OCLC Online Computer Library Center Virtual International Authority File Ed O’Neill Prepared with the assistance of Rick Bennett Australian Committee on Cataloguing Seminar Sydney, Australia, January 31, 2005

Download Report

Transcript OCLC Online Computer Library Center Virtual International Authority File Ed O’Neill Prepared with the assistance of Rick Bennett Australian Committee on Cataloguing Seminar Sydney, Australia, January 31, 2005

OCLC Online Computer Library Center
Virtual
International
Authority
File
Ed O’Neill
Prepared with the assistance of Rick Bennett
Australian Committee on Cataloguing Seminar
Sydney, Australia, January 31, 2005
Background
The IFLA Section on Cataloguing recognized the
need for a international authority file:
Where authority records from the world’s national
bibliographic agencies could be linked
Would be available via the Internet
Would be a practical expansion of the concept of
universal bibliographic control
Would build on the work done by each national
bibliographic agency
Allowing national or regional variations in authorized form
to co-exist
Supporting worldwide user’s needs for variations in
preferred language, script, and spelling
Background
The VIAF could be one of the basic building blocks for a
“semantic web”
When combined with other controlled vocabularies and
authority files from such sources as abstracting and
indexing services, archives, museums, publishers, etc.
Libraries now have an opportunity to make a great
contribution to this future and should help make this vision
a reality
The VIAF be made freely available on the Web to users
worldwide
Joint Project
A project to test the concept of a VIAF is being
jointly undertaken by:
Die Deutsche Bibliothek (DDB)
The Library of Congress (LC)
OCLC Online Computer Library Center (OCLC)
VIAF Formally Approved in Berlin
Christel
Hengel-Dittrich
Jay Jordan
Renate
Ed
Gömpel
O’Neill
Barbara Tillett
Elisabeth
Neggemann Beacher Wiggins
Project Goal
Demonstrate the feasibility of VIAF by linking
the personal names authority records
between:
Personennormdatei (PND)
Library of Congress Name Authority File
(LCNAF)
What is the VIAF?
The VIAF will be a file of metadata to link users from records in
one national bibliographic agency’s personal name authority file
to matching records in other national authority files
The VIAF will provide for web access through a specially
designed user interface
The VIAF will support for multi-lingual and multi-script
capability
The VIAF will use Open Archive Initiative (OAI) protocols to
harvest metadata from the agencies’ authority files, which
would then be added to the shared servers to keep the file
updated
The system is being designed so that any number of authority
files can be linked
The Problem
In the LCNAF and PND authority files:
A person may have the same established form in both
authority files
Different people may be assigned the same established
form
Different forms of the name may be established for the
same person
An particular person may not be established in both files
Two People – One Name
Adams, Mike
In the PND, the name is established
for a golfer
In LCNAF, the name is established for
an author of a Beatles collector's guide
Two Names – One Person
LC:
Morel, Pierre
PND: Morellus, Petrus
Brief LC Authorty
010 n 84044261
040 DLC $c DLC $d DLC
100 1 Larson, Jack.
670 Thomson, V. The cat, c1982:
$b t.p. (Jack Larson)
Information in Bibliographic Records
From the bibliographic records we gain significant additional
information about Jack Larson:
He is a lyricist
His primary subject area is music
He was published in the 80s and 90s by G. Schirmer and
Belwin Mills in New York
Worked with Virgil Thomson and Gerhard Samuel
Jack Larson is the only name he has used on his
publications
Etc.
Project Phases
Phase 1: Build enhances authority files for both PND
and LC person names
Phase 2: Match PND and LC enhances authority
records to create the initial version of the VIAF
Phase 3: Build OAI Server
Phase 4: Ongoing maintenance and metadata
harvesting using OAI protocols
Phase 5: Build end user interface with unicode
displays
Phase 1
Building the Enhanced Authority Files
Authority records generally include very few, if any,
details about the person and/or their publishing
history
The information is rarely sufficient to determine if two
different authority records represent the same person
To provide additional information to unambiguously
match authority records for same author, information
from bibliographic records is used to enhance the
authority record
Enhancing the Authorities
Bibliographic
Record
Derived
Authority
Authority
Record
Enhanced
Authority
Mining the Bibliographic Record
LDR
00826ccm 2200289 a 4500
1 ocm10025532
5 20031229650847.0
Language
8 840627s1982
nyuuua
n
eng
10
$a
84758340
LC Control Number
40
$a DLC $c DLC
19
$a 17706440
20
$c $2.95
28 22 $a 48418 $b G. Schirmer
LC Classification
45 2 $b d198006 $b d198007
48
$b va01 $b ve01 $a ka01 Usage
Usage Title
50 00 $a M1529.3 $b .T
Publisher
Place
of Publication
100 1 $a Thomson, Virgil, $d 1896245 14 $a The cat : $b duet for soprano and baritone / $c
Virgil Thomson ; [words by Jack Larson].
Date of
260
$a New York : $b G. Schirmer, $c c1982.
Material Type
300
$a 1 score (11 p.) ; $c 31 cm.
Authors
500
$a For soprano, baritone, and piano.
Publication
650 0 $a Vocal duets with piano.
600 10 $a Larson, Jack $x Musical settings.
700 1 $a Larson, Jack.
Derived Authority Record
00525nz
2200229n 4500
0
1 xlc 1
1
3 OCoLC
2
5 20040721111415.0
3
8 040721nneanz||abbn
n and
d
4 40
$a OCoLC $b eng $c OCoLC $f viaf
5 100 1 $a Larson, Jack.
6 903
$a 84758340
7 910 14 $a the cat $b duet for soprano and baritone
8 921
$a g schirmer
9 922
$a nyu
10 930
$a jack larson
11 940
$a eng
12 942
$a 234
13 943
$a 198x
14 944
$a cm
15 950 1 $a thomson, virgil $d 1896
All text is normalized
Subjects are grouped into
Material
Publication
Coauthor
type
date
is coded
is by decade
broad subject areas
90x Control numbers
901 ISBN $a Numeric portion of ISBN
902 ISSN $a Numeric portion of ISSN
903 LCCN $a Numeric portion of LCCN
91x Title fields
910 Title from 245, Subfields a & b
911 Abbreviated title from 210, Subfields a & b
913 Uniform title from 240, Subfields a & b
914 Translated title from 242, Subfields a & b
915 Collective uniform title from 243, All subfields
916 Variant title from 246, Subfields a & b
917 Uniform Title Extracted from Name/Title
authorities, field 100 $t
92x Publisher fields
920 Publisher number (Publisher
number from ISBN)
921 Publisher name (Publisher name
from the 260 $b or 533 $c)
922 Place of publication (Country of
publication code from 008 field)
93x Usage
930 Name Usage (Form of name found
in the statement of responsibility, 245
subfield $c)
94x Attributes
940 Language (Language code from the 008
or 041 subfield $a)
941 Author's role (Relater code from 700,
subfields $e and/or $4)
942 North American Title Count subject (NATC
survey line number)
943 Decade of publication
944 Format (Type and bib level)
945 Broader Subject Area
95x Joint Authors
950 Personal Authors (From either the
100 or 700 fields)
951 Corporate Authors
96x Names as Subjects
960 Name as Subject
99x Number of Records
999 Number of Associated
bibliographic records
– $a Total number of associated
bibliographic records
– $b Bibliographic Record Control
Number
– $2 Source of Bibliographic Record
Enhanced Authority Record
00824nz
2200301n 4500
0
1 oca01144962
1
5 19840809154202.7
2
8 840702n| acannaab|
|n aaa |||
3 10
$a n 84044261
4 40
$a DLC $c DLC $d DLC
5 100 1 $a Larson, Jack.
6 670
$a Thomson, V. The cat, c1982: $b t.p. (Jack Larson)
7 903
$a 84758340 $9 1
8 903
$a 93710923 $9 1
9 910 11 $a the cat $b duet for soprano and baritone $9 1
10 910 11 $a sun like $b on a poem by jack larson $9 1
11 921
$a g schirmer $9 1
12 921
$a belwin mills publ corp $9 2
13 922
$a nyu $9 2
14 930
$a jack larson $9 1
15 940
$a eng $9 2
16 942
$a 234 $9 2
17 943
$a 198x $9 1
18 943
$a 197x $9 1
19 944
$a cm $9 2
20 950 11 $a thomson, virgil $d 1896 $9 1
21 950 11 $a samuel, gerhard $9 1
LC Bibliographic Records
Number of records:
7,612,979
Personal Names assigned: 6,318,094
Unique Personal Names:
2,554,266
LCNAF Personal Name Authorities
Differentiated names:
Undifferentiated names:
Total authority records:
3,834,162
37,990
3,872,152
LC Names
Established Names
3,834,162
Names from Bib Records
2,554,266
Active
Established
Names
2,159,315
Orphaned
Names
1,674, 847
Uncontrolled
Names
394,951
DDB Bibliographic Records
Die Deutsche Bibliothek (DDB):
6,316,675
Bibliotheksverbund Bayern (BVB): 5,022,316
Total number of records:
11,338,991
Number of assignments:
12,080,387
Number of unique names:
2,371,461
DDB Names
Established Names
2,498,071
Names from Bib Records
2,371,461
Active
Established
Names
2,057,530
Orphaned
Names
440,541
Uncontrolled
Names
313,931
Phase 2
Matching the Enhanced Authorities
Linking Retrospective Files
Enhanced
LCNAF
Authorities
Matching
Algorithms
VIAF
Authorities
Enhanced PND
Authorities
Matching
LCNAF
Pauling, Linus,
1901-
PND
Pauling, Linus,
1901-1994
Name Matching
To be considered for a match, two names
must be consistent:
Smith, J. William
Smith, John
Smith, J. William
Smith, John Q.
Are Consistent
Are Inconsistent
Strong Matching Attributes
A work (title) in common
Common controls numbers (ISBN, ISSN, or LCCN)
Dates; the combination of birth and death year--A moderate match
score value is given for matching birth dates
Joint Authors
Distinct form alternate name
For example, LC has
100 Schade, Peter, $d 1493-1524
400 Mosellanus, Petrus, $d 1493-1524
While PND has
100 Mosellanus, Petrus, $d 1493-1524
400 Schade, Peter, $d 1493-1524
Weaker Attributes
Role (Author, Illustrator, composer, etc.
Subject Area of Publications
Format (Books, Films, Musical scores, etc.)
Language
Country
Date of publications
Similarity Measure
The total similarity measure, is a weighted
sum of the of the individual attribute matches
A similarity measure is only computed for
consistent names
The weighting factor is lower for the weaker
attributes and higher for the stronger
attributes
Care is taken to avoid double counting or
using scores that are correlated
Similarity Metric
1 001
oca04693556
2 005
19980327132122.5
3 008
980327n| acannaab|
|n aaa |||
4 010
n 98029633
5 040
DLC $c DLC $d DLC
6 100 1 Tarrant, John, $d 19497 670
The light inside the dark, 1998: $b CIP t.p. (John
Tarrant) data sheet (John M. Tarrant; b. 1949)
8 901
006017219 $9 1
9 903
98017676 $9 1
10 910 11 the light inside the dark $b zen soul and the
spiritual life $9 1
11 920
0-06 $9 1
12 921
harpercollins publishers $9 1
13 922
nyu$9 1
14 930
john tarrant $9 1
15 940
eng$9 1
16 942
26$9 1
17 943
199x$9 1
18 944
am$9 1
19 999
1$b ocm38948253 $2 DLC
|
1 001
|
2 003
|
3 005
|
4 008
|
5 016
|
6 040
|
7 100
|
8 901
|
9 910
| und der
| 10 913
| 11 920
| 12 921
| 13 922
| 14 930
| 15 940
| 16 943
| 17 944
| 18 999
12231638X
DDB
20000926224921.0
000825|||az|nnaa|||||||||||| a|aba|||| d
12231638X $2 GyFmDB
DDB $b ger $d 9999 $f RAK-PND
1 Tarrant, John
344221568 $9 1
11 licht im herzen der dunkelheit $b die nacht der seele
weg zur erleuchtung $9 1
11 the light inside the dark $9 1
3-442 $9 1
goldmann $9 1
gw $9 1
john tarrant $9 1
ger$9 1
200x$9 1
am$9 1
1$b 959703160 $2 DDB
100
Tarrant,
John,
the1light
inside
the dark $b 100
the
1light
Tarrant,
insideJohn
the dark
harpercollins
publishers
goldmann
Similarity Metric = 0.89
$d and
1949zen soul
the spiritual life
Future of VIAF?
If the proof-of-concept is successful, the
VIAF will be expanded:
To include other authority files for
personal names,
To include other types of authorities
– Corporate names,
– Geographic names,
– etc.
First VIAF Record
Rec stat: n
Type:
z
Roan:

ovt agn: 
Series:
n
Ser nu: n
1 010
2 040
4 00 1
5 00 1
Entered:
20030225
Upd status: a
Enc lvl:
Re status: a
od rec:
ut status: a
Su:
ut/re:
a
eo sud:
ae:
a
Sudiv tp:
n
a
n
n
Source:
ae use:
Su use:
Ser use:
Rules:
a


n
1 
 c  
al,  P de 2 loc 0 n 22324
al, oannes P de d 194- 2 pnd 0 12251993
Phase 3: Build OAI Server
OAI
LCNAF
Server(s)
DDB/PND
Slide Courtesy of Barbara Tillett, Library of Congress
Phase 4: Ongoing maintenance and
metadata harvesting using OAI protocols
Slide Courtesy of Barbara Tillett, Library of Congress
Phase 5: Build End User Interface with
unicode displays
User’s cookie specifies hongul is preferred.
Display 700 form, building on local system’s authority structure
Slide Courtesy of Barbara Tillett, Library of Congress
Questions?
Thank you
[email protected]
http://www.oclc.org/research/projects/viaf