OCLC Online Computer Library Center Virtual International Authority File Ed O’Neill Prepared with the assistance of Rick Bennett Australian Committee on Cataloguing Seminar Sydney, Australia, January 31, 2005
Download ReportTranscript OCLC Online Computer Library Center Virtual International Authority File Ed O’Neill Prepared with the assistance of Rick Bennett Australian Committee on Cataloguing Seminar Sydney, Australia, January 31, 2005
OCLC Online Computer Library Center Virtual International Authority File Ed O’Neill Prepared with the assistance of Rick Bennett Australian Committee on Cataloguing Seminar Sydney, Australia, January 31, 2005 Background The IFLA Section on Cataloguing recognized the need for a international authority file: Where authority records from the world’s national bibliographic agencies could be linked Would be available via the Internet Would be a practical expansion of the concept of universal bibliographic control Would build on the work done by each national bibliographic agency Allowing national or regional variations in authorized form to co-exist Supporting worldwide user’s needs for variations in preferred language, script, and spelling Background The VIAF could be one of the basic building blocks for a “semantic web” When combined with other controlled vocabularies and authority files from such sources as abstracting and indexing services, archives, museums, publishers, etc. Libraries now have an opportunity to make a great contribution to this future and should help make this vision a reality The VIAF be made freely available on the Web to users worldwide Joint Project A project to test the concept of a VIAF is being jointly undertaken by: Die Deutsche Bibliothek (DDB) The Library of Congress (LC) OCLC Online Computer Library Center (OCLC) VIAF Formally Approved in Berlin Christel Hengel-Dittrich Jay Jordan Renate Ed Gömpel O’Neill Barbara Tillett Elisabeth Neggemann Beacher Wiggins Project Goal Demonstrate the feasibility of VIAF by linking the personal names authority records between: Personennormdatei (PND) Library of Congress Name Authority File (LCNAF) What is the VIAF? The VIAF will be a file of metadata to link users from records in one national bibliographic agency’s personal name authority file to matching records in other national authority files The VIAF will provide for web access through a specially designed user interface The VIAF will support for multi-lingual and multi-script capability The VIAF will use Open Archive Initiative (OAI) protocols to harvest metadata from the agencies’ authority files, which would then be added to the shared servers to keep the file updated The system is being designed so that any number of authority files can be linked The Problem In the LCNAF and PND authority files: A person may have the same established form in both authority files Different people may be assigned the same established form Different forms of the name may be established for the same person An particular person may not be established in both files Two People – One Name Adams, Mike In the PND, the name is established for a golfer In LCNAF, the name is established for an author of a Beatles collector's guide Two Names – One Person LC: Morel, Pierre PND: Morellus, Petrus Brief LC Authorty 010 n 84044261 040 DLC $c DLC $d DLC 100 1 Larson, Jack. 670 Thomson, V. The cat, c1982: $b t.p. (Jack Larson) Information in Bibliographic Records From the bibliographic records we gain significant additional information about Jack Larson: He is a lyricist His primary subject area is music He was published in the 80s and 90s by G. Schirmer and Belwin Mills in New York Worked with Virgil Thomson and Gerhard Samuel Jack Larson is the only name he has used on his publications Etc. Project Phases Phase 1: Build enhances authority files for both PND and LC person names Phase 2: Match PND and LC enhances authority records to create the initial version of the VIAF Phase 3: Build OAI Server Phase 4: Ongoing maintenance and metadata harvesting using OAI protocols Phase 5: Build end user interface with unicode displays Phase 1 Building the Enhanced Authority Files Authority records generally include very few, if any, details about the person and/or their publishing history The information is rarely sufficient to determine if two different authority records represent the same person To provide additional information to unambiguously match authority records for same author, information from bibliographic records is used to enhance the authority record Enhancing the Authorities Bibliographic Record Derived Authority Authority Record Enhanced Authority Mining the Bibliographic Record LDR 00826ccm 2200289 a 4500 1 ocm10025532 5 20031229650847.0 Language 8 840627s1982 nyuuua n eng 10 $a 84758340 LC Control Number 40 $a DLC $c DLC 19 $a 17706440 20 $c $2.95 28 22 $a 48418 $b G. Schirmer LC Classification 45 2 $b d198006 $b d198007 48 $b va01 $b ve01 $a ka01 Usage Usage Title 50 00 $a M1529.3 $b .T Publisher Place of Publication 100 1 $a Thomson, Virgil, $d 1896245 14 $a The cat : $b duet for soprano and baritone / $c Virgil Thomson ; [words by Jack Larson]. Date of 260 $a New York : $b G. Schirmer, $c c1982. Material Type 300 $a 1 score (11 p.) ; $c 31 cm. Authors 500 $a For soprano, baritone, and piano. Publication 650 0 $a Vocal duets with piano. 600 10 $a Larson, Jack $x Musical settings. 700 1 $a Larson, Jack. Derived Authority Record 00525nz 2200229n 4500 0 1 xlc 1 1 3 OCoLC 2 5 20040721111415.0 3 8 040721nneanz||abbn n and d 4 40 $a OCoLC $b eng $c OCoLC $f viaf 5 100 1 $a Larson, Jack. 6 903 $a 84758340 7 910 14 $a the cat $b duet for soprano and baritone 8 921 $a g schirmer 9 922 $a nyu 10 930 $a jack larson 11 940 $a eng 12 942 $a 234 13 943 $a 198x 14 944 $a cm 15 950 1 $a thomson, virgil $d 1896 All text is normalized Subjects are grouped into Material Publication Coauthor type date is coded is by decade broad subject areas 90x Control numbers 901 ISBN $a Numeric portion of ISBN 902 ISSN $a Numeric portion of ISSN 903 LCCN $a Numeric portion of LCCN 91x Title fields 910 Title from 245, Subfields a & b 911 Abbreviated title from 210, Subfields a & b 913 Uniform title from 240, Subfields a & b 914 Translated title from 242, Subfields a & b 915 Collective uniform title from 243, All subfields 916 Variant title from 246, Subfields a & b 917 Uniform Title Extracted from Name/Title authorities, field 100 $t 92x Publisher fields 920 Publisher number (Publisher number from ISBN) 921 Publisher name (Publisher name from the 260 $b or 533 $c) 922 Place of publication (Country of publication code from 008 field) 93x Usage 930 Name Usage (Form of name found in the statement of responsibility, 245 subfield $c) 94x Attributes 940 Language (Language code from the 008 or 041 subfield $a) 941 Author's role (Relater code from 700, subfields $e and/or $4) 942 North American Title Count subject (NATC survey line number) 943 Decade of publication 944 Format (Type and bib level) 945 Broader Subject Area 95x Joint Authors 950 Personal Authors (From either the 100 or 700 fields) 951 Corporate Authors 96x Names as Subjects 960 Name as Subject 99x Number of Records 999 Number of Associated bibliographic records – $a Total number of associated bibliographic records – $b Bibliographic Record Control Number – $2 Source of Bibliographic Record Enhanced Authority Record 00824nz 2200301n 4500 0 1 oca01144962 1 5 19840809154202.7 2 8 840702n| acannaab| |n aaa ||| 3 10 $a n 84044261 4 40 $a DLC $c DLC $d DLC 5 100 1 $a Larson, Jack. 6 670 $a Thomson, V. The cat, c1982: $b t.p. (Jack Larson) 7 903 $a 84758340 $9 1 8 903 $a 93710923 $9 1 9 910 11 $a the cat $b duet for soprano and baritone $9 1 10 910 11 $a sun like $b on a poem by jack larson $9 1 11 921 $a g schirmer $9 1 12 921 $a belwin mills publ corp $9 2 13 922 $a nyu $9 2 14 930 $a jack larson $9 1 15 940 $a eng $9 2 16 942 $a 234 $9 2 17 943 $a 198x $9 1 18 943 $a 197x $9 1 19 944 $a cm $9 2 20 950 11 $a thomson, virgil $d 1896 $9 1 21 950 11 $a samuel, gerhard $9 1 LC Bibliographic Records Number of records: 7,612,979 Personal Names assigned: 6,318,094 Unique Personal Names: 2,554,266 LCNAF Personal Name Authorities Differentiated names: Undifferentiated names: Total authority records: 3,834,162 37,990 3,872,152 LC Names Established Names 3,834,162 Names from Bib Records 2,554,266 Active Established Names 2,159,315 Orphaned Names 1,674, 847 Uncontrolled Names 394,951 DDB Bibliographic Records Die Deutsche Bibliothek (DDB): 6,316,675 Bibliotheksverbund Bayern (BVB): 5,022,316 Total number of records: 11,338,991 Number of assignments: 12,080,387 Number of unique names: 2,371,461 DDB Names Established Names 2,498,071 Names from Bib Records 2,371,461 Active Established Names 2,057,530 Orphaned Names 440,541 Uncontrolled Names 313,931 Phase 2 Matching the Enhanced Authorities Linking Retrospective Files Enhanced LCNAF Authorities Matching Algorithms VIAF Authorities Enhanced PND Authorities Matching LCNAF Pauling, Linus, 1901- PND Pauling, Linus, 1901-1994 Name Matching To be considered for a match, two names must be consistent: Smith, J. William Smith, John Smith, J. William Smith, John Q. Are Consistent Are Inconsistent Strong Matching Attributes A work (title) in common Common controls numbers (ISBN, ISSN, or LCCN) Dates; the combination of birth and death year--A moderate match score value is given for matching birth dates Joint Authors Distinct form alternate name For example, LC has 100 Schade, Peter, $d 1493-1524 400 Mosellanus, Petrus, $d 1493-1524 While PND has 100 Mosellanus, Petrus, $d 1493-1524 400 Schade, Peter, $d 1493-1524 Weaker Attributes Role (Author, Illustrator, composer, etc. Subject Area of Publications Format (Books, Films, Musical scores, etc.) Language Country Date of publications Similarity Measure The total similarity measure, is a weighted sum of the of the individual attribute matches A similarity measure is only computed for consistent names The weighting factor is lower for the weaker attributes and higher for the stronger attributes Care is taken to avoid double counting or using scores that are correlated Similarity Metric 1 001 oca04693556 2 005 19980327132122.5 3 008 980327n| acannaab| |n aaa ||| 4 010 n 98029633 5 040 DLC $c DLC $d DLC 6 100 1 Tarrant, John, $d 19497 670 The light inside the dark, 1998: $b CIP t.p. (John Tarrant) data sheet (John M. Tarrant; b. 1949) 8 901 006017219 $9 1 9 903 98017676 $9 1 10 910 11 the light inside the dark $b zen soul and the spiritual life $9 1 11 920 0-06 $9 1 12 921 harpercollins publishers $9 1 13 922 nyu$9 1 14 930 john tarrant $9 1 15 940 eng$9 1 16 942 26$9 1 17 943 199x$9 1 18 944 am$9 1 19 999 1$b ocm38948253 $2 DLC | 1 001 | 2 003 | 3 005 | 4 008 | 5 016 | 6 040 | 7 100 | 8 901 | 9 910 | und der | 10 913 | 11 920 | 12 921 | 13 922 | 14 930 | 15 940 | 16 943 | 17 944 | 18 999 12231638X DDB 20000926224921.0 000825|||az|nnaa|||||||||||| a|aba|||| d 12231638X $2 GyFmDB DDB $b ger $d 9999 $f RAK-PND 1 Tarrant, John 344221568 $9 1 11 licht im herzen der dunkelheit $b die nacht der seele weg zur erleuchtung $9 1 11 the light inside the dark $9 1 3-442 $9 1 goldmann $9 1 gw $9 1 john tarrant $9 1 ger$9 1 200x$9 1 am$9 1 1$b 959703160 $2 DDB 100 Tarrant, John, the1light inside the dark $b 100 the 1light Tarrant, insideJohn the dark harpercollins publishers goldmann Similarity Metric = 0.89 $d and 1949zen soul the spiritual life Future of VIAF? If the proof-of-concept is successful, the VIAF will be expanded: To include other authority files for personal names, To include other types of authorities – Corporate names, – Geographic names, – etc. First VIAF Record Rec stat: n Type: z Roan: ovt agn: Series: n Ser nu: n 1 010 2 040 4 00 1 5 00 1 Entered: 20030225 Upd status: a Enc lvl: Re status: a od rec: ut status: a Su: ut/re: a eo sud: ae: a Sudiv tp: n a n n Source: ae use: Su use: Ser use: Rules: a n 1 c al, P de 2 loc 0 n 22324 al, oannes P de d 194- 2 pnd 0 12251993 Phase 3: Build OAI Server OAI LCNAF Server(s) DDB/PND Slide Courtesy of Barbara Tillett, Library of Congress Phase 4: Ongoing maintenance and metadata harvesting using OAI protocols Slide Courtesy of Barbara Tillett, Library of Congress Phase 5: Build End User Interface with unicode displays User’s cookie specifies hongul is preferred. Display 700 form, building on local system’s authority structure Slide Courtesy of Barbara Tillett, Library of Congress Questions? Thank you [email protected] http://www.oclc.org/research/projects/viaf