OCLC - The world's libraries. Connected.

Download Report

Transcript OCLC - The world's libraries. Connected.

Charleston
Conference
7 November 2008
Data Mining, Advanced
Collection Analysis, and
Publisher Profiles:
An Update on the OCLC
Publisher Name Authority File
Lynn Silipigni Connaway, Ph.D.
Senior Research Scientist
OCLC Research
Timothy J. Dickey, Ph.D.
Post-Doctoral Researcher
OCLC Research
Overall Research Goals
To Build a Database that Will:
Identify
• Authoritative strings for publisher names
• Common variants for names and locations
• Hierarchical references indicating relationships and nesting
of subsidiaries
• Definitions of publishing entities
Overall Research Goals
To Build a Database that Will:
Produce
• Profiles, including data-mined information regarding formats,
languages, subjects, etc. for publishers
Conform
• to international authority and standards practice, and
• inter-operate with other OCLC products
Issues & Challenges
Database Quality:
Historical Practices
• “…the shortest form in which it can be understood.”
[AACR2 2004]
• Different versions of cataloging rules
• Abbreviations
Errors and misspellings
Local Practices
Method: Data Mining in an “Aggregate
Collection”
Data Mining and Analysis of WorldCat:
“…affords high-level perspective on historical patterns,
suggests future trends, and supplies useful intelligence
with which to inform decision making.”
Lavoie, B.F., Connaway, L. S., & O’Neill, E. T. (2007). Mapping WorldCat’s digital landscape.
Library Resources & Technical Services, 51, 106-115 at 107.
WorldCat: July 2008
Manifestations (records): 108,828,533
Total holdings: 1,292,763,300
Digital Items: 3,182,550
Works: 84,096,107
Institutions: 69,000
Physical Items: ~1.2 billion
Global Origins of WorldCat Materials
Rest of World
27%
Germany
10%
Unknown
17%
France
4%
Canada
3%
UK
8%
US
28%
Global Origins of WorldCat Materials
Content Languages: 478
Materials w/non-US origins:
49% of WC non-English
57.9 million (55%)
Top 5 non-English:
Top 5:
German: 12 million
Germany:
10.0 million
French:
UK:
8.8 million
Spanish: 3.5 million
France:
4.2 million
Dutch:
Netherlands:
2.9 million
Canada:
2.9 million
6.1 million
2.6 million
Japanese:2.4 million
Non-English Metadata Language:
28 million (66 languages)
Top 5:
German: 11 million
French:
1.8 million
Dutch:
Finnish:
0.7 million
5.0 million
Swedish: 1.9 million
OCLC Publisher Name Server
Publisher Name Server: Objectives
Resolve for data mining and quality of WorldCat
• ISBN prefixes to publisher name
• Variant publisher names to a preferred form
Complement Collection Analysis Service
• Librarians & Publishers
Publisher Name Server: Objectives
Capture and profile attributes of individual publishers:
• Location(s)
• Language(s) of materials published
• Genre(s)/format(s)
• Dominant subject domain(s)
• Parent company and subsidiaries
Publisher Name Server: Methodology
Programmatically cluster publishers’ records using ISBN
prefixes
• Data clustering
• Classification of similar objects into different groups
• Partitioning of a data set into subsets (clusters)
Hand parse the entities and resolve ISBN prefixes
Publisher Name Server: Database
1750 publishing entities
Relational database, preserving hierarchical relationships
Begins with high-occurrence entities:
• “Top 10” lists
• Top 10 university presses
• Mergers and acquisitions, last 8 years
Example: Top U.S. Publishing Entities by ISBN
ISBN
Prefix
WorldCat
Records
Publishing Entity
0-13
50,298
Prentice-Hall, Inc.
0-07
44,545
McGraw Hill, Inc.
0-06
44,362
HarperCollins (Firm)
0-16
40,451
United States G.P.O.
0-471
37,710
John Wiley & Sons
0-312
33,318
St. Martin's Press
0-671
31,765
Simon & Schuster, Inc.
0-02
27,602
MacMillan Publishers
0-15
18,420
Harcourt Brace & Company
0-394
18,043
Random House (Firm)
0-590
17,290
Scholastic Inc.
0-385
16,768
Doubleday and Company, Inc.
0-395
16,699
Houghton Mifflin Company
0-19
15,724
Oxford University Press
0-03
15,417
Holt, Rinehart, and Winston
Publisher Name Server: Data Captured
Data:
Publisher Name, Preferred Form
Source of Preferred Form
Former Names
Sources:
U.S. Library of Congress, National
Authority File, 110 (Corporate Name) field
Books In Print Online (W.W. Bowker)
The International ISBN Registry (K.G. Saur)
Variant Forms
Publishers’ Weekly Online
ISBN Prefixes
Hoover’s Handbook Online
HQ City
Standard and Poor’s Corporate Descriptions
HQ Country
The Directory of Corporate Affiliations
(DIALOG)
Other Cities
Company websites
URL
----Languages
Formats
Conspectus Subjects
DATA MINING
Publisher Name Server: Current Scope
More than 56,000 separate strings mapped to 1750 entities
• 8.5 million OCLC records
• 22% of these are Library of Congress records
• ~490 million holdings
Hierarchical relationships maintained
Entity-Parsing in a World of Mergers and Acquisitions
Pearson PLC
Penguin Books
Allen Lane
Puffin Books
Ladybird Books
Pearson Canada
Copp Clark
Riverhead Books
Pearson Technology Group
Adobe Press
Cisco Press
Putnam Books Berkeley Publishing Group
Pearson Education, Inc.
Avery
Addison-Wesley
Publishing Company
Benjamin/Cummings
Publishing Company
Allyn and Bacon
Scott, Foresman
and Company
Prentice-Hall, Inc.
HarperCollins
Educational Publishers
Dominie Press
Longmans, Green,
and Co.
Publisher Profiles within WorldCat
Oxford University Press
• 119,237 records with ISBNs mapped to 210,095 records (0.19% of
WorldCat)
Pearson PLC
• Includes 14 subsidiaries and acquisitions
• Aggregate: 291,433 records (0.27% of WorldCat)
Springer (Firm)
• 197,263 records (0.18% of WorldCat)
Reed Elsevier PLC
• Includes dozens of subsidiaries
• Aggregate: 370,029 records (0.34% of WorldCat)
WorldCat Publisher Profiles – Top Languages
Oxford Univ. Press:
Pearson PLC:
English
96.74%
English
95.27%
Latin
0.51%
Spanish
1.43%
German
0.39%
German
1.33%
Chinese
0.39%
French
0.60%
French
0.37%
Dutch
0.55%
Spanish
0.28%
Latin
0.26%
Afrikaans
0.14%
Malay
0.06%
Middle English
0.13%
Ancient Greek
0.05%
Malay
0.09%
Portuguese
0.05%
Swahili
0.09%
Italian
0.04%
WorldCat Publisher Profiles – Top Languages
Springer (Firm):
Reed Elsevier PLC:
English
61.25%
English
83.64%
German
37.10%
French
9.34%
French
1.02%
Dutch
2.32%
Italian
0.29%
Spanish
0.95%
Polish
0.13%
Italian
0.60%
Czech
0.04%
Latin
0.27%
Spanish
0.04%
Afrikaans
0.16%
Hungarian
0.03%
Ancient Greek
0.12%
Dutch
0.02%
Portuguese
0.09%
Danish
0.02%
Polish
0.06%
WorldCat Publisher Profiles - Formats
Oxford University Press:
Pearson PLC:
Printed Material
89.57%
Printed Material
92.98%
Computer File
8.23%
Microform
2.82%
Microform
1.39%
Computer File
2.15%
Sound Recording
0.50%
Video Recording
0.70%
Video Recording
0.16%
Sound Recording
0.67%
Springer (Firm):
Reed Elsevier PLC:
Printed Material
81.69%
Printed Material
92.31%
Computer file
17.51%
Computer File
5.46%
Microform
0.71%
Microform
1.85%
Video Recording
0.05%
Video Recording
0.14%
WorldCat Publisher Profiles – Conspectus
Divisions
Oxford Univ. Press:
Pearson PLC:
Language/ Literature
27.12%
Language/ Literature
18.67%
History
11.92%
Business/ Economics
13.30%
Music
9.78%
Computer Science
9.42%
Philosophy/ Religion
9.55%
Engineering
8.04%
Business/ Economics
6.15%
History
7.59%
Medicine
4.36%
Mathematics
6.04%
Law
3.85%
Education
5.64%
Sociology
3.75%
Sociology
4.18%
Political Science
3.58%
Philosophy/ Religion
3.81%
Biology
2.60%
Physical Sciences
2.75%
WorldCat Publisher Profiles – Conspectus
Categories
Oxford Univ. Press:
Pearson PLC:
English literature
10.66%
English language
7.74%
English language
5.86%
Business admin.
4.62%
Instrumental music
3.48%
English literature
3.63%
Vocal music
3.09%
Economics
2.94%
Literature on music
2.26%
Comp. programming
2.39%
History – Britain
1.82%
Electrical engineering
2.24%
Economic history
1.38%
Early childhood ed.
2.05%
American lit.
1.35%
Computer software
1.88%
History – S. Asia
1.30%
U.S. federal law
1.80%
General history
1.29%
Computer Science
1.54%
WorldCat Publisher Profiles – Conspectus
Subjects
Pearson PLC:
Oxford Univ. Press:
English – modern
5.57%
English – modern
7.68%
English lit. – prose
2.51%
Management
2.53%
English lit. – 19th c.
2.23%
Programming
1.74%
Juvenile lit.
1.06%
Arithmetic
1.09%
English lit. – poetry
1.03%
Economic theory
1.06%
English lit. – collections
0.80%
Marketing
1.06%
Biographies
0.76%
General algebra
1.04%
English lit. – 1900-1960
0.74%
Accounting
0.97%
Shakespeare
0.68%
Juvenile lit.
0.93%
Sacred choruses
0.66%
English lit. – 19th c.
0.89%
WorldCat Publisher Profiles – Conspectus
Divisions
Reed Elsevier PLC:
Springer (Firm):
Computer Science
16.83%
Language/ Literature
14.18%
Engineering
15.12%
Law
11.78%
Mathematics
12.96%
Engineering
11.73%
Medicine
9.93%
Business/ Economics
6.82%
Physical Sciences
9.83%
Medicine
6.50%
Biology
5.22%
Physical Sciences
5.01%
Business/ Economics
5.13%
History
4.57%
Health Professions
4.48%
Biology
4.32%
Chemistry
3.14%
Health Professions
3.70%
Geography
2.58%
Chemistry
3.51%
WorldCat Publisher Profiles – Conspectus
Categories
Reed Elsevier PLC:
Springer (Firm):
Computer science
5.23%
English literature
5.84%
General math
4.48%
Health professions
3.40%
Health professions
4.03%
English language
2.79%
Electrical engineering
3.73%
U.S. federal law
2.32%
General engineering
3.25%
General engineering
2.26%
Mathematical analysis
3.06%
Electrical engineering
2.10%
Computer software
2.37%
General law
1.70%
Comp. programming
2.34%
Industrial economics
1.65%
Probability/ Statistics
2.20%
Business admin.
1.53%
Mech. engineering
2.17%
U.S. state law
1.46%
WorldCat Publisher Profiles – Conspectus
Subjects
Reed Elsevier PLC:
Springer (Firm):
Health professions
3.56%
English – modern
2.68%
Math collections
2.76%
English - prose
2.06%
Computer science
1.84%
Health professions
1.92%
Programming
1.46%
U.S. state law
1.37%
Access/ security
1.10%
Industrial management
1.22%
Artificial intelligence
1.03%
Legal periodicals
1.16%
Mathematical stats
1.03%
English lit. - 1900-1960
1.15%
Analytical physics
1.02%
Engineering materials
0.86%
Industrial management
0.99%
English fiction
0.83%
Engineering materials
0.90%
Nuclear physics
0.68%
Projected MARC coding of Authorized Forms
710 Added Entry – Corporate Name
• Add $4 for publisher name
• Add $2 NAF where preferred form matches existing authority
record (44% of current PNAF)
752 Added Entry – Hierarchical Place Name
• Add $2 FAST where place of publication matches FAST
geographical subject headings
Ongoing Research
Further data mining
• Profile other aspects of publication output
• Profile other publishers
• Trends over time
• Author clusters
• Geographic holdings patterns
• Collection Analysis
Ongoing Research
Plan for long-term maintenance
• ISBN-13 compliance
• File expansion of ongoing mergers/ acquisition activities
• Deeper scaling into WorldCat (beyond ISBN)
OCLC Publisher Name Server
Project page:
http://www.oclc.org/research/projects/publisherns/
Thank You!
Questions and Discussion
Lynn Silipigni Connaway
Timothy J. Dickey
[email protected]
[email protected]