OCLC - The world's libraries. Connected.

Download Report

Transcript OCLC - The world's libraries. Connected.

ASIS&T 2008
Annual Meeting
Columbus, OH
28 October, 2008
Beyond Data Mining:
Delivering the Next
Generation of Services
from Library Data
Lynn Silipigni Connaway, Ph.D.
Senior Research Scientist
OCLC
Timothy J. Dickey, Ph.D.
Post-Doctoral Researcher
OCLC
WorldCat as an “Aggregate Collection”
Data Mining and Analysis of WorldCat:
“…affords high-level perspective on historical patterns,
suggests future trends, and supplies useful intelligence
with which to inform decision making.”
Lavoie, B.F., Connaway, L. S., & O’Neill, E. T. (2007). Mapping WorldCat’s digital landscape.
Library Resources & Technical Services, 51, 106-115 at 107.
WorldCat: July 2008
Manifestations (records): 108,828,533
Total holdings: 1,292,763,300
Digital Items: 3,182,550
Works: 84,096,107
Institutions: 69,000
Physical Items: ~1.2 billion
Global Origins of WorldCat Materials
Rest of World
27%
Germany
10%
Unknown
17%
France
4%
Canada
3%
UK
8%
US
28%
Global Origins of WorldCat Materials
Content Languages: 478
Materials w/non-US origins:
49% of WC non-English
57.9 million (55%)
Top 5 non-English:
Top 5:
German: 12 million
Germany:
10.0 million
French:
UK:
8.8 million
Spanish: 3.5 million
France:
4.2 million
Dutch:
Netherlands:
2.9 million
Canada:
2.9 million
6.1 million
2.6 million
Japanese:2.4 million
Non-English Metadata Language:
28 million (66 languages)
Top 5:
German: 11 million
French:
1.8 million
Dutch:
Finnish:
0.7 million
5.0 million
Swedish: 1.9 million
WorldCat as a Decision-Making Resource
Collection management
• Cooperative collection development
• Comparative collection analysis
• Collection assessment
• Mass digitization
• Off-site storage
• Preservation
WorldCat as a Decision-Making Resource
Services
• Virtual reference
• Recommender services
• Social networking
Systems
• Precision
WorldCat as a Decision-Making Resource
Three Areas of Data Mining Research:
• OCLC WorldMap
• Audience Level
• Publisher Name Server
OCLC WorldMap
OCLC WorldMapTM: Objectives
Geographically represent WorldCat data
• Titles published in each country
• Holdings for titles published in each country
• Languages represented for titles published in each country
OCLC WorldMapTM: Objectives
Geographically represent data from UNESCO, ARL, and NCES
for each country
• Number of
• Libraries
• Library volumes
• Certified/degreed librarians
• Registered library users
• Library expenditures
• Cultural heritage institutions (museums and archives)
• Publishers
OCLC WorldMapTM: Objectives
Research prototype
• Support OCLC data mining research
• Visually display data for review and analysis
• Internal use
• Sales and marketing
• External use
• Library collection assessment and comparison
• Data may be processed AT A GLANCE
• Complement the AAU/ARL Global Resources Network project
• Project of the Council on Library and Information Resources
(CLIR)
http://pubserv.oclc.org:12223/WorldMap/
OCLC Audience Level
Audience Level: Rationale and Objectives
Holdings represent selection decisions by
librarians … implies there are more than 1
billion individual selection decisions in the
WorldCat holdings file
Selections serve the interests of a library’s
target community …
• Associate community (audience level) to library
?
profiles - e.g., ARL, non-ARL academic, public, K12 school …
Thus we can infer materials’ audience level from
holdings patterns, which in turn can support:
•
•
•
•
Collection management
Readers’ advisory services
Reference services
Information retrieval
Example Computation: Build Community
Library
symbol
Library name
Library type
Weight
OHI
State Library of Ohio
Other
x
OCO
Columbus Metropolitan Library
Public
0.33
CDC
Cedarville University
Academic
0.67
LIM
Lima Public Library
Public
0.33
OUN
Ohio University
Research
1.00
OSD
SEO Automation Consortium
Other
BGU
Bowling Green State University
Academic
0.67
MIA
Miami University
Academic
0.67
AKR
University of Akron
Academic
0.67
BGF
Firelands College
Academic
0.67
CIN
University of Cincinnati
Research
1.00
TOL
University of Toledo
Academic
0.67
KSU
Kent State University
Research
1.00
HIR
Hiram College
Academic
0.67
YNG
Youngstown State University
Academic
0.67
x
“FRBRizing” Audience Level Results
•Calculate Audience Level for each Manifestation
•Aggregate weighted holdings for Work
OCLC Number
Total Holdings
Usable Holdings
Manifestation
Audience Level
15504400
147
114
0.783825
29613712
172
117
0.769453
40393191
207
136
0.789426
62762763
190
124
0.758274
81016224
1
0
x
Evaluating the OCLC Audience Level
• Random sample of 30 Zoology books, all audience levels
• Human subjects
• Ranked books “in increasing order of difficulty”
• Strong statistical correlation between human subjects’
ranking and programmatic ranking
Evaluating the OCLC Audience Level
30
25
Subjet's Rankings
20
15
10
5
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Audience Level Ranking
http://audiencelevel.oclc.org/
OCLC Publisher Name Server
Publisher Name Server: Research Objectives
Resolve for data mining and quality of WorldCat
• ISBN prefixes to publisher name
• Variant publisher names to a preferred form
Complement Collection Analysis Service
• Librarians
• Publishers
Capture and profile attributes of individual publishers
• Location(s)
• Language(s) of materials published
• Genre(s)/format(s)
• Dominant subject domain(s)
• Parent company and subsidiaries
Publisher Name Server: Methodology
Programmatically cluster publishers’ records using ISBN
prefixes
• Data clustering (The Free Dictionary)
• "The science of extracting useful information from large data
sets or databases"
• Classification of similar objects into different groups
• Partitioning of a data set into subsets (clusters)
• Data in each subset (ideally) share some common trait
Hand parse the entities and resolve ISBN prefixes
Publisher Name Server: Database
1750 publishing entities
Relational database, preserving hierarchical relationships
Begins with high-occurrence entities:
• “Top 10” lists (USA, UK, Canada, Australia, Germany, France,
Netherlands, Japan, Italy, China, Russia, Spain, Finland, Australia,
Taiwan, New Zealand)
• Top 10 university presses
• Mergers and acquisitions, last 8 years
Publisher Name Server: Data Captured
Database Fields:
Publisher Name, Preferred Form
Source of Preferred Form
Former Names
Data Sources:
U.S. Library of Congress, National
Authority File, 110 (Corporate Name) field
Books In Print Online (W.W. Bowker)
The International ISBN Registry (K.G. Saur)
Variant Forms
Publishers’ Weekly Online
ISBN Prefixes
Hoover’s Handbook Online
HQ City
Standard and Poor’s Corporate Descriptions
HQ Country
The Directory of Corporate Affiliations
(DIALOG)
Other Cities
Company websites
URL
----Languages
Formats
Conspectus Subjects
DATA MINING
Publisher Name Server: Database
More than 56,000 separate strings mapped to 1750 entities
• 8.5 million OCLC records
• 22% of these are Library of Congress records
• ~490 million holdings
Hierarchical relationships maintained
Entity-Parsing in a World of Mergers and Acquisitions
Pearson PLC
Penguin Books
Allen Lane
Puffin Books
Ladybird Books
Pearson Canada
Copp Clark
Riverhead Books
Pearson Technology Group
Adobe Press
Cisco Press
Putnam Books Berkeley Publishing Group
Pearson Education, Inc.
Avery
Addison-Wesley
Publishing Company
Benjamin/Cummings
Publishing Company
Allyn and Bacon
Scott, Foresman
and Company
Prentice-Hall, Inc.
HarperCollins
Educational Publishers
Dominie Press
Longmans, Green,
and Co.
Publisher Profiles
Oxford University Press
• 119,237 records with ISBNs mapped to 210,095 records
(0.19% of WorldCat)
Pearson PLC
• Includes 14 subsidiaries and acquisitions
• Aggregate: 291,433 records (0.27% of WorldCat)
Publisher Profiles – Top Languages
Oxford Univ. Press:
Pearson PLC:
English
96.74%
English
95.27%
Latin
0.51%
Spanish
1.43%
German
0.39%
German
1.33%
Chinese
0.39%
French
0.60%
French
0.37%
Dutch
0.55%
Spanish
0.28%
Latin
0.26%
Afrikaans
0.14%
Malay
0.06%
Middle English
0.13%
Ancient Greek
0.05%
Malay
0.09%
Portuguese
0.05%
Swahili
0.09%
Italian
0.04%
Publisher Profiles – Conspectus Divisions
Oxford Univ. Press:
Pearson PLC:
Language/ Literature
27.12%
Language/ Literature 18.67%
History
11.92%
Business/ Economics
13.30%
Music
9.78%
Computer Science
9.42%
Philosophy/ Religion
9.55%
Engineering
8.04%
Business/ Economics
6.15%
History
7.59%
Medicine
4.36%
Mathematics
6.04%
Law
3.85%
Education
5.64%
Sociology
3.75%
Sociology
4.18%
Political Science
3.58%
Philosophy/ Religion
3.81%
Biology
2.60%
Physical Sciences
2.75%
Publisher Profiles – Conspectus Categories
Oxford Univ. Press:
Pearson PLC:
English literature
10.66%
English language
7.74%
English language
5.86%
Business admin.
4.62%
Instrumental music
3.48%
English literature
3.63%
Vocal music
3.09%
Economics
2.94%
Literature on music
2.26%
Comp. programming
2.39%
History – Britain
1.82%
Electrical engineering 2.24%
Economic history
1.38%
Early childhood ed.
2.05%
American lit.
1.35%
Computer software
1.88%
History – S. Asia
1.30%
U.S. federal law
1.80%
General history
1.29%
Computer Science
1.54%
Publisher Profiles – Conspectus Subjects
Oxford Univ. Press:
Pearson PLC:
English – modern
5.57%
English – modern
7.68%
English lit – prose
2.51%
Management
2.53%
English lit – 19th c.
2.23%
Programming
1.74%
Juvenile lit.
1.06%
Arithmetic
1.09%
English lit – poetry
1.03%
Economic theory
1.06%
English lit – collections
0.80%
Marketing
1.06%
Biographies
0.76%
General algebra
1.04%
English lit – 1900-1960
0.74%
Accounting
0.97%
Shakespeare
0.68%
Juvenile lit.
0.93%
Sacred choruses
0.66%
English lit – 19th c.
0.89%
Projected MARC coding of Authorized Forms
710 Added Entry – Corporate Name
• Add $4 for publisher name
• Add $2 NAF where preferred form matches existing authority
record (44% of current PNAF)
752 Added Entry – Hierarchical Place Name
• Add $2 FAST where place of publication matches FAST
geographical subject headings
Future Research
• Further data mining
• Profile aspects of publication output
• Deeper scaling into WorldCat (beyond ISBN)
• Plan for long-term maintenance
• ISBN-13 compliance
• File expansion of ongoing mergers/ acquisition activities
Thank You!
Questions and Discussion
Lynn Silipigni Connaway
Timothy J. Dickey
[email protected]
[email protected]