HATHI TRUST A Shared Digital Repository HathiTrust, Collections, and Collaboration COLD 2011 Spring Meeting Jeremy York May 20, 2011

Download Report

Transcript HATHI TRUST A Shared Digital Repository HathiTrust, Collections, and Collaboration COLD 2011 Spring Meeting Jeremy York May 20, 2011

HATHI TRUST
A Shared Digital Repository
HathiTrust, Collections, and
Collaboration
COLD 2011 Spring Meeting
Jeremy York
May 20, 2011
Outline
• Overview
• Partnership
• Mission/Goals
• Collections
• Services
• Collaboration
Mission and Goals
Overview
Current Partners
Arizona State University
Baylor University
California Digital Library
Columbia University
Cornell University
Dartmouth College
Duke University
Emory University
Harvard University Library
Indiana University
Johns Hopkins University
Library of Congress
Massachusetts Institute of
Technology
Michigan State University
New York University
New York Public Library
North Carolina Central
University
North Carolina State University
Northwestern University
The Ohio State University
The Pennsylvania State
University
Princeton University
Purdue University
Stanford University
Texas A&M University
Universidad Complutense de
Madrid
University of California
Berkeley
Davis
Irvine
Los Angeles
Merced
Riverside
San Diego
San Francisco
Santa Barbara
Santa Cruz
The University of Chicago
University of Illinois
University of Illinois at Chicago
The University of Iowa
University of Maryland
University of Michigan
University of Minnesota
The University of North
Carolina at Chapel Hill
University of Pennsylvania
University of Pittsburgh
University of Utah
University of Virginia
University of Washington
University of WisconsinMadison
Utah State University
Yale University Library
Mission
• To contribute to the common good by collecting,
organizing, preserving, communicating, and
sharing the record of human knowledge
Collections and Collaboration
• Comprehensive collection
• Preservation…with Access
• Shared strategies
–
–
–
–
Collection management, development
Copyright
Preservation
Efficient user services
• Public Good
Collections
What is in HathiTrust?
•
•
•
•
8,725,092 Total volumes
2,367,111 Public Domain
4,774,782 Book titles
211,688 Serial titles
* As of May 20, 2011
Content Sources
* As of May 1, 2011
Content Distribution
* As of May 1, 2011
Dates
* As of May 1, 2011
Breakdown of HathiTrust book corpus by publication date
Bibliographic Indeterminacy and the Scale of Problems and Opportunities of "Rights" in Digital Collection Building – 2/2011
Breakdown of HathiTrust book corpus by publication date
Language Distribution (1)
The top 10 languages make up
~86% of all content
* As of May 1, 2011
Language Distribution (2)
The next 40
languages make
up ~13% of total
* As of May 1, 2011
Content over time
100%
Chicago
90%
Madrid
80%
Columbia
70%
LoC
Harvard
60%
Minnesota
50%
Indiana
40%
Princeton
NYPL
30%
Cornell
20%
Wisconsin
10%
California
0%
Michigan
* As of May 1, 2011
Content Growth
Services:
Preservation, Access
Services (1)
• Ingest
– Book and Journal content
• Google
• Internet Archive
• In-house, other vendor digitization
– Images, Audio, Born digital (coming soon…)
• Two parts
– Bibliographic Data
– Content
Services (2)
• Long-term preservation
– Bit-level, migration
– Standard and open formats (ITU G4 TIFF,
JPEG2000, JPG, Unicode)
– Validation, integrity, redundancy
– OAIS
• How reliable is it?
– DRAMBORA, TRAC
Technology - OAIS
MARC record extensions
(Aleph)
Rights DB
GROOVE
(JHOVE)
Page Turner
HathiTrust API
OAI
GeoIP DB
CNRI Handles
[Solr]
Google
Internet Archive
In-house
Conversion
;
GRIN
Internal Data Loading
METS/PREMIS object
TIFF G4/JPEG2000
OCR
MD5 checksums
Isilon
Site Replication
TSM
MD5 checksum validation
METS object
PNG
OCR
PDF
Quality
•
•
•
•
Partner Digitization
Google Digitization
Quality work / Volume certification
[email protected]
Services (3)
• Preservation…with Access
– As part of preservation, service to partners, and as
public good
– Discovery
• Bibliographic (temporary catalog, OCLC/HathiTrust
catalog)
• Full-text
– Reading
• Interface optimized for users with print disabilities
– Collections
Skip navigation link
Info about SSD service & link
to accessibility page
Descriptive headings added
(hidden from GUI with CSS)
Added labels & descriptive
titles to forms & ToC table
Access keys for navigating
pages with keyboard
Images used for style are in css
so no need to use alt tags
Access Matrix
Type of
work
Public
domain
worldwide
Public
domain in
the US
Search –
Bib and
Full text
World
View
Full-PDF
download
Print on
Demand
World
World
World
US
World if no
restrictions,
Partners if
restrictions
US if no
restrictions,
US partners
if restrictions
World if no
restrictions
Open
World
Access
(+Creative
Commons)
In
World
copyright
(and
undetermin
ed)
World
US
Print
Section 108
disabilities (preservation
uses)
Partners
N/A
worldwide
US
Partners
World with Partners
permission worldwide
if no
restrictions
Not
Not available Not
Partners
available
available
US and
worldwide,
where
applicable
N/A
N/A
Partners US
and
worldwide,
where
applicable
Services (4)
• Rights Management
– Rights Database
– Copyright review
• IMLS Grant awarded to University of Michigan 2008 to
determine copyright status of books published in US
between 1923 and 1963
• 18 staff members, 4 institutions
–
–
–
–
Indiana University
University of Michigan
University of Minnesota
University of Wisconsin
• 125k reviewed through CRMS
• 67,000 (54%) in public domain
Services (5)
• Data Availability
– Tab-delimited inventory files
– Bibliographic API
– Data API
– OAI feed of public domain
– SFX target
– Summon
Some Examples of Use
• Catalogs
– UM loaded every record
– Chicago links to public domain volumes owned in print
– TROVE harvesting through OAI
– OCLC loads records into OCLC
• Link Resolves
– UC created SFX target
• Vendors
– H.W. Wilson database links to public domain volumes
– ProQuest full-text index via Summon
Services (6)
• Collaborative Development Environment
– Active repository development
• Support for Computational Research
– Datasets
• 120,000-volume set
• Google-digitized public domain
– Protocol-based access
– Research Center
How Different from Google?
•
•
•
•
•
•
Preservation
Content
Collective work
Uses of materials
Own trajectory
Partnership
–
–
–
–
Not just about digital content or repository
Address challenges
Fulfill mission
Provide services for our communities
Collaboration:
Print Storage
A global change in the library environment
60%
Academic print book collection already substantially
duplicated in mass digitized book corpus
50%
% of Titles in Local Collection
June 2010
Median duplication: 31%
40%
30%
20%
June 2009
Median duplication: 19%
10%
0%
0
20
40
60
80
Rank in 2008 ARL Investment Index
100
120
Continuing growth of overlap …
• ARL overlap
– 31% in June 2010
– 33% in Dec (adjustment: adding little-held works)
– ~ 1% per 225,000 vols
– 38% in May, 2011; 45% by December, 2011
• Oberlin Group overlap
– 41% in December, 2010
– Higher rate of overlap per added volume?
– Close to 50% in May, 2011
Digitized Books in Shared Repositories
~3.5M titles
3,500,000
3,000,000
~75% of mass digitized corpus is ‘backed up’ in one
or more shared print repositories
~2.5M
Unique Titles
2,500,000
2,000,000
1,500,000
1,000,000
500,000
0
Sep-09
Oct-09
Nov-09
Dec-09
Mass digitized books in Hathi digital repository
Jan-10
Feb-10
Mar-10
Apr-10
May-10
Jun-10
Mass digitized books in shared print repositories
Cost Model
• Based on overlap with print collections
– Public Domain / In-copyright
• Print Holdings Database
– Costs
– Lawful uses of materials
– Complete picture
– Volumes institutions own or have owned
• OCLC number; Bib record ID; Condition; Holding Status
Collaboration:
Copyright
Copyright status of books published pre-1923 and US works
published 1923-1963
Public domain, in-copyright, and orphan works, pre-1923 and
1923-1963
Breakdown by US/non-US and rights status, pre-1923, 19231963 and 1964-1977
Breakdown by US/non-US and rights status for all periods
Collaboration:
Preservation
Technology - OAIS
MARC record extensions
(Aleph)
Rights DB
GROOVE
(JHOVE)
Page Turner
HathiTrust API
OAI
GeoIP DB
CNRI Handles
[Solr]
Google
Internet Archive
In-house
Conversion
;
GRIN
Internal Data Loading
METS/PREMIS object
TIFF G4/JPEG2000
OCR
MD5 checksums
Isilon
Site Replication
TSM
MD5 checksum validation
Technology
METS object
PNG
OCR
PDF
How to find out more
• Web site “About” section:
http://www.hathitrust.org/about
• Twitter: http://twitter.com/hathitrust
• RSS: http://www.hathitrust.org/updates_rss
• Monthly newsletter:
http://www.hathitrust.org/updates
• Contact us: [email protected]
• Soon: Facebook, blog
Thank you!