HATHITRUST A Shared Digital Repository HathiTrust Overview MichALL Spring Meeting May 18, 2012 Jeremy York, Project Librarian, HathiTrust.

Download Report

Transcript HATHITRUST A Shared Digital Repository HathiTrust Overview MichALL Spring Meeting May 18, 2012 Jeremy York, Project Librarian, HathiTrust.

HATHITRUST
A Shared Digital Repository
HathiTrust Overview
MichALL Spring Meeting
May 18, 2012
Jeremy York, Project Librarian, HathiTrust
Partnership
Arizona State University
Baylor University
Boston College
Boston University
California Digital Library
Columbia University
Cornell University
Dartmouth College
Duke University
Emory University
Florida State University
Getty Research Institute
Harvard University Library
Indiana University
Johns Hopkins University
Lafayette College
Library of Congress
Massachusetts Institute of
Technology
McGill University`
Michigan State University
New York Public Library
New York University
North Carolina Central
University
North Carolina State
University
Northwestern University
The Ohio State University
The Pennsylvania State
University
Princeton University
Purdue University
Stanford University
Texas A&M University
Universidad Complutense
de Madrid
University of Arizona
University of Calgary
University of California
Berkeley
Davis
Irvine
Los Angeles
Merced
Riverside
San Diego
San Francisco
Santa Barbara
Santa Cruz
The University of Chicago
University of Connecticut
University of Florida
University of Illinois
University of Illinois at Chicago
The University of Iowa
University of Maryland
University of Miami
University of Michigan
University of Minnesota
University of Missouri
University of Nebraska-Lincoln
The University of North
Carolina at Chapel Hill
University of Notre Dame
University of Pennsylvania
University of Pittsburgh
University of Utah
University of Virginia
University of Washington
University of WisconsinMadison
Utah State University
Washington University
Yale University Library
Digital Repository
• Launched 2008
• Initial focus on digitized book and journal
content
– 10,302,450 total volumes
– 5,462,709 book titles
– 271,014 serial titles
– 2,994,286 public domain (~29%)
The Name
• The meaning behind the name
– Hathi (hah-tee)--Hindi for elephant
– Big, strong
– Never forgets, wise
– Secure
– Trustworthy
Mission
• To contribute to the common good by collecting,
organizing, preserving, communicating, and
sharing the record of human knowledge
Collections and Collaboration
• Comprehensive collection
- Preservation…with Access
• Shared strategies
–
–
–
–
–
–
Copyright
Collection management, development
Preservation
Discovery / Use
Bibliographic Indeterminacy
Efficient user services
• Public Good
Content
Content Distribution
72%
"Public Domain"
28%
Public Domain
(worldwide)
14%
U.S. Federal
Government
Documents
(worldwide)
4%
Public
Domain
(US)
10%
Open Access
.1%
Creative Commons
.01%
Content Sources
LC
1%
Minnesota
1%
Yale UNC-Chapel Hill
0%
Harvard Madrid Virginia 0%
Utah
State
1%
Indiana
1%
Chicago
0%
0%
2%
NCSU
0%
Columbia
NorthwesternDuke
0%
0%
1%
0% Illinois
Penn State
NYPL Princeton
Purdue
0%
0%
3%
3%
0%
Cornell
Wisconsin 4%
5%
Michigan
45%
California
33%
Dates
1900-1909
4%
1910-1919
4%
1920-1929
4%
1930-1939
4%
1940-1949
4%
1950-1959
6%
1600-1699
0%
1800-1849
3%
1700-1799
1850-1899
1%
8%
1500-1599
0%
0-1500
0%
2000-2009
10%
1990-1999
14%
1980-1989
15%
1960-1969
11%
1970-1979
13%
Language Distribution (1)
Arabic Latin
2%Italian 1%
Japanese 3%
Remaining
Languages
14%
3%
Russian
4%
Chinese
4%
Spanish
5%
French
7%
The top 10 languages make up
~86% of all content
English
48%
German
9%
Language Distribution (2)
Ancient-Greek
Ukrainian Bulgarian
Panjabi Catalan
Multiple
1%
The next 40
1%
1%
1%
1% Malayalam
Romanian
1%
Armenian
Telugu
languages make
1%
1%
Undetermined
1% Marathi Malay
Greek
1%
Vietnamese
up ~13% of total
1%
7%
1%
Finnish
1%
Slovak
1%
Serbian
Polish
1%1%
Hungarian Sanskrit 1%
7%
Portuguese
2%
2%
7%
Norwegian
2%
Dutch
Music
5%
2% Bengali
2%
Tamil
Persian
2%
2%
Croatian
2%
Unknown
3%
Czech
3%
Danish
3%
Hebrew
5%
Hindi
5%
Thai
3%
Turkish Urdu
3%
3%
Korean
Swedish 4%
3%
Indonesian
4%
Preservation and
Access
Preservation with Access
• Cost effective preservation and access services
• Preservation
– TRAC-certified
– Robust infrastructure
– Long-term commitments on digital content
facilitate planning, decision-making
Preservation with Access (2)
• Discovery
– Bibliographic and full-text search of all materials
– Extended discovery (ProQuest, EBSCO, OCLC, Ex
Libris)
– Mechanisms for local loading of records
Preservation with Access (3)
• Access and Use
– Public domain and open access works
– Full download of materials where possible*
– Print on demand
– Collections and APIs
– Research Center*
– Lawful uses of in-copyright works*
Lawful uses
• Access to users who have print disabilities
• Section 108 uses of materials
• Access to orphan works
Terms of Access
• Available to students, faculty, staff of
partnering institutions
– On library premises or authenticated into
HathiTrust
• Partner libraries own a print copy
– One simultaneous user per print copy owned
• Users must be on U.S. soil
• One page at a time download
How do we facilitate uses?
• Fundamental issues of
– Identification
– Description
– Rights
Copyright
Automatic Rights Determination
• Conducted on all works at time of ingest and
when records are modified
– Public domain worldwide
• US works published before 1923, US federal
government publications, non-US works published prior
to 1872
– Public domain in the United States
• Non-US works published prior to 1923
Manual Rights Determination
• IMLS-funded CRMS project
–
–
–
–
–
US-published works 1923-1963
Conformance with formalities
Expanding to non-US works
Double-blind review with expert review for conflicts
Staff at 4 HathiTrust partner institutions (15 will take
part in non-US)
– As of February 2012 ~190,000 reviewed, more than
100,000 opened
• Rights Holder Permissions
Breakdown of HathiTrust book corpus by publication date
Bibliographic Indeterminacy and the Scale of Problems and Opportunities of "Rights" in Digital Collection Building – 2/2011
Breakdown of HathiTrust book corpus by publication date
Copyright status of books published pre-1923 and US works
published 1923-1963
Copyright status of books published pre-1923 and US works
published 1923-1963
?
Copyright status of books published pre-1923 and US works
published 1923-1963
Copyright status of books published pre-1923 and US works
published 1923-1963
In Print ?
Collection Management
A global change in the library environment
60%
Academic print book collection already substantially
duplicated in mass digitized book corpus
50%
% of Titles in Local Collection
June 2010
Median duplication: 31%
40%
30%
20%
June 2009
Median duplication: 19%
10%
0%
0
20
40
60
80
Rank in 2008 ARL Investment Index
100
120
Digitized Books in Shared Repositories
~3.5M titles
3,500,000
3,000,000
~75% of mass digitized corpus is ‘backed up’ in one
or more shared print repositories
~2.5M
Unique Titles
2,500,000
2,000,000
1,500,000
1,000,000
500,000
0
Sep-09
Oct-09
Nov-09
Dec-09
Mass digitized books in Hathi digital repository
Jan-10
Feb-10
Mar-10
Apr-10
May-10
Jun-10
Mass digitized books in shared print repositories
Collection Management, Development
• Overlap
– More than 50% median overlap with ARL
institutions; higher for small liberal arts colleges
• Pricing model based on Print holdings
– Requires print holdings database
– Also support expansion of legal uses, efforts in deduplication
– Facilitate individual and collaborative collection
development and management operations
Holdings relevant to Law
•
•
•
•
•
Reports
Committee Hearings
Government Documents
Law Reviews
Information around court cases
How to find out more
• Web site “About” section
• http://www.hathitrust.org/about
• HathiTrust Research Center
• http://www.hathitrust.org/htrc
• Twitter
• http://twitter.com/hathitrust
• Monthly newsletter
• http://www.hathitrust.org/updates
• RSS: http://www.hathitrust.org/updates_rss
• Contact us: [email protected]
• Blogs: http://www.hathitrust.org/blogs
• Large-scale search
• Perspectives from HathiTrust
Thank you very much!