Transcript Slide 1

HATHI TRUST
A Shared Digital Repository
HathiTrust
How We Can Make A Difference
Jeremy York
Yale University
November 3, 2010
Current Partners
•
•
•
•
•
•
•
•
•
•
•
Committee on Institutional Cooperation (CIC)
Columbia University
Cornell University
Dartmouth College
New York Public Library
Yale University Library
Princeton University Library
Triangle Research Libraries Network
University of California system
University of Virginia
Utah State University
Mission
• To contribute to the common good by collecting,
organizing, preserving, communicating, and
sharing the record of human knowledge
HathiTrust
Universal Library
Common Goal
Single Entity, Many Partners
Goals
• Comprehensive collection
• Preservation…with Access
• Shared strategies
–
–
–
–
Collection management, development
Preservation
Copyright
Efficient user services
• Openness
Outline
•
•
•
•
•
•
Content
Services
Governance
How work is done
What work there is
How we can make a difference
Content Distribution
7,130,606 – Total volumes
1,678,161 – Public Domain
4,071,294 Book titles
170,535 Serial titles
* As of November 3, 2010
Language Distribution (1)
* As of November 2, 2010
Language Distribution (2)
The next 40
languages make
up ~13% of total
* As of November 2, 2010
Dates
* As of November 2, 2010
Content Growth
A global change in the library environment
60%
Academic print book collection already substantially
duplicated in mass digitized book corpus
50%
% of Titles in Local Collection
June 2010
Median duplication: 31%
40%
30%
20%
June 2009
Median duplication: 19%
10%
0%
0
20
40
60
80
Rank in 2008 ARL Investment Index
100
120
Digitized Books in Shared Repositories
~3.5M titles
3,500,000
3,000,000
~75% of mass digitized corpus is ‘backed up’ in one
or more shared print repositories
~2.5M
Unique Titles
2,500,000
2,000,000
1,500,000
1,000,000
500,000
0
Sep-09
Oct-09
Nov-09
Dec-09
Mass digitized books in Hathi digital repository
Jan-10
Feb-10
Mar-10
Apr-10
May-10
Jun-10
Mass digitized books in shared print repositories
Services (1)
• Ingest
– Google, Internet Archive, Local
– Working toward sustainable model for ingest of
content from diverse sources
• Long-term preservation
– Bit-level, migration
– Standard and open formats (ITU G4 TIFF, JPEG2000,
JPEG, Unicode)
– OAIS, TRAC
– Validation, integrity, redundancy
Repository Philosophy/Design
•
•
•
•
•
Consistency
Standardization
Simplicity (in design, not function)
Practicality
Sustainability
Services (2)
• Rights Management
– Automatic review
– Manual review (Michigan, Indiana, Minnesota,
Wisconsin)
•
•
•
•
•
•
•
Since 2007
IMLS in 2008
20 staff in all
US 1923-1963
462,497 US publications from 1923-1963
96,000 reviewed, 175,000 remaining candidates
52,000 in public domain
Services (3)
• Preservation…with Access
• Brings concerns of research libraries to bear on the
way the scholarly record is cared for and made
available
–
–
–
–
–
Bibliographic Search
Full-text search
Collections
Full-PDF download of public domain
Scholarly Resource
Services (4)
• Data Distribution
– Metadata files, Bib API, Data API, OAI
• Print on Demand
• Collaborative Development Environment
• Coming soon…
– Non-Book/Non-Journal Ingest
– Computational Research
Computational Research
• Data distribution
• Protocol-based access
• Research Center
Quality
• Partner Digitization
• Google Digitization
• Volume Certification
Outlook
• Leverage partner resources and input to
create and maintain the library of the future
• This is our library
• The more we use it, the better it will become
Governance
Budget/Finances
Decision-making
Strategic
Advisory Board
Executive
Committee
HathiTrust
Guidance on
Policy,
Planning
Executive Committee
•
•
•
•
•
•
Paul Courant, University Librarian and Dean of Libraries, UM
Laine Farley, Executive Director, CDL
John King, Vice Provost for Academic Information, UM
Paula Kaufman, University Librarian and Dean of Libraries, UI
Brian Schottlaender, University Librarian, UCSD
Ed Van Gemert, Deputy Director of Libraries, UW – Madison
(ex officio)
• Brenda Johnson, Dean of Libraries, IU
• Brad Wheeler, Chief Information Officer, IU
• John Wilkin, Executive Director of HathiTrust and
Associate University Librarian, LIT, UM
Strategic Advisory Board
• Ed Van Gemert (Chair), Deputy Director of Libraries, UW Madison
• John Butler, Associate University Librarian for Information
Technology, U Minn
• Patricia Cruse, Director, Preservation, CDL
• Bernie Hurley, Director, Library Technologies, UC Berkeley
• R. Bruce Miller, University Librarian, UC - Merced
• Sarah Pritchard, University Librarian, Northwestern
• Paul Soderdahl, Director, LIT, U Iowa
• John Wilkin, Executive Director, HathiTrust (ex officio)
• Robert Wolven, Columbia University
How does work get done?
• Collective work
– e.g., working groups
• Distributed work
– Projects, e.g. grant work, ingest specifications,
page turner, bibliographic data management
Working Groups (1)
• Operational focus
– Appointed by Executive Director in coordination
with Executive Committee
•
•
•
•
•
Usability
Communications
Development Environment
Storage
Research Center
Working Groups (2)
• Planning or Exploratory focus
– Appointed by Strategic Advisory Board
– Recommendations reviewed by SAB and XCom;
may call for subsequent implementation
•
•
•
•
Collections Committee
Surrogates
Quality, Ingest, and Error rate
Discovery
Governance
Budget, Finances
Decision-making
Policy
Enterprise
Management
Repository
Administration
Repository
Administration
Communication
and Coordination
with partner
institutions
Hardware
configuration and
maintenance
Data management
(content storage,
backup, integrity
checks, deletion)
Project
management
Planning
Web and
application server
configuration and
maintenance
Security
Hardware selection
and replacement
Content and
Metadata
specifications
Permissions
Rights
Management
Bibliographic
Data
Management
Copyright
determination
Entity description
(record-level)
Copyright review
Object
identification
(item-level)
Copyright
information
management
(database)
Data availability
Collection
Development
Digital
• Expansion beyond
books and journals
(born-digital,
images and maps,
audio)
• Selection of
content (for nonGoogle volume
ingest and pilots
projects)
Print
• Cloud Library (effect
of digital on print)
Rightsholder
permissions
Disaster Recovery
Logging
Processes for
ensuring content
integrity
e-Commerce
Print on Demand
Content Ingest
Content Access
Quality
Assurance
User Services
Transformation
PageTurner
Quality Review
Usability
Validation
Collection Builder
Content
Certification
User support
(helpdesk)
Large-scale Search
Financial
contributions
of partners
Research Center
Bibliographic
Catalog
APIs
HathiTrust Functional
Framework
Outreach
Project website
Monthly
newsletter
Papers and
presentations
Communication
with potential
partners
Surveys, general
inquiries
Repository
evaluation and
audit (e.g.,
DRAMBORA,
TRAC)
Legal
Risk management
(use of materials)
Partner
agreements
Advocacy
What work is there?
•
•
•
•
•
•
•
Usage Reporting
Quality
Copyright Review
Specifications
Metadata
Development Environment
Other?
Cost Model 1
Reasonable costs of sustaining the archive, includes cost of
replacement, capital fund
Cost Model 1
• Economies of scale keep costs low
– $0.145/volume/year for Google-digitized
– about $0.45/volume/year for IA-digitized
• Advantages not fully known until you jump in
Cost Model 2
For public domain volumes:
(PD*X*C)/N
For a given in copyright volume:
IC=(C*X)/H
•
•
•
•
Share in costs of curation
Share in uses of relevant materials
Voice in future directions
Free riders?
Cost Model 2
• Sustaining common resource
• Costs go down
• Quality of services increases
– Realize in aggregated collection, something don’t
get through distributed search or federation
Cost Model 2: Timeline &
Requirements
• Timeline:
– Implement in 2013
– Accept new partners now with costs based on
overlap calculations
• Requirements:
– Print holdings database
– Update mechanisms
– Manual remediation
Print Holdings Database
• Print holdings database will also benefit
– De-duplication
• Compromises user experience, obscures collection
development needs
– Management of print volumes
• Information to withdraw volumes (journals)
– Legal uses of copyright materials
• Section 108, 121, ADA uses will depend knowledge of
which institutions own(ed) which materials
Future Directions (1)
• Locally-digitized partner content
• Usage reporting
• Coordinate digital and print resources
(holdings database)
• Computational Research
• Quality
• Strategies for openness
• Collaborative Development
• Extending Services through Shibboleth
• Non-book, non-journal content
Future Directions (2)
•
•
•
•
•
•
•
•
•
Born-digital content
New Bibliographic Management
Compliance with TRAC
Grant projects
OCLC Catalog
3-year review
Improvements to Large-scale Search
Improvements to PageTurner
Ingest Reporting
How can we make a difference?
• Digital Curation
–
–
–
–
–
–
Drive costs down
Reduce bibliographic indeterminacy
Make meaningful decisions about formats and quality
Increase discoverability
Consolidate development talent
Improve strength of archiving
• Print Curation
– Means to associate our print holdings
– Coordinated record-keeping
• Subsidiary benefits
– Improve description
– Quantify problems
– Collective attention to solving shared problems
Thank you!
[email protected]
[email protected]