Dartmouth College
Download
Report
Transcript Dartmouth College
HATHI TRUST
A Shared Digital Repository
Digital Preservation, HathiTrust,
and the Reimagination of the
Library Landscape
Jeremy York
Iceland
August 5, 2010
Outline
• Digital Preservation in U.S.
• HathiTrust
–
–
–
–
–
About HathiTrust
Content
What we do (services)
Governance
Partnership & Resources
• Google Settlement
• Publishing
• Changing Library Landscape
Books and Journals
Archives
Data
Portico
• Centralized
• Journals
• Source files, mainly focused
on XML, highly controlled
transformation
Internet Archive
• Centralized
• Web files
ICPSR
• Centralized
• Social science data
LOCKSS
• Distributed
• Journals
• Web files, not source images
or XML
MetaArchive (NDIIPP)
• Distributed
• Private LOCKSS Network
• Web files
DATA-PASS (NDIIPP)
• Distributed
• Social science data
HathiTrust
• Centralized
• Books and Journals
• Master image and OCR files
International Internet
Preservation Consortium
• Distributed
• Harvesting tools, Access,
Preservation strategies
GeoMAPP (NDIIPP)
• Distributed
• Geospatial data
• State governments
OCLC – Digital Archive
• Centralized
• Master files, web archiving
• CONTENTdm, custom
repository
LOCKSS, DuraCloud, DSpace, Fedora
NDIIPP
Mission: Develop a national strategy to collect, preserve and make available
significant digital content, especially information that is created in
digital form only, for current and future generations.
• Since 2000
• Broad collaborations with institutions and organizations (e.g., OCLC, Portico)
• Funding (Establishing a network, Preserving Creative America, Preserving State
Government Information)
• Standards/Best Practices
• Tools
o JHOVE2 (validation)
o Chronopolis (data grid framework)
o Dataverse (management, dissemination, exchange, and citation of virtual
collections (dataverses) of quantitative data)
o BagIt (transfer utilities - creation, manipulation and validation of bags)
o Hub and Spoke (repository interoperability)
o FITS (bundle of identification, validation and metadata extraction tools)
About
HathiTrust Digital Library
• Digital Repository
– Initial focus on digitized book and journal content
– “Light” archive
• Collections and Collaboration
–
–
–
–
Comprehensive collection
Shared strategies
Local services
Public Good
Current Partners
–
–
–
–
Columbia University
New York Public Library
University of California system
CIC (Committee on Institutional Cooperation)
University of Chicago
University of Illinois
Indiana University
University of Iowa
University of Michigan
Michigan State University
– University of Virginia
– Yale University
University of Minnesota
Northwestern University
Ohio State University
Pennsylvania State University
Purdue University
University of Wisconsin-Madison
Content Distribution
6,383,209 – Total
1,234,088 – Public Domain
* As of August 5, 2010
Language Distribution (1)
* As of July 25, 2010
Language Distribution (2)
The next 40
languages make
up ~13% of total
* As of July 25, 2010
Dates
* As of July 25, 2010
Originating Institution
* As of July 25, 2010
Content over time
* As of July 25, 2010
Content Growth
What we do
Services (1)
• Ingest
– Google, Internet Archive
– Working toward sustainable model for ingest of
content from diverse sources
• Long-term preservation
– Bit-level, migration
– Standard and open formats (ITU G4 TIFF,
JPEG2000, JPG, Unicode)
– OAIS, TRAC
– Validation, integrity, redundancy
Services (2)
• Preservation…with Access
• Brings concerns of research libraries to bear on the
way the scholarly record is cared for and made
available
–
–
–
–
–
Scholarly Resource
Bibliographic Search
Full-text search
Collections
Full-PDF download of public domain
Services (4)
• Rights Management
– Rights Database
– Copyright review
• US 1923-1963
• 188k candidates, 85k reviewed
• 60% in public domain
• Data Distribution
– Metadata files, Bib API, Data API
• Print on Demand
Services (5)
•
•
•
•
Community Development Environment
Non-Google Ingest
Non-Book/Non-Journal Ingest
Computational Research
Outlook
• Leverage partner resources and input to
create and maintain the library of the future
• This is our library
• The more we use it, the better it will become
Governance
Governance
Budget/Finances
Decision-making
Strategic
Advisory
Board
Executive
Committee
HathiTrust
Guidance on
Policy,
Planning
Partnership &
Resources
Funding
• Funded for a initial 5 years with
base-funding from partners
• 3-year review of governance and sustainability
• Budget – separately held within
UMich budget system
• Cost Models
– Per GB cost of storage per year with a one-time fee on new
content to build a capital fund
– Volume overlap
Cost Model 1
Reasonable costs of sustaining the archive, includes cost of
replacement, capital fund
Cost Model 1
• Economies of scale keep costs low
– $0.145/volume/year for Google-digitized
– about $0.45/volume/year for IA-digitized
• Advantages not fully known until you jump in
Cost Model 2
• Shared space to deal with shared problems
– Use HathiTrust as part of broader library strategies
• Beginning to see benefits of aggregating this
body of materials together
– Overlap, collection development
– Coordinated print management
– Begin to ask “What is missing”?
Cost Model 2
For public domain volumes:
(PD*X*C)/N
For a given incopyright volume:
IC=(C*X)/H
•
•
•
•
Share in costs of curation
Share in uses of relevant materials
Voice in future directions
Free riders?
Staff
• Staff/Expertise – highly integrated
– Project managers, IT and communications
staff, copyright experts, administrators (UM,
Indiana and UC taking the lead)
• Working groups
• Shared development space
Governance
Budget, Finances
Decision-making
Policy
Enterprise
Management
Repository
Administration
Repository
Administration
Communication
and Coordination
with partner
institutions
Hardware
configuration and
maintenance
Data management
(content storage,
backup, integrity
checks, deletion)
Project
management
Planning
Web and
application server
configuration and
maintenance
Security
Hardware selection
and replacement
Content and
Metadata
specifications
Permissions
Rights
Management
Bibliographic
Data
Management
Copyright
determination
Entity description
(record-level)
Copyright review
Object
identification
(item-level)
Copyright
information
management
(database)
Data availability
Collection
Development
Digital
• Expansion beyond
books and journals
(born-digital,
images and maps,
audio)
• Selection of
content (for nonGoogle volume
ingest and pilots
projects)
Print
• Cloud Library (effect
of digital on print)
Rightsholder
permissions
Disaster Recovery
Logging
Processes for
ensuring content
integrity
e-Commerce
Print on Demand
Content Ingest
Content Access
Quality
Assurance
User Services
Transformation
PageTurner
Quality Review
Usability
Validation
Collection Builder
Content
Certification
User support
(helpdesk)
Large-scale Search
Financial
contributions
of partners
Research Center
Bibliographic
Catalog
APIs
Outreach
Project website
Monthly
newsletter
Papers and
presentations
HathiTrust Functional
Framework
Communication
with potential
partners
Surveys, general
inquiries
Repository
evaluation and
audit (e.g.,
DRAMBORA,
TRAC)
Legal
Risk management
(use of materials)
Partner
agreements
Advocacy
Working Groups
Current
• Quality
• Discovery Interface (with OCLC)
• Collections
• Communication
• Usability
Past
• Storage
• Research Center
Google Settlement (1)
•
•
•
•
•
2005, Author’s Guild, AAP sued
Google claimed fair use
Settlement – 2008
Amended – Nov 2009
Works covered
– registered with U.S. copyright office, Canada, UK,
Australia
• Works not covered
– public domain, published after 5 Jan 2009
Google Settlement (2)
• Google continues scanning
• In copyright, non-commercially available out-of-print work
– Sell individual access, any book retailer - 63% of revenue to rights
holders, distributed by BRR
– display up to 20%
– Copy & paste and printing
– Rights holders can open access, distribute under CC, set printing limits
– Institutional subscription (available to libraries, fee based on FTE
users)
• Includes unclaimed works
– BRR required to search for rights holders and hold revenue on their
behalf
• Public access terminals
• Cash payments to Rightsholders whose works were scanned
before May 5, 2009
Book Rights Registry
• Book Rights Registry
– Represent the interests of the Rightsholders – equal
representation of Author and Publisher sub-classes on board;
one author and publisher representative from US, UK, Canada,
Australia; court-appointed representative for rights holders of
unclaimed works
– Establish and maintain a database of contact information for
authors and publishers;
– Use commercially reasonable efforts to locate Rightsholders;
– Distribute payments received from Google for the
Rightsholders’ share of revenues; and
– Assist in the resolution of disputes between Rightsholders.
– Funded by Google (initial 34.5 million, ongoing percentage of
revenues)
http://www.googlebooksettlement.com/help/bin/answer.py?hl=en&answer=118704
Settlement for HathiTrust
• Complementary
– Settlement provides access to covered works,
HathiTrust is preservation, trust for the future
– Research Center (75% of Google Book Search scanned
from HathiTrust partner libraries)
• Specifically sanctions
– Section 108 uses, access for users with print
disabilities, computational research
• Does not allow
– Fair use, sale of access, interlibrary loan, e-reserves,
use in course management systems
Publishing
•
•
•
•
Libraries would like to buy more eBooks
Cost is high
Not good models for consortia (multiple users)
Move to on-demand purchase, leasing of
volumes
• Do we need to own it?
Changing Library Landscape
• Leverage collective resources, expertise
– Drive costs down
– Increase discoverability, use
– Improve strength of archiving
– Reduce redundancy of collections (digital and
print), effort
– Address collective challenges
• Focus on local resources and services
• Redefine who we are, what we provide
– Collections, research
Thank you!