HATHI TRUST A Shared Digital Repository HathiTrust, Collections, and Collaboration COLD 2011 Spring Meeting Jeremy York May 20, 2011
Download ReportTranscript HATHI TRUST A Shared Digital Repository HathiTrust, Collections, and Collaboration COLD 2011 Spring Meeting Jeremy York May 20, 2011
HATHI TRUST A Shared Digital Repository HathiTrust, Collections, and Collaboration COLD 2011 Spring Meeting Jeremy York May 20, 2011 Outline • Overview • Partnership • Mission/Goals • Collections • Services • Collaboration Mission and Goals Overview Current Partners Arizona State University Baylor University California Digital Library Columbia University Cornell University Dartmouth College Duke University Emory University Harvard University Library Indiana University Johns Hopkins University Library of Congress Massachusetts Institute of Technology Michigan State University New York University New York Public Library North Carolina Central University North Carolina State University Northwestern University The Ohio State University The Pennsylvania State University Princeton University Purdue University Stanford University Texas A&M University Universidad Complutense de Madrid University of California Berkeley Davis Irvine Los Angeles Merced Riverside San Diego San Francisco Santa Barbara Santa Cruz The University of Chicago University of Illinois University of Illinois at Chicago The University of Iowa University of Maryland University of Michigan University of Minnesota The University of North Carolina at Chapel Hill University of Pennsylvania University of Pittsburgh University of Utah University of Virginia University of Washington University of WisconsinMadison Utah State University Yale University Library Mission • To contribute to the common good by collecting, organizing, preserving, communicating, and sharing the record of human knowledge Collections and Collaboration • Comprehensive collection • Preservation…with Access • Shared strategies – – – – Collection management, development Copyright Preservation Efficient user services • Public Good Collections What is in HathiTrust? • • • • 8,725,092 Total volumes 2,367,111 Public Domain 4,774,782 Book titles 211,688 Serial titles * As of May 20, 2011 Content Sources * As of May 1, 2011 Content Distribution * As of May 1, 2011 Dates * As of May 1, 2011 Breakdown of HathiTrust book corpus by publication date Bibliographic Indeterminacy and the Scale of Problems and Opportunities of "Rights" in Digital Collection Building – 2/2011 Breakdown of HathiTrust book corpus by publication date Language Distribution (1) The top 10 languages make up ~86% of all content * As of May 1, 2011 Language Distribution (2) The next 40 languages make up ~13% of total * As of May 1, 2011 Content over time 100% Chicago 90% Madrid 80% Columbia 70% LoC Harvard 60% Minnesota 50% Indiana 40% Princeton NYPL 30% Cornell 20% Wisconsin 10% California 0% Michigan * As of May 1, 2011 Content Growth Services: Preservation, Access Services (1) • Ingest – Book and Journal content • Google • Internet Archive • In-house, other vendor digitization – Images, Audio, Born digital (coming soon…) • Two parts – Bibliographic Data – Content Services (2) • Long-term preservation – Bit-level, migration – Standard and open formats (ITU G4 TIFF, JPEG2000, JPG, Unicode) – Validation, integrity, redundancy – OAIS • How reliable is it? – DRAMBORA, TRAC Technology - OAIS MARC record extensions (Aleph) Rights DB GROOVE (JHOVE) Page Turner HathiTrust API OAI GeoIP DB CNRI Handles [Solr] Google Internet Archive In-house Conversion ; GRIN Internal Data Loading METS/PREMIS object TIFF G4/JPEG2000 OCR MD5 checksums Isilon Site Replication TSM MD5 checksum validation METS object PNG OCR PDF Quality • • • • Partner Digitization Google Digitization Quality work / Volume certification [email protected] Services (3) • Preservation…with Access – As part of preservation, service to partners, and as public good – Discovery • Bibliographic (temporary catalog, OCLC/HathiTrust catalog) • Full-text – Reading • Interface optimized for users with print disabilities – Collections Skip navigation link Info about SSD service & link to accessibility page Descriptive headings added (hidden from GUI with CSS) Added labels & descriptive titles to forms & ToC table Access keys for navigating pages with keyboard Images used for style are in css so no need to use alt tags Access Matrix Type of work Public domain worldwide Public domain in the US Search – Bib and Full text World View Full-PDF download Print on Demand World World World US World if no restrictions, Partners if restrictions US if no restrictions, US partners if restrictions World if no restrictions Open World Access (+Creative Commons) In World copyright (and undetermin ed) World US Print Section 108 disabilities (preservation uses) Partners N/A worldwide US Partners World with Partners permission worldwide if no restrictions Not Not available Not Partners available available US and worldwide, where applicable N/A N/A Partners US and worldwide, where applicable Services (4) • Rights Management – Rights Database – Copyright review • IMLS Grant awarded to University of Michigan 2008 to determine copyright status of books published in US between 1923 and 1963 • 18 staff members, 4 institutions – – – – Indiana University University of Michigan University of Minnesota University of Wisconsin • 125k reviewed through CRMS • 67,000 (54%) in public domain Services (5) • Data Availability – Tab-delimited inventory files – Bibliographic API – Data API – OAI feed of public domain – SFX target – Summon Some Examples of Use • Catalogs – UM loaded every record – Chicago links to public domain volumes owned in print – TROVE harvesting through OAI – OCLC loads records into OCLC • Link Resolves – UC created SFX target • Vendors – H.W. Wilson database links to public domain volumes – ProQuest full-text index via Summon Services (6) • Collaborative Development Environment – Active repository development • Support for Computational Research – Datasets • 120,000-volume set • Google-digitized public domain – Protocol-based access – Research Center How Different from Google? • • • • • • Preservation Content Collective work Uses of materials Own trajectory Partnership – – – – Not just about digital content or repository Address challenges Fulfill mission Provide services for our communities Collaboration: Print Storage A global change in the library environment 60% Academic print book collection already substantially duplicated in mass digitized book corpus 50% % of Titles in Local Collection June 2010 Median duplication: 31% 40% 30% 20% June 2009 Median duplication: 19% 10% 0% 0 20 40 60 80 Rank in 2008 ARL Investment Index 100 120 Continuing growth of overlap … • ARL overlap – 31% in June 2010 – 33% in Dec (adjustment: adding little-held works) – ~ 1% per 225,000 vols – 38% in May, 2011; 45% by December, 2011 • Oberlin Group overlap – 41% in December, 2010 – Higher rate of overlap per added volume? – Close to 50% in May, 2011 Digitized Books in Shared Repositories ~3.5M titles 3,500,000 3,000,000 ~75% of mass digitized corpus is ‘backed up’ in one or more shared print repositories ~2.5M Unique Titles 2,500,000 2,000,000 1,500,000 1,000,000 500,000 0 Sep-09 Oct-09 Nov-09 Dec-09 Mass digitized books in Hathi digital repository Jan-10 Feb-10 Mar-10 Apr-10 May-10 Jun-10 Mass digitized books in shared print repositories Cost Model • Based on overlap with print collections – Public Domain / In-copyright • Print Holdings Database – Costs – Lawful uses of materials – Complete picture – Volumes institutions own or have owned • OCLC number; Bib record ID; Condition; Holding Status Collaboration: Copyright Copyright status of books published pre-1923 and US works published 1923-1963 Public domain, in-copyright, and orphan works, pre-1923 and 1923-1963 Breakdown by US/non-US and rights status, pre-1923, 19231963 and 1964-1977 Breakdown by US/non-US and rights status for all periods Collaboration: Preservation Technology - OAIS MARC record extensions (Aleph) Rights DB GROOVE (JHOVE) Page Turner HathiTrust API OAI GeoIP DB CNRI Handles [Solr] Google Internet Archive In-house Conversion ; GRIN Internal Data Loading METS/PREMIS object TIFF G4/JPEG2000 OCR MD5 checksums Isilon Site Replication TSM MD5 checksum validation Technology METS object PNG OCR PDF How to find out more • Web site “About” section: http://www.hathitrust.org/about • Twitter: http://twitter.com/hathitrust • RSS: http://www.hathitrust.org/updates_rss • Monthly newsletter: http://www.hathitrust.org/updates • Contact us: [email protected] • Soon: Facebook, blog Thank you!