Mass Digitization at the University of California UC Davis February 23, 2010 Heather Christenson, California Digital Library.
Download ReportTranscript Mass Digitization at the University of California UC Davis February 23, 2010 Heather Christenson, California Digital Library.
Mass Digitization at the University of California UC Davis February 23, 2010 Heather Christenson, California Digital Library Mass Digitization at UC • Overview & history of UC mass digitization projects & partnerships • What we have digitized and where you can find it • HathiTrust & UC mass digitized collections: preservation and access • Google settlement overview and implications What is UC’s rank, in terms of how many books we have digitized? #2 In the world! How many digitized books do we have? 2.5 Million+ digitized books QuickTime™ and a decompressor are needed to see this picture. How many are fully viewable? 445,000+ public domain books Where are the books coming from? Quic kTime™ and a dec ompr es sor are needed to s ee this pic ture. QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. QuickT i me™ and a decom pressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. UC Mass Digitization Timeline UC Internet Archive digitization UC Google digitization OCA founding member UC joins Google project Microsoft digitization funding Oct 2005 Aug 2006 Mar 2007 – Jul 2008 HathiTrust founding member Oct 2008 Internet Archive / Open Content Alliance Projects ~200,000 books digitized to date Out-of-copyright works only Primarily English language, some romance languages Foldouts are included Scanning was primarily done on-site at SRLF and NRLF Previously funded by Microsoft, Yahoo, Sloan Foundation, others Future funding uncertain - likely to require library and/or grant funding Google Projects Over 2.3 million books digitized to date In-copyright and outof-copyright works All languages Foldouts are skipped Scanning done at a Google off-site facility Funded almost entirely by Google UC Books digitized by Google NRLF: all subjects, all languages UC Santa Cruz: Humanities / Social Sciences UC San Diego: • East Asian collection • International Relations & Pacific Studies • Scripps Institution of Oceanography UCLA Beginning East Asia library UC Books digitized by Internet Archive • Pre-1923 English language books from NRLF and SRLF • Pre-1923 foreign language books at SRLF • UC Davis: Selected California documents • Selected Bancroft collections • UCB cookbooks, mathematics • UCLA children’s books, Italian comedies, rare business and economics texts Why are we doing this? Why UC Participates in Mass Digitization Partnerships • • • • • • • Improve discovery Fulfill our public service mission Preserve and protect our collections Enhance student and faculty research Support collection management “A place at the table” Carpe diem! Participant Roles • UC Libraries • Supply books and bibliographic metadata • CDL • • • • • Liaison with digitization partners Planning and coordination Stewardship and preservation of digital content Discovery & access services Funding • Digitization Partners • Funding • Digitization -scanning, post-processing • Host copies of digitized items CDL’s role • Maintain & develop primary relationships with partners • Project management & coordination • Guidance to campuses and facilitation of information sharing CDL’s role: technical leadership & data management • Coordinate transfer of bibliographic data • Maintain central server for file pickup and delivery • Create and maintain standards for file delivery from campuses • Create and maintain standards for delivery of images & metadata from partners • Responsible for download of files from digitization partners CDL’s role: Stewardship of digital output • Coordinate download of UC’s copies • Specifications for data transformation and ingest • Preservation services • Discovery & access services • Quality QuickTime™ and a decompressor are needed to see this picture. Where are our books? QuickTime™ and a decompressor are needed to see this picture. Copies of approximately 1.1 million UC Google books are stored on HathiTrust servers at the University of Michigan and Indiana University, and an additional copy is backed up on tape. More are being added daily Internet Archive books will go into HathiTrust in Q1 2010 Internet Archive keeps copies of the books they digitize Google keeps copies of the books they digitize Where can you find UC books? • Next Generation Melvyl: http://melvyl.worldcat.org/ • Google Books: http://books.google.com/ • Internet Archive: http://www.archive.org/details/university_of_california_libraries • Open Library: http://openlibrary.org/ • Biodiversity Heritage Library: http://www.biodiversitylibrary.org/ • HathiTrust: http://catalog.hathitrust.org/ Copyright Status is a factor • Out of copyright, pre-1923 • “orphan works,” 1923-1964 • 1965 - present What is the HathiTrust? QuickTime™ and a decompressor are needed to see this picture. The HathiTrust • A shared digital repository for mass digitized content, founded in October 2008 • A partnership of major U.S. research libraries • Operates at web scale • Currently digitized as of February 17, 2010 – – – – – 5,432,325 volumes 1,901,313,750 pages 202 terabytes 64 miles 4,414 tons – 802,861 volumes (~15% of total) in the public domain Who else is involved? HathiTrust Partners • Lead partners University of Michigan University of California Indiana University • Members • • • • • • • • • • Columbia University Indiana University Michigan State University Northwestern University The Ohio State University Penn State University Purdue University University of California Berkeley University of California Davis University of California Irvine • • • • • • • • • • • • • • • University of California Los Angeles University of California Merced University of California Riverside University of California San Diego University of California San Francisco University of California Santa Barbara University of California Santa Cruz California Digital Library The University of Chicago University of Illinois University of Illinois at Chicago The University of Iowa University of Michigan University of Minnesota University of Wisconsin-Madison • University of Virginia UC involved at many levels Executive Committee • Brian Schottlaender, UCSD • Laine Farley, CDL Working Groups • CDL & campus participants according to tasks Strategic Advisory Board • Bernie Hurley, UCB • R. Bruce Miller, UCM • Patricia Cruse, CDL Operations & Development • CDL, for now • Potentially campuses? • Why is UC participating in the HathiTrust? Preservation and stewardship of UC resources – Brings our Google and Internet Archive books together in a common preservation repository under UC control • Economy of scale – Storing mass digitized books is expensive – many terabytes of data • Better access to our own books – Create robust links to full text in HathiTrust in Next Generation Melvyl, including all viewable content from UC and other participating libraries – Build improved access interfaces via the HathiTrust API • Aggregate multiple library collections for greater research impact – HathiTrust will support shared access and search mechanisms across all partner content to the extent possible – With UC, over 5 million books and counting – Over ¾ million books in the public domain • Experiment with large scale search, text mining, and other specialized services developed with academic users in mind – Google and Internet Archive are building services for the general user – Research libraries will build services optimized for serious research Over 1.1 million UC books stored in the HathiTrust QuickTime™ and a decompressor are needed to see this picture. Is the Google Settlement a good thing for libraries? • As, David Weinberger, a fellow at Harvard’s Berkman Center for Internet & Society, has written: –“The settlement is not what you would come up with if you began with a blank piece of paper and designed the optimal system for all the interested parties.” Controversy over the Settlement • Some are concerned that it will: – Give Google a monopoly over book digitization and suppress competition – Allow Google to charge high prices for subscriptions – Create an artificial market for orphan works, give Google a monopoly over them and prevent more open sharing of those works • Orphan works = works still under copyright whose copyright owners cannot be identified or located What’s the potential upside for libraries? If approved, the Settlement will: Make millions of books in research library collections more accessible to users and the general public than ever before – including more accessible than they are now via Google Book Search Provide a significant corpus of material for advanced computational research Allow individual rights holders to convey broader use rights if they wish Potentially, spur a more rational legislative solution for orphan works An approach for libraries: • Assist and encourage rights holders to release their books into the public sphere • Press for orphan works legislation • Advocate for robust privacy controls • Neither we, nor other libraries, need rush to purchase an institutional subscription So, what’s next? QuickTime™ and a decompressor are needed to see this picture. What’s next? • Digitization continues! • UC Google books continue to be ingested into the HathiTrust • UC Internet Archive books will follow Q1 2010 • CDL is beginning to investigate discovery & access mechanisms in concert with Michigan and other HathiTrust partners • HathiTrust will continue to work within boundaries of the law to make as many books viewable as possible • The building of robust services & true collaboration will take time For more information: • CDL Mass Digitization: http://www.cdlib.org/services/collections/massdig/index.html • UC Mass Digitization FAQ: http://www.cdlib.org/services/collections/massdig/faq.html • UC & the Google Book Settlement: http://osc-s10.cdlib.org/google/ • HathiTrust: http://www.hathitrust.org/ • UC Libraries & the HathiTrust: http://www.cdlib.org/services/hathi/faq.html Questions? Thank you! Heather Christenson Mass Digitization Project Manager California Digital Library [email protected]