Mass Digitization at the University of California UC Davis February 23, 2010 Heather Christenson, California Digital Library.

Download Report

Transcript Mass Digitization at the University of California UC Davis February 23, 2010 Heather Christenson, California Digital Library.

Mass Digitization at the
University of California
UC Davis
February 23, 2010
Heather Christenson, California Digital Library
Mass Digitization at UC
• Overview & history of UC mass digitization
projects & partnerships
• What we have digitized and where you can
find it
• HathiTrust & UC mass digitized collections:
preservation and access
• Google settlement overview and implications
What is UC’s rank, in terms of
how many books we have
digitized?
#2
In the
world!
How many digitized books do
we have?
2.5 Million+
digitized books
QuickTime™ and a
decompressor
are needed to see this picture.
How many are fully viewable?
445,000+
public domain books
Where are the books coming
from?
Quic kTime™ and a
dec ompr es sor
are needed to s ee this pic ture.
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
QuickT i me™ and a
decom pressor
are needed to see this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
UC Mass Digitization Timeline
UC Internet Archive digitization
UC Google digitization
OCA
founding
member
UC joins
Google
project
Microsoft
digitization
funding
Oct 2005
Aug 2006
Mar 2007 –
Jul 2008
HathiTrust
founding
member
Oct 2008
Internet Archive / Open Content Alliance
Projects
 ~200,000 books digitized to
date
 Out-of-copyright works only
 Primarily English language,
some romance languages
 Foldouts are included
 Scanning was primarily
done on-site at SRLF and
NRLF
 Previously funded by
Microsoft, Yahoo, Sloan
Foundation, others
 Future funding uncertain
- likely to require library
and/or grant funding
Google Projects
 Over 2.3 million
books digitized to
date
 In-copyright and outof-copyright works
 All languages
 Foldouts are skipped
 Scanning done at a
Google off-site facility
 Funded almost
entirely by Google
UC Books digitized by Google
 NRLF:
 all subjects, all languages
 UC Santa Cruz:
 Humanities / Social Sciences
 UC San Diego:
• East Asian collection
• International Relations & Pacific Studies
• Scripps Institution of Oceanography
 UCLA
 Beginning East Asia library
UC Books digitized by Internet
Archive
• Pre-1923 English language books from NRLF and
SRLF
• Pre-1923 foreign language books at SRLF
• UC Davis: Selected California documents
• Selected Bancroft collections
• UCB cookbooks, mathematics
• UCLA children’s books, Italian comedies, rare
business and economics texts
Why are we doing this?
Why UC Participates in Mass
Digitization Partnerships
•
•
•
•
•
•
•
Improve discovery
Fulfill our public service mission
Preserve and protect our collections
Enhance student and faculty research
Support collection management
“A place at the table”
Carpe diem!
Participant Roles
• UC Libraries
• Supply books and bibliographic metadata
• CDL
•
•
•
•
•
Liaison with digitization partners
Planning and coordination
Stewardship and preservation of digital content
Discovery & access services
Funding
• Digitization Partners
• Funding
• Digitization -scanning, post-processing
• Host copies of digitized items
CDL’s role
• Maintain & develop
primary
relationships with
partners
• Project
management &
coordination
• Guidance to
campuses and
facilitation of
information sharing
CDL’s role: technical leadership &
data management
• Coordinate transfer of bibliographic data
• Maintain central server for file pickup and
delivery
• Create and maintain standards for file
delivery from campuses
• Create and maintain standards for delivery of
images & metadata from partners
• Responsible for download of files from
digitization partners
CDL’s role: Stewardship of
digital output
• Coordinate download of
UC’s copies
• Specifications for data
transformation and
ingest
• Preservation services
• Discovery & access
services
• Quality
QuickTime™ and a
decompressor
are needed to see this picture.
Where are our books?


QuickTime™ and a
decompressor
are needed to see this picture.


Copies of approximately 1.1 million UC
Google books are stored on HathiTrust
servers at the University of Michigan and
Indiana University, and an additional
copy is backed up on tape. More are
being added daily
Internet Archive books will go into
HathiTrust in Q1 2010
Internet Archive keeps copies of the
books they digitize
Google keeps copies of the books they
digitize
Where can you find UC books?
• Next Generation Melvyl: http://melvyl.worldcat.org/
• Google Books: http://books.google.com/
• Internet Archive:
http://www.archive.org/details/university_of_california_libraries
• Open Library: http://openlibrary.org/
• Biodiversity Heritage Library: http://www.biodiversitylibrary.org/
• HathiTrust: http://catalog.hathitrust.org/
Copyright Status is a factor
• Out of copyright,
pre-1923
• “orphan works,”
1923-1964
• 1965 - present
What is the HathiTrust?
QuickTime™ and a
decompressor
are needed to see this picture.
The HathiTrust
• A shared digital repository for mass digitized
content, founded in October 2008
• A partnership of major U.S. research libraries
• Operates at web scale
• Currently digitized as of February 17, 2010
–
–
–
–
–
5,432,325 volumes
1,901,313,750 pages
202 terabytes
64 miles
4,414 tons
– 802,861 volumes (~15% of total) in the public domain
Who else is involved?
HathiTrust Partners
• Lead partners
University of Michigan
University of California
Indiana University
• Members
•
•
•
•
•
•
•
•
•
•
Columbia University
Indiana University
Michigan State University
Northwestern University
The Ohio State University
Penn State University
Purdue University
University of California Berkeley
University of California Davis
University of California Irvine
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
University of California Los Angeles
University of California Merced
University of California Riverside
University of California San Diego
University of California San Francisco
University of California Santa Barbara
University of California Santa Cruz
California Digital Library
The University of Chicago
University of Illinois
University of Illinois at Chicago
The University of Iowa
University of Michigan
University of Minnesota
University of Wisconsin-Madison
•
University of Virginia
UC involved at many levels
Executive Committee
• Brian Schottlaender,
UCSD
• Laine Farley, CDL
Working Groups
• CDL & campus
participants according
to tasks
Strategic Advisory
Board
• Bernie Hurley, UCB
• R. Bruce Miller, UCM
• Patricia Cruse, CDL
Operations &
Development
• CDL, for now
• Potentially campuses?
•
Why is UC participating in the
HathiTrust?
Preservation and stewardship of UC resources
– Brings our Google and Internet Archive books together in a common
preservation repository under UC control
•
Economy of scale
– Storing mass digitized books is expensive – many terabytes of data
•
Better access to our own books
– Create robust links to full text in HathiTrust in Next Generation Melvyl,
including all viewable content from UC and other participating libraries
– Build improved access interfaces via the HathiTrust API
•
Aggregate multiple library collections for greater research impact
– HathiTrust will support shared access and search mechanisms across all
partner content to the extent possible
– With UC, over 5 million books and counting
– Over ¾ million books in the public domain
•
Experiment with large scale search, text mining, and other specialized
services developed with academic users in mind
– Google and Internet Archive are building services for the general user
– Research libraries will build services optimized for serious research
Over 1.1 million UC books
stored in the HathiTrust
QuickTime™ and a
decompressor
are needed to see this picture.
Is the Google Settlement a good
thing for libraries?
• As, David Weinberger, a fellow at Harvard’s Berkman
Center for Internet & Society, has written:
–“The settlement is not what you would
come up with if you began with a blank
piece of paper and designed the optimal
system for all the interested parties.”
Controversy over the Settlement
• Some are concerned that it will:
– Give Google a monopoly over book digitization and
suppress competition
– Allow Google to charge high prices for subscriptions
– Create an artificial market for orphan works, give
Google a monopoly over them and prevent more
open sharing of those works
• Orphan works = works still under copyright whose copyright
owners cannot be identified or located
What’s the potential upside for
libraries?
If approved, the Settlement will:
 Make millions of books in research library collections
more accessible to users and the general public than
ever before – including more accessible than they are
now via Google Book Search
 Provide a significant corpus of material for advanced
computational research
 Allow individual rights holders to convey broader use
rights if they wish
 Potentially, spur a more rational legislative solution for
orphan works
An approach for libraries:
• Assist and encourage rights holders to
release their books into the public sphere
• Press for orphan works legislation
• Advocate for robust privacy controls
• Neither we, nor other libraries, need rush to
purchase an institutional subscription
So, what’s next?
QuickTime™ and a
decompressor
are needed to see this picture.
What’s next?
• Digitization continues!
• UC Google books continue to be ingested into the HathiTrust
• UC Internet Archive books will follow Q1 2010
• CDL is beginning to investigate discovery & access mechanisms in
concert with Michigan and other HathiTrust partners
• HathiTrust will continue to work within boundaries of the law to make
as many books viewable as possible
• The building of robust services & true collaboration will take time
For more information:
• CDL Mass Digitization:
http://www.cdlib.org/services/collections/massdig/index.html
• UC Mass Digitization FAQ:
http://www.cdlib.org/services/collections/massdig/faq.html
• UC & the Google Book Settlement:
http://osc-s10.cdlib.org/google/
• HathiTrust: http://www.hathitrust.org/
• UC Libraries & the HathiTrust:
http://www.cdlib.org/services/hathi/faq.html
Questions?
Thank you!
Heather Christenson
Mass Digitization Project Manager
California Digital Library
[email protected]