Digital Repository Preservation Service ________________________ LMC Plus April 16, 2008 Meg Bellinger, AUL Roy Lechich, Audrey Novak, ILTS.

Digital Repository Preservation Service


LMC Plus April 16, 2008 Meg Bellinger, AUL Roy Lechich, Audrey Novak, ILTS


Yale and global

Yale Cyber Infrastructure Architecture Infrastructure Framework and Protocols

Web services, Z39.50, OAI-PMH, RSS, SRU/SRW, OAIS, Fedora

Common Services

Persistent identification, Authentication & Authorization, Registries, Rights Management

Content Provision: Services & Storage For

digital collections, preservation, metadata


library, museums, research, academic and administrative departments

Fusion: Services, Tools, Applications

Brokers, aggregators, indexes, catalogs, MetaLib, XSearch

Presentation: Interfaces

Yale uPortal, Classesv2, Google, Personal Information Environment, Discipline specific, gallery, museum and library sites Based on a graphic created by Lorcan Dempsey

Content Sources Full Text Books Finding Aids Audio & Video Images and Metadata Complex Objects .

Research Data Personal Collections Yale University Library Digital Repository Service


Collections Environment E-Publishing (Institutional Repository) Integration Services Preservation Archive


Dissem ination Google, MSN, Yahoo … Image Commons Classes*v2 (Sakai) University Portal Library, MetaLib Collections XSearch VITAL


______________________________________ • Introduction • Background • Digital Preservation Repository – Phase I – Additional Phases • Within the Larger Landscape 16 Apr 2008

Intro: What is Digital Preservation?

__________________________________________________ “ Digital preservation is the whole of the activities and processes involved in the physical and intellectual protection and technical stabilization of digital resources through time in order to reproduce authentic copies of these resources.”

(YUL Digital Preservation Policy)

16 Apr 2008

Introduction: The Need ___________________________________________________________________

Statistical Datasets Mass Digitization

At an ever accelerating pace, faculty, students, and staff (e.g., the Library) are creating, sharing, and storing digital information for teaching, learning, research, administrative, and creative purposes.

Information in digital form is now integral to Yale's core mission.

Images Scientific & Biomedical Data Audio, Video, Podcasts Web Sites

16 Apr 2008

Introduction: The Need

__________________________________________________ • Digital resources are fragile and the preservation of these resources is complex. • Digital preservation is dynamic – Responses to technological obsolescence or media decay must be taken quickly. • Digital preservation is pro-active – Rather than reactionary and the prospects for successfully preserving digital resources rest heavily upon decisions taken at each stage of their life cycle starting with creation. 16 Apr 2008

Introduction: The Need

_____________________________________________________ Digital Landscapes Committee, Cyberinfrastructure Survey (Oct 2006) Ranking from 19 survey questions posed to faculty: #1 Easier electronic access to scholarly materials #2 Providing students with digital access to research and instructional materials #11 Ensuring the preservation of my scholarly digital output (e.g., datasets, research notes, e-prints) 16 Apr 2008

Introduction: The Need

_____________________________________________________ “

The coolest thing that will be done with your data someone else will do.”

Open Repositories 08

16 Apr 2008

Background – YUL Related Initiatives _____________________________________________________ • • • IAC Rescue Repository – 2004 - present IAC Digital Preservation Committee – Nov 2004 - Jan 2007 IAC Metadata Committee – Nov 2004 - Feb 2007 – PREMIS - Preservation Metadata Task Force • April - Oct 2006 16 Apr 2008

Rescue Repository

( May 2004 Requirements Report ) ______________________________________ “ An increasing number of projects in the YUL are generating or acquiring digital content …” “The digital masters for much of this material are in immediate danger of permanent loss through media decay, physical damage, technological obsolescence, or difficulties in archival management..." " the interim, we propose a flexible and agile/quick short-term solution…" 16 Apr 2008

Resue Repository Description

_____________________________________________________ • Managed, secure storage (disk-to-disk-to tape).

• Resources are organized according to owning library, collection, subcollection(s), file name. • Activity is managed by simple ingest and retrieval applications with basic file verification and validation.

• A ~3 year temporary solution (May 2005 +3 yrs).

• Heavily used … 16 Apr 2008

Users: BRBL, Div, E-Collections, Geo, LWL, MSS/A, Peabody, Preservation, SSL, VRC, YUAG

RR Storage Usage

60 50 40 30 20 10 Total Storage Used Storage 0 O ct-0 5 Dec -0 5 Fe b-0 6 Ap r-0 6 Ju n-0 6 Au g-0 6 O ct-0 6 Dec -0 6 Fe b-0 7 Ap r-0 7 Ju n-0 7 Au g-0 7 O ct-0 7 Dec -0 7 Fe b-0 8 Ap r-0 8 Ju n-0 8 Au g-0 8 O ct-0 8 Dec -0 8 Available Storage Oct-05 Jan-06 Jul-06 Jan-07 Mar-07 Jun-07 Oct-07 13.6





13 13.6


8 13.6













Nov-07 Jun-08 Sep-08 Dec-08 36 19 21 36 36 0 43.5

53 Total Storage Used Storage Available Storage 16 Apr 2008

Digital Preservation Committee

___________________________________________________________________ • Preservation Policy – Defines digital preservation; establishes general principles about what is preserved; promulgates our commitment to standards.

• Best Practices – A dynamic suite of documents that address current best practices for preservation-related issues such as format validation, registries, etc.

16 Apr 2008

Metadata Committee

____________________________________________ Preservation Metadata Taskforce (PREMIS) Report • PREMIS ( PRE servation M etadata I mplementation S trategies) defines the metadata needed to preserve digital information assets for the long term.

16 Apr 2008

Preservation Metadata TaskForce Recommendations __________________________________________________ Two Profiles for YUL’s PREMIS Implementation: • Base (6 elements) - A sub-set of full PREMIS … that is temporary until the library has developed digital preservation policies.

• Full - A draft that needs to be fine-tuned through experience with actual instances of use at Yale. Experience using PREMIS will determine which elements in the PREMIS model are necessary at Yale.

16 Apr 2008

Digital Preservation Need and Related Initiatives Summary _____________________________________________________ • The demand for a Digital Preservation Repository from faculty, Rescue Repository users, digitization operations and projects is heavy.

• The Rescue Repository and work by the IAC Digital Preservation and Metadata/PREMIS Committees laid the foundation. • Rescue Repository is reaching its planned end to life.

16 Apr 2008

Digital Preservation Repository: Phase I _________________________________________________________ • $500,000 funding to establish a Digital Preservation Repository prototype.

– Provide mechanisms and services for preservation and access to the data. – Create the scalable hardware infrastructure. – Demonstrate an extensible repository service model.

– Develop the resource (staff and economic) models.

– Establish the collaborative campus partnerships. – Further the research and scholarship into digital preservation issues.

16 Apr 2008

Digital Preservation Repository: Phase I _________________________________________________________ Working from two Use Cases: 1.


• YPED (Yale Protein Expression Database)* Protein profiling mass spectrometry data sets generated by the Keck Lab • Images from the Rescue Repository Approximately 400,000 individual image files from the Art Gallery, Beinecke, Divinity Library, Lewis Walpole Library, Library Visual Resources Collection, and Manuscripts and Archives department.

* Proteomics is the large-scale study of proteins and is often considered the next step in the study of biological systems, after genomics.

16 Apr 2008

Digital Preservation Repository: Phase I _________________________________________________________ 1. Hardware Architecture 2. Software Design 3. Preservation Metadata 4. Use Case: YPED 5. Use Case: Images 16 Apr 2008

Phase I - Hardware

______________________________________ • 20TB YPED and Images • 30TB Microsoft mass digitization • 10TB non-images (Rescue Repository) • 40TB Annual growth with Library digitization projects _________ • 250TB Annual growth with Fortunoff video digitization project • 1000TBs (a petabyte) within 5 years 16 Apr 2008

Phase I - Hardware


Projected Growth in Storage

900 800 700 600 500 400 300 200 100 0 TB 2005 2006 2007 2008 2009 2010 2011 2012 16 Apr 2008

Software Design

___________________________________________________ Phase I - Core Preservation Functionality • Deposit, Normalization, Packaging, Validation, Ingest, Storage (multiple copies, geographic separation), Preservation Policy Management, Authorization, OAI-PMH, SRW/SRU, Retrieval • YPED and Image Use Case Requirements Additional Phases - Additional Services • Preservation actions • All (or almost all) user-facing services • Enhanced access & delivery through applications 16 Apr 2008

 Flexible  Accept Different Types of Data  Collect Data and Metadata Components  Normalize for Ingest Processing  Verify Integrity  Add Identifiers  Add Preservation Metadata Deposit / Ingest SIP  Continuous Integrity Checks  Format Migrations (e.g. .tiff to .jp2000)  Storage Migrations (to new or different type physical media)  Logging  Reporting Preservation / Storage AIP  Authorization  Validation  OAI-PMH  SRW/SRU  Indexing  Retrieval  Logging Access DIP Repository

Digital Preservation Repository – Phase I Summary _____________________________________________________ Build: • Hardware environment • Core preservation repository services • Project specific service components needed for YPED and to replace Rescue Repository • Migration of Rescue Repository image content 16 Apr 2008

Additional Phases

_____________________________________________________ Examples: • Full Rescue Repository migration • More content (project/use cases) – Project specific ingest and access • More storage (950TBs) • Preservation actions (integrity checks, format migrations, etc.) • Reporting • Rights Management

5 years, 6FTE, ~7 million dollars

16 Apr 2008

Larger Landscape ____________________________________________ Peer Institutions: • Stanford, Harvard • Rutgers • DAITSS (Florida) • Michigan • Columbia Internationally: • European National Libraries • Australia & New Zealand 16 Apr 2008

16 Apr 2008