A Micro-Services-Based Approach for Curation and

Download Report

Transcript A Micro-Services-Based Approach for Curation and

UC3 Summer Webinar Series

An Introduction to the Merritt Curation Repository

University of California Curation Center Team California Digital Library June 9, 2011

First, a word about the webinar series…

• A forum for timely topics of interest to the UC community – Highlighting projects, services, and developments in the areas of digital preservation, web archiving, and data curation – Intended to raise awareness of issues, and provide information on useful resources and services available to the UC community – 2nd and 4th Thursday of the month, and as scheduled, featuring UC3 staff and UC librarians, content managers, and technologists

Teleconference Webconference

+1 (866) 740-1260, access code 9879016# http://bit.ly/jdjMAP

First, a word about the webinar series…

• Some logistics… – Participant phones will be muted during the formal presentation, but we will be monitoring the online chat – Slides, Q & A, and web and voice recordings will be posted after each presentation – Schedule available at http://www.cdlib.org/uc3/uc3webinars.html

– Please suggest additional topics!

[email protected]

– Take the short survey http://www.surveymonkey.com/s/XSGWP8R

Now on with the show…

• Today’s topic is an introduction to the Merritt curation repository – Who is it for?

– What can it do?

– Why use it?

– What does it cost?

– Next steps?

– Q & A

What keeps you up at night?

How much will it cost?

What’s the best strategy to ensure permanent availability?

Are there standards or best practices I should be aware of?

How do I know my content is safe?

I have a good discovery platform; how can I add preservation services?

How can I transfer my content to an appropriate curation environment Do I need to create new derivatives just for preservation purposes?

Can I control who can see my content?

How can I get a persistent reference to my content?

What if my content needs to evolve over time?

“There’s an app for that”

Are there standards or best practices I should be aware of?

How much will it cost?

Storage at $1.04/GB/year UC3 consultation Automatic replication and

What’s the best strategy to ensure permanent availability?

high-availability redundancy

How do I know my content is safe?

Periodic fixity audit I have a good discovery platform; how can I add preservation services? Modular micro services “toolkit” Simple submission UI/API METS “feeder” duplicates existing DPR workflow

How can I transfer my content to an appropriate curation environment Can I control who can see my content?

Curator-defined access control rules

How can I get a persistent reference to my content?

Integration with EZID and DataCite

Do I need to create new derivatives just for preservation purposes?

What if my content needs to evolve over time?

Model free No packaging, format, or metadata requirements Strongly versioned

Merritt repository

• Merritt is available for use by all members of the UC community – – – Libraries/archives/museums ORU/MRUs Faculty/staff

Mediated through campus libraries

• Centrally hosted by UC3/CDL on behalf of the UC community – – Economies of scale Shared experience and expertise

Modes of use: dark archive

• Pro-active preservation, but no expectation of direct end user access – – Legacy DPR content contributed by campus libraries Cultural heritage texts, master images, sound, moving image, data sets – All DPR content will be automatically migrated to Merritt

Modes of use: bright archive

• Provide preservation and end user access – NIH Healthy Pathways project on bio-demographics • • Multi-institutional: UC Davis, University of Colorado, University of Virginia, Syddansk University (Denmark) Need to restrict access to project partners initially, with eventual public access

Modes of use: bright archive

• Content discovery: search

Modes of use: bright archive

• Content discovery: search

Modes of use: bright archive

• Content discovery: browse

Modes of use: bright archive

• Content discovery: browse

Modes of use: preservation “back end”

• Preservation only; content discovery/delivery provided by well-known external systems – Using direct hooks into Merritt to retrieve content – eScholarship Open access publishing – Open Context Archaeological data publishing – Investigating integration with Islandora/Drupal and Alfresco

Modes of use: distributed data grids

• DataONE “Enable new science and knowledge creation

through universal access to data about life on earth and the

environment that sustains it

More information

• • • • Online help FAQ User’s guide UC3 contact http://merritt.cdlib.org/help http://merritt.cdlib.org/docs/merritt_handout.pdf

http://merritt.cdlib.org/docs/merritt_user_guide.pdf

http://www.cdlib.org/uc3/contact.html

[email protected]

Merritt cost model

• • • • UC3 provides technical infrastructure, data center hosting, staff, monitoring, maintenance, enhancements, help, outreach, consultation, etc. Contributors are charged only for storage used, at the UC3 recovery rate of $1.04/GB/year • • •

How does this compare?

Cost of a physical book in RLF

Cost of a digital book in HathiTrust

Cost of a digital book in Merritt

$ 4.62/year $ 0.15/year

$ 0.06/year

Developing an “endowment” model: Pay once, preserve forever Will soon extend model for non-UC contributors † Gary Lawrence (2007) Internal analysis, CDL; ‡ Paul Courant and Matthew Nielsen (2010), On the cost of keeping a book, HathiTrust.

Average collection sizes and costs

Collection

CA DOE reports Cal Cultures eScholarship

Objects

8,000 420 46,425

Size

12.0 GB 65.6 GB 118.6 GB

Annual cost

$ 12.48

$ 68.22

$ 123.34

A “cost calculator” spreadsheet is available at http://www.cdlib.org/uc3/docs/Merritt-cost-calculator-v3.xlsx

Average ETD size and cost

Campus

Berkeley Davis Irvine Los Angeles Riverside San Diego San Francisco * Santa Barbara Santa Cruz

ETD titles

797 837 390 720 192 558 560 325 155

Based on 2009 holdings in ProQuest

* UCSF based on total ETD holdings in Merritt

Size

12.4 GB 13.0 GB 6.1 GB 11.2 GB 2.9 GB 8.7 GB 8.7 GB 5.0 GB 2.4 GB

Annual cost

$ 12.88

$ 13.52

$ 6.30

$ 11.63

$ 3.10

$ 9.02

$ 9.05

$ 5.25

$ 2.50

Average research data size and cost

• Almost 50% of all research data is less than 1 GB

Size

< 1 GB 1 – 100 GB 100 GB – 1 TB > 1 TB

Percentage

48.3 % 32.0 % 12.1 % 7.6 %

Annual cost

< $ 1.04

$ 1.04 – 104.00

$ 104.00 – 1,040.00

> $ 1,040.00

Source: Science 331:6018 (February 11, 2011): 692-693

Next steps

• UC3 is working with campus partners to determine ongoing development and collection priorities Annotation Notification Transformation Characterization Fixity / Linked data Replication IdM/Authn/Authz Ingest, Access Inventory, Queuing Storage and Identity Technology watch Metadata standards Policy and business model Data management guidelines Object and collection modeling New content acquisition

Next steps

In production • • • • • • • • • • • Model-free objects Submission via UI and API Persistent identifiers Format identification Version provenance Automated replication Automated fixity audit Role-based access control Collections Semantic index and search Object/version/file download In progress • Simplified update • Enhanced characterization (JHOVE2) • • Faceted search and browse (XTF) CMS/DAMS-like function (Islandora) We welcome your feedback on needs and priorities!

http://www.cdlib.org/uc3/contact.html

[email protected]

• • • • In planning • Simplified batch • UCTrust integration • Linked data Transformation Notification Annotation Support for NGTS/DLSTF recommendations

Simplified update

• Variant form of object update requiring the submission of only the changed components • Client-side tools to simplify the creation of batch manifests #%checkm_0.7

#%profile | http://uc3.cdlib.org/registry/ingest/mani #%prefix | mrt: | http://merritt.cdlib.org/terms# #%prefix | nfo: | http://www.semanticdesktop.org/onto #%fields | nfo:fileUrl | nfo:hashAlgorithm | nfo:hash http://merritt.cdlib.org/samples/goldenDragon.jpg | m http://merritt.cdlib.org/samples/tumbleBug.jpg | md5 http://merritt.cdlib.org/samples/generalDrapery.jpg | http://merritt.cdlib.org/samples/generalDrapery.jpg | #%eof

Enhanced characterization

• JHOVE2 next-generation framework for format aware characterization http://jhove2.org/ – Automated extraction and inference of extensive technical metadata significant for preservation analysis and planning "Module": { "scope": "ICCModule“, "Header": { "scope": "ICCHeader“, "ProfileSize": { "unit": "byte“, "value": 60960 } ,"ProfileVersionNumber": "4.2.0.0“ ,"ProfileDeviceClass_raw": "spac“ ,"ProfileDeviceClass_descriptive": "ColorSpace Conversion profile“ ,"ColourSpace_raw": "RGB “ ,"ColourSpace_descriptive": "rgbData“ ,"ProfileConnectionSpace_raw": "Lab “ ,"ProfileConnectionSpace_descriptive": "labData“

Enhanced discovery via XTF

• eXtensible Text Framework http://xtf.cdlib.org/ – – CDL developed/supported open source discovery platform Robust, scalable faceted search and browse

CMS/DAMS-like function

• • Many campuses are looking for CMS/DAMS solutions Investigating integration with Islandora to provide a Drupal CMS/DAMS front-end to Merritt http://islandora.ca/ http://drupal.org/

Questions?

Upcoming webinars

Date/time

Wednesday, June 15 12:30 pm Thursday, June 30 2:00 pm Thursday, July 14 2:00 pm Thursday, July 28 2:00 pm Thursday, August 11 2:00 pm

Topic Data Sharing by Scientists: Practices and Perceptions

Carol Tenopir, Univ. Tennessee Mike Frame, USGS

The Data Management Planning Tool (DMP Tool)

Trisha Cruse, UC3

Data as Publication

John Kunze, UC3 Catherine Mitchell, CDL Publishing Program

Merritt: Depositing Content and Providing Access DCXL (Data Curation Excel)

http://www.cdlib.org/uc3/uc3webinars.html

Please take the webinar survey

http://www.surveymonkey.com/s/XSGWP8R

For more information

UC Curation Center

http://www.cdlib.org/uc3 http://www.cdlib.org/uc3/contact.html

[email protected]

Stephen Abrams Lisa Colvin Patricia Cruse Scott Fisher Erik Hetzner Greg Janée John Kunze Margaret Low David Loy Mark Reyes Tracy Seneca Joan Starr Marisa Strong Perry Willett UC3 webinar series

http://www.cdlib.org/uc3/uc3webinars.html

Merritt repository

http://merritt.cdlib.org/ http://merritt.cdlib.org/help http://merritt.cdlib.org/docs/merritt_handout.pdf

http://merritt.cdlib.org/docs/merritt_user_guide.pdf