A Micro-Services-Based Approach for Curation and

Download Report

Transcript A Micro-Services-Based Approach for Curation and

UC3 Summer Webinar Series
Merritt Repository
Depositing Content and Providing Access
University of California Curation Center Team
California Digital Library
July 28, 2011
Merritt summary
• Curation repository
For more information, review the June 9 webinar
http://www.cdlib.org/uc3/uc3webinars.html
– Supporting long-term preservation and access
– Publish, share, preserve, discover, (re-)use
• “Model free”
– There are no prescriptive requirements for content genre,
format, structure, or accompanying metadata
• No service fee (for UC affiliates)
– Contributors are billed only for storage, $1.04/GB/year
Cost of a physical book in offsite storage
Cost of a digital book in HathiTrust
Cost of a digital book in Merritt
$4.62/year
$0.15/year
$0.06/year
Cost of a dataset in Merritt
$1.00/year
Master recipe
•
•
•
•
•
Registration (one time)
Submission
Ingest
Notification
Discovery/delivery
[contributor → UC3, [email protected]]
[contributor → Merritt]
[Merritt]
[Merritt → contributor]
[consumer → Merritt → consumer]
Registration
• Contact Perry Willett, Merritt service manager
[email protected]
Submission
• User interface
• METS feeder
• API
manual deposits
existing DPR workflows
automated deposits
UI submission
• The submission package is always a single file
• An opportunity to supply descriptive metadata
UI submission
• The submission package is always a single file, which
may be:
– For a single object
• The complete object
• A multi-file object in a container (zip, gzip, tar.gz)
• A multi-file object defined by a manifest
– For a batch of objects
• A manifest referring to single file objects
• A manifest referring to objects in containers
• A manifest referring to objects defined by manifests
Manifest
• A “packing slip” for an object, providing URLs for all
object’s file components
– Object manifest
fileURL | hashAlgorithm | hashValue | fileSize | fileName | mimeType
...
• Algorithm = adler32, crc32, md2, md5, sha1, sha256, sha384, sha256
#%checkm_0.7
#%profile| http://uc3.cdlib.org/registry/ingest/manifest/mrt-ingest-manifest
#%prefix | mrt: | http://merritt.cdlib.org/terms#
#%prefix | nfo: | http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#
#%fields | nfo:fileUrl | nfo:hashAlgorithm | nfo:hashValue | nfo:fileSize |
nfo:fileLastModified | nfo:fileName | mrt:mimeType
http://merritt.cdlib.org/samples/call911.jpg | md5 | 47d321056e60944a06973...
http://merritt.cdlib.org/samples/call911.txt | md5 | 77fe42b1055bbabe51648...
#%eof
• See User’s Guide and online help for more information http://merritt.cdlib.org/
Manifest
• A “packing slip” for a batch, providing URLs for all
object’s file components
– Batch manifest
• Batch of single file objects
• Batch of container objects
• Batch of manifest objects
fileURL | hashAlgorithm | hashValue | fileSize | fileName | primaryID |
localID | creator | title | date
...
• An Excel macro is available for automatically generating manifests
from spreadsheets http://merritt.cdlib.org/docs/merrittManifest.xls
• See User’s Guide and online help for more information http://merritt.cdlib.org/
Metadata
• Submission form
• Batch manifest
• Object component: mrt-erc.txt
erc:
who:
what:
when:
where:
Blaine, Tegan Woodward
Continuous measurements of atmospheric argon/nitrogen ...
2005
ark:/20775/bb21509964
Dublin Kernel
Dublin Core
Element
who
creator
Responsible person or party
what
title
Content description
when
date
Lifecycle-meaningful date
where
identifier
Locally-meaningful identifier
http://dublincore.org/groups/kernel/spec/
METS feeder
• METS must conform to a profile documented in the
CDL Guidelines for Digital Objects
http://www.cdlib.org/services/dsc/contribute/docs/GDO.pdf
– METS, all referenced file components, and manifest must be web
accessible
– The Merritt IP address can be provided for configuring firewall rules
• Feeder manifest
http://url/path/mets.xml
http://url/path/mets.xml
...
• Submission
http://feeder.cdlib.org/?userID=id&authCode=passwd&
accessGroupID=collection&manifestURL=manifest
API submission
Field
Value
filename
optional File name
file
required File contents
type
File type:
• file
• batch-manifest
optional
• container
• container-batch-manifest
• object-manifest • single-file-batch-manifest
profile
required Profile (supplied by UC3)
primaryIdentifier
optional Primary identifier (ARK)
localIdentifier
optional Local identifier
digestType
Message digest type:
• adler-32
• sha-1
optional • crc-32
• sha-256
• md2
• sha-384
• md5
• sha-512
API submission
Field
optional Value
digestValue
optional Message digest value (hexadecimal encoded)
creator
optional Creator
title
optional Title
date
optional Date
note
optional Descriptive note
responseForm
optional Response form:
• anvl
• json
• xhtml
• xml
API submission
POST /object/ingest HTTP/1.1
Host: merritt.cdlib.org
Content-type: multipart/form-data; boundary=boundary
--boundary
Content-disposition: form-data; name=“file”;
filename=“filename”
file
--boundary
Content-disposition: form-data; name=“type”
type
--boundary
Content-disposition: forma-data; name=“profile”
profile
--boundary
...
API submission
• cURL
http://curl.haxx.se/
% curl –s –u user password
–F “file=@manifest”
-F “type=manifest-type”
-F “profile=profile”
-F “localIdentifier=identifier”
-F “creator=creator”
-F title=title”
http://merritt.cdlib.org/object/ingest
Ingest
• Primary identifier
– ARK (required; auto-generated by
– DOI (can be optionally requested from
• Validation
• Characterization
• SIP → AIP
ISO 1472, Open Archival Information
System (OAIS)
if not supplied)
)
Notification
• You will receive two email separate notifications
– Initial notification that we have received your submission,
and that it is queued for subsequent processing
– Final notification that we have fully processed your
submission
• UC3’s preservation commitment starts at the time of final
notification
Initial notification
Completion
From:
UC3 Merritt
of submission
Support-[mailto:[email protected]]
Notification Report
Sent: Thursday, July 14, 2011 3:28 PM
To:
- Submission
Stephen Abrams
ID: bid-4ed4bf45-aa78-4da7-bb65-63b125d88150
Subject:
- Job(s):
Completion of submission
Completion
- Job ID:
of jid-3498bef6-e296-429d-b652-da1f35f8bc04
submission - Notification
- Primary ID: ark:/20775/bb21509964
- Submission
- Local ID:ID:
http://libraries.ucsd.edu/ark:/20775/bb21509964;b4946677;umi-ucsd-1040
bid-4ed4bf45-aa78-4da7-bb65- Filename: manifest2.txt
63b125d88150
- Job(s):
- Object title: Continuous measurements of atmospheric argon/nitrogen as a tracer of air-sea
heat flux : models, methods, and data
- ObjectNumber
creator:
of pending
Blaine, job(s):
Tegan Woodward
1
- ObjectNumber
date: of
2005
completed job(s): 0
- Status:
Number
PENDING
of failed job(s): 0
- User agent: slabrams
- Submission date: 2011-07-14T15:27:41-07:00
- Status: QUEUED
With attachment, bid-4ed4bf45-aa78-4da7-bb65-63b125d88150.txt
Final notification
Completion
From:
UC3 Merritt
of ingest
Support
- Notification
[mailto:[email protected]]
Report
Sent: Thursday, July 14, 2011 3:28 PM
To:
- Submission
Stephen Abrams
ID: bid-4ed4bf45-aa78-4da7-bb65-63b125d88150
Subject:
- Job(s): Completion of ingest
Notification
- Job ID:Summary
jid-3498bef6-e296-429d-b652-da1f35f8bc04
- Primary ID: ark:/99999/fk4vm4kg6
- Submission
- Local ID:ID:
ark:/20775/bb21509964
bid-4ed4bf45-aa78-4da7-bb65-63b125d88150
- Job(s):
- Version: 3
- Filename: manifest2.txt
- Object title:
Number
Continuous
of pending
measurements
job(s): 0 of atmospheric argon/nitrogen as a tracer of air-sea heat flux : models, methods,
Number
and data
of completed job(s): 1
- Object creator:
Number
Blaine,
of failed
Tegan
job(s):
Woodward
0
- Object date: 2005
- User
- Object
agent:state:
slabrams
http://store-stage.cdlib.org:35121/state/2111/ark%3A%2F99999%2Ffk4vm4kg6?t=xhtml
- Queue
- Submission
Priority: date:
06 2011-07-14T15:27:46-07:00
- Submission
- Completion
date:date:
2011-07-14T15:27:41-07:00
2011-07-14T15:27:53-07:00
- Completion
- Status: COMPLETED
date: 2011-07-14T15:27:53-07:00
- Status: COMPLETED
- User agent: slabrams
- Queue Priority: 06
- Submission date: 2011-07-14T15:27:41-07:00
- Completion date: 2011-07-14T15:27:53-07:00
- Status: COMPLETED
With attachment, bid-4ed4bf45-aa78-4da7-bb65-63b125d88150.txt
Discovery/delivery
• Search
Discovery/delivery
• Search
Discovery/delivery
• Search
Discovery/delivery
• Browse
Discovery/delivery
• Browse
Coming soon …
• Enhanced characterization
– JHOVE2
http://jhove2.org/
• Faceted search/browse
– XTF (the technology behind
)
http://xtf.cdlib.org/
• Investigation of CMS/DAMS-like function through
integration with …
– Islandora/Drupal
– Alfresco
– Omeka
(in cooperation with UCLA)
(in cooperation with UCB)
(in cooperation with UCSC)
Questions?
Upcoming webinars
Date/time
Topic
Thursday, August 11
2:00 pm
EZID: Create and Manage Persistent Identifiers
Thursday, August 25
2:00 pm
DCXL (Data Curation Excel)
Thursday, Sept. 22
2:00 pm
Data Management Planning Tool
Joan Starr, UC3/CDL
Carly Strasser, UC3/CDL
Patricia Cruse/Tracy Seneca, UC3/CDL
http://www.cdlib.org/uc3/uc3webinars.html
For more information
UC Curation Center
http://www.cdlib.org/uc3
http://www.cdlib.org/uc3/contact.html
[email protected]
Stephen Abrams
Lisa Colvin
Patricia Cruse
Scott Fisher
Erik Hetzner
Greg Janée
John Kunze
Margaret Low
David Loy
Mark Reyes
Abhishek Salve
Tracy Seneca
Joan Starr
Carly Strasser
Marisa Strong
Perry Willett
UC3 webinar series
http://www.cdlib.org/uc3/uc3webinars.html
Merritt repository
http://merritt.cdlib.org/
http://merritt.cdlib.org/help
http://merritt.cdlib.org/docs/merritt_handout.pdf
http://merritt.cdlib.org/docs/merritt_user_guide.pdf