Developing an Ingest Service for Fedora

Download Report

Transcript Developing an Ingest Service for Fedora

Developing an
Ingest Service
for Fedora
Ryan Scherle
Muzaffer Ozakca
IUDL infrastructure project
• 2-year project funded by University Information Technology Services
to reengineer digital library infrastructure around Fedora
• Builds on experience with Fedora in context of EVIA Digital Archive
(ethnomusicology video)
• 2 full-time staff, plus part-time from many others
• Dozens of legacy collections with roughly 100,000 objects
• New collections: some content-focused, some research-focused
Diversity
• Multiple media types
• Multiple brands
• Multiple tools
The goal
Aajk fs
jkflsf jkds
s jfs sdkf
Aajk fs
jkflsf jkds
s jfs sdkf
Ingest
Jkl id jid whi ahin
inpa aialw hwiwl
Jkl id jid whi ahin
inpa aialw hwiwl
Aajk fs
jkflsf jkds
s jfs sdkf
Required features
• Ingest common content types:
▫ Images
▫ Paged documents
▫ Textual documents
• Allow for easy creation of new content types
• Must support several workflows
▫
▫
▫
▫
Metadata or media may be primary
Most objects include derived media
Systematic changes to metadata may be desired
May need to connect with external tools for metadata
generation, validation, etc.
▫ A workflow engine may sit on top of the ingest system
Existing Ingest Tools
Criteria
•
•
•
•
•
Ease of install
Native content models
Custom content models (e.g. paged)
Workflow neutrality, including object modification
Batch ingest
Remember, we’re evaluating object ingest only,
not object delivery!
But first, some disclaimers…
• This is not an objective evaluation, just our
experiences
• We’re not experts in these systems
• We’re evaluating ingest only, not delivery!
• We’re evaluating ingest with a focus on our
needs
• We believe in community
Fedora admin client
• Comes with Fedora
• Geared towards admins rather than end users
• No systematic way of entering data or attaching
files
• Very flexible
• The only way to create disseminators
• Tedious
Fez
• End-to-End GUI system
• Highly customizable content models, workflow,
security
• Customizable role and group based access control
• Growing community
• Originally developed as an Institutional Repository
• Many preset content models
• Can create “extension” metadata based on an XSD
• External MySQL database for workflow/vocabulary
data
• GPL
Fez - ingest
File
Custom
MD
• Single object ingest
▫ Through Web UI
File
▫ ImageMagick/JHOVE integration
• Bulk ingest:
▫ Upload files to a directory
▫ Also can import existing Fedora objects
in bulks
▫ Templates for metadata common to all
Fedora
objects, manual updates for the rest
▫ Batches possible, but only one file per object
• No disseminators
• Custom metadata can be stored as a simple XML file
• Objects must use “compound” content model
Fez – object organization
Community
level
Collection
level
Content
level
Community
Collection
Image DO
Paper DO
Collection
DO with
Custom
MD
Elated overview
• End to end complete system for digital
collections
• Simple customizable metadata and a simple
workflow supported
• GPL
“Elated is a lightweight, general-purpose application for managing
digital files. ELATED is built on top of the Fedora Repository
System, and could be used as a digital assets management system,
an institutional repository, or to meet other collection archiving,
publishing and searching needs.”
Elated ingest
• Single object ingest
▫ Through Web UI
▫ Focused on DC metadata,
custom fields can be added
• Multi object ingest via
zipped folders and files
▫ Metadata template + manually
▫ Batches possible, but only one file
per object
• Simple content model
• Manually-attached disseminators
File
DC +
Custom
MD
File
Fedora
Elated object organization
Top level
Collection
Level 1
Folder
Level 2
Level n
Image
DO
Folder
Folder
PDF DO
PDF DO
Image
DO
Image
DO
Valet for ETDs
• A component of the VTLS VITAL product
focused on ETD submission
• Allows submission of thesis and a simple
workflow for approval
• Part of a larger framework
• Highly focused on ETDs
DirIngest overview
•
•
•
•
•
Ingests objects from a structured ZIP file
Highly flexible
User must create METS structure by hand
Doesn’t handle disseminators
Can create some RELS-EXT data, but not fully
flexible
• Cannot modify existing objects/collections
• Easy to use OhioLink Bulk Ingest
DirIngest
Image
File
Zip Archive
Images
Image
File
METS.xml
Image
File
Collection
Texts
Crules.xml
Text File
Top level
Collection
Folder level
Fedora
Content level
Images
Image
DO
Image
DO
Texts
Image
DO
Text DO
Batch modify
• A method of controlling API-M with simple XML
statements
• Can create “empty” objects and change them in
systematic ways.
• Requires manual (or programmatic) creation of
the modify scripts
• Can be used in conjunction with other tools…
Summary
Fez
Ease of
install
Native
CM
Custom
CM
Workflow
Neutrality
Batch
ingest
Elated
Valet
Dir
Ingest
Batch Admin
Modify Client
Indiana Ingest Tool
Indiana Ingest Tool
• A structured interface between a workflow management or repository
management GUI and the Fedora repository
• Focused on simple input formats for maximum flexibility
• Keeps the tools independent of the repository architecture
• Builds the FOXML, rather than requiring a full structure to be pre-built
• Binds disseminators
• Creates RELS-EXT relationships
• Can create and/or alter items in a collection
• Auto-generates technical metadata with JHOVE or XSLT.
Image Cataloging Tool
EAD
Sheet Music Cataloging Tool
JPG
MODS
Ingest Tool
FOXML
Datastreams
Fedora
PDF
SIP
Performing an ingest
• Place source metadata in an accessible location
(filesystem, website)
• Place media files (both master and derivative) in an
accessible location
• Define the "collection configuration"
• Run the ingest process
• Receive report
Sample collection config file
<cc:collectionName>Hoagy Carmichael Correspondence</cc:collectionName>
<cc:contentModel>paged</cc:contentModel>
<cc:collectionID>hoagy</cc:collectionID>
Collection defn
<cc:collectionPid>iudl:6</cc:collectionPid>
<cc:existingItem>
<cc:fedoraItemExists action="alter"/>
</cc:existingItem>
What to do
If item exists
<cc:masterContent type="image" subtype="tiff">
<cc:source location="localfs">{path to master images}</cc:source>
<cc:extension>.tif</cc:extension>
</cc:masterContent>
<cc:derivedContent derivativeType="images">
<cc:source location="localfs">{path to dreivative images here}</cc:source>
File defn
<cc:extension item="thumb">-thumb.jpg</cc:extension>
<cc:extension item="screen">-screen.jpg</cc:extension>
<cc:extension item="large">-full.jpg</cc:extension>
</cc:derivedContent>
<cc:descriptiveMetadata>
<cc:metadataItem type="ead" authoritative="true" level="collection">
<cc:source location="localfs">{path to ead}</cc:source>
Desc. metadata
</cc:metadataItem>
...
<cc:technicalMetadata>
<cc:metadataItem type="mix" authoritative="true" level="masterContent">
</cc:metadataItem>
...
Tech. metadata
Example – Sheet Music
Images
Link to
Parent
MODS
Ingest
Config
Fedora
Ingest Tool
Tech
MD
Example – preservation package
SIP
Audio
Link to
Parent
AES31
Metadata
Fedora
Ingest
Config
Ingest Tool
Tech
MD
Summary
Fez
Ease of
install
Native
CM
Custom
CM
Workflow
Neutrality
Batch
ingest
Elated
Valet
Dir
Ingest
Batch Admin
Modify Client
IU Tool
Major difficulties in any ingest tool
• Providing flexibility in “style” of content model
• Matching filenames with metadata records
• Indicating the sequence of files in complex
objects
• Abstracting over differing local metadata
standards (even in our own collections)
Topics for future discussion
• What is the best structure for an ingest tool?
▫ Is our tool of interest to others?
▫ Would it be better to combine our capabilities with
an existing tool?
• Can we agree on some core content models?
Thank You!
• Infrastructure project wiki:
▫ http://wiki.dlib.indiana.edu/confluence/display/INF
• Contact info:
▫ Ryan Scherle [email protected]
▫ Muzaffer Ozakca [email protected]