Towards smart storage for repository preservation services

Download Report

Transcript Towards smart storage for repository preservation services

Towards smart storage for
repository preservation services
Steve Hitchcock, David Tarrant, Adrian Brown1, Ben O’Steen2,
Neil Jefferies2 and Leslie Carr
Preserv 2 Project
School of Electronics and Computer Science, University of Southampton
1The National Archives, Kew
2Oxford University Library Services
@iPRES 2008: The Fifth International Conference on Preservation of Digital
Objects, London, 29-30 September 2008
Three-stage strategy for keeping
your data safe
• Ability to move data freely, easily
and instantly
– OAI, ORE, Atom
• Reliable, trusted large-scale storage
– Open Storage
• Risk profiling: invoke a range of
selectable services
– Smart storage
About
institutional
repositories
•
•
•
Set up by institutions of
higher education and
research to manage and
disseminate their digital
intellectual outputs.
IRs are a special type of
Web site, typically based on
some repository software
that presents a database of
records pointing to the
objects deposited.
The Preserv 2 project is
investigating the provision
of preservation services for
IRs.
IRs in flux
•
•
•
Photo: Flickr/cpikas
Uncertainty in terms of
target content published papers, theses,
research data, teaching
materials - policy, rights,
even locus of content and
responsibility for longterm management.
OAI-ORE (Object Reuse
and Exchange) effectively
frees the data from being
captive to repository
software.
Commercial repository
services, from softwarespecific services to digital
library services or more
general 'cloud' or network
storage services.
IRs are
• Open source repository softwares
• Open access content
• Open archives using OAI-PMH to share data with e.g. discovery
services.
• Open repositories, using OAI-ORE enables the easy movement of
data between different types of repository software
Photo: Flickr/Rightee
A new ‘open’
How open storage supports
preservation services
• Open storage, large-scale storage devices based on open source
software
• Open storage averts the need for a repository layer to access firstclass objects – these are objects that can be addressed directly
– In turn, these digital objects can be distributed and/or replicated over
many open storage platforms.
– In turn, able to select storage with built-in preservation support
– Resilient storage platforms may be viable for preservation services
aimed at multiple repositories
• E.g. Sun Microsystems STK5800 (codenamed Honeycomb)
• Google Repository
Smart storage
• Smart storage combines an underlying
passive storage approach with the intelligence
provided through services.
• The key to realising smart storage is to enable
the services to communicate and share
information with the digital content sources they
may be acting on. This is done through
machine-level application programming
interfaces (APIs) and protocols.
APIs, interfaces and the Web
architecture
• Major services on the Web, such as
deploy their own simple, but
different, APIs, e.g.
– Google Maps
– Within the repository community,
SWORD (Simple Web-service
Offering Repository Deposit)
– Open storage platforms such as
Sun's STK5800 and the Amazon
Simple Storage Service (S3)
• To take advantage of open storage,
repositories have to be able to talk
to these services through their APIs.
Smart storage example:
format services
• Preservation methods affecting formats can be classified
in three stages (‘seamless flow’):
– Format identification and characterization (which format?)
– Preservation planning and technology watch (format risk
and implications)
– Preservation action, migration, etc. (what to do with the
format)
• Format-based services tend to be ad hoc processes for
which some tools are available
– E.g. PRONOM-DROID from The National Archives (UK)
– PRONOM is an online registry of technical information,
such as file format signatures
– DROID is a downloadable file format identification tool that
applies these signatures)
• These and other tools could be used in a more
coordinated manner.
Smart storage DROID: concept
Smart storage
DROID:
scheduling/history
• Scheduling interface controls when a DROID
classification needs to be performed.
• Preserv 2 has developed a scheduling service that uses
the Darwin Calendar Server and iCalendar format.
• Provides a powerful scheduling service with many clients
already available - Apple iCal, Mozilla Sunbird, and others
- that can read and interpret the files so that past and
future events can be reviewed.
Smart storage
DROID:
OAI-PMH interface
• An OAI-PMH interface to open storage discovers the latest
objects to have been deposited and which are ready for
format classification.
• Could also be performed by simpler RSS or Atom-based
methods.
• The interface has since been expanded to allow export of
OAI-ORE resource maps in both RDF and Atom formats.
Smart storage
DROID: implementation
E.g. iCal, Outlook, Sunbird
DROID
Scheduler
History
DROID-OAI harvester
OAI-PMH
Open
storage
Schedule
event
Calendar
server
url,
date
Repository
Is event
done?
Messaging
Atom?
User interface
Web server HTTP
Stores results of
DROID events
Machine interface, API
Implemented
To be implemented
Get results
of event
• Risk profiling
• The scheduler will invoke actions based on the results of
scanning by DROID allied to decision-making tools that use
intelligence from planning and technology watch tools, such as
– PRONOM,
– Plato preservation planning tool from the EC-funded
Planets project,
– and others.
Photo: Flickr/yourbartender
Summary: smart storage in the
storage scheme
Binary stream
File system
need to store multiple streams with permissions
Content addressable
adds content validation and object identifiers,
metadata required to locate an object
Open
adds error correction and recovery, places
processing close to storage, solves some
bandwidth problems
Smart
opens up the close-to-storage approach for
application development, transition to 'cloud'
storage
How smart storage addresses current storage issues – see full paper
Storage can become smarter
• Openness, in its various forms, the ability to
move data freely and easily, needs to be
supplemented by decision-making that can be
automated based on the supplied intelligence
and information.
• In this way, open storage can become ‘smarter’.
http://preserv.eprints.org/
Thanks to