Transcript Bild 1 - Kungliga Biblioteket
Depositing e-material to The National Library of Sweden
www.kb.se
KB - Overview
1661
– First legal deposit law
1877
– Becomes a government institution
1996
– First steps in digitization
1997
– Kulturarw3 - the first collection of the Swedish web
20??
– Deposit law expanded to include electronically published documents
www.kb.se
KB – Aim of repository
• Be able to receive
different kinds of data
in
different kinds of formats
• Be able to handle
large amounts
of incoming data (scalability) • • Have a
flexible
and
modular
design Be able to utilize services that can receive data from organizations with different technical capabilities • A system for long term
preservation
and
presentation
www.kb.se
Overview - Architecture
www.kb.se
Reality – Types of material
• Will receive widely different kinds of materials –
Different:
• file formats • metadata formats • structure of data • naming schemas • From a lot of different sources – Local file system, FTP, Database, URL on the web – Should still try to use the same services •
Solution:
–
Normalize
received material to an internal format – Represent
data + metadata
as
DIDL
XML
www.kb.se
Overview – Deposit system
www.kb.se
Fundamentals of deposit system
• • • Modular design • One internal format for representing packages Try to use as simple interfaces between services as possible – REST services (HTTP + XML) – Message Queue to drop packages for the system in – This makes the system independent of platform and programming framework Each module should be highly configurable with smaller sub-components – Build services as chains of simple components concerned with just one task – Use Spring Framework for configuration
www.kb.se
Internal package format
• Uses
Digital Item Declaration Language (DIDL)
– An MPEG-21 standard – An XML format for both data and metadata • Do not inline data, just metadata • Store datastreams centrally and reference • 1 DIDL file = 1 ”object” • One
package
has: – ID – Type – List of Attributes (name/value pairs) – List of Metadata (as XML) – List of Resources (as references)
www.kb.se
Internal package format
• • Represent a package as a DIDL file – –
Parser
to read a DIDL file into a
Java objec
t
Serializer
to write a Java object to a
DIDL file
Usually works with the package as a Java object •
BUT:
– Only
plain XML
is sent between services – Decouples services from programming language, anything that can handle XML is fine
www.kb.se
Internal package format - Attributes
• Attributes – Name/value pairs (Example:
page-number = 5)
– Flexible way of representing additional information about a package
In DIDL:
Internal package format - Metadata
• Metadata – Name – Description (optional) – XML that represents the metadata
In DIDL:
www.kb.se
Internal package format - Resource
• Resource – ID – Mimetype – List of Attributes – List of Metadata (for this Resource only) (for this Resource only) – Reference to the datastream (a URL)
In DIDL:
Package normalizer
www.kb.se
Package normalizer
• • Takes data in one format and creates an internal package – Creates the DIDL file and writes the datastreams to the Resource Store Places the package on a queue for further processing • • One normalizer per type of data package delivered – Has to know the contract for the delivered data Looks in an inbox at regular intervals for new packages – File system directory • Data could be delivered via FTP or file copy on local file system – URL • OAI-PMH server with metadata that has links to actual resources • OAI-ORE fits in nicely here – Database – Web form operated by human – Anything else?
www.kb.se
Enricher
www.kb.se
Enriching a package
• • REST service – POST a DIDL file and get it back enriched Implemented with Spring and a chain of enrichers – Each doing one specific task, for example adding a urn:nbn – Some only make sense for a specific kind of package – Can be a different set of enrichers for different package types • Examples of enrichers – Adding urn:nbn – Updating MARCXML to reflect that it is an electronic copy – Adding extracted technical metadata from JHove or DROID – And so on...
• Possible to have enrichers that involves human intervention
www.kb.se
Validator
www.kb.se
Validating a package
•
Similar in design to Enricher
• • REST service – POST a DIDL file and get back a status report Implemented with Spring and a chain of tests – Each test doing one specific task – Some only make sense for a specific kind of package – Can be a different set of tests for different package types • Examples of tests – Verifying that a PDF is readable – Validating metadata – And so on...
• Possible to have tests that involves human intervention
www.kb.se
Ingest
www.kb.se
Ingest
• • • • REST service – PUT a DIDL file and get back an id pointing into the repository
In future:
– Perhaps add possibility to update or delete package in repository using POST and DELETE
Abstraction
that hides the actual repository used – Can change repository without affecting rest of the system – Repository dependant enrichments and tests can be done here We use
Fedora
as our repository •
The same principal is used for ingestion into the long-term preservation
archive
www.kb.se
Fedora
• Fedora is used as the repository –
Reasons why:
• Open-source • Actively developed • Large (and growing) user base • Good design and nice features – We use version 2.2
• obviously going to move to 3.0 in the future • • Used for storage and presentation – Stores both relevant datastreams and metadata – Have relations between datastreams (i.e.
sequence-number
) Possible to search against the repository – As standard search against DC fields
www.kb.se
Fedora – Content Models
•
Content Model
– A contract of available
Datastreams
and
Behaviour Definitions
record in a Fedora • In Fedora 2.x just an informal agreement • But from Fedora 3.0 a new mechanism exists for this – Called
Content Model Architecture
(CMA) – A Content Model could involve multiple Fedora records •
Atomistic
versus
Compund
model – Also specifies relations • Both between datastreams and Fedora records • Using RDF in the RELS-EXT datastream
www.kb.se
Fedora - An example Content Model
•
PagedObject
Content Model – Used for digitized material where each page is an image – Atomistic, i.e. one page becomes one Fedora record – Also has one Fedora record for the object as a whole • Record for the
object
–
Datastreams
• DC • MODS • MARCXML –
Behaviour Definitions
• view • list • getPreview –
Relations
• member of a collection • member of OAI-PMH set • Record for an
individual page
–
Datastreams
• WEBIMAGE • THUMBNAIL –
Behaviour Definitions
• getImage • getZoom –
Relations
• member of the object • sequence-number etc.
www.kb.se
Fedora - Ingest
• Gets a DIDL package and creates corresponding FOXML – Different FOXML for different Content Models – Which Content Model depends on Type of package – A Content Model can result in multiple FOXML files (and accordingly multiple Fedora records) • • Uses Fedora's Web Services to ingest the FOXML to the repository The datastreams are also transferred to the Fedora repository •
(Also a urn:nbn is mapped to the objects location in Fedora)
www.kb.se
Fedora - Access
• • • • Built-in
search system
– Search for DC terms and some Fedora terms Built-in
OAI-PMH
provider – We give access to DC, MODS and MARCXML Built-in
RDF Query Server
– Query against the RDF in RELS-EXT
In future:
OAI-ORE provider for Fedora • We provide our own viewer for digitized objects – Developed with Google Web Toolkit (GWT) – Has one tab with an overview of all pages – Another tab with an individual page with zooming functionality and the ability to navigate between pages – Some simple metadata displayed
www.kb.se
Example
A demo of viewing e-material from our Fedora repository.
Accessing SOT from LIBRIS.
www.kb.se