Bild 1 - Kungliga Biblioteket

Download Report

Transcript Bild 1 - Kungliga Biblioteket

Depositing e-material to The National Library of Sweden

www.kb.se

KB - Overview

1661

– First legal deposit law

1877

– Becomes a government institution

1996

– First steps in digitization

1997

– Kulturarw3 - the first collection of the Swedish web

20??

– Deposit law expanded to include electronically published documents

www.kb.se

KB – Aim of repository

• Be able to receive

different kinds of data

in

different kinds of formats

• Be able to handle

large amounts

of incoming data (scalability) • • Have a

flexible

and

modular

design Be able to utilize services that can receive data from organizations with different technical capabilities • A system for long term

preservation

and

presentation

www.kb.se

Overview - Architecture

www.kb.se

Reality – Types of material

• Will receive widely different kinds of materials –

Different:

• file formats • metadata formats • structure of data • naming schemas • From a lot of different sources – Local file system, FTP, Database, URL on the web – Should still try to use the same services •

Solution:

Normalize

received material to an internal format – Represent

data + metadata

as

DIDL

XML

www.kb.se

Overview – Deposit system

www.kb.se

Fundamentals of deposit system

• • • Modular design • One internal format for representing packages Try to use as simple interfaces between services as possible – REST services (HTTP + XML) – Message Queue to drop packages for the system in – This makes the system independent of platform and programming framework Each module should be highly configurable with smaller sub-components – Build services as chains of simple components concerned with just one task – Use Spring Framework for configuration

www.kb.se

Internal package format

• Uses

Digital Item Declaration Language (DIDL)

– An MPEG-21 standard – An XML format for both data and metadata • Do not inline data, just metadata • Store datastreams centrally and reference • 1 DIDL file = 1 ”object” • One

package

has: – ID – Type – List of Attributes (name/value pairs) – List of Metadata (as XML) – List of Resources (as references)

www.kb.se

Internal package format

• • Represent a package as a DIDL file – –

Parser

to read a DIDL file into a

Java objec

t

Serializer

to write a Java object to a

DIDL file

Usually works with the package as a Java object •

BUT:

– Only

plain XML

is sent between services – Decouples services from programming language, anything that can handle XML is fine

www.kb.se

Internal package format - Attributes

• Attributes – Name/value pairs (Example:

page-number = 5)

– Flexible way of representing additional information about a package

In DIDL:

#Attributes foo=bar www.kb.se

Internal package format - Metadata

• Metadata – Name – Description (optional) – XML that represents the metadata

In DIDL:

...

www.kb.se

Internal package format - Resource

• Resource – ID – Mimetype – List of Attributes – List of Metadata (for this Resource only) (for this Resource only) – Reference to the datastream (a URL)

In DIDL:

123456 www.kb.se

Package normalizer

www.kb.se

Package normalizer

• • Takes data in one format and creates an internal package – Creates the DIDL file and writes the datastreams to the Resource Store Places the package on a queue for further processing • • One normalizer per type of data package delivered – Has to know the contract for the delivered data Looks in an inbox at regular intervals for new packages – File system directory • Data could be delivered via FTP or file copy on local file system – URL • OAI-PMH server with metadata that has links to actual resources • OAI-ORE fits in nicely here – Database – Web form operated by human – Anything else?

www.kb.se

Enricher

www.kb.se

Enriching a package

• • REST service – POST a DIDL file and get it back enriched Implemented with Spring and a chain of enrichers – Each doing one specific task, for example adding a urn:nbn – Some only make sense for a specific kind of package – Can be a different set of enrichers for different package types • Examples of enrichers – Adding urn:nbn – Updating MARCXML to reflect that it is an electronic copy – Adding extracted technical metadata from JHove or DROID – And so on...

• Possible to have enrichers that involves human intervention

www.kb.se

Validator

www.kb.se

Validating a package

Similar in design to Enricher

• • REST service – POST a DIDL file and get back a status report Implemented with Spring and a chain of tests – Each test doing one specific task – Some only make sense for a specific kind of package – Can be a different set of tests for different package types • Examples of tests – Verifying that a PDF is readable – Validating metadata – And so on...

• Possible to have tests that involves human intervention

www.kb.se

Ingest

www.kb.se

Ingest

• • • • REST service – PUT a DIDL file and get back an id pointing into the repository

In future:

– Perhaps add possibility to update or delete package in repository using POST and DELETE

Abstraction

that hides the actual repository used – Can change repository without affecting rest of the system – Repository dependant enrichments and tests can be done here We use

Fedora

as our repository •

The same principal is used for ingestion into the long-term preservation

archive

www.kb.se

Fedora

• Fedora is used as the repository –

Reasons why:

• Open-source • Actively developed • Large (and growing) user base • Good design and nice features – We use version 2.2

• obviously going to move to 3.0 in the future • • Used for storage and presentation – Stores both relevant datastreams and metadata – Have relations between datastreams (i.e.

sequence-number

) Possible to search against the repository – As standard search against DC fields

www.kb.se

Fedora – Content Models

Content Model

– A contract of available

Datastreams

and

Behaviour Definitions

record in a Fedora • In Fedora 2.x just an informal agreement • But from Fedora 3.0 a new mechanism exists for this – Called

Content Model Architecture

(CMA) – A Content Model could involve multiple Fedora records •

Atomistic

versus

Compund

model – Also specifies relations • Both between datastreams and Fedora records • Using RDF in the RELS-EXT datastream

www.kb.se

Fedora - An example Content Model

PagedObject

Content Model – Used for digitized material where each page is an image – Atomistic, i.e. one page becomes one Fedora record – Also has one Fedora record for the object as a whole • Record for the

object

Datastreams

• DC • MODS • MARCXML –

Behaviour Definitions

• view • list • getPreview –

Relations

• member of a collection • member of OAI-PMH set • Record for an

individual page

Datastreams

• WEBIMAGE • THUMBNAIL –

Behaviour Definitions

• getImage • getZoom –

Relations

• member of the object • sequence-number etc.

www.kb.se

Fedora - Ingest

• Gets a DIDL package and creates corresponding FOXML – Different FOXML for different Content Models – Which Content Model depends on Type of package – A Content Model can result in multiple FOXML files (and accordingly multiple Fedora records) • • Uses Fedora's Web Services to ingest the FOXML to the repository The datastreams are also transferred to the Fedora repository •

(Also a urn:nbn is mapped to the objects location in Fedora)

www.kb.se

Fedora - Access

• • • • Built-in

search system

– Search for DC terms and some Fedora terms Built-in

OAI-PMH

provider – We give access to DC, MODS and MARCXML Built-in

RDF Query Server

– Query against the RDF in RELS-EXT

In future:

OAI-ORE provider for Fedora • We provide our own viewer for digitized objects – Developed with Google Web Toolkit (GWT) – Has one tab with an overview of all pages – Another tab with an individual page with zooming functionality and the ability to navigate between pages – Some simple metadata displayed

www.kb.se

Example

A demo of viewing e-material from our Fedora repository.

Accessing SOT from LIBRIS.

www.kb.se