DLP Gate 0 Initial Planning Meeting Presentation

Download Report

Transcript DLP Gate 0 Initial Planning Meeting Presentation

The Archival Problem & Infrastructure for Solutions
What needs to be archived and what needs to be done?
Richard Boulderstone
Director eStrategy
February 2010
What needs to be archived?
at most
least things...
a sample of most things...
2
2
What needs to be done?
1.
Ingest

Transition from print to digital information resources

Heterogeneity, complexity and scale of digital content

Interactive items

Should we validate?
2. Storage

Long term authenticity of items

Loss or corruption

External References
3. Access

Securely share content with other legal deposit libraries

Long term access – Beyond life of original hardware & software platform (aka Digital
Preservation)

Controlled access

Public domain, legal deposit, licensed

Content must be easy to find – Important!!!
3
3
Digital Library Architecture & Design Considerations

Started with Long-Term Storage problem

Wanted cost effective, highly resilient store (highly unlikely to lose items or have items corrupted),
long term integrity

Analysis showed that magnetic tape solutions had limitations:

As size of store grows (petabytes) total recovery time can be long

Cost of tape not much less than commodity spinning disk

Wanted continuous validation to ensure content retained integrity

Disk storage market tends to focus ‘value-added’ products on high-transaction rates, large capacity
and high reliablity – our requirement is low-cost, large capacity and reasonable performance

Needed to share some of the archived content with other legal deposit libraries

Architect resilience into system – is ‘backup’ part of the architecture?
4
4
The Digital Library System Store
5
5
What needs to be done?
1.
Ingest

Transition from print to digital information resources

Heterogeneity, complexity and scale of digital content

Interactive items

Should we validate?
2. Storage

Long term authenticity of items

Loss or corruption

External References
3. Access

Securely share content with other legal deposit libraries

Long term access – Beyond life of original hardware & software platform (aka Digital
Preservation)

Controlled access

Public domain, legal deposit, licensed

Content must be easy to find – Important!!!
6
6
CONTENT STREAMS with operational DIGITAL LIBRARY INGEST
ID
Name
Description
1
DIGITISED BOOKS &
JOURNALS
Microsoft-funded digitised nineteenth century books accessible from DLS to the
Reading Rooms via ILS. Ingest is complete.
2
VOLUNTARY ELECTRONIC
LEGAL DEPOSIT (VELD)
Deposited hand held media and offline electronic media submitted ‘in lieu’ of
electronic legal deposit legislation; includes journals, books etc (formerly known as
VDEP material; excludes scholarly e-journals which are identified as a separate
stream). A very limited amount of this content is accessible from DLS in the Reading
Rooms via ILS.
3
FIELD SOUND
RECORDINGS
Born digital field recordings created by Sound Archive – low volume, accessible from
DLS to Reading Rooms via Sound Server
4
eJOURNALS
Scholarly e-Journals sent to the BL as part of the Voluntary Deposit scheme. Ingest of
simple eJournals (Stream 2) is live. Progress towards ingest of complex eJournals is
halted while the technical options are being considered. Work has started on a project
to ingest ESTAR eJournals
5a
DIGITAL NEWSPAPERS
Contemporary newspapers to be supplied digitally & directly to the BL by newspaper
publishers. A pilot of a small number of titles from a single publisher is live. Progress
is halted pending agreements from publishers.
5b
LEGACY DIGITISED
NEWSPAPERS
Scanned historical newspapers already in the BL’s current collection. Ingest of JISC-1
is currently underway
7
7
CONTENT STREAMS prioritised for DIGITAL LIBRARY INGEST
ID
Name
Description
6
WEB ARCHIVING
An archive of web-sites gathered after gaining permission from rights
holders. Following Legal Deposit Regulations the BL will be able to
harvest sites from the uk domain without asking for permission. The
project to ingest this collection is at the shape stage
7
BORN DIGITAL SOUND
Born digital sound recordings, acquisitions & voluntary deposits.
Expanding, with implications for Gateway project & links to Moving Image
stream. The project to ingest this collection is at the shape stage
8
8
OTHER CONTENT STREAMS for DIGITAL LIBRARY INGEST
ID
Name
Description
8
LEGACY DIGITISED
MASTERS
All existing image-based digitisation products largely on hand-held media with
significant preservation risk
9
NEW DIGITISED MASTERS
Digitised images from forthcoming projects including Single-Sheet Digitisation,
Vulnerable Items Imaging, possibly Greek Manuscripts.
Fast Track to Safety has been used to process the Vulnerable Collection Items (VCI)
images – these are now DLS-ready. The same process can be used for the singlesheet digitised objects & other Aleph catalogued content
10
DIGITISED SOUND
Digitised archival sound recordings funded by JISC (ASR1 & ASR2)
11
eMANUSCRIPTS
Hybrid collections, comprising paper, computer & other media with significant issues
(technical, privacy, rights management etc)
12
DATABASES
Including large numerical datasets
13
DIGITAL MAPS
Including OS/OSNI MasterMap data and other contacts / agreements (actually a
dataset rather than a ‘map’)
14
E-THESES
Digitised & born digital theses funded by JISC (eThOS Project)
15
ELECTRONIC GREY
LITERATURE
Scientific technical & business documents, conference papers, newsletters, e-govt
documents i.e. not readily available through commercial channels. Some of this
content is already ingested to DLS via the VELD route (see 2 above)
16
eBOOKS
Assumes born-digital material. Some of this content is already ingested to DLS via the
VELD route (see 2 above)
17
MOVING IMAGES
Digital recordings of television programmes, online podcasts etc
9
9
1. Ingest – Some remaining issues





‘Dynamic’ Content – Update after initial deposit

Currently use snapshot, version-based approach

Other generic solutions?
Should we archive published outputs, underlying data or both?
Growing diversity of content

Should we validate to ensure long-term access?
Container formats may hide significant complexity (3D pdf)
Scale
10
10
What needs to be done?
1.
Ingest

Transition from print to digital information resources

Heterogeneity, complexity and scale of digital content

Interactive items

Should we validate?
2. Storage

Long term authenticity of items

Loss or corruption

External References
3. Access

Securely share content with other legal deposit libraries

Long term access – Beyond life of original hardware & software platform (aka Digital
Preservation)

Controlled access

Public domain, legal deposit, licensed

Content must be easy to find – Important!!!
11
11
Legal Deposit Libraries Shared Infrastructure




Edinburgh -2010

Large scale, highly resilient digital store
Complete copies of content at each node
Continuous validation & correction
Long term digital storage for BL content & eLegal
deposit distribution
Distribution of eLegal deposit content (NLW, NLS
and Oxford & Cambridge)
Boston Spa
Aberystwyth
Oxford
Univ.
Cambridge
Univ.
St. Pancras
12
12
Agreement between UK Legal Deposit Libraries






Use of single IT infrastructure, based on BL Digital Library System, to share legal deposit
content
Use of single ingest point (Boston Spa) for legal deposit content
Deployment of ‘nodes’ at BL, NLW & NLS for resilience, operational efficiency, autonomy
of operation. Oxford and Cambridge to access content from BL node.
Consistent approach to preservation, metadata standards, SLAs (service level agreements),
infrastructure operations.
Access controls
Trinity College Dublin will be included when legislation allows
13
13
Digital Library System Contents


Live Content Streams

Sound Archives (BL)

Voluntary Digital Donations (Vol. Scheme)

Nineteenth Century Digitised Books (BL)

Born Digital Newspapers (BL Pilot)

eJournals (Vol. Scheme)

Digitised Newspapers (BL)
Storage

>500,000 Digital Items

~50 Terabytes of Content
14
14
14
Long-Term Access (aka Digital Preservation)




Dedicated digital preservation team at BL
Digital Library System currently supports Bit-level Preservation – long term integrity of ingested ‘bits’.
Also need to support Content-level Preservation, where the DLS is able to provide long-term access to the
content, ensuring that users can render and use preserved content.
The Planets Project will deliver preservation modules for DLS in summer 2010.

Identification of at risk content

Support for file format migrations

Technology watch service
15
15
2. Storage – Some remaining issues




Ongoing cost

Storage

Can we share common costs (Tools, Technology watch, Test-beds)
Can ‘dynamic’ items be frozen and more importantly unfrozen?
How many file formats/software will become obsolete requiring heroic efforts to recreate
original user experience?
How are external references maintained over time?
16
16
What needs to be done?
1.
Ingest

Transition from print to digital information resources

Heterogeneity, complexity and scale of digital content

Interactive items

Should we validate?
2. Storage

Long term authenticity of items

Loss or corruption

External References
3. Access

Securely share content with other legal deposit libraries

Long term access – Beyond life of original hardware & software platform (aka Digital
Preservation)

Controlled access

Public domain, legal deposit, licensed

Content must be easy to find – Important!!!
17
17
Digital Policy & Rights Management

To provide the widest possible access to our digital collections while respecting the terms and conditions of licenses,
voluntary schemes and regulations.
 Most content controlled by copyright/legal deposit restrictions – will this change?

Current access control supports:



Embargoed (no access), Authorised staff only, Reading room only
To be developed:

Internet

Single consecutive use at legal deposit libraries

Secure container so that readers can use own PCs to access legal deposit content
Mobile (anywhere) access
18
18
Content Navigation & Discovery


The most important issue
Catalogue model designed for two levels of hierarchy (Title & holdings)

Using Ex Libris Primo product as initial solution (Lucene full-text search engine embedded in product)

Much more needed – need help!

Persistent links

Full featured commercial search engines

Semantic web/Linked data/RDF Triples

Text mining, entity extraction

Information visualisation techniques

Hardware developments, mobile technologies, large displays
19
19
3. Access – Some remaining issues



With huge quantity of content how can people find what they want?
How can we support the development of sophisticated content navigation tools?
Where should we invest in resource discovery?
20
20
Conclusion





We have developed a highly-resilient, scalable store for
digital items
We will need to archive a very broad range of content.
The BL Digital Library System will be used by the legal
deposit libraries to share legal deposit content
However, this feels like the beginning of a very long
journey!
We will need considerable help along the way
Thank you.
21
21