British Library Web Archiving

Download Report

Transcript British Library Web Archiving

Archiving the UK Web
Helen Hockx-Yu
Web Archiving Programme Manager
British Library
13 October 2008
Overview

The need to archive the web

The UK Domain

British Library web archiving activities

What do we collect and how do we select?

Workflow

Metadata

System components

Permission / legal issues

Selective archiving: pros and cons

Conclusion
2
The need to archive the web

To save e-publishing

Reduce the risk of loss of ephemeral in-scope material –
cited average life expectancy of websites is 44 days

Research requires reliable references to e-publishing

Contributes to the memory of the nation

Issues




Illegal (pornography, terrorist etc.) or rubbish sites
Needle in haystack: how to filter out useful material
How to foresee future research interests
How to keep pace with rapidly changing web technology
3
The UK domain – an overview

Territoriality criteria
 Has a .uk domain name
 Relates to UK-based individuals or organisations which use
other domain names, e.g. .org, .com, .net etc. or alternatives
 Can be demonstrated, if an overseas publication, to be made
available by a UK-based publisher

6.1 million .uk domains registered in 2007 plus 50, 000 other
domains which can be identified as published in the UK

Growth by 17% per annum till 2011, then by 15% till 2016

On-line publications in scope for legal deposit estimated at 3.9
million sites in 2007 rising to 14.6 million in 2016

Average size of websites is 25MB (domain harvesting model) or
180MB (permission-based harvesting)
4
British Library web archiving activities – history (1)

DOMAIN.UK project by the BL in 2001: 6-month experiment
to select and capture 100 UK web-sites – leading to ongoing
Web Archiving Programme

BL participated in and led the UK Web Archiving Consortium
(UKWAC), collaborative initiative since 2004 to build a
collective national web archive

Permission-based selective archive

Underwent major system / data migration earlier this year:
from PANDAS to IIPC toolset

Over 3,700 unique websites and over 11,400 instances,
measuring approximately 2TB
5
British Library web archiving activities – history (2)

BL the largest collector: to date archived 1,853 unique
websites, 5,264 instances, or 1TB of data

BL infrastructure shared by NLW, TNA, Wellcome Trust and
JISC. Currently finalising the public access interface to the
archive

Web Archiving Programme: BL as the point of first resort for
a comprehensive archive of material from the UK Web
domain
 embed web archiving within the BL's overall collection
development policy
 provide the infrastructure to collect, preserve and make
accessible web materials
6
British Library Web Collection Policy
The BL will collect freely available sites selectively
from the UK web space by prioritising the archiving
of sites of research value across the spectrum of
knowledge. In addition, the BL archives a selection
of sites which are representative of British social
history and cultural heritage in all its diversity and
across the regions. It will also archive a small
number of sites which demonstrate web
innovation.
- British Library Web Collection Policy
7
What does the British Library select?

Research
 stated intentions of sites themselves or potential to be primary
resources for research
 sites hosted by universities, government bodies, grey material
published by campaigning organisations and charities
 “research on the web”

Social history & culture
 Sites representing British cultural diversity, regional difference,
social significance (current trends, e.g. Facebook); key event
of national life, topicality

Innovation
 Award winning sites (pre-selected), sites illustrating web’s
information, communication and training strength
8
Some examples

Institute for Criminal Policy Research
Live site

UK General Election 2005: Andrew George MP
Live site

Archived site (25 September 2005)
David Shaw’s Homepage
Live site

Archived site (13 March 2007)
London Bombings: ABC News Map
Live site

Archived site (21 April 2005)
Football Poets
Live site

Archived site (17 March 2005)
Archived site (25 June 2006)
E-publishing trends: Egg Bacon Chips and Beans
Live site
Archived site(17 Jan 2006)
9
How do we select?

Subject specialists team
 20 curators spending 5% of their time
 cultural change
 Appreciation of web technology & understanding of
crawler limitation

Collaboration with external organisations
 Women’s library, V&A

UK Web Archiving Consortium

Recommendations (e.g. colleagues)

self selected via UKWAC web form
10
Web Archiving Workflow
Management
Workflow
Selection
Permissions
Archiving
Collection Development Policy
Subject
Specialists
Legal Deposit
Legislation
Recommendations
(e.g. by colleagues)
Self Selected
(e.g. Via Web Form)
Process
Selection
Already
Selected?
Yes
Follow Up?
No
No Action
No
Yes
Register on DB
Send Permission
Request
Letter
Email
Web
Form
Letter
Email
Request Contact
Details
Response?
No
Update DB
Yes
Enquiry?
Yes
Update DB,
Assign & Ans
No
Permission
Granted?
No
Yes
Update DB
Add Details to WCT
Create Harvest
Authorisation
Update DB
Create Target
(Set Crawl Options &
Harvest Schedule)
Approve Target
Tune Target
Review Target Instance
(Quality Review)
Harvested
ok?
No
Reject Target Instance
Yes
Endorse Target Instance
Archive Target Instance
End
Legend
Decision
Process
Time Delay
Terminator
11
Workflow (1)
Selection
Subject
Specialists
Recommendations
(e.g. by colleagues)
Self Selected
(e.g. Via Web Form)
Process
Selection
12
Permissions
Workflow (2)
Process
Selection
Already
Selected?
Yes
Follow Up?
No
No Action
No
Yes
Register on DB
Send Permission
Request
Letter
Email
Web
Form
Letter
Email
Request Contact
Details
Response?
No
Update DB
Yes
Enquiry?
Yes
Update DB,
Assign & Ans
No
Permission
Granted?
Yes
Update DB
Add Details to WCT
No
Update DB
13
Workflow (3)
Archiving
Add Details to WCT
Create Harvest
Authorisation
Create Target
(Set Crawl Options &
Harvest Schedule)
Approve Target
Tune Target
Target Instance Harvested
Review Target Instance
(Quality Review)
Harvested
ok?
No
Reject Target Instance
Yes
Endorse Target Instance
Archive Target Instance
End
14
Metadata

Broad subject headings

Collections
 Catalogue records for collections
 26 thematic collections (e.g. blogs, British countryside, Digital lives)
 Event-based collections
 UK general election 2005
 London Terrorist attack 7/7/05
 Indian Ocean Tsunami
 London Mayor election 2008
 Olympic & Paralympics Games 2012

Titles

Descriptions, at instance, site and collection levels, but not consistently collected

Permissions

Harvest log files (generated during the crawling process)

Will start project to define requirements for preservation metadata and workflow
15
System Components

Web Curator Tool (WCT)
 workflow management tool for selective archiving jointly developed
by the BL and NLNZ
 4 iterations of development completed
 requirements collected from sources worldwide
 uses embedded Heritirx for crawling (1.14.1)
 archivists still learning to use crawl profiles and settings effectively

Nutchwax for (keyword) indexing and search

Access interface incorporating OSWM
 URL and keyword search
 Browsing by titles, subjects and collections

RESTful API based on OpenSearch standards

Smart crawler project: BL works together with the Internet Archive and
other national libraries to develop smart capabilities of the crawler
16
Permission / legal issues

Legal Deposit Libraries Act 2003 and extension of legal deposit to nonprint publications – not yet fully implemented

LDAP Web Archiving Sub-committee advising the Secretary of State on
implementation of the Act: regulation-based harvesting and archiving of
freely available online publications.

Slow process with delays; earliest legislation expected April 2010

Low response rate to the permission requests (25% success rate)

3rd party cover can be prohibitive (e.g. multiple contributors)

Resources required on both sides to do admin

Difficulties in tracking down the right person

Valuable websites disappearing before owners are found or contacted
17
Selective archiving: pros and cons

No access restrictions

Added value and depth to collection due to curatorial input

Better quality (look and feel) of archived sites due to detailed quality
assurance checks

Easier to navigate as is based on subject knowledge and tools

Offers better hooks to build hybrid library collections (as catalogue
records)
However:

Low response rate to permission requests

Labour intensive

Expensive

Only a small portion of the UK domain is being collected
18
Conclusions

Lack of national legislation is hold us back and the biggest
issue for web archiving in the UK

Most national libraries in the world undertake selective and
domain harvesting at the same time

Selective and domain harvesting are not exclusive of each
other; they can and should complement each other

BL needs to get up to speed

Need to shift web archiving to the centre of BL’s overall
collection and infrastructure

Selective archiving is expected to play a continued role
along side domain harvesting
19
Reference

Alison Hill, Archiving in the British Library: an overview, June 2008

Philip Beresford & Ravish Mistry, Web Archiving with the Web
Curator Tool at the British Library, September 2008

LDAP Web Archiving Sub-Committee, Recommendation for the
Collection and Preservation of UK Online Publications, free of
charge and without access restriction, 22 May 2008
20
Any Questions?
Helen Hockx-Yu
Web Archiving Programme Manager
The British Library
96 Euston Road
London NW1 2DB
Tel: +44 (0)20 7412 7184
Mobile: 07766 474 368
Email: [email protected]
Website:
http://www.bl.uk/aboutus/stratpolprog/digi/webarch/index.html
21