British Library Web Archiving
Download
Report
Transcript British Library Web Archiving
Archiving the UK Web
Helen Hockx-Yu
Web Archiving Programme Manager
British Library
13 October 2008
Overview
The need to archive the web
The UK Domain
British Library web archiving activities
What do we collect and how do we select?
Workflow
Metadata
System components
Permission / legal issues
Selective archiving: pros and cons
Conclusion
2
The need to archive the web
To save e-publishing
Reduce the risk of loss of ephemeral in-scope material –
cited average life expectancy of websites is 44 days
Research requires reliable references to e-publishing
Contributes to the memory of the nation
Issues
Illegal (pornography, terrorist etc.) or rubbish sites
Needle in haystack: how to filter out useful material
How to foresee future research interests
How to keep pace with rapidly changing web technology
3
The UK domain – an overview
Territoriality criteria
Has a .uk domain name
Relates to UK-based individuals or organisations which use
other domain names, e.g. .org, .com, .net etc. or alternatives
Can be demonstrated, if an overseas publication, to be made
available by a UK-based publisher
6.1 million .uk domains registered in 2007 plus 50, 000 other
domains which can be identified as published in the UK
Growth by 17% per annum till 2011, then by 15% till 2016
On-line publications in scope for legal deposit estimated at 3.9
million sites in 2007 rising to 14.6 million in 2016
Average size of websites is 25MB (domain harvesting model) or
180MB (permission-based harvesting)
4
British Library web archiving activities – history (1)
DOMAIN.UK project by the BL in 2001: 6-month experiment
to select and capture 100 UK web-sites – leading to ongoing
Web Archiving Programme
BL participated in and led the UK Web Archiving Consortium
(UKWAC), collaborative initiative since 2004 to build a
collective national web archive
Permission-based selective archive
Underwent major system / data migration earlier this year:
from PANDAS to IIPC toolset
Over 3,700 unique websites and over 11,400 instances,
measuring approximately 2TB
5
British Library web archiving activities – history (2)
BL the largest collector: to date archived 1,853 unique
websites, 5,264 instances, or 1TB of data
BL infrastructure shared by NLW, TNA, Wellcome Trust and
JISC. Currently finalising the public access interface to the
archive
Web Archiving Programme: BL as the point of first resort for
a comprehensive archive of material from the UK Web
domain
embed web archiving within the BL's overall collection
development policy
provide the infrastructure to collect, preserve and make
accessible web materials
6
British Library Web Collection Policy
The BL will collect freely available sites selectively
from the UK web space by prioritising the archiving
of sites of research value across the spectrum of
knowledge. In addition, the BL archives a selection
of sites which are representative of British social
history and cultural heritage in all its diversity and
across the regions. It will also archive a small
number of sites which demonstrate web
innovation.
- British Library Web Collection Policy
7
What does the British Library select?
Research
stated intentions of sites themselves or potential to be primary
resources for research
sites hosted by universities, government bodies, grey material
published by campaigning organisations and charities
“research on the web”
Social history & culture
Sites representing British cultural diversity, regional difference,
social significance (current trends, e.g. Facebook); key event
of national life, topicality
Innovation
Award winning sites (pre-selected), sites illustrating web’s
information, communication and training strength
8
Some examples
Institute for Criminal Policy Research
Live site
UK General Election 2005: Andrew George MP
Live site
Archived site (25 September 2005)
David Shaw’s Homepage
Live site
Archived site (13 March 2007)
London Bombings: ABC News Map
Live site
Archived site (21 April 2005)
Football Poets
Live site
Archived site (17 March 2005)
Archived site (25 June 2006)
E-publishing trends: Egg Bacon Chips and Beans
Live site
Archived site(17 Jan 2006)
9
How do we select?
Subject specialists team
20 curators spending 5% of their time
cultural change
Appreciation of web technology & understanding of
crawler limitation
Collaboration with external organisations
Women’s library, V&A
UK Web Archiving Consortium
Recommendations (e.g. colleagues)
self selected via UKWAC web form
10
Web Archiving Workflow
Management
Workflow
Selection
Permissions
Archiving
Collection Development Policy
Subject
Specialists
Legal Deposit
Legislation
Recommendations
(e.g. by colleagues)
Self Selected
(e.g. Via Web Form)
Process
Selection
Already
Selected?
Yes
Follow Up?
No
No Action
No
Yes
Register on DB
Send Permission
Request
Letter
Email
Web
Form
Letter
Email
Request Contact
Details
Response?
No
Update DB
Yes
Enquiry?
Yes
Update DB,
Assign & Ans
No
Permission
Granted?
No
Yes
Update DB
Add Details to WCT
Create Harvest
Authorisation
Update DB
Create Target
(Set Crawl Options &
Harvest Schedule)
Approve Target
Tune Target
Review Target Instance
(Quality Review)
Harvested
ok?
No
Reject Target Instance
Yes
Endorse Target Instance
Archive Target Instance
End
Legend
Decision
Process
Time Delay
Terminator
11
Workflow (1)
Selection
Subject
Specialists
Recommendations
(e.g. by colleagues)
Self Selected
(e.g. Via Web Form)
Process
Selection
12
Permissions
Workflow (2)
Process
Selection
Already
Selected?
Yes
Follow Up?
No
No Action
No
Yes
Register on DB
Send Permission
Request
Letter
Email
Web
Form
Letter
Email
Request Contact
Details
Response?
No
Update DB
Yes
Enquiry?
Yes
Update DB,
Assign & Ans
No
Permission
Granted?
Yes
Update DB
Add Details to WCT
No
Update DB
13
Workflow (3)
Archiving
Add Details to WCT
Create Harvest
Authorisation
Create Target
(Set Crawl Options &
Harvest Schedule)
Approve Target
Tune Target
Target Instance Harvested
Review Target Instance
(Quality Review)
Harvested
ok?
No
Reject Target Instance
Yes
Endorse Target Instance
Archive Target Instance
End
14
Metadata
Broad subject headings
Collections
Catalogue records for collections
26 thematic collections (e.g. blogs, British countryside, Digital lives)
Event-based collections
UK general election 2005
London Terrorist attack 7/7/05
Indian Ocean Tsunami
London Mayor election 2008
Olympic & Paralympics Games 2012
Titles
Descriptions, at instance, site and collection levels, but not consistently collected
Permissions
Harvest log files (generated during the crawling process)
Will start project to define requirements for preservation metadata and workflow
15
System Components
Web Curator Tool (WCT)
workflow management tool for selective archiving jointly developed
by the BL and NLNZ
4 iterations of development completed
requirements collected from sources worldwide
uses embedded Heritirx for crawling (1.14.1)
archivists still learning to use crawl profiles and settings effectively
Nutchwax for (keyword) indexing and search
Access interface incorporating OSWM
URL and keyword search
Browsing by titles, subjects and collections
RESTful API based on OpenSearch standards
Smart crawler project: BL works together with the Internet Archive and
other national libraries to develop smart capabilities of the crawler
16
Permission / legal issues
Legal Deposit Libraries Act 2003 and extension of legal deposit to nonprint publications – not yet fully implemented
LDAP Web Archiving Sub-committee advising the Secretary of State on
implementation of the Act: regulation-based harvesting and archiving of
freely available online publications.
Slow process with delays; earliest legislation expected April 2010
Low response rate to the permission requests (25% success rate)
3rd party cover can be prohibitive (e.g. multiple contributors)
Resources required on both sides to do admin
Difficulties in tracking down the right person
Valuable websites disappearing before owners are found or contacted
17
Selective archiving: pros and cons
No access restrictions
Added value and depth to collection due to curatorial input
Better quality (look and feel) of archived sites due to detailed quality
assurance checks
Easier to navigate as is based on subject knowledge and tools
Offers better hooks to build hybrid library collections (as catalogue
records)
However:
Low response rate to permission requests
Labour intensive
Expensive
Only a small portion of the UK domain is being collected
18
Conclusions
Lack of national legislation is hold us back and the biggest
issue for web archiving in the UK
Most national libraries in the world undertake selective and
domain harvesting at the same time
Selective and domain harvesting are not exclusive of each
other; they can and should complement each other
BL needs to get up to speed
Need to shift web archiving to the centre of BL’s overall
collection and infrastructure
Selective archiving is expected to play a continued role
along side domain harvesting
19
Reference
Alison Hill, Archiving in the British Library: an overview, June 2008
Philip Beresford & Ravish Mistry, Web Archiving with the Web
Curator Tool at the British Library, September 2008
LDAP Web Archiving Sub-Committee, Recommendation for the
Collection and Preservation of UK Online Publications, free of
charge and without access restriction, 22 May 2008
20
Any Questions?
Helen Hockx-Yu
Web Archiving Programme Manager
The British Library
96 Euston Road
London NW1 2DB
Tel: +44 (0)20 7412 7184
Mobile: 07766 474 368
Email: [email protected]
Website:
http://www.bl.uk/aboutus/stratpolprog/digi/webarch/index.html
21