A survey of Web preservation initiatives Michael Day UKOLN, University of Bath [email protected] 7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2003), Trondheim,

Download Report

Transcript A survey of Web preservation initiatives Michael Day UKOLN, University of Bath [email protected] 7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2003), Trondheim,

A survey of Web preservation initiatives

Michael Day UKOLN, University of Bath

[email protected]

7 th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2003), Trondheim, Norway, 17-22 August 2003

Presentation overview

• The importance of the Web • Challenges: – Technical, legal, and organisational challenges • Approaches to collection: – Harvesting based, selective, and deposit; combined approaches • Discussion: – Collection and access policies, software, costs, long-term preservation ECDL 2003, Trondheim, Norway, 17-22 August 2003

Importance of the Web

An all pervasive communication medium: • In research: – Scientists are 'increasingly reliant' on the Web for supporting research (Hendler, 2003) • Wider societal role: – personal communication, e-commerce, etc.

– "… the information source of first resort for millions of readers" (Lyman, 2002) ECDL 2003, Trondheim, Norway, 17-22 August 2003

The UKOLN study

Feasibility study produced for: – Joint Information Systems Committee (JISC) – Wellcome Library – A survey of initiatives – Recommendations for the JISC and Wellcome Library – Supplementary legal study (Charlesworth) – Published February 2003

http://library.wellcome.ac.uk/projects/archiving_reports.shtml

ECDL 2003, Trondheim, Norway, 17-22 August 2003

Technical challenges (1)

Size of Web: – Surface web > 50 Tb (2000) … and still growing – The 'deep Web' – Scale of task means that Web-archiving needs to be a collaborative activity ECDL 2003, Trondheim, Norway, 17-22 August 2003

Technical challenges (2)

Dynamic nature of Web: – Web pages disappear on average after 75 days – Many leave no trace Evolution of Web-based technologies: – Increasing reliance on databases, scripts, plug-ins, etc.

– A 'moving target' ECDL 2003, Trondheim, Norway, 17-22 August 2003

Legal challenges

Copyright Content liability, e.g.: – Defamation – Data protection In the UK: – Selective approach would be the safest solution (unless law changes) See: Charlesworth (2003)

http://library.wellcome.ac.uk/projects/archiving_reports.shtml

ECDL 2003, Trondheim, Norway, 17-22 August 2003

Organisational challenges

Decentralised organisation: – Web-archiving initiatives focus on defined sub-sets of the Web, e.g.: – National domain, subject, organisation type – Need for co-operation between initiatives Quality: – Much on Web is low-quality (or worse) – Is there a need to preserve all of this?

ECDL 2003, Trondheim, Norway, 17-22 August 2003

Initiatives (1)

The Internet Archive – Largest initiative, running since 1996 – Co-operates on special collections and with other repositories National Libraries: – Pioneer archives in Sweden (Kulturarw 3 ) and Australia (PANDORA) – Now many, many more – Changes to legal deposit legislation in some countries ECDL 2003, Trondheim, Norway, 17-22 August 2003

Initiatives (2)

National archives: – Focus on government Web-sites (however defined) – Guidance for Web-site managers: – e.g., UK and Australia – Snapshots: – e.g., USA and UK Other: – Universities and scholarly societies: – e.g., Archipol, Occasio archive, Political Communications Web Archiving (Cornell) ECDL 2003, Trondheim, Norway, 17-22 August 2003

Approaches (1)

Automatic harvesting: – Use of Web crawler technologies – Crawler follows links and downloads content – Pioneered by Internet Archive and Kulturarw 3 project – Also used for the gathering of the Finnish and Austrian Web ECDL 2003, Trondheim, Norway, 17-22 August 2003

Approaches (2)

Selective approaches: – Selection of individual Web sites – Negotiate rights with site owners – Collection using gathering or mirroring software, ftp, or e-mail – Pioneered in PANDORA project – Experimented with by Library of Congress and British Library Deposit approaches: – Site owners/administrators deposit site in repositories ECDL 2003, Trondheim, Norway, 17-22 August 2003

Approaches (3)

Combined approaches: – Combines the advantages of the harvesting and selective approaches – Pioneered by the Bibliothèque nationale de France – Experimented with enhancements to the harvesting approach • e.g., noting the change frequency of sites, and their 'importance') • Uses the selective approach for the 'deep Web' ECDL 2003, Trondheim, Norway, 17-22 August 2003

Collection policies

Dependent on technical approach chosen – National domain ++ (for harvesting-based approaches) – Collection guidelines (for selective approaches) – Based on relevance, provenance, quality, etc.

– Frequency of capture – Possible overlap with subject gateway initiatives - e.g. the Resource Discovery Network (RDN) in the UK ECDL 2003, Trondheim, Norway, 17-22 August 2003

Approximate size (2002)

Country

USA Sweden France Austria Australia Finland UK USA

Initiative

Internet Archive Kulturarw3 BnF AOLA PANDORA HUL Britain on the Web MINERVA S H S S

Type

H H C H

Size (Gb.)

>150,000.00

No. Sites

4,500.00

<1,000.00

448.00

405.00

401.00

0.03

3,300 100 * 35

Source: Day (2003)

ECDL 2003, Trondheim, Norway, 17-22 August 2003

Access policies

Access policies differ: – Internet Archive and the PANDORA archive make data available – e.g., the Wayback Machine – Other collections effectively closed (for legal reasons or because experimental) – Need for specialised Web indexes that can search and navigate large collections of Web material – e.g., Nordic Web Archive (NWA) Toolset ECDL 2003, Trondheim, Norway, 17-22 August 2003

Software

Various software in use: – Harvesting: – Adapted Combine harvester, NEDLIB harvester, Xyleme, Alexa – Selective: – HTTrack (popular), etc.

– PANDAS (PANDORA Digital Archiving System) - helps with managing the process, adding metadata, etc.

ECDL 2003, Trondheim, Norway, 17-22 August 2003

Costs

Costs vary widely: – Selective approach much more expensive (per Tb.) than bulk harvesting – But resulting archives are more widely accessible – Significant costs in undertaking rights clearance ECDL 2003, Trondheim, Norway, 17-22 August 2003

Long-term preservation

Many initiatives until now mainly focused on the collection of resources: – Need to consider the longer-term – Descriptive and technical metadata – Migration needs (e.g. for complex sites) – Need for Web archiving initiatives to become trusted repositories – Need to be embedded into the 'core activities' of their host organisation ECDL 2003, Trondheim, Norway, 17-22 August 2003

Summing up

• Much experimentation to date, but now moving into implementation phase • Co-operation and collaboration is important • Combined technical approaches offer best way forward • Legal challenges still problematic • Long-term preservation issues still to be explored in detail ECDL 2003, Trondheim, Norway, 17-22 August 2003

Acknowledgements

UKOLN is funded by Resource: the Council for Museums, Archives and Libraries, the Joint Information Systems Committee (JISC) of the UK higher and further education funding councils, as well as by project funding from the JISC and the European Union. UKOLN also receives support from the University of Bath, where it is based.

http://www.ukoln.ac.uk/

ECDL 2003, Trondheim, Norway, 17-22 August 2003