A survey of Web preservation initiatives Michael Day UKOLN, University of Bath [email protected] 7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2003), Trondheim,
Download ReportTranscript A survey of Web preservation initiatives Michael Day UKOLN, University of Bath [email protected] 7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2003), Trondheim,
A survey of Web preservation initiatives
Michael Day UKOLN, University of Bath
7 th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2003), Trondheim, Norway, 17-22 August 2003
Presentation overview
• The importance of the Web • Challenges: – Technical, legal, and organisational challenges • Approaches to collection: – Harvesting based, selective, and deposit; combined approaches • Discussion: – Collection and access policies, software, costs, long-term preservation ECDL 2003, Trondheim, Norway, 17-22 August 2003
Importance of the Web
An all pervasive communication medium: • In research: – Scientists are 'increasingly reliant' on the Web for supporting research (Hendler, 2003) • Wider societal role: – personal communication, e-commerce, etc.
– "… the information source of first resort for millions of readers" (Lyman, 2002) ECDL 2003, Trondheim, Norway, 17-22 August 2003
The UKOLN study
Feasibility study produced for: – Joint Information Systems Committee (JISC) – Wellcome Library – A survey of initiatives – Recommendations for the JISC and Wellcome Library – Supplementary legal study (Charlesworth) – Published February 2003
http://library.wellcome.ac.uk/projects/archiving_reports.shtml
ECDL 2003, Trondheim, Norway, 17-22 August 2003
Technical challenges (1)
Size of Web: – Surface web > 50 Tb (2000) … and still growing – The 'deep Web' – Scale of task means that Web-archiving needs to be a collaborative activity ECDL 2003, Trondheim, Norway, 17-22 August 2003
Technical challenges (2)
Dynamic nature of Web: – Web pages disappear on average after 75 days – Many leave no trace Evolution of Web-based technologies: – Increasing reliance on databases, scripts, plug-ins, etc.
– A 'moving target' ECDL 2003, Trondheim, Norway, 17-22 August 2003
Legal challenges
Copyright Content liability, e.g.: – Defamation – Data protection In the UK: – Selective approach would be the safest solution (unless law changes) See: Charlesworth (2003)
http://library.wellcome.ac.uk/projects/archiving_reports.shtml
ECDL 2003, Trondheim, Norway, 17-22 August 2003
Organisational challenges
Decentralised organisation: – Web-archiving initiatives focus on defined sub-sets of the Web, e.g.: – National domain, subject, organisation type – Need for co-operation between initiatives Quality: – Much on Web is low-quality (or worse) – Is there a need to preserve all of this?
ECDL 2003, Trondheim, Norway, 17-22 August 2003
Initiatives (1)
The Internet Archive – Largest initiative, running since 1996 – Co-operates on special collections and with other repositories National Libraries: – Pioneer archives in Sweden (Kulturarw 3 ) and Australia (PANDORA) – Now many, many more – Changes to legal deposit legislation in some countries ECDL 2003, Trondheim, Norway, 17-22 August 2003
Initiatives (2)
National archives: – Focus on government Web-sites (however defined) – Guidance for Web-site managers: – e.g., UK and Australia – Snapshots: – e.g., USA and UK Other: – Universities and scholarly societies: – e.g., Archipol, Occasio archive, Political Communications Web Archiving (Cornell) ECDL 2003, Trondheim, Norway, 17-22 August 2003
Approaches (1)
Automatic harvesting: – Use of Web crawler technologies – Crawler follows links and downloads content – Pioneered by Internet Archive and Kulturarw 3 project – Also used for the gathering of the Finnish and Austrian Web ECDL 2003, Trondheim, Norway, 17-22 August 2003
Approaches (2)
Selective approaches: – Selection of individual Web sites – Negotiate rights with site owners – Collection using gathering or mirroring software, ftp, or e-mail – Pioneered in PANDORA project – Experimented with by Library of Congress and British Library Deposit approaches: – Site owners/administrators deposit site in repositories ECDL 2003, Trondheim, Norway, 17-22 August 2003
Approaches (3)
Combined approaches: – Combines the advantages of the harvesting and selective approaches – Pioneered by the Bibliothèque nationale de France – Experimented with enhancements to the harvesting approach • e.g., noting the change frequency of sites, and their 'importance') • Uses the selective approach for the 'deep Web' ECDL 2003, Trondheim, Norway, 17-22 August 2003
Collection policies
Dependent on technical approach chosen – National domain ++ (for harvesting-based approaches) – Collection guidelines (for selective approaches) – Based on relevance, provenance, quality, etc.
– Frequency of capture – Possible overlap with subject gateway initiatives - e.g. the Resource Discovery Network (RDN) in the UK ECDL 2003, Trondheim, Norway, 17-22 August 2003
Approximate size (2002)
Country
USA Sweden France Austria Australia Finland UK USA
Initiative
Internet Archive Kulturarw3 BnF AOLA PANDORA HUL Britain on the Web MINERVA S H S S
Type
H H C H
Size (Gb.)
>150,000.00
No. Sites
4,500.00
<1,000.00
448.00
405.00
401.00
0.03
3,300 100 * 35
Source: Day (2003)
ECDL 2003, Trondheim, Norway, 17-22 August 2003
Access policies
Access policies differ: – Internet Archive and the PANDORA archive make data available – e.g., the Wayback Machine – Other collections effectively closed (for legal reasons or because experimental) – Need for specialised Web indexes that can search and navigate large collections of Web material – e.g., Nordic Web Archive (NWA) Toolset ECDL 2003, Trondheim, Norway, 17-22 August 2003
Software
Various software in use: – Harvesting: – Adapted Combine harvester, NEDLIB harvester, Xyleme, Alexa – Selective: – HTTrack (popular), etc.
– PANDAS (PANDORA Digital Archiving System) - helps with managing the process, adding metadata, etc.
ECDL 2003, Trondheim, Norway, 17-22 August 2003
Costs
Costs vary widely: – Selective approach much more expensive (per Tb.) than bulk harvesting – But resulting archives are more widely accessible – Significant costs in undertaking rights clearance ECDL 2003, Trondheim, Norway, 17-22 August 2003
Long-term preservation
Many initiatives until now mainly focused on the collection of resources: – Need to consider the longer-term – Descriptive and technical metadata – Migration needs (e.g. for complex sites) – Need for Web archiving initiatives to become trusted repositories – Need to be embedded into the 'core activities' of their host organisation ECDL 2003, Trondheim, Norway, 17-22 August 2003
Summing up
• Much experimentation to date, but now moving into implementation phase • Co-operation and collaboration is important • Combined technical approaches offer best way forward • Legal challenges still problematic • Long-term preservation issues still to be explored in detail ECDL 2003, Trondheim, Norway, 17-22 August 2003
Acknowledgements
UKOLN is funded by Resource: the Council for Museums, Archives and Libraries, the Joint Information Systems Committee (JISC) of the UK higher and further education funding councils, as well as by project funding from the JISC and the European Union. UKOLN also receives support from the University of Bath, where it is based.
http://www.ukoln.ac.uk/
ECDL 2003, Trondheim, Norway, 17-22 August 2003