status of the netarchive.dk project

Download Report

Transcript status of the netarchive.dk project

Strategies for archiving the Danish web space
Bjarne Andersen
Head of Digital Resources
State and University Library, Aarhus
[email protected]
http://netarchive.dk
Agenda









New legal deposit law in Denmark
Collection strategies
NetarchiveSuite software package
Snapshot harvesting
Selective harvesting
Event harvesting
Challenges in snapshot harvesting
Snapshot harvesting usefulness
Future work
Legal deposit law 1
 Revision of the legal deposit law in 1997
 -> legal deposit included static documents on the internet
 During in 1998-1999 we found out that:
 We were actually preserving the least interesting part
 Many of the documents in that collection are also available in print
 A lot of work was done between 2000-2004
 2 pilot projects run by the two national libraries
 Testing different software / different strategies for archiving /
storing web material
 A governmental publication on ”preserving the Danish digital
cultural heritage” (2003)
 A report to the ministry of culture (2004) outlining


Recommendations from the two national libraries on how to solve the
”entire” problem
Issues to be covered by a new revision of the legal deposit law
Legal deposit law 2
 A new revision came into force on july 1st 2005
 Allowing the two national libraries to automatically gather all
danish websites
 Danish roughly defined as:




Websites on the .dk TLD
Websites minded on a Danish audience / written in Danish
Websites about Danish people (Hans Christian Andersen)
More or less any site of interest to Denmark
 We are by law granted access to all relevant data from the .dk
TLD administrator
Legal deposit law 3
 The law covers all public available material

Material that all Danish people in principal can gain access to
 Material which requires action before usage (payment,
registration….)
 Pay-sites should hand out username / password upon request
(for free)
 Other interesting parts


Combined strategy (snapshot, selective and event-harvesting)
Robots.txt explicitly mentioned in the regulations of the law
 A lot of the very interesting websites have very restrictive
robots.txt’s (we discovered around 35.000 robots.txt-files)
 During 6 snap shots of more than 750.000 web sites we had
fewer than 50 complaints about robots.txt
Legal deposit law 4
 In the end led to funding of
 Netarchive.dk
 Virtual centre in cooperation between
 The Royal Library, Copenhagen
 The State & University Library, Aarhus
 Implementing a complete system
 Running the archiving on a daily basis
 Currently with an annual budget of 450.000 euros
 Involving 15 people from the two libraries
 4.5 Man-years of man-power
The 3 collection strategies
 Illustrated by coverage over time
 Amount of data collected so far
 Snapshots: 61 TB (6 times)
 Selective harvests: 9.5 TB (80 web sites)
 Event harvests: 5.6 TB (9 events)
NetarchiveSuite software package
 We needed a curator tool ready at July 1st 2005
 Requirement number 1: Operated by librarians
 With the web interface librarians can:
 Define harvests (all three types)
 Based on quite simple settings + a number of different
predefined heritrix setups
 Do quality control
 Looking at harvest results (simple reports and statistics)
 Browsing through harvested material

Automated pickup of missing URIs
 NetarchiveSuite was released as Open Source in July 2007
 Currently used by a number of national libraries
Snapshot harvesting
 The .dk TLD currently holds > 750.000 active
domains
 We encountered around 42.000 Danish domains
outside the .dk TLD
 By extracting links from the entire .dk web space –
checking country-code by IP-number (GeoIP)
 By doing Google searches on Danish localities (city
names ..)
 With 8 machines we can do
 One complete snapshot (including deDuplication) at
20TB in 80 days
 DeDuplication saves around 30% of the storage space
Selective harvesting
 Archiving of 80 selected websites
 News sites
 ”Typical” dynamic and heavily used sites
representing civic society, the commercial sector
and public authorities
 Experimental and/or unique sites, documenting
new ways of using the web (e.g. net art) Harvested
much more frequent
 From weekly to several times per day
Event harvesting
 Combining the other two strategies
 Taking a larger number of sites (200-3000)
 On a more frequent basis (daily / weekly)
 In a shorter period of time
 We have done 9 event harvests so far
 Elections, different national events
 We have pre-defined some harvest-definitions on
especially news-sites (both local and national)
 With one click we can start these if a sudden event
should happen – to ensure collection of important
sites from the very beginning
Challenges in snapshot harvesting

Number of domains is constantly growing



Domains are growing bigger and bigger




2005: 607.000 domains – 480.0o0 active
2008: 950.000 domains – 750.000 active
Audio/Video is getting more and more popular
Sites larger than 10Mb increased from 40.000 to 90.000
Sites larger than 500Mb increased from 6.000 to 12.000
Web 2.0 makes harvesting difficult

Web material is inlined from other web sites – from all over the world



The border of a web site is disappearing
The web is going more and more dynamic – Flash / Ajax
The amount of traps and spam grows constantly

In Denmark librarians manually inspect all websites larger than 1Gb


Currently over 3000 domains
They identify aliases and potential crawler traps

That task should be (semi)-automated
Snapshot harvesting usefulness

With snap shot harvesting a web archive ensures cultural heritage by


Archiving regular ”pictures” of entire national parts of the internet
Archiving as much as possible in a quite cheap way


Snap shots is very useful for research in many different areas








Linguistics
Web technologies
File formats and their evolution
Web design
Genealogy / Ancestor search
Web site history
And many many more – to be defined in the future
And off cause useful for more ordinary users wanting



Netarchive.dk: Storage space and 15 hours per week for librarians
To find content disappeared from the live web – 40-100 days lifetime
Getting more and more interesting over time
Currently access to Netarchive.dk is limited to researchers
Future work
Automating discovery of Danish web sites outside the .dk TLD
Automated quality assurance for large crawls
Automating filtering of web spam and traps
Improving archiving of web 2.0
 Dynamic web content
 Streaming audio/video
 Non of these problems are Danish
 Lets solve them together
 LIWA – European project working on most of these problems
 Danish challenges
 Working for better access possibilities




 On the system level: WayBack Machine / NutchWAX search
 On the political level: Change of law
Questions
?