Transcript martin_how_to_face_the_challanges_II_4
How to Face the Challenges of Web Archiving?
The experiences of a small library on the edge.
Chloe Martin, Internet Memory Catherine Ryan, National Library of Ireland
LIBER 2012 - 1
Context:
National Library of Ireland •
Beginnings
: Established by the Dublin Science and Museum Act, 1877 •
Mission
: “to collect, preserve, promote and make accessible the documentary and intellectual record of the life of Ireland ”.
•
The Digital Record
: Born Digital Programme established in 2010, covering web archiving.
•
Web Archive Projects
: 2 pilot projects in 2011 LIBER 2012 - 2
Context:
Internet Memory • •
European Archive / Internet Memory Foundation
• • Established in 2004 in Amsterdam (offices also in Paris)
Mission
: to preserve Web content as a new media for current and future generations
Actions
: Sensibilization, partnerships, R&D
Open Access Collections
: UK National Archives & Parliament, PRONI, CERN and The National Library of Ireland
Internet Memory Research
• • Spin-off of IM established in June 2011 in Paris
Missions
: to operate large scale or selective crawls & develop new technologies (crawl, access, processing and extraction) LIBER 2012 - 3
Web Archiving Project: Project Origins
National Library of Ireland Building a 21 st Century Library: – Born Digital – Digitisation – Single Integrated Catalogue – Digital Repository – OSCAIL, the Digital Library Programme LIBER 2012 - 4
Web Archiving Project: Project Origins
National Library of Ireland Born Digital Materials: • Natural progression for NLI’s strong political, cultural and historical collections • How best to approach this in time of unprecedented financial difficulty?
• Born Digital Programme established to examine requirements and produce a policy document for the next steps LIBER 2012 - 5
Web Archiving Project: Project Origins
National Library of Ireland The Hand of History: – Snap General Election – Five Weeks LIBER 2012 - 6
Web Archiving Project: Project Origins
National Library of Ireland
Just do it
LIBER 2012 - 7
Web Archiving Project: Project Origins
National Library of Ireland
Just do it
How?
LIBER 2012 - 8
Web Archiving Project: Project Origins
National Library of Ireland Collaborative Partnership: Partner that suited our requirements and that had experience with others in the cultural sector Requirements: – Technical skills in the NLI but working on other projects – needed these skills – Leverage NLI’s on strong curatorial experience, esp. in politics – Fast!
LIBER 2012 - 9
Web Archiving Project: Project Origins
National Library of Ireland Project phases: – Project scoping and contract – Site selection – Permissions gathering – QA (look and feel) – Publication and promotion LIBER 2012 - 10
Site Selection and Permissions
National Library of Ireland Selection Criteria: Permissions: – Website presence – Technical reasons – Cut-off date – Women candidates – All sites contacted and provided with a brief – Pressurised but necessary phase LIBER 2012 - 11
Scope of projects
National Library of Ireland General Election: – Crawl: 200 snapshots – Scope: 100 seeds – Frequency: 2 times – Date: Feb. 2011 Presidential Election: – Crawl: 80 snapshots – Scope: 70 seeds – Frequency: 3 times – Date: Oct-Nov. 2011 LIBER 2012 - 12
Crawl
Internet Memory • Seeds Validation: URLs, Duplication, Redirection, External links, Dynamic websites • Scope Parameters: Domain, host and path ; Social Web content ; Frequency ; Robots.txt
files exclusion ; Politeness • Specific incidents technical changes on the fly Modification of scope ; Pending crawls ; Adaptation of the politeness • Improvement of second crawl LIBER 2012 - 13
Quality Assurance (QA)
National Library of Ireland • Manual QA • Jira software • IM – Technical QA • NLI - ‘Look and Feel’ QA • Multiple browsers • Communication with site owners (building relationships and promotion) LIBER 2012 - 14
Quality Assurance (QA)
Internet Memory • Why?
• How? • Manual and visual method: homepage + 2 • Resolution of issues • Temporal Coherence LIBER 2012 - 15
Access
National Library of Ireland • Available to the public • Full text search • IM website – search by keyword, URL • NLI catalogue – keyword via widget developed by NLI IS team and IM • Future – access through NLI’s own interfaces, issue of integrating results LIBER 2012 - 16
Publication and Promotion
National Library of Ireland • NLI social media initiative (Twitter and blog) • Project participants • Print media (esp. in area of technology) • And IM!
• Usage figures have increased but real value more apparent in 5-10 years LIBER 2012 - 17
1000 900 800 700 600 500 400 300 200 100 0
Usage Statistics of Web Archive
National Library of Ireland Unique visitors per month 21/09/2011: Official launch of NLI Web archives (Tweets) 26/10/2011: Blog post on nli.ie/blog and Paper in thejournal.ie
25/11/2011: Paper on irishtimes.com
20/01/2012: Paper on irishtimes.com
17/03/2012: Post on soundofthearchives.wordpress.com
04/05/2012: Paper on irisheconomy.ie
LIBER 2012 - 18
Advantages of Web Archiving
National Library of Ireland Web archiving: – New opportunities for delivery of materials to users – Work with existing users expectations that content be online – Reach new audiences LIBER 2012 - 19
Advantages of Web Archiving
National Library of Ireland Political web archives;Irish General Election: – Researchers can compare online content pre and post-election – Facilitates research into how ‘online’ this election was – Assess impact of technological developments in campaign communications – Record of campaign information LIBER 2012 - 20
Benefits of Working Together
National Library of Ireland Pilot project for a long-term activity: – Allowed us to enter a new collecting area despite lack of tech expertise – Facilitated collection of important material that one else was collecting – Collect material quickly – Leverage curatorial skills – Gained new technical skills LIBER 2012 - 21
Benefits of Working Together
Internet Memory • To supporte the development of Web archiving initiatives • To operate rapid deployment of Web archives • To address new challenges in this area: • Social media content • QA • Automatization LIBER 2012 - 22
Conclusion
General Election: • 18,495,771 URLs • 1.14 TB • 10,405 ARCs View the
NLI collections
at: http://www.nli.ie/en/udlist/digital collections.aspx
Presidential Election: • 7,333,399 URLs • 278.10 GB • 2,513 ARCs View the
Web archive blog
entry at: http://www.nli.ie/blog/index.php/2011/10 /26/general-election-2011-web archiving/ View
Internet Memory Collections
at: http://collections.europarchive.org/
To be continued…
LIBER 2012 - 23
Questions?
Thanks for your attention!
Catherine Ryan
National Library of Ireland
http://www.
nli
.ie
@NLIreland
Chloe Martin
Internet Memory
http://internetmem ory.org
@InternetMemory
LIBER 2012 - 24