PANDORA Australia’s Web Archive

Download Report

Transcript PANDORA Australia’s Web Archive

PANDORA and Beyond: Managing Web
Archiving at the National Library of Australia
Digital Preservation Seminar
National Library of Australia, 21 November 2006
Paul Koerbin
Manager Digital Archiving
National Library of Australia
[email protected]
PANDORA and Beyond
•
•
•
•
•
Context and background
PANDORA – selective archiving
PANDAS – a web archiving system
Domain harvesting
Now and beyond
PANDORA and Beyond – Context - Legislation
• National Library Act, 1960
• Functions of the National Library
– Maintain and develop a national collection of library
material, including a comprehensive collection of
library material relating to Australia and the Australian
people
– To make library material in the national collection
available … in the national interest
– ‘Library material’ ~ books, periodicals, newspapers,
manuscripts, films, sound recordings, musical scores,
maps, plans , pictures, photographs, prints and other
recorded material …
PANDORA and Beyond – Context - Legislation
• Copyright Act, 1968 – Sect 201
• Delivery of library materials to the National
Library
– ‘Library material’ ~ book, periodical, newspaper,
pamphlet sheet of letter-press, sheet of music, map,
plan, chart or table, being a literary, dramatic, musical
or artistic work or an edition of such a work …
• Enabling and supportive legislation does not
address the collection of digital content
• Copyright Amendment (Digital Agenda) Act, 2000
– some support for digital preservation actions
PANDORA and Beyond – Context – Web Publishing
• World Wide Web: a new publishing medium,
1995→
• Defining a publication for our purpose:
A publication is information, regardless of its format or
method of delivery, that is made available to the
general public, or to an identified public, either free of
charge or for a fee.
Definition from: PANDORA Selection Guidelines
http://pandora.nla.gov.au/selectionguidelines.html#pubdefinition
• Content rendered through a web browser
• Email – only as delivery mechanism (e.g. PDF)
• Databases – yes, but more problematic
PANDORA and Beyond – Context – Web Publishing
•
•
•
•
•
•
•
•
•
•
Enormous growth and volume of material
Everyone can be creators and publishers
Virtually instantaneous publication
Dynamic content and format
Multiplicity of formats
Technology dependent
Hyperlinked and interconnected
Highly accessible but hard to identify
Ephemeral
Interactivity, re-use, personalisation (web 2.0)
PANDORA and Beyond – Context – Some Objectives
• Fulfil the functions of the National Library
• Identify published content to collect
• Manage content for long term preservation
– Integrity of the data streams
– Maintain access to authentic content
• Provide persistent access to the content
• Incorporate collection and preservation of web
content into routine Library processes
• Efficient and sustainable
PANDORA and Beyond – The PANDORA Archive
•
•
•
•
PANDORA Archive 1996→
Began as proof-of-concept project
Now a routine process within NLA
Currently 10 participants – NLA, state libraries
(not Tas), NFSA, AWM, AIATSIS
• Selective, content focused (bibliocentric)
– simple documents to whole websites
• PANDAS workflow management system, 2001→
PANDORA and Beyond – PANDORA – Web Archiving
What is web archiving?
• Identifying and selecting
• Seeking permission to collect and make accessible
• Recording metadata
• Crawling/harvesting (including scheduling)
• Processing for quality assurance (best effort)
• Storing and maintaining the data
• Preparing and rendering for public display
• Creating resource discovery metadata
PANDORA and Beyond – PANDAS
• PANDAS – PANDORA Digital Archiving System
• Web based workflow management system
• Developed specifically to manage the web
archiving processes at the National Library of
Australia
• Used by PANDORA’s participants located
throughout Australia (mainland state libraries,
AWM, NFSA, AIATSIS)
• Also used by UKWAC
PANDORA and Beyond – PANDAS
• Developed in-house at the NLA
• Replaced multiple non-integrated systems used
between 1996 and 2001
• Written in Java on Apple WebObjects application
development platform
• Presentation, application, business and data layers
• Version 1 released June 2001
• Version 2 released August 2002
• Version 3 due early 2007
PANDORA and Beyond – PANDAS
PANDORA and Beyond – PANDAS
• Developed in-house at the NLA
• Replaced multiple non-integrated systems used
between 1996 and 2001
• Written in Java on Apple WebObjects application
development platform
• Presentation, application, business and data layers
• Version 1 released in June 2001
• Version 2 released August 2002
• Version 3 due early 2007
PANDORA and Beyond – PANDAS
• Record administrative metadata about titles
selected (or considered) for archiving
• Schedule and initiate harvesting
– but not a crawler; currently use HTTrack
• Manage quality assurance checking and problem
fixing workflow
• Prepare and deliver archived copies for public
display through the PANDORA home page
– dynamically from PANDAS database
• Manage access restrictions
• Facilitates management reporting
PANDORA and Beyond – Persistent URIs
• Running number generated by PANDAS
• Persistent URL applied to title entry page
http://nla.gov.au/nla.arc-21220
• Logically extended to any resource in the Archive
http://nla.gov.au/nla.arc-21220-20030822www.ipjp.org/september2002/schweitzered.html
• Citation generator on public interface
PANDORA and Beyond – PANDORA Statistics
Indicative statistics as at October 2006
•
•
•
•
13,000+ titles
26,000+ archived instances
33.5+ million files*
1.2+ Terabytes data*
*These
figures are for the display copy only. Three
preservation copies are actually maintained: a preservation
master, an access master and a metadata master.
PANDORA and Beyond – Domain Harvesting
• Crawl conducted by the Internet Archive for the
NLA
• 1st harvest June/July 2005
– 4 weeks, 185m files, 6.69 TBs
• 2nd harvest Aug/Sept 2006
– 5 weeks, 516m files, 19.04 TBs
• Harvest of the .au top level domain
– plus, non .au hosts identified through geoPI
lookup as being hosted in Australia
• Domain harvesting – obvious choice?
Comparative statistics
Domain Harvest
Unique files
2005
2006
185,549,662
516,280,205
Hosts crawled
811,523
1,046,038
Size
6.69 TB
19.04 TB
PANDORA (c. 6% of 2006 DH)
Files:
33 million
Size:
1.2 TB
DH MIME types
HTML:
67%
Image files: 28.5%
PDF files:
1.6%
MS Word
files:
0.2%
PANDORA and Beyond – Domain Harvesting – Pros and Cons
• Convergence of resources, technology,
collaborations, and purpose in 2005
• Some pros –
– Retains linkages and context
– Large scale – more bytes for the buck
– Less selectively discriminate
• Some cons –
– High dependence on the crawler technology
– Domain and geo-location bias (.au, geoIP)
– Limitations in timeliness, quality assurance, scoping,
site complexity, deep web
– Legal and access issues to resolve
PANDORA and Beyond – Now
• 10 years selective web archiving for PANDORA
– publicly accessible web archive
• 2 years domain harvesting
– large scale archival content
• PANDAS
– production workflow system
• Tangible outcomes from pragmatic approach
• Doing (what we can) with limited resources
• Developing experience, knowledge and skill
through practical engagement in the tasks
PANDORA and Beyond – Future Strategies
• Renewed focus on strategic thinking
• Collaborations, relationships, partnerships
– International Internet Preservation Consortium
Internet Archive
– Open source tools, standards (IIPC)
– Institutional and trusted repositories (universities and
e-presses)
– Government & academic sectors (APSR, ARROW)
– ‘research information infrastructure’
• services that support the discovery and management of
research resources and research outputs by and for the current
and future research community
PANDORA and Beyond – Future Strategies
•
•
•
•
Preservation planning and infrastructure
Sustainable resourcing and workflows
Push for legislation for collecting in the digital age
Understanding the territory
– Personal web archiving (HanzoWeb); archive crawlers
(Warrick); advanced bookmarking (spurl.net)
• Strategic use of selective and domain harvesting
• Architecture, systems and workflows for efficient
management of and access to web archive
collections
PANDORA
Australia’s Web Archive
http://pandora.nla.gov.au/