WAX: A candle in the darkness A digital to digital project

Download Report

Transcript WAX: A candle in the darkness A digital to digital project

WAX: A candle in the
darkness
A digital to digital project
Wendy Gogel, Andrea Goethals
Harvard University Library, Office for Information Systems
May 1, 2009
Today’s Journey
• The Darkness – The Web
Introducing the challenge of web archiving
• The Candle – WAX
HUL’s Web Archive Collection Service
• The Light – The Collections
Demonstrating the results
The Darkness: The Web
The Challenges of Web Archiving
• A fleeting record – here today, gone
tomorrow
•
•
•
•
•
Government Documents
Public Debate
Culture
Personal expression
University Output
Harvard Magazine May/June 2009
Curator Activities
•
•
•
•
•
•
•
Selection
Acquisition
Rights management
Quality assurance
Arrangement
Storage
Description and indexing for discovery
(cataloguing, searching, browsing)
• Presentations and exhibitions
• Preservation
IP and Other Legal Risks
• Copyright infringement
• State tort liability
• Civil damages, resulting from invasion
of privacy, sensitive personal data,
commercial content, defamatory
content
• Statutory content restrictions
• Foreign Laws
Preservation Challenges
• We were not there at creation
• Viruses more likely
• Formats misidentify themselves
• A lot of formats are invalid (especially HTML)
• It’s a moving target – what should we
preserve?
•
•
•
•
Evolving born digital formats
Proliferation of formats
Partial capture
Complex behaviors and styles
• Complex delivery to maintain
• Hyperlinked resources
• Multiple renderers will continue to evolve
2006/07 Alternatives
Selection
Crawling
Management
(QA and
Metadata)
Storage
Preservation
Discovery
and Display
Wayback
(IA)
No
Yes
No
Yes
Partial Replicated
storage – Not
Harvard
owned
No full text
searching
Contract IA
Yes
Yes
No, handle
in-house
No,
Handle
inhouse
No, Handle
in-house
No, Handle
in-house
Archive It!
(IA)
Yes
Yes
Minimal, has
since
improved
Yes
Partial Replicated
storage
Minimal, has
since
improved
Customize
IIPC Tools
(WAX)*
Yes
Yes
Yes
Yes
More than
others
Yes
* Additional benefit of integration with HUL central services
Notes
2008 costs:
$16,000/yr
$2,000/yr
Harvard copy
The Candle: WAX
HUL’s Web Archiving Project
•
•
2.5 year pilot project funded by LDI
Key Goals
1. Gain experience in domain
2. Explore legal terrain
3. Investigate sustainability of a Harvard
web archiving service
•
•
quantify technical, human, and $
requirements
aim for operational efficiencies
Project Players
1. Curators and Collection Managers
•
•
•
Harvard University Archives
Schlesinger Library on the History of
Women in America
Edwin O. Reischauer Institute of
Japanese Studies
2. Legal Counsel – Office of General
Counsel (OGC)
3. Technologists - OIS
What Did We Build? WAX
What Did We Build? WAX
What Did We Build? WAX
What Did We Build? WAX
Third Party Software
• International Internet Preservation
Consortium (IIPC) tools
www.netpreserve.org
• Heritrix
• HCC
• NutchWAX
• Wayback
•
•
•
•
•
JBoss
Oracle
Struts
Tomcat
Quartz job scheduler
The Web is vast
and
interconnected.
How do you
specify the part
you want to
capture?
Or “training a
web crawler”…
How to Train a Web Crawler
1. Tell it where to start
•
“Seed URI”
2. Tell it what to collect and where to
stop
•
“Scope”
3. Tell it when and how often
•
“Schedule”
Web Archiving Steps
1.
Create a harvest profile
Identify website URI (“seed”), define scope and
schedule
2.
3.
4.
5.
Harvest web site
QA harvest
Send harvest to DRS
Index harvest
Becomes searchable and viewable by users
A lot of work per website –
which can automated?
Web Archiving Steps
Manual by curator → 1. Create a harvest profile
Automated by 2. Harvest web site
scheduler and crawler software
→
Manual by curator → 3. QA harvest
Manual by curator → 4. Send harvest to DRS
Automated by 5. Index harvest
Indexing software →
Workflow Efficiencies
• Curator’s manual tasks:
• Create a harvest profile
• 3 scopes: Directory, host and host+1
• Schedules
• Global excluded URIs
• QA harvests
• Remove unwanted pieces
• Detect missing pieces
• Refinement of seed scope
• Send harvests to DRS
How can the system help with these tasks?
Efficiencies: QA Harvests
•
•
•
Exclude URIs
from future
crawls
Delete URIs from
harvest
Delete URIs from
harvest and
Exclude them
from future
crawls
Efficiencies: Send Harvests to DRS
The Ultimate Shortcut?
• Can pre-configure WAX to send
harvests directly to the DRS
• Skip QA step
• Skip push to archive step
Web Harvest Objects:
Unit of Preservation in the DRS
• For each crawl starting from a seed URI:
• One or more ARC files (*.arc.gz)
• contain one or more “resources” - the
individual HTML, JPEG, Javascript, etc.
files that make up the harvested web
pages
• Crawl log
• records all URI requests, regardless of
result
• Crawler configuration
• Metadata
• descriptive, administrative, technical
WAX Legal Mitigations: Crawls
• Polite crawling
• Obey robots.txt
• Leave WAX crawler information in logs
• Employ a respectful “request
frequency” during crawls
• Don’t overload web servers
• Capture surface web only
• No attempt to crawl protected content
• Choice of offsite crawler for curators
• Non-Harvard IP address
WAX Legal Mitigations: Use
• Don’t compete with or divert traffic
from live site
• Exclude robots from the WAX archive
• Add transformative content
• Framing
• Presentation pages with original
intellectual content
• Embargo display for 3 months
• Link to live site
The Collections
• 191 “seeds” identified by curators for
harvesting
• Stored in DRS:
• Over 8 million web archive resources
• 365.17 gigabytes of storage ($913/year)
• 291 mime types
application/x-download
application/x-java-vm
Shockwave
message/rfc822
text/Javascript
audio/x-realaudio
image/x-portable-anymap
text\css
chemical/mdl-rdf
javascript/x-javascript
application/x-Shockwave-Flash
content-type
application/bds
png
text/text
image/png?ver=074219b2138e87ecf980914
471183dfc
text/x-c++
application/xrds+xml
"text/xml"
image/x-bmp
gif
application/x-rar-compressed
Image/png
mime/type
image/null
text/troff
application/vnd.sun.xml.impress
text/enriched
application/icalendar
application-x/javascript
x-mapp-php4
imag/x-icon
application/x-shockwave-flash2-preview
Swish
image/x-photoshop
application/x-quicktimeplayer
image/x-cmu-raster
httpd/yahoo-send-as-is
application/x-mpeg
Video/X-Flv
text/x-python
audio/x-scpls
Text/HTML
audio/mid
text/Calendar
application/x-wais-source
application/x-perl
image/txt
application/pgp-keys
Applicationxm
text/calendar
PNG
text/x-vcard
x-png
application/octet-string
unknown/unknown
application/x-troff-me
text/x-javascript
application/octetstream
Image
application/x-sh
audio/x-mpegurl
video/x-m4v
application/pgp-signature
image/x-portable-graymap
image/#{favicon_formats[format]}
image/files/curryjpg
test/xml
text/x-invalid
video/x-flv
text/javascript+json
audio/unknown
chemical/x-xyz
application/perl
application/x.atom+xml
application/octet_stream
video/mp4
The Light: The Collections
The Partners
Megan Sniffin-Marinoff, University Archivist
A-Sites: Archived Harvard Web Sites collected by the Harvard
University Archives
Marilyn Dunn, Executive Director of the Schlesinger Library
and Librarian of the Radcliffe Institute
Blogs: Capturing Women's Voices collected by the Arthur and
Elizabeth Schlesinger Library on the History of Women in
America
Helen Hardacre, Reischauer Institute Professor of Japanese
Religions and Society
Web Archiving Project on Constitutional Revision collected
by the Reischauer Institute of Japanese Studies with
Sponsorship from the Harvard College Library Documentation
Center on Contemporary Japan
To Participate
http://hul.harvard.edu/ois/systems/wax
Questions?
“…we have rather chosen to fill our hives
with honey and wax, thus furnishing
mankind with the two noblest of things,
which are sweetness and light.”
Jonathan Swift
Image Credits
Title slide:
http://www.flickr.com/photos/lwr/59014972/in/set-1552655/
The darkness:
http://www.melegraph.com/images/outerspace.jpg
The candle:
http://www.sxc.hu/pic/m/a/as/asolario/472153_peach_votive_candle.jpg
The Web:
http://projecta-z.com/Internet_map_1024.jpg
The light
http://i252.photobucket.com/albums/hh2/habeba2007/candles-1-1.gif