Creating and Maintaining Web Archives

Download Report

Transcript Creating and Maintaining Web Archives

Joanne Archer
University of Maryland
Kate Odell
Archive-It
Creating and Maintaining Web Archives
Abbie Grotke
Library of Congress
Tessa Fallon
Columbia University
Session Goals
• Provide an overview of web archiving and the
tasks involved
• Discuss workflow management and copyright
issues
• Talk about collection strategies and collection
development for web archives
• Analyze the different options for web archiving
• Discuss some of the commonly encountered
technical challenges and problems
• Examine methods of access and description
What is web archiving?
Web Archiving is the capture, management, and
preservation of websites and web resources.
Web Archiving Initiatives
Prominent Web Archiving Initiatives include:
• Internet Archive
• International Internet Preservation Consortium
• Large National Libraries:
– Australia
– United Kingdom
– United States
– Denmark
• Web at Risk Project
Workflow Management
Resource
planning
Monitor along
the way
Determine
crawling
approach
Determine
permissions
approach
Identify
services and
tools to use
Collection
development
and planning
Access for
researchers
Copyright/Permissions
Legal deposit requirement only applies to “published works” (§
407)
–§ 108 of the Copyright Act provides library exceptions but doesn’t
address digital preservation and web archiving
–Varying approaches taken:
–
•
•
•
•
Crawl permissions
Access permissions
Notification of crawling
Respecting robots.txt (or not!)
–Risk and web archiving policies should be determined by each
institution - talk to your lawyers!
Collection Strategies
• Whole Domain
• used by some national libraries and by the Internet Archive. -capture everything within a geographic domain such as in the
case of Sweden, all sites within the .se domain.
• Selective Archiving
• capture certain portions of the web based on predefined criteria
or collection policies.
• Thematic
• event driven (September 11) or theme driven (human rights)
• deposit
• Combination
Collection Development: Topical
•
•
•
•
•
•
Finite/Ongoing
Active/Inactive
Public
Organization
Academic
Subject
specialists
• Curators
• Collaborators
• Nomination
forms
• Delicious social
bookmarks
• Survey/forms
• Email
• Bookmarklet
•
•
•
•
SCOPE
FOCUS
TOOLS
SELECTION
Subject/general
Project-specific
Collaborative
Institutional
history
• Event-specific
• Data set
•
•
•
•
•
Distributed
Survey
Nomination
Targeted
Domain
Collection Development: Technical
Flash
Storage
Languages
Javascript
Technical
considerations
Databases
Hidden
content
Copyright
Social
media
Multiple
domains
Collection Development Policies/Guidelines
• Collection Development Policies or Similar Documents:
– Center for Human Rights Documentation and Research, Human Rights Web Archive
• http://library.columbia.edu/indiv/humanrights/hrwa.html
– Library of Congress
• http://www.loc.gov/acq/devpol/webarchive.pdf
– Tamiment Library Web Archive
• http://www.nyu.edu/library/bobst/research/tam/webarchive.html
– University of Michigan Bentley Historical Library
• http://bentley.umich.edu/uarphome/webarchives/BHL_WebArchives_Policy.pdf
– National Library of Ireland general election 2011 web archive
• http://www.nli.ie/GetAttachment.aspx?id=8f6b68db-e19c-411c-b041-aa8b741d2e10
Tools: HTTrack
Tools: HTTrack
Tools: In-House Program
Web Curator Tool
Tools: In-House Program
DigiBoard
Tools: Subscriptions, Web Archiving Service
Tools: Subscriptions, Archive-It
How does web archiving work?
Curator Selects
Websites (Seeds)
to Archive
Access tools
(Wayback) allow
archived content to
be viewed and
browse
Curator Specifies
Scope (how much
of the websites are
archived)
Archived content is
processed and
stored (.warc
format)
Seeds and scoping
are sent to the
Crawler (usually
Heritrix)
Crawler visits seed
sites and archives
the Urls that are
discovered
(following the
scoping rules)
Quality Review
Quality Review is different for everyone. Why?
• The tool(s) being used for harvesting and access
• Your institution’s goals, needs, and preferences
• How much time you have
Review Reports
Review Archived Web Pages
• Were there any blocked content or
unreachable sites?
• Did you get more content than
expected? Less?
• Some issues can only be found with the
human eye (for now!)
• Was look-and-feel properly captured?
Make Desired Changes
Scoping, Seeds,
Crawl Settings, etc.
Crawl Again
Common Problems – “The Web is a Mess”
•Some web technologies can be tricky (though not
impossible!) to capture or to view in the archived version:
• Database driven sites
• Javascript (only sometimes)
• Flash (only sometimes)
• Certain video formats
•Websites change – what archived perfectly yesterday,
might not after today’s redesign
Access and Description
Access Options:
•
•
•
•
Subscription Service Access Page (i.e. Archive-It website)
Website of Your Organization or Project (i.e. Human Rights Web Portal, LOC’s Web
Archives site)
OPAC (i.e. Columbia’s CLIO)
OCLC’s WorldCat
Examples of Description:
• Columbia University
•
Dublin Core
•
MARC
•
Internet Resource Cataloging Request (IRCR)
• Library of Congress
•
Creates MODS records for each “site”
•
Collection level records in MARC (for the OPAC)
• Archive-It
•
Dublin Core
•
Coming soon: Automated transformation to MARC, MODS, and more.
Archive-It Partner Page
Library of Congress Web Archives Page
http://lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.html
Library of Virginia
http://www.virginiamemory.com/collections/archival_web_collections
CLIO Record (public view)
Worldcat
Link back to the Archive-It collection
http://www.worldcat.org/title/north-africa-the-middle-east-2011/oclc/756767371
Staffing
Staff needed include:
• Project Management
• Selectors/Curators
• Technical staff for Seed URL preparation (scoping), Quality Review, analysis of
reports, etc.
• Catalogers
Training for Staff:
• Use of Tools
• Selection - and how what can and cannot archive affects that
• Permissions
• Quality Review
Helpful skills: comfortable with web (not all are, in our experience!),
flexibility, good sense of humor
Taking the First Steps…
• Is there web content within your collection scope?
–Your organization’s website(s)
–Print material that has migrated to web publication
–Subject related websites
–Websites related to manuscript or archival collections
–State or local government websites
• Research and talk to similar organizations
• Talk to subscription services about trial accounts
• Try out some of the lower barrier tools (i.e. HTTrack)
• Get involved with collaborative web archiving efforts
• Just do it! Jump in!
NDSA Web Archiving Survey
The National Digital Stewardship Alliance (NDSA) Content Working
Group
[http://www.digitalpreservation.gov/ndsa/working_groups/content.html
] is sponsoring this survey of organizations in the United States who are
actively involved in or planning to archive content from the web.
http://www.surveymonkey.com/s/USWebArchiving
The survey will close October 31, 2011.
Questions? Comments?
Suggestions?
Joanne Archer • [email protected]
Tessa Fallon • [email protected]
Abbie Grotke • [email protected]
Kate Odell • [email protected]