WEB ARCHIVING IN THE BRITISH LIBRARY

Transcript WEB ARCHIVING IN THE BRITISH LIBRARY

WEB ARCHIVING IN THE BRITISH
LIBRARY
John Tuck
Head of British Collections
February 2004
1
BRITISH LIBRARY: CONTEXT

Created by British Library Act 1972.

National Library of the United Kingdom.

Origins from 1753.

One of world’s greatest research libraries.

160 million collection items.
2
BRITISH LIBRARY:
COLLECTION DEVELOPMENT

Building as completely as possible the
UK national published archive - current
and retrospective gap filling; print and
electronic.

Collecting research-level Englishlanguage material published world-wide
in the humanities, social sciences, STM.

Buying foreign-language material
selectively

Material acquired through: legal deposit,
voluntary deposit from publishers, purchase,
donation, exchange.
3
LEGISLATION

Legal Deposit Libraries Act 2003: enabling
legislation.

VDEP: Voluntary Deposit of Electronic
Publications.
4
DOMAIN.UK

Six-month experiment to select and capture
100 UK web-sites, 2001.

audit change, loss, links, etc.

determine next steps.
5
DOMAIN.UK: Why?

Short-lived nature/changing content of many
web-sites.

loss of information.

increasing reference to web-sites in
research/scholarship.
6
DOMAIN.UK:
Voluntary/Rights Cleared Approach

Voluntary.

Requiring explicit agreement of website
publishers to take part in pilot.

No public access.
7
DOMAIN.UK: Selection

Websites of historical or cultural
significance.

Cross-section of Dewey Decimal
Classification.
8
DOMAIN.UK: Process

E-mail selected sites for approval and to
check whether already archived.

Measure sites for links, size, change, etc.

Frequency of visits: every three weeks or
more in some cases.

Supported by those sites approached.

Report recommended scaling up.
9
BRITISH LIBRARY WEB ARCHIVING
PROGRAMME

Building on Domain.uk.

BL to play leading role in collecting UK web
presence in partnership with other
institutions nationally and internationally.

Selective approach.
10
BRITISH LIBRARY WEB ARCHIVING
PROGRAMME contd.

Co-ordinate a snapshot of entire UK web
presence at occasional intervals.

Achieve more regular capture of limited and
well-defined range of sites.

Sites judged to be research-level, whether in
terms of stated intentions of sites
themselves or of potential to be primary
resources for research.
11
WEB ARCHIVING PROGRAMME

Comprises a series of complementary
projects and activities.

Based entirely on voluntary, rights-cleared
basis pending secondary legal deposit
legislation.

Aims to embed web archiving within the
BL's overall collection development policy.

Aims to provide the infrastructure to collect,
preserve and make accessible web-site
material alongside material in other formats.
12
WEB ARCHIVING PROGRAMME
STRANDS

Four main strands:

Definition of collection development policy.

UK Web Archiving Consortium.

International Internet Preservation
Consortium.

Internet Archive: incunabula of the internet.
13
COLLECTION DEVELOPMENT

Appointment of Curator, Web Archiving.

Extension of policy defined for Domain.uk.

Sites of national, historical and cultural
significance.

Research level now/in the future.
14
UK WEB ARCHIVING CONSORTIUM

Two-year project.

Six partners: BL (lead); National Library of
Scotland, National Library of Wales,
National Archives, Joint Information
Systems Committee, Wellcome Library.

Plan to use PANDAS software developed by
National Library of Australia.

Rights to use individual sites to be cleared
with rights-holders.
15
UK WEB ARCHIVING CONSORTIUM
contd.

Procurement exercise in process to recruit
supplier to host service.

Intention to let contract in April 2004 and to
be operational in summer 2004.

Sites to be made accessible to users.

Each partner to collect up to 500 sites per
year, i.e. 6,000 during project.
16
INTERNATIONAL INTERNET
PRESERVATION CONSORTIUM

Project involving national libraries.

Led by Bibliotheque Nationale de France.

Also includes BL, Library of Congress,
Library and Archives of Canada, Nordic
countries, Italy, Australia, Internet Archive.
17
INTERNATIONAL INTERNET
PRESERVATION CONSORTIUM
contd.

Aims to develop automated web-crawler
mechanism.

Open-source tools to search web at regular
intervals matching agreed collection
development policies.

Working groups in: access tools; content
management, deep web, framework, metrics
and test-beds, researcher requirements.

Developmental at this stage.
18
INTERNET ARCHIVE

Collecting and saving sites since 1997.

Wayback machine.

Legal, technical and procurement issues.
19
SOME CHALLENGES

Defining UK.

Rapid technology change.

Third party rights (not always subject to UK
law).

Libel/defamation issues.

Software issues / which platform?

Validity of a snapshot.
20
SOME CHALLENGES
contd.

Formats for archiving.

Metadata standards.

Archiving ‘look and feel’.

Authenticity.
21

WEB ARCHIVING IN THE BRITISH LIBRARY

Transcript WEB ARCHIVING IN THE BRITISH LIBRARY

Directory