DOI and DIGITAL LIBRARIES

Download Report

Transcript DOI and DIGITAL LIBRARIES

MELLON
E-JOURNAL ARCHIVING
PROJECT
January20, 2002
DIGITAL PRESERVATION
BIG
THE
ISSUE IN
DIGITAL LIBRARIES
• Digital is inherently fragile
– constant technological change yields short life
for all digital materials
• Nothing will be saved passively
– requires constant and conscious action to
preserve
• A core role for research libraries in the
digital era????
JOURNAL ARCHIVING IN
THE PAPER ERA
• Large-scale redundancy
• Access copy and archival copy usually the
same
• Not just storage, but preservation
– includes environmental control, library binding,
repair, reformatting. . .
• Deliberate, long-term archiving largely the
role of national and research libraries
E-JOURNAL MODEL IS
DIFFERENT
• “Copies” are remote, held in publisher
systems
– not replicated across different institutions
• Perpetual license provides limited comfort
in the absence of independent copies
• Long-term preservation involves very
different issues than day-to-day access
LACK OF ARCHIVING A
GROWING PROBLEM
• Libraries bearing double costs
– the e-journals users prefer
– the paper for preservation
• Publishers cannot convert totally to digital
– authors and editors distrust e-only journals
because of concerns about persistence
– libraries demand paper for preservation
• Libraries preserving paper version, but
electronic more complete, increasingly the
copy of record
MELLON E-JOURNAL
ARCHIVING PROGRAM
• 13 institutions invited to submit proposals
for a one-year planning project
• Six planning proposals were selected and
funded in December 2000
– additional project focused on technology
(LOCKSS) also funded
• Second round of Mellon grants to be
announced in June will fund actual
implementation
SIX PLANNING
PROJECTS
• Publisher-based
– Harvard (Wiley, Blackwell, University of
Chicago Press)
– Penn (Oxford and Cambridge University
Presses)
– Yale (Elsevier)
• Discipline-based
– Cornell (agriculture),
– NYPL (performing arts)
• Dynamic e-journals
– MIT
SOME BASIC
ASSUMPTIONS
• Archive should be independent of
publishers
– responsibility of institutions for whom
archiving is a core mission
• Archiving requires active publisher
partnership
• Address long timeframes (100 years?)
• Archive design based on Open Archival
Information System (OAIS) model
OBJECTIVES FOR
PLANNING PROJECTS
• Develop draft archiving agreements with
publisher partners
• Design technical architecture for an archive
• Formulate an acquisitions and growth plan
• Articulate access policies
• Address validation/certification
• Design an organizational model, staffing,
long-term funding model
Key planning issues/decisions…
BASE ON DL
INFRASTRUCTURE
• Use existing infrastructure for storage,
management, preservation, access
• Enhanced to comply with OAIS model
• New ingest and rendering functions
ARCHIVING
AGREEMENT
• Explicit archiving license with publisher
• License addresses what content is archived,
responsibilities of parties, conditions of use,
economics
• Not always an easy negotiation
– archiving involves handing publisher’s
intellectual property to independent party
PUSH MODEL
• Publishers will “push” content to be
archived to Harvard
– on-going regular deposit following on-line
publication of issue
• (what happens when issues disappear?)
WHAT CONTENT IS
DEPOSITED?
• “Journal issues” are complex
– publishers do not treat all journal content the
same (e. g. “front matter” treated as web pages,
not objects in content management systems)
– “associated materials” (datasets, images, tables,
etc.) not in the print versions
– advertising usually dynamic, and can involve
country-specific complexities
SOME COMMON STUFF
•
•
•
•
•
•
•
•
Journal description
Editorial board
Instructions to authors
Rights and usage
terms
Copyright statement
Ordering information
Reprint information
Indexes
•
•
•
•
•
•
•
•
Career information
News
Events lists
Discussion fora
Editorials
Errata
Reviewers
Conference
announcements
ARCHIVE MOST
CONTENT
• Exclude little except advertisements
– different from most “local loading”
• Articles include supplementary materials
• Include an “issue object” in addition to the
article components
– masthead, news, jobs, meetings, etc
• Reference links problematic
– dynamic, frequently separate from article
STANDARD ARCHIVAL
ARTICLE DTD
• Publisher’s SGML formats vary widely
• Consultant report on practicality of common
archival XML DTD
• Dramatically reduces archive complexity
• Issues include
–
–
–
–
how low a common denominator
extended character sets, formulae, etc.
sacrifice functionality and original appearance
transformations involve risks
DEPOSIT MORE THAN
ONE FORMAT?
• Archive must accept PDF in any case
– so include both SGML and PDF when
available?
• belt and suspenders
– inclined to do this
• Accept publisher’s original SGML also?
– conversion to archival DTD will result in loss
– inclined to not do this
“DARK-TO-LIGHT”
• Archived material not accessible at deposit
– do not compete with publishers
• Content becomes accessible after “trigger
event”
– default then is universal access
• But how do you know “dark” archival
content is still good?
– it would be better if there was some on-going
access…..
ACCESS MODEL
• Archived content always accessible to
anyone with appropriate license from
publisher
– might be satisfied by batch export
• After trigger, simple on-line functionality
– assume same functionality for auditors
TRIGGER EVENTS
• “N” years after deposit
– “N” set by publisher title-by-title
• When title/year no longer commercially
accessible on the Internet
– still problematic with some publishers
• When content enters public domain
PRESERVATION
• Format-by-format issue
• Archive specifies preferred formats, which
will be kept renderable
• Just maintain bits for others
– e. g., “associated materials” (datasets, models,
etc.) generally accepted in ANY format
• maintaining the viability of such wildly
heterogeneous materials unrealistic
– keep unaltered for future “digital archeology”
ECONOMIC MODEL
• First question is not who pays, but what will
it cost…
– reducing costs to the minimum is critical
• In general publishers expected to bear
preparation costs for archived objects
• Process automation critical to keeping costs
low
– ingest process
– auditing
PAYMENT WITH
DEPOSIT
• Two part fee
– ingest fee to cover up-front costs
• varies with publisher effort to create easily archived
objects???
– “dowry” to create maintenance endowment
• Sources include subscribers, authors,
societies
NEXT…..
• Proposal to Mellon by April 1 for funding to
implement an archive
– particular parameters of the call-for-proposals
still uncertain
• Original plan suggested 3 or 4 year projects
• Intent is to implement archive, contract for
deposit, begin operations
– learn by getting dirty hands
– help understand issues, costs