HARVARD E-JOURNAL ARCHIVING STUDY Dale Flecker June, 2002 JOURNAL ARCHIVING IN THE PAPER ERA • Large-scale redundancy • Access copy and archival copy usually the same • Not just.

Download Report

Transcript HARVARD E-JOURNAL ARCHIVING STUDY Dale Flecker June, 2002 JOURNAL ARCHIVING IN THE PAPER ERA • Large-scale redundancy • Access copy and archival copy usually the same • Not just.

HARVARD
E-JOURNAL ARCHIVING
STUDY
Dale Flecker
June, 2002
JOURNAL ARCHIVING IN
THE PAPER ERA
• Large-scale redundancy
• Access copy and archival copy usually the
same
• Not just storage, but preservation
– includes environmental control, library binding,
repair, reformatting. . .
• Deliberate, long-term archiving largely the
role of national and research libraries
E-JOURNAL MODEL IS
DIFFERENT
• “Copies” are remote, held in publisher
systems
– Not replicated across different institutions
• Perpetual license provides limited comfort
in the absence of independent copies
• Long-term preservation involves very
different issues than day-to-day access
E-JOURNAL ARCHIVING
A GROWING PROBLEM
• Libraries bearing double costs
– the e-journals users prefer
– the paper for preservation
• Publishers cannot convert totally to digital
– authors and editors distrust e-only journals
because of concerns about persistence
– libraries demand paper for preservation
• Libraries preserving paper version, but
electronic more complete, increasingly the
copy of record
MELLON E-JOURNAL
ARCHIVING PROGRAM
• 13 institutions invited to submit proposals
for a planning projects
• Two approaches
– Large-scale distributed replication (LOCKSS)
– Centralized archives serving a wider
community
CENTRAL ARCHIVES
PLANNING PROJECTS
• Publisher-based
– Harvard (Wiley, Blackwell, University of
Chicago Press)
– Penn (Oxford and Cambridge University
Presses)
– Yale (Elsevier)
• Discipline-based
– Cornell (agriculture),
– NYPL (performing arts)
• Dynamic e-journals
– MIT
FOUR BASIC
ASSUMPTIONS
• Archive should be independent of
publishers
– responsibility of institutions for whom
archiving is a core mission
• Archiving requires active publisher
partnership
• Address long timeframes (100 years?)
• Archive design based on Open Archival
Information System (OAIS) model
CENTRAL ARCHIVE
MODEL
• Archive negotiates relationship with
publisher
• Publisher deposits content regularly
• Content accompanied by metadata to
support discovery and preservation
• Archived content only accessible under
specific conditions
• Archive assumes responsibility for longterm preservation
SOME INTERESTING
QUESTIONS
•
•
•
•
•
•
•
•
What is archived?
In what format?
When is archive accessible?
Who can access archived content?
What does the archive “preserve”?
Who does archiving?
How is the archive paid for?
How is the archive governed?
WHAT CONTENT IS
ARCHIVED?
E-journals not simply articles….
SOME COMMON STUFF
•
•
•
•
•
•
•
•
Journal description
Editorial board
Instructions to authors
Rights and usage
terms
Copyright statement
Ordering information
Reprint information
Indexes
•
•
•
•
•
•
•
•
Career information
News
Events lists
Discussion fora
Editorials
Errata
Reviewers
Conference
announcements
HARD AREAS
• Masthead, “front matter” stored as web
pages, not in content management systems
• No control over the format of “associated
materials” (datasets, images, tables, etc.)
• Advertising very complex
– dynamic, frequently from third party, can
involve country-specific complexities
• Links frequently separate from articles
– regularly updated, sometimes dynamic
OUR INCLINATION
• Exclude little except advertisements
– based on discussions with librarians and
scholars
– different from most “local loading”
• Articles include supplementary materials
• Include an “issue object” in addition to the
article components
– masthead, news, jobs, meetings, etc
Format for archived articles?
PDF?
• PDF almost universally available from
publishers
– and the only format available for some journals
• There are qualms...
–
–
–
–
–
proprietary
marked-up for display, not meaning
supports limited functionality
long-term “preservability” unclear
unlikely to remain the universal format over
time
MARKED-UP TEXT?
• SGML/XML increasingly common
– and likely to become more so
• Greater functionality, easier migration as
technology changes
• Complex
– DTDs vary widely from publisher to publisher
– DTDs far from stable
– archive documentation and rendering would be
complex
“INTERCHANGE”
ARTICLE DTD
• Intended for exchanging content between
independent players
• Reduces complexity of interaction
– archive needs to document, migrate, and
display only one format
• archive can choose whether to maintain articles in
interchange DTD, or transform at ingest for longterm storage
– publisher needs deposit only one format for all
archives
“INTERCHANGE”
ARTICLE DTD
• Mellon, Harvard, National Library of
Medicine, 2 consultants (Inera, Mulberry)
working on draft standard DTD
• Design based on current publisher practice
–
–
–
–
must be easy for publishers to produce
homogenizes many elements
leaves options in some difficult areas
eliminates elements specific to individual
publisher delivery systems
INTERCHANGE DTD
ISSUES
• How low is the common denominator?
• What gets lost?
– inevitably sacrifices some functionality and
original appearance
• Transformation from publisher’s “native”
DTD involves risks
• Some technically difficult areas
– extended character sets, mathematical and
chemical formulae, tables. “generated text”
SGML/XML QUALITY
CONTROL PROBLEM
• SGML/XML is an output rather than the
input for many publishers today
– may not fully reflect the output (PDF, print)
that users see day-to-day…how do you know it
is good?
• If SGML/XML is transformed for deposit,
errors can introduced
• Quality control of ingested content is
expensive but critical for a sound archive
ARCHIVE MORE THAN
ONE FORMAT?
• Publisher-based archive must accept PDF in
any case (only format available for some
titles)
– so include both SGML and PDF when
available?
• belt and suspenders
• Accept publisher’s original SGML also?
– preserve information lost in conversion to
interchange DTD
– maintenance over time problematic
WHEN IS ARCHIVE
ACCESSIBLE?
• Most publishers instinctively prefer “dark”
archives
– does not compete with publisher’s service
• If “dark”, what “trigger events” make it
accessible?
– after a given period of time (‘moving wall”)?
– when content is not otherwise accessible
(“failsafe”)?
– only when content enters the public domain?
IS “DARK”
DANGEROUS?
If content is dark, how do you
know it is still good?
(real users are the best auditors)
WHO CAN ACCESS
ARCHIVE CONTENT?
• Just other subscribing institutions?
– does the archive need to maintain complex
records of license rights?
• defining licensees a nightmare
• tracking license changes over time another
nightmare
• Individual subscribers?
– an even greater nightmare
• Everybody?
– dramatically easier to administer
WHAT DOES THE
ARCHIVE PRESERVE?
• Preservation is a format-by-format issue
– and most e-journals are composed of many
formats
• How much “look and feel” preserved?
• Just preserve the “core intellectual
content”?
• Does archive insure content remains
“render-able” as technology changes?
HARVARD’S DIGITAL
REPOSITORY
• Repository specifies preferred
(“normative”) formats, which will be kept
useable
• Just maintain bits for others
– for e-journals this is likely for many “associated
materials” (datasets, models, etc.)
• generally accepted in ANY format
• maintaining the viability of such wildly
heterogeneous materials unrealistic
– keep unaltered for future “digital archeology”
WHO DOES ARCHIVING?
• “Common good” activity
– model based on a few archives serving many
subscribers
• Is this an appropriate role for individual
universities?
– research libraries have technical capability,
relationships with publishers and subscribers
– BUT how archiving would be paid for is
central…...
HOW IS THE ARCHIVE
PAID FOR?
• First question: who benefits?
– publishers, libraries, authors, scholarly
societies…
– is there a way to share costs?
• Cost categories include
–
–
–
–
preparation of “archivable” objects
ingestion and quality control
long-term storage
preservation
PROPOSED MODEL
• Publisher assumes cost of preparing objects
in standard format (whenever possible)
• Deposited material accompanied by two
part fee from publisher
– ingest fee to cover up-front costs
• varies with publisher effort to create easily archived
objects???
– “dowry” to create maintenance endowment
• Real funding sources include subscribers,
authors, societies
HOW IS THE ARCHIVE
GOVERNED?
* Publishers hand their its intellectual
property to independent party -do they have a continuing say?
* Are there other stakeholders who
should also have a say?
HARVARD’S MODEL
ARCHIVE
• Accept content for all titles a publisher
produces
– archive as many journal elements as possible
• Maintain an archive serving the entire
community
• Store and maintain more robust formats (e.
g., XML) when possible
• Collect metadata to support administration
and preservation
HARVARD’S MODEL
ARCHIVE
• Requires only a few archival copies of any
given journal
• Archive assumes responsibility for
preservation migration when canonical
versions deposited
• Organizational and economic model
difficult
NEXT?
Over to Kevin….