"Keeping alert: issues to know today for long-term digital preservation with repositories" Neil Beagrie Fedora Users Group Open Repositories Southampton April 2008

Download Report

Transcript "Keeping alert: issues to know today for long-term digital preservation with repositories" Neil Beagrie Fedora Users Group Open Repositories Southampton April 2008

"Keeping alert:
issues to know today for long-term digital
preservation with repositories"
Neil Beagrie
Fedora Users Group Open Repositories
Southampton April 2008
Focus of this lecture
• Research Data
• JISC Research Data Digital Preservation
Costs Study
• Long-term costs and sustainability of
Repositories
Trends
Computer Processing Power and Storage
$/MB
1950 1960 1970 1980 1990 2000 2010
10000
1000
100
10
1
0.1
0.01
0.001
0.0001
Growth of Scientific Data and Data Curation
• In next 5 years e-Science will produce more data than
has been collected in the whole of human history
• Data growth – Protein Data Bank (1972- 07/2005)
e-Research
and preservation
(UK Science and Innovation Investment
Framework 2004 – 2014)
Information Infrastructure
• 2.23 The growing UK research base must have ready
and efficient access to information of all kinds – such as
experimental data sets, journals, theses, conference
proceedings and patents….
• 2.24 It is clear that the research community needs
access to information mechanisms which: systematically
collect, preserve and make available digital
information;….
• 2.25 The Government [via DTI] will therefore work with
interested funders and stakeholders to consider the
national e-infrastructure (hardware, networks,
communications technology) necessary to deliver an
effective system.
e-research data and repositories
EU Studies:
• Driver2
• E-SCIDR (e-science repositories)
Current UK Studies:
• Data Scientist careers/skills
• Data Audit Framework and institutional pilots
• UK Research Data (shared)Service
Feasibility Study
• Costs for long-term preservation of research
Keeping Research Data Safe
JISC Research Data Digital Preservation
Costs Study
Overview
• Aim – investigate costs, develop model and
recommendations
• Project team – Me, Julia Chruszcz, Brian
Lavoie (OCLC), Cambridge, KCL,
Southampton
• Method – detailed analysis of 2 cost models
(LIFE & NASA CET) in combination with
OAIS and TRAC; literature review;12
interviews; 4 case studies.
• 4 month study
• Draft final report in peer review
What have we Produced?
• A cost framework consisting of:
– activity model in 3 parts: pre-archive, archive,
support services
– Key cost variables divided into economic adjustments
and service adjustments
– Resources template for TRAC
– Used in combination to generate cost/charging
models
• 4 detailed case studies (ADS, Cambridge, KCl,
Southampton)
• Spreadsheet supplement
• Data from other services.
Some Tentative Findings
Findings
Institutional
Repository (epublications):
Staff
Equipment
(capital
depreciated
over 3 years)
£1,300 pa
Annual recurrent
costs
1 FTE
Federated
Institutional
Repository (data):
Annual recurrent
costs
Cambridge
Staff
Equipment
(capital
depreciated
over 3 years)
4 FTE
£58,764 pa
KCL
2.5 FTE
£27,546 pa
Findings
• Timing. costs c. 333 euros for the creation of a batch
of 1000 records. Once 10 years have passed since
creation it may cost 10,000 euros to ‘repair’ a batch of
1000 records with badly created metadata (Digitale
Bewaring Project)
• Efficiency Curve effects – start-up to operational
• Economy of scale effects – Accession rates of
10 or 60 collections - 600% increase in accessions will
only increase costs by 325% (ULCC)
Findings
• National subject repositories costs
Acquisition and
Ingest
c. 42%
Archival Storage
and
Preservation
c. 23%
Access
c. 35%
Findings
• ADS project of long-term preservation costs
• Implications for sustainability via project charges
•
•
•
•
Preservation interventions (file format migrations)
Long-term storage costs
Assumptions of archive growth (economies of scale)
Assumptions on “first mover innovation”
What’s New?
• FEC based – not in or partial in other models but
– Requirement for HEIs
– Absence of FEC (a) distorts business cases eg for
automation (b) cannot accurately compare in-house or outsource costs
• Not just DIY – application neutral – can cost for inhouse archive, full or partial shared service(s),
national/subject data centre archive charges
• Preservation: archival storage, preservation planning,
data management, “first mover innovation”
• Tailored for research data: different collection levels,
documentation+ metadata, products from data, etc
Conclusions
Cost Observations for
Repositories
• Not just formula of function costs
• Can illustrate effect of some choices on
costs
• Sustainable project archive funding model?
• Start-up v running costs
• bleeding-edge costs – “first mover
innovation”
• Audit/capacity planning
• Not last word on costs....
Questions?