"Keeping alert: issues to know today for long-term digital preservation with repositories" Neil Beagrie Fedora Users Group Open Repositories Southampton April 2008
Download ReportTranscript "Keeping alert: issues to know today for long-term digital preservation with repositories" Neil Beagrie Fedora Users Group Open Repositories Southampton April 2008
"Keeping alert: issues to know today for long-term digital preservation with repositories" Neil Beagrie Fedora Users Group Open Repositories Southampton April 2008 Focus of this lecture • Research Data • JISC Research Data Digital Preservation Costs Study • Long-term costs and sustainability of Repositories Trends Computer Processing Power and Storage $/MB 1950 1960 1970 1980 1990 2000 2010 10000 1000 100 10 1 0.1 0.01 0.001 0.0001 Growth of Scientific Data and Data Curation • In next 5 years e-Science will produce more data than has been collected in the whole of human history • Data growth – Protein Data Bank (1972- 07/2005) e-Research and preservation (UK Science and Innovation Investment Framework 2004 – 2014) Information Infrastructure • 2.23 The growing UK research base must have ready and efficient access to information of all kinds – such as experimental data sets, journals, theses, conference proceedings and patents…. • 2.24 It is clear that the research community needs access to information mechanisms which: systematically collect, preserve and make available digital information;…. • 2.25 The Government [via DTI] will therefore work with interested funders and stakeholders to consider the national e-infrastructure (hardware, networks, communications technology) necessary to deliver an effective system. e-research data and repositories EU Studies: • Driver2 • E-SCIDR (e-science repositories) Current UK Studies: • Data Scientist careers/skills • Data Audit Framework and institutional pilots • UK Research Data (shared)Service Feasibility Study • Costs for long-term preservation of research Keeping Research Data Safe JISC Research Data Digital Preservation Costs Study Overview • Aim – investigate costs, develop model and recommendations • Project team – Me, Julia Chruszcz, Brian Lavoie (OCLC), Cambridge, KCL, Southampton • Method – detailed analysis of 2 cost models (LIFE & NASA CET) in combination with OAIS and TRAC; literature review;12 interviews; 4 case studies. • 4 month study • Draft final report in peer review What have we Produced? • A cost framework consisting of: – activity model in 3 parts: pre-archive, archive, support services – Key cost variables divided into economic adjustments and service adjustments – Resources template for TRAC – Used in combination to generate cost/charging models • 4 detailed case studies (ADS, Cambridge, KCl, Southampton) • Spreadsheet supplement • Data from other services. Some Tentative Findings Findings Institutional Repository (epublications): Staff Equipment (capital depreciated over 3 years) £1,300 pa Annual recurrent costs 1 FTE Federated Institutional Repository (data): Annual recurrent costs Cambridge Staff Equipment (capital depreciated over 3 years) 4 FTE £58,764 pa KCL 2.5 FTE £27,546 pa Findings • Timing. costs c. 333 euros for the creation of a batch of 1000 records. Once 10 years have passed since creation it may cost 10,000 euros to ‘repair’ a batch of 1000 records with badly created metadata (Digitale Bewaring Project) • Efficiency Curve effects – start-up to operational • Economy of scale effects – Accession rates of 10 or 60 collections - 600% increase in accessions will only increase costs by 325% (ULCC) Findings • National subject repositories costs Acquisition and Ingest c. 42% Archival Storage and Preservation c. 23% Access c. 35% Findings • ADS project of long-term preservation costs • Implications for sustainability via project charges • • • • Preservation interventions (file format migrations) Long-term storage costs Assumptions of archive growth (economies of scale) Assumptions on “first mover innovation” What’s New? • FEC based – not in or partial in other models but – Requirement for HEIs – Absence of FEC (a) distorts business cases eg for automation (b) cannot accurately compare in-house or outsource costs • Not just DIY – application neutral – can cost for inhouse archive, full or partial shared service(s), national/subject data centre archive charges • Preservation: archival storage, preservation planning, data management, “first mover innovation” • Tailored for research data: different collection levels, documentation+ metadata, products from data, etc Conclusions Cost Observations for Repositories • Not just formula of function costs • Can illustrate effect of some choices on costs • Sustainable project archive funding model? • Start-up v running costs • bleeding-edge costs – “first mover innovation” • Audit/capacity planning • Not last word on costs.... Questions?