Transcript Slide 1

Where are we with Digital
Preservation?
Andrew Waugh
Public Record Office Victoria
Where are we?
• It is not the end. It may not even be the
beginning of the end. But it is undoubtedly
the end of the beginning
– Winston Churchill
• This talk will cover
– Consensus views on digital presevation
– Open questions and future challenges
What this presentation will
cover
•
•
•
•
Understanding (building systems)
Storage (preserving the bit strings)
Access (preserving the meaning)
Metadata (preserving the context &
authenticity)
• Transfer (overcoming system senescence)
Understanding
• Communication requires shared terminology and
concepts
• Open Archival Information System (OAIS)
reference model (IS 14721:2003)
– http://public.ccsds.org/publications/archive/650x0b1.pdf
– High level terminology very widely used, but few use
the detail in the model
– Does not cover preservation
– Pre web and detail does not reflect actual
implementations
– Currently under review
Trusted digital repositories
• How can you be sure if an organisation (& its
system) is up to holding your digital objects?
• Trustworthy Repositories Audit and Certification
– CRL/NARA (2007)
• http://www.crl.edu/content.asp?l1=13&l2=58&l3=162&l4=91
– Administrative focus rather than technical
– high level (cannot be tested)
– Based on OAIS, basis for audit checklists
Audit checklists
• Provide tests to see if a repository can be
trusted
– Drambora: DCC/DPE (2007)
• Risk based, self certification
• http://www.repositoryaudit.eu/
Public domain digital
repositories
• Public domain digital repository code
– D-Space (http://www.dspace.org/)
– Fedora (http://www.fedora-commons.org/)
• Both came out of the academic community
and primarily support institutional
repositories
Storage – preserving the bit
string
• Fundamental task of digital preservation is
ensuring that the bits that make up the
digital objects are preserved
• “Solved” problem – large scale data
repositories have existed for decades and
there is lots of operational experience
• Archival twist: actively monitor health of
stored objects using hashes
Storage - future challenges
• Reducing storage cost (and chance for error)
– Swedish National Archives estimated in 2005 between 4 and 8
Euro per digitised page mostly in system and support costs
– http://www.tape-online.net/docs/Palm_Black_Hole.pdf
• Reducing risks
– Administrator risk vs packaged risk
• Ideal storage system
– Packaged (i.e. built in administration such as the Centera)
– Open so that you can trust it and replace components
• CLOCKSS
– Uses redundant copies at participating institutions to ensure
preservation (LOCKSS)
– http://www.clockss.org/clockss/Home
Access – preserving the meaning
• What do you do when you no longer have
an application to open the data files?
• Current approach is either
– Do nothing now with eventual migration
– Normalisation upon accession
• Future approach might be emulation
Migration
• Save what you capture now and convert to
new formats as required
– Web harvesting (studies show web sites are
mostly safe formats – HTML, XML, jpeg, gif,
etc)
– Formats (and software) proving surprisingly
resilient
Normalisation
• Convert upon accession to small number
of long term preservation formats
– E.g. PDF/A (PROV), ODF (NAA)
– Immediate cost upon accession, but expected
lower long term management cost
– Criteria for good LTPF (Library of Congress)
• http://www.digitalpreservation.gov/formats/intro/intro.shtml
Challenges
• What is it? Tools to determine file formats
– Pronom – repository of format descriptions
and DROID (format classifier)
http://www.nationalarchives.gov.uk/pronom/
– JHOVE (Harvard) classifier and simple
validation http://hul.harvard.edu/jhove/
• How accurate is the conversion?
• Is it a valid file according to the standard?
Metadata is better data
• Metadata is information about the bit string
– What it is (semantic)
– What it is (technical)
– How it relates to other digital objects
– What is its history?
– How is it to be managed?
• Unfortunately, lots and lots of large
metadata standards
Metadata standards
• For an excellent summary of metadata
standards see the Metadata chapter in the
DCC Digital Curation Manual
– http://www.dcc.ac.uk/resource/curationmanual/chapters/metadata/metadata.pdf
Digital preservation metadata
• Data Dictionary for Preservation Metadata
(PREMIS)
– little descriptive information and nothing format
specific
– http://www.loc.gov/standards/premis/
• ISO 23081 (Metadata for records)
• National Archives Australia Recordkeeping
Metadata Standard
– http://www.naa.gov.au/Images/rkms_pt1_2_tcm2-1036.pdf
Future challenges
• Too many competing standards
– Which do I implement?
• Too many elements
– Increases cost of standard development and
software implementation
• Few elements ever used
– Too expensive and too hard to capture
metadata
Transfer
Overcoming system senescence
• Digital objects have a much longer life
than the systems that hold them
– Move objects to digital repositories where
they can be properly managed
– Move them from one digital repository to its
replacement
• Storage is so cheap that holders may be
tempted to keep digital objects (until it is
too late)
Future challenges
• Current systems are not designed around the
assumption that digital objects must be relocated
– AIHT, Conceptual Issues from Practical Tests, Clay
Shirky, D-Lib Magazine, Vol 11 No 12, December
2005,
http://www.dlib.org/dlib/december05/shirky/12shirky.ht
ml
• ADRI-UN/CEFACT work on a standard to
transfer custody of digital records
More information
• If I have whetted your appetite...
– PADI Annotated bibliography of digital
preservation (http://www.nla.gov.au/padi/)
– D-Lib Magazine (http://www.dlib.org/)
Final thoughts
• We know about compasses, and we have some
charts, but there are a lot of rocks out there…
We are a long way from satellite navigation
• What about small/medium archives… personal
archives?
• Are photographs better digital or as negatives?
– http://www.wilhelm-research.com/