Basic electronic records

Download Report

Transcript Basic electronic records

Preserving and Providing LongTerm Access to Archival
Electronic Records
Thomas J. Ruller
[email protected]
www.truller.com
© 2003 Thomas J. Ruller
Workshop topics
Defining and identifying records and
documents in a digital environment
 Selecting and preparing records for longterm preservation
 Preserving records and documents in digital
formats
 Providing access to digital materials and
special challenges

Basic archival objectives
Select the information that is most
important
 Ensure that it remains useable,
understandable and accessible over time
 Protect it from changes in hardware and
software
 Make it understandable to current
technology

These objectives are accomplished when...
Identify all the components of a record.
 Ensure all of the components are managed
and cared for.
 Move components to a standards-based data
and systems architecture.
 Protect components from change over time.

Life-Cycle Management is the Key

The archival
administration of
records in electronic
form requires LIFECYCLE management.
Actions taken at the
beginning of the lifecycle impact
preservation and
access.
Start at the beginning
Starting at the beginning... all digital data
is simply binary code




Information is
represented by 1 or 0
Each 1 or 0 is a bit
8 ones and zeroes
together make a byte
7 bit ASCII is a
standard for
representing meaning
through 7 bits the
eighth bit is a check
bit

01000001 = A

01100001 = a

01000010 = B

01100010 = b

Unicode is an
emerging standard for
encoding
Data vs. Metadata
Data is the actual information that is created
and acted upon in the information system
 Metadata is either applied to the data or
created by its use. It is information about
the data.
 With electronic records data and metadata
are inseparable.

South Carolina Public Record
"Public record" includes all books, papers, maps,
photographs, cards, tapes, recordings, or other
documentary materials regardless of physical form or
characteristics prepared, owned, used, in the
possession of, or retained by a public body.
Does this work in a digital environment?
Records, Documents and Data

Data
– All recorded
information

Documents
– discrete and
identifiable
– logical structure
– stored as more than
one component

Records
– support accountability,
transactional
– recorded
– in any form
– created, received or
maintained
– by an organization or
person
– in the transaction of
business
– and kept as evidence
Records are defined through

Content
– What information and facts do they contain?

Context
– Who created these records, when, why and for what
purpose?

Structure
– What are the components of this record and how are
those components organized?
Metadata is the Key
Information about the structure and organization
of the digital information
Needed to make the information useable and understandable
Information about the access and use of the information
Needed to substantiate its authenticity
Identifying an Organization’s Records
What is important to document?
 What are the “recordkeeping requirements”
of the organization?
 What is considered the “record” of a
transaction or element of business
 What are all of the components of
information that make this a “record”

Qualities of “Authentic”
electronic records:
• Rules governing “Documentary Form”
• Rules Governing Annotations
• Medium
• Context
Authenticity







Identity and integrity of the record
Access privileges
Protective procedures to guard against loss and
corruption
Protective procedures for media and technology
Established “documentary forms”
Procedures for authentication
Procedures for moving records from active to
inactive status
Creating Useful Records
From the University of Pittsburgh project
 Develop information systems that are
capable of creating and keeping records

–
–
–
–
Metadata that defines the context of records
Who created this information
When was it created
What happened to it after it was created.
• Understand the business, regulatory and social context in which they
operate (step A); Preliminary investigation
• Identify their need to create, control, retrieve and dispose of records (that is,
their recordkeeping requirements) through an analysis of their business
activities and environmental factors (steps B and C)
• Analysis of business activity and identification of recordkeeping
requirements.
• Assess the extent to which existing organizational strategies (such as
policies, procedures and practices) satisfy their recordkeeping requirements
(step D)
• Redesign existing strategies or design new strategies to address unmet or
poorly satisfied requirements (steps E and F)
• Implement, maintain and review these strategies (steps G and H)
The DIRKS Methodology
Simple strategies for office documents




Develop filing systems for
storing electronic files.
Develop naming
conventions for types of
documents.
Set baseline standards for
important business
documents.
Office defines procedures
for all documents.



Filing systems should
support implementation of
retention schedule.
Electronic files are copied
to archive storage.
Electronic files are
migrated to new formats.
Records are equivalent regardless
of format
Identify what information must be created
 Determine how long the information must
be maintained
 Understand the uses and functions of the
records both primary and secondary
 Select the “best evidence” of the activity,
fact , transaction or event and maintain it.

Questions
How do you apply these distinctions in to
personal papers?
 Can you apply these distinctions “after the
fact?”
 Can data or information be institutional
records?

Defining records with
business rules
Key source of documentation
 Key to understanding:

– Risks
– Documentation needs
– Archiving needs
Often not written
 Required for complex data systems

Sources of business rules
Law and regulation
 Federal law and regulation
 Operational procedures
 Data Administration
 System documentation

These same considerations can apply to individuals
and organizations.
Document Management System
Solutions




DOD 5015.2
Requires a file plan
Requires
implementation
support
Helps organizations
select and file mail
and documents they
need to keep
A Word About Metadata

Data about the structure and content of the
information
– Code books
– File layouts
– Database Entity-Relationship-Diagram

Data about the access and use of the
information
– Audit trails, etc.
More on Metadata

Different levels of metadata needed for
different types of records
– High risk records that require greater
authenticity require lots of access and use
metadata
– Low risk records may require no access and use
metadata
– Virtually all records require some kind of
structural documentation
Documentation
File and record layout
 Codebook
 Data flow-diagrams
 Entity-Relationship Diagrams
 Log-files and audit trails
 Electronic repository system

Electronic records inventory
Name of system
 Function of system
 List of sub-systems
 Functions of sub-systems
 Recordkeeping requirements supported
 List of files/procedures that support the
recordkeeping requirements

Electronic records system
inventory continued
Operating environment
 Organization of the data
 Source of documentation
 Related audit-trails
 Purge/Migration criteria
 Documentation of off-line resources
 Security documentation/procedures

Retention Scheduling /
Purge-Migration
What information is kept?
 What files does it go to?
 Where do the files get stored?
 How long are they stored there?
 What happens at the end of each stage of
the life of the data?

Note: This is a systems tool, not a recordkeeping
process.
Purge/Migration
or Archiving Criteria
Purging data from system or migrating to
tape
 In the language of the data manager
 Provide specific guidance on what
files/data/processes are maintained
 Easily and frequently updated
 Can be documented

Evidential value and electronic
records

Tries to answer the question, what did they
know?
– Uses metadata on the access and use of the
system.
– Requires “snapshots” of data to link to
metadata
– Available only for “high-risk” environments
Legal Evidence

Proving that the information is the “best
evidence” of a fact, activity, or action.
–
–
–
–
–

Complete
Accurate
Created in normal course of business
Authentic
Original
All accomplished through “use” metadata and
business rules
Appraisal of legacy electronic
records

Content appraisal
– Determines whether the function and the
information are of archival quality

Technical appraisal
– Determines whether the data is complete and
structured in such a way that it can be
preserved.
Content Appraisal

Basic archival questions
– Informational value
» What facts?
–
–
–
–
–
Completeness of data
Time series
Function of data
Accuracy and reliability
Evidential value
Technical Appraisal
Structure of information
 Types of storage used for the files
 Purge/Migration criteria and methods
 Physical storage methods and capabilities of
creator
 Analysis of documentation
 File sizes, integrity of database,
completeness of files, etc.

Complex Relational Database
Entity Relationship Diagram
 File and record layouts
 Business rules
 Function to entity diagram or
documentation
 Archiving criteria and plan

Annual Report Web Page
Appraisal of annual report
 Downloaded HTML/XML files
 Downloaded cascading style sheet
definition
 Downloaded images and graphics
 Inventory of all document components

Web resource preservation
Understand that the web is not a
recordkeeping system by itself, it is a
delivery system
 Static HTML pages are only one small
aspect of web content
 The web is multi-component and extremely
dynamic

Web harvesting tool example


HTTrack is one
example, used by
National Library of
Australia
Open source software
used to harvest
specific web sites



Copies HTML and
other files into your
file system.
Works well for static
content
Modifies links for
retrieval
Internet Archive
www.archive.org
•Wholesale “capture” of all static html content
• Not selective or related to changes in content
• Uses Alexa web crawler and proprietary storage method
Internet Archive Footer
<!-- SOME LINK HREF'S ON THIS PAGE HAVE BEEN REWRITTEN BY THE WAYBACK MACHINE
OF THE INTERNET ARCHIVE IN ORDER TO PRESERVE THE TEMPORAL INTEGRITY OF THE SESSION.
-->
<SCRIPT language="Javascript">
<!-// FILE ARCHIVED ON 19970606072913 AND RETRIEVED FROM THE
// INTERNET ARCHIVE ON 20020416142610.
// JAVASCRIPT APPENDED BY WAYBACK MACHINE, COPYRIGHT INTERNET ARCHIVE.
// ALL OTHER CONTENT MAY ALSO BE PROTECTED BY COPYRIGHT (17 U.S.C.
// SECTION 108(a)(3)).
.
//-->
</SCRIPT>
</HTML>
Selection guidelines for government
publications
 Procedures for managing and adding to the
archive
 Guidelines for transfer of ownership
 Practices for persistence of links and for
naming resources
 pandora.nla.gov.au

Accessioning
Physical and legal acquisition of records
 May result in a copy of records transferred
to archives as the “record copy”
 Direct relationship to purge/migration
criteria
 “first step” in preservation process

Accessioning guidelines

Recording mode for data
– Standard label ASCII or EBCDIC (cartridge or
tape only)
– Density of recording
– Operating system requirements
– What files are on the media
– ISO standard for CD-ROM
Accessioning guidelines

Documentation must accompany records
–
–
–
–
–
Code books
File and record layouts
Data dictionaries
Index files
Media documentation
Accessioning guidelines

Metadata and documentation is also
transferred at time records are transferred
– Log files are most common metadata
– Linked metadata files common in database
environments
– Data dictionary file
– Entity Relationship Diagram
– DTD for encoded text
Accessioning procedures

“Pre-Accessioning” analysis
– Test whether the records you receive are
complete, match the documentation, and are in
accessible data structures
– Simple dump of sample of data
» applies to all digital materials
– Compare to documentation
Accessioning procedures

Make two copies of the records
– One for accessioning
– One for off-site storage

Expect to spend a lot of time working with
the records creator to acquire
documentation and work out details of
media formats, etc.
Accessioning procedures

Tools:
–
–
–
–
–
–
–
–
SPSS for statistical or numeric files
Desktop database for simple databases
Robust RDBMS for SQL based databases
Word processor for desktop files
High-end PCs/Unix workstations
Tape drives
DTDs or RFCs
System Modelling tools
Accessioning procedures

Storage media
– The only “archival” digital media are open-reel
tape or 3480/3490 or 3590 magnetic cartridges.
No other media has been rigorously tested,
subscribes to national/international standards.
Accessioning Procedures
Digital Linear Tape is becoming an alternative and
has been “field tested” as a possible alternative to
tape cartridges.
Preservation Components


Management of media
to ensure that the
information can be
retrieved from some
storage medium
Preservation of the
data


Preservation of the
means to understand
and interpret the data
Preservation of the
metadata to
demonstrate that the
data is an authentic
record
Preservation

Media preservation
–
–
–
–
Store only on tape or cartridge
Periodic (3-5 year) rewind of each volume
Periodic (5-10 year) copy to new media
Migrate to new media when
appropriate/necessary
– Maintain environmental controls
– Maintain use and archival copies
Preservation

Migration:
– Move to new hardware and software
environments because data is in a standard
format

Emulation
– Create software tools that mimic the original
encoding scheme and save the bitstream as-is
Preservation

Migration:
–
–
–
–

David Bearman
Kenneth Thibideau
Margaret Hedstrom
40 years of experience
Emulation
– Jeff Rothenberg
– Universal Preservation Format (sort of)
The 1980’s video game
Missile Command running in under emulation via the
Internet…2003
Preservation via Migration

Objective:
– Understandable
– Useable
– Accessible

Accomplished through:
– Media
– Recording methods
– Data formats
Preservation through standard formats



Information
maintained in a
standard, ubiquitous
format.
Formats are backward
compatible.
Formats that ensure no
information loss.


Information is
migrated to target
preservation format.
Data structures and
encoding format
should both be in
appropriate format.
Life-cycle standards
Industry standards are often the best
available approach
 Ubiquitous environments
 Archivability begins at the point of creation
 Archiving must add value if creators are to
incorporate it.

Standards list

Text
–
–
–
–

SGML
PDF
ASCII
XML
Sound/Video
– MPG
E-mail
– SMTP
– Encoded text
– SQL database

Database
– SQL
– xBase
– XML


GIS
– SDTS
– Content Standard

Recording Modes
– ASCII
– EBCDIC
Preservation of meaning through
documentation and metadata




File layouts and formats must be defined in structural
metadata or documentation
Meaning of specific attributes must be explained in
structural metadata.
Information that supports completeness and authenticity
must be maintained through maintenance of logs, record
counts and business process documentation.
All of this information must be maintained and migrated as
well if it is in electronic format.
Preservation of documentation and
metadata

Sources
– Entity Relationship Diagrams (UML)
– New file layouts
– New documentation of recording modes
Access
User guide is primary vehicle for access
 Good user guides enable the researcher to
understand the records before actually
seeing the data itself.
 Modern technology tools greatly enhance
the ability to locate discrete pieces of
information.

OAIS
OAIS

Ingest: to bring a
bitstream into the
system.
– Wraps bits to identify
them for retrieval
– Ensures bits are
complete and accurate

Storage:
– maintains bits over
time

Dissemination
– Query capability
– Reproduction of bits
into useful information
– Protection of sensitive
information

OAIS does not
discriminate about
bits. Selection and
context are needed
Ken Thibodeau, “Building the Archives of the Future” D-Lib magazine February 2001
Research trends

Functional
requirements research
– Focus on records
– At beginning of lifecycle
– Based on model
developed at
University of
Pittsburgh

Migration/preservation
research
– Emerging from digital
libraries work
– Heavy focus on digital
images
– RLG is a major player
in this work
– InterPARES
– NARA
Research trends

Functional
requirements research
– Focus on records
– At beginning of lifecycle
– Based on model
developed at
University of
Pittsburgh
– Trustworthy
Information Systems

Migration/preservation
research
– Emerging from digital
libraries work
– SDSC/PERPOS
– CaMelion
– RLG is a major player
in this work
Key Archival Research



RLG Digital Libraries
and various task forces
of Preserv section
US NARA sponsored
research at UCSD and
Georgia Tech
InterPARES

Program
Implementations at:
– Delaware State
Archives
– Indiana University
– State of Michigan
– Cornell University
Your skills



Keep a general
understanding of how
records can be created
Learn business process
analysis
Learn project
management



Understand trends in
standards
Understand
preservation research
Read:
– JASIS
– PC
– HBR
Concluding guidance


Remember that
records are
representations of
activity not physical
devices
Electronic records
management requires
life-cycle management



Preservation is much
more than media
management
Documentation is of
equal importance to
the data.
Records are more than
just the data.