JH VE2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library Sheila Morrissey Evan Owens Portico Tom Cramer Keith Johnson Stanford University.

Download Report

Transcript JH VE2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library Sheila Morrissey Evan Owens Portico Tom Cramer Keith Johnson Stanford University.

JH VE2
JHOVE2
A Next-Generation Architecture for
Format-Aware Characterization
British Library, 1 October 2008
Stephen Abrams
California Digital Library
Sheila Morrissey
Evan Owens
Portico
Tom Cramer
Keith Johnson
Stanford University
JH VE2
Thanks to…
• Koninklijke Bibliotheek our meeting sponsor
• British Library
our meeting host
• Library of Congress
our funding agency
JH VE2
Agenda
• Introductions
• Review of objectives and agenda
• Project goals, deliverables, and schedule
• New terminology
• Functional requirements
• Assessment
• Technology options
• Community engagement
• Next steps
JH VE2
Introductions
• Who are you?
• Where are you from?
• What is your level of involvement with JHOVE?
JH VE2
Objectives
• Getting feedback on the utility and achievability of project
goals, deliverables, and schedule
• Introduce new terminology and concepts
• Refine functional requirements
• Understand what assessment means in preservation
workflows
• Propose community engagement plan
• Gauge the potential for development partnerships
• Solicit interesting test data
JH VE2
Why JH VE2?
• Preservation requires management of the gap between
what you were given and what you need
• That gap is only manageable if it is quantifiable
• Characterization tells you what you have, as the starting
point for iterative preservation planning and action
Characterization
Preservation
action
Preservation
planning
Adopted from A. Brown, “Developing Practical Approaches to Active Preservation,” IJDC 2:1 (June 2007): 3-11.
JH VE2
JH VE2 project
• A next-generation architecture for format-aware object
characterization
– Three-fold goals:
• Re-factor the existing architecture to achieve higher performance,
simplify system integration, and encourage third-party enhancement
• Provide significant new function
• Implement modules
• Collaborative project of CDL, Portico, and Stanford
University
– Funded by Library of Congress/NDIIPP
– Open source BSD license
JH VE2
New function
• Complex object data model
• Generic plug-in interface
• Common data structure passed between modules to
enable stateful processing
• Identification de-coupled from validation
• Standardized handling of format profiles and error
reporting
• Symbolic display of binary formats
• API-level support for editing
JH VE2
Format support
• Based on project partner requirements and budgetary
constraints
–
–
–
–
–
–
–
Image:
JPEG 2000, TIFF
Audio:
WAVE
Text:
SGML, UTF-8, XML
Document: PDF
GIS:
Shapefile
Color:
ICC
And their well-known variants, e.g. TIFF/IT, TIFF/EP, GeoTIFF,
EXIF, DNG, …
• Unfortunately precluding some JHOVE-supported
formats
– AIFF, GIF, HTML, JPEG
JH VE2
Schedule
• Months 1-6
Outreach, design, and prototyping
• Months 7-9
Core APIs and framework
• Months 10-24
Module implementation
JH VE2
Terminology
• What is it?
– Identification
Determining presumptive format through
signature matching
• What is it, really?
– Validation
Determining conformance to commonlyaccepted normative requirements
• What about it?
– Feature extraction Reporting intrinsic properties significant to
preservation planning and action
• What should you do with it?
– Assessment
Determining acceptability for a given purpose
on the basis of locally-defined policies
JH VE2
Objects, not files
• JHOVE assumed 1 object = 1 file = 1 format
• But what about…
– TIFF with embedded ICC profile and XMP metadata
1 object = 1 file = 3 formats
– JPEG 2000 JPX fragmentation
1 object = n files = 1 format
– ESRI Shapefile
1 object = 3 files = 3 formats
• JHOVE2 will support 1 object = n files = m formats
JH VE2
Objects, not files
Source units
Reportable units
abcd.tif
TIFF
TIFF
ICC
XMP
1234.dbf
1234.shp
ICC
XMP
1234.shx
Shapefile
dBASE IV
SHP
SHX
dBASE IV
SHP
SHX
JH VE2
Functional requirements
• JHOVE2 will provide a highly configurable, extensible,
and scalable framework for preservation characterization
• JHOVE2 can process an arbitrary number of source
units during a single invocation
• JHOVE2 function will be encapsulated into granular
plug-in modules that can be installed in and invoked by
the framework
• The output of characterization will be expressed in an
intermediate XML form, with further presentation via XSL
• JHOVE2 can be invoked through an API or as a standalone application with a command line interface
JH VE2
Identification requirements
• Based on extrinsic hints and internal/external signatures
– Aggregate-level identification signatures are based on file-level
properties
• Identification may be possible for formats than feature for
which JHOVE2 feature extraction or validation is not
supported
• Identification will be reported in terms of level of
confidence
• Identification will be reported in terms of all known
common names and public identifiers
JH VE2
Validation requirements
• Conformance reported in terms defined by the format
itself
• The applicability of ambiguous conformance
requirements will be under local configuration control
• Validation will be exhaustive, documenting all identified
violations (not merely the first)
• Ability to distinguish syntactic and semantic conformance
• Error and informative messages will reference specific
passages in the relevant specifications
JH VE2
Feature extraction requirements
• Control over the granularity of parsing and reporting will
be local configuration option
• General reportable properties will include:
–
–
–
–
Source unit name and modification date
Reportable unit size and offset
Format identifiers and version
Format profile
• Genre- and format-specific properties will be reported in
terms of well-known data dictionaries and schemas
– ANSI/NISO Z39.87 for still image
– AES X-098B for audio
JH VE2
Assessment
• Evaluation of characterization data on the basis of local
policy to inform decisions and instigate actions
• Focus is on “instance risk” in an object
– JHOVE2 cannot answer, “Is JPEG 2000 a suitable archival
format?”
– But it can answer, “I have a JPEG 2000 format policy; how does
this object measure up to it?”
• Assessment…
– Makes technical metadata actionable
– Provides a common framework in which digital memory
institutions can publish and share their policies
JH VE2
Assessment policy
• Expressed in a set of user-configured rules
• Any piece of characterization data is eligible for analysis
– Presence/absence of a particular value or informative/error
message
– Particular range or combination of values
• Default policies associated with each format module,
based on community best practice
– Easily modified, extended, or replaced to conform to local policy
and practice
JH VE2
Assessment output
• Summary report of information not elsewhere
summarized
• Ability to create custom preservation metadata
• Can be extended to initiate downstream processes in
local workflows automatically
JH VE2
Assessment in workflows
• In a flow of homogenous objects, verifying specifications
– Am I receiving what I expected to receive?
• In a flow of heterogeneous objects, conducting triage
– Which risks are tolerable? Which aren’t?
– What level of service can I offer the depositor?
JH VE2
Ingest workflow
Producer
Archive
Content
Identification
Validation
Feature extract
Content
Assessment
Package
SIP
Identification
Validation
Feature extract
Unpackage
Metadata ′
Policy rules
Metadata
Metadata
Consistency
Assessment
Policy rules
Ingest
JH VE2
Migration workflow
Content ′
Content
AIP
Unpackage
Assessment
Migration
Identification
Validation
Feature extract
Policy rules
Metadata
Metadata ′
Equivalence
(Re)Ingest
JH VE2
Technical components
• Java
– java.nio package
– 1.5 vs 1.6
• OSGi/Spring frameworks
– Component versioning and dependency management
– Fine-grained control of component invocation
– Inversion of control
• SourceForge
– Distribution platform
– Issue tracking
JH VE2
Data abstraction
• Based on the “natural” conceptual structures of a format
and their component attributes
– Each such structure maps to a class with methods for parsing,
validating, reporting, and serializing
– Each such attribute maps to a field with accessor and mutator
methods
•
•
•
•
UTF-8
TIFF
JPEG 2000
PDF




Character
Image File Header and Image File Directory
Box
boolean, number, string, name, array, dictionary,
and stream
JH VE2
Community engagement
Wiki
confluence.ucop.edu/display/JHOVE2Info/Home
Mailing lists JHOVE2-Announce-L
JHOVE2-Techtalk-L
(Subscribe via the wiki)
JH VE2
Advisory board
•
•
•
•
•
•
•
•
•
•
•
Deutsche Nationalbibliothek
Ex Libris
Fedora Commons
Florida Center for Library Automation
Harvard University
Koninklijke Bibliotheek
MIT/DSpace
National Archives (UK)
National Library of Australia
National Library of New Zealand
Planets project
JH VE2
Next steps?