JH VE2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library Sheila Morrissey Evan Owens Portico Tom Cramer Keith Johnson Stanford University.
Download ReportTranscript JH VE2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library Sheila Morrissey Evan Owens Portico Tom Cramer Keith Johnson Stanford University.
JH VE2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library Sheila Morrissey Evan Owens Portico Tom Cramer Keith Johnson Stanford University JH VE2 Thanks to… • Koninklijke Bibliotheek our meeting sponsor • British Library our meeting host • Library of Congress our funding agency JH VE2 Agenda • Introductions • Review of objectives and agenda • Project goals, deliverables, and schedule • New terminology • Functional requirements • Assessment • Technology options • Community engagement • Next steps JH VE2 Introductions • Who are you? • Where are you from? • What is your level of involvement with JHOVE? JH VE2 Objectives • Getting feedback on the utility and achievability of project goals, deliverables, and schedule • Introduce new terminology and concepts • Refine functional requirements • Understand what assessment means in preservation workflows • Propose community engagement plan • Gauge the potential for development partnerships • Solicit interesting test data JH VE2 Why JH VE2? • Preservation requires management of the gap between what you were given and what you need • That gap is only manageable if it is quantifiable • Characterization tells you what you have, as the starting point for iterative preservation planning and action Characterization Preservation action Preservation planning Adopted from A. Brown, “Developing Practical Approaches to Active Preservation,” IJDC 2:1 (June 2007): 3-11. JH VE2 JH VE2 project • A next-generation architecture for format-aware object characterization – Three-fold goals: • Re-factor the existing architecture to achieve higher performance, simplify system integration, and encourage third-party enhancement • Provide significant new function • Implement modules • Collaborative project of CDL, Portico, and Stanford University – Funded by Library of Congress/NDIIPP – Open source BSD license JH VE2 New function • Complex object data model • Generic plug-in interface • Common data structure passed between modules to enable stateful processing • Identification de-coupled from validation • Standardized handling of format profiles and error reporting • Symbolic display of binary formats • API-level support for editing JH VE2 Format support • Based on project partner requirements and budgetary constraints – – – – – – – Image: JPEG 2000, TIFF Audio: WAVE Text: SGML, UTF-8, XML Document: PDF GIS: Shapefile Color: ICC And their well-known variants, e.g. TIFF/IT, TIFF/EP, GeoTIFF, EXIF, DNG, … • Unfortunately precluding some JHOVE-supported formats – AIFF, GIF, HTML, JPEG JH VE2 Schedule • Months 1-6 Outreach, design, and prototyping • Months 7-9 Core APIs and framework • Months 10-24 Module implementation JH VE2 Terminology • What is it? – Identification Determining presumptive format through signature matching • What is it, really? – Validation Determining conformance to commonlyaccepted normative requirements • What about it? – Feature extraction Reporting intrinsic properties significant to preservation planning and action • What should you do with it? – Assessment Determining acceptability for a given purpose on the basis of locally-defined policies JH VE2 Objects, not files • JHOVE assumed 1 object = 1 file = 1 format • But what about… – TIFF with embedded ICC profile and XMP metadata 1 object = 1 file = 3 formats – JPEG 2000 JPX fragmentation 1 object = n files = 1 format – ESRI Shapefile 1 object = 3 files = 3 formats • JHOVE2 will support 1 object = n files = m formats JH VE2 Objects, not files Source units Reportable units abcd.tif TIFF TIFF ICC XMP 1234.dbf 1234.shp ICC XMP 1234.shx Shapefile dBASE IV SHP SHX dBASE IV SHP SHX JH VE2 Functional requirements • JHOVE2 will provide a highly configurable, extensible, and scalable framework for preservation characterization • JHOVE2 can process an arbitrary number of source units during a single invocation • JHOVE2 function will be encapsulated into granular plug-in modules that can be installed in and invoked by the framework • The output of characterization will be expressed in an intermediate XML form, with further presentation via XSL • JHOVE2 can be invoked through an API or as a standalone application with a command line interface JH VE2 Identification requirements • Based on extrinsic hints and internal/external signatures – Aggregate-level identification signatures are based on file-level properties • Identification may be possible for formats than feature for which JHOVE2 feature extraction or validation is not supported • Identification will be reported in terms of level of confidence • Identification will be reported in terms of all known common names and public identifiers JH VE2 Validation requirements • Conformance reported in terms defined by the format itself • The applicability of ambiguous conformance requirements will be under local configuration control • Validation will be exhaustive, documenting all identified violations (not merely the first) • Ability to distinguish syntactic and semantic conformance • Error and informative messages will reference specific passages in the relevant specifications JH VE2 Feature extraction requirements • Control over the granularity of parsing and reporting will be local configuration option • General reportable properties will include: – – – – Source unit name and modification date Reportable unit size and offset Format identifiers and version Format profile • Genre- and format-specific properties will be reported in terms of well-known data dictionaries and schemas – ANSI/NISO Z39.87 for still image – AES X-098B for audio JH VE2 Assessment • Evaluation of characterization data on the basis of local policy to inform decisions and instigate actions • Focus is on “instance risk” in an object – JHOVE2 cannot answer, “Is JPEG 2000 a suitable archival format?” – But it can answer, “I have a JPEG 2000 format policy; how does this object measure up to it?” • Assessment… – Makes technical metadata actionable – Provides a common framework in which digital memory institutions can publish and share their policies JH VE2 Assessment policy • Expressed in a set of user-configured rules • Any piece of characterization data is eligible for analysis – Presence/absence of a particular value or informative/error message – Particular range or combination of values • Default policies associated with each format module, based on community best practice – Easily modified, extended, or replaced to conform to local policy and practice JH VE2 Assessment output • Summary report of information not elsewhere summarized • Ability to create custom preservation metadata • Can be extended to initiate downstream processes in local workflows automatically JH VE2 Assessment in workflows • In a flow of homogenous objects, verifying specifications – Am I receiving what I expected to receive? • In a flow of heterogeneous objects, conducting triage – Which risks are tolerable? Which aren’t? – What level of service can I offer the depositor? JH VE2 Ingest workflow Producer Archive Content Identification Validation Feature extract Content Assessment Package SIP Identification Validation Feature extract Unpackage Metadata ′ Policy rules Metadata Metadata Consistency Assessment Policy rules Ingest JH VE2 Migration workflow Content ′ Content AIP Unpackage Assessment Migration Identification Validation Feature extract Policy rules Metadata Metadata ′ Equivalence (Re)Ingest JH VE2 Technical components • Java – java.nio package – 1.5 vs 1.6 • OSGi/Spring frameworks – Component versioning and dependency management – Fine-grained control of component invocation – Inversion of control • SourceForge – Distribution platform – Issue tracking JH VE2 Data abstraction • Based on the “natural” conceptual structures of a format and their component attributes – Each such structure maps to a class with methods for parsing, validating, reporting, and serializing – Each such attribute maps to a field with accessor and mutator methods • • • • UTF-8 TIFF JPEG 2000 PDF Character Image File Header and Image File Directory Box boolean, number, string, name, array, dictionary, and stream JH VE2 Community engagement Wiki confluence.ucop.edu/display/JHOVE2Info/Home Mailing lists JHOVE2-Announce-L JHOVE2-Techtalk-L (Subscribe via the wiki) JH VE2 Advisory board • • • • • • • • • • • Deutsche Nationalbibliothek Ex Libris Fedora Commons Florida Center for Library Automation Harvard University Koninklijke Bibliotheek MIT/DSpace National Archives (UK) National Library of Australia National Library of New Zealand Planets project JH VE2 Next steps?