OCLC Online Computer Library Center Metadata Standards Eric Childress OCLC FEDLINK OCLC Users Group Meeting Washington, DC November 18, 2003

Download Report

Transcript OCLC Online Computer Library Center Metadata Standards Eric Childress OCLC FEDLINK OCLC Users Group Meeting Washington, DC November 18, 2003

OCLC Online Computer Library Center
Metadata Standards
Eric Childress
OCLC
FEDLINK OCLC Users Group Meeting
Washington, DC
November 18, 2003
Overview
Fundamentals
– Types of metadata
– Document mark-up languages & character
encodings
MetaMap
Metadata formats:
–
–
–
–
MARC, MODS
DC, ONIX
TEI, EAD, METS, MIX
RDF, FGDC, COSATI
Fundamentals
5 types of metadata
Descriptive
– Title, author, summary, topic, etc.
Technical & Structural
– File size, software needed, file type(s), presentation
instructions, etc.
Administrative (a.k.a. “meta-metadata”)
– Record number, record date, record source, etc.
Rights
– Copyright ownership, use privileges, etc.
Management
– [Typically by/for owning agency]: price paid, circulation
restrictions, etc.
Fundamentals
Markup languages
Markup languages:
– Address the structure of a document
– Convey instructions to software that will process text to:
• Index the text for searching
• To render the text (e.g., for screen display or print)
• Transform the text (e.g., for a voice synthesizer) for some output
device(s)
– The markup is generally invisible to end-users
Extensible Markup Language (XML):
– XML is a metalanguage
• Agencies define their own XML to suit their task
– By creating Document Type Definitions (DTDs) or XML schema
– Data is separate from presentation instructions
• Presentation instructions go in a style sheet
– Offers just the right mix of flexibility and structure
Fundamentals
Character Encodings
Character encoding:
– Used for communicating text characters in a computing
environment
– Hundreds of character encoding standards exist
– Character conversion is complex and expensive
Unicode:
– A single, “comprehensive” global encoding standard
– Includes characters from scripts of all major modern,
most minor, and selected ancient languages
MetaMap
http://mapageweb.umontreal.ca/turner/meta/english/metamap.html
MARC 21
MARC 21 (ISO 2709)
– ISO 2709-based metadata communications protocol
– Choice of two character encoding options:
• MARC 8 (ASCII, ANSEL, selected ISO, EACC)
• Unicode (limited to equivalents of MARC 8 repertoire)
– XML expression is now also an option
– Maintenance agency: Library of Congress w/ NLC, BL
Strengths:
•
•
•
•
Well-maintained, mature standard
Widely adopted by library communities
Large universe of MARC 21 records available
Wide choice of software vendors
Weaknesses (in the present & future):
•
•
•
•
Virtually unused outside of libraries
Limits on field and record size
Restricted range of scripts supported
Limited ability to convey complex relationships, hierarchy,
attributes at tag/subfield level
MODS
Metadata Object Description Schema (MODS)
– Essentially MARC 21 recast in an XML-native framework
• Text-based tags rather than numeric ones,
• Selected clusters of related MARC 21 attributes condensed into single
MODS element
– MARC 21 readily converts to MODS, but you can’t do a lossless
reverse conversion of MODS to MARC 21
– Maintenance agency: Library of Congress
Value of MODS:
– A rich, library-oriented XML metadata schema
– Optimized for from-MARC conversion of legacy records
– Well-suited as a metadata format for OAI harvesting
Applications of MODS:
– LC planning to convert 100K American Memory records
– Minerva project, U of Chicago Press, California Digital Library, others
using or planning to use for records for web sites, e-texts.
– OpenOffice Bibliographic Project
MARC 21 & MODS
Feature
MARC
21
MARC
21
Unicode
MARC
XML
MARC
Slim
MODS
Structure
ISO 2709
ISO 2709
XML
XML
XML
Encoding
MARC 8
Unicode
Unicode
Unicode
Unicode
Repertoire of scripts
JACKPHY
JACKPHY
JACKPHY
JACKPHY
Unicode
lossless
Conversion from MARC 21
lossless
lossless
lossless
minimal
loss
lossless
lossless
lossless
lossless?
minor loss
· Bibliographic
OCLC
OCLC R
OCLC R
OCLC R
OCLC DCPS
· Authority
OCLC
OCLC R
OCLC R
x
· Classification
x
OCLC R
OCLC R
x
· Community
x
x
x
x
OCLC
x
x
x
Conversion to MARC 21
· Holdings
Dublin Core
Dublin Core Metadata Element Set
– ISO 15836:2003(E) The Dublin Core metadata element set
– A standard for cross-domain resource description
• Designed primarily to support discovery and retrieval
– Defines semantics but not syntax (i.e. container)
– Choice of simple or qualified DC
– Maintenance agency: Dublin Core Metadata Initiative (DCMI) hosted
by OCLC Research
Value of Dublin Core:
– Simplicity, extensibility, interoperability
– Worldwide adoption (DCMES translated into 20+ languages)
– Usable as crosswalk between major metadata standards
Applications of Dublin Core:
– Open Archives Initiative (OAI) mandates DC metadata
– Wide variety of extended versions in use:
• In digital library, archives, museums projects
• By e-government programs (AU, CA, DK, FI, IE, NZ, UK)
– OCLC usage: Connexion, DCPS, ContentDM, Research
ONIX
ONIX International (Online Information Exchange):
– Standard data exchange format for publishers & jobbers
• Based on EPICS (EDItEUR Product Information Communication Standards)
– For representing and communicating book industry product
information in electronic form
• Offers two levels of richness (level 1 & level 2)
– XML schema with Unicode encoding
– Maintenance agency: EDItEUR working with input from the Book
Industry Communication (BIC) and the Book Industry Study Group
(BISG)
Value of ONIX:
– Meets needs of publishers, jobbers, retail sellers for:
• Easier access to richer book data (including bibliographic data, cover art,
blurbs, TOCs, UPC data, and much more)
• An inexpensive-to-implement common data exchange format
Applications of ONIX:
– Primarily oriented towards publishers, jobbers, retailers
• Most major players (Amazon, Baker & Taylor, etc.) now using/supporting
ONIX
– Some interest by libraries & ILS vendors in ONIX
TEI
Text Encoding Initiative (TEI):
–
–
–
–
For complex markup of literary texts
Both SGML & XML DTDs available
TEI “header” (TEIH) can be used as a metadata record
Maintenance agency: TEI Consortium:
• TEI Consortium has executive offices in Bergen, Norway, and is hosted at
four university sites worldwide: the Univ. of Bergen, Brown Univ., Oxford
Univ., and the Univ. of Virginia
• Maintains “P4” Guidelines for Electronic Text Encoding and Interchange
Value of TEI:
– Designed to meet the needs of scholarly research community (esp.
in the humanities) for a variety of activities including:
• Adding in-line academic commentary in e-texts
• As an aid to research by supporting special indexing points, etc.
Applications of TEI:
– Widely used by major humanities electronic text collections such as
CETH, UVa e-text center, many others.
EAD
Encoded Archival Description (EAD)
– A format for expressing electronic archival finding aids
– EAD DTD (Version 2002) is designed to function as both an SGML
and XML DTD
– Maintained jointly by the Library of Congress and the Society of
American Archivists (SAA)
Value of EAD:
– Effectively an organized presentation of a collection of documents
(typically in an archive or manuscript collection)
• EAD header carries metadata for the finding aid
• Provides for simple or complex mark-up to support varying levels of
indexing
• Well-suited for interweaving narrative with links to specific objects in a
collection (either directly to the object or via a record for the object that
may link to the object).
Applications of EAD:
– Conversion of existing paper finding aids to electronic form
– Widely used by academic institutions and archives in North America
– RLG Archival Resources database host copies of many EADs
METS
Metadata Encoding and Transmission Standard (METS)
– A standard “shell” for encoding data essential for retrieving,
preserving, and serving up digital resources
• Six modules define descriptive, administrative, structural, rights and other
metadata
• Some parts of a METS object may be external (e.g., a MODS record for
the descriptive metadata)
– Maintenance agency: Library of Congress
Value of METS:
– Need for METS identified at DLF metadata experts meetings
• Varied local approaches to non-descriptive metadata not scaling well &
offering little interoperability between agencies
– Offers a standard mode for object “packaging” for preservation,
institutional repositories, other activities
Applications of METS:
– LC: planning to use with selected moving images, audio recordings,
folk life mixed media collections
– OCLC DCPS, RLG, Harvard, Stanford, UC Berkeley, National Library
of Wales exploring or using for variety of projects
MIX
Metadata for Images in XML (MIX)
– XML schema for a set of technical data elements required to manage
digital image collections
– Format for interchange and/or storage of the data specified in the
NISO Draft Standard Data Dictionary: Technical Metadata for Digital
Still Images (version 1.2)
– Still in early development and testing phases
– Collaboration of: Library of Congress and NISO Technical Metadata
for Digital Still Images Standards Committee
Value of MIX:
– Provides a common XML schema for expressing technical data
particular to still and moving digital images
– Can be used with other schema such as METS and MODS as part of
a comprehensive approach to managing and preserving digital
images
Applications of MIX:
– OCLC DCPS, LC, others planning or testing
– MIX still in nascent stage of development and testing
Summary
DC
ONIX
TEI
EAD
METS
MIX
Structure
XML
XML
XML
XML
XML
Encoding
Unicode
Unicode
Unicode
Unicode
Unicode
Repertoire of scripts
Unicode
Unicode
Unicode
Unicode
Unicode
Lossiness
varies
Minimal loss
Header only
- lossy
Header only
- lossy
Minimal loss
Some ONIXonly data
may be lost
Header only
– lossless
Header only
– lossless
Simple
description
for discovery
& retrieval
Publisher
product info
exchange
Markup of
scholarly
Etexts
Markup of
electronic
finding aids
Shell with
technical
data
Technical
data for
digital
images
e-Govt,
Libraries,
Museums,
Archives,
Publishers,
Jobbers
Humanities
scholars
Archives,
Libraries
Archives,
Libraries
Archives,
Libraries
DCMI
Editeur
TEI
Consortium
LC w/ SAA
LC
LC
Conversion from MARC 21
Conversion to MARC 21
Chief purpose
Primary user base
Maintenance agency
RDF
Resource Description Format (RDF)
– Graphing theory (i.e. arcs and nodes)-influenced, XML syntax-based
metalanguage for expressing metadata about web resources
– Designed to convey metadata for machine consumption (raw RDF is
not very human-readable)
– Fundamental building block of RDF is the triple (subject + predicate +
object)
– Maintained by the W3C; RDF specification under revision
Value of RDF:
– A subject of debate (typically RDF vs. XML)!
– Pro: Model-based expression of metadata critical to the Semantic
Web (i.e. derived connections); more flexible, scalable and forgiving
standard than XML
– Con: RDF carries unneeded processing overhead vs. XML; RDF
specification has too many flaws; few use RDF
Applications of RDF:
– Open Directory Project, selected software (e.g., Siderean)
– OCLC Connexion exports Dublin Core in RDF/XML
CSDGM (a.k.a. FGDC)
Content Standard for Digital Geospatial Metadata (CSDGM)
[better known as “FGDC”]
– CSDGM Version 2 - FGDC-STD-001-1998
– Defines a common set of terminology and definitions for the
documentation of digital geospatial data
– Maintained by Federal Geographic Data Committee (FGDC) [an
interagency committee]
– Crosswalk of FGDC to ISO 19115:2003(E) Geographic information Metadata available; ANSI technical amendment for ISO-FDGC
harmonization in progress
Value of FGDC:
– Provides common standard for publishing metadata about geospatial
resources
– Widely used by government and business
– Many systems and applications support the standard
Applications of FGDC:
– Adopted or usable by major geospatial agencies in West.
– Usefulness extended with profiles (e.g. Biological Data)
COSATI
Committee on Scientific and Technical Information (COSATI)
– Cataloging rules and record format for the descriptive cataloging of
technical reports and similar documents
– Field tags are alpha strings (not numerical like MARC)
– Related COSATI subject category list can be used
– Owned by CENDI (the Commerce, Energy, NASA, Defense
Information Managers Group) [successor to COSATI]
Value of COSATI:
– Supports straightforward capture of useful metadata for scientific and
technical information
Applications of COSATI:
– Used by a number of science/technical and defense U.S. federal
agencies
– Small number of library systems (e.g., SIRSI) support COSATI
record import/export
– COSATI can be converted to MARC if desired
OCLC Online Computer Library Center
Questions
Links
Dublin Core: http://www.dublincore.org
EAD: http://www.loc.gov/ead
FGDC: http://www.fgdc.gov/metadata/meta_stand.html
MARC 21: http://lcweb.loc.gov/marc/marcdocz.html
MARCXML: http://www.loc.gov/marc/marcxml.html
METS: http://www.loc.gov/standards/mets
MIX: http://www.loc.gov/standards/mix
MODS: http://www.loc.gov/standards/mods
ONIX: http://www.editeur.org/onix.html
RDF: http://www.w3.org/RDF
TEI: http://www.tei-c.org
OCLC Research: http://www.oclc.org/research