Open Archives Initiative Where we are, Where we are going Carl Lagoze 4th OAF Workshop September, 2003

Download Report

Transcript Open Archives Initiative Where we are, Where we are going Carl Lagoze 4th OAF Workshop September, 2003

Open Archives Initiative
Where we are,
Where we are going
Carl Lagoze
4th OAF Workshop
September, 2003
Where we are now
• De facto standard for Internet information
exchange
• Deployed extensively and internationally
– (digital) libraries
– Museums
– Eprint repositories
– Research projects
Protocol Stability
• OAI-PMH has been stable since release
– No functional changes, just typographic edits
– Validation of leadership/participation model
• No plans for a 3.0 release
– Core protocol will not be extended
– Minor 2.x release could occur (more later)
– Additional implementation guidelines (more
later)
NSDL and OAI-PMH
The NSDL Context
• National STEM (Science, Technology,
Engineering, Mathematics, Medicine) Digital
Library
• Major National Science Foundation project
targeted at the application of web and Internet to
(STEM) education
• $25M over six years to over 100 projects
–
–
–
–
Collections
Services
Targeted Research
Core Integration
NSDL technical guidelines
• Aggregation rather than collection
– Core integration team will not manage any collections
• Spectrum of interoperability
– Accommodate diversity of participation models
– Open interfaces and standards permitting plug in of
array of value-added services
• One library many portals
– Accommodate multiple quality and selection metrics
– Tailor presentation of content and nature of services
to audience needs
Spectrum of interoperability
Level
Agreements
Example
Federation
Strict use of standards
(syntax, semantic,
and business)
AACR, MARC
Z 39.50
Harvesting
Digital libraries expose
metadata; simple
protocol and registry
Open Archives
metadata harvesting
Gathering
Digital libraries do not
cooperate; services must
seek out information
Web crawlers
and search engines
Translating to initial goals
• This is a big task that no one has done before!
• Work on the priorities
– Focus on one point on spectrum of interoperability
• Metadata harvesting
• Incorporate NSF funded collections and selected other collections
– Leverage existing (or at least emerging) technologies and protocols
• OAI, uPortal, Shibboleth, SDLIP, InQuery
– Provide reliable base level services
• Search and Discovery, Access Management, User Profiles, Exemplary
Portals, Persistence
• Plant some seeds for the future
– Machine-assisted metadata generation
– Automated collection aggregation
– Web gathering strategies
Metadata Repository
• Central storage of all metadata about all resources in
the NSDL
– Defines the extent of NSDL collection
– Metadata includes collections, items, annotations, etc.
• MR main functions
– Aggregation
– Normalization
– redistribution
• Ingest of metadata by various means
– Harvesting, manual, automatic, cross-walking
• Open access to MR contents for service builders via
OAI-PMH
Metadata Strategy
• Collect and redistribute any native (XML)
metadata format
• Provide crosswalks to Dublin Core from
standard formats
– DC-GEM, LTSC (IMS), ADL (SCORM),
MARC, FGCD, EAD
• Concentrate on collection-level metadata
• Use automatic generation to augment
item-level metadata
Importing metadata into the MR
Cleanup
and
crosswalks
Harvest
Collections
Database load
Staging area
Metadata Repository
Exporting metadata from the
MR
Create OAI
server tables
SQL queries
Metadata Repository
OAI server
Harvest
NSDL
services
NSDL and OAI-PMH
Two years later
• Concepts are good, practice is hard
• Issues
– Metadata is hard
• http://www.well.com/~doctorow/metacrap.htm
– XML is hard
– Protocols are hard
• Static repositories (more later)
– IP is relevant (more later)
Some Essential Metadata
Questions
• Review original (DC) metadata
assumptions
– Metadata is essential for good resource
discovery
– “Joe Sixpack” could create metadata
• Account for current realities
– 2003 is not 1994
– Google, etc. keeps getting better
Contextual Information (e.g., web links)
Metadata Space
ust
r
T
ity/
l
a
u
ss/Q
hne
c
i
ta R
a
tad
e
M
Automatic Indexing
Metadata Triage
Reconsidering the Dublin Core
Requirement
• Questions about utility of unqualified DC
– The conundrum….
• Specification too loose to serve intended
interoperability goal
• But more complex metadata may be too hard
• Limited energy for interoperability
– Data providers implement required DC at
expense of better metadata
• Use of protocol for purposes other than
resource discovery
Rethinking record-oriented model
Base Web Graph
NSDL Selections
Descriptive Metadata
Annotations
Branding
Collection (Semantic)
People and Organizations
Equivalence
Implications for record-oriented harvesting????
Topology Evolution
Search
Service
Browse
Service
Linking
Service
OAI -PMH
Harvester
OAI -PMH
Harvester
OAI -PMH
Harvester
OAI -PMH
Server
Repository
OAI -PMH
Server
Repository
OAI -PMH
Server
Repository
Simple Data Provider, Service Provider Topology
OAI -PMH
Server
Repository
Topology Evolution (cont.)
OAI-PMH
Harvester
OAI-PMH
Server
Metadata
Repository
Search
Service
Browse
Service
OAI-PMH
Harvester
OAI-PMH
Harvester
OAI-PMH
Server
OAI-PMH
Server
OAI-PMH
Server
OAI-PMH
Server
Collection
Collection
Collection
Collection
Metadata Aggregator
OAI-PMH p2p network
OAI-PMH
Server
OAI-PMH
Server
OAI-PMH
Server
OAI-PMH
Harvester
OAI-PMH
Harvester
OAI-PMH
Server
OAI-PMH
Server
OAI-PMH
Harvester
Topology Evolution (cont.)
OAI-P2pMH Issues
• Document (metadata) location
– Exploit unique identifiers, use efficient key-based
location mechanisms (distributed hash tables)
• Provenance-based queries
– Metadata records may go through refinement and/or
translation phases as they move through value-added
aggregators.
– Exploit provenance guidelines
• Network harvesting
– Broadcast query (Gnutella) inefficient
– Exploit techniques for efficient routing of queries (Ptrees)
OAI-PMH and Intellectual Property
• Protocol exists in a context where
information providers have concerns about
use of intellectual property
• OAI-PMH is nominally about metadata,
but…
– Rich metadata is an intellectual product
– The protocol can be used to transmit anything
(e.g. content) that can be encoded in XML
– Generally metadata leads to content so….
OAI-rights effort
• Goal is to investigate and develop means
of expressing rights about metadata and
resources in the OAI framework.
• The result will be an addition to the OAI
implementation guidelines that specifies
mechanisms for rights expressions within
OAI-PMH.
– No changes to core protocol
OAI-rights Effort (cont.)
• Extensible, providing a general framework for
expressing rights statements within OAI-PMH.
– Not an effort to develop a new rights expression
language
• Use Creative Commons licenses as a motivating
and deployable example.
• Release of specification by 2nd quarter ’04
• Invited OAI-rights group
– Standard OAI development model
Dimensions of OAI-PMH and rights
Entity Association
• Metadata: concern in NSDL for (re)use of
rich metadata
• Content: predominant application of the
protocol to resource discovery and
ultimate access makes this important
Dimensions of OAI-PMH and rights
Aggregation Association
• OAI-PMH aggregations
– Repository
– Set
– Item
• Rights association with an aggregation
may provide shortcut (e.g., the rights for all
resources in a repository/set…)
• Cost of shortcut is pseudo-statefulness,
possibly complex overriding rules
Dimensions of OAI-PMH and rights
Binding
• Choices
– exploit mechanisms in metadata formats e.g., DCrights
– restrict the rights statements to some more specific
protocol mechanism
– allow some mixture of these methods.
• DC-rights problems
– Semantics is restricted to rights about resource
– Can’t embed XML in dc value
– What if DC is not required
• Burden on harvesters if rights embedding is not
explicit but scattered across several locations
OAI-PMH Static Repositories
• Provide a lightweight mechanism for data
provider participation
• Intended for relatively small and static
collections
• Two components
– Static Repository XML format
• Semantically equivalent to Identify and ListRecords
• Invisible to harvester
– Static Repository Gateway
• Virtual data provider for static repository data
• Unique baseURL for each “contained” static repository
Static Repositories and
Static Repository Gateway
SR1
http://srg.org/sr1/verb=....
Static
Repository
Gateway
SR2
SR3
HTTP GET
http://srg.org/sr2/verb=....
Harvester
http://srg.org/sr3/verb=....
Static Repositories Open Issue
Relationship to RSS?????
Conclusions
• Interoperability and lowest common denominator
• Rapid advances automated methods
– Moore’s law
– Smart algorithms
– Benefits of issues of scale
• Combining human effort and automated
methods
– Extracting order from chaos
– Learning from order
• Move beyond resource discovery