OAI Protocol for Metadata Harvesting

Download Report

Transcript OAI Protocol for Metadata Harvesting

OAI Protocol for Metadata
Harvesting
Tim Brody
Intelligence, Agents, Multimedia Group
University of Southampton
OpCit – http://opcit.eprints.org/
www.ecs.soton.ac.uk
BCS Metadata Meeting, London 29th May 2002
(Many slides borrowed from Michael L. Nelson)
OAI 2.0
• Public, stable not released yet … (but
very close)
– Beta released mid-May
– Public release scheduled: 1st June
• 2.0 implementations in the pipeline
– British Library, Cornell Univ, Ex Libris, my.OAI, Humbolt Univ,
InQuirion Pty Ltd, Library of Congress, NASA, OCLC, Old
Dominion Univ, U. of Illinois, U. of Southampton, UCLA,
John Hopkins U., Indiana U., NYU, UKOLN, Virginia Tech
Open Archives Initiative
The protocol is openly
documented, and metadata
is “exposed” to at least some
peer group (note: rights
management can still apply!)
Archive defined as a
“collection of stuff” -not the archivist’s
definition of “archive”.
“Repository” used in
most OAI documents.
OAI is happening
at break-neck speed...
Metadata Harvesting
• Move away from distributed searching
• Extract metadata from various sources
• Build services on local copies of metadata
– Resources remain at remote repositories
all searching, browsing,
etc. performed on
the metadata here
user
individual nodes can
still support direct user
interaction
metadata
harvested
offline
search for “cfd
applications”
local copy of
metadata
metadata
harvested
offline
metadata
harvested
offline
metadata
harvested
offline
...
each node
independently
maintained
Metadata Harvesting
• Repositories (archives etc.) = low
implementation cost
• Services = higher implementation cost
• Similar to web search model
– DP9 gateway makes it exactly the same
Santa Fe
convention
OAI-PMH
v.1.0/1.1
OAI-PMH
v.2.0
nature
experimental
experimental
stable
verbs
Dienst
OAI-PMH
OAI-PMH
requests
HTTP GET/POST
HTTP GET/POST
HTTP GET/POST
responses
XML
XML
XML
transport
HTTP
HTTP
HTTP
metadata
OAMS
unqualified
Dublin Core
about
eprints
unqualified
Dublin Core
document
like objects
model
metadata
harvesting
metadata
harvesting
metadata
harvesting
resources
OAI-PMH v.2.0 [06/2002]
• Goal: recurrent exchange of metadata
about resources between systems
• Input:
•
•
•
•
OAI-PMH v.1.0 [01/01 – 09/02]
feedback on OAI-implementers
deliberations by OAI-tech [09/01 -]
alpha test group of OAI-PMH v.2.0 [03/02 -]
OAI-PMH v.2.0 [06/2002]
• low-barrier interoperability specification
• metadata harvesting model: data provider / service
provider
• metadata about resources
• autonomous protocol
• distinction between protocol and periphery
• community-specific extensions
• HTTP based
• XML responses
• unqualified Dublin Core
• stable (1.0 characterized as experimental)
OAI Data Model:
Resources / Items / Records
resource
item = identifier
Dublin Core
metadata
all available metadata
about David
MARC
metadata
SPECTRUM
metadata
item
records
record = identifier + metadata format + datestamp
Overview of OAI Verbs
Verb
archival
metadata
harvesting
verbs
Function
Identify
description of archive
ListMetadataFormats
metadata formats supported by archive
ListSets
sets defined by archive
ListIdentifiers
OAI unique ids contained in archive
ListRecords
listing of N records
GetRecord
listing of a single record
most verbs take arguments: dates, sets, ids, metadata formats
and resumption token (for flow control)
Identify
1.1
2.0
• Arguments
• Arguments
– none
– none
• Errors
– none
• Errors
– badArgument
ListMetadataFormats
1.1
• Arguments
– identifier
(OPTIONAL)
• Errors
– id does not exist
2.0
• Arguments
– identifier
(OPTIONAL)
• Errors
– badArgument
– noMetadataFormats
– idDoesNotExist
ListSets
1.1
• Arguments
– resumptionToken
(EXCLUSIVE)
• Errors
– no set hierarchy
2.0
• Arguments
– resumptionToken
(EXCLUSIVE)
• Errors
– badArgument
– badResumptionToken
– noSetHierarchy
ListIdentifiers
1.1
• Arguments
–
–
–
–
from (OPTIONAL)
until (OPTIONAL)
set (OPTIONAL)
resumptionToken
(EXCLUSIVE)
• Errors
– no records match
2.0
• Arguments
–
–
–
–
from (OPTIONAL)
until (OPTIONAL)
set (OPTIONAL)
resumptionToken
(EXCLUSIVE)
– metadataPrefix
(REQUIRED)
• Errors
–
–
–
–
–
badArgument
cannotDisseminateFormat
badResumptionToken
noSetHierarchy
noRecordsMatch
ListRecords
1.1
• Arguments
from (OPTIONAL)
until (OPTIONAL)
set (OPTIONAL)
resumptionToken
(EXCLUSIVE)
– metadataPrefix
(REQUIRED)
2.0
• Arguments
–
–
–
–
–
–
–
–
• Errors
• Errors
– no records match
– metadata format cannot be
disseminated
from (OPTIONAL)
until (OPTIONAL)
set (OPTIONAL)
resumptionToken
(EXCLUSIVE)
– metadataPrefix
(REQUIRED)
–
–
–
–
–
noRecordsMatch
cannotDisseminateFormat
badResumptionToken
noSetHierarchy
badArgument
GetRecord
1.1
• Arguments
– identifier
(REQUIRED)
– metadataPrefix
(REQUIRED)
• Errors
– id does not exist
– metadata format cannot
be disseminated
2.0
• Arguments
– identifier
(REQUIRED)
– metadataPrefix
(REQUIRED)
• Errors
– badArgument
– cannotDisseminateFor
mat
– idDoesNotExist
response no errors
<?xml version="1.0" encoding="UTF-8"?>
<OAI-PMH>
<responseDate>2002-0208T08:55:46Z</responseDate>
<request verb=“GetRecord”… …>http://arXiv.org/oai2</request>
<GetRecord>
<record>
<header>
<identifier>oai:arXiv:cs/0112017</identifier>
<datestamp>2001-12-14</datestamp>
<setSpec>cs</setSpec>
<setSpec>math</setSpec>
</header>
<metadata>
…..
</metadata>
</record>
</GetRecord>
</OAI-PMH>
response with error
<?xml version="1.0" encoding="UTF-8"?>
<OAI-PMH>
<responseDate>2002-0208T08:55:46Z</responseDate>
<request>http://arXiv.org/oai2</request>
<error code=“badVerb”>ShowMe is not a valid OAI-PMH verb</error>
</OAI-PMH>
resumptionToken Flow-Control
• Idempotency of resumptionToken: return same
incomplete list when rT is re-issued
• while no changes occur in the repo: strict
• while changes occur in the repo: all items with unchanged
datestamp
• new attributes for the resumptionToken:
• expirationDate
• completeListSize
• cursor
Adoption
• evolution
• from talking about OAI-PMH
• to talking about projects that use OAI-PMH
• to talking about projects and failing to mention
they use OAI-PMH
• => OAI-PMH becomes part of the infrastructure
Data Providers (a.k.a. repositories)
• 49 registered repositories [11/2001]
• 65 registered repositories [03/2002]
• 77 registered repositories [05/2002]
• 5+ million records
• many unregistered repositories
• private implementations (e.g. RDN)
Service Providers
• Arc: cross-searching of registered repositories
[ http://arc.cs.odu.edu ]
• CiteBase: research literature search + citation
ranking
[ http://citebase.eprints.org ]
• OLAC: cross-searching of Language Archive
Community repositories
[ http://www.language-archives.org/index.html ]
Service Providers
• Scirus scientific search engine [Elsevier]
[ http://www.scirus.com ]
• my.OAI : user-tailorable cross-searching of
registered repositories [FS Consulting, Inc.]
[ http://www.myoai.com ]
• Growing interest from web search engines
OAI-PMH tools
• Repository Explorer: interactive exploration of
repositories [Virginia Tech]
[ http://www.purl.org/NET/oai_explorer ]
• eprints.org: generic OAI-PMH compliant
repository software [U of Southampton]
[ http://www.eprints.org ]
• ALCME repository and harvester software
[OCLC]
[ http://alcme.oclc.org/index.html ]
• APIs, others tools @ www.openarchives.org
http://www.openarchives.org/
[email protected]