Metadata Harvesting Interoperable digital collections Two basic approaches • One service provider with access to resources stored in multiple locations – Information about the.

Download Report

Transcript Metadata Harvesting Interoperable digital collections Two basic approaches • One service provider with access to resources stored in multiple locations – Information about the.

Metadata Harvesting
Interoperable digital collections
Two basic approaches
• One service provider with access to
resources stored in multiple locations
– Information about the resources located at the
service provider.
– Services developed to use the information to
provide connections to resources at multiple
locations
• Distributed services
– Information kept with the resources
– Services interact with multiple collection sites
Two protocols
• Z39.50
– Developed before the web
– Protocol for communicating with collection
holders in order to provide services.
• Open Archives Initiative
– Recent innovation
– Central service provider gathers
information from collection holders
Z39.50 - briefly
• Information Retrieval Service Definition and Protocol
Specifications for Library Applications
• Initially developed over the OSI network standards
• Protocol for information exchange
– Free the information seeker from the need to know the
details of the target database configuration
• Each site provides services
– Each service queries remote sites for needed information
• Information requests mapped to database queries at the
collection site.
• Some inconsistency in the interpretation of queries.
Distributed Resources
Multiple Services
Approach 1 - One service
provider gathers information
about data and uses it to
provide services
Data provider
Data provider
Data provider
Service provider -search, browse,
compare, etc.
Data provider
Data provider
Distributed data and services
Approach 2:
Each system is
both a data
repository and a
service provider.
Services query
other data
providers as
needed.
Search,
browse
Search,
browse,
compare
Open Archives Initiative (OAI)
• Web-based
– Uses HTTP to communicate between sites
• Centralized server
– Services provided from a site that has
already gathered the information it needs
for those services from a distributed
collection of sites.
OAI Compared to Z39.50
Z39.50
OAI
Content (Objects)
Distributed
Distributed
World View
Bibliographic
Bibliographic
Object Presentation
Data provider
Data provider
Searching is
Distributed
Centralized
Search done by
Data provider
Service provider
Metadata searched is
Up to date
Stale
Semantic Mapping
When searching
Metadata delivery
Source: oai.grainger.uiuc.edu/FinalReport/JCDL_2003_OAI_Intro.ppt
Open Archives Initiative Protocol for
Metadata Harvesting -- OAI-PMH
Implemented as CGI,
ASP, PHP, or other
HTTP req
(OAI verb)
OAI
Metadata
Provider
OAI
HTTP resp
(XML)
Harvester
Repository
OAI PMH
defines an
interface
between the
Harvester
and any
number of
Repositories
Service
Provider
OAI components
Service
Providers
and
Data
Providers
Requests
and
Responses
http://www.oaforum.org/tutorial/english/page3.htm#section3
Records
• Metadata of a resource.
• Three parts
– Header (required)
•
•
•
•
Identifier (required: 1 only)
Datestamp (required: 1 only)
setSpec elements (optional: 0, 1, or more)
Status attribute for deleted item
– Metadata (required)
• XML encoded metadata with root tag, namespace
• Repositories must support Dublin Core, other formats
optional
– “About” statement (optional)
• Right statements
• Provenance statements
Identifiers
• Globally unique identifier
• Valid URI
– Examples
• oai:<archiveId>:<recordId>
• oai:etd.vt.edu:etd-1234567890
– Must resolve to one item
• No duplicates
• No reuse of previously used identifiers
Datestamps
• Date of last modification of a record
– Used only for harvesting (meta metadata?)
• Mandatory for each item in the repository
• Two levels of granularity possible
– YYYY-MM-DD
– YYYY-MM-DThh:mm:ssZ
• T … Z = time zone -- must be GMT
• Allows harvesting incrementally -- get only
what is new since last visit
– Accessed by arguments from and until
The OAI-PMH verbs
• Each requests a specific response from
a data repository
Identify
•
•
•
•
Function: Description of the archive
Example: http://www.language-archives.org/cgi-bin/olaca3.pl?verb=Identify
Parameters: none
Errors/exceptions:
– badArgument (there should not be any)
• Response format:
Element
Example
Ordinality ‡
repositoryName
My Archive
1
baseURL
http://archive.org/oai
1
protocolVersion
2.0
1
earliestDatestamp
1999-01-01
1
deleteRecords
no, transient, persistent
1
granularity
YYY-MM-DD, YYYY-MM-DDThh:mm:ssZ
1
adminEmail
[email protected]
+
compression
deflate, compress
*
description
oai-identifier, eprints, friends, …
*
‡ Ordinality: 1 = mandatory, 1 only; + = mandatory, 1 only; * = optional, 0 or more
Actual response from
http://www.language-archives.org/cgi-bin/olaca3.pl?verb=Identify
<OAI-PMH xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2006-10-17T01:37:44Z</responseDate>
<request verb="Identify">http://www.language-archives.org/cgibin/olaca3.pl</request>
− <Identify>
<repositoryName>OLAC Aggregator</repositoryName>
<baseURL>http://www.language-archives.org/cgi-bin/olaca3.pl</baseURL>
<protocolVersion>2.0</protocolVersion>
<adminEmail>mailto:[email protected]</adminEmail>
<earliestDatestamp>2002-12-14</earliestDatestamp>
<deletedRecord>no</deletedRecord>
<granularity>YYYY-MM-DD</granularity>
− <!-- maybe later
<compression>identity</compression>
-->
Continued
− <description>
− <oai-identifier
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai-identifier
http://www.openarchives.org/OAI/2.0/oai-identifier.xsd">
<scheme>oai</scheme>
<repositoryIdentifier>OLACA.language-archives.org</repositoryIdentifier>
<delimiter>:</delimiter>
<sampleIdentifier>oai:ethnologue.com:aaa</sampleIdentifier>
</oai-identifier>
</description>
Continued
− <description>
− <olac-archive type="institutional" xsi:schemaLocation="http://www.languagearchives.org/OLAC/1.0/olac-archive
http://www.language-archives.org/OLAC/1.0/olacarchive.xsd">
<archiveURL>http://www.language-archives.org:8082/dp9/</archiveURL>
<curator>Steven Bird & Gary Simons</curator>
<curatorTitle>Coordinators</curatorTitle>
<curatorEmail>mailto:[email protected]</curatorEmail>
<institution>Open Language Archives Community</institution>
<institutionURL>http://www.language-archives.org/</institutionURL>
<shortLocation>Philadelphia, U.S.A.</shortLocation>
<location/>
− <synopsis>
This repository contains all records from OLAC-registered archives. It is intended to be used by
services which do not want to harvest individual OLAC archives.
</synopsis>
− <access>
Metadata may be used only subject to the access permissions given by the individual archives.
</access>
</olac-archive>
</description>
</Identify>
</OAI-PMH>
ListMetadataFormats
• Function: retrieve available metadata formats
from archive
• Example: archive.org/oai-script?verb=ListMetadataFormats&
•
identifier=oai:HUBerlin.de:3000218
• Parameters: identifier (optional)
• Errors/exceptions:
– badArgument
– idDoesNotExist
– noMetadataFormats
− <OAI-PMH xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2006-10-17T01:58:06Z</responseDate>
<request verb="ListMetadataFormats">http://www.language-archives.org/cgibin/olaca3.pl</request>
− <ListMetadataFormats>
− <metadataFormat>
<metadataPrefix>olac</metadataPrefix>
<schema>http://www.language-archives.org/OLAC/1.0/olac.xsd</schema>
<metadataNamespace>http://www.language-archives.org/OLAC/1.0/</metadataNamespace>
</metadataFormat>
− <metadataFormat>
<metadataPrefix>olac_display</metadataPrefix>
<schema>http://www.language-archives.org/OLAC/1.0/olac.xsd</schema>
<metadataNamespace>http://www.language-archives.org/OLAC/1.0/</metadataNamespace>
</metadataFormat>
− <metadataFormat>
<metadataPrefix>oai_dc</metadataPrefix>
<schema>http://www.openarchives.org/OAI/2.0/oai_dc.xsd</schema>
<metadataNamespace>http://www.openarchives.org/OAI/2.0/oai_dc/</metadataNamespace>
</metadataFormat>
</ListMetadataFormats>
</OAI-PMH>
Response to http://www.language-archives.org/cgi-bin/
olaca3.pl?verb=ListMetadataFormats
ListSets
• Function: retrieve set structure of a repository
• Example: archive.org/oai-script?verb=ListSets
• Parameters: resumptionToken (exclusive)
• Errors/exceptions:
– badArgument
– badResumptionToken
– noSetHierarchy
ListIdentifiers
• Function: abbieviated form of ListRecords, retrieve only headers
• Example: archive.org/oai-script?verb=ListIdentifiers&metadataPrefix=
oai_dc&from=2002-12-01
• Parameters:
–
–
–
–
–
from (optional)
until (optional)
metadataPrefix (required)
set (optional)
resumptionToken (exclusive)
• Errors/exceptions:
–
–
–
–
–
badArgument
badResumptionToken
cannotDisseminateFormat
noRecordsMatch
noSetHierarchy
ListRecords
• Function: harvest records from a repository
• Example: archive.org/oai-script?verb=ListRecords&
metadataPrefix=oai_dc&set=biology
• Parameters:
– from (optional)
– until (optional)
– metadataPrefix (required)
– set (optional)
– resumptionToken (exclusive)
• Errors/exceptions:
–
–
–
–
–
badArgument
badResumptionToken
cannotDisseminateFormat
noRecordsMatch
noSetHierarchy
GetRecord
• Function: retrieve an individual metadata record from a
repository
• Example:
archive.org/oai-script?verb=GetRecord&identifier=oai:HUBerlin.de: 3000218
&metadataPrefix=oai_dc
• Parameters:
– Identifier (required)
– metadataPrefix (required)
• Errors/exceptions:
– badArgument
– cannotDisseminateFormat
– idDoesNotExist
Interoperability
• The goal: communication, without human
intervention, between information sources
– Books that “talk to each other”
• Live links for references
• Knowledge of how to find relevant resources
when needed
• Ability to query other information locations
Protocols
• Precise rules for interactions between
independent processes
– Format of the messages
• Both structure and content
– Specified behavior in response to specific
messages
• Many ways to accomplish the same result,
but both sides must have the same
understanding of the rules of engagement.
Protocol Types
• RPC model
– Point to point
– Completely open to definition by developer
• Verbs (methods)
• Nouns (objects, resources)
– Useful to closed community or group who
know about the availability of the resource.
SOAP
• Initial words of the acronym have been
discontinued.
• Initially developed as part of the Microsoft
.NET paradigm
– Now in W3C committee
• Stateless, one-way message exchange
paradigm
• XML encoded
• Flexibility of RPC, but more constrained in the
way communication is formatted.
REST
• REpresentational State Transfer
• An after-the-fact definition of the architecture
of the World Wide Web
• The model is
–
–
–
–
Client/server
Stateless
Cacheable
Layered
• Resource interface constrained
– Restricted verbs
– Restricted content types
REST and RPC
• RPC provides flexibility for any type of
interaction between any type of
resources
• REST provides consistency to allow
interaction among resources without
prior discovery of accepted actions and
responses.
SOAP and REST
• Debate in the Web community about
which is the better paradigm for
application development
• REST -- restricted, but simple extension
of existing Web processes
• SOAP -- added flexibility with cost in
terms of bandwidth, security, complexity
for development
References
• Giving SOAP a REST http://www.devx.com/DevX/Article/8155
• SOAP Version 1.2 Part 0: Primer http://www.w3.org/TR/2003/RECsoap12-part0-20030624/#L1153
• OAI For Beginners - The Open Archives Forum online tutorial:
http://www.oaforum.org/tutorial/index.php
• Z39.50 Resource Page:
http://www.niso.org/standards/resources/Z3950_Resources.html
• Z39.50 An Overview of Development and the Future (1995)
http://www.cqs.washington.edu/~camel/z/z.html