Metadata Harvesting

Download Report

Transcript Metadata Harvesting

EuropeanaLocal Knowledge Sharing
Workshop
Metadata Harvesting
Julie Verleyen
Scientific Coordinator, Europeana Office
The Hague, 13 & 14 January 2009
Table Of Content
• Harvesting in Europeana: workflow and
requirements
• Best-practices
• Recommendations
• Common issues
• Tools / Software
• Resources
• Documentation
Harvesting in Europeana
1. Determine collections to be contributed
•
Questionnaire
Harvesting in Europeana
2. Obtain OAI-PMH repository parameters:
– Absolute minimum (enough for fully
implemented, tested and documented OAI
repositories)
•
Server base URL
– Very useful to have:
•
•
Mapping between described collection(s) and OAIPMH set(s)
Prefix of metadata format to use preferably for
Europeana (if not described in ListMetadataFormats
response): ex: oai_dc, mods, tel, ese
Harvesting in Europeana
3. Configuration of harvester
4. Full harvest with ListRecords request
– Records collected in XML files ≤ 10MB
– Harvest stored in SVN
Best-practices: implementation
• Compliancy to OAI-PMH 2.0 protocol specifications
http://www.openarchives.org/OAI/openarchivesprotocol.html .
 Follow implementation guidelines OAI-PMH v2 for
repository implementers
http://www.openarchives.org/OAI/2.0/guidelines-repository.htm
• Full functional tests!!
Best-practices: OAI validation
OAI validation
=
Your OAI repository correctly implements the
OAI-PMH!
 Correct response to all OAI-PMH requests:
with arguments, various error conditions,
every XML schema of every OAI response is
valid,...
Recommended approach to OAI validation
• Follow the Open Archive Initiative
Protocol Testing
• Validate your server using the validator
supplied by the OAI.
http://www.openarchives.org/data/registerasprovider.html
Without registering  clicking checkbox "only
validate and do not register (you may then
register later)."
http://www.openarchives.org/data/registerasprovider.html
#Protocol_Conformance_Testing
http://www.openarchives.org/data/registerasprovider.html => bottom of the page
Issues and recommendations: sets
• Set = "an optional construct for grouping
items for the purpose of selective
harvesting.“
Number of obstacles related to sets:
• Interpreting how a repository has organized
sets and determining which sets to harvest
– Issue: setName not human understandable
and/or no setDescription provided.
– Issue: Large number of sets to sort through.
• Knowing when there are records that belong to
no sets
– Issue: Items that belong to no sets are included in
the OAI repository.
• Knowing when there are empty sets
– Issue: Data provider exposes sets with no
records.
Number of obstacles related to sets:
• Understanding relationships between sets
– Issue: Relationships between sets are not
expressed.
• Mechanism to express relationships between hierarchical
sets
• But no mechanism to express relationships between
overlapping sets!
• The only way to know: harvest the identifiers or records
which contain the header information  sets record
belongs to
Number of obstacles related to sets:
• Knowing how many records there are within a
set before harvesting
– Issue: Not expressing how many records are
within a set which can be expressed via a
completeListSize attribute in a resumptionToken or
within the set description.
• Knowing when a set structure has been
substantially changed
– Issue: Changes in a set structure has not been
communicated
Sets: recommendations
• No single best practice for the organization of
sets.
• Realistically: data providers organize sets in a
way which best meets the needs of their primary
service provider and can be easily done within
their own internal workflows.
• Useful to organize the metadata items into sets
according to the collections of resources they
represent.
– Concept of collections varies and not completely clear
in Europeana.
– Useful for harvester to understand notion of collection
for data providers
Basic requirements
• Repository implementation following OAIPMH v2.0 + tested
• Inform Europeana harvesting responsible of
any repository changes / maintenance
• No regular harvesting schema determined
yet
• “SLA” between data providers and
harvesters
Common issues
• Unavailability / unreliability of repository
server
• Implementation of OAI-PMH v2 incomplete
– resumptionToken not supported
– Only ListIdentifiers
• XML syntax errors
• Character encoding errors
• Short lifetime of resumptionToken
Tools / Software
TEL/Europeana OAI-PMH Harvester – Offline
documentation
– Harvester
– Java standalone application with GUI
– Multiple harvesting jobs
– Resuming unfinished jobs
– Logging
– No scheduling, No configuration interface
Tools / Software
REPOX - http://repox.ist.utl.pt/
• Repository + Harvester
• Java standalone application with web GUI
• Multiple harvesting jobs, Scheduler
• Statistics
• Management of XML metadata repository
– Versioning and identification of records
– Different metadata format
– User interface to create metadata crosswalks: Schema
mapper
Tools / Software
OAIcat from OCLC http://www.oclc.org/research/software/oai/cat.htm
• Framework conforming to the OAI-PMH v2.0
• Repository + Harvesting
• Java web application
• Scheduling, logging
• Limited scalability (~2M records)
Tools / Software (TELplus D2.1)
Other implementations in different languages to plug-in into a
Library Management System:
–
–
–
–
PHP: OAIbiblio
data provider implementation of the OAI-PMH, version 2.0. This
toolkit can be easily customized to communicate with an already
existing, multi-table MySQL database
PERL: Celestial
OAI aggregator/cache application that imports OAI metadata from
version 1.0,1.1,2.0 OAI-compliant repositories, and re-exposes that
metadata through either an aggregated or per-repository OAIcompliant 2.0 interface. Celestial requires oai-perl v2, MySQL, Perl
5.6.x and a CGI-capable web server
Ruby: ruby-oai
Includes a client library, a server/provider library and a interactive
harvesting shell
Python: pyoai package
enables high-level access to an OAI-PMH Metadata Repository and
also implements a framework for quickly creating OAI-PMH
compliant servers
Tools / Software
• ESE XML validation schemas developed by
partners
Resources
• The Open Archives Initiative Protocol for
Metadata Harvesting v2.0
http://www.openarchives.org/OAI/openarchivesproto
col.html
• TELplus D2.1, “OAI-PMH implementation and
tools guidelines”, 21 pages
– Protocol overview and description of main
concepts
– OAI-PMH implementation in libraries
– References
Resources
• Wiki “Best Practices for OAI Data Provider
Implementations and Shareable Metadata”:
Excellent source of guidelines, tutorials,
recommendations, implementation softwares and
tools, references etc...
http://webservices.itcs.umich.edu/mediawiki/oaibp/in
dex.php/Main_Page
Documentation in Europeana context
• Requirements:
– Europeana OAI-PMH Harvesting
– Europeana OAI-PMH Repositories
• ESE XML validation schema
• Europeana OAI-PMH data providers registry &
forum/mailing list
– Local systems
– OAI-PMH repository solution
– Contact
Thank you
Questions? Remarks?...
[email protected]