Document 7419280

Transcript Document 7419280

panFMP - Ein XML-basiertes
Framework für MetadatenPortale
Vortrag und „hands-on“ Seminar am GFZ Potsdam
Uwe Schindler
MARUM – Universität Bremen
PANGAEA® - Publishing Network for Geoscientific &
Environmental Data
[email protected]
1
Metadata Portals: Search Technology
for distributed Catalogues
• Searching directly on distributed catalogues: In distributed
search infrastructures, every data provider not only has his own
metadata catalogue, but also a corresponding search interface to
the portal (e.g., web service based). Search requests are sent to all
data providers. The portal only needs to collect the search results
from the providers, then rank and display these to the end user.
Examples: NSDI Clearinghouse, GeoMIS.BUND
• Harvesting catalogues into a central searchable catalogue:
Every data provider has its own metadata catalogue but the search
engine is centralized. The portal periodically harvests all metadata
records into a central index and serves search requests from there.
Major web search engines like Google or the FGDC related
Geospatial One-Stop are based on this concept. The response time
is optimal because only local components are used in the search
process.
2
Metadata Portals: Harvesting solutions
from PANGAEA®
• WDC-MARE with its information system PANGAEA®
currently provides data portals for several
EU/international projects:
• Not all data are stored centralized, so all datasets
provided in portals must be consolidated from different
sources!
• Features:
– Data stays at the data providers
– Metadata is harvested by the portal
– Search queries are handled by the centralized catalogue
(Google-like search speed!)
– Scientist gets link to data at the provider
3
Metadata Harvesting Solutions
• Web Accessible Folder (WAF): Simple
harvesting by recursively collecting XML
files from a web server‘s directory listing –
simple, but inefficient
• Open Archives Protocol for Metadata
Harvesting (OAI-PMH): 
4
Open Archives Protocol
• The Open Archives Initiative Protocol for Metadata
Harvesting (OAI-PMH) is a protocol developed by
the Open Archives Initiative.
• Almost all digital libraries support it (most famous ones:
Fedora Commons, arXiv and the CERN Document
Server; GeoNetwork Opensource)
• Portals by Scientific Commons, OAIster, SUB
•
uses it during web crawling (if available)
• Very simple to implement (XML over HTTP-REST)
• Repository software for databases or file system
metadata providers is widely available (e.g. DLESE jOAI
software on the data provider side)
5
Current OAI-PMH software
1. Limited to Dublin Core metadata (libraries)!
2. Limited full text search functionality due to
relational databases in the background!
3. No geographic retrievals (because of Dublin
Core limitation)!
4. End user interface is part of the
software, this limits usability
in CMS systems.
6
Central indexing requirements
1. Open for any XML metadata format
2. Any mappings to document fields should be done by
XPath/XSLT
3. Possibility to map incompatible XML schemas during
harvesting by XSLT on-the-fly
4. On-the-fly validation of (maybe previously transformed)
documents during harvesting
5. No relational database, only a full text search engine,
that contains everything needed for operation
6. Range queries on specific fields (date/time or numeric)
7. Web service interface / programming API for the end user
interface that is accessible from any language (Java/JSP,
PHP, Perl,...)
7
• Ranked searching - best results returned first
• Many powerful query types: phrase queries, wildcard
queries, proximity queries, range queries for date time
values and numbers
• Fielded searching. All fields are searchable as a whole,
each field separately (e.g. for author, parameter), or
mixed.
• Any combination of boolean operators between search
terms (AND, OR, NOT, exact phrase)
• Sorting by any field
• Multiple-index searching with merged results
• Simultaneous searching and updates due to highperformance indexing
8
Structure of a Lucene Index
Inverted Index
Documents
Field
Text token
Document IDs
ID
Stored document contents
title
benthic
3
1
title
carbon
1, 2, 3
title
composition
2
title
foraminifera
3
<metadata>
<title>Carbon and oxygen isotope ratios</title>
<latitude>74.1</latitude>
<longitude>11.0</longitude>
</metadata>
title
isotope
1, 2
2
title
oxygen
1, 2, 3
title
ratios
1
title
stable
2
<metadata>
<title>Stable oxygen and carbon isotope composition</title>
<latitude>63.9</latitude>
<longitude>11.0</longitude>
</metadata>
latitude
63.9
2
latitude
74.1
1, 3
3
longitude
11.0
1, 2
longitude
12.3
3
<metadata>
<title>Carbon and oxygen in benthic foraminifera</title>
<latitude>74.1</latitude>
<longitude>12.3</longitude>
</metadata>
Terms
9
panFMP – PANGAEA® Framework
for Metadata Portals
panFMP is a generic and flexible framework for building geoscientific metadata
portals independent of content standards for metadata and protocols. Data
providers can be harvested with commonly used protocols (e.g., Open
Archives Initiative Protocol for Metadata Harvesting) and metadata
standards like Dublin Core, DIF, or ISO 19115. The new Java-based portal
software supports any XML encoding and makes metadata searchable
through Apache Lucene. Software administrators are free to define
searchable fields independent of their type using XPath and/or XSL
Templates. In addition, by extending the full-text search engine (FTS)
Apache Lucene, we have significantly improved queries for numerical and
date/time ranges by supplying a new trie-based algorithm, thus enabling
high-performance space/time retrievals in FTS-based geo portals. The
harvested metadata are stored in separate indexes, which makes it possible
to combine these into different portals. The portal-specific Java API and web
service interface is highly flexible and supports custom front-ends for users,
provides automatic query completion (AJAX), and dynamic visualization
with conventional mapping tools.
10
panFMP – Components of a
metadata portal
11
panFMP - Harvesting
Index Builder
transform by
XSL
Data
Provider
<<centralBuffer>>
DOM tree
Lucene
Index
Virtual
Index
validate against
schema
OAI-PMH
Harvester
Data
Provider
<<centralBuffer>>
DOM tree
apply XPath
apply XPath
serialize
DOM
field
field
XML
blob
Directory
Harvester
File
System
Search
Interface
accept Document as
DOM tree
add document
to index
12
Lucene
Index
Search
Interface
OAI-PMH
Harvester
Virtual
Index
Lucene
Index
panFMP - Search Interface
• Supports all standard Lucene search
features
• Additional support for fast range queries
to enable bounding boxes, etc.:
– implemented by redundant storage of
“numerical terms” in different precisions
– recursive reduction of distinct terms (every
numerical value is a term) on range query
– search time no longer dependent on index
size
• Accessible via Java API or AXIS web
service
13
panFMP – Range Queries
4
42
421
423
5
44
445
446
6
52
448
521
63
522
632
633
64
634
641
642
644
Example on trie-based recursive splitting of range query with three precisions
(simplied for demonstration): User wants to find all records with terms between
"423" and "642". Instead of selecting all terms in lowermost row, query is optimized
to only match on labelled terms with lower precision, where applicable. It is enough
to select term "5" to match all records starting with "5" ("521", "522") or "44" for
"445", "446", "448". Query is therefore simplied to match all records containing
terms "423", "44", "5", "63", "641", or "642".
14
Examples
•
•
•
•
•
http://sedis.iodp.org
http://www.c3grid.de/portal
http://www.world-data-centers.org/
http://dataportal.carboocean.org
http://pages-dataportal.unibe.ch/cgibin/WebObjects/dataportal
• Currently not available: http://data.planktonnet.eu
15
Thank You!
Software available open source on
Sourceforge.net!
http://www.panFMP.org
http://sourceforge.net/projects/panfmp
16