The PLAZI Markup System Universität Karlsruhe (TH) Donat Agosti Terry Catapano

Download Report

Transcript The PLAZI Markup System Universität Karlsruhe (TH) Donat Agosti Terry Catapano

The PLAZI Markup System

Donat Agosti Terry Catapano Robert “Bob“ Morris

Guido Sautter

Universität Karlsruhe (TH) Research University – founded 1825

Guido Sautter

Universität Karlsruhe (TH) The PLAZI Markup System

The PLAZI Markup System

Document markup, external referencing GoldenGATE Document Editor Taxon LSIDs, GeoData New Taxon Names Marked-Up Documents XML & PDF storage, treatment server PLAZI Server Queries Treatments, Detail Data, PDF Document Handles External Data Sources Taxonomic data sources & web services Links, Materials Citations PLAZI Search Portal Search portal, TAPIR provider, RSS feed 2

Guido Sautter

Universität Karlsruhe (TH) The PLAZI Markup System

The PLAZI Server

• GoldenGATE Search & Retrieval Server (SRS) – Extracts individual treatments from XML documents – Stores and indexes treatments – Based on independend, pluggable Indexers • Taxonomic names • Materials citations • Document meta data • Full text Web Service SRS – Serves treatments or indexed details FT MD MC • DSpace – Stores PDF and XML documents – Issues Handles for documents Index Data TN File System PostgreSQL 3

Guido Sautter

Universität Karlsruhe (TH) The PLAZI Markup System

The PLAZI Markup System

Document markup, external referencing GoldenGATE Document Editor Taxon LSIDs, GeoData New Taxon Names Marked-Up Documents XML & PDF storage, treatment server PLAZI Server Queries Treatments, Detail Data, PDF Document Handles External Data Sources Taxonomic data sources & web services Links, Materials Citations PLAZI Search Portal Search portal, TAPIR provider, RSS feed 4

Guido Sautter

Universität Karlsruhe (TH) The PLAZI Markup System

The PLAZI Search Portal

• Series of Java Servlets running in Apache Tomcat • Front-end for SRS Web Service • Linker plug-ins create hyperlinks to other web sites • HTML based search portal for humans – Search treatments & index data – Links submitting new search queries – Links to external data sources (e.g. HNS, GoogleMaps) – Links to PDF document & XML versions of treatments • XML document access in various XML schemas • TAPIR provider – Taxonomic names – Materials citations • RSS feed for new treatments 5

Guido Sautter

Universität Karlsruhe (TH) The PLAZI Markup System

The PLAZI Search Portal

6 Probolomyrmex tani

Guido Sautter

Universität Karlsruhe (TH) The PLAZI Markup System

The PLAZI Markup System

Document markup, external referencing GoldenGATE Document Editor Taxon LSIDs, GeoData New Taxon Names Marked-Up Documents XML & PDF storage, treatment server PLAZI Server Queries Treatments, Detail Data, PDF Document Handles External Data Sources Taxonomic data sources & web services Links, Materials Citations PLAZI Search Portal Search portal, TAPIR provider, RSS feed 7

Guido Sautter

Universität Karlsruhe (TH) The PLAZI Markup System

The GoldenGATE Editor

• Java-based editor for semi-automated document markup • Extensible through plug-in mechanism • Independent of specific XML schema • Element-level XML editing (XML syntax is generated) • Flexible display for clear view on all detail levels • Existing plug-ins provide broad spectrum of functionality: – NLP-based markup generation • Regular expressions, gazetteers, GATE JAPE • Homegrown and third-party NLP components • Import of data from external sources (e.g. LSIDs) – Specialized document views for correcting NLP results – Markup transformation & filtering – IO components for different data formats & storage locations (e.g. for uploading XML documents to PLAZI server) 8

Guido Sautter

Universität Karlsruhe (TH) The PLAZI Markup System

The GoldenGATE Editor

9

Guido Sautter

Universität Karlsruhe (TH) The PLAZI Markup System

The PLAZI Markup System

10 Document markup, external referencing GoldenGATE Document Editor Taxon LSIDs, GeoData New Taxon Names Marked-Up Documents XML & PDF storage, treatment server PLAZI Server Queries Treatments, Detail Data, PDF Document Handles External Data Sources Taxonomic data sources & web services Links, Materials Citations PLAZI Search Portal Search portal, TAPIR provider, RSS feed

Guido Sautter

Universität Karlsruhe (TH) The PLAZI Markup System

The External Data Sources

• Hymenoptera Name Server (HNS) – Retrieve LSIDs for taxon names – Enter new taxon names in HNS database • Further LSID sources: ZooBank, Index Fungorum • GBIF pulls materials citations via TAPIR • EOL pulls treatments via TAPIR (to start soon) 11

Guido Sautter

Universität Karlsruhe (TH) The PLAZI Markup System

Outlook

• Tighter integration of GoldenGATE editor with server – Load plug-ins from server  Easier update distribution – Upload documents directly after OCR – Host documents at server throughout markup  Users can share markup work (experts do LSIDs, etc)  Treatments available in search portal soon as marked up – Auto-distribute documents to different storage locations – Run automated markup generation on server side – Get corrections from community via online feedback forms • Other extensions of GoldenGATE editor – Simplified, more flexible plug-in architecture – Extensible user interface 12

Thank you! Questions?

Donat Agosti Terry Catapano Robert “Bob“ Morris Guido Sautter [email protected]

[email protected]

[email protected]

[email protected]

PLAZI homepage PLAZI search portal GoldenGATE homepage http://plazi.org

http://plazi.org:8080/GgSRS http://idaho.ipd.uka.de/GoldenGATE Universität Karlsruhe (TH) Research University – founded 1825

Guido Sautter

Universität Karlsruhe (TH) The PLAZI Markup System

The GoldenGATE Editor V3

Plug-in GUI extensions (hideable) Simplified, more flexible architecture 14 Document navigator for finding stuff more quickly Pre-OCR page images for correcting OCR errors