Transcript The PLAZI Markup System Universität Karlsruhe (TH) Donat Agosti Terry Catapano
The PLAZI Markup System
Donat Agosti Terry Catapano Robert “Bob“ Morris
Guido Sautter
Universität Karlsruhe (TH) Research University – founded 1825
Guido Sautter
Universität Karlsruhe (TH) The PLAZI Markup System
The PLAZI Markup System
Document markup, external referencing GoldenGATE Document Editor Taxon LSIDs, GeoData New Taxon Names Marked-Up Documents XML & PDF storage, treatment server PLAZI Server Queries Treatments, Detail Data, PDF Document Handles External Data Sources Taxonomic data sources & web services Links, Materials Citations PLAZI Search Portal Search portal, TAPIR provider, RSS feed 2
Guido Sautter
Universität Karlsruhe (TH) The PLAZI Markup System
The PLAZI Server
• GoldenGATE Search & Retrieval Server (SRS) – Extracts individual treatments from XML documents – Stores and indexes treatments – Based on independend, pluggable Indexers • Taxonomic names • Materials citations • Document meta data • Full text Web Service SRS – Serves treatments or indexed details FT MD MC • DSpace – Stores PDF and XML documents – Issues Handles for documents Index Data TN File System PostgreSQL 3
Guido Sautter
Universität Karlsruhe (TH) The PLAZI Markup System
The PLAZI Markup System
Document markup, external referencing GoldenGATE Document Editor Taxon LSIDs, GeoData New Taxon Names Marked-Up Documents XML & PDF storage, treatment server PLAZI Server Queries Treatments, Detail Data, PDF Document Handles External Data Sources Taxonomic data sources & web services Links, Materials Citations PLAZI Search Portal Search portal, TAPIR provider, RSS feed 4
Guido Sautter
Universität Karlsruhe (TH) The PLAZI Markup System
The PLAZI Search Portal
• Series of Java Servlets running in Apache Tomcat • Front-end for SRS Web Service • Linker plug-ins create hyperlinks to other web sites • HTML based search portal for humans – Search treatments & index data – Links submitting new search queries – Links to external data sources (e.g. HNS, GoogleMaps) – Links to PDF document & XML versions of treatments • XML document access in various XML schemas • TAPIR provider – Taxonomic names – Materials citations • RSS feed for new treatments 5
Guido Sautter
Universität Karlsruhe (TH) The PLAZI Markup System
The PLAZI Search Portal
6 Probolomyrmex tani
Guido Sautter
Universität Karlsruhe (TH) The PLAZI Markup System
The PLAZI Markup System
Document markup, external referencing GoldenGATE Document Editor Taxon LSIDs, GeoData New Taxon Names Marked-Up Documents XML & PDF storage, treatment server PLAZI Server Queries Treatments, Detail Data, PDF Document Handles External Data Sources Taxonomic data sources & web services Links, Materials Citations PLAZI Search Portal Search portal, TAPIR provider, RSS feed 7
Guido Sautter
Universität Karlsruhe (TH) The PLAZI Markup System
The GoldenGATE Editor
• Java-based editor for semi-automated document markup • Extensible through plug-in mechanism • Independent of specific XML schema • Element-level XML editing (XML syntax is generated) • Flexible display for clear view on all detail levels • Existing plug-ins provide broad spectrum of functionality: – NLP-based markup generation • Regular expressions, gazetteers, GATE JAPE • Homegrown and third-party NLP components • Import of data from external sources (e.g. LSIDs) – Specialized document views for correcting NLP results – Markup transformation & filtering – IO components for different data formats & storage locations (e.g. for uploading XML documents to PLAZI server) 8
Guido Sautter
Universität Karlsruhe (TH) The PLAZI Markup System
The GoldenGATE Editor
9
Guido Sautter
Universität Karlsruhe (TH) The PLAZI Markup System
The PLAZI Markup System
10 Document markup, external referencing GoldenGATE Document Editor Taxon LSIDs, GeoData New Taxon Names Marked-Up Documents XML & PDF storage, treatment server PLAZI Server Queries Treatments, Detail Data, PDF Document Handles External Data Sources Taxonomic data sources & web services Links, Materials Citations PLAZI Search Portal Search portal, TAPIR provider, RSS feed
Guido Sautter
Universität Karlsruhe (TH) The PLAZI Markup System
The External Data Sources
• Hymenoptera Name Server (HNS) – Retrieve LSIDs for taxon names – Enter new taxon names in HNS database • Further LSID sources: ZooBank, Index Fungorum • GBIF pulls materials citations via TAPIR • EOL pulls treatments via TAPIR (to start soon) 11
Guido Sautter
Universität Karlsruhe (TH) The PLAZI Markup System
Outlook
• Tighter integration of GoldenGATE editor with server – Load plug-ins from server Easier update distribution – Upload documents directly after OCR – Host documents at server throughout markup Users can share markup work (experts do LSIDs, etc) Treatments available in search portal soon as marked up – Auto-distribute documents to different storage locations – Run automated markup generation on server side – Get corrections from community via online feedback forms • Other extensions of GoldenGATE editor – Simplified, more flexible plug-in architecture – Extensible user interface 12
Thank you! Questions?
Donat Agosti Terry Catapano Robert “Bob“ Morris Guido Sautter [email protected]
PLAZI homepage PLAZI search portal GoldenGATE homepage http://plazi.org
http://plazi.org:8080/GgSRS http://idaho.ipd.uka.de/GoldenGATE Universität Karlsruhe (TH) Research University – founded 1825
Guido Sautter
Universität Karlsruhe (TH) The PLAZI Markup System
The GoldenGATE Editor V3
Plug-in GUI extensions (hideable) Simplified, more flexible architecture 14 Document navigator for finding stuff more quickly Pre-OCR page images for correcting OCR errors