Transcript Language Resources Portal - Digital Humanities Austria
Digital Editions & Language Resources Portal
Workshop - Save the data, 2. 12. 2014, Wien Matej Ďurčo ICLTT/ ACDH, ÖAW [email protected]
What kind of data?
TEXT
http://www
2
What kind of data?
Dictionaries • Persian – English Dictionary • • German – Russian Dictionary Dictionary of Bavarian dialects in Austria Cooperation with Austrian dictionary and the dictionary of German variants Full word-form corpus-based lexical database of German Databases e.g. prosopographic, bibliographic data, … Audio – speech recordings (project Tunico) 3
Complexity, Formats
XML TEI
Sources: plain text, images (need OCR), Word documents (need conversion), audio (need transcribing), digitally born - web! (needs cleaning) Multi-level enrichment: Structural markup, linguistic / semantic annotation (stand-off) Linking: • • Combining lexicographic material with information from corpora (encoding in TEI) semantic representation of lexicographic resources in RDF Audio with aligned transcription 4
Size?
qualitative vs. quantitative K. Kraus „Die Fackel“ (1899 – 1936) ~ 22.500 pages, ~ 6 mio. tokens AAC ~ 500 mio. tokens + facsimiles 40TB! AMC ~ 8 billion tokens in over 35 mio. articles of recent journalistic texts (complete newspapers & magazines in Austria over last 20 years!) – 100 GB 23 000 entries prosopographic database A number of smaller editions/corpora 5 – 50 works/resources, rich annotation, < 100.000 t Multiple dictionaries with a few thousand entries 5
Metadata
Bibliographic information encoded as teiHeader CMDI – Metadata Infrastructure used within CLARIN Allows for flexible „profiles“ specific to the type of resource and project/context - Lexical Resource - TextCorpus - Collection - teiHeader (emulated in CMDI) 6
Requirements on online availability
Varying combinations of: full-text search semantic search (search for persons, places, search by categories and classifications) full-view (e.g. text and facsimile of individual pages) specialized visualizations (temporal, spatial, graph) raw data available for download stable references to resources and resource fragments BUT before publication: collaborative editing
VRE !
7
Solutions
Publishing framework: corpus_shell Repository for digital objects (Fedora-based) Viennese Lexicographic Editor Collaborative environment for lexicographic work oXygen, XML-database eXist Apache Solr, Sketch Engine (/NoSke), DDC for fast advanced (linguistic) search capabilities Most recently: Language Resources Portal 8
European Research Infrastructures
CLARIN
„under construction“ but many real services already available Stable organisational structure has been set up general assembly, board of directors, national coordinators, thematic committees, … Network of Centres • • • • (real ones with computing and storage – not virtual) Certification process ( centre assessment ) Typ: A (infrastructure), B (LRT data/services), C (metadata) currently 14 centres certified (+ 4 pending) Coordinated through SCCTC Standing Committee on CLARIN Technical Centres 9
CLARIN
Infrastructure
Federated Identity AAI, Single-Sign-on Persistent Identifier CMDI – Component Metadata Infrastructure flexible framework for creation and publication of metadata FCS – Federated Content Search distributed system for searching in the content of the resources (corpora, …) Fostering the use of standards CLARIN Standards Committee (SCS) 10
Publishing Framework
corpus_shell
modular framework for publishing a wide range of language resources designed to operate in a distributed and heterogeneous environment distributed setup FCS-based integration with CLARIN metadata infrastructure (reusing) specialized resource viewers for specific types of resources multiple implementations (php, perl, XQuery) cooperation/integration with SADE Scalable Architecture for Digital Editions, BBAW, Berlin open source (code on github ) 11
corpus_shell instances
vicav
12
corpus_shell instances
ABaC:us
13
Lexicography suite
Dictionary Server • • • Open and freely available software that can be readily distributed (MySQL, PHP) Integrated with corpus_shell (FCS as common protocol) Connected to the clients through a REST-style web service Vienna Lexicographic Editor The corresponding client DictGate • a service for hosting lexicographic data • for smaller lexicographic projects 14
Lexicography suite
Viennese Lexicographic Editor (VLE) • • • • • • XML editor specialized for editing lexicographic data Generic – support for any (XML) format (LMF, TEI, TBX, RDF) Making use of cognate technologies (XSLT, XPath, XSD) Various editing modes, configurable keyboard layouts Optimised corpus-dictionary interface On-the-fly data visualisations 15
CLARIN Centre Vienna
clarin.oeaw.ac.at
First Austrian node in the network of CLARIN Centres DSA (Data Seal of Approval) and CLARIN Centre B status April 2014 Language Resources Portal Mission: National depositing and publishing service for digital language resources Tools corpus_shell, lexicographic suite, … Infrastructure Services - „Knowledge Hub“ mostly about metadata (under development) 16
clarin.oeaw.ac.at
CLARIN Centre Vienna
17
Thank you!
Matej Ďurčo Austrian Centre for Digital Humanities Österreichische Akademie der Wissenschaften [email protected]