Language Resources Portal - Digital Humanities Austria

Download Report

Transcript Language Resources Portal - Digital Humanities Austria

Digital Editions & Language Resources Portal

Workshop - Save the data, 2. 12. 2014, Wien Matej Ďurčo ICLTT/ ACDH, ÖAW [email protected]

What kind of data?




What kind of data?

Dictionaries • Persian – English Dictionary • • German – Russian Dictionary Dictionary of Bavarian dialects in Austria Cooperation with Austrian dictionary and the dictionary of German variants Full word-form corpus-based lexical database of German Databases e.g. prosopographic, bibliographic data, … Audio – speech recordings (project Tunico) 3

Complexity, Formats


Sources: plain text, images (need OCR), Word documents (need conversion), audio (need transcribing), digitally born - web! (needs cleaning) Multi-level enrichment: Structural markup, linguistic / semantic annotation (stand-off) Linking: • • Combining lexicographic material with information from corpora (encoding in TEI) semantic representation of lexicographic resources in RDF Audio with aligned transcription 4


qualitative vs. quantitative K. Kraus „Die Fackel“ (1899 – 1936) ~ 22.500 pages, ~ 6 mio. tokens AAC ~ 500 mio. tokens + facsimiles 40TB! AMC ~ 8 billion tokens in over 35 mio. articles of recent journalistic texts (complete newspapers & magazines in Austria over last 20 years!) – 100 GB 23 000 entries prosopographic database A number of smaller editions/corpora 5 – 50 works/resources, rich annotation, < 100.000 t Multiple dictionaries with a few thousand entries 5


Bibliographic information encoded as teiHeader CMDI – Metadata Infrastructure used within CLARIN Allows for flexible „profiles“ specific to the type of resource and project/context - Lexical Resource - TextCorpus - Collection - teiHeader (emulated in CMDI) 6

Requirements on online availability

Varying combinations of: full-text search semantic search (search for persons, places, search by categories and classifications) full-view (e.g. text and facsimile of individual pages) specialized visualizations (temporal, spatial, graph) raw data available for download stable references to resources and resource fragments BUT before publication: collaborative editing 




Publishing framework: corpus_shell Repository for digital objects (Fedora-based) Viennese Lexicographic Editor Collaborative environment for lexicographic work oXygen, XML-database eXist Apache Solr, Sketch Engine (/NoSke), DDC for fast advanced (linguistic) search capabilities Most recently: Language Resources Portal 8

European Research Infrastructures


„under construction“ but many real services already available Stable organisational structure has been set up general assembly, board of directors, national coordinators, thematic committees, … Network of Centres • • • • (real ones with computing and storage – not virtual) Certification process ( centre assessment ) Typ: A (infrastructure), B (LRT data/services), C (metadata) currently 14 centres certified (+ 4 pending) Coordinated through SCCTC Standing Committee on CLARIN Technical Centres 9



Federated Identity AAI, Single-Sign-on Persistent Identifier CMDI – Component Metadata Infrastructure flexible framework for creation and publication of metadata FCS – Federated Content Search distributed system for searching in the content of the resources (corpora, …) Fostering the use of standards CLARIN Standards Committee (SCS) 10

Publishing Framework


modular framework for publishing a wide range of language resources designed to operate in a distributed and heterogeneous environment distributed setup FCS-based integration with CLARIN metadata infrastructure (reusing) specialized resource viewers for specific types of resources multiple implementations (php, perl, XQuery) cooperation/integration with SADE Scalable Architecture for Digital Editions, BBAW, Berlin open source (code on github ) 11

corpus_shell instances



corpus_shell instances



Lexicography suite

Dictionary Server • • • Open and freely available software that can be readily distributed (MySQL, PHP) Integrated with corpus_shell (FCS as common protocol) Connected to the clients through a REST-style web service Vienna Lexicographic Editor The corresponding client DictGate • a service for hosting lexicographic data • for smaller lexicographic projects 14

Lexicography suite

Viennese Lexicographic Editor (VLE) • • • • • • XML editor specialized for editing lexicographic data Generic – support for any (XML) format (LMF, TEI, TBX, RDF) Making use of cognate technologies (XSLT, XPath, XSD) Various editing modes, configurable keyboard layouts Optimised corpus-dictionary interface On-the-fly data visualisations 15

CLARIN Centre Vienna

First Austrian node in the network of CLARIN Centres DSA (Data Seal of Approval) and CLARIN Centre B status April 2014 Language Resources Portal Mission: National depositing and publishing service for digital language resources Tools corpus_shell, lexicographic suite, … Infrastructure Services - „Knowledge Hub“ mostly about metadata (under development) 16

CLARIN Centre Vienna


Thank you!

Matej Ďurčo Austrian Centre for Digital Humanities Österreichische Akademie der Wissenschaften [email protected]