Language Resources Portal - Digital Humanities Austria

Download Report

Transcript Language Resources Portal - Digital Humanities Austria

Digital Editions & Language Resources Portal

Workshop - Save the data, 2. 12. 2014, Wien Matej Ďurčo ICLTT/ ACDH, ÖAW [email protected]

What kind of data?

TEXT

http://www

2

What kind of data?

Dictionaries • Persian – English Dictionary • • German – Russian Dictionary Dictionary of Bavarian dialects in Austria Cooperation with Austrian dictionary and the dictionary of German variants Full word-form corpus-based lexical database of German Databases e.g. prosopographic, bibliographic data, … Audio – speech recordings (project Tunico) 3

Complexity, Formats

XML TEI

Sources: plain text, images (need OCR), Word documents (need conversion), audio (need transcribing), digitally born - web! (needs cleaning) Multi-level enrichment: Structural markup, linguistic / semantic annotation (stand-off) Linking: • • Combining lexicographic material with information from corpora (encoding in TEI) semantic representation of lexicographic resources in RDF Audio with aligned transcription 4

Size?

qualitative vs. quantitative K. Kraus „Die Fackel“ (1899 – 1936) ~ 22.500 pages, ~ 6 mio. tokens AAC ~ 500 mio. tokens + facsimiles 40TB! AMC ~ 8 billion tokens in over 35 mio. articles of recent journalistic texts (complete newspapers & magazines in Austria over last 20 years!) – 100 GB 23 000 entries prosopographic database A number of smaller editions/corpora 5 – 50 works/resources, rich annotation, < 100.000 t Multiple dictionaries with a few thousand entries 5

Metadata

Bibliographic information encoded as teiHeader CMDI – Metadata Infrastructure used within CLARIN Allows for flexible „profiles“ specific to the type of resource and project/context - Lexical Resource - TextCorpus - Collection - teiHeader (emulated in CMDI) 6

Requirements on online availability

Varying combinations of: full-text search semantic search (search for persons, places, search by categories and classifications) full-view (e.g. text and facsimile of individual pages) specialized visualizations (temporal, spatial, graph) raw data available for download stable references to resources and resource fragments BUT before publication: collaborative editing 

VRE !

7

Solutions

Publishing framework: corpus_shell Repository for digital objects (Fedora-based) Viennese Lexicographic Editor Collaborative environment for lexicographic work oXygen, XML-database eXist Apache Solr, Sketch Engine (/NoSke), DDC for fast advanced (linguistic) search capabilities Most recently: Language Resources Portal 8

European Research Infrastructures

CLARIN

„under construction“ but many real services already available Stable organisational structure has been set up general assembly, board of directors, national coordinators, thematic committees, … Network of Centres • • • • (real ones with computing and storage – not virtual) Certification process ( centre assessment ) Typ: A (infrastructure), B (LRT data/services), C (metadata) currently 14 centres certified (+ 4 pending) Coordinated through SCCTC Standing Committee on CLARIN Technical Centres 9

CLARIN

Infrastructure

Federated Identity AAI, Single-Sign-on Persistent Identifier CMDI – Component Metadata Infrastructure flexible framework for creation and publication of metadata FCS – Federated Content Search distributed system for searching in the content of the resources (corpora, …) Fostering the use of standards CLARIN Standards Committee (SCS) 10

Publishing Framework

corpus_shell

modular framework for publishing a wide range of language resources designed to operate in a distributed and heterogeneous environment distributed setup FCS-based integration with CLARIN metadata infrastructure (reusing) specialized resource viewers for specific types of resources multiple implementations (php, perl, XQuery) cooperation/integration with SADE Scalable Architecture for Digital Editions, BBAW, Berlin open source (code on github ) 11

corpus_shell instances

vicav

12

corpus_shell instances

ABaC:us

13

Lexicography suite

Dictionary Server • • • Open and freely available software that can be readily distributed (MySQL, PHP) Integrated with corpus_shell (FCS as common protocol) Connected to the clients through a REST-style web service Vienna Lexicographic Editor The corresponding client DictGate • a service for hosting lexicographic data • for smaller lexicographic projects 14

Lexicography suite

Viennese Lexicographic Editor (VLE) • • • • • • XML editor specialized for editing lexicographic data Generic – support for any (XML) format (LMF, TEI, TBX, RDF) Making use of cognate technologies (XSLT, XPath, XSD) Various editing modes, configurable keyboard layouts Optimised corpus-dictionary interface On-the-fly data visualisations 15

CLARIN Centre Vienna

clarin.oeaw.ac.at

First Austrian node in the network of CLARIN Centres DSA (Data Seal of Approval) and CLARIN Centre B status April 2014 Language Resources Portal Mission: National depositing and publishing service for digital language resources Tools corpus_shell, lexicographic suite, … Infrastructure Services - „Knowledge Hub“ mostly about metadata (under development) 16

clarin.oeaw.ac.at

CLARIN Centre Vienna

17

Thank you!

Matej Ďurčo Austrian Centre for Digital Humanities Österreichische Akademie der Wissenschaften [email protected]