Transcript iew-Report

Intera
MPI WP2/3 Report
Metadata
Integrated Resource Domain
Portal Creation
Peter Wittenburg
MPI for Psycholinguistics
Nijmegen NL
INTERA
WP2 Summary
November 2004
1
Intera
What is Metadata?
Annotation
Resource
Primary Functions of MD
• visibility of resources
• searching/browsing
• organization of corpus
• management of corpus
• event documentation
• etc
Metadata Description
• Language about
• Researcher
• Modalities
• Content Type
• Informant Name
• Age
• Microphone Type
• Resource Pointers
• etc etc
Sound
Resource
Video
Resource
INTERA
WP2 Summary
November 2004
Emerging Functions of MD
• metadata is virtual fingerprint of the resource
• can be used instead of resource
• ready for the Semantic Web – virtual resource domains
2
Intera
Metadata Process
can be grouped to large
distributed LR collections
searching for
resources possible
MD
Search
Large
Collection
of LR
can be any type of
Language Resource
(Annotated Media,
Lexica, Grammars,
etc)
INTERA
WP2 Summary
November 2004
can be grouped to large
distributed MD catalogues
Large
Catalogue
of MD
Content
Search
Language
Resource
Metadata
Description
Resource
Creation
MD
Creation
the creation process is iterative, mostly
very complex and dependent on the
resource type
IMDI provides
a core
description
and special
extensions for
resource types
the creation process is comparatively
simple; any time the resource is updated
3
some MD information has to be updated
as well
Intera
Strategic Goals and Impact
strategic goals are about survival after project lifetime
• stimulate the idea of a building a joint metadata domain
• “critical mass” idea
• ISO standardization based
• impact
• from few subcontractors to over 50 institutions world-wide
• ISO TC37/SC4 standardization activity (ISO, ->industry)
• LIRICS – adaptation of relevant tools to ISO DCR
• DAM-LR – bring the DELAMAN archives into Data-GRID
• web-based exploration and commentary frameworks
MPI, CMU, U Melbourne, etc working on this
• but
• metadata creation is hard, it also means organizing, cleaning …
• needs more evangelization and benefits
INTERA
WP2 Summary
November 2004
4
Intera
DAM-LR/DELAMAN GRID
EMELD
ELAR
INL
MPI
Lund
ANLC
AILLA
AMPM
LACITO
PARADISEC
INTERA
WP2 Summary
November 2004
5
Intera
Stabilization and Framework
• IMDI 3.04 now stable and part of ISO standardization efforts
• all categories are in ISO DCR (WP3)
• DCR is key element on the way to Semantic Web
• IMDI infrastructure now mature and stable (open source, free)
• professional IMDI Editor (creating correct IMDI XML)
• CV editor
• IMDI browser (can operate in linked IMDI XML domains)
• gateway to OLAC and Dublin Core
• HTML browsing
• Google-like and complex searching
• Access Rights Management
• portal creation
• web-based Ingestion (not Intera - in progress)
• web-based exploration (not Intera – in progress)
INTERA
WP2 Summary
November 2004
6
Intera
WP3 Issues
Getting Metadata into the Semantic Web Framework
• just this whole week ISO TC37/SC4 meeting in Pisa
• IMDI is in the ISO DCR
• all ISO 11179 and ISO 12620 compliant
• localization of IMDI in DCR (Se, Gr, D, E, Fr, Nl, It, Sp)
• ISO DCR is based on XML (not RDF)
• SYNTAX tool at LORIA is web-accessible
• next steps:
• integrate OLAC(DC) and TEI (LIRICS)
• link tools with SYNTAX via Web-services
• already done for a lexicon tool
• still deep discussions (is_a, has_a relation)
• separate relation repositories (in RDF/OWL of course)
• different layers of DCRs remains an issue
INTERA
WP2 Summary
November 2004
7
Intera
INTERA
WP2 Summary
November 2004
WP3 DCR
8
Intera
IMDI Editor
also supports node creation and profiles
INTERA
WP2 Summary
November 2004
9
Intera
INTERA
WP2 Summary
November 2004
Corpus Structure Building
10
Intera
IMDI Browser
also supports lexica, catalogue metadata and profiles
INTERA
WP2 Summary
November 2004
11
Intera
INTERA
WP2 Summary
November 2004
Structured IMDI Search
12
Intera
INTERA
WP2 Summary
November 2004
HTML Browsing
13
Intera
INTERA
WP2 Summary
November 2004
Unstructured Search
14
Intera
INTERA
WP2 Summary
November 2004
Access Rights Management
15
Intera
MD Infrastructure/Portal
Browsing & Searching
IMDI Browser & IE
IMDI Domain
via INTERNET
corpus structure
generation
MPI
Metadata Editing
IMDI Editor
Excel
S
S
S
S
BAS
S
S
S
S
S
S
S
S
Corpus
exploitation
(WP4)
HRELP Workshop
INTERA
Review
London
November 2003
16
Intera
INTERA Domain
State INTERA sub-contracts
INTERA
WP2 Summary
November 2004
Partner
Subcontractor
Corpus
Type
MPI
BAS
Smartkom
multimodal
integrated
MPI
BAS
Verbmobil and others
Speech, text
integrated
MPI
Meertens
Dialect Corpus
speech
integrated
MPI
U Florence
Lablita
speech text
integrated
MPI
U Florence
CORAL ROM
Semantics ext
integrated
MPI
Dutch Spoken Corpus
speech text
integrated
MPI
Gesture corpus
multimodal
integrated
MPI
ESF Second Learner Corpus
speech text
integrated
MPI
PMOLL Corpus
speech text
integrated
MPI
various others
sign speech text
integrated
USAAR
DFKI
Negra, Tiger
annotated text
to be integrated
USAAR
CLPP Bulg
HPSG
treebank
to be integrated
USAAR
U Iasi
1984
text
to be integrated
LORIA
ATILF
Frantext, etc
text
to be integrated
ELDA
catalogue resources
various
integrated
ILSP/ILC
textual corpora
various
integrated17
Intera
INTERA
WP2 Summary
November 2004
IMDI Domain
Europe
• ELRA Paris
• INALF Nancy
• DFKI Saarbrücken
• University of Saarland
• Bavarian Speech Archive Munich
• Meertens Institute Amsterdam
• University of Florence
• ILSP Athens
• ILC Pisa
• University of Madrid
• Max-Planck-Institute Nijmegen
• University of Kiel
• University of Bochum
• Free University of Berlin
• University of Bonn
• University of Bielefeld
• University of Helsinki
• University of Helsinki
• Phonogrammarchiv Vienna
• University of Groningen
• Kotus Project Helsinki
• Sweden’s National Dialect Archive Lund
• European Sign Language Communities
(Se, UK NL, D)
• University of Utrecht
• University of Uppsala
• University of Stavanger
• University of Lund
• University of Leipzig
• University of Erfurt
• University of Leiden
• University of Frankfurt
•…
International
• Federal University of Rio de Janeiro
• University of Colorado
• University of Buenos Aires
• University of Kansas
• University of Victoria
• University of Sydney
• University of Melbourne
• E Michigan University
• Wayne State University
• AILLA Austin
•…
Big problem:
integration and portal effort
18
Intera
MD Creation Problems
Conclusions
• contracts are difficult – much overhead for little money
• no broad experience for MD creation
• much interaction necessary over all aspects
• no standard contract form – adaptations needed
• institutes often wanted more money than expected
• rather chaotic situation in some cases as basis
• some cases no handiness with XML
• problems with changing student assistants
• special wishes wrt MD (IMDI flexible enough)
• MPI expected stepwise availability – delivery at the end is practice
• strong support for the ENABLER declaration necessary
• creating MD remains extra work
INTERA
WP2 Summary
November 2004
19
Intera
Portal Creation – XML Browsing
Task:
creation of a web-site that offers all options for a selected domain
of IMDI resources
just get the URL’s
and create a root node
IMDI
domain
BAS
Verbmobil
INTERA
WP2 Summary
November 2004
Speech
info files
MPI
Trumai
Sign
info files
lexica
grammar
….
text
sound
image
movie
annotations
eye movements
20
Intera
Portal Creation – Searching
harvest all data by traversing links and validate
create a fast index file (using Java Library DBMS)
just select a button in the browser
so: simple, everyone can setup a portal
Portal Node
Fast Index
IMDI Repositories
INTERA
WP2 Summary
November 2004
21
Intera
Portal Creation – HTML Support
install Tomcat server and IMDI-Web-Interface
software traverses tree to establish database
large index file is created under the cover
give a HTML entry point (HTTP server)
Web
Client
TOMCAT
Server
IMDI-WebInterface
Web-Server
MPI
Web-Server
BAS
IMDI Provider
IMDI Provider
Database
INTERA
WP2 Summary
November 2004
Portal Site
22
Intera
Portal Creation – DC/OLAC Gateway
DC Service
Provider
the database can be used
to fulfill the OAI protocol
for metadata harvesting;
any record can be served
Servlet
OAI-PMH
Portal Node
INTERA
WP2 Summary
November 2004
IMDI Repositories
Fast Index
23
Intera
Dissemination
Dissemination / Events
• Intern Metadata Workshop
• Open Forum on Metadata Registries
• Lexicon Workshop
• Workshop on Resource Storage and Access
• Intern Workshop on LR Archiving
• Sign Language Workshop
• Intern E-Meld Workshop
• Intern Linguistic Congress
• ENABLER Workshop
• DRH Meeting
• Intern PARADISEC Archiving Workshop
• HRELP Archiving Workshop
• etc
Nijmegen
Santa Fe
Munich
Göttingen
London
Nijmegen
Ypsilanti
Prague
Paris
Cheltenham
Sydney
London
November 02
January 03
February 03
February 03
March 03
May 03
July 03
July 03
August 03
September 03
October 03
November 03
LREC 2004 – Demonstration of infrastructure and MD domain
Two Metadata Flyer (MPI – U Lund) distributed at various occasions
Web-Site Design
INTERA
WP2 Summary
November 2004
several training workshops done
24
INTERA Portal Screenshots
INTERA
WP2 Summary
November 2004
25
26
27
28
29
30
31