Transcript iew-Report
Intera MPI WP2/3 Report Metadata Integrated Resource Domain Portal Creation Peter Wittenburg MPI for Psycholinguistics Nijmegen NL INTERA WP2 Summary November 2004 1 Intera What is Metadata? Annotation Resource Primary Functions of MD • visibility of resources • searching/browsing • organization of corpus • management of corpus • event documentation • etc Metadata Description • Language about • Researcher • Modalities • Content Type • Informant Name • Age • Microphone Type • Resource Pointers • etc etc Sound Resource Video Resource INTERA WP2 Summary November 2004 Emerging Functions of MD • metadata is virtual fingerprint of the resource • can be used instead of resource • ready for the Semantic Web – virtual resource domains 2 Intera Metadata Process can be grouped to large distributed LR collections searching for resources possible MD Search Large Collection of LR can be any type of Language Resource (Annotated Media, Lexica, Grammars, etc) INTERA WP2 Summary November 2004 can be grouped to large distributed MD catalogues Large Catalogue of MD Content Search Language Resource Metadata Description Resource Creation MD Creation the creation process is iterative, mostly very complex and dependent on the resource type IMDI provides a core description and special extensions for resource types the creation process is comparatively simple; any time the resource is updated 3 some MD information has to be updated as well Intera Strategic Goals and Impact strategic goals are about survival after project lifetime • stimulate the idea of a building a joint metadata domain • “critical mass” idea • ISO standardization based • impact • from few subcontractors to over 50 institutions world-wide • ISO TC37/SC4 standardization activity (ISO, ->industry) • LIRICS – adaptation of relevant tools to ISO DCR • DAM-LR – bring the DELAMAN archives into Data-GRID • web-based exploration and commentary frameworks MPI, CMU, U Melbourne, etc working on this • but • metadata creation is hard, it also means organizing, cleaning … • needs more evangelization and benefits INTERA WP2 Summary November 2004 4 Intera DAM-LR/DELAMAN GRID EMELD ELAR INL MPI Lund ANLC AILLA AMPM LACITO PARADISEC INTERA WP2 Summary November 2004 5 Intera Stabilization and Framework • IMDI 3.04 now stable and part of ISO standardization efforts • all categories are in ISO DCR (WP3) • DCR is key element on the way to Semantic Web • IMDI infrastructure now mature and stable (open source, free) • professional IMDI Editor (creating correct IMDI XML) • CV editor • IMDI browser (can operate in linked IMDI XML domains) • gateway to OLAC and Dublin Core • HTML browsing • Google-like and complex searching • Access Rights Management • portal creation • web-based Ingestion (not Intera - in progress) • web-based exploration (not Intera – in progress) INTERA WP2 Summary November 2004 6 Intera WP3 Issues Getting Metadata into the Semantic Web Framework • just this whole week ISO TC37/SC4 meeting in Pisa • IMDI is in the ISO DCR • all ISO 11179 and ISO 12620 compliant • localization of IMDI in DCR (Se, Gr, D, E, Fr, Nl, It, Sp) • ISO DCR is based on XML (not RDF) • SYNTAX tool at LORIA is web-accessible • next steps: • integrate OLAC(DC) and TEI (LIRICS) • link tools with SYNTAX via Web-services • already done for a lexicon tool • still deep discussions (is_a, has_a relation) • separate relation repositories (in RDF/OWL of course) • different layers of DCRs remains an issue INTERA WP2 Summary November 2004 7 Intera INTERA WP2 Summary November 2004 WP3 DCR 8 Intera IMDI Editor also supports node creation and profiles INTERA WP2 Summary November 2004 9 Intera INTERA WP2 Summary November 2004 Corpus Structure Building 10 Intera IMDI Browser also supports lexica, catalogue metadata and profiles INTERA WP2 Summary November 2004 11 Intera INTERA WP2 Summary November 2004 Structured IMDI Search 12 Intera INTERA WP2 Summary November 2004 HTML Browsing 13 Intera INTERA WP2 Summary November 2004 Unstructured Search 14 Intera INTERA WP2 Summary November 2004 Access Rights Management 15 Intera MD Infrastructure/Portal Browsing & Searching IMDI Browser & IE IMDI Domain via INTERNET corpus structure generation MPI Metadata Editing IMDI Editor Excel S S S S BAS S S S S S S S S Corpus exploitation (WP4) HRELP Workshop INTERA Review London November 2003 16 Intera INTERA Domain State INTERA sub-contracts INTERA WP2 Summary November 2004 Partner Subcontractor Corpus Type MPI BAS Smartkom multimodal integrated MPI BAS Verbmobil and others Speech, text integrated MPI Meertens Dialect Corpus speech integrated MPI U Florence Lablita speech text integrated MPI U Florence CORAL ROM Semantics ext integrated MPI Dutch Spoken Corpus speech text integrated MPI Gesture corpus multimodal integrated MPI ESF Second Learner Corpus speech text integrated MPI PMOLL Corpus speech text integrated MPI various others sign speech text integrated USAAR DFKI Negra, Tiger annotated text to be integrated USAAR CLPP Bulg HPSG treebank to be integrated USAAR U Iasi 1984 text to be integrated LORIA ATILF Frantext, etc text to be integrated ELDA catalogue resources various integrated ILSP/ILC textual corpora various integrated17 Intera INTERA WP2 Summary November 2004 IMDI Domain Europe • ELRA Paris • INALF Nancy • DFKI Saarbrücken • University of Saarland • Bavarian Speech Archive Munich • Meertens Institute Amsterdam • University of Florence • ILSP Athens • ILC Pisa • University of Madrid • Max-Planck-Institute Nijmegen • University of Kiel • University of Bochum • Free University of Berlin • University of Bonn • University of Bielefeld • University of Helsinki • University of Helsinki • Phonogrammarchiv Vienna • University of Groningen • Kotus Project Helsinki • Sweden’s National Dialect Archive Lund • European Sign Language Communities (Se, UK NL, D) • University of Utrecht • University of Uppsala • University of Stavanger • University of Lund • University of Leipzig • University of Erfurt • University of Leiden • University of Frankfurt •… International • Federal University of Rio de Janeiro • University of Colorado • University of Buenos Aires • University of Kansas • University of Victoria • University of Sydney • University of Melbourne • E Michigan University • Wayne State University • AILLA Austin •… Big problem: integration and portal effort 18 Intera MD Creation Problems Conclusions • contracts are difficult – much overhead for little money • no broad experience for MD creation • much interaction necessary over all aspects • no standard contract form – adaptations needed • institutes often wanted more money than expected • rather chaotic situation in some cases as basis • some cases no handiness with XML • problems with changing student assistants • special wishes wrt MD (IMDI flexible enough) • MPI expected stepwise availability – delivery at the end is practice • strong support for the ENABLER declaration necessary • creating MD remains extra work INTERA WP2 Summary November 2004 19 Intera Portal Creation – XML Browsing Task: creation of a web-site that offers all options for a selected domain of IMDI resources just get the URL’s and create a root node IMDI domain BAS Verbmobil INTERA WP2 Summary November 2004 Speech info files MPI Trumai Sign info files lexica grammar …. text sound image movie annotations eye movements 20 Intera Portal Creation – Searching harvest all data by traversing links and validate create a fast index file (using Java Library DBMS) just select a button in the browser so: simple, everyone can setup a portal Portal Node Fast Index IMDI Repositories INTERA WP2 Summary November 2004 21 Intera Portal Creation – HTML Support install Tomcat server and IMDI-Web-Interface software traverses tree to establish database large index file is created under the cover give a HTML entry point (HTTP server) Web Client TOMCAT Server IMDI-WebInterface Web-Server MPI Web-Server BAS IMDI Provider IMDI Provider Database INTERA WP2 Summary November 2004 Portal Site 22 Intera Portal Creation – DC/OLAC Gateway DC Service Provider the database can be used to fulfill the OAI protocol for metadata harvesting; any record can be served Servlet OAI-PMH Portal Node INTERA WP2 Summary November 2004 IMDI Repositories Fast Index 23 Intera Dissemination Dissemination / Events • Intern Metadata Workshop • Open Forum on Metadata Registries • Lexicon Workshop • Workshop on Resource Storage and Access • Intern Workshop on LR Archiving • Sign Language Workshop • Intern E-Meld Workshop • Intern Linguistic Congress • ENABLER Workshop • DRH Meeting • Intern PARADISEC Archiving Workshop • HRELP Archiving Workshop • etc Nijmegen Santa Fe Munich Göttingen London Nijmegen Ypsilanti Prague Paris Cheltenham Sydney London November 02 January 03 February 03 February 03 March 03 May 03 July 03 July 03 August 03 September 03 October 03 November 03 LREC 2004 – Demonstration of infrastructure and MD domain Two Metadata Flyer (MPI – U Lund) distributed at various occasions Web-Site Design INTERA WP2 Summary November 2004 several training workshops done 24 INTERA Portal Screenshots INTERA WP2 Summary November 2004 25 26 27 28 29 30 31