Transcript Slide 1
The current state of Metadata - as far as we understand it Peter Wittenburg The Language Archive - Max Planck Institute CLARIN Research Infrastructure Nijmegen, The Netherlands Old Concept • • of course "metadata" is an old concept library cards were introduced to cope with mass and anonymity • • not surprising that library people started thinking about this to describe all kind web-accessible resources DC and qualified DC wee the results • however, research world is different - not just search • • therefore in many domains solutions were developed 2 years ago CLARIN revised its 15 year old set&framework Big Ideas • • of course managing increasing amounts of data of course finding valuable data in the growing haystacks • but also • machine usage of metadata • • • • • automatic profile matching research statistics - virtual sub-collection building etc. multilinguality in a multilingual European society interdisciplinary research biodiversity people should find information in linguistic archives etc. • • linking with contextual information document lifecycle management (provenance) Big Change • • until now researchers informed each other culture of personal exchange • claim: this will only work partially in the future • have distributed centers storing lots of data national and discipline dimensions • • • depositors upload their data into these centers will have an anonymous landscape of data & tools all offered as services what do we have to find things: • proper metadata descriptions • social tagging by virtual organizations • content to operate on by "smart" data mining Big Question • are we ready to meet these wishes and changes? • probably not • some major issues • quality • interoperability • registry and reference stability • functional • multilingual • scalability • IT principles Quality Issue • lack quality in descriptions • not all elements filled in (researchers are lazy, lack of tool support) • often not schema based (XLS) thus inconsistent • lack agreed and standardized vocabularies • ISO 639-3 - about 6000 language codes • what about subject classification schemes • what about institution names • thus many errors and inconsistencies • ontologies are expensive to maintain • misinterpretations/misuse of element semantics • etc Interoperability Issue • • • • • • • • hampered by different approaches (closed DB, no modularity, embedded ontologies) structural difficulties up to context dependency difficult semantic mapping • different description dimensions • bad element definitions • bad vocabulary definitions only little support of OAI-PMH reliance on DC semantics - but useless for research etc often "hardwired" mappings lack of a flexible framework to create/share/use relations little is standardized - what about lifetime then Registry and Reference Stability Issue • flexibility only when we separate things • define & register all concepts in open registries (we are using ISO 12620 - ISOcat) • define & register all components/profiles (we are using CLARIN registry) • register all mappings (nothing yet) • • but if we do this we need to refer are our references stable?? • some are using Cool URIs - are they just URLs? • some using explicit Handles - are they maintained? • who takes care? (we are using EPIC - European PID Consortium) Functional Issue • do we address new functional requirements • what about provenance information is it automatically generated what about versions - are they visible what about ltp information what about formal access information do we know what is needed for the web services scenario (profile matching, deployment information, etc) • • • • Multilingual Issue • what does it really include? • localizing all software • multilingual definitions of all concepts elements and vocabulary terms (no translations of proper names of course or?) • or do we simply rely on some lingua franca • answer probably discipline dependent • how much is (should be) public involved • • whatever we do it is a lot of work CLARIN: ISOcat covers almost all major EU languages Scalability Issue • • • • • are our solutions scalable? in EUROPEANA millions of metadata records in CLARIN about 270.000 • how to structure the offer • how to present this to naive users do we share same granularity (md at collection and/or resource level) • can we deal with aggregations in same way can we apply semantic web technology • automatic mapping • automatic quality improvement IT Principles • we need to disseminate the message of some basic IT principles • • • • • define and register your semantics specify and register your syntax use a stable reference scheme in some areas separate definitions and relations get things standardized or use standards such as • XML, some schema language • ISO 12620, etc • URI, Handles What can we do? • listen to each other first • increase awareness about metadata and basic principles • see how we can create an interoperable landscape • harmonizing approaches • harmonizing along major issues • making things explicit and scalable • look for proper interdisciplinary solutions moving towards an ideal e-Science domain Üm nicht to end in Babylonish scenario nous avons still algo time om sistemas te improve. Thanks for your attention.