– Multilingual Ontologies Standards and Technologies Gerhard Budin

Download Report

Transcript – Multilingual Ontologies Standards and Technologies Gerhard Budin

Multilingual Ontologies –
Standards and Technologies
Gerhard Budin
University of Vienna
Chair, ISO/TC 37/SC 2
22nd APAN Meeting
NUS, Singapore
20 July, 2006
Outline
•
•
•
•
•
•
Problem description
Methods integration
Multi-layer data modeling
Multi-standard frameworks
Multiple representation languages
Interlinking and harmonization of standards and
specifications
• Interoperability frameworks
• Integration of tools
• Examples from risk management ontology engineering
Problem Description
1.
2.
3.
There is (still) a communication gap between formalized
knowledge representations such as ontologies and users
of information and communication systems, where such
ontologies are used, also on user interfaces.
Although the Semantic Web has been designed primarily
for machine-to-machine-communication, we need
seamless natural language interaction workflows in
(semantic) web services of any kind
While the Semantic Web is (still) essentially monolingual
and the international lingua franca is English, there is a
growing need for multilingual ontology resources as well
as ontology-based translation services that overcome
communication barriers arising from cultural-linguistic
differences, lack of excellent command of English, need
for high precision in communication, etc.
Need for integration of diverse methods
• As expressed in standards and implemented in technologies, the
following “traditions” increasingly merge:
– Ontology engineering standards, frameworks, technologies
• e.g. OWL (based on RDF), SKOS (also on RDF) (W3C),
DOLCE/SUMO, description logic, frame logic, unified logic, annotation
• Types of ontologies (e.g. domain o., upper o., application o., task o.)
• Editors such as Protégé, Altova, OntoEdit, div. merging/annotation tools
– Translation engineering standards
• i.e. various paradigms in machine translation and computer-assisted
translation (language-based, statistical MT, Transl. Memories, patterns)
– Terminology and language engineering standards (as the pre-requisite for
and interface between ontology and translation)
• Terminology and lexical markup frameworks: TMF, LMF (ISO)
• Markup languages such as TBX (language industry+ISO)
• Lexical databases/ling.ontol: WordNet, Ontowordnet, EuroWordNet
• Linguistic enrichment of ontologies (e.g. FrameNet)
• Interaction mechanisms, translation of ontologies
• Integration of multilingual ontologies in machine translation processes
Diversity and interoperability
• Strong diversity of lexico-terminological resources
– Data models, data structures + data semantics
– Diversity of semantic, linguistic/cultural complexity and semantic
depth/richness
• Diversity of user groups and their requirements
• Sheer quantity of resources
• Data interchange between organizations (within and
across domains) as well as (distributed) data integration –
early needs asking for immediate solutions

• History of data modeling
• History of interchange standards
• History of semantic interoperability management
Need for multi-level modeling architectures
generic
interoperability
framework
terminological
interoperability
Developing the Terminology Markup Framework
in order to cope with this complexity-diversity
• Based on empirical studies and practical user-driven
requirements analysis
• Markup/representation/modeling: XML, XMLS, RDF, UML
• Open standards strategy (ISO TC 37)
– ISO 12620 Data categories – meta-model element + semantics
registry (RDF)
– ISO 16642 Terminology Markup Framework (TMF) – meta-model
architecture and specifications (UML)
– ISO 12200 – Terminology Markup Language (XML)
• Instance for language industry: TBX Termbase Exchange Format
(XML)
• Instance for lexicography/publishing: LexML ISO 1951
–
–
–
–
Lexical Markup Framework (LMF) (UML)
ISO 704 and ISO 1087 (foundational level)
ISO 15188 (workflow and collaborative issues)
Alignment with ISO 11179, W3C, OASIS, etc.
Introduction to TBX
• TBX® stands for TermBase eXchange
• TBX is a Terminological Markup Framework (TMF)
markup language
– TMF is an ISO standard (16642)
• TBX is consistent with ISO 12200 (MARTIF)
• TBX is maintained by OSCAR (www.lisa.org)
• The TBX specification is free
• Serving portability of resources across proprietary
terminology management systems, as well as
interoperability of application-specific resources
TBX structure
• A TBX file is an XML document
• A TBX file consists of:
– A header that describes the file
– A set of entries, one per concept in the termbase
– For each concept, a set of terms, grouped by
language, that designate the concept
• A terminological concept entry (termEntry)
– Can be multilingual
– Can be monolingual
TBX and Other Standards
• (1) TBX and ISO 16642 (TMF)
• (2) TBX and ISO 12620 (Data Categories)
• (3) TBX and SKOS
1: TBX and ISO 16642
• TBX is a TML (Terminological Markup Language) of TMF
(ISO 16642) (see Annex B)
• TBX maps to the TMF meta-model
– A TBX file is a TDC (terminological data collection)
– martifHeader provides GI (global information)
– termEntry: TE (terminological entry)
– langSet: LS (language section)
– tig/ntig: TS (term section)
• A TMF DCS (Data Category Selection) in TBX is in XCS
(eXtensible Constraint Specification) format
• TBX uses ISO 12200 for its XML style
TMF Metamodel
Terminological Data Collection (TDC)
Global
Information
(GI)
Terminological
(Concept) Entry/Entries
(TE)
Language Section(s)
(LS)
Term Section(s)
(TS)
Term Component
Section(s)
(TCS)
Complementary
Information
(CI)
TMF and lexical resources
• In general, a terminological resource is organized into
concept entries, each of which includes one or more
terms designating a particular concept
• In general, a lexical resource is organized into lexical
entries, each of which includes one or more senses of a
particular lexical item (a word or phrase)
• A concept entry containing multiple terms can be split
into multiple lexical entries, one per term, and multiple
lexical entries associated with the same concept can be
combined into one concept entry
• Link to Lexical Markup Framework (LMF)
2: TBX and ISO 12620
• All data categories in the default TBX DCS are taken
from ISO 12620
• ISO 12620 is organized as an online registry and serves
as a meta-ontology for resource modeling and for
resource interoperability
3: TBX and SKOS
• A typical concept entry will contain a subject field
to specify the domain of the concept.
• However, the subject field is typically some kind
of hierarchy that is flattened into a string within
TBX
• SKOS makes it possible to represent the subject
field hierarchy as a hierarchy and then create a
link within TBX
Simple Knowledge Organization
System (SKOS)
• “SKOS is an area of work developing
specifications and standards to support the use
of knowledge organisation systems (KOS) such
as thesauri, classification schemes, subject
heading lists, taxonomies, other types of
controlled vocabulary, and perhaps also
terminologies and glossaries, within the
framework of the Semantic Web.”
- http://www.w3.org/2004/02/skos/ (Accessed on 3/17/06)
Sample SKOS
• <skos:Concept rdf:about="#s71">
•
<skos:prefLabel>Food</skos:prefLabel>
•
<skos:narrower rdf:resource="#s81"/>
•
<skos:narrower rdf:resource="#s79"/>
• </skos:Concept>
• <skos:Concept rdf:about="#s81">
•
<skos:prefLabel>Recipe Ingredient</skos:prefLabel>
•
<skos:broader rdf:resource="#s71"/>
• </skos:Concept>
• <skos:Concept rdf:about="#s79">
•
<skos:prefLabel>Restaurant Menu Item</skos:prefLabel>
•
<skos:broader rdf:resource="#s71"/>
• </skos:Concept>
Visual Representation of SKOS
Food
Recipe Ingredient
Appetizer
Restaurant Menu Item
Entree
Salad
Grocery Store Item
Homemade Item
Soup
GEvTerm Initiative
• The information previously used dealing with
food has been taken from FooNaVar, a project of
the GEvTerm Initiative.
• The GEvTerm Initiative is a terminological
database that has committed to being fully TBX
and SKOS compliant
C: Multilingual Thesaurus for
Medieval Studies (MLTMS)
• “Imagine the ability to search across web-resources
using your native modern european language and find
appropriate primary and secondary sources in Latin,
French, Italian, German, Spanish, English, etc., based
upon the meaning rather than the form of the search
term. Imagine having a tool that would enable you to
search for a concept and be able to construct the forms it
has taken historically as well as the ability to link outward
for both evidence and argument. Imagine a tool that
would enable you to study the slippage of concept which
is beyond naming. Imagine having a tool that can
deconstruct ontological orders asking for different kinds
of readings.”
http://www.mith2.umd.edu/thes/ (Accessed on 3/17/06)
Why did MLTSM use TBX?
• integration of terminological data from multiple
sources;
• querying multiple termbases through a single
user interface by passing data through a
common intermediate format on a batch or
dynamic basis;
• placing data on an FTP site for download by
interested parties;
• peer review by colleagues of tentative entries
- http://www.mith2.umd.edu/thes/ytbx.html (Accessed on 3/17/06)
MLTSM Sample
<termEntry id='eid-VocCod-211.01'>
<descrip type='subjectField'>personnel</descrip>
<descrip type='definition'>personne qui accomplit un travail copie ou d'&#x00E9;criture</descrip>
<langSet xml:lang='fr'>
<ntig>
<termGrp>
<term id='tid-voccod-211.01-fr1'>copiste</term>
<termNote type='termType'>entryTerm</termNote>
</termGrp>
</ntig>
<ntig>
<termGrp>
<term id='tid-voccod-211.01-fr3'>&#x00E9;crivain</term>
<termNote type='termType'>synonym</termNote>
</termGrp>
</ntig>
</langSet>
<langSet xml:lang='en'>
<ntig>
<termGrp>
<term id='tid-voccod-211.01-en1'>scribe</term>
<termNote type='termType'>entryTerm</termNote>
</termGrp>
</ntig>
</langSet>
</termEntry>
MLTSM Sample
(Rendered with XSLT)
TBX  HTML
• The last few slides have provided an example of
rendering HTML from a TBX file. Here is a brief
diagram of the process.
TBX
XSLT
Processed by
HTML
Results in
D: Other Standards
• ISO 11179 and XCS, which defines a flavor of
TBX, both provide a list of data element types
• XMDR
• RDF
• OWL
• Topic Maps/XTM
E: Tasks for TBX
• Encourage translation technology vendors to implement
TBX
• Revise the specification
• Compare ISO 11179 to XCS
• Render TBX in RDF -> TBR for TBX-SKOS
interoperability implementation
• TBR -> OWL
• TBX – TMX (translation memory exchange standard)
• TBX in Machine Translation applications
XMDR Prototype Architecture:
Initial Implemented Modules
External
Interface
RegistryStore
Registry
Java
WritableRegistryStore
Subversion
Authentication Service (defer)
RetrievalIndex
MetadataValidator (defer)
schema-driven syntax checker
Jena, Xerces
FullTextIndex
MappingEngine (defer)
11179 OWL Ontology
Generalization
Lucene
LogicBasedIndex
Jena, OWI KS
Racer,Kowari
Ontology Editor
Protege
Composition (tight ownership)
Aggregation (loose ownership)
OWL, RDF & XML Schema used to specify
XMDR as UML used for 11179 Edition 2
11179
Relational
Schema
UML11179
Metamodel
OWL XMDR
Ontology &
annotations
RDF Spec
Types &
Cardinalities
XMDR’s
Triples:
Relax NG
binary
labeled
Schema
relationships
XML Schema
Language spec
Relational
Metadata
TRang
What things go in own files?
Which property direction stored?
Sequential ordering of properties
XMDR
XML
Schema
XML Objects
XMDR XML schema provides a
number of important benefits…
• Schema specifies what is required as well as what is legal
• Divides metadata into files conforming to XML schema
• Normalizes data (ala’ relational “one fact in one place”)
• Facilitates XSLT transformations by reducing degrees of
freedom to a canonical encoding within the RDF standard
• Relax NG used to create and check XMDR-it schema
• RNG validator enforces many OWL ontology constraints
• TRang automatically translates into XML schema syntax
From texts and terminologies to
ontologies
• Using the Risk scenario
– Termbase
• Export XML
• Domain Models – meta-models -> patterns
– Text corpus
• Term extraction – comparative testing ProTerm, MultiTerm
Extract, MultiCorpora
• Aligning with termbase
• Convert to RDF
– Ontology import -> editor
• The MULTH-WIN Project as an example of methods
integration:
Bornemisza
Terminological frame semantics
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
INTERVENTION (ACTOR(S), ACTIVITIES/PHASES):
RISK DETECTING (PRE-EVENT)
R-ASSESSMENT
R-PERCEPTION (X is risk)
EXPERIENCE (statistics, case studies)
OBSERVATION (monitoring)
METHOD
SATELLITE
PROGNOSES
R-ANALYSIS
R-FEATURES
SITUATION/CONTEXT (danger/hazard)
SIMULATION (course of events)
PROBALISTIC METHODS (safety)
RELIABILITY
R-IDENTIFICATION (DAMAGE)
R-SOURCE
DAMAGE CAUSE
VULNERABILITY (DAMAGE TARGET)
SUSCEPTABILITY (capacity/people)
Terminological frame semantics
I. Pre-event B. Public awareness and planning, II. In-event: C.
Events and response
afflux/Hochwasser durch Aufstau
BE [[TYPE=flood], [PLACE=], [TIME=]],
HAVE [CAUSE [[ORIGIN=], [NIEDERSCHLAG [TYPE=]], [STAU
[TYPE= Aufstau]]],
DAMAGE [TARGET=, SOURCE=, DEGREE=]],
HAPPEN [STATES=, PROCESSES=]]
backwater/Rückstau
BE [[TYPE=flood], [PLACE=], [TIME=]],
HAVE [CAUSE [[ORIGIN=], [NIEDERSCHLAG [TYPE=]], [STAU
[TYPE= Rückstau]]],
DAMAGE [TARGET=, SOURCE=, DEGREE=]],
HAPPEN [STATES=, PROCESSES=]]
Relationship modeling
disaster
general
snow
origin
cause
manmade
natura
l
type
percipitation
„Stau“
Aufstau
`afflux`
Rückstau
`backwater`
im Entwässerungssystem
`drainage flood´
hail
rain
TBX-SKOS interoperability
• Differences
– XML vs. RDF (-> TBX will be turned into TBR)
– Inherent flexibility + ”open” data modeling for a large
variety of resources vs. traditional thesaurus data
model as a default for a KOS (diff. scopes)
– TBX has documented use cases and mapping tools
-> language industry standard
– Different semantics + vocabularies (12620 vs.
thesaurus standard)
• Commonalities
– Conceptual approach
– W3C
– Integrated applications in the future
• Vocabulary mapping (RDF)
TMF Metamodel
Terminological Data Collection (TDC)
Global
Information
(GI)
Terminological
(Concept) Entry/Entries
(TE)
Language Section(s)
(LS)
Term Section(s)
(TS)
Term Component
Section(s)
(TCS)
Complementary
Information
(CI)
Term Entry Level (Level 1)
Terminological Data Collection (TDC)
Global
Information
(GI)
Complementary
Information
(CI)
Terminological
(Concept) Entry/Entries
(TE)
ConceptRelated
Dat-cats
Notes
Administrative
Dat-cats
Subject Field
Note
Transaction
Date
Definition
Responsibility
Concept
System
DatCats
SourceID
Language Section Level (Level 2)
Terminological Entry
Language Section(s)
Language Section(s)
(LS)
(LS)
Language Section(s)
(LS)
(LS * n …)
ConceptRelated
Dat-cats
LanguageRelated
Dat-cats
Notes
Administrative
Dat-cats
Definition
xml:lang
Note
Transaction
Concept
System
Dat-cats
Date
Transfer-comment
Responsibility
Equivalence
SourceID
Term-Level 3
Language Section(s) (LS)
(TS * n …)
Term Section(s)
(TS)
Term
Term Section(s)
(TS)
Term Section(s)
(TS)
ConceptRelated
Dat-cats
Notes
Definition
Note
Administrative
DatCats
Term-related
DatCats (TRD)
Transaction
Context
Date
Transfercomment
Transfercomment
Responsibility
SourceID
SKOS Vocabulary
• SKOS Core is a model for expressing the structure and
content of concept schemes (thesauri, classification
schemes, subject heading lists, taxonomies,
terminologies, glossaries and other types of controlled
vocabulary).
• The SKOS Core Vocabulary is an application of the
Resource Description Framework (RDF), that can be
used to express a concept scheme as an RDF graph.
Using RDF allows data to be linked to and/or merged
with other RDF data by semantic web applications.
SKOS Graphs
SKOS Graphs
RDF Representation of SKOS Graph
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:skos="http://www.w3.org/2004/02/skos/core#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"> <skos:Collection>
<rdfs:label>milk by source animal</rdfs:label> <skos:member
rdf:resource="http://www.example.com/concepts#buffalomilk"/> <skos:member
rdf:resource="http://www.example.com/concepts#cowmilk"/> <skos:member
rdf:resource="http://www.example.com/concepts#goatmilk"/> <skos:member
rdf:resource="http://www.example.com/concepts#sheepmilk"/> </skos:Collection>
<skos:Concept rdf:about="http://www.example.com/concepts#buffalomilk">
<skos:prefLabel>buffalo milk</skos:prefLabel> </skos:Concept> <skos:Concept
rdf:about="http://www.example.com/concepts#cowmilk"> <skos:prefLabel>cow
milk</skos:prefLabel> </skos:Concept> <skos:Concept
rdf:about="http://www.example.com/concepts#goatmilk"> <skos:prefLabel>goat
milk</skos:prefLabel> </skos:Concept> <skos:Concept
rdf:about="http://www.example.com/concepts#sheepmilk"> <skos:prefLabel>sheep
milk</skos:prefLabel> </skos:Concept> </rdf:RDF>
Mapping TBX/12620 DatCats
to SKOS Vocabulary
• TBX data categories (data element concepts in
the sense of ISO/IEC 11179-3) contain
instantiations of information that are expressed
in SKOS using SKOS core vocabulary.
• Interoperability (a cross-walk between the two
standards) depends on mapping between the
two systems
Terminological knowledge engineering framework
1. meta modeling level:
Terminology Markup Framework (TMF)
Lexical Markup Framework (LMF)
UML, RDF, XML
DC selection
ISO 12620 in RDF is the meta-ontology
DC selection
subsets, value sets
subsets, value sets
2. modeling level
Terminology Markup Languages, e.g. TBX
Lexical Markup, e.g. LexML (ISO 1951)
XML
3. resource level
Terminological resources
Lexical resources
Markup, Annotation, Alignment, Analysis, Term Extraction
4. workflow level
ISO 15188 and other workflow specifications govern resource management processes
(logistics, organizational measures, maintenance, quality assurance, etc.)
Framework integration
ontology
engineering
framework
translation
engineering
framework
interoperable
integrative
multilingual
applications
e.g. MULTH-WIN project
terminology
and language
engineering
framework
Thank you for your attention
Acknowledgements:
Slides 9-27 together with Alan Melby, Sue Ellen Wright
Slides 28-29 Bruce Bargmeyer
Slide 33 Entry from the WordNet database
Slides 35-37 WIN project (Rothkegel)
Slides 39-42 Flood Risk Project
Slides 44 WIN, 45: ThesShow Legat/Stallbaumer
46: GEMET, 47: Bandholtz, 48/49: Gangemi, 56-59:
Miles/SKOS