Semantic relations

Download Report

Transcript Semantic relations

The Global Wordnet Grid: anchoring
languages to universal meaning
Piek Vossen
Irion Technologies/Free University of Amsterdam
Overview
•
•
•
•
•
Wordnet, EuroWordNet background
Architecture of the Global Wordnet Grid
Mapping wordnets to the Grid
Advantages of shared knowledge structure
7th Frame work project KYOTO
WordNet1.5
• Semantic network in which concepts are defined in
terms of relations to other concepts.
• Structure:
 organized around the notion of synsets (sets
of synonymous words)
 basic semantic relations between these
synsets

http://www.cogsci.princeton.edu/~wn/w3wn.html
 Developed at Princeton by George Miller and his
team as a model of the mental lexicon.
Relational model of meaning
animal
kitten
animal
man
boy
man
woman
cat
dog
cat
meisje
boy
girl
kitten
puppy
dog
puppy
woman
Structure of WordNet
{conveyance; transport}
hyperonym
{vehicle}
{bumper}
hyperonym
{motor vehicle; automotive vehicle}
meronym
{car door}
hyperonym
meronym
{car; auto; automobile; machine; motorcar}
meronym
hyperonym
{doorlock}
meronym
{car window}
{car mirror}
hyperonym
{cruiser; squad car; patrol car; police car; prowl car}
{hinge; flexible joint}
{cab; taxi; hack; taxicab; }
{armrest}
Wordnet Data Model
Relations
type-of
type-of
part-of
Concepts
rec: 12345
1
- financial institute
rec: 54321
2
- side of a river
rec: 9876
- small string instrument
rec: 65438
- musician playing violin
rec:42654
- musician
rec:35576
1
- string of instrument
rec:29551
2
- underwear
rec:25876
- string instrument
Vocabulary of a language
bank
1
2
fiddle
violin
fiddler
violist
string
Usage of Wordnet
• Improve recall of textual based analysis:
– Query -> Index
•
•
•
•
•
Synonyms: commence – begin
Hypernyms: taxi -> car
Hyponyms: car -> taxi
Meronyms: trunk -> elephant
Lexical entailments: gun -> shoot
• Inferencing:
– what things can burn?
• Expression in language generation and translation:
– alternative words and paraphrases
Improve recall
• Information retrieval:
– small databases without redundancy, e.g. image
captions, video text
• Text classification:
– small training sets
• Question & Answer systems
– query analysis: who, whom, where, what, when
Improve recall
• Anaphora resolution:
– The girl fell off the table. She....
– The glass fell of the table. It...
• Coreference resolution:
– When he moved the furniture, the antique table got
damaged.
• Information extraction (unstructed text to
structured databases):
– generic forms or patterns "vehicle" - > text with
specific cases "car"
Improve recall
• Summarizers:
– Sentence selection based on word counts ->
concept counts
– Avoid repetition in summary -> language
generation
• Limited inferencing: detect locations,
organisations, etc.
Many others
• Data sparseness for machine learning:
hapaxes can be replaced by semantic classes
• Use redundancy for more robustness:
spelling correction and speech recognition
can built semantic expections using
Wordnet and make better choices
• Sentiment and opinion mining
• Natural language learning
EuroWordNet
• The development of a multilingual database with wordnets
for several European languages
• Funded by the European Commission, DG XIII,
Luxembourg as projects LE2-4003 and LE4-8328
• March 1996 - September 1999
• 2.5 Million EURO.
• http://www.hum.uva.nl/~ewn
• http://www.illc.uva.nl/EuroWordNet/finalresultsewn.html
EuroWordNet
• Languages covered:
– EuroWordNet-1 (LE2-4003): English, Dutch, Spanish, Italian
– EuroWordNet-2 (LE4-8328): German, French, Czech, Estonian.
• Size of vocabulary:
– EuroWordNet-1: 30,000 concepts - 50,000 word meanings.
– EuroWordNet-2: 15,000 concepts- 25,000 word meaning.
• Type of vocabulary:
– the most frequent words of the languages
– all concepts needed to relate more specific concepts
Wordnet family
Princeton
WordNet,
(Fellbaum
EuroWordNet,
1998):
8languages
languages
BalkaNet,
Global Wordnet
(Tufis(Vossen
Association:
2004):
6
languages
all1998):
115,000 conceps
SUMO DOLCE
Domains
Road
4
2
German Words
2
Dutch Words
2
English Words
ENGLISH
Car
…
Train
…
Vehicle
3
vehículo
1
auto tren
Spanish Words
auto trein
train
veicolo
Auto Zug
voertuig
1
TransportDevice
1
2
Device
Air Water
vehicle
car
Fahrzeug
1
Object
Transport
1
Inter-Lingual-Index
auto treno
2
Italian Words
dopravní prostředník
auto
1
2
Czech Words
vlak
1
liiklusvahend
auto
3
véhicule
voiture
1
train
2
French Words
killavoor
2
Estonian Words
EuroWordNet
• Wordnets are unique language-specific structures:
–
–
–
–
different lexicalizations
differences in synonymy and homonymy
different relations between synsets
same organizational principles: synset structure and
same set of semantic relations.
• Language independent knowledge is assigned to
the ILI and can thus be shared for all language
linked to the ILI: both an ontology and domain
hierarchy
Autonomous & Language-Specific
Wordnet1.5
Dutch Wordnet
voorwerp
{object}
object
artifact, artefact
(a man-made object)
block
natural object (an
object occurring
naturally)
blok
{block}
instrumentality
body
implement
lichaam
{body}
device
container
tool
instrument
box
werktuig{tool}
spoon
bag
bak
{box}
lepel
{spoon}
tas
{bag}
Linguistic versus Artificial Ontologies
Artificial ontology:
• better control or performance, or a more compact and
coherent structure.
• introduce artificial levels for concepts which are not
lexicalized in a language (e.g. instrumentality, hand tool),
• neglect levels which are lexicalized but not relevant for the
purpose of the ontology (e.g. tableware, silverware,
merchandise).
What properties can we infer for spoons?
spoon -> container; artifact; hand tool; object; made of metal or
plastic; for eating, pouring or cooking
Linguistic versus Artificial Ontologies
Linguistic ontology:
• Exactly reflects the relations between all the lexicalized words and
expressions in a language.
• Captures valuable information about the lexical capacity of
languages: what is the available fund of words and expressions in a
language.
What words can be used to name spoons?
spoon -> object, tableware, silverware, merchandise, cutlery,
Wordnets versus ontologies
• Wordnets:
• autonomous language-specific lexicalization
patterns in a relational network.
• Usage: to predict substitution in text for
information retrieval,
• text generation, machine translation, wordsense-disambiguation.
• Ontologies:
• data structure with formally defined concepts.
• Usage: making semantic inferences.
The Multilingual Design
• Inter-Lingual-Index: unstructured fund of concepts to
provide an efficient mapping across the languages;
• Index-records are mainly based on WordNet synsets and
consist of synonyms, glosses and source references;
• Various types of complex equivalence relations are
distinguished;
• Equivalence relations from synsets to index records: not on a
word-to-word basis;
• Indirect matching of synsets linked to the same index items;
Equivalent Near Synonym
1. Multiple Targets (1:many)
Dutch wordnet: schoonmaken (to clean) matches with 4
senses of clean in WordNet1.5:
• make clean by removing dirt, filth, or unwanted substances from
• remove unwanted substances from, such as feathers or pits, as of chickens or fruit
• remove in making clean; "Clean the spots off the rug"
• remove unwanted substances from - (as in chemistry)
2. Multiple Sources (many:1)
Dutch wordnet: versiersel near_synonym versiering
ILI-Record: decoration.
3. Multiple Targets and Sources (many:many)
Dutch wordnet: toestel near_synonym apparaat
ILI-records: machine; device; apparatus; tool
Equivalent Hyperonymy
Typically used for gaps in English WordNet:
• genuine, cultural gaps for things not known in
English culture:
– Dutch: klunen, to walk on skates over land from one
frozen water to the other
• pragmatic, in the sense that the concept is known but
is not expressed by a single lexicalized form in
English:
– Dutch: kunstproduct = artifact substance <=> artifact
object
From EuroWordNet to Global WordNet
• Currently, wordnets exist for more than 40
languages, including:
• Arabic, Bantu, Basque, Chinese, Bulgarian,
Estonian, Hebrew, Icelandic, Japanese, Kannada,
Korean, Latvian, Nepali, Persian, Romanian,
Sanskrit, Tamil, Thai, Turkish, Zulu...
• Many languages are genetically and typologically
unrelated
• http://www.globalwordnet.org
Some downsides
• Construction is not done uniformly
• Coverage differs
• Not all wordnets can communicate with one
another
• Proprietary rights restrict free access and usage
• A lot of semantics is duplicated
• Complex and obscure equivalence relations due to
linguistic differences between English and other
languages
Next step: Global WordNet Grid
Fahrzeug
1
Auto Zug
Inter-Lingual
Ontology
vehicle
voertuig
1
auto trein
1
car
Object
train
2
Dutch Words
2
English Words
TransportDevice
vehículo
1
véhicule
veicolo
voiture
1
auto treno
2
Italian Words
dopravní prostředník
1
auto
vlak
2
Czech Words
liiklusvahend
auto killavoor
3
auto tren
Spanish Words
1
Device
3
2
2
German Words
1
train
2
French Words
2
Estonian Words
GWNG: Main Features
• Construct separate wordnets for each Grid
language
• Contributors from each language encode the
same core set of concepts plus
culture/language-specific ones
• Synsets (concepts) can be mapped
crosslinguistically via an ontology
• No license constraints, freely available
The Ontology: Main Features
• Formal, artificial ontology serves as
universal index of concepts
• List of concepts is not just based on the
lexicon of a particular language (unlike in
EuroWordNet) but uses ontological
observations
• Concepts are related in a type hierarchy
• Concepts are defined with axioms
The Ontology: Main Features
• In addition to high-level (“primitive”) concept
ontology needs to express low-level concepts
lexicalized in the Grid languages
• Additional concepts can be defined with
expressions in Knowledge Interchange Format
(KIF) based on first order predicate calculus and
atomic element
The Ontology: Main Features
• Minimal set of concepts (Reductionist view):
– to express equivalence across languages
– to support inferencing
• Ontology must be powerful enough to encode all
concepts that are lexically expressed in any of the
Grid languages
The Ontology: Main Features
• Ontology need not and cannot provide a linguistic
encoding for all concepts found in the Grid
languages
– Lexicalization in a language is not sufficient to warrant
inclusion in the ontology
– Lexicalization in all or many languages may be
sufficient
• Ontological observations will be used to define the
concepts in the ontology
Ontological observations
• Identity criteria as used in OntoClean (Guarino &
Welty 2002), :
– rigidity: to what extent are properties true for entities
in all worlds? You are always a human, but you can be
a student for a short while.
– essence: what properties are essential for an entity?
Shape is essential for a statue but not for the clay it is
made of.
– unicity: what represents a whole and what entities are
parts of these wholes? An ocean is a whole but the
water it contains is not.
Type-role distinction
• Current WordNet treatment:
(1) a husky is a kind of dog(type)
(2) a husky is a kind of working dog (role)
• What’s wrong?
(2) is defeasible, (1) is not:
*This husky is not a dog
This husky is not a working dog
Other roles: watchdog, sheepdog, herding dog,
lapdog, etc….
Ontology and lexicon
•Hierarchy of disjunct types:
Canine  PoodleDog; NewfoundlandDog;
GermanShepherdDog; Husky
•Lexicon:
– NAMES for TYPES:
{poodle}EN, {poedel}NL, {pudoru}JP
((instance x Poodle)
– LABELS for ROLES:
{watchdog}EN, {waakhond}NL, {banken}JP
((instance x Canine) and (role x GuardingProcess))
Ontology and lexicon
•Hierarchy of disjunct types:
River; Clay; etc…
•Lexicon:
– NAMES for TYPES:
{river}EN, {rivier, stroom}NL
((instance x River)
– LABELS for dependent concepts:
{rivierwater}NL (water from a river => water is not Unit)
((instance x water) and (instance y River) and (portion x y)
{kleibrok}NL (irregularly shared piece of clay=>Non-essential)
((instance x Object) and (instance y Clay) and (portion x y)
and (shape X Irregular))
Rigidity
• The “primitive” concepts represented in the
ontology are rigid types
• Entities with non-rigid properties will be
represented with KIF statements
• But: ontology may include some universal,
core concepts referring to roles like father,
mother
Properties of the Ontology
• Minimal: terms are distinguished by
essential properties only
• Comprehensive: includes all distinct
concepts types of all Grid languages
• Allows definitions via KIF of all lexemes
that express non-rigid, non-essential
properties of types
• Logically valid, allows inferencing
Mapping Grid Languages onto
the Ontology
• Explicit and precise equivalence relations among synsets in
different languages, which is somehow easier:
– type hierarchy is minimal
– subtle differences can be encoded in KIF expressions
• Grid database contains wordnets with synsets that label
– either “primitive” types in the hierarchies,
– or words relating to these types in ways made explicit in KIF
expressions
• If 2 lgs. create the same KIF expression, this is a statement
of equivalence!
How to construct the GWNG
• Take an existing ontology as starting point;
• Use English WordNet to maximize the
number of disjunct types in the ontology;
• Link English WordNet synsets as names to
the disjunct types;
• Provide KIF expressions for all other
English words and synsets
How to construct the GWNG
• Copy the relation from the English Wordnet to the
ontology to other languages, including KIF
statements built for English
• Revise KIF statements to make the mapping more
precise
• Map all words and synsets that are and cannot be
mapped to English WordNet to the ontology:
– propose extensions to the type hierarchy
– create KIF expressions for all non-rigid concepts
Initial Ontology: SUMO
(Niles and Pease)
SUMO = Suggested Upper Merged Ontology
--consistent with good ontological practice
--fully mapped to WordNet(s): 1000 equivalence
mappings, the rest through subsumption
--freely and publicly available
--allows data interoperability
--allows NLP
--allows reasoning/inferencing
Mapping Grid languages onto the
Ontology
• Check existing SUMO mappings to
Princeton WordNet -> extend the ontology
with rigid types for specific concepts
• Extend it to many other WordNet synsets
• Observe OntoClean principles! (Synsets
referring to non-rigid, non-essential, nonunicitous concepts must be expressed in
KIF)
Lexicalizations not mapped to WordNet
• Not added to the type hierarchy:
{straathond}NL (a dog that lives in the streets)
((instance x Canine) and (habitat x Street))
• Added to the type hierarchy:
{klunen}NL (to walk on skates from one frozen body to
the next over land)
KluunProcess => WalkProcess
Axioms:
(and (instance x Human) (instance y Walk) (instance z
Skates) (wear x z) (instance s1 Skate) (instance s2
Skate) (before s1 y) (before y s2) etc…
• National dishes, customs, games,....
Most mismatching concepts are not
new types
• Refer to sets of types in specific circumstances or
to concept that are dependent on these types, next
to {rivierwater}NL there are many others:
{theewater}NL (water used for making tea)
{koffiewater}NL (water used for making coffee)
{bluswater}NL (water used for making extinguishing file)
• Relate to linguistic phenomena:
– gender, perspective, aspect, diminutives, politeness,
pejoratives, part-of-speech constraints
KIF expression for gender marking
• {teacher}EN
((instance x Human) and (agent x
TeachingProcess))
• {Lehrer}DE ((instance x Man) and (agent
x TeachingProcess))
• {Lehrerin}DE ((instance x Woman) and
(agent x TeachingProcess))
KIF expression for perspective
sell: subj(x), direct obj(z),indirect obj(y)
versus
buy: subj(y), direct obj(z),indirect obj(x)
(and (instance x Human)(instance y Human)
(instance z Entity) (instance e FinancialTransaction)
(source x e) (destination y e) (patient e)
The same process but a different perspective by subject
and object realization: marry in Russian two verbs,
apprendre in French can mean teach and learn
Parallel Noun and Verb hierarchy
Encoded once as a Process in the ontology!
• event
• to happen
– act
– to act
• deed
– sail
– promise
– change
• movement
– change of
location
• to do
– to sell
– a promise
– to change
• to move
– to move position
Part-of-speech mismatches
• {bankdrukken-V}NL vs.{bench press-N}EN
• {gehuil-N}NL vs. {cry-V}EN
• {afsluiting-N}NL vs. {close-V}EN
• Process in the ontology is neutral with respect
to POS!
Aspectual variants
• Slavic languages: two members of a verb pair for an
ongoing event and a completed event.
• English: can mark perfectivity with particles, as in the
phrasal verbs eat up and read through.
• Romance languages: mark aspect by verb conjugations on
the same verb.
• Dutch, verbs with marked aspect can be created by
prefixing a verb with door: doorademen, dooreten,
doorfietsen, doorlezen, doorpraten (continue to
breathe/eat/bike/read/talk).
• These verbs are restrictions on phases of the same
process
• Which does NOT warrant the extension of the ontology
with separate processes for each aspectual variant
Aspectual lexicalization
• Regular compositional verb structures:
doorademen:
doorbetalen:
doorlopen:
doorfietsen:
doorrijden:
(lit. through+breath, continue to breath)
(lit. through+pay, continue to pay)
(lit. through+walk, continue to walk)
(lit. through+walk, continue to walk)
(lit. through+walk, continue to walk)
(and (instance x BreathProcess)(instance y Time)
(instance z Time) (end x z) (expected (end x y)
(after z y))
Lexicalization of Resultatives
• MORE GENERAL VERBS:
openmaken:
dichtmaken:
(lit. open+make, to cause to be open);
(lit. close+make, to cause to be open);
• MORE SPECIFIC VERBS:
openknijpen
has_hyperonym
(lit. open+squeeze, to open by squeezing)
knijpen (squeeze) & openmaken (to open)
opendraaien
has_hyperonym
(lit. open+turn, to open by turning)
draaien (to turn) & openmaken (to open)
dichtknijpen:
has_hyperonym
(lit. closed+squeeze, to close by squeezing)
knijpen (squeeze) & dichtmaken (to close)
dichtdraaien:
has_hyperonym
(lit. closed +turn, to close by turning)
draaien (to turn) & dichtmaken (to close)
Kinship relations in Arabic
•
•
•
•
‫عم‬
father's brother,
َ (Eam~)
paternal uncle.
‫( خَال‬xaAl)
mother's brother,
maternal uncle.
‫ع َّمة‬
َ (Eam~ap) father's sister, paternal
aunt.
‫( خَالَة‬xaAlap) mother's sister, maternal
aunt
Kinship relations in Arabic
•
•
•
•
.........
‫ش ِقيقَة‬
َ ($aqiyqapfull) sister, sister on the paternal and
maternal side (as distinct from ‫>( أ ُ ْخت‬uxot): 'sister'
which may refer to a 'sister' from paternal or maternal
side, or both sides).
‫( ثَ ْكالن‬vakolAna)
father bereaved of a child (as
opposed to ‫( يَ ِتيم‬yatiym) or ‫( يَ ِتي َمة‬yatiymap) for
feminine: 'orphan' a person whose father or mother died
or both father and mother died).
‫( ثَ ْكلَى‬vakolaYa)
other bereaved of a child (as
opposed to ‫ يَتِيم‬or ‫ يَتِي َمة‬for feminine: 'orphan' a person
whose father or mother died or both father and mother
died).
Complex Kinship concepts
father's brother, paternal uncle
WORDNET
paternal uncle
=> uncle
=> brother of ....????
ONTOLOGY
(=>
(paternalUncle ?P ?UNC)
(exists (?F)
(and
(father ?P ?F)
(brother ?F ?UNC))))
Advantages of the Global Wordnet
Grid
• Shared and uniform world knowledge:
– universal inferencing
– uniform text analysis and interpretation
• More compact and less redundant databases
• More clear notion how languages map to
the knowledge
– better criteria for expressing knowledge
– better criteria for understanding variation
Expansion with pure hyponymy
relations
dog
hunting dog
puppy
dachshund
lapdog
street dog
poodle
bitch
watchdog
short hair
dachshund
long hair
dachshund
Expansion from a type to roles
Expansion with pure hyponymy
relations
dog
hunting dog
puppy
dachshund
lapdog
street dog
poodle
bitch
watchdog
short hair
dachshund
long hair
dachshund
Expansion from a role to types and other roles
Automotive ontology:
(http://www.ontoprise.de)
Who uses ontologies?
Human dialogues with Alice-bot
Full understanding is
fundamentally impossible BUT?
• How can people communicate?
• How can people coomunicate with
computers?
• As long as language is effective:
– meaning= to have the desired effect!
– Link language to useful content!
Thought
Objects
in reality
Ontology
携帯電話
(keitaidenwa )
Texts
Knowledge &
information
Expression
Useful and effective behavior:
-reason over knowledge
-collect information and data
-deliver services and be helpful
Concrete goals for GWG
• Global Wordnet Association website:
http://www.globalwordnet.org/gwa/gwa_grid.htm
• 5000 Base Concepts or more:
–
–
–
–
English
Spanish
Catalan
Czech, Polish, Dutch, other wordnets
• 7th Frame Work project Kyoto
KYOTO Project
• 7th Frame Work project (under negotiation)
• Kowledge Yielding Ontologies for Transition-based
Organisations
• Goal:
–
–
–
–
Global Wordnet Grid = ontology + wordnets
AutoCons = Automatic concept extractors
Kybots = Knowledge yielding robots
Wiki environment for encoding domain knowledge in expert
groups
– Index and retrieval software for deep semantic search
• Languages: Dutch, English, Spanish, Basque, Italian,
Chinese and Japanese
• Domain of application: environmental organisations
• Period: March/April 2008 - 2011
KYOTO Consortium
Universities
• Vrije Universiteit Amterdam, Amsterdam, Netherlands
• Consiglio Nazionale delle Ricerche, Pisa, Italy
• Berlin-Brandenburg Academy of Sciences and Humantities, Berlin,
Germany
• Euskal Herriko Unibertsitatea, San Sebastian, Spain
• Academia Sinica, Taipei, Taiwan
• National Institute of Information and Communications Technology,
Kyoto, Japan
• Masaryk University, Brno, Czech
Companies
• Irion Technologies, Delft, Netherlands
• Synthema, Pisa, Italy
Users
• European Centre for Nature Conservation, Tilburg, Netherlands
• World Wide Fund for Nature, Zeist, Netherlands
Citizens
Governors
Companies
Environmental
organizations
Environmental
organizations
Domain
Wiki
Capture
Universal Ontology

Top
Abstract Physical
Process
Substance
Wordnets
Concept
Mining
Docs
Fact
Mining
URLs
Middle
water CO2
Domain water CO2
pollution emission
Index
Experts
Images
Dialogue
Search
wordnet
ontology
domain
ontology
domain
wordnet
4
Wiki
DEB
Client
Bench
mark
data
User
scenarios
DEB
Server
7
term
hierarchy
Manual
Revision
1
User
scenarios
1
source
data
Manual
Test
Concept
Miners
term
relations
3
Access
end-users
Text & Meta data
in XMLFormat
Data & Facts
in XML Format
Kybots
Capture
2
8
5
Index
Indexing
6
Bench
marking
Ontology
Logical Expressions

Abstract
Physical
Wordnets
Linguistic Miners
or Kybots
Generic
words
Process
Chemical
Reaction
words
Substance
water
CO2
Domain
CO2
emission
water
pollution
words
words
END