Geen diatitel - Home Page of Prof. Dr. Piek Th.J.M. Vossen

Download Report

Transcript Geen diatitel - Home Page of Prof. Dr. Piek Th.J.M. Vossen

Wordnet, EuroWordNet, Global
Wordnet
Piek Vossen
[email protected]
http://www.globalwordnet.org
Overview



Princeton WordNet (1980 - ongoing)
EuroWordNet (1996 - 1999)

The database design

The general building strategy

Towards a universal index of meaning
Global WordNet Association (2001 - ongoing)

Other wordnets

BalkaNet (2001 - 2004)

IndoWordnet (2002 - ongoing)

Meaning (2002 - 2005)
WordNet1.5
•
•
•



Developed at Princeton by George Miller and his
team as a model of the mental lexicon.
Semantic network in which concepts are defined
in terms of relations to other concepts.
Structure:
 organized around the notion of synsets
(sets of synonymous words)
 basic semantic relations between these
synsets
Initially no glosses
Main revision after tagging the Brown corpus with
word meanings: SemCor.
http://www.cogsci.princeton.edu/~wn/w3wn.html
Structure of WordNet1.5
{conveyance; transport}
hyperonym
{vehicle}
{bum per}
hyperonym
{m otor vehicle; autom otive vehicle}
m eronym
{car door}
hyperonym
m eronym
{car; auto; autom obile; m achine; m otorcar}
m eronym
hyperonym
{doorlock}
m eronym
{car w indow }
{car m irror}
hyperonym
{cruiser; squad car; patrol car; police car; prow l car}
{hinge; flexible joint}
{cab; taxi; hack; taxicab; }
{arm rest}
EuroWordNet





The development of a multilingual database with
wordnets for several European languages
Funded by the European Commission, DG XIII,
Luxembourg as projects LE2-4003 and LE4-8328
March 1996 - September 1999
2.5 Million EURO.
URL: http://www.hum.uva.nl/~ewn
Objectives of EuroWordNet

Languages covered:



Size of vocabulary:



EuroWordNet-1 (LE2-4003): English, Dutch, Spanish, Italian
EuroWordNet-2 (LE4-8328): German, French, Czech, Estonian.
EuroWordNet-1: 30,000 concepts - 50,000 word meanings.
EuroWordNet-2: 15,000 concepts- 25,000 word meaning.
Type of vocabulary:


the most frequent words of the languages
all concepts needed to relate more specific concepts
Consortium
O rgan ization
C ou n try
T ask
U n iv ersity o f A m sterd am
NL
Istitu to D i L in g u istica
C o m p u tazio n ale P isa
F u n d acio n U n iv ersid ad E m p resa
U n iv ersité d ’ A v ig n o n an d M em o d ata at
A v ig n o n
U n iv ersität T ü b in g en
U n iv ersity o f M asaryk at B rn o
U n iv ersity o f T artu , E sto n ia
IT
P ro ject C o o rd in ato r &
B u ild th e D u tch w o rd n et
B u ild th e Italian w o rd n et
ES
FR
B u ild th e S p an ish w o rd n et
B u ild th e F ren ch w o rd n et
DE
CZ
EE
B u ild th e G erm an w o rd n et
B u ild th e C zech w o rd n et
B u ild th e E sto n ian w o rd n et
U n iv ersity o f S h effield
N o v ell B elg iu m N V
GB
BE
X ero x R esearch C en tre, M eylan
B ertin & C ie, P laisir, P aris
FR
FR
A d ap t th e E n g lish w o rd n et
U ser
B u ild th e co m m o n d atab ase
U ser
U ser
The basic principles of
EuroWordNet





the structure of the Princeton WordNet
the design of the EuroWordNet database
wordnets as language-specific structures
the language-internal relations
the multilingual relations
Specific features of
EuroWordNet

it contains semantic lexicons for other languages than English.

each wordnet reflects the relations as a language-internal
system, maintaining cultural and linguistic differences in the
wordnets.

it contains multilingual relations from each wordnet to English
meanings, which makes it possible to compare the wordnets,
tracking down inconsistencies and cross-linguistic differences.

each wordnet is linked to a language independent top-ontology
and to domain labels.
Autonomous &
Language-Specific
Wordnet1.5
Dutch Wordnet
voorwerp
{object}
object
artifact, artefact
(a man-made object)
block
natural object (an
object occurring
naturally)
blok
{block}
instrumentality
body
implement
lichaam
{body}
device
container
tool
instrument
box
werktuig{tool}
spoon
bag
bak
{box}
lepel
{spoon}
tas
{bag}
Differences in structure
•Artificial Classes versus Lexicalized Classes:
instrumentality; natural object
•Lexicalization differences of classes:
container and artifact (object) are not lexicalized in Dutch
•What is the purpose of different hierarchies?
•Should we include all lexicalized classes from all (8)
languages?
Linguistic versus Conceptual
Ontologies
Conceptual ontology:
A particular level or structuring may be required to achieve a
better control or performance, or a more compact and coherent
structure.
• introduce artificial levels for concepts which are not
lexicalized in a language (e.g. instrumentality, hand tool),
• neglect levels which are lexicalized but not relevant for the
purpose of the ontology (e.g. tableware, silverware,
merchandise).
What properties can we infer for spoons?
spoon -> container; artifact; hand tool; object; made of metal or
plastic; for eating, pouring or cooking
Linguistic versus Conceptual
Ontologies
Linguistic ontology:
Exactly reflects the relations between all the lexicalized words and
expressions in a language. It therefore captures valuable
information about the lexical capacity of languages: what is the
available fund of words and expressions in a language.
What words can be used to name spoons?
spoon -> object, tableware, silverware, merchandise, cutlery,
Separate Wordnets and
Ontologies
Language-Specific Wordnets
Language-Neutral Ontology
Dutch Wordnet
voorwerp
ReferenceOntologyClasses:
BOX
ContainerProduct;
SolidTangibleThing
doos
WordNet1.5
container
box
object
container
box
EuroWordNet Top-Ontology:
Form: Cubic
Function: Contain
Origin: Artifact
Composition: Whole
Wordnets versus ontologies
Wordnets:
autonomous language-specific lexicalization patterns
in a relational network.
Usage: to predict substitution in text for
information retrieval,
text generation, machine translation, word-sensedisambiguation.
Ontologies:
data structure with formally defined concepts.
Usage: making semantic inferences.
Wordnets as
Linguistic Ontologies
Classical Substitution Principle:
Any word that is used to refer to something can be replaced by its
synonyms, hyperonyms and hyponyms:
horse

stallion, mare, pony, mammal, animal, being.
It cannot be referred to by co-hyponyms and co-hyponyms of its
hyperonyms:
horse
X
cat, dog, camel, fish, plant, person, object.
Conceptual Distance Measurement:
Number of hierarchical nodes between words is a measurement of
closeness, where the level and the local density of nodes are
additional factors.
Linguistic Principles for
deriving relations
1. Substitution tests (Cruse 1986):
1
2
3
a.
b
a.
b
a
b
It is a fiddle therefore it is a violin.
It is a violin therefore it is a fiddle.
It is a dog therefore it is an animal.
*It is an animal therefore it is a dog.
to kill (/a murder) causes to die (/ death)
to kill (/a murder) has to die (/ death) as a consequence
*to die / death causes to kill
*to die / death has to kill as a consequence
Linguistic Principles for
deriving relations
2. Principle of Economy (Dik 1978):
If a word W1 (animal) is the hyperonym of W2 (mammal) and W2
is the hyperonym of W3 (dog) then W3 (dog) should not be linked to
W1 (animal) but to W2 (mammal).
3. Principle of Compatibility
If a word W1 is related to W2 via relation R1, W1 and W2 cannot be
related via relation Rn, where Rn is defined as a distinct relation
from R1.
Architecture of the
EuroWordNet Data Base
Domains
move
go
Air
bewegen
gaan
2OrderEntity
Traffic
III
ride
Ontology
Location Dynamic
Road`
rijden
drive
I
III
I
II
II
III
conducir
III
mover
transitar
II
Lexical Items Table
Lexical Items Table
ILI-record
{drive}
Lexical Items Table
berijden
III
Lexical Items Table
cabalgar
jinetear
III
III
II
cavalcare
guidare
Inter-Lingual-Index
I = Language Independent link
II = Link from Language Specific
to Inter lingual Index
III = Language Dependent Link
III
andare
muoversi
The mono-lingual design of
EuroWordNet
Language Internal Relations
WN 1.5 starting point
The ‘synset’ as a weak notion of synonymy:
“two expressions are synonymous in a linguistic context C
if the substitution of one for the other in C does not alter
the truth value.” (Miller et al. 1993)
Relations between synsets:
Relation
POS-combination
ANTONYMY
adjective-to-adjective
verb-to-verb
HYPONYMY
noun-to-noun
verb-to-verb
MERONYMY
noun-to-noun
ENTAILMENT
verb-to-verb
CAUSE
verb-to-verb
Example
open/ close
car/ vehicle
walk/ move
head/ nose
buy/ pay
kill/ die
Differences
EuroWordNet/WordNet1.5
• Added Features to relations
• Cross-Part-Of-Speech relations
• New relations to differentiate shallow hierarchies
• New interpretations of relations
EWN Relationship Labels
Disjunction/Conjunction of multiple relations of the same type
WordNet1.5
door1 -- (a swinging or sliding barrier that will close the entrance to a room or
building; "he knocked on the door"; "he slammed the door as he left")
PART OF: doorway, door, entree, entry, portal, room access
door 6 -- (a swinging or sliding barrier that will close off access into a car;
"she forgot to lock the doors of her car") PART OF: car, auto, automobile,
machine, motorcar.
EWN Relationship Labels
{airplane}
HAS_MERO_PART: conj1
HAS_MERO_PART: conj2 disj1
HAS_MERO_PART: conj2 disj2
{door}
{jet engine}
{propeller}
{door}
HAS_HOLO_PART: disj1
HAS_HOLO_PART: disj2
HAS_HOLO_PART: disj3
{car}
{room}
{entrance}
{dog}
HAS_HYPERONYM: conj1
HAS_HYPERONYM: conj2
{mammal}
{pet}
{albino}
HAS_HYPERONYM: disj1
HAS_HYPERONYM: dis2
{plant}
{animal}
Default Interpretation: non-exclusive disjunction
EWN Relationship Labels
Disjunction/Conjunction of multiple relations of the same type
{ {dog}
HAS_HYPONYM: dis1
HAS_HYPONYM: dis1
HAS_HYPONYM:
HAS_HYPONYM:
{poodle}
{labrador}
{sheep dog}
{watch dog}
(Orthogonal)
(Orthogonal)
Default Interpretation: non-exclusive disjunction
EWN Relationship Labels
Factive/Non-factive CAUSES (Lyons 1977)
factive (default interpretation):
“to kill causes to die”:
{kill}
CAUSES
{die}
non-factive: E1 probably or likely causes event E2 or E1 is intended to
cause some event E2:
“to search may cause to find”.
{search}
CAUSES
{find} non-factive
EWN Relationship Labels
Reversed
In the database every relation must have a reverse counter-part but there is
a difference between relations which are explicitly coded as reverse and
automatically reversed relations:
{finger}
{hand}
{paper-clip}
{metal}
HAS_HOLONYM
HAS_MERONYM
HAS_MER_MADE_OF
HAS_HOL_MADE_OF
Negation
{monkey}
{ape}
{hand}
{finger}
{metal}
{paper-clip} reversed
HAS_MERO_PART
HAS_MERO_PART
{tail}
{tail} not
Cross-Part-Of-Speech
relations
WordNet1.5: nouns and verbs are not interrelated by basic semantic
relations such as hyponymy and synonymy:
adornment 2
adorn 1
change of state-- (the act of changing something)
change, alter-- (cause to change; make different)
EuroWordNet: words of different parts of speech can be inter-linked with
explicit xpos-synonymy, xpos-antonymy and xpos-hyponymy relations:
{adorn V}
XPOS_NEAR_SYNONYM
{adornment N}
Cross-Part-Of-Speech
relations
The advantages of such explicit cross-part-of-speech relations are:
 similar words with different parts of speech are grouped together.
 the same information can be coded in an NP or in a sentence. By
unifying higher-order nouns and verbs in the same ontology it will
be possible to match expressions with very different syntactic
structures but comparable content
 by merging verbs and abstract nouns we can more easily link
mismatches across languages that involve a part-of-speech shift.
Dutch nouns such as “afsluiting”, “gehuil” are translated with the
English verbs “close” and “cry”, respectively.
Entailment in WordNet
WordNet1.5: Entailment indicates the direction of the implication or
entailment:
a. + Temporal Inclusion (the two situations partially or totally overlap)
a.1 co-extensiveness (e. g., to limp/to walk)
hyponymy/troponymy
a.2 proper inclusion (e.g., to snore/to sleep)
entailment
b. - Temporal Exclusion (the two situations are temporally disjoint)
b.1 backward presupposition (e.g., to succeed/to try) entailment
b.2 cause (e.g., to give/to have)
Subevents in EuroWordNet
EuroWordNet
Direction of the entailment is expressed by the labels factive and reversed:
{to succeed} is_caused_by
{to try}
causes
{to try}
{to succeed}
factive
non-factive
Proper inclusion is described by the has_subevent/ is_subevent_of relation
in combination with the label reversed:
{to snore}
{to sleep}
{to buy}
{to pay}
is_subevent_of
has_subevent
has_subevent
is_subevent_of
{to sleep}
{to snore}
{to pay}
{to buy}
reversed
reversed
The interpretation of the
CAUSE relation
WordNet1.5: The causal relation only holds between verbs and it should only
apply to temporally disjoint situations:
EuroWordNet: the causal relation will also be applied across different parts
of speech:
{to kill} V
causes
{death} N
{death} n
is_caused_by
{to kill} v
reversed
{to kill } v
causes
{dead} a
{dead} a
is_caused_by
{to kill} v
reversed
{murder} n
causes
{death}n
{death} a
is_caused_by
{murder} n
reversed
The interpretation of the
CAUSE relation
Various temporal relationships between the (dynamic/nondynamic) situations may hold:
• Temporally disjoint: there is no time point when dS1 takes place and also
S2 (which is caused by dS1) (e.g. to shoot/to hit);
• Temporally overlapping: there is at least one time point when both dS1
and S2 take place, and there is at least one time point when dS1 takes
place and S2 (which is caused by dS1) does not yet take place (e.g. to
teach/to learn);
• Temporally co-extensive: whenever dS1 takes place also S2 (which is
caused by dS1) takes place and there is no time point when dS1 takes
place and S2 does not take place, and vice versa (e.g. to feed/to eat).
Role relations
In the case of many verbs and nouns the most salient relation is not the hyperonym
but the relation between the event and the involved participants. These relations
are expressed as follows:
{hammer}
{to hammer}
{school}
{to teach}
ROLE_INSTRUMENT
INVOLVED_INSTRUMENT
ROLE_LOCATION
INVOLVED_LOCATION
{to hammer}
{hammer}
{to teach}
{school}
reversed
reversed
These relations are typically used when other relations, mainly hyponymy, do not
clarify the position of the concept network, but the word is still closely related to
another word.
Co_Role relations
guitar player
player
to play music
guitar
ice saw
saw
ice
HAS_HYPERONYM
CO_AGENT_INSTRUMENT
HAS_HYPERONYM
ROLE_AGENT
CO_AGENT_INSTRUMENT
HAS_HYPERONYM
ROLE_INSTRUMENT
HAS_HYPERONYM
CO_INSTRUMENT_AGENT
HAS_HYPERONYM
CO_INSTRUMENT_PATIENT
HAS_HYPERONYM
ROLE_INSTRUMENT
CO_PATIENT_INSTRUMENT
player
guitar
person
to play music
musical instrument
to make
musical instrument
musical instrument
guitar player
saw
ice
saw
to saw
ice saw REVERSED
Co_Role relations
Examples of the other relations are:
criminal
novel writer/ poet
dough
photograpic camera
CO_AGENT_PATIENT
CO_AGENT_RESULT
CO_PATIENT_RESULT
CO_INSTRUMENT_RESULT
victim
novel/ poem
pastry/ bread
photo
BE_IN_STATE and STATE_OF
Example:
the poor are the ones to whom the state poor applies
Effect:
poor N
poor N
poor A
HAS_HYPERONYM
BE_IN_STATE
STATE_OF
person N
poor A
poor N reversed
IN_MANNER and MANNER_OF
Example:
to slurp is to eat in a noisely manner
Effect:
slurp V
slurp V
noisely Adverb
HAS_HYPERONYM
IN_MANNER
MANNER_OF
eat V
noisely Adverb
slurp V reversed
Overview of the Language
Internal relations in EuroWordnet
Same Part of Speech relations:
NEAR_SYNONYMY
HYPERONYMY/HYPONYMY
ANTONYMY
HOLONYMY/MERONYMY
apparatus - machine
car - vehicle
open - close
head - nose
Cross-Part-of-Speech relations:
XPOS_NEAR_SYNONYMY
dead - death; to adorn - adornment
XPOS_HYPERONYMY/HYPONYMY
to love - emotion
XPOS_ANTONYMY
to live - dead
CAUSE
die - death
SUBEVENT
buy - pay; sleep - snore
ROLE/INVOLVED
write - pencil; hammer - hammer
STATE
the poor - poor
MANNER
to slurp - noisily
BELONG_TO_CLASS
Rome - city
Thematic networks
organisme
(organism)
Causes genezen Patient
ziekte
(to get well)
(disease)Patient
wezen(being)
persoon (person)
Part of
orgaan
(organ)
behandelen(treat)
Agent
scalpel
Patient
Instrument
opereren
(operate)
maagaandoening
(stomach disease)
arts (doctor)
zieke (sick person, patient)
Involves
maag
(stomach)
The multi-lingual design of
EuroWordNet
The Multilingual Design

Inter-Lingual-Index: unstructured fund of concepts to provide
an efficient mapping across the languages;

Index-records are mainly based on WordNet1.5 synsets and
consist of synonyms, glosses and source references;

Various types of complex equivalence relations are
distinguished;

Equivalence relations from synsets to index records: not on a
word-to-word basis;

Indirect matching of synsets linked to the same index items;
EWN Interlingual Relations
• EQ_SYNONYM: there is a direct match between a synset and an ILIrecord
• EQ_NEAR_SYNONYM: a synset matches multiple ILI-records
simultaneously,
• HAS_EQ_HYPERONYM: a synset is more specific than any available
ILI-record.
• HAS_EQ_HYPONYM: a synset can only be linked to more specific ILIrecords.
• other relations:
CAUSES/IS_CAUSED_BY, EQ_SUBEVENT/EQ_ROLE,
EQ_IS_STATE_OF/EQ_BE_IN_STATE
Equivalent Near Synonym
1. Multiple Targets
One sense for Dutch schoonmaken (to clean) which
simultaneously matches with at least 4 senses of clean in
WordNet1.5:
•{make clean by removing dirt, filth, or unwanted substances
from}
•{remove unwanted substances from, such as feathers or pits, as
of chickens or fruit}
•(remove in making clean; "Clean the spots off the rug")
•{remove unwanted substances from - (as in chemistry)}
The Dutch synset schoonmaken will thus be linked with an
eq_near_synonym relation to all these sense of clean.
Equivalent Near Synonym
2. Multiple Source meanings
Synsets inter-linked by a near_synonym relation can be linked to
same target ILI-record(s), either with an eq_synonym or an
eq_near_synonym relation:
Dutch wordnet:
toestel near_synonym apparaat
ILI-records: {machine}; {device}; {apparatus}; {tool}
Equivalent Hyponymy
has_eq_hyperonym
Typically used for gaps in WordNet1.5 or in English:
• genuine, cultural gaps for things not known in English culture, e.g.
citroenjenever, which is a kind of gin made out of lemon skin,
• pragmatic, in the sense that the concept is known but is not expressed
by a single lexicalized form in English, e.g.: Dutch hoofd only refers to
human head and Dutch kop only refers to animal head, English uses
head for both.
has_eq_hyponym
Used when wordnet1.5 only provides more narrow terms. In this case there
can only be a pragmatic difference, not a genuine cultural gap, e.g.:
Spanish dedo can be used to refer to both finger and toe.
Complex mappings across
languages
GB-Net
IT-Net
toe
dito
finger
{ toe : part of foot }
head
{ finger : part of hand }
{ dedo , dito :
finger or toe }
{ head : part of body }
NL-Net
hoofd
kop
{ hoofd : human head }
{ kop : animal head }
= normal equivalence
= eq _has_hyponym
= eq _has_hyperonym
ES-Net
dedo
The methodologies for
building wordnets
Overall Building Process
Machine Readable Dictionaries
Wordnets, Taxonomies,
Corpora
Loaded in local databases
Ib
Improve and
extend the
wordnet
fragments
Ia
Specification of
selection criteria
Subset of
word meanings
Encoding of
language
internal and
equivalence
relations
Ia
Wordnet fragment with
links to WordNet1.5
in local database
Load wordnet in
the EuroWordNet
Database
Comparing and
restructuring the
wordnet
Ic
Wordnet fragment in
EuroWordNet database
Adjust
coverage
improve
encoding
II
Verification
by users
Verification
Report
III
Demonstration
in
Information
Retrieval
Main Methods

Expand approach: translate WordNet1.5 synsets to another
language and take over the structure




easier and more efficient method
compatible structure with WordNet1.5
structure is close to WordNet1.5 but also biased by it
Merge approach: create an independent wordnet in another
language and align the separate hierarchies by generating
the appropriate translations



more complex and labour intensive
different structure from WordNet1.5
lanuage specific patterns can be maintained
Methods for extracting
language-internal relations
• editors and database for manually encoding relations;
• comparison with WordNet1.5 structure;
• definition patterns in monolingual dictionaries;
• co-occurrences in corpora;
• morphology;
• bilingual dictionaries;
• lexical semantic substitution tests
Methods for extracting
equivalence relations
• extract monosemeous translations of English synsets, e.g. a Spanish
word has only 1 translation to an English word which has only one sense
and vice versa;
 disambiguation of multiple ambivalent translations by measuring their
conceptual-distance between the senses of these translations in the
WordNet1.5 hierarchy (Rigau and Aguirre, 95);
 disambiguation of ambivalent translations by measuring the
conceptual-distance directly in the WordNet1.5 hierarchy between
alternative translations and the translations of the direct semantic context
in the source wordnet;
 disambiguation of ambivalent translations by measuring the overlap in
top-concepts inherited in the source wordnet and inherited for the
different senses of translations in WordNet1.5;
Aligning wordnets
object
artifact object
natural object
instrument
muziekinstrument
orgel
hammond orgel
musical instrument
organ
?
organ
hammond organ
organ
Inheriting
Semantic Features
hart 1
orgaan 1 (Living Part) deel 2 (Part) iets 1 LEAF
----------------------------------------------------------------------------------------------------heart 1
playing card 1 card 1 (Artifact Function Object) paper 6 (Artifact Solid)
material 5 (Substance) matter 1 inanimate object 1 entity 1 LEAF
heart 2
disposition 2 (Dynamic Experience Mental)nature 1
trait 1 (Property) attribute 1 (Property) abstraction 1 LEAF
heart 3
bravery 1 spirit 1 character 1 trait 1 (Property) attribute 1 (Property)
abstraction 1 LEAF
heart 4
internal organ 1 organ 4 (Living Part) body part 1 (Living Part)
part 10 entity 1 LEAF
Reliability
of Equivalence Relations
S p a n ish w o rd n et
C o n fid en ce (V a ria n ts)
1 0 0 % (M a n u a l)
>96%
>94%
>92%
>85%
>84%
T o ta l
N o u n s V erb s
7819
382
2948
1364
2 3 1 13
4156
3 9 7 82
T o ta l
8 3 9 4 1 6 2 13
0
382
0
2948
0
1364
0 2 3 1 13
0
4156
8 3 9 4 4 8 1 76
Reliability
of Equivalence Relations
D u tch w o rd n et
N ouns
V erb s
R eliab ility N o o f S yn sets P erc.
R eliab ility
M a tch in g T y p e N o o f syn sets P erc.
4 1 3 8 1 7 ,0 0 %
100%
3 3 8 3 3 7 ,0 7 %
100%
m a n u a l/o k
4 8 4 6 1 9 ,9 1 %
86%
763
8 ,3 6 %
78%
1 m a tch
3 0 5 9 1 2 ,5 7 %
68%
652
7 ,1 5 %
71%
2 m a tch es
5 4 0 8 2 2 ,2 2 %
65%
2 4 7 1 2 7 ,0 8 %
49%
3 -9 m a tch es
1864
7 ,6 6 %
54%
9 8 0 1 0 ,7 4 %
23%
1 0 + m a tch es
5 0 2 2 2 0 ,6 4 %
n .a.
876
9 ,6 0 %
n .a.
0 m a tch es
2 4 3 37
9125
T o ta l
Conflicting Starting points
1. There should be a maximum of flexibility:
 the wordnets should be able to reflect language-specific relations and
patterns
 the wordnets should be built relatively independently because each
sites has different starting points:

different tools, database and resources (Machine Readable
Dictionaries)

differences in the languages
2. The wordnets have to be compatible in terms of coverage and
relations to be useful for multilingual information retrieval and
translations tools and to be able to compare the wordnets.
Measures to
achieve maximal compatibility
 The results are loaded into a common Multilingual Database (Polaris):
 consistency checks and types of incompatibility
 specific comparison options to measure consistency and overlap in coverage
 User-guides for building wordnets in each language:
 the steps to encode the relations for a word meaning.
 common tests and criteria for all the relations.
 overview of problems and solutions.
 A set of common Base-Concepts which are shared by all the sites, having:
 most relations and the most-important positions in the wordnets
 most meanings and badly defined
 Classification of the common Base Concept in terms of a Top-Ontology of 63 basic
Semantic Distinctions
 Top-Down Approach, where first the Base Concepts and their direct context are
(manually) encoded and next the wordnets are (semi-automatically) extended topdown to include more specific concepts that depend on these Base Concept.
Top-Ontology and
Base Concepts
Top-Ontology with 63 higher-level concepts
Existing Ontologies:
WordNet1.5 top-levels
Aktions-Art models (Vendler, Verkuyl)
Acquilex and Sift ontologies (EC-projects)
Qualia-structure (Pustejovsky)
Upper-Model, MikroKosmos, Cyc, Ad Hoc ANSI-Committee on
ontologies
The ontology was adapted to represent the variety of concepts in the set
of Common Base Concepts, across the 4 language:.
homogenous Base-Concept Clusters
average size of Base Concept Cluster
apply to both nouns and verbs
Set of 1024 common Base Concepts making up the core of the separate
wordnets.
Base Concepts
Procedure:
• Each site determined the set of word meanings with most relations (up to 15% of all
relations) and high positions in the hierarchy.
• This set was extended with all meanings used to define the first selection.
• The local selection was translated to WordNet1.5 equivalences: 4 lists of
WordNet1.5 synsets (between 450 – 2000 synsets per selection).
• These sets of WordNet1.5 translations have been compared.
Concepts selected by all sites:
30 synsets (24 nouns synsets, 6 verb synsets).
Explanations:
•The individual selections are not representative enough.
•There are major differences in the way meanings are classified, which have an effect
on the frequency of the relations.
•The translations of the selection to WordNet1.5 synsets are not reliable
•The resources cover very different vocabularies
Concepts selected by at least two sites: intersections of pairs
NOUNS
VERBS
NL
NL
1027
ES
103
IT
182
GB/WN
333
NL
323
ES
36
IT
42
GB/WN
86
ES
103
523
45
284
36
128
18
43
IT
182
45
334
167
42
18
104
39
GB/WN
333
284
167
1296
86
43
39
236
Total Set of shared Base Concepts : Union of intersection pairs
Nouns
1stOrderEntities
491
2ndOrderEntities
272
3rdOrderEntities
33
Total
796
Verbs
Total
491
228
500
33
228
1024
Table 4: Number of Common BCs represented in the local wordnets
Related to CBCs
Eq_synonym
Eq_near_
CBCs Without
Relations
Synonym relations
Direct Equivalent
AMS
992
725
269
97
FUE
PSA
1012
878
1009
759
0
191
15
9
Table 5: BC4 Gaps in at least two wordnets (10 synsets)
body covering#1
body substance#1
social control#1
change of magnitude#1
contractile organ#1
psychological feature#1
mental object#1; cognitive content#1; content#2
natural object#1
place of business#1; business establishment#1
plant organ#1
Plant part#1
spatial property#1; spatiality#1
Table 6: Local senses with complex equivalence relations to CBCs
Eq_has_hyperonym
eq_has_hyponym
Eq_has_holonym
Eq_has_meronym
Eq_involved
Eq_is_caused_by
Eq_is_state_of
NL
61
34
2
3
3
3
1
ES
40
14
0
2
IT
4
20
Example of complex relation
CBC: cause to feel unwell#1, Verb
Closest Dutch concept: {onwel#1}, Adjective (sick)
Equivalence relation: eq_is_caused_by
Adaptation of Base Concepts in
EuroWordNet-2

A similar selection of fundamental concepts has been made in
EuroWordNet-2

The selected concepts have been compared among German, French,
Czech and Estonian and with the EuroWordNet-1 selection

The EuroWordNet-1 set has been extended to 1310 Base Concepts

A distinction has been made between Hard and Soft Base Concepts
 Hard: represented by only a single Index-record
 Soft: represented by several close Index-records

The final set has been used as starting point in EuroWordNet-2
Comparison of
Base Concept Selections
NOUNS
L o ca l In tersectio n w ith
N B C s N B C -ew n1 (90 5)
7 87
4 60
7 26
7 03
1 727
FR
DE
CZ
EE
U n io n
(selected b y at least 1 sid e)
U n io n o f In tersections
6 19
(selected b y at least 2 sid es)
In tersectio n
70
(selected b y 4 sid es)
VERBS
L o cal In tersectio n w ith
V B C s V B C -ew n1 (23 9)
FR
2 25
DE
3 21
EE
4 59
CZ
2 60
U n io n
8 72
(selected b y at least 1 sid e)
U n io n o f In tersections
2 58
(selected b y at least 2 sid es)
In tersectio ns
30
(selected b y 4 sid es)
% o f N B C -ew n 1
% o f L o ca l B C s N ew
BCs
1 00 ,00 %
4 3 ,9 1 %
3 7 ,3 3 %
5 5 ,3 3 %
4 6 ,9 6 %
7 87
2 02
2 71
3 89
8 11
9 9 ,2 4 %
2 5 ,4 7 %
3 4 ,1 7 %
4 9 ,0 5 %
1 02 ,27 %
5 16
6 5 ,0 7 %
8 3 ,3 6 %
70
8 ,83 %
1 00 ,00 %
% o f V B C -E W N 1
% o f L o cal B C s
0
2 58
4 55
3 14
9 16
1 05
N ew B C s
2 25
98
1 45
71
2 33
9 4 .1 4 %
4 1 .0 0 %
6 0 .6 7 %
2 9 .7 1 %
9 7 .4 9 %
1 00 .00 %
3 0 .5 3 %
3 1 .8 0 %
2 7 .3 1 %
2 6 .7 2 %
0
2 23
3 14
1 89
6 39
1 79
7 4 .9 0 %
6 9 .3 8 %
61
30
1 2 .5 5 %
1 00 .00 %
Revised Set of Base Concepts
NOUNS
VERBS
EW N1
T otal H ard
905
575
239
164
S oft
330
75
EW N2
T otal H ard
105
20
61
23
T a ble 7 : P ro po sed , M issing an d S elected N o u n B ase C o ncep ts for E W N 2
L o ca lB C s
FR
DE
EE
CZ
787
460
703
726
HARD
M issin g
T o tal
24
427
293
339
199
322
252
260
S oft
85
38
E W N 12
T otal H ard S oft
1010
595
415
300
187
113
SOFT
U n iq u e
S h a red
BCs
BCs
P artial
M issin g
112
87
0
787
97
225
199
216
160
92
238
465
153
107
375
351
T a ble 8 : P ro po sed , M issing an d S elected V erb B a se C o n cepts for E W N 2
T o ta l
FR
DE
EE
CZ
225
321
459
260
HARD
M issin g
30
91
52
126
SOFT
T o tal
P artial
45
70
43
76
U n iq u e
BCs
M issin g
11
36
36
35
34
34
7
41
S h a red
BCs
0
182
254
162
225
139
205
98
Starting points
for the Top-Ontology
• The ontology should support the building and encoding of semantic networks as
linguistic ontologies: networks of lexicalized words and expressions in a language.
• The classification of the Base Concepts in terms of the Top Ontology should apply
to all the involved languages.
• Enforce uniformity and compatibility of the different wordnets, by providing a
common framework. Divide the Base Concepts (BCs) into coherent clusters to
enable contrastive-analysis and discussion of closely related word meanings
• Customize the database by assigning features to the top-concepts, irrespective of
language-specific structures.
• Provide an anchor point for connecting other ontologies to the Inter-LingualIndex, such as CYC, MikroKosmos, the Upper-Model, by linking them to the
corresponding ILI-records.
Principles for
deciding on the distinctions
Starting point is that the wordnets are linguistic ontologies:
•
Semantic classifications common in linguistic paradigms: Aktionsart models
[Vendler 1967, Verkuyl 1972, Verkuyl 1989, Pustejovsky 1991], entity-orders
[Lyons 1977], Aristotle’s Qualia-structure [Pustejovsky 1995].
•
Ontologies developed in previous EC-projects, which had a similar basis and are
well-known in the project consortium: Acquilex (BRA 3030, 7315), Sift (LE62030, [Vossen and Bon 1996].
•
The ontology should be capable of reflecting the diversity of the set of common
BCs, across the 4 languages. In this sense the classification of the common BCs
in terms of the top-concepts should result in:


Homogeneous Base Concept Clusters: classifications in WordNet1.5 and the
other wordnets.
Average-sized Base Concept Clusters: not extremely large or small.
Other important
characteristics:

The distinctions apply to both nouns, verbs and adjectives, because these can be related in the
language-specific wordnets via a xpos_synonymy relation, and the ILI-records can be related
to any part-of-speech.

The top-concepts are hierarchically ordered by means of a subsumption relation but there can
only be one super-type linked to each top-concept: multiple inheritance between top-concepts
is not allowed.

In addition to the subsumption relation top-concepts can have an opposition-relation to
indicate that certain distinctions are disjunct, whereas others may overlap.

There may be multiple relations from ILI-records to top-concepts: the Base Conceptss can be
cross-classified in terms of multiple top-concepts (as long as these have no opposition-relation
between them): i.e. multiple inheritance from Top-Concept to Base Concept is allowed.
Result: the TCs function as cross-classifying features rather than conceptual classes.
Meanings for bodyparts are not linked to a single class BodyPart but to two features:
Living and Part.
The EuroWordNet Top-Ontology:
63 concepts (excluding the top)
First Level [Lyons 1977]:
1stOrderEntity (491 BC synsets, all nouns)
Any concrete entity (publicly) perceivable by the senses and located at any point in
time, in a three-dimensional space.
2ndOrderEntity (500 BC synsets, 272 nouns and 228 verbs)
Any Static Situation (property, relation) or Dynamic Situation, which cannot be
grasped, heart, seen, felt as an independent physical thing. They can be located in
time and occur or take place rather than exist; e.g. continue, occur, apply
3rdOrderEntity (33 BC synsets, all nouns)
An unobservable proposition that exists independently of time and space. They can
be true or false rather than real. They can be asserted or denied, remembered or
forgotten. E.g. idea, though, information, theory, plan.
Test to distinguish 1st, 2nd
and 3rd OrderEntities
Third-order entities cannot occur, have no temporal duration and therefore fail on both
tests:
a
The same person was here again to-day
b
The same thing happened/occurred again to-day
*?
The idea, fact, expectation, etc.... was here/occurred/ took place
A positive test for a 3rdOrderEntity is based on the properties that can be predicated:
ok
The idea, fact, expectation, etc.. is true, is denied, forgotten
The first division of the ontology is disjoint: BCs cannot be classified as combinations
of these TCs. This distinction cuts across the different parts of speech in that:
 1stOrderEntities are always (concrete) nouns.
 2ndOrderEntities can be nouns, verbs and adjectives, where adjectives are always
non-dynamic (refer to states and situations not involving a change of state).
 3rdOrderEntities are always (abstract) nouns.
Base Concepts classified as
3rdOrderEntities
theory; idea; structure; evidence;
procedure; doctrine; policy; data
point; content; plan of action; concept;
plan; communication; knowledge
base; cognitive content; know-how;
category; information; abstract; info;
1stOrderEntity1
Origin 0 the way in which an entity has come about
Natural21
Living30
Plant18
Human106
Creature2
Animal123
Artifact144
Function0 the typical activity or role that is associated with an entity
Vehicle8
Occupation23
Covering8
Garment3
Software4
Furniture6
Place45
Container12
Comestible32
Instrument18
Container12
Building13
Representation12: MoneyRepresentation10; LanguageRepresentation34; Image Representation9
Form0 a-morf or fixed shape.
Substance32
Solid63
Liquid13
Gas1
Object62
Composition0 group of self-contained wholes or as a part of such a whole
Part86
Group63
Conjunctive classes of
1stOrderEntities
5
5
5
5
5
5
6
6
6
6
7
7
Frequent combinations
Comestible;Solid;Artifact
7
LanguageRepresentation
Container;Part;Solid;Living 7
Vehicle;Object;Artifact
Furniture;Object;Artifact
10
Instrument;Object;Artifact
Instrument;Artifact
12
Part
Living
14
Place
Plant
14
Place;Part
Liquid
15
Substance
Object;Artifact
19
LanguageRepresentation;Artifact
Part;Living
20
Occupation;Object;Human
Place;Part;Solid
22
Object;Animal; Function
Building;Object;Artifact
38
Group;Human
Group
42
Object;Human
Conjunctive classes of
1stOrderEntities
Low Frequent combinations
fruit:
Comestible (Function)
Object (Form)
Part (Composition)
Plant (Natural, Origin)
skin:
Covering (Covering)
Solid (Form)
Part (Composition)
Living (Natural, Origin)
life:
Group (Composition)
Living (Natural, Origin)
cell:
Part (Composition)
Living (Natural, Origin)
arms: Instrument (Function)
Group (Composition)
Object (Form)
Artifact (Origin)
1stOrderEntities classified as
Function only
barrier 1; belonging 2;building material 1;causal agency
1;commodity 1;consumer goods 1;creation 3;curative
1;decoration 2;device 4;fastener 1;force 6;force 7;form
5;impediment 1;
medicament 1;piece of work 1;possession 1;protection
4;remains 2;restraint 2;support 6;support; 7;supporting
structure 1;thing 3
2ndOrderEntity0
SituationType6 (the event-structure in terms of which a situation can be characterized as a conceptual unit over
time; Disjoint features)
Dynamic134
(he sat down quickly. a quick meeting)
BoundedEvent183
UnboundedEvent48
Static28
(?he sits quickly.)
Property61
Relation38
SituationComponent0
(the most salient semantic component(s) that characterize(s) a situation; Conjuncted Features)
Cause67
Agentive170
Phenomenal17
Stimulating25
Communication50
Existence27
Location76
Mental90
Social102
Condition62
Experience43
Manner21
Modal10
Time24
Physical140
Possession23
Purpose137
Quantity39
Usage8
Conjunctive classes of
2ndOrderEntities
Static
5
5
5
5
6
7
8
9
10
Property;Physical;Condition
Property;Stimulating;Physical
Relation
Relation;Social
Static;Quantity
Property;Condition
Relation;Location
Property
Relation;Physical;Location:
adjoin 1; aim 4; blank space 1; course 7; direction 8; distance 1; elbow
room 1; path 3; spatial property 1; spatial relation 1
Conjunctive classes of
2ndOrderEntities
Dynamic
5
5
5
5
5
5
5
6
6
6
6
6
8
8
9
9
BoundedEvent;Cause;Physical
BoundedEvent;Cause;Physical;Location
BoundedEvent;Time
Dynamic
Dynamic;Location
Dynamic;Phenomenal
Dynamic;Phenomenal;Physical
BoundedEvent;Agentive
BoundedEvent;Location
BoundedEvent;Physical;Location
Dynamic;Agentive;Communication
Dynamic;Cause
BoundedEvent;Agentive;Mental;Purpose
BoundedEvent;Quantity;Time
BoundedEvent;Cause
Dynamic;Experience;Mental experience 7; find 3;affect 5; arouse 5;
excite 2; cognition 1; desire 2; disposition 2; disposition 4; disturbance 7; emotion 1;
feeling 1; humor 3; pleasance 1; process 4; look 8; phenomenon 1; cause to appear 1;
perception 2; sensation 1; feel 12; experience 8; trouble 3; reality 1
Top-Down Building Procedure
1) Construction of a core wordnet from the common set of Base Concepts
• Find Representatives in the local language for the Common Base Concepts (1310 synsets)
• Add local Base Concepts that are not selected as Common Base Concepts
• Specify the hyperonyms of the local and common Base Concepts
2) Extend the Core Wordnets
• Add the first level of hyponyms to the core wordnets
• Add other hyponyms which have many sub-hyponyms
• Add other types of relations: XPOS, roles, meronymy, subevents, causes.
3) Verify the Selection
• Corpus frequency: Parole lexicons and corpora
• Top-Concept clustering
• Intersection of ILI-records
• Overlap in ILI-chains
Top-Down Building
Top-Ontology
Hypero
nyms
Local
BCs
WMs
related via
non-hypo
nymy
CBC
Representatives
First Level Hyponyms
Remaining
Hyponyms
Hypero
nyms
63TCs
1310 CBCs
149 new ILIs
Remaining
WordNet1.5
Synsets
Inter-Lingual-Index
CBC
Representa.
Local
BCs
First Level Hyponyms
Remaining
Hyponyms
WMs
related via
non-hypo
nymy
The current wordnets
S yn sets N o. of sen ses S en s./ E n tries S en s./ L IR els. L IR els/ E Q R els- E Q R els/s S yn sets
syn s.
en try
syn s
IL I
yn
w ith ou t
IL I
44015
70201
1,59 56283
1,25 111639
2,54
53448
1,21
7203
D u tch
23370
50526
2,16 27933
1,81
55163
2,36
21236
0,91
0
S p an ish
4 0 428
48499
1,20 32978
1,47 117068
2,90
71789
1,78
1561
Italian
22745
32809
1.44 18777
1.75
49494
2.18
22730
1.00
20
F ren ch
15132
20453
1.35 17098
1.20
34818
2.30
16347
1.08
0
G erm an
12824
19949
1.56 12283
1.62
26259
2.05
12824
1.00
0
C zech
7678
13839
1 .8 0 10961
1.26
16318
2.13
9004
1.17
0
E ston ian
16361
40588
2,48 17320
2,34
42140
2,58
n.a.
n.a.
n.a.
E n glish
94515
187602
1,98 126617
1,48 211375
2,24
n.a.
n.a.
n.a.
W N 15
Comparison of wordnets



In depth comparison of major semantic fields
Comparison of the intersection of the associated ILIrecords Distribution of the associated ILI-records over
the different top ontology clusters
Comparison of the hyponymy relations in the wordnets,
projected on the associated ILI-records
Intersection of the associated
ILI-records
Nouns
Verbs
Total
62780
32520
Total
12215
7455
frequen
cy
% of 
(IT, NL,
ES) 75.6%
frequen
cy
4654
% of 
(WN,IT, NL,
ES)
38.1%
% of 
(IT, NL,
ES) 62.4%
ES
24596
% of 
(WN,IT, NL,
ES)
39.2%
IT
14272
22.7%
43.9%
4673
38.3%
62.7%
NL
21259
33.9%
65.4%
6416
52.5%
86.1%
 (ES, IT)
10907
17.4%
33.5%
3272
26.8%
43.9%
 (ES, NL)
14773
23.5%
45.4%
3870
31.7%
51.9%
 (IT, NL)
9862
15.7%
30.3%
3950
32.3%
53.0%
 (ES, IT,
NL)
8183
13.0%
25.2%
3051
25.0%
40.9%
Distribution over the top
ontology clusters
WN
T o p -C o n cep t
TCT o k en s
1 4 0 68
A n im a l
1 9 5 62
A rtifa ct
1022
B u ild in g
3377
C o m estib le
1725
C o n ta in er
2030
C o v erin g
664
C rea tu re
3 4 0 81
F u n ctio n
298
F u rn itu re
756
G a rm en t
93
G as
2 7 8 05
G ro u p
1 1 5 43
H um an
780
Im a g eR ep resen ta tio n
7036
In stru m en t
2844
L a n g u a g eR ep resen t.
1629
L iq u id
4 7 1 04
L iv in g
% of
TCw n T o k en s
3 .9 9 %
1193
5 .5 5 % 1 0 8 03
0 .2 9 %
707
0 .9 6 %
1393
0 .4 9 %
778
0 .5 8 %
1208
0 .1 9 %
159
9 .6 8 % 1 7 6 68
0 .0 8 %
171
0 .2 1 %
494
0 .0 3 %
67
7 .9 0 %
3357
3 .2 8 %
6372
0 .2 2 %
412
2 .0 0 %
4102
0 .8 1 %
1273
0 .4 6 %
617
1 3 .3 7 % 1 0 2 25
NL
% of
nl
0 .9 7 %
8 .8 3 %
0 .5 8 %
1 .1 4 %
0 .6 4 %
0 .9 9 %
0 .1 3 %
1 4 .4 4 %
0 .1 4 %
0 .4 0 %
0 .0 5 %
2 .7 4 %
5 .2 1 %
0 .3 4 %
3 .3 5 %
1 .0 4 %
0 .5 0 %
8 .3 6 %
% of
TCw n T o k en s
8 .5 %
2458
5 5 .2 %
9969
6 9 .2 %
628
4 1 .2 %
1614
4 5 .1 %
799
5 9 .5 %
1027
2 3 .9 %
254
5 1 .8 % 1 8 9 04
5 7 .4 %
147
6 5 .3 %
426
7 2 .0 %
62
1 2 .1 %
3630
5 5 .2 %
7683
5 2 .8 %
426
5 8 .3 %
3590
4 4 .8 %
1218
3 7 .9 %
500
2 1 .7 % 1 3 6 61
ES
IT
% o f es % o f
T C - % o f it % o f
w n T o k en s
wn
1 .8 1 % 1 7 .5 %
1 1 2 2 1 .4 4 % 8 .0 %
7 .3 6 % 5 1 .0 %
6 4 9 4 8 .3 4 % 3 3 .2 %
0 .4 6 % 6 1 .4 %
4 3 4 0 .5 6 % 4 2 .5 %
1 .1 9 % 4 7 .8 %
6 2 4 0 .8 0 % 1 8 .5 %
0 .5 9 % 4 6 .3 %
4 3 2 0 .5 5 % 2 5 .0 %
0 .7 6 % 5 0 .6 %
6 9 0 0 .8 9 % 3 4 .0 %
0 .1 9 % 3 8 .3 %
2 7 0 .0 3 % 4 .1 %
1 3 .9 6 % 5 5 .5 % 1 1 0 43 1 4 .1 8 % 3 2 .4 %
0 .1 1 % 4 9 .3 %
8 7 0 .1 1 % 2 9 .2 %
0 .3 1 % 5 6 .3 %
2 9 2 0 .3 7 % 3 8 .6 %
0 .0 5 % 6 6 .7 %
4 9 0 .0 6 % 5 2 .7 %
2 .6 8 % 1 3 .1 %
2 3 3 7 3 .0 0 % 8 .4 %
5 .6 7 % 6 6 .6 %
4 4 8 8 5 .7 6 % 3 8 .9 %
0 .3 1 % 5 4 .6 %
2 9 4 0 .3 8 % 3 7 .7 %
2 .6 5 % 5 1 .0 %
2 5 6 4 3 .2 9 % 3 6 .4 %
0 .9 0 % 4 2 .8 %
6 9 1 0 .8 9 % 2 4 .3 %
0 .3 7 % 3 0 .7 %
3 3 9 0 .4 4 % 2 0 .8 %
1 0 .0 8 % 2 9 .0 %
7 4 0 8 9 .5 1 % 1 5 .7 %
Distribution over the top
ontology clusters
WN
TCT o k en s
372
M o n ey R ep resen ta tio n
6 8 3 70
N a tu ra l
4 8 1 62
O b ject
2059
O ccu p a tio n
1 2 0 83
P a rt
5281
P la ce
1 8 8 74
P la n t
934
R ep resen ta tio n
201
S o ftw a re
6319
S o lid
1 2 3 65
S u b sta n ce
747
V eh icle
3 5 2 18 4
T o ta l
T o p -C o n cep t
TC% of
w n T o k en s
190
0 .1 1 %
1 9 .4 1 % 2 1 9 48
1 3 .6 8 % 2 0 2 06
1209
0 .5 8 %
4806
3 .4 3 %
2072
1 .5 0 %
1534
5 .3 6 %
560
0 .2 7 %
80
0 .0 6 %
2845
1 .7 9 %
5447
3 .5 1 %
466
0 .2 1 %
1 2 2 36 2
NL
% of
nl
0 .1 6 %
1 7 .9 4 %
1 6 .5 1 %
0 .9 9 %
3 .9 3 %
1 .6 9 %
1 .2 5 %
0 .4 6 %
0 .0 7 %
2 .3 3 %
4 .4 5 %
0 .3 8 %
TC% of
w n T o k en s
183
5 1 .1 %
3 2 .1 % 2 4 5 56
4 2 .0 % 2 2 6 08
1395
5 8 .7 %
5819
3 9 .8 %
2439
3 9 .2 %
2012
8 .1 %
577
6 0 .0 %
91
3 9 .8 %
2721
4 5 .0 %
5599
4 4 .1 %
466
6 2 .4 %
3 4 .7 % 1 3 5 46 2
IT
ES
T C - % o f it % o f
% o f es % o f
wn
w n T o k en s
1 1 1 0 .1 4 % 2 9 .8 %
0 .1 4 % 4 9 .2 %
1 8 .1 3 % 3 5 .9 % 1 4 4 00 1 8 .4 9 % 2 1 .1 %
1 6 .6 9 % 4 6 .9 % 1 3 2 42 1 7 .0 0 % 2 7 .5 %
8 2 4 1 .0 6 % 4 0 .0 %
1 .0 3 % 6 7 .8 %
2 5 8 6 3 .3 2 % 2 1 .4 %
4 .3 0 % 4 8 .2 %
1 2 2 7 1 .5 8 % 2 3 .2 %
1 .8 0 % 4 6 .2 %
1 1 2 1 1 .4 4 % 5 .9 %
1 .4 9 % 1 0 .7 %
3 0 2 0 .3 9 % 3 2 .3 %
0 .4 3 % 6 1 .8 %
4 9 0 .0 6 % 2 4 .4 %
0 .0 7 % 4 5 .3 %
1 4 0 6 1 .8 1 % 2 2 .3 %
2 .0 1 % 4 3 .1 %
2 8 4 7 3 .6 6 % 2 3 .0 %
4 .1 3 % 4 5 .3 %
3 5 2 0 .4 5 % 4 7 .1 %
0 .3 4 % 6 2 .4 %
2 2 .1 %
3 8 .5 % 7 7 8 82
Comparison of the hyponymy relations,
projected on the associated ILI-records
To be able to compare hyponymy chains, each word sense in the chain has
been replaced by the ILI-records that are linked to these synsets which gives
the following result:
veranderen (change)  bewegen (move intransitive)  bewegen (move
reflexive)  voortbewegen (move location)  verplaatsen (move from A to
B)  stijgen (move to a higher position)  opstijgen (take off)
00064108 01046072 01046072 01046072 01055491 01094615 00257753
Coverage of complete noun chains
projected over WN1.5 structure
ES
NL
IT
 (E S ,N L )
n odes
(53467) edges
(53467)
frequ en cy %
frequ en cy %
14221
26.60
14221
26.60
650
1.22
17
0.03
2760
5.16
49
0.09
352
0.66
10
0.02
 (E S ,IT )
1563
2.92
34
0.06
 (N L ,IT )
190
0.36
0
0.00
 (E S ,N L ,IT )
136
0.25
0
0.00
Partial noun chains projected
over WN1.5
LENGTH ES
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
53467
53385
51541
47930
42049
27582
16789
8337
3800
1647
647
299
115
19
2
NL
53213
43161
26862
15032
6771
2781
967
196
6
IT




53456
47346
44076
27878
21019
14817
7865
3526
1062
380
82
28
(E S ,N L ) (E S ,IT ) (N L ,IT ) (E S ,N L ,IT )
53148 53452 52862
52803
41959 47138 40893
40636
25162 42764 21573
21089
13106 26260
7808
7112
5454 19433
2996
2506
1929 12552
949
799
726
6259
169
148
87
2648
17
12
3
779
311
73
25
WN
53467
53467
53434
52913
50693
45029
32299
20558
11821
5881
2576
1176
659
295
82
Partial noun chains with 1 gap
projected over WN1.5
LENGTH ES
3
4
5
6
7
8
9
10
11
12
13
14
15
16
7804
7776
7333
6296
5017
3392
1914
1038
564
232
98
43
5
2
NL
IT
29355 12152
26152 11616
18633 10480
12019 7782
5326 4602
1891 2456
487 1166
83
538
2
173
108
35
2




(E S ,N L ) (E S ,IT ) (N L ,IT ) (E S ,N L ,IT )
28312 11619 20886
20439
24655 11086 17228
16775
16712
9652 11136
10561
10158
6879
6023
5262
3866
4119
2531
1960
1046
2131
704
560
268
986
115
98
32
485
11
7
1
163
101
4
WN
53434
52913
50693
45029
32299
20558
11821
5881
2576
1176
659
295
82
7
Towards an efficient, condensed and
universal index of sense-distinctions



Independently of the wordnet structures in each
language, we can manipulate the mapping across
languages via the ILI.
We can use the information of all the languages to
correct incompleteness and inconsistencies of the
individual resources
Ultimately, we should try to find a minimal and
sufficient set of concepts to provide an efficient
mapping.
Characteristics of the
Inter-Lingual-Index
The Inter-lingual-Index (ILI) is an unstructured fund of concepts with the
sole purpose of providing an efficient mapping of senses across languages.
Requirements:
1. efficient level of granularity
ILI
Wordnets
{break} “He broke the glass”
{break; cause to break}
breken
breken
{break; damage} inflict damage upon.
romper
rompere
Dutch
Dutch
Spanish
Italian
2. superset of concepts that occur across languages
ILI
Wordnets
{cashier}
{female cashier}
eq_hyperonym
eq_hyperonym
eq_synonym
eq_synonym
cassière
cajera
cassière
cajera
Dutch
Spanish
Dutch
Spanish
A Minimal and Efficient set of
concepts
• Globalizing the sense-differentiation:
• create metonymic clusters
• abstract from contextual specialization and grammatical
perspectives
• abstract from part-of-speech realization
• abstract from productive and predictable meanings
• Extending the Inter-Lingual-Index to become the
superset of concepts occurring in two or more
wordnets only if:
• concepts are unpredictable and unproductive
• concepts cannot be linked exhaustively and uniquely to the ILI
Under-specified concepts
Metonymic clusters
eq_metonym
eq_metonym
club
metonym#
club: organization
metonym#
club: building
{vereniging}NL
eq_synonym
{club;
verenigingsgebouw}NL
eq_synonym
{club}EN
Under-specified concepts
Generalization and Diathesis clusters
eq_diatheis
eq_diathesis
break
diathesis#
break:
inchoative
diathesis#
break:
causative
{breken; kapotgaan}NL
{breken; kapotmaken}NL eq_synonym
{rompere}IT
eq_synonym
{rompersi}IT
Under-specified for POS
eq_xpos_synonym
eq_xpos_synonym
depart
xpos#
departure
xpos#
depart
{vertrekkenV}NL
{departV}EN
eq_synonym
{vertrekN}NL
eq_synonym
{departureN}EN
Overview of equivalence
relations to the ILI
Relation
POS
Sources: Targets
Example
eq_synonym
same
1:1
auto : voiture
car
apparaat, machine, toestel:
apparatus, machine, device
citroenjenever:
gin
dedo :
toe, finger
universiteit, universiteitsgebouw:
university
raken (cause), raken:
hit
schoonmaken :
clean
eq_near_synonym any
many : many
eq_hyperonym
same
many : 1 (usually)
eq_hyponym
same
(usually) 1 : many
eq_metonymy
same
many/1 : 1
eq_diathesis
same
many/1 : 1
eq_generalization same
many/1 : 1
Progress on
restructuring the ILI
Clusters added manually and automatically based on:



structural properties of WN1.5
mapping to other sources: Levin’s classes, WN1.6
cross-lingual mapping
Nouns
Verbs2905
clusters words
1703
1398
1799
5134
word senses
3205
3839
synsets
2895
New ILIs from other wordnets have not yet been added. We estimated
that for verbs hardly any new ILIs are needed, for nouns about 30% of
non-translated concepts (2,000 synsets based on Dutch).
Effects of ILI-clusters
Intersection of ILI-references for Dutch, Spanish,
Italian and English
Nouns 2895 clustered synsets (4,6% of 62780 WN1.5 noun synsets)
intersection increased from 7736 (23,8%) to 8183 (25,2%) out of the
union of 32520 synsets
Verbs 3839 clustered synsets (31,4% of 12215 WN1.5 verb synsets)
intersection increased from 1632 (21,9%) to 3051 (40,9%) out of the
union of 7455 synsets
Superset of all concepts.
Procedure:
• Initially, the ILI will only contain WordNet1.5 synsets.
• a site that cannot find a proper equivalent among the available ILI-concepts
will link the meaning to another ILI-record using a so-called complexequivalence relation and will generate a potential new ILI-record:
Dutch Meaning
klunen
Definition
to walk on skates
Complex-equivalence
has_eq_hyperonym
Target concept
walk
• after a building-phase all potentially-new ILI-records are collected and
verified for overlap by one site;
• a proposal for updating the ILI is distributed to all sites and has to be verified;
• the ILI is updated and all sites have to reconsider the equivalence relations for
all meanings that can potentially be linked to the new ILI-records;
Filling gaps in the ILI
Types of GAPS
1.
genuine, cultural gaps for things not known in English culture,
e.g. citroenjenever, which is a kind of gin made out of lemon skin,
•
•
pragmatic, in the sense that the concept is known but is not
expressed by a single lexicalized form in English, e.g.: container,
borrower, cajera (female cashier)
1.
•
•
2.
Non-productive
Non-compositional
Productive
Compositional
Universality of gaps: Concepts occurring in at least 2 languages
Productive and Predictable Lexicalizations
exhaustively linked to the ILI
eq_has_hyperonym
beat
{doodslaanV}NL
eq_has_hyperonym
{totschlagenV}DE
kill
{doodstampenV}NL
eq_has_hyperonym
eq_has_hyperonym
eq_has_hyperonym
{tottrampelnV}DE
stamp
eq_has_hyperonym
eq_has_hyperonym
{doodschoppenV}NL
eq_has_hyperonym
eq_has_hyperonym
{casière}NL
eq_in_state
kick
cashier
female
fish
young
eq_has_hyperonym
{cajeraN}ES
eq_in_state
eq_has_hyperonym
{alevínN}ES
eq_in_state
WordNet gaps across languages
IL I R E F s
IL IV ars
(m ostly hy perony m s)
N ouns
V erbs N ouns V erbs
NL
491
99
551
82
DE
109
9
144
10
IT
45
22
77
66
NL&DE
10
0
2
0
N L & IT
6
3
1
0
D E & IT
5
1
0
0
N L & D E & IT
3
0
0
0
U nion Intersections
15
4
3
0
Towards an efficient, condensed and
universal index of sense-distinctions
WordNet1.5
Metonymy/
Generalization
clusters
Universal
Core meanings
POS
Non-predictable
Independent
90,000
concepts
Universal systematic
polysemy and level
of granularity
Language and
domain specific
lexicalizations that do
not occur in a large
variety of languages
Language specific Productive derivations
and compounds linked
realizations in
exhaustively
grammatical
forms
The EuroWordNet database
1.) The actual wordnets in Flaim database format: an indexing and compression
format of Novell.
2.) Polaris (Louw 1997): Re-implementation of the Novell ConceptNet toolkit
(Díez-Orzas et al 1995) adapted to the EuroWordNet architecture.









import and export wordnets or wordnet selections from/to ASCII files.
resolve links for imported concepts.
edit and add concepts, variants and relations in the wordnets.
access to the ILI and ontologies and to switch between the wordnets and
ontologies via the ILI.
extract, import and export clusters of senses based on relations.
project synsets or clusters from one wordnet to another wordnet
compare clusters of synsets.
import new or adapted ILI-records.
update ILI-references to updated ILI.
3. Periscope (Cuypers and Adriaens 1997): a graphical interface for viewing the
EuroWordNet database.
Global Wordnet Association
http://www.globalwordnet.org
provide a standardized framework to link, compare and build complete
wordnets for all the European languages and dialects.
initialize the development of wordnets in non-European languages
develop more specific definitions, tests and procedures for evaluating
and developing wordnets.
extend the specification of EuroWordNet to lexical units which are not
yet covered (adjectives/adverbs, lexicalized phrases and multi-words).
develop (axiomatized) ontologies for Domains and World-Knowledge
that can be shared by all languages via the ILI.
develop an efficient ILI for linking, sharing, consistency checking and
cross-language technology applications. This ILI could function as a
gold-standard of sense-distinctions.
organize a (annual/bi-annual) workshop or conference.
2nd Global Wordnet Conference



Location: Masaryk University, Brno (Czech
Republic),
January, 20 - 23, 2004.
http://www.fi.muni.cz/gwc2004/
Other wordnet initiatives







Danish
Norway
Swedish
Portuguese
Arabic
Korean
Russian






Welsh
Basque, Catalan
Chinese
BalkaNet
IndoWordnet
Meaning
BalkaNet



Funded by the European Union as project IST2000-29388.
3-year project: 2001 - 2004
Follows a strict EuroWordNet approach:



EWN database extended with:



Expanded set of base concepts
Top-down building approach
Greek, Romanian, Serbian, Turkish, Bulgarian, Czech
Development of new wordnet database system:
VisDic
http://www.ceid.upatras.gr/Balkanet/.
IndoWordnet

Current Wordnet development in India:




Hindi and Marathi at IIT Bombay,
Tamil at Anna University-K.B Chandrashekhar Research
Centre (AU-KBC) Chennai and Tamil University Tanjavur,
Gujarathi at MS University Baroda, Oriya at Utkal
University Bhubaneswar and Bengali at IIT Kharagpur.
The Hindi WordNet is at an advanced stage of
development with about 11000 semantically linked
synsets and with associated software and user
interface.
IndoWordnet


By the end of 2003 each Indian language will create a WordNet of 5000
synsets. These will be for about 2000 most frequent content words in
each language. Use will be made of the wordlist sorted by frequencyavailable with the CIIL
Language specific WordNets developed by the following institutions:








CIIL, Mysore: Kannada, Kashmiri, Punjabi, Urdu, Himachali, Malayalam.
IIT Bombay: Hindi, Marathi and Konkani
AU-KBC Chenai and Tamil University Tanjavur: Tamil and Malayalam
University of Hyderabad: Telegu
University of Baroda: Gujarati
Utkal University Bhubaneswar: Oriya
IIT Kharagpur: Bengali
Reserach groups have to be identified for building the WordNets of
Assamese, Nepali and Languages of the North East.
Meaning
Developing Multilingual Web-scale
Language Technologies
http://www.lsi.upc.es/~nlp/meaning/
Meaning Objectives



Funded by the European Union as project
IST-2001-34460
3 -year project: April 2002 - April 2005
Large-scale (Lexical) Knowledge Bases
 Automatic enrichment of EWN
 Mixed approach (KB + ML)
 Applied to Q/A, CLIR

Problem
 structural and lexical ambiguity
Meaning Approach
 automatic collection of sense examples
(Leacock et al. 98, Mihalcea y Moldovan
99)
 Large-scale WSD (Boosting, SVM,
transductives)
 Large-scale Knowledge Acquisition
(McCarthy 01, Agirre & Martinez 02)
Meaning
Architecture
English
Web Corpus
ACQ
WSD
WSD
English
EWN
UPLOAD
Italian
EWN
UPLOAD
PORT
Multilingual
Central Repository
UPLOAD
Spanish
Web Corpus
PORT
UPLOAD
Spanish
EWN
WSD
ACQ
PORT
PORT
ACQ
Italian
Web Corpus
Basque
EWN
PORT
Catalan
EWN
ACQ
UPLOAD
WSD
Catalan
Web Corpus
WSD
ACQ
Basque
Web Corpus
Meaning
WP6: Word Sense Disambiguation
A combination of unsupervised Knowledge-based and supervised Machine
Learning techniques that will provide a high-precision system that is able to
tag running text with word senses
A system that acquires a huge number of examples per word from the web
The use of sophisticated linguistic information, such as, syntactic relations,
semantic classes, selectional restrictions, subcategorization information,
domain, etc.
Efficient margin-based Machine Learning algorithms.
Novel algorithms that combine tagged examples with huge amounts of
untagged examples in order to increase the precision of the system.
THE END...