Transcript Document
IJCNLP2008 Jan 10, 2008
Gloss-based Semantic Similarity
Metrics for Predominant Sense
Acquisition
Ryu Iida
Nara Institute of Science and Technology
Diana McCarthy and Rob Koeling
University of Sussex
1
IJCNLP2008 Jan 10, 2008
Word Sense Disambiguation
Predominant sense acquisition
Exploited
as a powerful back-off strategy for
word sense disambiguation
McCarthy et al (2004):
Achieved
64% precision on Senseval2 allwords task
Strongly relies on linguistic resources such
as WordNet for calculating the semantic
similarity
Difficulty: porting it to other languages
2
IJCNLP2008 Jan 10, 2008
Focus
How to calculate the semantic similarity
score without semantic relations such as
hyponym
Explore the potential use of the word
definitions (glosses) instead of WordNetstyle resources for porting McCarthy et
al.’s method to other languages
3
IJCNLP2008 Jan 10, 2008
Table of contents
1.
Task
2.
Related work: McCarthy et al (2004)
3.
Gloss-based semantic similarity metrics
4.
Experiments
5.
WSD on the two datasets: EDR and
Japanese Senseval2 task
Conclusion and future directions
4
IJCNLP2008 Jan 10, 2008
Word Sense Disambiguation (WSD) task
select the correct sense of the word
appearing in the context
I ate fried chicken last Sunday.
sense id
gloss
1 a common farm bird that is kept for its meat and eggs
2 the meat from this bird eaten as food
3 informal someone who is not at all brave
4 a game in which children must do something dangerous
to show that they are brave
Supervised approaches have been mainly
applied to learn the context
5
IJCNLP2008 Jan 10, 2008
Word Sense Disambiguation (WSD) task
(Cont’d)
Estimate the most predominant sense of
a word regardless of its context
English coarse-grained all words task
(2007)
Choosing
most frequent senses: 78.9%
Best performing system: 82.5%
Systems using a first sense heuristic
have relied on sense-tagged data
However,
sense-tagged data is expensive
6
IJCNLP2008 Jan 10, 2008
McCarthy et al. (2004)’s unsupervised approach
Extract top N neighbour words of the target
word according to the distributional similarity
score (simds)
Calculate the prevalent score of each sense
Calculate simds weighted by the semantic similarity
score (simss)
Sum up all the weighted simds of top N neighbours
Semantic similarity: estimated from linguistic
resources (e.g. WordNet)
Output the sense which has the maximum
prevalent score
7
IJCNLP2008 Jan 10, 2008
McCarthy et al. (2004)’s approach: An example
chicken
sense2: the meat from this bird eaten as food.
sense3: informal someone who is not at all brave.
neighbour
simds
simss(word, sense2)
weighted simds
turkey
0.1805
0.15
0.0271
meat
0.1781
0.20
...
...
= 0.0365
tomato
0.1573
distributional
similarity score
...
...
0.10
0.0157
semantic similarity
score (from WordNet)
prevalence(sense2) = 0.0271 + 0.0365 + ... + 0.0157
= 0.152
8
IJCNLP2008 Jan 10, 2008
McCarthy et al. (2004)’s approach: An example
chicken
sense2: the meat from this bird eaten as food.
sense3: informal someone who is not at all brave.
neighbour
simds
simss(word, sense3)
weighted simds
turkey
0.1805
0.01
0.0018
meat
0.1781
0.02
...
...
= 0.0037
tomato
0.1573
...
0.01
...
0.0016
prevalence(sense2) = 0.152
prevalence(sense3) = 0.0018 + 0.0037 + ... + 0.0016
= 0.023
prevalence(sense2) > prevalence(sense3)
predominant sense: sense2
9
IJCNLP2008 Jan 10, 2008
Problem
While the McCarthy et al.’s method
works well for English, other inventories
do no always have WordNet-style
resources to tie the nearest neighbors to
the sense inventory
While traditional dictionaries do not
organise senses into synsets, they do
typically have sense definitions (glosses)
associated with the senses
10
IJCNLP2008 Jan 10, 2008
Gloss-based similarity
Calculate similarity between two glosses
in a dictionary as semantic similarity
simlesk: simply calculate the overlap of
the content words in the glosses of the
two word senses
simDSlesk: use distributional similarity as
an approximation of semantic distance
between the words in the two glosses
11
IJCNLP2008 Jan 10, 2008
lesk: Example
word
gloss
chicken the meat from this bird eaten as food
turkey the meat from a turkey eaten as food
simlesk(chicken, turkey) = 2
“meat”
and “food” are overlapped in two
glosses
12
IJCNLP2008 Jan 10, 2008
lesk: Example
word
gloss
chicken the meat from this bird eaten as food
tomato a round soft red fruit eaten raw or cooked
as a vegetable
simlesk(chicken, tomato) = 0
No
overlap in two glosses
13
IJCNLP2008 Jan 10, 2008
DSlesk
Calculate distributional similarity scores of any
pairs of nouns in two glosses
simds(meat, fruit) = 0.1625, simds(meat, vegetable) = 0.1843,
simds(bird, fruit) = 0.1001, simds(bird, vegetable) = 0.0717,
simds(food, fruit) = 0.1857, simds(food, vegetable) = 0.1772
Output the average of the maximum
distributional similarity of all the nouns in
target word
simDSlesk (chicken, tomato)
= 1/3 (0.1843 + 0.1001 + 0.1857) = 0.1557
14
IJCNLP2008 Jan 10, 2008
DSlesk
sim DSlesk ( wsi , n) max sim ( wsi , ws j )
ws j W S( n )
max sim ( gi , g j )
g i : gloss of word sense
wsi
1
sim ( gi , g j )
max simds (a, b)
| a gi | agi bg j
a (b) : noun appearing in g i ( g j )
15
IJCNLP2008 Jan 10, 2008
Apply Gloss-based similarity to McCarthy et
al.’s approach
chicken
sense2: the meat from this bird eaten as food.
sense3: informal someone who is not at all brave.
neighbour
simds
simDSlesk(word, sense2)
weighted simds
turkey
0.1805
0.3453
0.0623
meat
0.1781
0.2323
...
...
= 0.0414
tomato
0.1573
...
0.1557
...
0.0245
prevalence(sense2) = 0.0623 + 0.0414 + ... + 0.0245
= 0.2387
16
IJCNLP2008 Jan 10, 2008
Table of contents
1.
Task
2.
Related work: McCarthy et al (2004)
3.
Gloss-based semantic similarity metrics
4.
Experiments
5.
WSD on the two datasets: EDR and
Japanese Senseval2 task
Conclusion and future directions
17
IJCNLP2008 Jan 10, 2008
Experiment 1: EDR
Dataset: EDR corpus
3,836
polysemous nouns (183,502
instances)
Adopt the similarity score proposed by
Lin (1998) as the distributional similarity
score
9-years
Mainichi newspaper articles and 10years Nikkei newspaper articles
Japanese dependency parser CaboCha (Kudo
and Matsumoto, 2002)
Use 50 nearest neighbors in line with
McCarthy et al. (2004)
18
IJCNLP2008 Jan 10, 2008
Methods
Baseline
Select
one word sense at random for each
word token and average the precision over
100 trials
Unsupervised: McCarthy et al. (2004)
Semantic
similarity:
Jiang and Conrath (1997) (jcn), lesk, DSlesk
Supervised (Majority)
Use
hand-labeled training data for obtaining
the predominant sense of the test words
19
IJCNLP2008 Jan 10, 2008
Results: EDR
baseline
jcn
lesk
DSlesk
upper-bound
supervised
recall
0.402
precision
0.402
0.495
0.474
0.495
0.488
0.495
0.745
0.495
0.745
0.731
0.731
DSlesk is comparable to jcn without the requirement
for semantic relations such as hyponymy
20
IJCNLP2008 Jan 10, 2008
Results: EDR (Cont’d)
baseline
jcn
lesk
DSlesk
upper-bound
supervised
all
freq ≤ 10
freq ≤ 5
0.402
0.495
0.474
0.495
0.745
0.731
0.405
0.445
0.448
0.453
0.674
0.519
0.402
0.431
0.426
0.433
0.639
0.367
All methods for finding a predominant sense
outperform the supervised one for item with little data
(≤ 5), indicating that these methods robustly work
even for low frequency data where hand-tagged data is
unreliable
21
IJCNLP2008 Jan 10, 2008
Experiment 2 and Results: Senseval2 in
Japanese
50 nouns (5,000 instances)
4 methods
lesk,
DSlesk, baseline, supervised
baseline
lesk
DSlesk
upper-bound
supervised
precision = recall
fine-grained
coarse-grained
0.282
0.399
0.344
0.501
0.386
0.747
0.593
0.834
0.742
0.842
sense-id: 105-0-0-2-0 fine-grained
coarse-grained
22
IJCNLP2008 Jan 10, 2008
Conclusion
We examined different measures of
semantic similarity for finding a first
sense heuristic for WSD automatically in
Japanese
We defined a new gloss-based similarity
(DSlesk) and evaluated the performance
on two Japanese WSD datasets (EDR and
Senseval2), outperforming lesk and
achieving a performance comparable to
the jcn method which relies on hyponym
links which are not always available
23
IJCNLP2008 Jan 10, 2008
Future directions
Explore other information in the glosses,
such as words of other POS and
predicate-argument relations
Group fine-grained word senses into
clusters, making the task suitable for
NLP applications (Ide and Wilks, 2006)
Use the results of predominant sense
acquisition as a prior knowledge of other
approaches
Graph-based
approaches (Mihalcea 2005,
Nastase 2008)
24