Word Sense Disambiguation

Download Report

Transcript Word Sense Disambiguation

Word Sense Disambiguation
• Many words have multiple meanings
– E.g, river bank, financial bank
• Problem: Assign proper sense to each
ambiguous word in text
• Applications:
– Machine translation
– Information retrieval
– Semantic interpretation of text
Tagging?
• Idea: Treat sense disambiguation like POS
tagging, just with “semantic tags”
• The problems differ:
– POS tags depend on specific structural cues
(mostly neighboring tags)
– Senses depend on semantic context – less
structured, longer distance dependency
Approaches
•
Supervised learning:
Learn from a pretagged corpus
•
Dictionary-Based Learning
Learn to distinguish senses from dictionary
entries
•
Unsupervised Learning
Automatically cluster word occurrences into
different senses
Evaluation
• Train and test on pretagged texts
– Difficult to come by
• Artificial data: ‘merge’ two words to form
an ‘ambiguous’ word with two ‘senses’
– E.g, replace all occurrences of door and of
window with doorwindow and see if the system
figures out which is which
Performance Bounds
• How good is (say) 83.2%??
• Evaluate performance relative to lower and
upper bounds:
– Baseline performance: how well does the
simplest “reasonable” algorithm do?
– Human performance: what percentage of the
time do people agree on classification?
Supervised Learning
• Each ambiguous word token wi = wk in the
training is tagged with a sense from
sk1,…,sknk
• Each word token occurs in a context ci
(usually defined as a window around the word
occurrence – up to ~100 words long)
• Each context contains a set of words used as
features vij
Bayesian Classification
• Bayes decision rule:
Classify s(wi) = arg maxs P(s | ci)
• Minimizes probability of error
• How to compute? Use Bayes’ Theorem:
P (c | s k ) P ( s k )
P ( sk | c ) 
P (c )
Bayes’ Classifier (cont.)
• Note that P(c) is constant for all senses,
therefore:
s  arg max sk P( sk | c)
 arg max sk
P (c | s k )
P ( sk )
P (c )
 arg max sk P(c | sk ) P( sk )
 arg max sk [log P (c | sk )  log P( sk )]
Naïve Bayes
• Assume:
– Features are conditionally independent, given
the example class,
– Feature order doesn’t matter
(bag of words model – repetition counts)
P(c | sk )  P({v j : v j in c} | sk )
 P (v | s )
log P(c | s )   log P(v | s )

j
k
v j in c
k
j
v j in c
k
Naïve Bayes Training
• For all senses sk of w, do:
– For all words vj in the vocabulary, do:
P (v j | s k ) 
C (v j , s k )
C (v j )
• For all senses sk of w, do:
C ( sk )
P ( sk ) 
C ( w)
Naïve Bayes Classification
• For all senses sk of wi, do:
– score(sk) = log P(sk)
– For all words vj in the context window ci, do:
• score(sk) += log P(vj | sk)
• Choose s(wi) = arg maxsk score(sk)
Significant Features
• Senses of drug (Gale et al. 1992):
‘medication’ prices, prescription, patent,
increase, consumer,
pharmaceutical
‘illegal substance’
abuse, paraphernalia, illicit,
alcohol, cocaine, traffickers
Dictionary-Based Disambiguation
Idea: Choose between senses of a word given
in a dictionary based on the words in the
definitions
Cone:
1. A mass of ovule-bearing or pollen-bearing scales in
trees of the pine family or in cycads that are arranged
usually on a somewhat elongated axis
2. Something that resembles a cone in shape: as a crisp
cone-shaped wafer for holding ice cream
Algorithm (Lesk 1986)
Define Di(w) as the bag of words in the ith
definition for w
Define E(w) as i Di(w)
• For all senses sk of w, do:
– score(sk) = similarity(Dk(w),[vj in c E(vj)])
• Choose s = arg maxsk score(sk)
Similarity Metrics
similarity(X, Y) =

matching coefficien t


Dice coefficien t


Jaccard coefficien t


overlap coefficien t


X Y
2 X Y
X Y
X Y
X Y
X Y
min( X , Y )
Simple Example
ash:
s1: a tree of the olive family
s2: the solid residue left when combustible
material is burned
This cigar burns slowly and creates a stiff ash2
The ash1 is one of the last trees to come into leaf
After being struck by lightning the olive tree was reduced to ash?
Some Improvements
• Lesk obtained results of 50-70% accuracy
Possible improvements:
• Run iteratively, each time only using
definitions of “appropriate” senses for
context words
• Expand each word to a set of synonyms,
using also a thesaurus
Thesaurus-Based Disambiguation
• Thesaurus assigns subject codes to
different words, assigning multiple codes to
ambiguous words
• t(sk) = subject code of sense sk for word w in
the thesaurus
• (t,v) = 1 iff t is a subject code for word v
Simple Algorithm
• Count up number of context words with
same subject code:
for each sense sk of wi, do:
score(sk) = vj in ci (t(sk),vj)
s(wi) = arg maxsk score(sk)
Some Issues
• Domain-dependence: In computer
manuals, “mouse” will not be evidence for
topic “mammal”
• Coverage: “Michael Jordan” will not likely
be in a thesaurus, but is an excellent
indicator for topic “sports”
Tuning for a Corpus
• Use a naïve-Bayes formulation:
P(t | c) = [P(t) vj in c P(vj | t) ] / [vj in c P(vj) ]
• Initialize probabilities as uniform
• Re-estimate P(t) and P(vj | t) for each topic t
and each word vj by evaluating all contexts
in the corpus, assuming the context has
topic t if P(t | c) >  (where  is a predefined
threshhold)
• Disambiguate by choosing the highest probability
topic
Training (based on the paper):
for all topics tl, let Wl be the set of words listed
under the topic
for all topics tl, let Tl = { c(w) | w  Wl }
for all words vj , let Vj = { c | vj  c }
for all words vj , topics tl, let:
P(vj | tl) = |Vj Tl | / (j |Vj Tl | )
(with smoothing)
for all topics tl, let:
P(tl) = (j|Vj Tl | ) / (l j |Vj Tl | )
for all words vj , let:
P(vj) = |Vj | / Nc (where N is the total number of contexts)
c
Using a Bilingual Corpus
• Use correlations between phrases in two
languages to disambiguate
E.g, interest =
In German
‘legal share’ (acquire an interest)
‘attention’ (show interest)
Beteiligung erwerben
Interesse zeigen
• Depending on where the translations of related
words occur, determine which sense applies
Scoring
• Given a context c in which a syntactic
relation R(w,v) holds between w and a
context word v:
– Score of sense sk is the number of contexts c’ in
the second language such that R(w’,v’)c’
where w’ is a translation of sk and v’ is a
translation of v.
– Choose highest-scoring sense
Global Constraints
• One sense per discourse: The sense of a
word tends to be consistent within a given
document
• One sense per collocation: Nearby words
provide strong and consistent clues for
sense, depending on distance, order, and
syntactic relationship
Examples
D1
living
living
???
existence of plant and animal life
classified as either plant or animal
bacterial and plant cells are enclosed
D2
living
living
factory
contains varied plant and animal life
the most common plant life
protected by plant parts remaining
Collocations
• Collocation: a pair of words that tend to occur
together
• Rank collocations that occur with different senses
by:
rank(f) = P(sk1 | f) / P(sk2 | f)
• A higher rank for a collocation f means it is more
highly indicative of its sense
for all senses sk of w do:
Fk = all collocations in sk’s definition
Ek = { }
while at least one Ek changed, do:
for all senses sk of w do:
Ek = { ci | exists fm in ci & fm in Fk }
for all senses sk of w do:
Fk = { fm | P(sk | fm) / P(sn | fm) > a }
for all documents d do:
determine the majority sense s* of w in d
assign all occurrences of w to s*