Transcript wsd.ppt

Lecture 19
Word Sense Disambiguation
CS 4705
Overview
• Selectional restriction based approaches
• Robust techniques
– Machine Learning
• Supervised
• Unsupervised
– Dictionary-based techniques
Disambiguation via Selectional Restrictions
• Eliminates ambiguity by eliminating ill-formed
semantic representations much as syntactic parsing
eliminates ill-formed syntactic analyses
– Different verbs select for different thematic roles
wash the dishes (takes washable-thing as patient)
serve delicious dishes (takes food-type as patient)
• Method: rule-to-rule syntactico-semantic analysis
– Semantic attachment rules are applied as sentences are
syntactically parsed
– Selectional restriction violation: no parse
• Requires:
– Selectional restrictions for each sense of each predicate
– Hierarchical type information about each argument (a la
WordNet)
• Limitations:
– Sometimes not sufficiently constraining to
disambiguate (Which dishes do you like?)
– Violations that are intentional (Eat dirt, worm!)
– Metaphor and metonymy
Selectional Restrictions as Preferences
• Resnik ‘97, ‘98’s selectional association:
– Probabilistic measure of strength of association
between predicate and class dominating argument
– Derive predicate/argument relations from tagged corpus
– Derive hyponymy relations from WordNet
– Selects sense with highest selectional association
between an ancestor and predicate (44% correct)
Brian ate the dish.
• WN: dish is a kind of crockery and a kind of food
• tagged corpus counts: ate/<crockery> vs. ate/<food>
Machine Learning Approaches
• Learn a classifier to assign one of possible word
senses for each word
– Acquire knowledge from labeled or unlabeled corpus
– Human intervention only in labeling corpus and
selecting set of features to use in training
• Input: feature vectors
– Target (dependent variable)
– Context (set of independent variables)
• Output: classification rules for unseen text
Input Features for WDS
• POS tags of target and neighbors
• Surrounding context words (stemmed or not)
• Partial parsing to identify thematic/grammatical
roles and relations
• Collocational information:
– How likely are target and left/right neighbor to co-occur
Is the bass fresh today?
[w-2, w-2/pos, w-1,w-/pos,w+1,w+1/pos,w+2,w+2/pos…
[is,V,the,DET,fresh,RB,today,N...
• Co-occurrence of neighboring words
– How often does sea or words with root sea (e.g.
seashore, seafood, seafaring) occur in a window of size
N
– How choose?
• M most frequent content words occurring within
window of M in training data
Supervised Learning
• Training and test sets with words labeled as to
correct sense (It was the biggest [fish: bass] I’ve
seen.)
– Obtain independent vars automatically (POS, cooccurrence information, etc.)
– Run classifier on training data
– Test on test data
– Result: Classifier for use on unlabeled data
Types of Classifiers
• Naïve Bayes
p(V |s)P(s)
P(V )
– 
P(s|V),
– Where s is one of the senses possible and V the input
vector of features
– Assume features independent, so probability of V is the
product of probabilities of each feature, given s, so
n
– P(V |s)  P(v j|s)
and P(V) same for any s
= arg max
sS
or arg max
sS
j 1
– If P(s) is the prior
n
sˆ  arg max P(s)  P(v j | s)
j 1
sS
• Decision lists:
– like case statements applying tests to input in turn
fish within window
--> bass1
striped bass
--> bass1
guitar within window
--> bass2
bass player
--> bass1
…
– Yarowsky ‘96’s approach orders tests by individual
accuracy on entire training set based on log-likehood


ratio
 P(Sense1| f v j 
i
 
Abs(Log 

 P(Sense 2| f

i v j 
• Bootstrapping I
– Start with a few labeled instances of target item as
seeds to train initial classifier, C
– Use high confidence classifications of C on unlabeled
data as training data
– Iterate
• Bootstrapping II
– Start with sentences containing words strongly
associated with each sense (e.g. sea and music for
bass), either intuitively or from corpus or from
dictionary entries
– One Sense per Discourse hypothesis
Unsupervised Learning
• Cluster automatically derived feature vectors to
‘discover’ word senses using some similarity
metric
– Represent each cluster as average of feature vectors it
contains
– Label clusters by hand with known senses
– Classify unseen instances by proximity to these known
and labeled clusters
• Evaluation problem
– What are the ‘right’ senses?
– Cluster impurity
– How do you know how many clusters to create?
– Some clusters may not map to ‘known’ senses
Dictionary Approaches
• Problem of scale for all ML approaches
– Build a classifier for each sense ambiguity
• Machine readable dictionaries (Lesk ‘86)
– Retrieve all definitions of content words in context of
target
– Compare for overlap with sense definitions of target
– Choose sense with most overlap
• Limitations
– Entries are short --> expand entries to ‘related’ words
using subject codes