Transcript Notebook

Part 4:
Supervised Methods of Word
Sense Disambiguation
Outline
• What is Supervised Learning?
• Task Definition
• Single Classifiers
– Naïve Bayesian Classifiers
– Decision Lists and Trees
• Ensembles of Classifiers
What is Supervised Learning?
• Collect a set of examples that illustrate the various possible
classifications or outcomes of an event.
• Identify patterns in the examples associated with each
particular class of the event.
• Generalize those patterns into rules.
• Apply the rules to classify a new event.
Learn from these examples :
“when do I go to the store?”
Day
Go to Store? Hot Outside?
Slept Well?
Ate Well?
1
YES
YES
NO
NO
2
NO
YES
NO
YES
3
YES
NO
NO
NO
4
NO
NO
NO
YES
Outline
• What is Supervised Learning?
• Task Definition
• Single Classifiers
– Naïve Bayesian Classifiers
– Decision Lists and Trees
• Ensembles of Classifiers
Task Definition
• Supervised WSD: Class of methods that induces a classifier from
manually sense-tagged text using machine learning techniques.
• Resources
– Sense Tagged Text
– Dictionary (implicit source of sense inventory)
– Syntactic Analysis (POS tagger, Chunker, Parser, …)
• Scope
– Typically one target word per context
– Part of speech of target word resolved
– Lends itself to “lexical sample” formulation
• Reduces WSD to a classification problem where a target word is
assigned the most appropriate sense from a given set of possibilities
based on the context in which it occurs
Sense Tagged Text
Bonnie and Clyde are two really famous criminals, I think
they were bank/1 robbers
My bank/1 charges too much for an overdraft.
I went to the bank/1 to deposit my check and get a new ATM
card.
The University of Minnesota has an East and a West Bank/2
campus right on the Mississippi River.
My grandfather planted his pole in the bank/2 and got a great
big catfish!
The bank/2 is pretty muddy, I can’t walk there.
Two Bags of Words
(Co-occurrences in the “window of context”)
FINANCIAL_BANK_BAG:
a an and are ATM Bonnie card charges check Clyde
criminals deposit famous for get I much My new overdraft
really robbers the they think to too two went were
RIVER_BANK_BAG:
a an and big campus cant catfish East got grandfather great
has his I in is Minnesota Mississippi muddy My of on planted
pole pretty right River The the there University walk West
Simple Supervised Approach
Given a sentence S containing “bank”:
For each word Wi in S
If Wi is in FINANCIAL_BANK_BAG then
Sense_1 = Sense_1 + 1;
If Wi is in RIVER_BANK_BAG then
Sense_2 = Sense_2 + 1;
If Sense_1 > Sense_2 then print “Financial”
else if Sense_2 > Sense_1 then print “River”
else print “Can’t Decide”;
Supervised Methodology
• Create a sample of training data where a given target word
is manually annotated with a sense from a predetermined
set of possibilities.
– One tagged word per instance/lexical sample disambiguation
• Select a set of features with which to represent context.
– co-occurrences, collocations, POS tags, verb-obj relations, etc...
• Convert sense-tagged training instances to feature vectors.
• Apply a machine learning algorithm to induce a classifier.
– Form – structure or relation among features
– Parameters – strength of feature interactions
• Convert a held out sample of test data into feature vectors.
– “correct” sense tags are known but not used
• Apply classifier to test instances to assign a sense tag.
Outline
•
•
•
•
•
What is Supervised Learning?
Task Definition
Naïve Bayesian Classifier
Decision Lists and Trees
Ensembles of Classifiers
Naïve Bayesian Classifier
• Naïve Bayesian Classifier well known in Machine
Learning community for good performance across a range
of tasks (e.g., Domingos and Pazzani, 1997)
…Word Sense Disambiguation is no exception
• Assumes conditional independence among features, given
the sense of a word.
– The form of the model is assumed, but parameters are estimated
from training instances
• When applied to WSD, features are often “a bag of words”
that come from the training data
– Usually thousands of binary features that indicate if a word is
present in the context of the target word (or not)
Bayesian Inference
p ( F 1, F 2, F 3,...,Fn|S )* p ( S )
p ( S | F 1, F 2, F 3,...,Fn)
p ( F 1, F 2, F 3,...,Fn)
•
•
•
•
Given observed features, what is most likely sense?
Estimate probability of observed features given sense
Estimate unconditional probability of sense
Unconditional probability of features is a normalizing
term, doesn’t affect sense classification
Naïve Bayesian Model
S
F1
F2
F3
F4
Fn
P( F1, F 2,...,Fn | S )  ( pF1 | S ) * p( F 2 | S ) * ...* p( Fn | S )
The Naïve Bayesian Classifier
sense  argmax p( F1 | S ) * ...* p( Fn | S ) * p( S )
senseS
– Given 2,000 instances of “bank”, 1,500 for bank/1 (financial sense)
and 500 for bank/2 (river sense)
• P(S=1) = 1,500/2000 = .75
• P(S=2) = 500/2,000 = .25
– Given “credit” occurs 200 times with bank/1 and 4 times with bank/2.
• P(F1=“credit”) = 204/2000 = .102
• P(F1=“credit”|S=1) = 200/1,500 = .133
• P(F1=“credit”|S=2) = 4/500 = .008
– Given a test instance that has one feature “credit”
• P(S=1|F1=“credit”) = .133*.75/.102 = .978
• P(S=2|F1=“credit”) = .008*.25/.102 = .020
Comparative Results
• (Leacock, et. al. 1993) compared Naïve Bayes with a
Neural Network and a Context Vector approach when
disambiguating six senses of line…
• (Mooney, 1996) compared Naïve Bayes with a Neural
Network, Decision Tree/List Learners, Disjunctive and
Conjunctive Normal Form learners, and a perceptron when
disambiguating six senses of line…
• (Pedersen, 1998) compared Naïve Bayes with Decision
Tree, Rule Based Learner, Probabilistic Model, etc. when
disambiguating line and 12 other words…
• …All found that Naïve Bayesian Classifier performed as
well as any of the other methods!
Outline
•
•
•
•
•
What is Supervised Learning?
Task Definition
Naïve Bayesian Classifiers
Decision Lists and Trees
Ensembles of Classifiers
Decision Lists and Trees
• Very widely used in Machine Learning.
• Decision trees used very early for WSD research (e.g.,
Kelly and Stone, 1975; Black, 1988).
• Represent disambiguation problem as a series of questions
(presence of feature) that reveal the sense of a word.
– List decides between two senses after one positive answer
– Tree allows for decision among multiple senses after a series of
answers
• Uses a smaller, more refined set of features than “bag of
words” and Naïve Bayes.
– More descriptive and easier to interpret.
Decision List for WSD (Yarowsky, 1994)
• Identify collocational features from sense tagged data.
• Word immediately to the left or right of target :
– I have my bank/1 statement.
– The river bank/2 is muddy.
• Pair of words to immediate left or right of target :
– The world’s richest bank/1 is here in New York.
– The river bank/2 is muddy.
• Words found within k positions to left or right of target,
where k is often 10-50 :
– My credit is just horrible because my bank/1 has made several
mistakes with my account and the balance is very low.
Building the Decision List
• Sort order of collocation tests using log of conditional
probabilities.
• Words most indicative of one sense (and not the other)
will be ranked highly.
Abs (log
p ( S 1| Fi Collocationi )
p ( S  2| Fi Collocationi )
)
Computing DL score
– Given 2,000 instances of “bank”, 1,500 for bank/1
(financial sense) and 500 for bank/2 (river sense)
• P(S=1) = 1,500/2,000 = .75
• P(S=2) = 500/2,000 = .25
– Given “credit” occurs 200 times with bank/1 and 4
times with bank/2.
• P(F1=“credit”) = 204/2,000 = .102
• P(F1=“credit”|S=1) = 200/1,500 = .133
• P(F1=“credit”|S=2) = 4/500 = .008
– From Bayes Rule…
• P(S=1|F1=“credit”) = .133*.75/.102 = .978
• P(S=2|F1=“credit”) = .008*.25/.102 = .020
– DL Score = abs (log (.978/.020)) = 3.89
Using the Decision List
• Sort DL-score, go through test instance looking for
matching feature. First match reveals sense…
DL-score
Feature
Sense
3.89
credit within bank
Bank/1 financial
2.20
bank is muddy
Bank/2 river
1.09
pole within bank
Bank/2 river
0.00
of the bank
N/A
Using the Decision List
CREDIT?
BANK/1 FINANCIAL
IS MUDDY?
BANK/2 RIVER
POLE?
BANK/2 RIVER
Learning a Decision Tree
• Identify the feature that most “cleanly” divides the training
data into the known senses.
– “Cleanly” measured by information gain or gain ratio.
– Create subsets of training data according to feature values.
• Find another feature that most cleanly divides a subset of
the training data.
• Continue until each subset of training data is “pure” or as
clean as possible.
• Well known decision tree learning algorithms include ID3
and C4.5 (Quillian, 1986, 1993)
• In Senseval-1 a modified decision list (which supported
some conditional branching) was most accurate for English
Lexical Sample task (Yarowsky, 2000)
Supervised WSD with Individual Classifiers
• Most supervised Machine Learning algorithms have been
applied to Word Sense Disambiguation, most work
reasonably well.
• Features tend to differentiate among methods more than
the learning algorithms.
• Good sets of features tend to include:
–
–
–
–
–
Co-occurrences or keywords (global)
Collocations (local)
Bigrams (local and global)
Part of speech (local)
Predicate-argument relations
• Verb-object, subject-verb,
– Heads of Noun and Verb Phrases
Convergence of Results
• Accuracy of different systems applied to the same data
tends to converge on a particular value, no one system
shockingly better than another.
– Senseval-1, a number of systems in range of 74-78% accuracy for
English Lexical Sample task.
– Senseval-2, a number of systems in range of 61-64% accuracy for
English Lexical Sample task.
– Senseval-3, a number of systems in range of 70-73% accuracy for
English Lexical Sample task…
• What to do next?
Outline
•
•
•
•
•
What is Supervised Learning?
Task Definition
Naïve Bayesian Classifiers
Decision Lists and Trees
Ensembles of Classifiers
Ensembles of Classifiers
• Classifier error has two components (Bias and Variance)
– Some algorithms (e.g., decision trees) try and build a
representation of the training data – Low Bias/High Variance
– Others (e.g., Naïve Bayes) assume a parametric form and don’t
represent the training data – High Bias/Low Variance
• Combining classifiers with different bias variance
characteristics can lead to improved overall accuracy
• “Bagging” a decision tree can smooth out the effect of
small variations in the training data (Breiman, 1996)
– Sample with replacement from the training data to learn multiple
decision trees.
– Outliers in training data will tend to be obscured/eliminated.
Ensemble Considerations
• Must choose different learning algorithms with
significantly different bias/variance characteristics.
– Naïve Bayesian Classifier versus Decision Tree
• Must choose feature representations that yield significantly
different (independent?) views of the training data.
– Lexical versus syntactic features
• Must choose how to combine classifiers.
– Simple Majority Voting
– Averaging of probabilities across multiple classifier output
– Maximum Entropy combination (e.g., Klein, et. al., 2002)
Ensemble Results
• (Pedersen, 2000) achieved state of art for interest and line
data using ensemble of Naïve Bayesian Classifiers.
– Many Naïve Bayesian Classifiers trained on varying sized
windows of context / bags of words.
– Classifiers combined by a weighted vote
• (Florian and Yarowsky, 2002) achieved state of the art for
Senseval-1 and Senseval-2 data using combination of six
classifiers.
– Rich set of collocational and syntactic features.
– Combined via linear combination of top three classifiers.
• Many Senseval-2 and Senseval-3 systems employed
ensemble methods.
References
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
(Black, 1988) E. Black (1988) An experiment in computational discrimination of English word senses. IBM Journal
of Research and Development (32) pg. 185-194.
(Breiman, 1996) L. Breiman. (1996) The heuristics of instability in model selection. Annals of Statistics (24) pg.
2350-2383.
(Domingos and Pazzani, 1997) P. Domingos and M. Pazzani. (1997) On the Optimality of the Simple Bayesian
Classifier under Zero-One Loss, Machine Learning (29) pg. 103-130.
(Domingos, 2000) P. Domingos. (2000) A Unified Bias Variance Decomposition for Zero-One and Squared Loss. In
Proceedings of AAAI. Pg. 564-569.
(Florian an dYarowsky, 2002) R. Florian and D. Yarowsky. (2002) Modeling Consensus: Classifier Combination for
Word Sense Disambiguation. In Proceedings of EMNLP, pp 25-32.
(Kelly and Stone, 1975). E. Kelly and P. Stone. (1975) Computer Recognition of English Word Senses, North Holland
Publishing Co., Amsterdam.
(Klein, et. al., 2002) D. Klein, K. Toutanova, H. Tolga Ilhan, S. Kamvar, and C. Manning, Combining Heterogeneous
Classifiers for Word-Sense Disambiguation, Proceedings of Senseval-2. pg. 87-89.
(Leacock, et. al. 1993) C. Leacock, J. Towell, E. Voorhees. (1993) Corpus based statistical sense resolution. In
Proceedings of the ARPA Workshop on Human Language Technology. pg. 260-265.
(Mooney, 1996) R. Mooney. (1996) Comparative experiments on disambiguating word senses: An illustration of the
role of bias in machine learning. Proceedings of EMNLP. pg. 82-91.
(Pedersen, 1998) T. Pedersen. (1998) Learning Probabilistic Models of Word Sense Disambiguation. Ph.D.
Dissertation. Southern Methodist University.
(Pedersen, 2000) T. Pedersen (2000) A simple approach to building ensembles of Naive Bayesian classifiers for word
sense disambiguation. In Proceedings of NAACL.
(Quillian, 1986). J.R. Quillian (1986) Induction of Decision Trees. Machine Learning (1). pg. 81-106.
(Quillian, 1993). J.R. Quillian (1993) C4.5 Programs for Machine Learning. San Francisco, Morgan Kaufmann.
(Yarowsky, 1994) D. Yarowsky. (1994) Decision lists for lexical ambiguity resolution: Application to accent
restoration in Spanish and French. In Proceedings of ACL. pp. 88-95.
(Yarowsky, 2000) D. Yarowsky. (2000) Hierarchical decision lists for word sense disambiguation. Computers and the
Humanities, 34.