Corpora and Statistical Methods Lecture 12
Download
Report
Transcript Corpora and Statistical Methods Lecture 12
Corpora and Statistical Methods
Albert Gatt
Word Sense Disambiguation
What are word senses?
Cognitive definition:
mental representation of meaning
used in psychological experiments
relies on introspection (notoriously deceptive)
Dictionary-based definition:
adopt sense definitions in a dictionary
most frequently used resource is WordNet
WordNet
Taxonomic representation of words (“concepts”)
Each word belongs to a synset, which contains near-
synonyms
Each synset has a gloss
Words with multiple senses (polysemy) belong to multiple
synsets
Synsets organised by hyponymy (IS-A) relations
Also, other lexical relations, depending on category
How many senses?
Example: interest
pay 3% interest on a loan
showed interest in something
purchased an interest in a company.
the national interest…
have X’s best interest at heart
have an interest in word senses
The economy is run by business interests
Wordnet entry for interest (noun)
1. a sense of concern with and curiosity about someone or something …
2.
3.
4.
5.
6.
7.
(Synonym: involvement)
the power of attracting or holding one’s interest… (Synonym:
interestingness)
a reason for wanting something done ( Synonym: sake)
a fixed charge for borrowing money…
a diversion that occupies one’s time and thoughts … (Synonym: pastime)
a right or legal share of something; a financial involvement with something
(Synonym: stake)
(usually plural) a social group whose members control some field of activity
and who have common aims, (Synonym: interest group)
Some issues
Are all these really distinct senses? Is WordNet too fine-
grained?
Would native speakers distinguish all these as different?
Cf. The distinction between sense ambiguity and
underspecification (vagueness):
one could argue that there are fewer senses, but these are
underspecified out of context
Translation equivalence
Many WSD applications rely on translation equivalence
Given: parallel corpus (e.g. English-German)
if word w in English has n translations in German, then each
translation represents a sense
e.g. German translations of interest:
Zins: financial charge (WN sense 4)
Anteil: stake in a company (WN sense 6)
Interesse: all other senses
Some terminology
WSD Task: given an ambiguous word, find the intended sense
in context
Sense tagging: task of labelling words as belonging to one
sense or another.
needs some a priori characterisation of senses of each relevant word
Discrimination:
distinguishes between occurrences of words based on senses
not necessarily explicit labelling
Some more terminology
Two types of WSD task:
Lexical sample task: focuses on disambiguating a small set of
target words, using an inventory of the senses of those words.
All-words task: focuses on entire texts and a lexicon, where
every word in the text has to be disambiguated
Serious data sparseness problems!
Approaches to WSD
All methods rely on training data. Basic idea:
Given word w in context c
learn how to predict sense s of w based on various features of w
Supervised learning: training data is labelled with correct senses
can do sense tagging
Unsupervised learning: training data is unlabelled
but many other knowledge sources used
cannot do sense tagging, since this requires a priori senses
Supervised learning
Words in training data labelled with their senses
She pays 3% interest/INTEREST-MONEY on the loan.
He showed a lot of interest/INTEREST-CURIOSITY in the painting.
Similar to POS tagging
given a corpus tagged with senses
define features that indicate one sense over another
learn a model that predicts the correct sense given the features
Features (e.g. plant)
Neighbouring words:
plant life
manufacturing plant
assembly plant
plant closure
plant species
Content words in a larger window
animal
equipment
employee
automatic
Other features
Syntactically related words
e.g. object, subject….
Topic of the text
is it about SPORT? POLITICS?
Part-of-speech tag, surrounding part-of-speech tags
Some principles proposed (Yarowsky 1995)
One sense per discourse:
typically, all occurrences of a word will have the same sense in the
same stretch of discourse (e.g. same document)
One sense per collocation:
nearby words provide clues as to the sense, depending on the distance
and syntactic relationship
e.g. plant life: all (?) occurrences of plant+life will indicate the botanic
sense of plant
Training data
SENSEVAL
Shared Task competition
datasets available for WSD, among other things
annotated corpora in many languages
(NB: SENSEVAL now merged with the broader SEMEVAL tasks)
Pseudo-words
create training corpus by artificially conflating words
e.g. all occurrences of man and hammer with man-hammer
easy way to create training data
Multi-lingual parallel corpora
translated texts aligned at the sentence level
translation indicates sense
SemCor corpus
Corpus consisting of files from the Brown Corpus (1m
words)
SemCor contains 352 files, 360k words total:
186 files contain sense tags for all content words
The remainder only have sense tags for verbs
Originally created using the WordNet 1.6 sense inventory;
since updated to the more recent versions of WordNet
SemCor Example (= Brown file a01)
<s snum=1>
[...]
<wf cmd=ignore pos=DT>an</wf>
<wf cmd=done pos=NN lemma=investigation wnsn=1 lexsn=1:09:00::>investigation</wf>
<wf cmd=ignore pos=IN>of</wf>
<wf cmd=done pos=NN lemma=atlanta wnsn=1 lexsn=1:15:00::>Atlanta</wf>
<wf cmd=ignore pos=POS>'s</wf>
<wf cmd=done pos=JJ lemma=recent wnsn=2 lexsn=5:00:00:past:00>recent</wf>
<wf cmd=done pos=NN lemma=primary_election wnsn=1
lexsn=1:04:00::>primary_election</wf>
[...]
</s>
SemCor example
<wf cmd=done pos=NN lemma=investigation wnsn=1
lexsn=1:09:00::>investigation</wf>
Senses in WordNet:
S: (n) probe, investigation (an inquiry into unfamiliar or
questionable activities) "there was a congressional probe into the
scandal"
S: (n) investigation, investigating (the work of inquiring
into something thoroughly and systematically)
This occurrence involves the first sense
Data representation
Example sentence: An electric guitar and bass player stand off to
one side...
Target word: bass
Possible senses: fish, musical instrument...
Relevant features are represented as vectors, e.g.:
wi2 , POSi2 , wi1, POSi1, wi1, POSi1, wi2 , POSi2
guitar,NN,and,CC, player,NN,stand,VB
Part 1
Supervised methods I: Naive Bayes
Naïve Bayes Classifier
Identify the features (F)
e.g. surrounding words
other cues apart from surrounding context
Combine evidence from all features
Decision rule: decide on sense s’ iff
sk , sk s': P( s' | F ) P( sk | F )
Example: drug. F = words in context
medication sense: price, prescription, pharmaceutical
illegal substance sense: alcohol, illicit, paraphernalia
Naive Bayes Classifier
Problem: We usually don’t know the probability of a sense
given the features: P(sk|F)
In our corpus, we have access to P(F|sk)
We can compute P(sk|F) from P(F|sk)
For this we use Bayes’ theorem
Deriving Bayes’ rule from the multiplication
rule
Recall that:
P( A B) P( A) P( B | A)
Deriving Bayes’ rule from the multiplication
rule
Recall that:
P( A B) P( A) P( B | A)
Given symmetry of intersection, multiplication rule can be
written in two ways:
P( A B) P( A) P( B | A)
P( A B) P( B) P( A | B)
Deriving Bayes’ rule from the multiplication
rule
Recall that:
P( A B) P( A) P( B | A)
Given symmetry of intersection, multiplication rule can be
written in two ways:
P( A B) P( A) P( B | A)
P( A B) P( B) P( A | B)
Bayes’ rule involves the substitution of one equation into the
other, to replace P(A and B)
P( B) P( A | B)
P( B | A)
P( A)
Deriving P(A)
Often, it’s not clear where P(A) should come from
we start out from conditional probabilities!
Given that we have two sets of outcomes of interest, A and B,
P(A) can be derived from the following observation:
A ( A B) ( A B)
i.e. The events in A are made up of those which are only in A (but not in
B) and those which are in both A and B.
Finding P(A) -- I
B
A
A B
A B
P( A B) P( B) P( A | B)
P( A B) P( B)P( A | B)
P(A) must either be in one
or the other (or both), since
A is composed of these two
sets.
Finding P(A) -- II
Step 1: Applying the addition rule:
P( A) P( A B) P( A B)
P( B) P( A | B) P( B) P( A | B)
Step 2: Substituting into Bayes’ equation to replace P(A):
P( B | A)
P( B) P( A | B)
P( B) P( A | B) P( B) P( A | B)
Using Bayes’ rule fro WSD
We usually don’t know P(sk|F) but we can compute from
training data: P(sk) (the prior) and P(F|sk)
P ( sk | f )
P ( f | sk ) P ( sk )
P( f )
P(f) can be eliminated because it is constant for all senses in the
corpus
sbest
arg maxsk P( sk | f )
arg maxsk
P ( f | sk ) P ( sk )
P( f )
arg maxsk P( f | sk ) P( sk )
The independence assumption
It’s called “naïve” because:
n
P( f | sk ) P( f j | sk )
j 1
i.e. all features are assumed to be independent
Obviously, this is often not true.
e.g. finding illicit in the context of drug may not be independent of finding pusher.
cf. our discussion of collocations!
Also, topics often constrain word choice.
Training the naive Bayes classifier
We need to compute:
P(s) for all senses s of w
P ( sk )
Count( sk , w)
Count( w)
P(fj|s) for all features fj
P( f j | sk )
Count( f j , sk )
Count(sk )
Part 2
Supervised methods II: Information-theoretic and collocation-based
approaches
Information-theoretic measures
Find the single, most informative feature to predict a sense.
E.g. using a parallel corpus:
prendre (FR) can translate as take or make
prendre une décision: make a decision
prendre une mesure: take a measure [to…]
Informative feature in this case: direct object
mesure indicates take
décision indicates make
Problem: need to identify the correct value of the feature that
indicates a specific sense.
Brown et al’s algorithm
1.
Given: translations T of word w
2.
Given: values X of a useful feature (e.g. mesure, décision as values of DO)
3.
Step 1: random partition P of T
4.
While improving, do:
create partition Q of X that maximises I(P;Q)
find a partition P of T that maximises I(P;Q)
comment: relies on mutual info to find clusters of translations mapping to clusters
of feature values
Using dictionaries and thesauri
Lesk (1986): one of the first to exploit dictionary
definitions
the definition corresponding to a sense can contain words which
are good indicators for that sense
Method:
1.
2.
3.
4.
Given: ambiguous word w with senses s1…sn with glosses g1…gn.
Given: the word w in context c
compute overlap between c & each gloss
select the maximally matching sense
Expanding a dictionary
Problem with Lesk:
often dictionary definitions don’t contain sufficient information
not all words in dictionary definitions are good informants
Solution: use a thesaurus with subject/topic categories
e.g. Roget’s thesaurus
http://www.gutenberg.org/cache/epub/22/pg22.html.utf8
Using topic categories
Suppose every sense sk of word w has subject/topic tk
w can be disambiguated by identifying the words related to tk
in the thesaurus
Problems:
general-purpose thesauri don’t list domain-specific topics
several potentially useful words can be left out
e.g. … Navratilova plays great tennis …
proper name here useful as indicator of topic SPORT
Expanding a thesaurus: Yarowsky 1992
1.
Given: context c and topic t
2.
For all contexts and topics, compute p(c|t) using Naïve Bayes
by comparing words pertaining to t in the thesaurus with words in c
if p(c|t) > α, then assign topic t to context c
3.
For all words in the vocabulary, update the list of contexts in which the
word occurs.
Assign topic t to each word in c
4.
Finally, compute p(w|t) for all w in the vocabulary
this gives the “strength of association” of w with t
Yarowsky 1992: some results
SENSE
star
space object
celebrity
shape
sentence
punishment
set of words
ROGET TOPICS
ACCURACY
UNIVERSE
ENTERTAINER
INSIGNIA
96%
95%
82.%
LEGAL_ACTION 99%
GRAMMAR
98%
Bootstrapping
Yarowsky (1995) suggested the one sense per
discourse/collocation constraints.
Yarowsky’s method:
1.
2.
select the strongest collocational feature in a specific context
disambiguate based only on this feature
(similar to the information-theoretic method discussed earlier)
One sense per collocation
1.
For each sense s of w, initialise F, the collocations found in the
dictionary definition for s
2.
One sense per collocation:
identify the set of contexts containing collocates of s
for each sense s of w, update F to contain those collocates such that, for
for all s’ ≠ s
P( s | f )
P ( s '| f )
(where alpha is a threshold)
One sense per discourse
3.
For each document:
find the majority sense of w out of those found in previous step
assign all occurrences of w the majority sense
This is implemented as a post-processing step. Reduces
error rate by ca. 27%.
Part 3
Unsupervised disambiguation
Preliminaries
Recall: unsupervised learning can do sense discrimination
not tagging
akin to clustering occurrences with the same sense
e.g. Brown et al 1991: cluster translations of a word
this is akin to clustering senses
Brown et al’s method
Preliminary categorisation:
1.
Set P(w|s) randomly for all words w and senses s of w.
2.
Compute, for each context c of w the probability P(c|s)
that the context was generated by sense s.
Use (1) and (2) as a preliminary estimate. Re-estimate
iteratively to find best fit to the corpus.
Characteristics of unsupervised
disambiguation
Can adapt easily to new domains, not covered by a dictionary
or pre-labelled corpus
Very useful for information retrieval
If there are many senses (e.g. 20 senses for word w), the
algorithm will split contexts into fine-grained sets
NB: can go awry with infrequent senses
Part 4
Some issues with WSD
The task definition
The WSD task traditionally assumes that a word has one and
only one sense in a context.
Is this true?
Is it possible to see evidence of co-activation (one word
displaying more than one sense)?
this would bring competition to the licensed trade
competition = “act of competing”
competition = “people/organisations who are competing”
Systematic polysemy
Not all senses are so easy to distinguish. E.g. competition in the
“agent competing” vs “act of competing” sense.
The polysemy here is systematic
Compare bank/bank where the senses are utterly distinct (and most
linguists wouldn’t consider this a case of polysemy, but homonymy)
Can translation equivalence help here?
depends if polysemy is systematic in all languages
Logical metonymy
Metonymy = usage of a word to stand for something else
e.g. the pen is mightier than the sword
pen = the press
Logical metonymy arises due to systematic polysemy
good cook vs. good book
enjoy the paper vs enjoy the cake
Should WSD distinguish these? How could they do this?
Which words/usages count?
Many proper names are identical to common nouns (cf.
Brown, Bush,…)
This presents a WSD algorithm with systematic ambiguity
and reduces performance.
Also, names are good indicators of senses of neighbouring
words.
But this requires a priori categorisation of names.
Brown’s green stance vs. the cook’s green curry
Useful links
WordNet: http://wordnet.princeton.edu/
SemCor:
http://www.cse.unt.edu/~rada/downloads.html#semcor
Senseval: http://www.senseval.org/