Corpora and Statistical Methods Lecture 12

Download Report

Transcript Corpora and Statistical Methods Lecture 12

Corpora and Statistical Methods
Albert Gatt
Word Sense Disambiguation
What are word senses?
 Cognitive definition:
 mental representation of meaning
 used in psychological experiments
 relies on introspection (notoriously deceptive)
 Dictionary-based definition:
 adopt sense definitions in a dictionary
 most frequently used resource is WordNet
WordNet
 Taxonomic representation of words (“concepts”)
 Each word belongs to a synset, which contains near-
synonyms
 Each synset has a gloss
 Words with multiple senses (polysemy) belong to multiple
synsets
 Synsets organised by hyponymy (IS-A) relations
 Also, other lexical relations, depending on category
How many senses?
 Example: interest
 pay 3% interest on a loan
 showed interest in something
 purchased an interest in a company.
 the national interest…
 have X’s best interest at heart
 have an interest in word senses
 The economy is run by business interests
Wordnet entry for interest (noun)
1. a sense of concern with and curiosity about someone or something …
2.
3.
4.
5.
6.
7.
(Synonym: involvement)
the power of attracting or holding one’s interest… (Synonym:
interestingness)
a reason for wanting something done ( Synonym: sake)
a fixed charge for borrowing money…
a diversion that occupies one’s time and thoughts … (Synonym: pastime)
a right or legal share of something; a financial involvement with something
(Synonym: stake)
(usually plural) a social group whose members control some field of activity
and who have common aims, (Synonym: interest group)
Some issues
 Are all these really distinct senses? Is WordNet too fine-
grained?
 Would native speakers distinguish all these as different?
 Cf. The distinction between sense ambiguity and
underspecification (vagueness):
 one could argue that there are fewer senses, but these are
underspecified out of context
Translation equivalence
 Many WSD applications rely on translation equivalence
 Given: parallel corpus (e.g. English-German)
 if word w in English has n translations in German, then each
translation represents a sense
 e.g. German translations of interest:
 Zins: financial charge (WN sense 4)
 Anteil: stake in a company (WN sense 6)
 Interesse: all other senses
Some terminology
 WSD Task: given an ambiguous word, find the intended sense
in context
 Sense tagging: task of labelling words as belonging to one
sense or another.
 needs some a priori characterisation of senses of each relevant word
 Discrimination:
 distinguishes between occurrences of words based on senses
 not necessarily explicit labelling
Some more terminology
 Two types of WSD task:
 Lexical sample task: focuses on disambiguating a small set of
target words, using an inventory of the senses of those words.
 All-words task: focuses on entire texts and a lexicon, where
every word in the text has to be disambiguated
 Serious data sparseness problems!
Approaches to WSD
 All methods rely on training data. Basic idea:
 Given word w in context c
 learn how to predict sense s of w based on various features of w
 Supervised learning: training data is labelled with correct senses
 can do sense tagging
 Unsupervised learning: training data is unlabelled
 but many other knowledge sources used
 cannot do sense tagging, since this requires a priori senses
Supervised learning
 Words in training data labelled with their senses
 She pays 3% interest/INTEREST-MONEY on the loan.
 He showed a lot of interest/INTEREST-CURIOSITY in the painting.
 Similar to POS tagging
 given a corpus tagged with senses
 define features that indicate one sense over another
 learn a model that predicts the correct sense given the features
Features (e.g. plant)
 Neighbouring words:
 plant life
 manufacturing plant
 assembly plant
 plant closure
 plant species
 Content words in a larger window
 animal
 equipment
 employee
 automatic
Other features
 Syntactically related words
 e.g. object, subject….
 Topic of the text
 is it about SPORT? POLITICS?
 Part-of-speech tag, surrounding part-of-speech tags
Some principles proposed (Yarowsky 1995)
 One sense per discourse:
 typically, all occurrences of a word will have the same sense in the
same stretch of discourse (e.g. same document)
 One sense per collocation:
 nearby words provide clues as to the sense, depending on the distance
and syntactic relationship
 e.g. plant life: all (?) occurrences of plant+life will indicate the botanic
sense of plant
Training data
 SENSEVAL
 Shared Task competition
 datasets available for WSD, among other things
 annotated corpora in many languages
 (NB: SENSEVAL now merged with the broader SEMEVAL tasks)
 Pseudo-words
 create training corpus by artificially conflating words
 e.g. all occurrences of man and hammer with man-hammer
 easy way to create training data
 Multi-lingual parallel corpora
 translated texts aligned at the sentence level
 translation indicates sense
SemCor corpus
 Corpus consisting of files from the Brown Corpus (1m
words)
 SemCor contains 352 files, 360k words total:
 186 files contain sense tags for all content words
 The remainder only have sense tags for verbs
 Originally created using the WordNet 1.6 sense inventory;
since updated to the more recent versions of WordNet
SemCor Example (= Brown file a01)
<s snum=1>
[...]
<wf cmd=ignore pos=DT>an</wf>
<wf cmd=done pos=NN lemma=investigation wnsn=1 lexsn=1:09:00::>investigation</wf>
<wf cmd=ignore pos=IN>of</wf>
<wf cmd=done pos=NN lemma=atlanta wnsn=1 lexsn=1:15:00::>Atlanta</wf>
<wf cmd=ignore pos=POS>'s</wf>
<wf cmd=done pos=JJ lemma=recent wnsn=2 lexsn=5:00:00:past:00>recent</wf>
<wf cmd=done pos=NN lemma=primary_election wnsn=1
lexsn=1:04:00::>primary_election</wf>
[...]
</s>
SemCor example
<wf cmd=done pos=NN lemma=investigation wnsn=1
lexsn=1:09:00::>investigation</wf>
Senses in WordNet:




S: (n) probe, investigation (an inquiry into unfamiliar or
questionable activities) "there was a congressional probe into the
scandal"
S: (n) investigation, investigating (the work of inquiring
into something thoroughly and systematically)
This occurrence involves the first sense
Data representation
 Example sentence: An electric guitar and bass player stand off to
one side...
 Target word: bass
 Possible senses: fish, musical instrument...
 Relevant features are represented as vectors, e.g.:
wi2 , POSi2 , wi1, POSi1, wi1, POSi1, wi2 , POSi2 
guitar,NN,and,CC, player,NN,stand,VB
Part 1
Supervised methods I: Naive Bayes
Naïve Bayes Classifier
 Identify the features (F)
 e.g. surrounding words
 other cues apart from surrounding context
 Combine evidence from all features
 Decision rule: decide on sense s’ iff
sk , sk  s': P( s' | F )  P( sk | F )
 Example: drug. F = words in context
 medication sense: price, prescription, pharmaceutical
 illegal substance sense: alcohol, illicit, paraphernalia
Naive Bayes Classifier
 Problem: We usually don’t know the probability of a sense
given the features: P(sk|F)
 In our corpus, we have access to P(F|sk)
 We can compute P(sk|F) from P(F|sk)
 For this we use Bayes’ theorem
Deriving Bayes’ rule from the multiplication
rule
 Recall that:
P( A  B)  P( A) P( B | A)
Deriving Bayes’ rule from the multiplication
rule
 Recall that:
P( A  B)  P( A) P( B | A)
 Given symmetry of intersection, multiplication rule can be
written in two ways:
P( A  B)  P( A) P( B | A)
P( A  B)  P( B) P( A | B)
Deriving Bayes’ rule from the multiplication
rule
 Recall that:
P( A  B)  P( A) P( B | A)
 Given symmetry of intersection, multiplication rule can be
written in two ways:
P( A  B)  P( A) P( B | A)
P( A  B)  P( B) P( A | B)
 Bayes’ rule involves the substitution of one equation into the
other, to replace P(A and B)
P( B) P( A | B)
P( B | A) 
P( A)
Deriving P(A)
 Often, it’s not clear where P(A) should come from
 we start out from conditional probabilities!
 Given that we have two sets of outcomes of interest, A and B,
P(A) can be derived from the following observation:
A  ( A  B)  ( A  B)
 i.e. The events in A are made up of those which are only in A (but not in
B) and those which are in both A and B.
Finding P(A) -- I
B
A
A B
A B
P( A  B)  P( B) P( A | B)
P( A  B)  P( B)P( A | B)
P(A) must either be in one
or the other (or both), since
A is composed of these two
sets.
Finding P(A) -- II
Step 1: Applying the addition rule:
P( A)  P( A  B)  P( A  B)
 P( B) P( A | B)  P( B) P( A | B)
Step 2: Substituting into Bayes’ equation to replace P(A):
P( B | A) 
P( B) P( A | B)
P( B) P( A | B)  P( B) P( A | B)
Using Bayes’ rule fro WSD
 We usually don’t know P(sk|F) but we can compute from
training data: P(sk) (the prior) and P(F|sk)
P ( sk | f ) 
P ( f | sk ) P ( sk )
P( f )
 P(f) can be eliminated because it is constant for all senses in the
corpus
sbest
 arg maxsk P( sk | f )
 arg maxsk
P ( f | sk ) P ( sk )
P( f )
 arg maxsk P( f | sk ) P( sk )
The independence assumption
 It’s called “naïve” because:
n
P( f | sk )   P( f j | sk )
j 1
 i.e. all features are assumed to be independent
 Obviously, this is often not true.
 e.g. finding illicit in the context of drug may not be independent of finding pusher.
 cf. our discussion of collocations!
 Also, topics often constrain word choice.
Training the naive Bayes classifier
 We need to compute:
 P(s) for all senses s of w
P ( sk ) 
Count( sk , w)
Count( w)
 P(fj|s) for all features fj
P( f j | sk ) 
Count( f j , sk )
Count(sk )
Part 2
Supervised methods II: Information-theoretic and collocation-based
approaches
Information-theoretic measures
 Find the single, most informative feature to predict a sense.
 E.g. using a parallel corpus:
 prendre (FR) can translate as take or make
 prendre une décision: make a decision
 prendre une mesure: take a measure [to…]
 Informative feature in this case: direct object
 mesure indicates take
 décision indicates make
 Problem: need to identify the correct value of the feature that
indicates a specific sense.
Brown et al’s algorithm
1.
Given: translations T of word w
2.
Given: values X of a useful feature (e.g. mesure, décision as values of DO)
3.
Step 1: random partition P of T
4.
While improving, do:

create partition Q of X that maximises I(P;Q)

find a partition P of T that maximises I(P;Q)
comment: relies on mutual info to find clusters of translations mapping to clusters
of feature values
Using dictionaries and thesauri

Lesk (1986): one of the first to exploit dictionary
definitions


the definition corresponding to a sense can contain words which
are good indicators for that sense
Method:
1.
2.
3.
4.
Given: ambiguous word w with senses s1…sn with glosses g1…gn.
Given: the word w in context c
compute overlap between c & each gloss
select the maximally matching sense
Expanding a dictionary
 Problem with Lesk:
 often dictionary definitions don’t contain sufficient information
 not all words in dictionary definitions are good informants
 Solution: use a thesaurus with subject/topic categories
 e.g. Roget’s thesaurus
 http://www.gutenberg.org/cache/epub/22/pg22.html.utf8
Using topic categories
 Suppose every sense sk of word w has subject/topic tk
 w can be disambiguated by identifying the words related to tk
in the thesaurus
 Problems:
 general-purpose thesauri don’t list domain-specific topics
 several potentially useful words can be left out
 e.g. … Navratilova plays great tennis …
 proper name here useful as indicator of topic SPORT
Expanding a thesaurus: Yarowsky 1992
1.
Given: context c and topic t
2.
For all contexts and topics, compute p(c|t) using Naïve Bayes

by comparing words pertaining to t in the thesaurus with words in c

if p(c|t) > α, then assign topic t to context c
3.
For all words in the vocabulary, update the list of contexts in which the
word occurs.

Assign topic t to each word in c
4.
Finally, compute p(w|t) for all w in the vocabulary

this gives the “strength of association” of w with t
Yarowsky 1992: some results
SENSE
star
space object
celebrity
shape
sentence
punishment
set of words
ROGET TOPICS
ACCURACY
UNIVERSE
ENTERTAINER
INSIGNIA
96%
95%
82.%
LEGAL_ACTION 99%
GRAMMAR
98%
Bootstrapping

Yarowsky (1995) suggested the one sense per
discourse/collocation constraints.

Yarowsky’s method:
1.
2.

select the strongest collocational feature in a specific context
disambiguate based only on this feature
(similar to the information-theoretic method discussed earlier)
One sense per collocation
1.
For each sense s of w, initialise F, the collocations found in the
dictionary definition for s
2.
One sense per collocation:


identify the set of contexts containing collocates of s
for each sense s of w, update F to contain those collocates such that, for
for all s’ ≠ s
P( s | f )

P ( s '| f )
(where alpha is a threshold)
One sense per discourse
3.
For each document:



find the majority sense of w out of those found in previous step
assign all occurrences of w the majority sense
This is implemented as a post-processing step. Reduces
error rate by ca. 27%.
Part 3
Unsupervised disambiguation
Preliminaries
 Recall: unsupervised learning can do sense discrimination
not tagging
 akin to clustering occurrences with the same sense
 e.g. Brown et al 1991: cluster translations of a word
 this is akin to clustering senses
Brown et al’s method

Preliminary categorisation:
1.
Set P(w|s) randomly for all words w and senses s of w.
2.
Compute, for each context c of w the probability P(c|s)
that the context was generated by sense s.

Use (1) and (2) as a preliminary estimate. Re-estimate
iteratively to find best fit to the corpus.
Characteristics of unsupervised
disambiguation
 Can adapt easily to new domains, not covered by a dictionary
or pre-labelled corpus
 Very useful for information retrieval
 If there are many senses (e.g. 20 senses for word w), the
algorithm will split contexts into fine-grained sets
 NB: can go awry with infrequent senses
Part 4
Some issues with WSD
The task definition
 The WSD task traditionally assumes that a word has one and
only one sense in a context.
 Is this true?
 Is it possible to see evidence of co-activation (one word
displaying more than one sense)?
 this would bring competition to the licensed trade
 competition = “act of competing”
 competition = “people/organisations who are competing”
Systematic polysemy
 Not all senses are so easy to distinguish. E.g. competition in the
“agent competing” vs “act of competing” sense.
 The polysemy here is systematic
 Compare bank/bank where the senses are utterly distinct (and most
linguists wouldn’t consider this a case of polysemy, but homonymy)
 Can translation equivalence help here?
 depends if polysemy is systematic in all languages
Logical metonymy
 Metonymy = usage of a word to stand for something else
 e.g. the pen is mightier than the sword
 pen = the press
 Logical metonymy arises due to systematic polysemy
 good cook vs. good book
 enjoy the paper vs enjoy the cake
 Should WSD distinguish these? How could they do this?
Which words/usages count?
 Many proper names are identical to common nouns (cf.
Brown, Bush,…)
 This presents a WSD algorithm with systematic ambiguity
and reduces performance.
 Also, names are good indicators of senses of neighbouring
words.
 But this requires a priori categorisation of names.
 Brown’s green stance vs. the cook’s green curry
Useful links
 WordNet: http://wordnet.princeton.edu/
 SemCor:
http://www.cse.unt.edu/~rada/downloads.html#semcor
 Senseval: http://www.senseval.org/