UCSD Linguistics/CSE 256: HMM part-of

Download Report

Transcript UCSD Linguistics/CSE 256: HMM part-of

Statistical NLP
Winter 2009
Lecture 8: Part-of-speech tagging
Roger Levy
Thanks to Dan Klein, Chris Manning, and Jason Eisner for slides
Parts-of-Speech (English)
• One basic kind of linguistic structure: syntactic word classes
Open class (lexical) words
Nouns
Proper
IBM
Italy
Verbs
Common
cat / cats
snow
Closed class (functional)
Determiners the some
Conjunctions and or
Pronouns
he its
Main
see
registered
Adjectives
yellow
Adverbs
slowly
Numbers
… more
122,312
one
Modals
can
had
Prepositions to with
Particles
off up
… more
CC
CD
DT
EX
FW
IN
JJ
JJR
JJS
MD
NN
NNP
NNPS
NNS
POS
PRP
PRP$
RB
RBR
RBS
RP
TO
UH
VB
VBD
VBG
VBN
VBP
VBZ
WDT
WP
WP$
WRB
conjunction, coordinating
numeral, cardinal
determiner
existential there
foreign word
preposition or conjunction, subordinating
adjective or numeral, ordinal
adjective, comparative
adjective, superlative
modal auxiliary
noun, common, singular or mass
noun, proper, singular
noun, proper, plural
noun, common, plural
genitive marker
pronoun, personal
pronoun, possessive
adverb
adverb, comparative
adverb, superlative
particle
"to" as preposition or infinitive marker
interjection
verb, base form
verb, past tense
verb, present participle or gerund
verb, past participle
verb, present tense, not 3rd person singular
verb, present tense, 3rd person singular
WH-determiner
WH-pronoun
WH-pronoun, possessive
Wh-adverb
and both but either or
mid-1890 nine-thirty 0.5 one
a all an every no that the
there
gemeinschaft hund ich jeux
among whether out on by if
third ill-mannered regrettable
braver cheaper taller
bravest cheapest tallest
can may might will would
cabbage thermostat investment subhumanity
Motown Cougar Yvette Liverpool
Americans Materials States
undergraduates bric-a-brac averages
' 's
hers himself it we them
her his mine my our ours their thy your
occasionally maddeningly adventurously
further gloomier heavier less-perfectly
best biggest nearest worst
aboard away back by on open through
to
huh howdy uh whammo shucks heck
ask bring fire see take
pleaded swiped registered saw
stirring focusing approaching erasing
dilapidated imitated reunifed unsettled
twist appear comprise mold postpone
bases reconstructs marks uses
that what whatever which whichever
that what whatever which who whom
whose
however whenever where why
The Tagging Task
Input: the lead paint is unsafe
Output: the/Det lead/N paint/N is/V
unsafe/Adj
• Uses:
• text-to-speech (how do we pronounce “lead”?)
• can write regexps like (Det) Adj* N+ over the output
• preprocessing to speed up parser (but a little dangerous)
• if you know the tag, you can back off to it in other tasks
600.465 - Intro to NLP - J. Eisner
4
Why Do We Care?
Input: the lead paint is unsafe
Output: the/Det lead/N paint/N is/V
•
•
•
•
unsafe/Adj
The first statistical NLP task
Been done to death by different methods
Easy to evaluate (how many tags are correct?)
Canonical finite-state task
• Can be done well with methods that look at local context
• Though should “really” do it by parsing!
600.465 - Intro to NLP - J. Eisner
5
Current Performance
Input: the lead paint is unsafe
Output: the/Det lead/N paint/N is/V
unsafe/Adj
• How many tags are correct?
• About 97% currently
• But baseline is already 90%
• Baseline is performance of stupidest possible method
• Tag every word with its most frequent tag
• Tag unknown words as nouns
600.465 - Intro to NLP - J. Eisner
6
Why POS Tagging?
• Useful in and of itself
• Text-to-speech: record, lead
• Lemmatization: saw[v]  see, saw[n]  saw
• Quick-and-dirty NP-chunk detection: grep {JJ | NN}* {NN | NNS}
• Useful as a pre-processing step for parsing
• Less tag ambiguity means fewer parses
• However, some tag choices are better decided by parsers
IN
DT NNP
NN VBD VBN RP NN
NNS
The Georgia branch had taken on loan commitments …
VDN
DT NN IN NN
VBD NNS
VBD
The average of interbank offered rates plummeted …
Part-of-Speech Ambiguity
• Example
VBD
VBN
NNP
VBZ
NNS
VB
VBP
NN
VBZ
NNS
CD
NN
Fed raises interest rates 0.5 percent
• Two basic sources of constraint:
• Grammatical environment
• Identity of the current word
• Many more possible features:
• … but we won’t be able to use them for a while
What Should We Look At?
correct tags
PN Verb Det Noun Prep Noun Prep Det Noun
Bill directed a cortege of autos through the dunes
PN
Adj Det Noun Prep Noun Prep Det Noun
Verb Verb Noun Verb
Adj
some possible tags for
Prep
each word (maybe more)
…?
Each unknown tag is constrained by its word
and by the tags to its immediate left and right.
600.465 - Intro to NLP - J. Eisner
But those tags are unknown
too …
9
What Should We Look At?
correct tags
PN Verb Det Noun Prep Noun Prep Det Noun
Bill directed a cortege of autos through the dunes
PN
Adj Det Noun Prep Noun Prep Det Noun
Verb Verb Noun Verb
Adj
some possible tags for
Prep
each word (maybe more)
…?
Each unknown tag is constrained by its word
and by the tags to its immediate left and right.
600.465 - Intro to NLP - J. Eisner
But those tags are unknown
too …
10
What Should We Look At?
correct tags
PN Verb Det Noun Prep Noun Prep Det Noun
Bill directed a cortege of autos through the dunes
PN
Adj Det Noun Prep Noun Prep Det Noun
Verb Verb Noun Verb
Adj
some possible tags for
Prep
each word (maybe more)
…?
Each unknown tag is constrained by its word
and by the tags to its immediate left and right.
600.465 - Intro to NLP - J. Eisner
But those tags are unknown
too …
11
Today’s main goal
• Introduce Hidden Markov Models (HMMs) for part-ofspeech tagging/sequence classification
• Cover the three fundamental questions of HMMs:
• How do we fit the model parameters of an HMM?
• Given an HMM, how do we efficiently calculate the
likelihood of an observation w?
• Given an HMM and an observation w, how do we
efficiently calculate the most likely state sequence for w?
• Next time: we’ll cover richer models for sequence
classification
HMMs
•
We want a model of sequences s and observations w
s0
•
s1
s2
sn
w1
w2
wn
Assumptions:
•
•
•
•
•
States are tag n-grams
Usually a dedicated start and end state / word
Tag/state sequence is generated by a markov model
Words are chosen independently, conditioned only on the tag/state
These are totally broken assumptions: why?
Transitions and Emissions
Transitions
• Transitions P(s|s’) encode well-formed tag sequences
• In a bigram tagger, states = tags
<>
< t 1>
< t2>
< tn>
s0
s1
s2
sn
w1
w2
wn
• In a trigram tagger, states = tag pairs
<,>
s0
< , t1> < t1, t2>
< tn-1, tn>
s1
s2
sn
w1
w2
wn
Estimating Transitions
• Use standard smoothing methods to estimate transitions:
P(ti | ti 1 , ti 2 )  2 Pˆ (ti | ti 1 , ti 2 )  1Pˆ (ti | ti 1 )  (1  1  2 ) Pˆ (ti )
• Can get a lot fancier (e.g. KN smoothing), but in this
case it doesn’t buy much
• One option: encode more into the state, e.g. whether the
previous word was capitalized (Brants 00)
Estimating Emissions
• Emissions are tricker:
•
•
•
•
Words we’ve never seen before
Words which occur with tags we’ve never seen
One option: break out the Good-Turning smoothing
Issue: words aren’t black boxes:
343,127.23
11-year
Minteria
reintroducibly
• Unknown words usually broken into word classes
D+,D+.D+
D+-x+
Xx+
x+“ly”
• Another option: decompose words into features and use a maxent
model along with Bayes’ rule
P( w | t )  PMAXENT (t | w) P( w) / P(t )
Better Features
• Can do surprisingly well just looking at a word by itself:
•
•
•
•
•
•
Word
the: the  DT
Lowercased word
Importantly: importantly  RB
Prefixes
unfathomable: un-  JJ
Suffixes
Importantly: -ly  RB
Capitalization
Meridian: CAP  NNP
Word shapes
35-year: d-x  JJ
• Then build a maxent (or whatever) model to predict tag
• Maxent P(t|w): 93.7% / 82.6%
Disambiguation
• Given these two multinomials, we can score any word / tag sequence
pair
<,>
<,NNP>
<NNP, VBZ>
NNP
VBZ
Fed
raises
<VBZ, NN>
<NN, NNS> <NNS, CD>
NN
NNS
interest rates
<CD, NN>
<STOP>
CD
NN
.
0.5
percent
.
P(NNP|<,>) P(Fed|NNP) P(VBZ|<NNP,>) P(raises|VBZ) P(NN|VBZ,NNP)…..
• In principle, we’re done – list all possible tag sequences, score each
one, pick the best one (the Viterbi state sequence)
NNP VBZ NN NNS CD NN
logP = -23
NNP NNS NN NNS CD NN
logP = -29
NNP VBZ VB NNS CD NN
logP = -27
Finding the Best Trajectory
• Too many trajectories (state sequences) to list
• Option 1: Beam Search
Fed:NNP
<>
Fed:NNP raises:NNS
Fed:NNP raises:VBZ
Fed:VBN
Fed:VBD
Fed:VBN raises:NNS
Fed:VBN raises:VBZ
• A beam is a set of partial hypotheses
• Start with just the single empty trajectory
• At each derivation step:
• Consider all continuations of previous hypotheses
• Discard most, keep top k, or those within a factor of the best, (or
some combination)
• Beam search works relatively well in practice
• … but sometimes you want the optimal answer
• … and you need optimal answers to validate your beam search
The goal: efficient exact inference
• We want a way of computing exactly:
1. The probability of any sequence w given the model
2. Most likely state sequence given the model and some
w
• BUT! There are exponentially many possible state
sequences |T|n
• We can, however efficiently calculate (1) and (2) using
dynamic programming (DP)
•
•
The DP algorithm used for efficient state-sequence
inference uses a trellis of paths through state space
It is an instance of what in NLP is called the Viterbi
Algorithm
HMM Trellis
The Viterbi Algorithm
• Dynamic program for computing
 i ( s )  max P( s0 ...si 1s, w1...wi 1 )
s0 ...si 1s
• The score of a best path up to position i ending in state s
1 if s  , 
 0 ( s)  
0 otherwise
i (s)  maxP(s | s')P(wi1 | s')i1(s')
s'
• Also store a back-trace: most likely previous state for each state
 i ( s )  arg max P( s | s' ) P( w | s' ) i 1 ( s' )
s'
• 
Iterate on i, storing partial results as you go
A simple example
• [enter handout/whiteboard mini-example]
So How Well Does It Work?
• Choose the most common tag
• 90.3% with a bad unknown word model
• 93.7% with a good one
• TnT (Brants, 2000):
• A carefully smoothed trigram tagger
• Suffix trees for emissions
• 96.7% on WSJ text (SOA is ~97.2%)
• Noise in the data
• Many errors in the training and test corpora
DT NN IN NN
VBD NNS
VBD
The average of interbank offered rates plummeted …
• Probably about 2% guaranteed error
from noise (on this data)
JJ
JJ
NN
chief executive officer
NN
JJ
NN
chief executive officer
JJ
NN
NN
chief executive officer
NN
NN
NN
chief executive officer
Theoretical recap
• An HMM is generative and is easy to train from
supervised labeled sequence data
• With a trained HMM, we can use Bayes’ rule for noisychannel evaluation of P(s|w) (s=states, w=words):
• But just like with TextCat, we might want to do
discriminative classification (directly estimate P(s|w))
Overview: Accuracies
• Roadmap of (known / unknown) accuracies:
• Most freq tag:
~90% / ~50%
• Trigram HMM:
~95% / ~55%
•
•
•
•
•
Maxent P(t|w):
TnT (HMM++):
MEMM tagger:
Bidirectional dependencies:
Upper bound:
Most errors
on unknown
words
93.7% / 82.6%
96.2% / 86.0%
96.9% / 86.9%
97.2% / 89.0%
~98% (human agreement)
How to improve supervised results?
• Better features!
RB
PRP VBD IN RB IN PRP VBD .
They left as soon as he arrived .
• We could fix this with a feature that looked at the next word
JJ
NNP NNS VBD
VBN
.
Intrinsic flaws remained undetected .
• We could fix this by linking capitalized words to their lowercase versions
• More general solution: Maximum-entropy Markov models
• Reality check:
• Taggers are already pretty good on WSJ journal text…
• What the world needs is taggers that work on other text!
Discriminative sequence classification
• We will look once again at Maximum-Entropy models
as a basis for sequence classification
• We want to calculate the joint distribution P(s|w), but it
is too sparse
• We will once again use a Markov assumption
• As before, we can identify states with tag tuples to
obtain dependence on previous n-1 tags
• e.g., tag trigrams:
Maxent Markov Models (MEMMs)
• This is sometimes called a maxent tagger, or MEMM
• Key point: the next-tag distribution conditions on the
entire word sequence!
• Why can this be done?
• How is this in principle useful?
MEMM taggers: graphical visualization
• The HMM tagger had words generated from states:
• The MEMM tagger commonly has states conditioned
on same-position words only:
• It’s also possible to condition states on all words:
MaxEnt & Feature Templates
• In MaxEnt, we directly estimate conditional distributions
• Of course, there is too much sparsity to do this directly
• So we featurize the history (just like for text
categorization)
MaxEnt Classifiers
• Given the featurization, how to estimate conditional probabilities?
• Exponential (log-linear, maxent, logistic, Gibbs) models:
• Turn the votes into a probability distribution:
Next tag
parameters
P (c | d ,  ) 
exp  i (c) fi (d )
Makes votes positive.
 exp   (c ') f (d )
i
i
i
Normalizes votes.
Conditioning
c'
i
info • For any weight vector { }, we get a conditional probability model
i
P(c | d,).
• We want to choose parameters {i} that maximize the conditional
(log) likelihood of the data:
log P(C | D,  ) 
 log P(c | d ,  )  
( c , d )( C , D )
( c , d )( C , D )
exp  i (c) fi (d )
log
i
 exp   (c) f (d )
i
c'
i
i
Building a Maxent Model
• How to define features:
• Features are patterns in the input which we think the weighted vote
should depend on
• Usually features added incrementally to target errors
• If we’re careful, adding some mediocre features into the mix won’t hurt
(but won’t help either)
• How to learn model weights?
• Maxent just one method
• Use a numerical optimization package
• Given a current weight vector, need to calculate (repeatedly):
• Conditional likelihood of the data
• Derivative of that likelihood wrt each feature weight
Featurizing histories
• Features corresponding to the generative HMM:
• An “emission feature”:
<wi=future, ti=JJ>
• A “transition feature”: <ti=JJ, ti-1=DT>
• Feature templates: <wi, ti> <ti, ti-1>
• But we can be much more flexible:
• Mixed feature templates:
• < ti-1, wi, ti >
• Or feature templates picking out subregularities:
• <wi matches /[A-Z].*/, ti==NNP or ti==NNPS>
Maximum-likelihood parameter estimation
• The (log) conditional likelihood is a function of the iid data
(C,D) and the parameters :
 P(c | d , )   log P(c | d , )
log P(C | D,  )  log
( c , d )(C , D)
( c ,d )(C , D)
• If there aren’t many values of c, it’s easy to calculate:
log P(C | D,  ) 

( c , d )( C , D )
log
exp  i (c) fi (d )
i
 exp   (c) f (d )
i
c'
i
i
• We can separate this into two components:
log P(C | D,  ) 

( c ,d )(C , D )
log exp  i (c) fi (d ) 
i

( c ,d )(C , D )
log  exp  i (c ') fi (d )
log P(C | D,  )  N ( )  M ( )
c'
i
Likelihood maximization
• We have a function to optimize:
log P(C | D,  ) 

( c , d )( C , D )
log
exp  i (c) fi (d )
i
 exp   (c ') f (d )
i
c'
i
i
• It turns out that this function is convex
• We know the function’s derivatives:
 log P(C | D,  ) / i (c) actual count( fi , c)  predictedcount( fi ,  )
• Ready to feed it into a numerical optimization
package…
• [also turns out to be the maximum entropy sol’n]
Smoothing: Issues of Scale
• Lots of features:
• NLP maxent models can have over 1M features.
• Even storing a single array of parameter values can have a
substantial memory cost.
• Lots of sparsity:
• Overfitting very easy – need smoothing!
• Many features seen in training will never occur again at test time.
• Optimization problems:
• Feature weights can be infinite, and iterative solvers can take a
long time to get to those infinities.
Smoothing: Issues
•
•
•
Assume the following empirical distribution:
Tails
h
t
Features: {Heads}, {Tails}
We’ll have the following model distribution:
pHEADS
•
Heads
e H
 H
e  eT
pTAILS
e T
 H
e  e T
Really, only one degree of freedom ( = H-T)
pHEADS
eH e  T
e
e0
 H T
 
pTAILS  
T T
0
e e e e
e e
e  e0

Smoothing: Issues
• The data likelihood in this model is:
log P(h, t |  )  h log pHEADS  t log pTAILS
log P(h, t |  )  h  (t  h) log(1  e )
log P
log P
log P



Heads
Tails
Heads
Tails
Heads
Tails
2
2
3
1
4
0
Smoothing: Priors (MAP)
•
•
•
•
•
What if we had a prior expectation that parameter values wouldn’t be very
large?
We could then balance evidence suggesting large parameters (or infinite)
against our prior.
The evidence would never totally defeat the prior, and parameters would
be smoothed (and kept finite!).
We can do this explicitly by changing the optimization objective to
maximum posterior likelihood:
This is alternatively known as penalization in the non-Bayesian statistical
literature, and the use of a prior in the Bayesian literature
log P(C,  | D)  log P( )  log P(C | D,  )
Posterior
Prior
Evidence
Smoothing: Priors
• Gaussian, or quadratic, priors:
• Intuition: parameters shouldn’t be large.
• Formalization: prior expectation that each
parameter will be distributed according to a
gaussian with mean  and variance 2.
 (i  i ) 2 
1

P(i ) 
exp 
2
2 i
 i 2


• Penalizes parameters for drifting to far from their
mean prior value (usually =0).
• 22=1 works surprisingly well (better to set using
held-out data, though)
22 = 
22
= 10
22
=1
Smoothing: Priors
• If we use gaussian priors:
• Trade off some expectation-matching for smaller parameters.
• When multiple features can be recruited to explain a data point, the more
common ones generally receive more weight.
• Accuracy generally goes up!
• Change the objective:
log P(C,  | D)  log P(C | D,  )  log P( )
(i  i ) 2
k
log P(C,  | D)   P(c | d ,  )  
2
2 i
i
( c ,d )(C , D )
22 = 
22
= 10
22
=1
• Change the derivative:
 log P(C,  | D) / i  actual( fi , C)  predicted( fi ,  )  (i  i ) /  2
Decoding
• Decoding maxent taggers:
• Just like decoding HMMs
• Viterbi, beam search, posterior decoding
• Viterbi algorithm (HMMs):
• Viterbi algorithm (Maxent):
Iteratively improving your model
• With MaxEnt models, you need feature templates
• There are no a priori principles of choosing which
templates to include in a model
• Two basic approaches
• Hand-crafted feature (template)s
• Search in the space of possible feature templates
(feature/model selection)
• We saw an example of the latter approach in TextCat
• Now we’ll look at an example of the former: Toutanova
& Manning 2000