Ngrams, Markov models, and Hidden Markov Models

Download Report

Transcript Ngrams, Markov models, and Hidden Markov Models

Sequence Models
With slides by me, Joshua Goodman,
Fei Xia
Outline
• Language Modeling
• Ngram Models
• Hidden Markov Models
– Supervised Parameter Estimation
– Probability of a sequence
– Viterbi (or decoding)
– Baum-Welch
A bad language model
3
A bad language model
4
A bad language model
5
A bad language model
6
What is a language model?
Language Model: A distribution that assigns a
probability to language utterances.
e.g.,
PLM(“zxcv ./,mwea afsido”) is zero;
PLM(“mat cat on the sat”) is tiny;
PLM(“Colorless green ideas sleep furiously”) is bigger;
PLM(“A cat sat on the mat”) is bigger still.
What’s a language model for?
•
•
•
•
•
•
•
Information Retrieval
Handwriting recognition
Speech Recognition
Spelling correction
Optical character recognition
Machine translation
…
Example Language Model Application
Speech Recognition: convert an acoustic signal
(sound wave recorded by a microphone) to a
sequence of words (text file).
Straightforward model:
P ( text | sound )
But this can be hard to train effectively (although
see CRFs later).
Example Language Model Application
Speech Recognition: convert an acoustic signal
(sound wave recorded by a microphone) to a
sequence of words (text file).
Acoustic Model
(easier to train)
Traditional solution: Bayes’ Rule
P ( text | sound ) 
P ( sound | text ) P ( text )
P ( sound )
Language Model
Ignore: doesn’t matter for
picking a good text
Importance of Sequence
So far, we’ve been making the exchangeability, or
bag-of-words, assumption:
The order of words is not important.
It turns out, that’s actually not true (duh!).
“cat mat on the the sat” ≠ “the cat sat on the mat”
“Mary loves John” ≠ “John loves Mary”
Language Models
with Sequence Information
Problem: How can we define a model that
• assigns probability to sequences of words (a
language model)
• the probability depends on the order of the words
• the model can be trained and computed tractably?
Outline
• Language Modeling
• Ngram Models
• Hidden Markov Models
– Supervised parameter estimation
– Probability of a sequence (decoding)
– Viterbi (Best hidden layer sequence)
– Baum-Welch
Smoothing: Kneser-Ney
P(Francisco | eggplant) vs P(stew | eggplant)
• “Francisco” is common, so backoff,
interpolated methods say it is likely
• But it only occurs in context of “San”
• “Stew” is common, and in many contexts
• Weight backoff by number of contexts word
occurs in
14
Kneser-Ney smoothing (cont)
Backoff:
Interpolation:
15
Outline
• Language Modeling
• Ngram Models
• Hidden Markov Models
– Supervised parameter estimation
– Probability of an observation sequence
– Viterbi (Best hidden layer sequence)
– Baum-Welch
The Hidden Markov Model
X1
X2
…
XT
O1
O2
…
OT
A dynamic Bayes Net (dynamic because the size
can change).
The Oi nodes are called observed nodes.
The Xi nodes are called hidden nodes.
NLP
17
HMMs and Language Processing
X1
X2
…
XT
O1
O2
…
OT
• HMMs have been used in a variety of
applications, but especially:
– Speech recognition
(hidden nodes are text words, observations are spoken words)
– Part of Speech Tagging
(hidden nodes are parts of speech, observations are words)
NLP
18
HMM Independence Assumptions
X1
X2
…
XT
O1
O2
…
OT
HMMs assume that:
• Xi is independent of X1 through Xi-2, given Xi-1 (Markov assumption)
• Oi is independent of all other nodes, given Xi
• P(Xi | Xi-1) and P(Oi | Xi) do not depend on i (stationary assumption)
Not very realistic assumptions about language – but HMMs are often
good enough, and very convenient.
NLP
19
HMM Formula
An HMM predicts that the probability of observing
a sequence o = <o1, o2, …, oT> with a particular
set of hidden states x = <x1, … xT> is:
T
P ( o , x )  P ( x1 ) P ( o1 | x1 )  P ( x i | x i  1 ) P ( o i | x i )
i2
To calculate, we need:
- Prior: P(x1) for all values of x1
- Observation: P(oi|xi) for all values of oi and xi
- Transition: P(xi|xi-1) for all values of xi and xi-1
HMM: Pieces
1)
A set of hidden states H = {h1, …, hN} that are the values which hidden
nodes may take.
2)
A vocabulary, or set of states V = {v1, …, vM} that are the values which an
observed node may take.
3)
Initial probabilities P(x1=hj) for all j
-
4)
-
5)
‐
‐
Written as a vector of N initial probabilities, called π
πj = P(x1=hj)
Transition probabilities P(xt=hj | xt-1=hk) for all j, k
Written as an NxN ‘transition matrix’ A
Aj,k = P(xt=hj | xt-1=hk)
Observation probabilities P(ot=vk|st=hj) for all k, j
written as an MxN ‘observation matrix’ B
Bj,k = P(ot=vk|st=hj)
HMM for POS Tagging
1) H = {DT, NN, VB, IN, …}, the set of all POS tags.
2) V = the set of all words in English.
3) Initial probabilities πj are the probability that POS tag hj
can start a sentence.
4) Transition probabilities Ajk represent the probability that
one tag (e.g., hk=VB) can follow another (e.g., hj=NN)
5) Observation probabilities Bjk represent the probability
that a tag (e.g., hj=NN) will generate a particular word
(e.g., vk=cat).
Outline
• Language Models
• Ngram modeling
• Hidden Markov Models
– Supervised parameter estimation
– Probability of a sequence
– Viterbi: what’s the best hidden state sequence?
– Baum-Welch: unsupervised parameter estimation
Supervised Parameter Estimation
X1
B
A
Xt-1
B
o1
ot-1
A
Xt
B
A
Xt+1
B
ot
ot+1
A
XT
B
oT
• Given an observation sequence and states, find the HMM
parameters (π, A, and B) that are most likely to produce
the sequence.
• For example, POS-tagged data from the Penn Treebank
Maximum Likelihood
Parameter Estimation
A
x1
B
xt-1
A
B
o1
ˆ i 
aˆ ij 
A
B
ot-1
# sentences
xt
B
ot
starting
ot+1
with state hi
# sentences
# times hi is followed
xt+1
by h j
# times hi is in the data
# times hi produces v k
ˆ
bik 
# times hi is in the data
A
xT
B
oT
Outline
• Language Modeling
• Ngram models
• Hidden Markov Models
– Supervised parameter estimation
– Probability of a sequence
– Viterbi
– Baum-Welch
What’s the probability of a sentence?
Suppose I asked you, ‘What’s the probability of
seeing a sentence v1, …, vT on the web?’
If we have an HMM model of English, we can use it
to estimate the probability.
(In other words, HMMs can be used as language
models.)
Conditional Probability of a Sentence
• If we knew the hidden states that generated
each word in the sentence, it would be easy:
P ( o1 ,..., o T | x1 ,..., x T ) 
P ( o1 ,..., o T , x1 ,..., x T )
P ( x1 ,..., x T )
T
P ( x1 ) P ( o1 | x1 )  P ( x i | x i  1 ) P ( o i | x i )
i2
T

P ( x1 )  P ( x i | x i  1 )
i2
T

 P (o
i 1
i
| xi )
Marginal Probability of a Sentence
Via marginalization, we have:
P ( o1 ,..., o T )
 P ( o ,..., o

1
T
, x1 ,..., x T )
x1 ,..., x T


x1 ,..., x T
T
P ( x1 ) P ( o1 | x1 )  P ( x i | x i  1 ) P ( o i | x i )
i2
Unfortunately, if there are N values for each xi (h1
through hN),
Then there are NT values for x1,…,xT.
Brute-force computation of this sum is intractable.
Forward Procedure
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
• Special structure gives us an efficient solution using
dynamic programming.
• Intuition: Probability of the first t observations is the same
for all possible t+1 length state sequences.
• Define:  i ( t ) 
P ( o1 ... o t , x1 ,..., x t 1 , x t  i | π , A , B )

x1 ,..., x t 1

 i (1)

P ( o1 ... o t , x t  i | π , A , B )
 i B io
1
Forward Procedure
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
 j ( t  1)
 P ( o1 ... o t  1 , x t  1  h j )
 P ( o1 ... o t  1 | x t  1  h j ) P ( x t  1  h j )
 P ( o1 ... o t | x t  1  h j ) P ( o t  1 | x t  1  h j ) P ( x t  1  h j )
 P ( o1 ... o t , x t  1  h j ) P ( o t  1 | x t  1  h j )
Forward Procedure
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
 j ( t  1)
 P ( o1 ... o t  1 , x t  1  h j )
 P ( o1 ... o t  1 | x t  1  h j ) P ( x t  1  h j )
 P ( o1 ... o t | x t  1  h j ) P ( o t  1 | x t  1  h j ) P ( x t  1  h j )
 P ( o1 ... o t , x t  1  h j ) P ( o t  1 | x t  1  h j )
Forward Procedure
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
 j ( t  1)
 P ( o1 ... o t  1 , x t  1  h j )
 P ( o1 ... o t  1 | x t  1  h j ) P ( x t  1  h j )
 P ( o1 ... o t | x t  1  h j ) P ( o t  1 | x t  1  h j ) P ( x t  1  h j )
 P ( o1 ... o t , x t  1  h j ) P ( o t  1 | x t  1  h j )
Forward Procedure
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
 j ( t  1)
 P ( o1 ... o t  1 , x t  1  h j )
 P ( o1 ... o t  1 | x t  1  h j ) P ( x t  1  h j )
 P ( o1 ... o t | x t  1  h j ) P ( o t  1 | x t  1  h j ) P ( x t  1  h j )
 P ( o1 ... o t , x t  1  h j ) P ( o t  1 | x t  1  h j )
Forward Procedure

x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
 P ( o ... o , x
t
 P ( o ... o , x
t 1
 P ( o ... o , x
t
1
t
 hi , x t 1  h j )P ( o t 1 | x t 1  h j )
i  1 ... N

1
t
 h j | x t  hi )P ( x t  hi ) P ( o t 1 | x t 1  h j )
i  1 ... N

1
t
 hi )P ( x t 1  h j | x t  hi ) P ( o t 1 | x t 1  h j )
i  1 ... N


i  1 ... N
i
( t ) Aij B jo t 1
Forward Procedure

x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
 P ( o ... o , x
t
 P ( o ... o , x
t 1
 P ( o ... o , x
t
1
t
 hi , x t 1  h j )P ( o t 1 | x t 1  h j )
i  1 ... N

1
t
 h j | x t  hi )P ( x t  hi ) P ( o t 1 | x t 1  h j )
i  1 ... N

1
t
 hi )P ( x t 1  h j | x t  hi ) P ( o t 1 | x t 1  h j )
i  1 ... N


i  1 ... N
i
( t ) Aij B jo t 1
Forward Procedure

x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
 P ( o ... o , x
t
 P ( o ... o , x
t 1
 P ( o ... o , x
t
1
t
 hi , x t 1  h j )P ( o t 1 | x t 1  h j )
i  1 ... N

1
t
 h j | x t  hi )P ( x t  hi ) P ( o t 1 | x t 1  h j )
i  1 ... N

1
t
 hi )P ( x t 1  h j | x t  hi ) P ( o t 1 | x t 1  h j )
i  1 ... N


i  1 ... N
i
( t ) Aij B jo t 1
Forward Procedure

x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
 P ( o ... o , x
t
 P ( o ... o , x
t 1
 P ( o ... o , x
t
1
t
 hi , x t 1  h j )P ( o t 1 | x t 1  h j )
i  1 ... N

1
t
 h j | x t  hi )P ( x t  hi ) P ( o t 1 | x t 1  h j )
i  1 ... N

1
t
 hi )P ( x t 1  h j | x t  hi ) P ( o t 1 | x t 1  h j )
i  1 ... N


i  1 ... N
i
( t ) Aij B jo t 1
Backward Procedure
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
 i (T  1)  1
Probability of the rest
 i ( t )  P ( o t 1 ... oT | x t  i )
of the states given the
first state
 i ( t )   Aij B io t  j ( t  1)
j  1 ... N
Decoding Solution
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
N
P (O | π , A , B ) 

i
(T )
Forward Procedure
i
B io1  i (1)
Backward Procedure
i
( t ) i ( t )
Combination
i 1
N
P (O | π , A , B ) 

i 1
N
P (O | π , A , B ) 

i 1
Outline
• Language modeling
• Ngram models
• Hidden Markov Models
– Supervised parameter estimation
– Probability of a sequence
– Viterbi: what’s the best hidden state sequence?
– Baum-Welch
Best State Sequence
o1
ot-1
ot
ot+1
oT
• Find the hidden state sequence that best explains the
observations
arg max P ( X | O )
X
• Viterbi algorithm
Viterbi Algorithm
x1
xt-1
hj
o1
ot-1
ot
ot+1
oT
 j ( t )  max P ( x1 ... x t 1 , o1 ... o t 1 , x t  h j , o t )
x1 ... x t 1
The state sequence which maximizes the
probability of seeing the observations to time
t-1, landing in state hj, and seeing the
observation at time t
Viterbi Algorithm
x1
xt-1
xt
xt+1
o1
ot-1
ot
ot+1
oT
 j ( t )  max P ( x1 ... x t 1 , o1 ... o t 1 , x t  h j , o t )
x1 ... x t 1
 j ( t  1)  max  i ( t ) Aij B jo
i
 j ( t  1)  arg max  i ( t ) Aij B jo
i
Recursive
Computation
t 1
t 1
Viterbi Algorithm
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
Xˆ T  h arg max
 i (T )
i
Xˆ t  
( t  1)
^
X
t 1
P ( Xˆ )  max  i (T )
i
Compute the most
likely state sequence
by working
backwards
Outline
• Language modeling
• Ngram models
• Hidden Markov Models
– Supervised parameter estimation
– Probability of a sequence
– Viterbi
– Baum-Welch: Unsupervised parameter estimation
Unsupervised Parameter Estimation
A
B
A
B
o1
ot-1
A
B
A
B
ot
ot+1
• Given an observation sequence, find the model
that is most likely to produce that sequence.
• No analytic method
• Given a model and observation sequence, update
the model parameters to better fit the
observations.
B
oT
Parameter Estimation
A
B
A
B
o1
A
B
ot-1
p t (i , j ) 
B
ot
 i ( t ) Aij B jo  j ( t  1)
t 1

m
A
(t )  m (t )
ot+1
B
oT
Probability of
traversing an arc
m  1 ... N
 t (i ) 

p t (i , j )
j  1 ... N
Probability of
being in state i
Parameter Estimation
A
B
A
B
o1
ot-1
A
B
A
B
ot
B
ot+1
oT
ˆ i   1 ( i )



T
aˆ ij 
bˆik 
t 1
T
p t (i , j )
t 1
 t (i )
{ t :o t  k }

T
t 1
 t (i )
 t (i )
Now we can
compute the new
estimates of the
model parameters.
Parameter Estimation
A
B
A
B
o1
ot-1
A
B
A
B
ot
ot+1
• Guarantee: P(o1:T|A,B,π) <= P(o1:T|Â,B̂, π̂ )
• In other words, by repeating this procedure, we
can gradually improve how well the HMM fits the
unlabeled data.
• There is no guarantee that this will converge to
the best possible HMM, however (only
guaranteed to find a local maximum).
B
oT
The Most Important Thing
A
B
A
B
o1
ot-1
A
B
A
B
ot
ot+1
B
oT
We can use the special structure of this
model to do a lot of neat math and solve
problems that are otherwise not tractable.