EE669 Lecture 7 - National Cheng Kung University

Download Report

Transcript EE669 Lecture 7 - National Cheng Kung University

Lecture 9: Hidden Markov Models
(HMMs)
(Chapter 9 of Manning and Schutze)
Dr. Mary P. Harper
ECE, Purdue University
[email protected]
yara.ecn.purdue.edu/~harper
Fall 2001
EE669: Natural Language Processing
1
Markov Assumptions
• Often we want to consider a sequence of randm variables
that aren’t independent, but rather the value of each variable
depends on previous elements in the sequence.
• Let X = (X1, .., XT) be a sequence of random variables
taking values in some finite set
S = {s1, …, sN}, the state space. If X possesses the
following properties, then X is a Markov Chain.
• Limited Horizon:
P(Xt+1= sk|X1, .., Xt) = P(X t+1 = sk |Xt) i.e., a word’s state
only depends on the previous state.
• Time Invariant (Stationary):
P(Xt+1= sk|Xt) = P(X2 = sk|X1) i.e., the dependency does not
change over time.
Fall 2001
EE669: Natural Language Processing
2
Definition of a Markov Chain
• A is a stochastic N x N matrix
of the probabilities of transitions with:
– aij = Pr{transition from si to sj} = Pr(Xt+1= sj | Xt= si)
n
– ij, aij  0, and i:
 aij  1
j 1
•  is a vector of N elements representing the initial
state probability distribution with:
– i = Pr{probability that the initial state is si} = Pr(X1= si)
n
– i:
  1
i
i 1
– Can avoid this by creating a special start state s0.
Fall 2001
EE669: Natural Language Processing
3
Markov Chain
Fall 2001
EE669: Natural Language Processing
4
Markov Process & Language
Models
• Chain rule:
P(W) = P(w1,w2,...,wT) = i=1..T p(wi|w1,w2,..,wi-n+1,..,wi-1)
• n-gram language models:
– Markov process (chain) of the order n-1:
P(W) = P(w1,w2,...,wT) = i=1..T p(wi|wi-n+1,wi-n+2,..,wi-1)
Using just one distribution (Ex.: trigram model: p(wi|wi-2,wi-1)):
Positions:
Words:
1
2
3
4
5
6
7
8
9 10 11 12 13
14
15 16
My car broke down , and within hours Bob ’s car broke down , too .
p( ,|broke down) = p(w5|w3,w4) = p(w14|w12,w13)
Fall 2001
EE669: Natural Language Processing
5
Example with a Markov Chain
.6
a
h
1
.4
p
.4
.3
1
e
1
.6
.3
t
i
.4
Start
Fall 2001
EE669: Natural Language Processing
6
Probability of a Sequence of States
P( x1 x 2 x3  xT )  P( x1 )  P( x 2 | x1 )    P( xT | x1 x 2 x3  xT 1 )
 P( x1 )  P( x 2 | x1 )    P( xT | xT 1 )
T 1
  x1  a xt xt 1
t 1
• P(t, a, p, p) = 1.0 × 0.3 × 0.4 × 1.0 = 0.12
• P(t, a, p, e) = ?
Fall 2001
EE669: Natural Language Processing
7
Hidden Markov Models (HMMs)
• Sometimes it is not possible to know precisely which states
the model passes through; all we can do is observe some
phenomena that occurs when in that state with some
probability distribution.
• An HMM is an appropriate model for cases when you
don’t know the state sequence that the model passes
through, but only some probabilistic function of it. For
example:
–
–
–
–
Fall 2001
Word recognition (discrete utterance or within continuous speech)
Phoneme recognition
Part-of-Speech Tagging
Linear Interpolation
EE669: Natural Language Processing
8
Example HMM
• Process:
– According to , pick a state qi. Select a ball from urn i
(state is hidden).
– Observe the color (observable).
– Replace the ball.
– According to A, transition to the next urn from which to
pick a ball (state is hidden).
N states corresponding to urns: q1, q2, …,qN
M colors for balls found in urns: z1, z2, …, zM
A: N x N matrix; transition probability distribution (between urns)
B: N x M matrix; for each urn, there is a probability density
function for each color.
bij = Pr(COLOR= zj | URN = qi)
: N vector; initial sate probability distribution (where do we start?)
Fall 2001
EE669: Natural Language Processing
9
What is an HMM?
• Green circles are hidden states
• Dependent only on the previous state
Fall 2001
EE669: Natural Language Processing
10
What is an HMM?
• Purple nodes are observations
• Dependent only on their corresponding hidden state (for a
state emit model)
Fall 2001
EE669: Natural Language Processing
11
Elements of an HMM
• HMM (the most general case):
– five-tuple (S, K, , A, B), where:
•
•
•
•
•
•
S = {s1,s2,...,sN} is the set of states,
K = {k1,k2,...,kM} is the output alphabet,
 = {i} i  S, initial state probability set
A = {aij} i,j  S, transition probability set
B = {bijk} i,j  S, k  K (arc emission)
B = {bik} i  S, k  K (state emission), emission probability set
• State Sequence: X = (x1, x2, ..., xT+1),
xt : S {1,2,…, N}
• Observation (output) Sequence: O = (o1, ..., oT), ot  K
Fall 2001
EE669: Natural Language Processing
12
HMM Formalism
S
S
S
S
S
K
K
K
K
K
• {S, K, , A, B}
• S : {s1…sN} are the values for the hidden states
• K : {k1…kM} are the values for the observations
Fall 2001
EE669: Natural Language Processing
13
HMM Formalism
i
S
A
S
B
K
•
•
•
•
K
A
S
B
K
A
S
A
S
B
K
K
{S, K, , A, B}
  {i} are the initial state probabilities
A = {aij} are the state transition probabilities
B = {bik} are the observation probabilities
Fall 2001
EE669: Natural Language Processing
14
Example of an Arc Emit HMM
p(t)=.8
p(o)=.1
p(e)=.1
´
0.6
0.12
p(t)=.1
p(o)=.7
p(e)=.2
1
3
p(t)=.5
p(o)=.2
p(e)=.3
2
1
0.4
Fall 2001
p(t)=0
p(o)=0
p(e)=1
0.88
p(t)=0
p(o)=1
p(e)=0
1
4
p(t)=0
p(o)=.4
p(e)=.6
EE669: Natural Language Processing
15
HMMs
1. N states in the model: a state has some measurable,
distinctive properties.
2. At clock time t, you make a state transition (to a different
state or back to the same state), based on a transition
probability distribution that depends on the current state
(i.e., the one you're in before making the transition).
3. After each transition, you output an observation output
symbol according to a probability density distribution
which depends on the current state or the current arc.
 Goal: From the observations, determine what model
generated the output, e.g., in word recognition with a
different HMM for each word, the goal is to choose the
one that best fits the input word (based on state
transitions and output observations).
Fall 2001
EE669: Natural Language Processing
16
Example HMM
• N states corresponding to urns: q1, q2, …,qN
• M colors for balls found in urns: z1, z2, …, zM
• A: N x N matrix; transition probability distribution
(between urns)
• B: N x M matrix; for each urn, there is a
probability density function for each color.
– bij = Pr(COLOR= zj | URN = qi)
• : N vector; initial sate probability distribution
(where do we start?)
Fall 2001
EE669: Natural Language Processing
17
Example HMM
• Process:
– According to , pick a state qi. Select a ball from urn i
(state is hidden).
– Observe the color (observable).
– Replace the ball.
– According to A, transition to the next urn from which to pick a ball
(state is hidden).
• Design: Choose N and M. Specify a model  = (A, B, )
from the training data. Adjust the model parameters (A, B,
) to maximize P(O| ).
• Use: Given O and = (A, B, ), what is P(O| )? If we
compute P(O| ) for all models , then we can determine
which model most likely generated O.
Fall 2001
EE669: Natural Language Processing
18
Simulating a Markov Process
t:= 1;
Start in state si with probability i (i.e., X1= i)
Forever do
Move from state si to state sj with probability aij
(i.e., Xt+1= j)
Emit observation symbol ot = k with probability bik (bijk )
t:= t+1
End
Fall 2001
EE669: Natural Language Processing
19
Why Use Hidden Markov Models?
• HMMs are useful when one can think of
underlying events probabilistically generating
surface events. Example: Part-of-Speech-Tagging.
• HMMs can be efficiently trained using the EM
Algorithm.
• Another example where HMMs are useful is in
generating parameters for linear interpolation of
n-gram models.
• Assuming that some set of data was generated by
an HMM, and then an HMM is useful for
calculating the probabilities for possible
underlying state sequences.
Fall 2001
EE669: Natural Language Processing
20
Fundamental Questions for HMMs
• Given a model  = (A, B, ), how do we efficiently
compute how likely a certain observation is, that is,
P(O| )
• Given the observation sequence O and a model ,
how do we choose a state sequence (X1, …, XT+1)
that best explains the observations?
• Given an observation sequence O, and a space of
possible models found by varying the model
parameters  = (A, B, ), how do we find the model
that best explains the observed data?
Fall 2001
EE669: Natural Language Processing
21
Probability of an Observation
• Given the observation sequence O = (o1, …, oT)
and a model = (A, B, ), we wish to know how
to efficiently compute P(O| ).
• For any state sequence X = (X1, …, XT+1), we
find: P(O| ) = X P(X | ) P(O | X, )
• This is simply the probability of an observation
sequence given the model.
• Direct evaluation of this expression is extremely
inefficient; however, there are dynamic
programming methods that compute it quite
efficiently.
Fall 2001
EE669: Natural Language Processing
22
Probability of an Observation
o1
ot-1
ot
ot+1
oT
Given an observation sequence and a model,
compute the probability of the observation sequence
O  (o1...oT ),   ( A, B, )
ComputeP(O |  )
Fall 2001
EE669: Natural Language Processing
23
Probability of an Observation
x1
xt-1
xt
xt+1
xT
ot-1
ot
ot+1
oT
b
o1
P(O | X ,  )  bx1o1 bx2o2 ...bxT oT
Fall 2001
EE669: Natural Language Processing
24
Probability of an Observation
a
x1
xt-1
xt
xt+1
xT
ot-1
ot
ot+1
oT
b
o1
P(O | X ,  )  bx1o1 bx2o2 ...bxT oT
P( X |  )   x1 ax1x2 ax2 x3 ...axT 1xT
Fall 2001
EE669: Natural Language Processing
25
Probability of an Observation
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
P(O | X ,  )  bx1o1 bx2o2 ...bxT oT
P( X |  )   x1 ax1x2 ax2 x3 ...axT 1xT
P(O, X |  )  P( X |  ) P(O | X ,  )
Fall 2001
EE669: Natural Language Processing
26
Probability of an Observation
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
P(O | X ,  )  bx1o1 bx2o2 ...bxT oT
P( X |  )   x1 ax1x2 ax2 x3 ...axT 1xT
P(O, X |  )  P(O | X ,  ) P( X |  )
P(O |  )   P( X |  ) P(O | X ,  )
Fall 2001
X
EE669: Natural Language Processing
27
Probability of an Observation
(State Emit)
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
P(O |  ) 

{ x1 ...xT }
Fall 2001
T 1
b
x1 x1o1
a
t 1
b
xt xt 1 xt 1ot 1
EE669: Natural Language Processing
28
Probability of an Observation with an
Arc Emit HMM
p(t)=.8
p(o)=.1
p(e)=.1
´
0.12
1
3
p(t)=.5
p(o)=.2
p(e)=.3
2
1
0.6
0.4
Fall 2001
p(t)=0
p(o)=0
p(e)=1
0.88
p(t)=0
p(o)=1
p(e)=0
1
4
p(t)=.1
p(o)=.7
p(e)=.2
p(toe) = .6´.8 ´.88´.7 ´1´.6 +
.4´.5 ´1´1 ´.88´.2 +
p(t)=0
.4´.5 ´1´1 ´.12´1
p(o)=.4
@ .237
p(e)=.6
EE669: Natural Language Processing
29
Making Computation Efficient
• To avoid this complexity, use dynamic programming or
memoization techniques due to trellis structure of the
problem.
• Use an array of states versus time to compute the probability
of being at each state at time t+1 in terms of the probabilities
for being in each state at t.
• A trellis can record the probability of all initial subpaths of
the HMM that end in a certain state at a certain time. The
probability of longer subpaths can then be worked out in
terms of the shorter subpaths.
• A forward variable, i(t)= P(o1o2…ot-1, Xt=i| ) is stored at
(si, t) in the trellis and expresses the total probability of
ending up in state si at time t.
Fall 2001
EE669: Natural Language Processing
30
The Trellis Structure
Fall 2001
EE669: Natural Language Processing
31
Forward Procedure
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
• Special structure gives us an efficient solution
using dynamic programming.
• Intuition: Probability of the first t observations is
the same for all possible t+1 length state
sequences (so don’t recompute it!)
• Define:  (t )  P(o ...o , x  i |  )
i
Fall 2001
1
t 1
t
EE669: Natural Language Processing
32
Forward Procedure
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
 j (t  1)
Note: We omit |
from formula.
 P(o1...ot 1 , xt 1  j )
 P(o1...ot 1 | xt 1  j ) P( xt 1  j )
 P(o1...ot | xt 1  j ) P(ot 1 | xt 1  j ) P( xt 1  j )
 P(o1...ot , xt 1  j ) P(ot 1 | xt 1  j )
Fall 2001
EE669: Natural Language Processing
33
Forward Procedure
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
 j (t  1)
 P(o1...ot 1 , xt 1  j )
Note: P(A , B) =
P(A| B)*P(B)
 P(o1...ot 1 | xt 1  j ) P( xt 1  j )
 P(o1...ot | xt 1  j ) P(ot 1 | xt 1  j ) P( xt 1  j )
 P(o1...ot , xt 1  j ) P(ot 1 | xt 1  j )
Fall 2001
EE669: Natural Language Processing
34
Forward Procedure
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
 j (t  1)
 P(o1...ot 1 , xt 1  j )
 P(o1...ot 1 | xt 1  j ) P( xt 1  j )
 P(o1...ot | xt 1  j ) P(ot 1 | xt 1  j ) P( xt 1  j )
 P(o1...ot , xt 1  j ) P(ot 1 | xt 1  j )
Fall 2001
EE669: Natural Language Processing
35
Forward Procedure
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
 j (t  1)
 P(o1...ot 1 , xt 1  j )
 P(o1...ot 1 | xt 1  j ) P( xt 1  j )
 P(o1...ot | xt 1  j ) P(ot 1 | xt 1  j ) P( xt 1  j )
 P(o1...ot , xt 1  j ) P(ot 1 | xt 1  j )
Fall 2001
EE669: Natural Language Processing
36
Forward Procedure
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
  P(o1...ot , xt  i, xt 1  j )P(ot 1 | xt 1  j )
i 1... N
  P(o1...ot , xt 1  j | xt  i )P( xt  i ) P(ot 1 | xt 1  j )
i 1... N
  P(o1...ot , xt  i )P( xt 1  j | xt  i ) P(ot 1 | xt 1  j )
i 1... N
   i (t )aij bijot 1
1... N
Fall i2001
EE669: Natural Language Processing
37
Forward Procedure
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
  P(o1...ot , xt  i, xt 1  j )P(ot 1 | xt 1  j )
i 1... N
  P(o1...ot , xt 1  j | xt  i )P( xt  i ) P(ot 1 | xt 1  j )
i 1... N
  P(o1...ot , xt  i )P( xt 1  j | xt  i ) P(ot 1 | xt 1  j )
i 1... N
   i (t )aij bijot 1
1... N
Fall i2001
EE669: Natural Language Processing
38
Forward Procedure
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
  P(o1...ot , xt  i, xt 1  j )P(ot 1 | xt 1  j )
i 1... N
  P(o1...ot , xt 1  j | xt  i )P( xt  i ) P(ot 1 | xt 1  j )
i 1... N
  P(o1...ot , xt  i )P( xt 1  j | xt  i ) P(ot 1 | xt 1  j )
i 1... N
   i (t )aij bijot 1
1... N
Fall i2001
EE669: Natural Language Processing
39
Forward Procedure
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
  P(o1...ot , xt  i, xt 1  j )P(ot 1 | xt 1  j )
i 1... N
  P(o1...ot , xt 1  j | xt  i )P( xt  i ) P(ot 1 | xt 1  j )
i 1... N
  P(o1...ot , xt  i )P( xt 1  j | xt  i ) P(ot 1 | xt 1  j )
i 1... N
   i (t )aij bijot 1
1... N
Fall i2001
EE669: Natural Language Processing
40
The Forward Procedure
• Forward variables are calculated as follows:
– Initialization: i(1)= i , 1 i  N
– Induction: j(t+1)=i=1, N i(t)aijbijot, 1 tT, 1 jN
– Total: P(O|)= i=1,N i(T+1)
= i=1,N i=1, N i(t)aijbijot
• This algorithm requires 2N2T multiplications (much
less than the direct method which takes (2T+1)NT+1
Fall 2001
EE669: Natural Language Processing
41
The Backward Procedure
• We could also compute these probabilities by
working backward through time.
• The backward procedure computes backward
variables which are the total probability of seeing
the rest of the observation sequence given that we
were in state si at time t.
• i(t) = P(ot…oT | Xt = i, ) is a backward variable.
• Backward variables are useful for the problem of
parameter reestimation.
Fall 2001
EE669: Natural Language Processing
42
Backward Procedure
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
 i (T  1)  1
i (t )  P(ot ...oT | xt  i)
i (t )   aij bijot  j (t  1)
Fall 2001
j 1...N
EE669: Natural Language Processing
43
The Backward Procedure
• Backward variables can be calculated working
backward through the treillis as follows:
– Initialization: i(T+1) = 1, 1 i  N
– Induction: i(t) =j=1,N aijbijotj(t+1), 1 t T, 1 i  N
– Total: P(O|)=i=1, N ii(1)
• Backward variables can also be combined with
forward variables:
P(O|) = i=1,N i(t)i(t), 1 t  T+1
Fall 2001
EE669: Natural Language Processing
44
The Backward Procedure
• Backward
working
P(O, X t  ivariables
|  )  P(o1can
...oT ,be
X t calculated
 i | )
backward
 P(o ...o through
, X  i, othe
...otreillis
|  ) as follows:
1
t 1
t
t
T
– Initialization: i(T+1) = 1, 1 i  N
 P(o1...ot 1 , X t  i |  )  P (ot ...oT | o1...ot 1 , X t  i,  )
– Induction: i(t) =j=1,N aijbijotj(t+1), 1 t T, 1 i  N
 P(o1...ot 1 , X t  i |  ) P(ot ...oT | X t  i,  )
– Total: P(O|)=i=1, N ii(1)
  i (t )  i (t )
• Backward variables can also be combined with
forward variables:
P(O|) = i=1,N i(t)i(t), 1 t  T+1
Fall 2001
EE669: Natural Language Processing
45
Probability of an Observation
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
N
P(O |  )   i (T )
Forward Procedure
P(O |  )    i  i (1)
Backward Procedure
P(O |  )   i (t ) i (t )
Combination
i 1
N
i 1
N
Fall 2001
i 1
EE669: Natural Language Processing
46
Finding the Best State Sequence
• One method consists of optimizing on the states individually.
• For each t, 1 t T+1, we would like to find Xt that maximizes
P(Xt|O, ).
• Let i(t) = P(Xt = i |O, ) = P(Xt = i, O|)/P(O|) =
(i(t)i(t))/j=1,N j(t)j(t)
• The individually most likely state is
X^t =argmax1iN i(t), 1 t T+1
• This quantity maximizes the expected number of states that
will be guessed correctly. However, it may yield a quite
unlikely state sequence.
Fall 2001
EE669: Natural Language Processing
47
Best State Sequence
o1
ot-1
ot
ot+1
oT
• Find the state sequence that best explains the observation
sequence
• Viterbi algorithm arg max P( X | O,  )
X
Fall 2001
EE669: Natural Language Processing
48
Finding the Best State Sequence: The
Viterbi Algorithm
• The Viterbi algorithm efficiently computes the
most likely state sequence.
• To find the most likely complete path, compute:
argmaxX P(X|O,)
• To do this, it is sufficient to maximize for a fixed
O: argmaxX P(X,O|)
• We define
j(t) = maxX1..Xt-1 P(X1…Xt-1, o1..ot-1, Xt=j|)
j(t) records the node of the incoming arc that led
to this most probable path.
Fall 2001
EE669: Natural Language Processing
49
Viterbi Algorithm
x1
xt-1
j
o1
ot-1
ot
ot+1
oT
 j (t )  max P( x1...xt 1 , o1...ot 1 , xt  j |  )
x1 ...xt 1
Fall 2001
The state sequence which maximizes the
probability of seeing the observations to
time t-1, landing in state j, and seeing the
observation
at time t
EE669: Natural Language Processing
50
Viterbi Algorithm
x1
xt-1
xt
xt+1
o1
ot-1
ot
ot+1
oT
 j (t )  max P( x1...xt 1 , o1...ot 1 , xt  j |  )
x1 ...xt 1
 j (t  1)  max  i (t )aijbijo
t
i
 j (t  1)  arg max i (t )aijbijo
i
Fall 2001
Recursive
Computation
t
EE669: Natural Language Processing
51
Finding the Best State Sequence:
The Viterbi Algorithm
The Viterbi Algorithm works as follows:
• Initialization: j(1) = j, 1 j N
• Induction: j(t+1) = max1 iN i(t)aijbijot, 1 j N
Store backtrace:
j(t+1) = argmax1 iN i(t)aij bijot, 1 j N
• Termination and path readout:
^
X
T+1 = argmax1 iN i(T+1)
^ =
X
^ (t+1)
t
Xt+1
^
P(X)
= max1 iN i(T+1)
Fall 2001
EE669: Natural Language Processing
52
Viterbi Algorithm
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
Xˆ T  arg max i (T )
i
Xˆ t   ^ (t  1)
X t 1
P( Xˆ )  max i (T )
Fall 2001
i
Compute the most
likely state sequence
by working
backwards
EE669: Natural Language Processing
53
Third Problem: Parameter Estimation
• Given a certain observation sequence, we want to
find the values of the model parameters =(A, B, )
which best explain what we observed.
• Using Maximum Likelihood Estimation, find values
to maximize P(O| ), i.e., argmax P(Otraining| )
• There is no known analytic method to choose  to
maximize P(O| ). However, we can locally
maximize it by an iterative hill-climbing algorithm
known as Baum-Welch or Forward-Backward
algorithm. (This is a special case of the EM
Algorithm)
Fall 2001
EE669: Natural Language Processing
54
Parameter Estimation: The ForwardBackward Algorithm
• We don’t know what the model is, but we can
work out the probability of the observation
sequence using some (perhaps randomly chosen)
model.
• Looking at that calculation, we can see which state
transitions and symbol emissions were probably
used the most.
• By increasing the probability of those, we can
choose a revised model which gives a higher
probability to the observation sequence.
Fall 2001
EE669: Natural Language Processing
55
Parameter Estimation
A
B
o1
A
B
ot-1
A
B
ot
A
B
B
ot+1
oT
• Given an observation sequence, find the model
that is most likely to produce that sequence.
• No analytic method
• Given a model and training sequences, update the
model parameters to better fit the observations.
Fall 2001
EE669: Natural Language Processing
56
Definitions
The probability of traversing a certain arc at time t given
Observation sequence O:
p t (i, j )  p ( X t  i, X t 1  j | O,  )
p ( X t  i, X t 1  j , O |  )

p (O |  )
 i (t )aij bijot 1  j (t  1)

  m (t )  m (t )
m 1...N
Let:
N
 i (t )   pt (i, j )
j 1
Fall 2001
EE669: Natural Language Processing
57
The Probability of Traversing an Arc
aij bijot
si
sj
i(t)
t-1
Fall 2001
t
t+1
EE669: Natural Language Processing
j(t+1)
t+2
58
Definitions
• The expected number of transitions from state i in O:
T
N
 (t ),
t 1
 i (t )   pt (i, j )
i
j 1
• The expected number of transitions from state i to j in O
T
 p (i, j )
t 1
Fall 2001
t
EE669: Natural Language Processing
59
Reestimation Procedure
• Begin with model  perhaps selected at random.
• Run O through the current model to estimate the
expectations of each model parameter.
• Change model to maximize the values of the paths
that are used a lot while respecting stochastic
constraints.
• Repeat this process (hoping to converge on
optimal values for the model parameters ).
Fall 2001
EE669: Natural Language Processing
60
The Reestimation Formulas
ˆ ,B
ˆ ,
ˆ ).
ˆ  (A
From   (A,B,  ), derive 
ˆ )  P (O |  ).
Not e t hatP (O | 
The expected frequency in state i at time t = 1:
ˆ i   i (1)
expected# transitions fromstatei to j
aˆ ij 
expected# transitions fromstatei
T

 p (i, j )
t 1
T

t 1
Fall 2001
t
i
(t )
EE669: Natural Language Processing
61
Reestimation Formulas
expected# transitions fromstatei to j observingk
ˆ
bijk 
expected# transitions fromstatei to j


( t :ot  k ,1t T )
T
 p (i, j )
t 1
Fall 2001
pt (i, j )
t
EE669: Natural Language Processing
62
Parameter Estimation
A
B
A
B
o1
B
ot-1
pt (i, j ) 
ot
 i (t )aij b jo  j (t  1)

m 1...N
 i (t ) 
Fall 2001
A
t 1
m
(t )  m (t )
 p (i, j)
j 1...N
t
A
B
B
ot+1
oT
Probability of
traversing an arc
at time t
Probability of
being in state i
EE669: Natural Language Processing
63
Parameter Estimation
A
A
B
B
o1
ot-1
A
B
ot
A
B
B
ot+1
oT
ˆi   i (1)
p (i, j )


  (t )
 (t )


  (t )
T
aˆij
t 1
T
t 1
bˆik
Fall 2001
t
i
{t:ot  k }
i
Now we can
compute the new
estimates of the
model parameters.
T
t 1
i
EE669: Natural Language Processing
64
HMM Applications
• Parameters for interpolated n-gram models
• Part of speech tagging
• Speech recognition
Fall 2001
EE669: Natural Language Processing
65