HMM - Academia Sinica

Download Report

Transcript HMM - Academia Sinica

Hidden Markov Models
Hsin-min Wang
[email protected]
References:
1.
2.
3.
L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter 6
X. Huang et. al., (2001) Spoken Language Processing, Chapter 8
L. R. Rabiner, (1989) “A Tutorial on Hidden Markov Models and Selected Applications in
Speech Recognition,” Proceedings of the IEEE, vol. 77, No. 2, February 1989
1
Hidden Markov Model (HMM)
 History
– Published in Baum’s papers in late 1960s and early 1970s
– Introduced to speech processing by Baker (CMU) and Jelinek (IBM) in
the 1970s
– Introduced to DNA sequencing in 1990s?
 Assumption
– Speech signal (DNA sequence) can be characterized as a parametric
random process
– Parameters can be estimated in a precise, well-defined manner
 Three fundamental problems
– Evaluation of probability (likelihood) of a sequence of observations
given a specific HMM
– Determination of a best sequence of model states
– Adjustment of model parameters so as to best account for observed
signal/sequence
2
Hidden Markov Model (HMM)
0.34
Given an initial model as follows:
0.34 0.33 0.33
A  0.33 0.34 0.33
0.33 0.33 0.34
b1  A  0.34, b1 B   0.33, b1 C   0.33
b2  A  0.33, b2 B   0.34, b2 C   0.33
S1
0.33
0.34
S2
b3  A  0.33, b3 B   0.33, b3 C   0.34
π  0.34 0.33 0.33
back
{A:.34,B:.33,C:.33}
0.33
0.33 0.33
0.33
S3
0.34
0.33
{A:.33,B:.34,C:.33} {A:.33,B:.33,C:.34}
We can train HMMs for the following
two classes using their training data respectively.
Training set for class 1:
1. ABBCABCAABC
2. ABCABC
3. ABCA ABC
4. BBABCAB
5. BCAABCCAB
6. CACCABCA
7. CABCABCA
8. CABCA
9. CABCA
Training set for class 2:
1. BBBCCBC
2. CCBABB
3. AACCBBB
4. BBABBAC
5. CCAABBAB
6. BBBCCBAA
7. ABBBBABA
8. CCCCC
9. BBAAA
We can then decide which
class do the following testing
sequences belong to?
ABCABCCAB
AABABCCCCBBB
3
The Markov Chain
P( A, B)  P( B | A) P( A)
n
P( X 1 , X 2 ,..., X n 1 , X n )  P( X n | X 1 , X 2 ,..., X n 1 ) P( X 1 , X 2 ,..., X n 1 )  P( X 1 ) P(X i | X i 1 , X i  2 ,..., X 1 )
i 2
n
P( X1, X 2 ,..., X n )  P( X1 ) P(X i | X i 1 )
First-order Markov chain
i 2
 An Observable Markov Model
– The parameters of a Markov chain, with N states labeled by
{1,…,N} and the state at time t in the Markov chain denoted as qt,
can be described as
(Nj1 aij  1 i)
aij=P(qt= j|qt-1=i) 1≤i,j≤N
i =P(q1=i) 1≤i≤N
( i 1 i  1)
N
– The output of the process is the set of
states at each time instant t,
where each state corresponds to an
observable event Xi
– There is one-to-one correspondence
(Rabiner 1989)
between the observable sequence and
the Markov chain state sequence (observation is deterministic!)
4
The Markov Chain – Ex 1
 Example 1 : a 3-state Markov Chain 
0.6
– State 1 generates symbol A only,
State 2 generates symbol B only,
State 3 generates symbol C only
 0 .6
A   0.1
0.3
π  0.4
0.3 0.1
0.7 0.2
0.2 0.5
0.5 0.1
S1
0.3
0.7
S2
A
0.3
0.1 0.1
0.2
S3
0.5
0.2
B
C
– Given a sequence of observed symbols O={CABBCABC}, the only
one corresponding state sequence is Q={S3S1S2S2S3S1S2S3}, and the
corresponding probability is
P(O|)=P(CABBCABC|)=P(Q|  )=P(S3S1S2S2S3S1S2S3 |)
=π(S3)P(S1|S3)P(S2|S1)P(S2|S2)P(S3|S2)P(S1|S3)P(S2|S1)P(S3|S2)
=0.10.30.30.70.20.30.30.2=0.00002268
5
The Markov Chain – Ex 2
 Example 2: A three-state Markov chain for the Dow
Jones Industrial average (Huang et al., 2001)
The probability of 5 consecutive up days
P5 consecutive up days  
 PS1,S1,S1,S1,S1  
  1a11a11a11a11
 0.5 0.6  0.0648
4
π   i t
0.5


 0.2

0.3

6
Extension to Hidden Markov Models
 HMM: an extended version of Observable Markov Model
– The observation is a probabilistic function (discrete or
continuous) of a state instead of an one-to-one correspondence of
a state
– The model is a doubly embedded stochastic process with an
underlying stochastic process that is not directly observable (hidden)
• What is hidden? The State Sequence!
According to the observation sequence, we are not sure which
state sequence generates it!
7
Hidden Markov Models – Ex 1
 Example : a 3-state discrete HMM 
0.6
0.6 0.3 0.1
Initial model


A   0.1 0.7 0.2
0.3 0.2 0.5
b1  A  0.3, b1 B   0.2, b1 C   0.5
b2  A  0.7, b2 B   0.1, b2 C   0.2
0.7
S1
0.3
S2
b3  A  0.3, b3 B   0.6, b3 C   0.1
{A:.3,B:.2,C:.5}
0.3
0.1 0.1
0.2
S3
0.5
0.2
{A:.7,B:.1,C:.2}
{A:.3,B:.6,C:.1}
π  0.4 0.5 0.1
– Given a sequence of observations O={ABC}, there are 27
possible corresponding state sequences, and therefore the
probability, P(O|), is
PO λ    PO, Qi λ    PO Qi , λ PQi λ ,
27
i 1
27
i 1
Qi : state sequence
e.g. when Qi  S 2 S 2 S3 , PO Qi , λ   PA S 2 PB S 2 PC S3   0.7  0.1 0.1  0.007
PQi λ    S 2 PS 2 S 2 PS3 S 2   0.5  0.7  0.2  0.07
8
Hidden Markov Models – Ex 2
Given a three-state Hidden Markov Model for the Dow Jones Industrial average
as follows:
(Huang et al., 2001)
How to find the probability P(up, up, up, up, up|)?
How to find the optimal state sequence of the model which generates the observation
sequence “up, up, up, up, up”?
9
Elements of an HMM
 An HMM is characterized by the following:
1. N, the number of states in the model
2. M, the number of distinct observation symbols per state
3. The state transition probability distribution A={aij}, where
aij=P[qt+1=j|qt=i], 1≤i,j≤N
4. The observation symbol probability distribution in state j,
B={bj(vk)} , where bj(vk)=P[ot=vk|qt=j], 1≤j≤N, 1≤k≤M
5. The initial state distribution  ={i}, where  i=P[q1=i], 1≤i≤N
 For convenience, we usually use a compact notation
=(A,B,) to indicate the complete parameter set of an
HMM
– Requires specification of two model parameters (N and M)
10
Two Major Assumptions for HMM
 First-order Markov assumption
– The state transition depends only on the origin and destination
P Q    P q1 ,...,qt ,...,qT    P q1   P qt qt 1 ,  
T
t 2
– The state transition probability is time invariant
aij=P(qt+1=j|qt=i), 1≤i, j≤N
 Output-independent assumption
– The observation is dependent on the state that generates it, not
dependent on its neighbor observations
P O Q,    P o1 ,...,ot ,...,oT q1 ,...,qt ,...,qT ,     P ot qt ,     bqt ot 
T
T
t 1
t 1
11
Three Basic Problems for HMMs
 Given an observation sequence O=(o1,o2,…,oT), and
an HMM =(A,B,)
– Problem 1:
How to efficiently compute P(O|) ?
 Evaluation problem
– Problem 2:
How to choose an optimal state sequence Q=(q1,q2,……, qT)
which best explains the observations?
 Decoding Problem
Q*  arg max P(Q, O |  )
Q
– Problem 3:
How to adjust the model parameters =(A,B,) to maximize
P(O|)?
 Learning/Training Problem
12
Solution to Problem 1 - Direct Evaluation
Given O and , find P(O|)= Pr{observing O given }
 Evaluating all possible state sequences of length T that
generate observation sequence O
PO     PO, Q     PO Q,  PQ  
all Q
all Q
 PQ  : The probability of the path Q
– By first-order Markov assumption
P Q    P q1   P qt qt 1 ,    q1 aq1q2 aq2 q3 ...aqT 1qT
T
t 2
 PO Q,  : The joint output probability along the path Q
– By output-independent assumption
P O Q,     P ot qt ,     bqt ot 
T
T
t 1
t 1
13
Solution to Problem 1 - Direct Evaluation (cont.)
State
S3
S3
S3
S3
S3
S2
S2
S2
S2
S2
S1
S1
S1
S1
S1
1
2
3
T-1
T
o1
o2
o3
oT-1
oT
Si
means bj(ot) has been computed
aij
means aij has been computed
Time
14
Solution to Problem 1 - Direct Evaluation (cont.)
P O   
 PQ  PO Q,  
all Q

  q1 aq1q2 aq2q3 .....aqT 1qT bq1 o1 bq2 o2 .....bqT oT 
all Q

  q1 bq1 o1 aq1q2 bq2 o2 .....aqT 1qT bqT oT 
q1,q2 ,..,qT
– Huge Computation Requirements: O(NT) (NT state sequences)
• Exponential computational complexity
Comple xity
: 2T-1N T MUL  2TN T , N T -1 ADD
 A more efficient algorithm can be used to evaluate PO  
– The Forward Procedure/Algorithm
15
Solution to Problem 1 - The Forward Procedure
 Base on the HMM assumptions, the calculation of
Pqt qt 1,   and Pot qt ,   involves only qt-1, qt , and
ot , so it is possible to compute the likelihood
PO   with recursion on t
 Forward variable :
αt i   Po1, o2 ,...,ot , qt  i λ
– The probability of the joint event that o1,o2,…,ot are observed and
the state at time t is i, given the model λ
αt 1  j   Po1 , o2 ,...,ot , ot 1 , qt 1  j λ 
N

  αt (i)aij b j (ot 1 )
i 1

16
Solution to Problem 1 - The Forward Procedure
(cont.)
P( A, B,  )
P( A, B,  ) P( B,  )
 t 1  j   Po1 , o2 ,...,ot , ot 1 , qt 1  j |   P( A, B |  )  P( )  P(B,  )  P( )  P( A | B,  )P(B |  )
 Po1 , o2 ,...,ot , ot 1 | qt 1  j ,  P (qt 1  j |  ) Output-independent assumption
 Po1 , o2 ,...,ot | qt 1  j ,  P (ot 1 | qt 1  j ,  ) P (qt 1  j |  )
 Po1 , o2 ,...,ot , qt 1  j |  P (ot 1 | qt 1  j ,  ) P( A | B,  )P(B |  )  P( A, B |  )
 Po1 , o2 ,...,ot , qt 1  j |  b j (ot 1 ) Po q  j,    b (o )
t 1
t 1
N

  P o1 , o2 ,...,ot , qt  i, qt 1  j λ b j (ot 1 )
i 1

j
P A 
t 1
 P( A, B)
all B
P( A, B |  )  P( A |  ) P( B | A,  )
N

  P o1 , o2 ,...,ot , qt  i λ P (qt 1  j | o1 , o2 ,...,ot , qt  i, λ)b j (ot 1 )
i 1

First-order Markov assumption
N

  P o1 , o2 ,...,ot , qt  i λ P (qt 1  j | qt  i, λ)b j (ot 1 )
i 1

N

   t (i )aij b j (ot 1 )
i 1

17
Solution to Problem 1 - The Forward Procedure
(cont.)
 3(2)=P(o1,o2,o3,q3=2|)
=[2(1)*a12+ 2(2)*a22 +2(3)*a32]b2(o3)
State
S3
S3
S3
a32
2(3)
b2(o3)
S3
S3
S2
S2
a22 S2
2(2)
a12
S1
S1
S2
S2
S1
S1
S1
2(1)
1
2
3
T-1
T
o1
o2
o3
oT-1
oT
Si
means bj(ot) has been computed
aij
means aij has been computed
Time
18
Solution to Problem 1 - The Forward Procedure
(cont.)
t i   Po1o2...ot , qt  i λ 
 Algorithm
1. Initializtion
a
α1 i   πi bi o1 , 1  i  N
N

2. Induction αt 1  j    αt i aij b j ot 1 , 1  t  T-1,1  j  N
i 1

3.Te rminat
ion PO λ    αT i 
N
i 1
– Complexity: O(N2T)
MUL: N(N+1 )(T-1 )+N  N 2T
ADD: (N-1 )N(T-1 )  N 2T
 Based on the lattice (trellis) structure
– Computed in a time-synchronous fashion from left-to-right, where
each cell for time t is completely computed before proceeding to
time t+1
 All state sequences, regardless how long previously,
merge to N nodes (states) at each time instance t
19
Solution to Problem 1 - The Forward Procedure
(cont.)
 A three-state Hidden Markov Model for the Dow Jones
Industrial average
α2(1)= (0.35*0.6+0.02*0.5+0.09*0.4)*0.7
a11=0.6
α1(1)=0.5*0.7
π1=0.5 b1(up)=0.7
a21=0.5
b1(up)=0.7
a31=0.4
α1(2)= 0.2*0.1
π2=0.2
b2(up)= 0.1
b2(up)= 0.1
α1(3)= 0.3*0.3
π3=0.3
b3(up)=0.3
b3(up)=0.3
(Huang et al., 2001)
20
Solution to Problem 2 - The Viterbi Algorithm
 The Viterbi algorithm can be regarded as the dynamic
programming algorithm applied to the HMM or as a
modified forward algorithm
– Instead of summing up probabilities from different paths coming
to the same destination state, the Viterbi algorithm picks and
remembers the best path
• Find a single optimal state sequence Q=(q1,q2,……, qT)
– The Viterbi algorithm also can be illustrated in a trellis framework
similar to the one for the forward algorithm
21
Solution to Problem 2 - The Viterbi Algorithm
(cont.)
State
S3
S3
S3
S3
S3
S2
S2
S2
S2
S2
S1
S1
S1
S1
S1
1
2
3
T-1
o1
o2
o3
oT-1
T
Time
oT
22
Solution to Problem 2 - The Viterbi Algorithm
(cont.)
1. Initialization
1 i   πi bi o1 , 1  i  N
1 (i )  0, 1  i  N
2. Induction
 t 1  j   max[ t i aij ]b j ot 1 , 1  t  T-1,1  j  N
1i  N
t 1 ( j )  arg max[ t i aij ], 1  t  T-1,1  j  N
1i  N
3. Termination
P* O λ   max T i 
1i  N
qT*  arg max T i 
1i  N
4. Backtracking
q*t   t 1 (qt*1 ), t  T  1.T  2,...,1
Q*  (q1* , q2* ,...,qT* ) is the best state sequence
Complexity: O(N2T)
23
Solution to Problem 2 - The Viterbi Algorithm
(cont.)
 A three-state Hidden Markov Model for the Dow Jones
Industrial average
δ1(1)=0.5*0.7
π1=0.5 b1(up)=0.7
δ1(2)= 0.2*0.1
π2=0.2
δ2(1)
=max (0.35*0.6, 0.02*0.5, 0.09*0.4)*0.7
a11=0.6
b2(up)= 0.1
a21=0.5
b1(up)=0.7
a31=0.4
δ2(1)= 0.35*0.6*0.7=0.147
Ψ2(1)=1
b2(up)= 0.1
δ1(3)= 0.3*0.3
π3=0.3
b3(up)=0.3
b3(up)=0.3
(Huang et al., 2001)
24
Solution to Problem 3 –
The Baum-Welch Algorithm
 How to adjust (re-estimate) the model parameters
=(A,B,) to maximize P(O|)?
– The most difficult one among the three problems, because there
is no known analytical method that maximizes the joint
probability of the training data in a closed form
• The data is incomplete because of the hidden state sequence
– The problem can be solved by the iterative Baum-Welch
algorithm, also known as the forward-backward algorithm
• The EM (Expectation Maximization) algorithm is perfectly suitable
for this problem
25
Solution to Problem 3 –
The Backward Procedure
 Backward variable :
t i   Pot 1, ot 2 ,...,oT qt  i, λ 
– The probability of the partial observation sequence ot+1,ot+2,…,oT,
given state i at time t and the model 
– 2(3)=P(o3,o4,…, oT|q2=3,)
=a31* b1(o3)*3(1)+a32* b2(o3)*3(2)+a33* b3(o3)*3(3)
State
S3
S3
S3
S3
S3
S3
S2
S2
S2
S2
S2
S2
S1
S3
S1
a31
S1
S1
S1
b1(o3) 3(1)
1
2
3
T-1
T
o1
o2
o3
oT-1
oT
Time
26
Solution to Problem 3 –
The Backward Procedure (cont.)
t i   Pot 1, ot 2 ,...,oT qt  i, λ
 Algorithm
1. Initialization βT i   1, 1  i  N
N
2. Induction t i    aij b j ot 1  t 1  j , 1  t  T-1,1  j  N
j 1
Complexity MUL : 2 N 2(T-1 )  N 2T ; ADD : (N-1 )N(T-1 )  N 2T
P O λ    P o1 , o2 , o3 ,...,oT , q1  i λ    P o1 , o2 , o3 ,...,oT q1  i, λ P q1  i λ 
N
N
i 1
i 1
  P o2 , o3 ,...,oT q1  i, λ P o1 q1  i, λ P q1  i λ 
N
i 1
N
  1 (i )bi (o1 ) i
i 1
 
cf. P O λ   αT i 
N
i 1
27
Solution to Problem 3 –
The Forward-Backward Algorithm
 Relation between the forward and backward variables
 t i   P o1o2 ...ot , qt  i λ 
N
 t i   [   t 1  j a ji ]bi (ot )
j 1
 t i   Pot 1ot  2 ...oT qt  i, λ 
N
 t i    aij b j (ot 1 )  t 1  j 
j 1
t i t (i)  PO, qt  i λ
PO λ  iN1t i t (i)
(Huang et al., 2001)
28
Solution to Problem 3 –
The Forward-Backward Algorithm (cont.)
 t i  t (i)
 P(o1 , o2 ,...,ot , qt  i |  )  P(ot 1 , ot  2 ,...,oT | qt  i,  )
 P(o1 , o2 ,...,ot | qt  i,  )  P(qt  i |  )  P(ot 1 , ot  2 ,...,oT | qt  i,  )
 P(o1 , o2 ,...,oT | qt  i,  )  P(qt  i |  )
 P(o1 , o2 ,...,oT , qt  i |  )
 PO, qt  i λ 
P O λ    P O, qt  i λ     t (i )  t (i )
N
N
i 1
i 1
29
Solution to Problem 3 – The Intuitive View
t i t (i)  PO, qt  i λ
 Define two new variables:
t(i)= P(qt = i | O, )
PO λ  iN1t i t (i)
– Probability of being in state i at time t, given O and 
 t i  
P(O, qt  i |  )  t i t i 
 i t i 

 Nt
PO λ 
PO λ 
  t i t i 
i 1
t( i, j )=P(qt = i, qt+1 = j | O, )
– Probability of being in state i at time t and state j at time t+1, given O
and 
t i, j  
N
Pqt  i, qt 1  j, O λ 
PO λ 

t i aijb j ot 1 t 1  j 
   t mamn bn ot 1 t 1 n
N N
m1n 1
 t i    t i, j 
j 1
30
Solution to Problem 3 – The Intuitive View
(cont.)
 P(q3 = 3, O | )=3(3)*3(3)
3(3)
State
3(3)
Ss13
Ss13
Ss13
S3
S3
S3
Ss 2
Ss 2
S2
S2
S2
S2
Ss31
Ss31
S1
S1
S1
S1
1
2
3
T-1
T Time
o1
o2
o3
oT-1
oT
4
31
Solution to Problem 3 – The Intuitive View
(cont.)
 P(q3 = 3, q4 = 1, O | )=3(3)*a31*b1(o4)*4(1)
3(3)
State
Ss13
Ss13
Ss13
S3
S3
S3
a31
Ss 2
Ss 2
S2
S2
S2
S2
Ss31
Ss31
S1
S1
S1
S1
T-1
T
oT-1
oT
b1(o4) 4(1)
1
2
3
o1
o2
o3
4
Time
32
Solution to Problem 3 – The Intuitive View
(cont.)
 t( i, j )=P(qt = i, qt+1 = j | O, )
T 1
 t i, j 
t 1
 expectednumber of transitions fromstatei tostate j in O
 t(i)= P(qt = i | O, )
T 1
  t i 
t 1
 expectednumber of transitions fromstatei in O
33
Solution to Problem 3 – The Intuitive View
(cont.)
 Re-estimation formulae for  , A, and B are
 i  expected freqency (number of times) in state i at time (t  1)   1i 
T-1
 ξt i,j 
expected number of transitio ns from state i to state j t 1
aij 
 T-1
expected number of transitio ns from state i
  t i 
t 1
 t  j
T
t 1
expected number of times in state j and observing symbol vk
s.t. ot  vk
b j vk  
 T
expected number of times in state j
 t  j
t 1
34