No Slide Title

Download Report

Transcript No Slide Title

CS 552/652
Speech Recognition with Hidden Markov Models
Winter 2011
Oregon Health & Science University
Center for Spoken Language Understanding
John-Paul Hosom
Lecture 12
February 16
Expectation Maximization, Embedded Training
1
Project 3: Forward-Backward Algorithm
• Given existing data files of speech, implement the forwardbackward (EM, Baum-Welch) algorithm to train HMMs.
• “Template” code is available to read in features, write out HMM
values to an output file, provide some context and a starting point.
• The variables “gamma” and “xi” are not defined in the template
code, because this template assumes that you will compute them
in a subroutine. However, feel free to define them and compute
them where ever you want.
• The features in the speech files are “real speech features,” in that
they are 7 cepstral coefficients plus 7 delta values from utterances
of “yes” and “no” sampled every 10 msec.
• All necessary files (data files and list of files to train on) are in
the project3.zip file on the class web site.
2
Project 3: Forward-Backward Algorithm
• Train an HMM on the word “no” using the list “nolist.txt,”
which contains the filenames “no_1.txt” “no_2.txt” and “no_3.txt”
Train another HMM on the word “yes” using the list “yeslist.txt”.
• Train for 10 iterations.
• The HMM should have 7 states, the first and last of which are
“NULL” states.
• You can use the first NULL state to store information about ,
and you can start off assuming that the  value for the first “real”
(non-null) state is 1.0 and for all other states  is zero.
• You can initialize the non- transition probabilities (the matrix
A) to a probability of 0.5 for a self loop and 0.5 for transitioning
to the next state.
3
Project 3: Forward-Backward Algorithm
• You can use any method to get initial HMM parameters; the “flat
start” method is easiest.
• You can use only one mixture component in training, and you
can assume a diagonal covariance matrix.
• Updating of the parameters using the accumulators is currently
set up for accumulating numerators and denominators separately
for aij, means, and covariances. If you want to do the updating
differently (using only one accumulator each for aij, means,
and covariances), feel free to do so.
4
Project 3: Forward-Backward Algorithm
•Submit your results for the 10th iteration of training on the
words “yes” and “no”.
•Send your source code and results (the files “hmm_no.10”
and “hmm_yes.10” that you created) to hosom at cslu .ogi .edu;
late responses generally not accepted.
•Computing variance: Don't forget that you can compute variance
by making one pass over all values and then compute from the
sum, count, and sum of squares using the formula
  x x 
 x   N 


N 1
2
5
Project 3: Forward-Backward Algorithm
Sanity checks:
1. First, your output should be close to the HMM file for the word
“no” that you used in the Viterbi project. (Results may not be
exactly the same, depending on different assumptions made.)
2. Second, you can compare alpha and beta values, as discussed in
class, to make sure that they are equal in certain cases.
3. When you train on only one file, the probability of the
observation sequence given the model should increase with
each iteration. (This should also be true for when you train on
all files, but training file-by-file can help with debugging.)
4. Re-arranging the order of the training files should have no
impact on the final results.
6
Expectation-Maximization*
• We want to compute “good” parameters for an HMM so that
when we evaluate it on different utterances, recognition results
are accurate.
• How do we define or measure “good”?
• Important variables are the HMM model , observations O
where O = {o1, o2, … oT}, and state sequence S (instead of Q).
• The probability density function p(ot | ) is used to compute the
probability of an observation given the entire model (NOT same
as bj(ot)); p(O | ) is
the probability of an observation sequence
N
given the model (   i (T ) ).
i 1
• We assume a 1:1 correspondence between p.d.f. and probabilities
*These
lecture notes are based on:
• Bilmes, J. A., “A Gentle Tutorial of the EM Algorithm and Its Application to Parameter
Estimation for Gaussian Mixture and Hidden Markov Models”, ICSI Tech. Report TR-97-021,
1998.
7
• Zhai, C. X., “A Note on the Expectation-Maximization (EM) Algorithm,” CS397-CXZ
Introduction to Text Information Systems, University of Illinois at Urbana-Champaign, 2003
Expectation-Maximization: Likelihood Functions, “Best” Model
• Let’s assume, as usual, that the data vectors ot are independent.
• Define the likelihood of a model  given a set of observations O:
T
L ( | O)  p(O |  )   p(ot |  )
[1]
t 1
• L( | O) is the likelihood function. It is a function of the
model , given a fixed set of data O. If, for two models 1 and
2, the probability p(O | 1) is larger than probability p(O | 2),
then 1 provides a better fit to the data than 2. In this case,
we consider 1 to be a “better” model than 2 for the data O.
In this case, also, L(1 | O) > L(2 | O), and so we can measure
the relative goodness of a model by computing its likelihood.
• So, to find the “best” model parameters, we want to find the
 that maximizes the likelihood function:
  arg maxL ( | O)


8
[2]
Expectation-Maximization: Maximizing the Likelihood
• This is the “maximum likelihood” approach to obtaining
parameters of a model (training).
• It is sometimes easier to maximize the log likelihood,
log(L( | O)). This will be true in our case.
• In some cases (e.g. where the data have the distribution of a
single Gaussian), a solution can be obtained directly.
• In our case, p(ot | ) is a complicated distribution (depending on
several mixtures of Gaussians and an unknown state sequence),
and a more complicated solution is used… namely the iterative
approach of the Expectation-Maximization (EM) algorithm.
• EM is more of a (general) process than a (specific) algorithm;
the Baum Welch algorithm (also called the forward-backward
algorithm) is a specific implementation of EM.
9
Expectation-Maximization: Incorporating Hidden Data
• Before talking about EM in more detail, we should specifically
mention the “hidden” data…
• Instead of just O, the observed data, and a model , we also
have “hidden” data, the state sequence S. S is “hidden” because
we can never know the “true” state sequence that generated
a set of observations, we can only compute the most probable
state sequence (using Viterbi).
• Let’s call the set of complete data (both the observations and
the state sequence) Z, where Z = (O, S).
• The state sequence S is unknown, but can be expressed as a
random variable dependent on the observed data and the model.
Again, we can compute the most probable S (using Viterbi),
but the true S is unknown… we can think of different possible
state sequences having different probabilities, given the
observed data and model.
10
Expectation-Maximization: Incorporating Hidden Data
• Specify a joint-density function
p(Z |  )  p(O, S |  )  p(S | O,  ) p(O |  )
[3]
(the last term comes from the multiplication rule)
• The complete-data likelihood function is then
L ( | Z)  L( | O, S)  p(O, S |  )
[4]
• Our goal is then to maximize the expected value of the
log-likelihood of this complete likelihood function, and
determine the model that yields this maximum likelihood:
  arg max Elog(L ( | Z))  arg max Elog( p(O, S |  )


• We compute the expected value, because the true value
can never be known, because S is hidden. We only know
probabilities of different state sequences.
11
[5]
Expectation-Maximization: Incorporating Hidden Data
• What is the expected value of a function when the p.d.f. of the
random variable depends on other variable(s)?
• Expected value of a random variable Y:
E y  

 y f
Y
( y)dy where fY ( y) is p.d.f. of Y
[6]

(as specified on slide 6 of Lecture 3)
• Expected value of a function h(Y) of the random variable Y:

Eh(Y )   h(Y )   h( y)  fY ( y)dy
[7]

• If the probability density function of Y, fY(y), depends on some
random variable X, then:

Eh(Y ) | X  x   h( y)  fY | X ( y | x)dy

[8]
12
Expectation-Maximization: Overview of EM
• First step in EM:
Compute the expected value of the complete-data log-likelihood,
log(L( | O, S))=log p(O, S | ), with respect to the hidden data S
(so we’ll integrate over the space of state sequences S), given the
observed data O and previous best model (i-1).
• Let’s review the meaning of all these variables:
•  is some model which we want to evaluate the likelihood of.
• O is the observed data (O is known and constant)
• i is the index of the current iteration, i = 1, 2, 3, …
• (i-1) is a set of parameters of a model from a previous
iteration i-1. (for i = 1, (i-1) is the set of initial model values)
((i-1) is known and constant)
• S is a random variable dependent on O and (i-1) with pdf
p(s | O, i 1 )
13
Expectation-Maximization: Overview of EM
• First step in EM:
Compute the expected value of the complete-data log-likelihood,
log(L( | O, S))=log p(O, S | ), with respect to the hidden data S
(so we’ll integrate over the space of state sequences S), given the
observed data O and previous best model (i-1).
• The function of this expected value is called Q(, (i-1)):

 
Q , i 1  E log p(O, S |  ) | O  {o1, o2 ,...oT }, i 1

E[p(O,S | )]
• O and (i-1) are constant, but we can compute E[p(O, S | )]
for any . (Can’t compute exact prob., because S is unknown).
Q(, (i-1))
model  parameters
14
[9]
Expectation-Maximization: Overview of EM
• Second step in EM:
Find the parameters  that maximize the value of Q(, (i-1)).
These parameters become the ith value of , to be used in the
next iteration
(i )  arg maxQ( , i 1 )
[10]

• In practice, the expectation and maximization steps are
performed simultaneously.
• Repeat this expectation-maximization, increasing the value of i
at each iteration, until Q(, (i-1)) doesn’t change (or change is
below some threshold).
• It is guaranteed that with each iteration, the likelihood of  will
increase or stay the same. (The reasoning for this will follow
later in this lecture).
15
Expectation-Maximization: EM Step 1
• So, for first step, we want to compute

 
Q , i1  E log p(O, S |  ) | O, i 1

[11]
which we can combine with equation 8

Eh(Y ) | X  x   h( y)  fY | X ( y | x)dy
[8]

to get the expected value with respect to the unknown data S

  log p(O, s |  )  p(s | O, 
E log p (O, S |  ) | O, i 1 
i 1
[12]
)ds
sS
where S is the space of values (state sequences) that s can have.
16
Expectation-Maximization: EM Step 1
• Problem:
We don’t easily know p(s | O, i 1 )
• But, from the multiplication rule,
i 1
p
(
s
,
O
|

)
i 1
p(s | O,  ) 
p(O | i 1 )
[13]
i 1
• We do know how to compute p(s, O |  )
• p(O | i 1 ) is constant for a given (i-1), and so this term
has no effect on maximizing the expected value of L ( | Z)
i 1
• So, we can replace p(s | O, i 1 ) with p(s, O |  )
and not affect results.
17
Expectation-Maximization: EM Step 1
• The Q function will therefore be implemented as

  log p(O, s |  )  p(s, O | 
Q  , i 1 
i 1
)ds
[14]
sS
• Since the state sequence is discrete, not continuous, this can
be represented as


Q  , i 1   log p(O, s |  )  p(O, s | i 1 )
[15]
sS
• Given a specific state sequence s = {q1,q2,…qT},
 T 1

p(O, s |  )   q1   bqt (ot )aqt qt1   bqT (oT )
 t 1

T 1
T
  q1   aqt qt1   bqt (ot )
t 1
[16]
t 1
[17]
18
Expectation-Maximization: EM Step 1
• Then the Q function is represented as:


Q  , i 1   log p(O, s |  )  p(O, s | i 1 )
[18=15]
sS
T 1
T


  log q1   aqt qt 1   bqt (ot )  p(s, O | (i 1) )
sS
t 1
t 1


[19]

 T 1

 T

   log  q1  log  aqt qt1   log  bqt (ot )   p(s, O | (i 1) )
sS 
 t 1

 t 1

[20]
 T 1

)   log  aqt qt1  p(s, O | (i 1) ) 
sS
 t 1

[21]
 
 
  log  q1 p(s, O | 
sS
( i 1)
 T

( i 1)
log
b
(
o
)
p
(
s
,
O
|

)



  qt t 
sS
 t 1

19
Expectation-Maximization: EM Step 2
• If we optimize by finding the parameters at which the derivative
of the Q function is zero, we don’t have to actually search over all
possible  to compute (i )  arg maxQ( , i 1 )

• We can optimize each part independently, since the three
parameters to be optimized are in three separate terms.
We will consider each term separately.
• First term to optimize:
 
( i 1)
log

p
(
s
,
O
|

)

q1
[22]
sS
N
=
 log  p(q
i 1
i
1
 i, O | 
( i 1)
)
[23]
because states other than q1 have a constant effect and so can
be omitted (e.g. P( X )   P( X , y))
yY
20
Expectation-Maximization: EM Step 2
• We have the additional constraint that all  values sum to 1.0, so
we use a Lagrange multiplier  (the usual symbol for the
Lagrange multiplier, , is taken), then find the maximum by
setting the derivative to 0:
N

  N

  
( i 1)
log i  p q1  i, O | 
      i   1   0
[24]


 i  i 1
  i 1   


• Solution (lots of math left out):
p(q1  i, O | (i 1) )
i 
p(O | (i 1) )
[25]
• Which equals 1(i)
• Which is the same update formula for  we saw earlier
(Lecture 11, slide 19)
21
Expectation-Maximization: EM Step 2
• Second term to optimize:
 T 1

( i 1)
log
a
p
(
s
,
O
|

)



  qt qt1 
sS
 t 1

[26]
N
• We (again) have an additional constraint, namely  aij  1
j 1
so we use the Lagrange multiplier , then find the maximum by
setting the derivative to 0.
• Solution (lots of math left out):
T 1
aij 
( i 1)
p
(
q

i
,
q

j
,
O
|

)
 t
t 1
t 1
T 1
( i 1)
p
(
q

i
,
O
|

)
 t
t 1
• Which is equivalent to the update formula Lecture 11, slide 20.22
[27]
Expectation-Maximization: EM Step 2
• Third term to optimize:
 T

( i 1)
log
b
(
o
)
p
(
s
,
O
|

)



  qt t 
sS
 t 1

[28]
• Which has the constraint, in the discrete-HMM case, of
M
 b j (e p )  1
p 1
there are M discrete events
e1… eM generated by the HMM
• After lots of math, the result is:
T
( i 1)
p
(
q

j
,
O
|

)
 t
b j (k ) 
t 1,
s.t. o t  ek
T
( i 1)
p
(
q

j
,
O
|

)
 t
t 1
• Which is equivalent to the update formula Lecture 11, slide 20.23
[29]
Expectation-Maximization: Increasing Likelihood?
• By solving for the point at which the derivative is zero, these
solutions find the point at which the Q function (expected
log-likelihood of the model  given the complete data, O and S)
is at a local maximum, based on a prior model (i-1).
• We are maximizing the Q function for each iteration.
Is that the same as maximizing the (log) likelihood of the
model  given only the data O?
• Consider the log-likelihood of a model based on a complete
data set, Llog( | O, S), vs. the log-likelihood based on only the
observed data O, Llog ( | O):
(Llog = log(L))
Llog ( | O, S)  log p(O, S |  )  log p(O |  )  log p(S | O,  )
 Llog ( | O)  log p(S | O,  )
[30]
Llog ( | O)  Llog ( | O, S) log p(S | O,  )
[31]
24
Expectation-Maximization: Increasing Likelihood?
• Now consider the difference between a new and an old likelihood
of the observed data, as a function of the complete data:
Llog ( | O)  Llog (
( i 1)

L

| O)  Llog ( | O, S) log p(S | O,  ) 
[32]
( i 1)
( i 1)
(

|
O
,
S
)

log
p
(
S
|
O
,

)
log

Llog ( | O)  Llog (( i 1) | O)  Llog ( | O, S)  Llog ((i 1) | O, S) 
[33]
p (S | O, ( i 1) )
log
p(S | O,  )
• If we take the expectation of this difference in log-likelihood
with respect to the hidden state sequence S given the observations
O and the model (i-1) then we get…
(next slide)
25
Expectation-Maximization: Increasing Likelihood?
Llog ( | O)  Llog ((i 1) | O)   Llog ( | O, s) p(s | O, (i 1) ) 
[34]
sS
( i 1)
( i 1)
L
(

|
O
,
s
)
p
(
s
|
O
,

)
 log
sS
( i 1)
p
(
s
|
O
,

)
( i 1)
p(s | O,  ) log

p(s | O,  )
sS
• Left hand side doesn’t change because it’s not a function of S:



Yp( x)dx  Y
x  
 p( x)dx
x  
if p(x) is a probability density function, then

so
 Yp( x)dx  Y
x  
[35]

 p( x)dx  1
x  
[36]
26
Expectation-Maximization: Increasing Likelihood?
• The third term is the Kullback-Leibler Distance:
P ( zi )
i P( zi ) log Q( z )  0
i
P(zi), Q(zi) are probability
density functions
[37]
(proof involves inequality log(x)  x –1)
• So, we have
Llog ( | O)  Llog (
( i 1)
| O)   Llog ( | O, s) p (s | O, 
( i 1)
)
[38]
sS
( i 1)
( i 1)
L
(

|
O
,
s
)
p
(
s
|
O
,

)
 log
sS
which is the same as
Llog ( | O)   Llog ( | O, s) p(s | O, ( i 1) )  Llog (( i 1) | O) 
sS
L
sS
log
(
( i 1)
| O, s) p (s | O, 
( i 1)
[39]
)
27
Expectation-Maximization: Increasing Likelihood?
• The right-hand side of this equation [39] is the lower bound on
the likelihood function Llog( | O)
• By combining [12], [4], and [15] we can write Q as
Q(, (i 1) )   Llog ( | O, s) p(s | O, (i 1) )
[40]
sS
• So, we can re-write Llog( | O) as
Llog ( | O)  Q(, (i 1) )  Llog ((i 1) | O)  Q((i 1) , (i 1) )
[41]
• Since we have maximized the Q function for model ,
maximum
• And therefore
Q(, (i 1) )  Q((i 1) , (i 1) )  0
Llog ( | O)  Llog ((i 1) | O)
[42]
not greater than
maximum
[43]
28
Expectation-Maximization: Increasing Likelihood?
• Therefore, by maximizing the Q function, the log-likelihood of
the model  given the observations O does increase (or stay the
same) with each iteration.
• More work is needed to show the solutions for the re-estimation
formulae for cˆ, ˆ , and ˆ in the case where bj(ot) is computed from
a Gaussian Mixture Model.
29
Expectation-Maximization: Forward-Backward Algorithm
• Because we directly compute the model parameters that maximize
the Q function directly, we don’t need to iterate in the
Maximization step, and so we can perform both Expectation and
Maximization for one model (i) simultaneously.
• The algorithm is then as follows:
(1) get initial model (0)
(2) for i = 1 to R:
(2a) use re-estimation formulae to compute parameters
of (i) (based on model (i-1))
(2b) if (i) = (i-1) then stop
where R is the maximum number of iterations
• This is called the forward-backward algorithm because the
re-estimation formulae use the variables  (which computes
probabilities going forward in time) and  (which computes
probabilities going backward in time).
30
Expectation-Maximization: Forward-Backward Illustration
• Forward-Backward Algorithm, Iteration 1:
j, j
0.5
aij
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
bj(ot)
μ
σ2
ot
, 
aij
P(qt=j|O,)
=t(j)
j, j
31
Expectation-Maximization: Forward-Backward Illustration
• Forward-Backward Algorithm, Iteration 2:
0.94
aij
0.06
0.5
0.5
0.5
0.5
0.91
0.09
0.92
0.08
bj(ot)
μ
σ2
ot
, 
aij
P(qt=j|O,)
=t(j)
j, j
32
Expectation-Maximization: Forward-Backward Illustration
• Forward-Backward Algorithm, Iteration 3:
0.93
aij
0.07
0.53
0.75
0.25
0.47
0.88
0.12
0.93
0.07
bj(ot)
μ
σ2
ot
, 
aij
P(qt=j|O,)
=t(j)
j, j
33
Expectation-Maximization: Forward-Backward Illustration
• Forward-Backward Algorithm, Iteration 4:
0.91
aij
0.08
0.58
0.88
0.12
0.42
0.85
0.15
0.93
0.07
bj(ot)
μ
σ2
ot
, 
aij
P(qt=j|O,)
=t(j)
j, j
34
Expectation-Maximization: Forward-Backward Illustration
• Forward-Backward Algorithm, Iteration 10:
0.89
aij
0.11
0.85
0.87
0.13
0.15
0.78
0.22
0.94
0.06
bj(ot)
μ
σ2
ot
, 
aij
P(qt=j|O,)
=t(j)
j, j
35
Expectation-Maximization: Forward-Backward Illustration
• Forward-Backward Algorithm, Iteration 20:
0.89
aij
0.11
0.84
0.87
0.13
0.16
0.73
0.27
0.94
0.06
bj(ot)
μ
σ2
ot
, 
aij
P(qt=j|O,)
=t(j)
j, j
36
Embedded Training
• Typically, when training a medium- to large-vocabulary
system, each phoneme has its own HMM; these phonemelevel HMMs are then concatenated into a word-level HMM
to form the words in the vocabulary.
• Typically, forward-backward training is for training the
phoneme-level HMMs, and uses a database in which the
phonemes have been time-aligned (e.g. TIMIT) so that each
phoneme can be trained separately.
• The phoneme-level HMMs have been trained to maximize
the likelihood of these phoneme models, and so the wordlevel HMMs created from these phoneme-level HMMs can
then be used to then recognize words.
• In addition, we can train on sentences (word sequences) in
our training corpus using a method called embedded
training.
37
Embedded Training
• Initial forward-backward procedure trains on each phoneme
individually:
y1
y2
s1
y3
E1
E2
s2
s3
E3
• Embedded training concatenates all phonemes in a sentence
into one sentence-level HMM, then performs forwardbackward training on the entire sentence:
y1
y2
y3
E1
E2
E3
s1
s2
s3
38
Embedded Training
• Example: Perform embedded training on a sentence from the
Resource-Management (RM) corpus:
“Show all alerts.”
• First, generate phoneme-level pronunciations for each word
• Second, take existing phoneme-level HMMs and concatenate
them into one sentence-level HMM.
• Third, perform forward-backward training on this sentencelevel HMM.
SHOW
SH
SH
SH
SH
ALL
OW
OW OW OW
AA
AA AA AA
ALERTS
L
L
L
AX
L
AX AX AX
L
L
ER
L
L
TS
ER
ER
ER
TS
39
TS
TS
Embedded Training
• Why do embedded training?
(1) Better learning of acoustic characteristics of specific words.
(the acoustics of /r/ in “true” and “not rue” are somewhat
different, even though the phonetic context is the same)
(2) Given initial phoneme-level HMMs trained using forwardbackward, can perform embedded training on much
larger corpus of target speech using only the word-level
transcription and a pronunciation dictionary. Resulting
HMMs are then (a) trained on more data and (b) tuned to
specific words in the target corpus.
Caution: Words spoken in sentences can have pronunciation that
is different from the pronunciation obtained from a dictionary.
(Word pronunciation can be context-dependent or speakerdependent).
40