HMM, Chapter 15 - Southern Oregon University

Download Report

Transcript HMM, Chapter 15 - Southern Oregon University

Synthesis Generations
• First Generation Evaluation
– Perfect speech could be generated
– Required perfect setting of the parameters
– Human intervention put supper limits on the achievable quality
• Second Generation
– Memorize pre-stored waveforms for concatenation
– Cannot store enough data to concatenate all that we want
– Only allows pitch and timing changes
• Third Generation
– Introduce statistical models to learn the data’s properties
– Allows the possibility to modify the output in many ways
Dynamic Programming
• Definition: Recursive algorithm that uses arrays.
• Description:
– Start with the base case, which initializes the arrays
– Each step of the algorithm fills in table entries
– Later steps access table entries filled-in by earlier steps
• Advantages:
– avoids repeat calculations performed during recursion
– Uses loops without the overhead of creating activation records
• Applications
– Many applications beyond signal processing
– Dynamic Time Warping: How close are two sequences
– Hidden Markov Model Algorithms
Example: Minimum Edit Distance
A useful dynamic programming algorithm
• Problem: How can we measure how different one word is
from another word (ie spell checker)?
– How many operations will transform one word into another?
– Examples: caat --> cat, fplc --> fireplace
• Definition:
– Levenshtein distance: smallest number of insertion, deletion,
or substitution operations to transform one string into another
– Each insertion, deletion, or substitution is one operation
• Requires a two dimension array
– Rows: source word positions, Columns: spelled word positions
– Cells: distance[r][c] is the distance up to that point
Pseudo Code (minDistance(target, source))
n = character in source
m = characters in target
Create array, distance, with dimensions n+1, m+1
FOR r=0 TO n distance[r,0] = r
FOR c=0 TO m distance[0,c] = c
FOR each row r
FOR each column c
IF source[r]=target[c]
cost = 0
ELSE cost = 1
distance[r,c]=minimum of
distance[r-1,c] + 1,
//insertion
distance[r, c-1] + 1,
//deletion
and distance[r-1,c-1] + cost) //substitution
Result is in distance[n,m]
Example
0
G
1
A
2
M
3
B
4
O
5
L
6
G
U
M
B O
1
2
3
4 5
• Source: GAMBOL, Target: GUMBO
• Algorithm Step: Initialization
Example
G
U
M
B O
0
1
2
3
4 5
G
1
0
A
2
1
M
3
2
B
4
3
O
5
4
L
6
5
• Source: GAMBOL, Target: GUMBO
• Algorithm Step: Column 1
Example
G
U
M
B O
0
1
2
3
4 5
G
1
0
1
A
2
1
1
M
3
2
2
B
4
3
3
O
5
4
4
L
6
5
5
• Source: GAMBOL, Target: GUMBO
• Algorithm Step: Column 2
Example
G
U
M
B O
0
1
2
3
4 5
G
1
0
1
2
A
2
1
1
2
M
3
2
2
1
B
4
3
3
2
O
5
4
4
3
L
6
5
5
4
• Source: GAMBOL, Target: GUMBO
• Algorithm Step: Column 3
Example
G
U
M
B O
0
1
2
3
4 5
G
1
0
1
2
3
A
2
1
1
2
3
M
3
2
2
1
2
B
4
3
3
2
1
O
5
4
4
3
2
L
6
5
5
4
3
• Source: GAMBOL, Target: GUMBO
• Algorithm Step: Column 4
Example
G
U
M
B O
0
1
2
3
4 5
G
1
0
1
2
3 4
A
2
1
1
2
3 4
M
3
2
2
1
2 3
B
4
3
3
2
1 2
O
5
4
4
3
2 1
L
6
5
5
4
3 2
• Source: GAMBOL, Target: GUMBO
• Algorithm Step: Column 5
• Result: Distance equals 2
Another Example
E
X
E
C
U
T
I
O
N
0
1
2
3
4
5
6
7
8
9
I
1
1
2
3
4
5
6
6
7
8
N
2
2
2
3
4
5
6
7
7
7
T
3
3
3
3
4
5
5
6
7
8
E
4
3
4
3
4
5
6
6
7
8
N
5
4
4
4
4
5
6
7
7
7
T
6
5
5
5
5
5
5
6
7
8
I
7
6
6
6
6
6
6
5
6
7
O
8
7
7
7
7
7
7
6
5
6
N
9
8
8
8
8
8
8
7
6
5
Hidden Markov Model
•
Motivation
–
–
–
–
•
We observe the output
We don't know which internal states the model is in
Goal: Determine the most likely internal (hidden) state
sequence
Hence the title, “Hidden”
Definition (Discrete HMM Ф = [O, S, A, B, Ω]
1.
2.
3.
4.
5.
O = {o1, o2, …, oM} is the possible output states
S = {1, 2, …, N} possible internal HMM states
A = {aij} is the transition probability matrix from i to j
B = {bi(k)} probability of state i outputting ok
{Ω i} = is a set of initial State Probabilities where Ωi is the
probability that the system starts in state i
HMM Applications
Given an HMM Model and an observation sequence:
1. Evaluation Problem
What is the probability that the model generated
the observations?
2. Decoding Problem
What is the most likely state sequence S=(s0, s1, s2,
…, sT) in the model that produced the
observations?
3. Learning Problem
How can we adjust parameters of the model to
maximize the likelihood that the observation will
be correctly recognized?
Hidden Markov Model (HMM)
Natural Language processing and HMM
1. Speech Recognition
• Which words generated the observed acoustic signal?
2. Handwriting Recognition
• Which words generated the observed image?
3. Part-of-speech
• Which parts of speech correspond to the observed words?
• Where are the word boundaries in the acoustic signal?
• Which morphological word variants match the acoustic signal?
4. Translation
• Which foreign words are in the observed signal?
5. Speech Synthesis
• What database unit fits the synthesis script
Demo: http://www.comp.leeds.ac.uk/roger/HiddenMarkovModels/html_dev/hmms/s3_pg1.html
Natural Language HMM Assumptions
• A Stochastic Markov process
– System state changes are not deterministic; they vary according
to some probabilistic distribution
• Discrete
– There is a countable system state set can observable in time steps
• Markov Chain: Next state depends solely on the current state
P(w1, …, wn) ≈∏i=2,n P(wi | wi-1)
not P(w1, …, wn) =∏i=2,n P(wi | w1, …, wi-1)
• Output Assumption
– Output at a given state solely depends on that state
Demonstration of a Stochastic Process
http://cs.sou.edu/~harveyd/classes/cs415/docs/unm/movie.html
Speech Recognition Example
•
•
•
•
Observations: The digital signal features
Hidden States: The spoken word that generated the features
Goal: Assume Word Maximizes P(Word|Observation)
Bayes Law gives us something we can calculate:
– P(Word|Observation) = P(Word) P(Observation|Word)/P(O)
– Ignore denominator: it’s probability = 1 (we observed it after all)
• P(Word) can be looked up from a database
– Use bi or tri grams to take the context into account
– Chain rule: P(w) = P(w1)P(w2|w1)P(w3|w1,w2)P(w,w1w2…wG)
– If there is no such probability, we can use a smoothing algorithm
to insert a value for combinations never encountered.
HMM: Trellis Model
Question: How do we find the most likely sequence?
Probabilities
• Forward probability:
t (i)
The probability of being in state si, given the partial
observation o1,…,ot
• Backward probability:
t (i)
of being in state s , given the partial
The probability
i
observation ot+1,…,oT
• Transition probability:

αij = P(qt = si, qt+1 = sj | observed output)
The probability of being in state si at time t and
going from state si, to state sj, given the complete
observation o1,…,oT
Forward Probabilities
αt(j) = ∑i=1,N {αt-1(i)αijbj(ot)} and αt(j) = P(O1…OT | qt = sj, λ)
Notes
λ = HMM, qt = HMM state at time t, sj = jth state, Oi = ith output
aij = probability of transitioning from state si to sj
bi(ot) = probability of observation ot resulting from si
αt(j) = probability of state j at time t given observations o1,o2,…,ot
Forward Algorithm Pseudo Code
What is the likelihood of each possible observed pronunciation?
forward[i,j]=0 for all i,j; forward[0,0]=1.0
FOR each time step t
FOR each state s
FOR each state transition s to s’
forward[s’,t+1] +=
forward[s,t]*a(s,s’)*b[s’,ot]
RETURN ∑forward[s,tfinal+1] for all states s
Notes
1. a(s,s’) is the transition probability from
state s to state s’
2. b(s’,ot) is the probability of state s’ given
observation ot
Complexity: O(t S2) where S is the number of states
Viterbi Algorithm
• Viterbi is an optimally efficient dynamic programming
HMM algorithm that traces through a series of possible
states to find the most likely cause of an observation
• Similar to computing the forward probabilities, but
instead of summing over transitions from incoming
states, compute the maximum
• Forward Algorithm:
N

 t ( j)   t1(i) aij b j (ot )
i1



t ( j)  max t1(i) aij b j (ot )
• Viterbi:

1iN
Viterbi Algorithm Pseudo Code
What is the likelihood of a word given an observation sequence?
viterbi[i,j]=0 for all i,j; viterbi[0,0]=1.0
FOR each time step t
FOR each state s
FOR each state transition s to s’
newScore=viterbi[s,t]*a(s,s’)* b[s’,ot]
IF (newScore > viterbi[s’,t+1])
viterbi[s’,t+1] = newScore
maxScore = newScore
save maxScore in a queue
RETURN queue
Notes
1. a(s,s’) is the transition probability from state s
to state s’
2. B(s’,ot) is the probability of state s’ given
observation ot
Markov Example
• Problem: Model the Probability of
stocks being bull, bear, or stable
• Observe: up, down, unchanged
• Hidden: bull, bear, stable
aij
Bull Bear
Stable
Bull
0.6
0.2
0.2
Bear
0.5
0.3
0.2
Stable
0.4
0.1
0.5
Probability Matrix
bull
bear
stable
Ωi
Bull
0.5
Bear
0.2
Stable
0.3
Initialization Matrix
Example: What is the probability of observing up five days in a row?
HMM Example
• O = {up, down, unchanged (Unch)}
• S = {bull (1), bear (2), stable (3)}
1
2
3
Observe 'up, up, down, down, up'
What is the most likely sequence of states for this output?
aij
1
2
3
1
0.6
0.2
0.2
2
0.5
0.3
0.2
3
0.4
0.1
0.5
Bi
up
down Unch.
1
0.7
0.1
0.2
2
0.1
0.6
0.3
3
0.3
0.3
0.4
State
Ωi
1
0.5
2
0.2
3
0.3
Forward Probabilities
Bi*Ωc = 0.7 * 0.5
0.35
0.02
0.09
t=0
Ai,c
0
1
2
0
0.6
0.2
0.2
1
0.5
0.3
0.2
2
0.4
0.1
0.5
X = [up, up]
0.179
0.009
bc
up
down
Unch.
0
0.7
0.1
0.2
1
0.1
0.6
0.3
2
0.3
0.3
0.4
0.036
t=1
State
Ωc
0
0.5
1
0.2
2
0.3
Sum of α0,c * ai,c * bc
Note: 0.35*0.2*0.3 + 0.02*0.2*0.3 + 0.09*0.5*0.3 = 0.0357
Viterbi Example
Observed = [up, up]
Bi*Ωc = 0.7 * 0.5
State 0
State 1
State 2
0.35
0.02
0.09
t=0
0.147
0.007
0.021
t=1
Ai,c
0
1
2
0
0.6
0.2
0.2
1
0.5
0.3
0.2
2
0.4
0.1
0.5
bc
up
down
Unch.
0
0.7
0.1
0.2
1
0.1
0.6
0.3
2
0.3
0.3
0.4
State
Ωc
0
0.5
1
0.2
2
0.3
Maximum of α0,c * ai,c * bc
Note: 0.021 = 0.35*0.2*0.3, versus 0.02*0.2*0.3 and 0.09*0.5*0.3
Backward Probabilities
• Similar algorithm as computing the forward
probabilities, but in the other direction
• Answers the question: What is the
probability that given an HMM model and
given the state at time t is i, when the
partial observation ot+1 … oT is generated?
βt(i) = ∑j=1,N {αij bj(ot+1)βt+1(i)}
t (i)  P(ot 1...oT | qt  si, )
Backward Probabilities
βt(i) = ∑j=1,N {βt+1(j)αijbj(ot+1)} and βt(i) = P(Ot+1…OT | qt = si, λ)
Notes
λ = HMM, qt = HMM state at time t, sj = jth state, Oi = ith output
aij = probability of transitioning from state si to sj
bi(ot) = probability of observation ot resulting from si
βt(i) = probability of state i at time t given observations ot+1,ot+2,…,oT
Parameters for HMM states
• Cepstrals
– Why? They are largely statistically independent which make
them suitable for classifying outputs
• Delta coefficients
– Why? To overcome the HMM limitation where transitions
only depend on one previous state. Speech articulators
change slowly, so they don’t follow the traditional HMM
model. Without delta coefficients, HMM tends to jump too
quickly between states
• Synthesis requires more parameters than ASR
– Examples: additional delta coefficients, duration and F0
modeling, acoustic energy
Cepstral Review
1. Perform Fourier transform to go from time to
frequency domain
2. Warp the frequencies using the Mel-scale
3. Gather the amplitude data into bins (usually 13)
4. Perform the log power of the amplitudes
5. Compute first and second order delta coefficients
6. Perform a discrete cosine transform (no complex
numbers) to form the Cepstrals
Note: Phase data is lost in the process
Training Data
• Question: How do we establish the transition
probability between states when that information is
not available
– Older Method: tedious hand marking of wave files based
on spectrograms
– Optimal Method: NP complete is intractable
– Newer Method: HMM Baum Welsh algorithm is a popular
heuristic to automate the process
• Strategies
– Speech Recognition: train with data from many speakers
– Speech Synthesis: train with data for specific speakers
Baum-Welsh Algorithm Pseudo-code
• Initialize HMM parameters, its = 0
• DO
– HMM’ = HMM; iterations++;
– FOR each training data sequence
• Calculate forward probabilities
• Calculate backward probabilities
• Update HMM parameters
• UNTIL |HMM - HMM’|<delta OR iterations<MAX
Re-estimation of State Changes
Forward Value
α’ij = ∑t=1,T αi(t)αijb(ot+1) βj(t+1)
Note: b(ot) is part of αi(t)
Backward Value
∑t=1,T αi(t)βi(t)
Sum forward/backward ways to arrive at time t with observed
output divided by the forward/backward ways to get to state t
Joint probabilities
State i at t and j at t+1
aijbj(Xt+1)
Re-estimation of Other Probabilities
• The probability of an output, o, being observed from
a given state, s
b’(o) =
Number of times in state s observing o
Number of times in state s
• The probability of initially being in state, s, when
observing the output sequence
Ω’s =
∑i=1,N α1(s)α1jb(o2) β2(i)
∑i=1,N ∑j=1,N α1(i) α1jb(o2) β2(j)
Summary of HMM Approaches
• Discrete
– The continuous valued observed outputs are compared against
a codebook of discrete values for HMM observations
– Performs well for smaller dictionaries
• Continuous Mixture Density
– The observed outputs are fed to the HMM in continuous form
– Gaussian mixture: outputs map to a range of distribution
parameters
– Applicable for large vocabulary with a large number of
parameters
• Semi-Continuous
– No mixture of Gaussian densities
– Tradeoff between discrete and continuous mixture
– Large vocabularies: better than discrete, worse than continuous
HMM limitations
1.
HMM is a hill climbing algorithm


2.
3.
4.
5.
6.
It finds local (not global) minimums, not global
minimums
It is sensitive to initial parameter settings
HMM's have trouble modeling time duration or speech
The first order Markov assumption independence don't
exactly model speech
Underflow when computing Markov probabilities. For
this reason, log probabilities are normally used
Continuous output model performance limited by
probabilities that incorrectly map to outputs
Relationship between outputs are interrelated, not
independent
Decision Trees
Partition a series of questions, each with a discrete set of answers
x x
x
x
x
x
x
x
x x
x
x
x
x
x x
x
x
x
x
x x
x
x
x
x
x
x
x
Reasonably Good Partition
x
x x
x
x
x
x
x
x
x
x x
x
x
Poor Partition
x
CART Algorithm
Classification and regression trees
1.
Create a set of questions that can distinguish between the
measured variables
a. Singleton Questions: Boolean (yes/no or true/false) answers
b. Complex Questions: many possible answers
2.
3.
4.
5.
6.
7.
Initialize the tree with one root node
Compute the entropy for a node to be split
Pick the question that with the greatest entropy gain
Split the tree based on step 4
Return to step 3 as long as nodes remain to split
Prune the tree to the optimal size by removing leaf nodes
with minimal improvement
Note: We build the tree from top down. We prune the tree from bottom up.
Example: Play or not Play?
Outlook
Temperature
Humidity
Windy
Play?
sunny
hot
high
false
No
sunny
hot
high
true
No
overcast
hot
high
false
Yes
rain
mild
high
false
Yes
rain
cool
normal
false
Yes
rain
cool
normal
true
No
overcast
cool
normal
true
Yes
sunny
mild
high
false
No
sunny
cool
normal
false
Yes
rain
mild
normal
false
Yes
sunny
mild
normal
true
Yes
overcast
mild
high
true
Yes
overcast
hot
normal
false
Yes
rain
mild
high
true
No
Questions
1) What is the outlook?
2) What is the temperature?
3) What is the humidity?
4) Is it Windy?
Goal: Order the questions in
the most efficient way
Example Tree for “Do we play?”
Goal: Find the optimal tree
Outlook
sunny
overcast
Humidity
Yes
rain
Windy
high
normal
true
false
No
Yes
No
Yes
Which question to select?
witten&eibe
Computing Entropy
• Entropy: Bits needed to store possible question answers
• Formula: Computing the entropy for a question:
Entropy(p1, p2, …, pn) = - p1log2p1 – p2log2p2 … - pn log2pn
• Where
pi is the probability of the ith answer to a question
log2x is logarithm base 2 of x
• Examples:
– A coin toss requires one bit (head=1, tail=0)
– A question with 30 equally likely answers requires
∑i=1,30-(1/30)log2(1/30) = - log2(1/30) = 4.907
Example: question “Outlook”
Compute the entropy for the question: What is the outlook?
Entropy(“Outlook”=“Sunny”)=Entropy(0.4, 0.6)=-0.4 log2(0.4)-0.6 log2(0.6)=0.971
Five outcomes, 2 for play for P = 0.4, 3 for not play for P=0.6
Entropy(“Outlook” = “Overcast”) = Entropy(1.0, 0.0)= -1 log2(1.0) - 0 log2(0.0) = 0.0
Four outcomes, all for play. P = 1.0 for play and P = 0.0 for no play.
Entropy(“Outlook”=“Rainy”)= Entropy(0.6,0.4)= -0.6 log2(0.6) - 0.4 log2(0.4)= 0.971
Five Outcomes, 3 for play for P=0.6, 2 for not play for P=0.4
Entropy(Outlook) = Entropy(Sunny, Overcast, Rainy)
= 5/14*0.971+4/14*0+5/14*0.971 = 0.693
Computing the Entropy gain
• Original Entropy : Do we play?
Entropy(“Play“)=Entropy(9/14,5/14)=-9/14log2(9/14) - 5/14 log2(5/14)=0.940
14 outcomes, 9 for Play P = 9/14, 5 for not play P=5/14
• Information gain equals
(information before) – (information after)
gain("Outlook") = 0.940 – 0.693 = 0.247
• Information gain for other weather questions
– gain("Temperature") = 0.029
– gain("Humidity") = 0.152
– gain("Windy") = 0.048
• Conclusion: Ask, “What is the Outlook?” first
Continuing to split
yes
no
no
gain(" Temperatur e" )  0.571 bits
gain(" Humidity" )  0.971 bits
gain(" Windy" )  0.020 bits
For each child question, do the same thing to form the complete decision tree
Example: After the outlook sunny node, we still can ask about temperature,
humidity, and windiness
The final decision tree
Note: The splitting stops when further splits don't reduce
entropy more than some threshold value
Other Models
• Goal: Find database units to use for
synthesizing some element of speech
• Other approaches
– Relax the Markov assumption
• Advantage: Can better model speech
• Disadvantage: Complicates the model
– Neural nets
• Disadvantage: Has not demonstrated to be superior to
the HMM approach