Speech Recognition and Hidden Markov Models CPSC4600@UTC/CSE Hidden Markov models • Probability fundamentals • Markov models • Hidden Markov models – Likelihood calculation.

Download Report

Transcript Speech Recognition and Hidden Markov Models CPSC4600@UTC/CSE Hidden Markov models • Probability fundamentals • Markov models • Hidden Markov models – Likelihood calculation.

Speech Recognition and
Hidden Markov Models
CPSC4600@UTC/CSE
Hidden Markov models
• Probability fundamentals
• Markov models
• Hidden Markov models
– Likelihood calculation
Probability fundamentals
• Normalization
– discrete and continuous
• Independent events
– joint probability
• Dependent events
– conditional probability
• Bayes’ theorem
– posterior probability
• Marginalization
– discrete and continuous
Normalisation
Discrete: probability of all possibilities sums to one:
Continuous: integral over entire probability density
function (pdf) comes to one:
Joint probability
The joint probability that two independent events
occur is the product of their individual probabilities:
Conditional probability
If two events are dependent, we need to
determine their conditional probabilities.
The joint probability is now
P(A,B) = P(A) P(B|A),
(4)
where P(B|A) is the probability of event B
given that A occurred; conversely, taking
the events the other way
P(A,B) = P(A|B) P(B).
(5)
Bayes’ theorem
Equating the RHS of eqs. 4 and 5 gives
For example, in a word recognition application we have
which can be interpreted as
The posterior probability is used to make Bayesian inferences; the
conditional likelihood describes how likely the data were for a given class; the
prior allows us to incorporate other forms of knowledge into our decision
(like a language model); the evidence acts as a normalization factor and is
often discarded in practice (as it is the same for all classes).
Marginalization
Discrete: probability of event B, which depends on A, is
the sum over A of all joint probabilities:
Continuous: similarly, the nuisance factor x can be
eliminated from its joint pdf with y:
Introduction to Markov Models
• Set of states: {s1 , s2 ,, sN }
• Process moves from one state to another generating a
sequence of states : si1 , si 2 ,, sik ,
• Markov chain property: probability of each subsequent state
depends only on what was the previous state:
P(sik | si1 , si 2 ,, sik 1 )  P(sik | sik 1 )
• To define Markov model, the following probabilities have to be
specified: transition probabilities aij  P(si | s j ) and initial
probabilities  i  P(si )
Example of Markov Model
0.3
0.7
Rain
Dry
0.2
0.8
• Two states : ‘Rain’ and ‘Dry’.
• Transition probabilities: P(‘Rain’|‘Rain’)=0.3 ,
P(‘Dry’|‘Rain’)=0.7 , P(‘Rain’|‘Dry’)=0.2, P(‘Dry’|‘Dry’)=0.8
• Initial probabilities: say P(‘Rain’)=0.4 , P(‘Dry’)=0.6 .
Calculation of sequence probability
• By Markov chain property, probability of state sequence can be
found by the formula:
P( si1 , si 2 ,, sik )  P(sik | si1 , si 2 ,, sik 1 ) P( si1 , si 2 ,, sik 1 )
 P(sik | sik 1 ) P( si1 , si 2 ,, sik 1 )  
 P(sik | sik 1 ) P( sik 1 | sik 2 )  P(si 2 | si1 ) P(si1 )
• Suppose we want to calculate a probability of a sequence of
states in our example, {‘Dry’,’Dry’,’Rain’,Rain’}.
P({‘Dry’,’Dry’,’Rain’,Rain’} ) =
P(‘Rain’|’Rain’) P(‘Rain’|’Dry’) P(‘Dry’|’Dry’) P(‘Dry’)=
= 0.3*0.2*0.8*0.6
Hidden Markov models.
• Set of states:{s1 , s2 ,, sN }
• Process moves from one state to another generating a
sequence of states : si1 , si 2 ,, sik ,
• Markov chain property: probability of each subsequent state
depends only on what was the previous state:
P(sik | si1 , si 2 ,, sik 1 )  P(sik | sik 1 )
• States are not visible, but each state randomly generates one of
M observations (or visible states) {v1 , v2 ,, vM }
• To define hidden Markov model, the following probabilities
have to be specified:
• matrix of transition probabilities A=(aij), aij= P(si | sj)
• matrix of observation probabilities B=(bi (vm )), bi(vm ) =
P(vm | si)
• initial probabilities =(i), i = P(si) . Model is represented
by M=(A, B, ).
Example of Hidden Markov Model
0.3
0.7
Low
High
0.2
0.8
0.6
0.4
Rain
0.4
0.6
Dry
Two states : ‘Low’ and ‘High’ atmospheric pressure.
Example of Hidden Markov Model
1. Two states : ‘Low’ and ‘High’ atmospheric pressure.
2. Two observations : ‘Rain’ and ‘Dry’.
3. Transition probabilities: P(‘Low’|‘Low’)=0.3 ,
4.
5.
P(‘High’|‘Low’)=0.7 , P(‘Low’|‘High’)=0.2,
P(‘High’|‘High’)=0.8
Observation probabilities : P(‘Rain’|‘Low’)=0.6 ,
P(‘Dry’|‘Low’)=0.4 , P(‘Rain’|‘High’)=0.4 ,
P(‘Dry’|‘High’)=0.3 .
Initial probabilities: say P(‘Low’)=0.4 , P(‘High’)=0.6 .
Calculation of observation sequence
probability
•Suppose we want to calculate a probability of a sequence of
observations in our example, {‘Dry’,’Rain’}.
•Consider all possible hidden state sequences:
P({‘Dry’,’Rain’} ) = P({‘Dry’,’Rain’} , {‘Low’,’Low’}) +
P({‘Dry’,’Rain’} , {‘Low’,’High’}) + P({‘Dry’,’Rain’} ,
{‘High’,’Low’}) + P({‘Dry’,’Rain’} , {‘High’,’High’})
where first term is :
P({‘Dry’,’Rain’} , {‘Low’,’Low’})=
P({‘Dry’,’Rain’} | {‘Low’,’Low’}) P({‘Low’,’Low’}) =
P(‘Dry’|’Low’)P(‘Rain’|’Low’) P(‘Low’)P(‘Low’|’Low)
= 0.4*0.6*0.4*0.3
Summary of Markov models
State topology:
Initial-state probabilities:
and state-transition probabilities:
Probability of a given state sequence X:
Summary of Hidden Markov models
Probability of state i generating a discrete observation ot, which has one
of a finite set of values, is
Probability distribution of a continuous observation ot, which can have
one of an infinite set of values, is
We begin by considering only discrete observations.
Elements of a discrete HMM,
1. Number of different hidden states N,
2. Number of different observation K,
3. Initial-state probabilities,
4. State-transition probabilities,
5. Discrete emission/output probabilities,
Three main issues using HMMs
Evaluation problem.
Compute likelihood
of a set of
observations with an given HMM model,   (A,B,  )
Decoding problem.
Decode a state sequence by calculating
the most

likely path
given observation sequence and a
HMM model.
Learning problem.
Optimize the template patterns by training the
parameters in the models,
Word recognition example(1).
• Typed word recognition, assume all characters are separated.
• Character recognizer outputs probability of the image being
particular character, P(image|character).
a
b
c
0.5
0.03
0.005
z 0.31
Hidden state
Observation
Word recognition example(2).
• Hidden states of HMM = characters.
• Observations = typed images of characters segmented from the
image v . Note that there is an infinite number of
observations
• Observation probabilities = character recognizer scores.
B  bi (v )  P(v | si )
•Transition probabilities will be defined differently in two
subsequent models.
Word recognition example(3).
• If lexicon is given, we can construct separate HMM models
for each lexicon word.
Amherst
a
m
h
e
r
s
t
Buffalo
b
u
f
f
a
l
o
0.5
0.03
0.4
0.6
• Here recognition of word image is equivalent to the problem
of evaluating few HMM models.
•This is an application of Evaluation problem.
Word recognition example(4).
• We can construct a single HMM for all words.
• Hidden states = all characters in the alphabet.
• Transition probabilities and initial probabilities are calculated
from language model.
• Observations and observation probabilities are as before.
a
m
f
r
t
o
b
h
e
s
v
• Here we have to determine the best sequence of hidden states,
the one that most likely produced word image.
• This is an application of Decoding problem.
Task 1: Likelihood of an Observation
Sequence
• What is P(O |  ) ?
• The likelihood of an observation sequence is the
sum of the probabilities of all possible state
sequences in the HMM.
•Naïve computation is very expensive. Given T
observations and N states, there are NT possible
state sequences.
• Even small HMMs, e.g. T=10 and N=10, contain
10 billion different paths
• Solution to this and Task 2 is to use dynamic
programming
Forward Probabilities
• What is the probability that, given an HMM, at time t the
state is i and the partial observation o1 … ot has been
generated?  t (i)  P(o1 ...ot , qt  si | )

N

 t ( j)   t1(i) aij b j (ot )
i1

Forward Algorithm
• Initialization:
1(i)   ibi (o1) 1  i  N
• Induction:

N

 t ( j)   t1(i) aij b j (ot ) 2  t  T,1  j  N
i1

• Termination:

N
P(O | )  T (i)
i1
Forward Algorithm Complexity
• In the naïve approach to solving problem 1
it takes on the order of 2T*NT
computations
• The forward algorithm takes on the order
of N2T computations
Character recognition with HMM example
• The structure of hidden states is chosen.
• Observations are feature vectors extracted from vertical slices.
• Probabilistic mapping from hidden state to feature vectors:
1. use mixture of Gaussian models
2. Quantize feature vector space.
Exercise: character recognition with HMM(1)
• The structure of hidden states:
s1
s2
• Observation = number of islands in the vertical slice.
•HMM for character ‘A’ :
 .8 .2 0 
Transition probabilities: {aij}=  0 .8 .2 
 0 0 1
 .9 .1 0 
Observation probabilities: {bjk}=  .1 .8 .1 
 .9 .1 0 
•HMM for character ‘B’ :
 .8 .2 0 
Transition probabilities: {aij}=  0 .8 .2 
 0 0 1
 .9 .1 0 
Observation probabilities: {bjk}=  0 .2 .8 
 .6 .4 0 
s3
Exercise: character recognition with HMM(2)
• Suppose that after character image segmentation the following
sequence of island numbers in 4 slices was observed:
{ 1, 3, 2, 1}
• What HMM is more likely to generate this observation
sequence , HMM for ‘A’ or HMM for ‘B’ ?
Exercise: character recognition with HMM(3)
Consider likelihood of generating given observation for each
possible sequence of hidden states:
• HMM for character ‘A’:
Hidden state sequence
Transition probabilities
Observation probabilities
s1 s1 s2s3
s1 s2 s2s3
s1 s2 s3s3
.8  .2  .2

.9  0  .8  .9 = 0
.2  .8  .2

.9  .1  .8  .9 = 0.0020736
.2  .2  1

.9  .1  .1  .9 = 0.000324
Total = 0.0023976
• HMM for character ‘B’:
Hidden state sequence
Transition probabilities
Observation probabilities
s1 s1 s2s3
.8  .2  .2

.9  0  .2  .6 = 0
s1 s2 s2s3
s1 s2 s3s3
.2  .8  .2

.9  .8  .2  .6 = 0.0027648
.2  .2  1

.9  .8  .4  .6 = 0.006912
Total = 0.0096768
Task 2: Decoding
• The solution to Task 1 (Evaluation) gives us the
sum of all paths through an HMM efficiently.
• For Task 2, we want to find the path with the
highest probability.
• We want to find the state sequence Q=q1…qT,
such that
Q  arg max P(Q'| O, )
Q'
Viterbi Algorithm
• Similar to computing the forward
probabilities, but instead of summing over
transitions from incoming states, compute
the maximum
• Forward:
N

 t ( j)   t1(i) aij b j (ot )
i1

• Viterbi Recursion:



t ( j)  maxt1(i) aij b j (ot )
1iN

Viterbi Algorithm
• Initialization:
• Induction:
1(i)   ib j (o1) 1 i  N


t ( j)  maxt1(i) aij b j (ot )
1iN

 
 t ( j)  arg max t1 (i) aij  2  t  T,1  j  N
 1iN

• Termination:
• Read out path:

p  maxT (i)
qT*  arg max T (i)
*
1iN
1iN
q  t 1(q ) t  T 1,...,1
*
t
*
t 1

Example of Viterbi Algorithm
Voice Biometrics
General Description
• Each individual has individual voice components
called phonemes. Each phoneme has a pitch,
cadence, and inflection
• These three give each one of us a unique voice
sound.
• The similarity in voice comes from cultural and
regional influences in the form of accents.
• Voice physiological and behavior biometric are
influenced by our body, environment, and age.
Voice Capture
• Voice can be captured in two ways:
– Dedicated resource like a microphone
– Existing infrastructure like a telephone
• Captured voice is influenced by two factors:
– Quality of the recording device
– The recording environment
• In wireless communication, voice travels through
open air and then through terrestrial lines, it
therefore, suffers from great interference.
Application of Voice Technology
• Voice technology is applicable in a variety
of areas. Those used in biometric
technology include:
– Voice Verification
• Internet/intranet security:
–
–
–
–
on-line banking
on-line security trading
access to corporate databases
on-line information services
• PC access restriction software
Voice Verification
• Voice biometrics works by digitizing a profile of a
person's speech to produce a stored model
voice print, or template.
• Biometric technology reduces each spoken word
to segments composed of several dominant
frequencies called formants.
• Each segment has several tones that can be
captured in a digital format.
• The tones collectively identify the speaker's
unique voice print.
• Voice prints are stored in databases in a manner
similar to the storing of fingerprints or other
biometric data.
Voice Verification
• Voice verification verifies the vocal
characteristics against those associated
with the enrolled user.
• The US PORTPASS Program, deployed at
remote locations along the U.S.–Canadian
border, recognizes voices of enrolled local
residents speaking into a handset. This
system enables enrollees to cross the
border when the port is unstaffed.
Automatic Speech Recognition
• Automatic Speech Recognition systems are
different from voice recognition systems
although the two are often confused.
• Automatic Speech Recognition is used to
translate the spoken word into a specific
response.
• The goal of voice recognition systems is simply
to understand the spoken word, not to establish
the identity of the speaker.
Automatic Speech Recognition
• Automatic Speech Recognition
– hands free devices, for example car mobile
hands free sets
– electronic devices, for example telephone,
PC, or ATM cash dispenser
– software applications, for example games,
educational or office software
– industrial areas, warehouses, etc.
– spoken multiple choice in interactive voice
response systems, for example in telephony
– applications for people with disabilities
Difficulties in Automatic Speech
Recognition (ASR)
• Context Variability
Mr. Wright should write to Ms. Wright right away about
his Ford or four door Honda.
• Style Variability
– isolated speech recognition is easier than continuous speech
recognition
– reading recognition is easier than conversational speech
recognition
• Speaker Variability
speaker-independent v.s. speaker-dependent
• Environment Variability
background noise
Task of ASR
The task of speech recognition is to take as input
an acoustic waveform and produce as output a
string of words.
Acoustic Processing of Speech
Two important characteristics of a wave
• Frequency and Pitch
– The frequency is the number of times per second that
a wave repeats itself, or cycles.
– Unit: cycles per second are usually called Hertz
(Hz)
– The pitch is the perceptual correlate of frequency
• Amplitude and loudness
– The amplitude measures the amount of air pressure
variation.
– Loudness is the perceptual correlate of the power,
which is related to the square of the amplitude.
Acoustic Processing of Speech
Feature extraction
• Analog-to-digital conversion
– Sampling: In order to accurately measure a wave, it is
necessary to have at least two samples in each cycle
• One measuring the positive part of the wave
• The other one measuring the negative part
• Thus the maximum frequency wave that can be measured is
one whose frequency is half the sample rate.
• This maximum frequency for a given sampling rate is called
the Nyquist frequency.
– Quantization: Representing a real-valued number as
an integer.
Acoustic Processing of Speech
Spectrum
• Based on the insight of Fourier that every
complex wave can be represented as a sum of
many simple waves of different frequencies.
• Spectrum is a representation of these different
frequency components.
Acoustic Processing of Speech
Smoothing
• Goal: Finding where the spectral peaks
(formants) are, we could get the characteristic of
different sounds.  determining vowel identity
• Linear Predictive Coding (LPC) is one of the
most common methods.
• LPC spectrum is represented by a vector of
features.
• It is possible to use LPC features directly as the
observation of HMMs.
Acoustic Processing of Speech
• There are 6 states detected in the spoken
digit ZERO i.e 1 , 2 , 3 , 4 , 6 , and 7.
Acoustic Processing of Speech
For the given acoustic observation O  o1 , o2 ,..., on
the goal of speech recognition is to find out the
corresponding word sequence W  w1 , w2 ,..., wn
that has the maximum posterior probability P(W|O)
Acoustic Model
Language
Model
Schematic Architecture for a (simplified)
Speech Recognizer
Search Space
• Given a word-segmented acoustic sequence list
all candidates
'bot
boat
P(' bot | bald)
bald
P(inactive | bald)

ik-'spen-siv
'pre-z&ns
excessive
presidents
expensive
presence
bold
expressive

bought
inactive
• Compute the most likely path
presents
press
Software and Hand-on Labs (Nov. 29)
• Task 1: Download and install one of the following
software
– Speech Filing System Tools for Speech Research
http://www.phon.ucl.ac.uk/resource/sfs/
– Praat: doing phonetics by computer
http://www.fon.hum.uva.nl/praat/
• Task 2: Download and Install on the Speech
Recognition Software at
http://www.download.com/Voice-Recognition/31507239_4-0.html
(Tazi speech recognition)
• Reference: The Hidden Markov Model Toolkit (HTK)
http://htk.eng.cam.ac.uk/
Introduction to Markov models
Pattern recognition problem:
Need to have good templates that are
representative of speech patterns we want
to recognize.
– How should we model the patterns?
– How can we optimize the model’s
parameters?
Markov models
• State topology of an ergodic Markov model:
The initial-state probabilities for each state i are defined
with the properties
Modeling stochastic sequences
State topology of a left-right Markov model:
For 1st-order Markov chains, probability of state occupation depends
only on the previous step (Rabiner, 1989):
So, if we assume the RHS of eq. 12 is independent of time, we can
write the state-transition probabilities as
with the properties
Weather predictor example
Let us represent the state of the weather by a 1storder, ergodic Markov model, M:
State 1: rain
State 2: cloud
State 3: sun
with state-transition probabilities,
Weather predictor probability calculation
Given today is sunny (i.e., x1 = 3), what is the
probability with model M of directly observing the
sequence of weather states “sun-sun-rain-cloudcloud-sun”?
Formants
• Formants are the resonant frequencies of the
vocal tract when vowels are pronounced.
• Linguists classify each type of speech sound
(called phonemes) into different categories. In
order to identify each phoneme, it is sometimes
useful to look at its spectrogram or frequency
response where one can find the characteristic
formants.
• Formant values can vary widely from person to
person, but the spectrogram reader learns to
recognize patterns which are independent of
particular frequencies and which identify the
various phonemes with a high degree of
reliability.
Vowel “A”
Vowel “I”
• Formants can be seen very clearly in a
wideband spectrogram, where they are
displayed as dark bands. The darker a
formant is reproduced in the spectrogram,
the stronger it is (the more energy there is
there, or the more audible it is):
Formants
• But there is a difference between oral vowels on
one hand, and consonants and nasal vowels on
the other.
• Nasal consonants and nasal vowels can exhibit
additional formants, nasal formants, arising from
resonance within the nasal branch.
• Consequently, nasal vowels may show one or
more additional formants due to nasal
resonance, while one or more oral formants may
be weakened or missing due to nasal
antiresonance.
Oral formants are numbered consecutively
upwards from the lowest frequency.
In the example, fragment from the
wideband spectrogram shows the
sequence [ins] from the beginning. Five
formants labeled F1-F5 are visible. Four
(F1-F4) are visible in this [n] and there is a
hint of the fifth. There are four more
formants between 5000Hz and 8000Hz in
[i] and [n] but they are too weak to show
up on the spectrogram, and mostly they
are also too weak to be heard.
The situation is reversed in this [s], where
F4-F9 show very strongly, but there is little
to be seen below F4.
Individual Differences in Vowel Production
• There are differences in individual formant
frequencies attributable to size, age, gender,
environment, and speech.
• The acoustic differences that allow us to
differentiate between various vowel productions
are usually explained by a source-filter theory.
• The source is the sound spectrum created by
airflow through the glottis which varies as vocal
folds vibrate. The filter is the vocal track itself- its
shape is controlled by the speaker.
• The three figures below (taken from Miller) illustrate how different
configurations of the vocal tract selective pass certain frequencies
and not others. The first shows the configuration of the vocal tract
while articulating the phoneme [i] as in the word "beet," the second
the phoneme [a], as in "father," and the third [u] as in "boot." Note
how each configuration uniquely affects the acoustic spectrum--i.e.,
the frequencies that are passed.