No Slide Title

Transcript No Slide Title

CS 552/652 Speech Recognition with Hidden Markov Models

Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul Hosom Lecture 2 January 5

Induction, DTW, and Issues in ASR

Induction & Dynamic Programming

Induction (from Floyd & Beigel, The Language of Machines, pp. 39-66) • Technique for proving theorems, used in Hidden Markov Models to guarantee “optimal” results.

• Understand induction by doing example proofs… • Suppose

(

) is statement about number

, and we want to prove

(

) is true for all

 0.

• Inductive proof: Show both of the following: Base case: Induction: (

(0) is true 

 0) [

(

) 

(

+1)] In the inductive case, we want to show that if (assuming)

true for

, then it must be true for

+1. We never prove

is is true for any specific value of

other than 0.

If both cases are shown, then

(

) is true for all

 0.

Induction & Dynamic Programming

Example:

Prove that

k n

  0

 1 2

(

 1 ) Step 1: Prove base case: for

 0

0   0

 0  0 1 2 0 ( 0  1 ) Step 2: Prove the inductive case (if true for

, true for

+1): show if

k n

  0

 1 2

(

 1 ) then

n k

 1   0

 1 2 (

 1 )((

 1 )  1 ) (In other words, show that if true for

, then true for

+1) Step 2a: assume that fixed value of

k n

  0

 1 2

(

 1 ) is true for some 3

Induction & Dynamic Programming

Step 2b: extend equation to next value for

 1

 (

 1 ) 

n k k

  0

  0     1

(

 1 )  (

 1 ) 2 1 

2 

  (

 1 ) 2 1  (

2 2 1 2 

2 

)  ( 2

  3

 2  2 )   1 2 (

 1 )(

 2 )  (

 1 )    1 2

(

 1 ) 1 2 (

 1 ) (

 2 ) 1 2 (

 1 ) ((

 1 )  1 )

(from 2a) (algebra)

we have now showed

k n

 

what we wanted to show at beginning of Step 2.

• We proved case for (

+1), assuming that case for

is true.

• If we look at base case (

=0), we can show truth for

=0.

• Given that case for

=0 is true, then case for

=1 is true. • Given that case for

=1 is true, then case for

=2 is true.

(etc.)

• By proving base case and inductive step, we prove 

 0. 4

Induction & Dynamic Programming

Dynamic Programming

Inductive

technique: “The term was originally used in the 1940s by Richard Bellman to describe the process of solving problems where one needs to find the best decisions one after another. By 1953, he had refined this to the modern meaning, which refers specifically to

nesting smaller decision problems inside larger decisions

” 1 Two approaches: 1.

Top Down: Start out by trying to compute final answer. To do that, need to solve sub-problems. Each sub-problem requires solving additional sub-problems, until reaching the “base case”. The process can be sped up by storing (“memoizing”) answers to sub-problems.

Bottom Up: Start out by solving base cases, then “building” upon those results to solve sub-problems, until reach final answer.

5 1 From Wikipedia article on Dynamic Programming; emphasis mine.

Induction & Dynamic Programming

Example: Compute Fibonacci sequence {0 1 1 2 3 5 8 13 21 …} where F(

)=F(

-1)+F(

-2) Top-Down pseudocode: memoized[0] = 0 memoized[1] = 1 maxMemoized = 1 int fibonacci(n) { if (n <= maxMemoized) return(memoized[n]) else { f = fibonacci(n-1) + fibonacci(n-2) memoized[n] = f maxMemoized = n } } 1 Based on Wikipedia article on Dynamic Programming 6

Induction & Dynamic Programming

Example: Compute Fibonacci sequence {0 1 1 2 3 5 8 13 21 …} where F(

)=F(

-1)+F(

-2) Bottom-Up pseudocode: int fibonacci(n) { if (n == 0) return(0) if (n == 1) return(1) fAtNMinusTwo = 0 fAtNMinusOne = 1 for (idx = 2; idx <= n; idx++) { f = fAtNMinusTwo + fAtNMinusOne fAtNMinusTwo = fAtNMinusOne return(f) } fAtNMinusOne = f } We will be using the Bottom-Up approach in this class 1 Based on Wikipedia article on Dynamic Programming 7

Induction & Dynamic Programming

“Greedy Algorithm”: Make a

locally-optimum

choice going forward at each step, hoping (but not guaranteeing) that the

globally-optimum

will be found at the last step. Example:

Travelling Salesman Problem

: Given a number of cities, what is the shortest route that visits each city exactly once and then returns to the starting city? Vancouver 21 26 Gresham 35 146 Hillsboro 183 167 Bend 55 53 58 132 Salem GA: 26+21+58+132+183=

420

Better solution: 26+21+146+132+53=

378

Induction & Dynamic Programming Exhaustive solution

: compute distance of all possible routes, and select the shortest. Time required is

(

!) where

is the number of cities. With even moderate values of

, solution is impractical.

Greedy Algorithm solution

: At each city, the next city to visit is the unvisited city nearest to the current city. This process does not guarantee that the globally-optimum solution will be found, but is a fast solution

(

n 2

Dynamic-Programming solution

Does

DTW but used in HMMs) and

guarantee that the globally-optimum solution will be found, because it relies on induction. For Travelling Salesman problem, the solution 1 is

(

2 2 (

n-1)

). For speech problems, the dynamic-programming solution is

(

) where

is the number of states (not used in is the number of time frames.

1 Bellman, R. “Dynamic Programming Treatment of the Travelling Salesman Problem,” in Journal of the ACM (JACM), vol. 9, no. 1, January 1962, pp. 61 – 63. 9

Dynamic Time Warping (DTW)

• Goal: Given two utterances, find “best” alignment between pairs of frames from each utterance. (A) (B) The path through this matrix shows the best pairing of frames from utterance A with utterance B: This path can be considered the best warping between A and B.

time (frame) of (A) 10

Dynamic Time Warping (DTW)

• Dynamic Time Warping  Requires measure of “distance” between 2 frames of speech, one frame from utterance A and one from utterance B.

 Requires heuristics about allowable transitions from one   frame in A to another frame in A (and likewise for B).

Uses dynamic programming algorithm to find best warping.

Can get total “distortion score” for best warped path.

• Distance:  Measure of dissimilarity of two frames of speech • Heuristics:     Constrain begin and end times to be (1,1) and (

T A

T B

) Allow only monotonically increasing time Don’t allow too many frames to be skipped Can express in terms of “paths” with “slope weights” 11

Dynamic Time Warping (DTW)

• Does

not

require that both patterns have the same length • We may refer to one speech pattern as the “input” and the other speech pattern as the “template”, and compare input with template.

• For speech, we divide speech signal into equally-spaced frames (e.g. 10 msec) and compute one set of features per frame.

The local distance measure is the distance between features at a pair of frames (one from A, one from B).

•

Local

distance between frames called

Global

distortion from beginning of utterance until current pair of frames called

• DTW can also be applied to related speech problems, such as matching up two similar sequences of phonemes. • Algorithm: Similar in some respects to Viterbi search, which will be covered later 12

Dynamic Time Warping (DTW)

• Heuristics: P 1 ½ ½ 1 P 2 ½ P 3 Heuristic 1 ½ P1=(1,1)(1,0) P2=(1,1) P3=(1,1)(0,1) P 2 P 1 P 3 P1=(1,0) P2=(1,1) P3=(1,2) Heuristic 2 • Path

and slope weight

determined heuristically • Paths considered backward

from

target frame • Larger weight values for less preferable paths • Paths always go up, right (monotonically increasing in time) • Only evaluate

all

frames have meaningful values (

e.g.

don’t evaluate a path if one frame is at time  1, because there is no data for time  1).

Dynamic Time Warping (DTW)

• Algorithm: 1. Initialization (time 1 is first time frame)

(1,1) = d(1,1) 2. Recursion min

(

)  (

 ,

 ) 

(

 ,

 )   ((

 ,

 ), (

))  (  =zeta)

 ((

 ,

 ), (

))   index along valid

l L

  1

(

x l

y l

)

(

) path from (

' ,

' ) to (

), from 1 to

3. Termination

result



(

x T M

y T

) 

k T

  1

(

)

or ( sometimes defined as

T x

, or

T x

T y

T x

2 +

T y

2 ) ½ a convenient value for

is the length of the

template

Dynamic Time Warping (DTW)

• Example: 3 2 3 2 2 2 2 3 2 3 1 1 1 1 2 2 1 2 1 1 2 2 1 1 2 1 2 1 2 2 3 3 2 3 3 1 1 3 3 3 3 3 1 heuristic paths: 1 1 1 P1=(1,0) P2=(1,1) P3=(1,2) begin at (1,1), end at (7,6) 2 4 2 3 4 5 4 6 5 5 6 6 4 5 5 6 2 5 8 6 7 7 8 9 11 14 8 7 9 11 12 17 D(1,1) = D(1,2) = D(2,3) = D(3,4) = D(4,5) = D(4,6) = D(2,1) = D(2,2) = D(3,3) = D(4,4) = D(5,5) = D(5,6) = D(3,1) = D(3,2) = D(4,3) = D(5,4) = D(6,5) = D(6,6) = D(4,1) = D(4,2) = D(5,3) = D(6,4) = D(7,5) = D(7,6) = normalized distortion = 8/6 = 1.33

normalized distortion = 8/7 = 1.14

Dynamic Time Warping (DTW)

• Can we do local look-ahead to speed up process?

• For example, at (1,1) we know that there are 3 possible points to go to ((2,1), (2,2), (2,3)). Can we compute the cumulative distortion for those 3 points, select the minimum, (e.g. (2,2)), and proceed only from that best point?

• No, because (global) end-point constraint (end at (7,6)) may alter the path. We can’t make local decisions with a global constraint.

• In addition, we can’t do this because often there are many ways to end up at a single point, and we don’t know all the ways of getting to a point until we visit it and compute it’s cumulative distortion.

• This look-ahead transforms DTW from dynamic-programming to greedy algorithm.

Dynamic Time Warping (DTW)

• Example: 3 2 3 2 2 2 3 2 3 1 1 1 2 2

1 2

9 8

1 2 1 2 1 2 2 3 3 1

3 3 3 3 13 10 11 12 12 12 13 9 10 10 11 11 7 5 3 1 7 9 10 10 11 9 2 3 10 6 7 9 10 10 8 11 9 12 12 15 D(1,1) = 1 D(1,2) = 3 heuristic paths: D(2,1) = 3 D(2,2) = 2 1 1 1 P1=(1,0) P2=(1,1) P3=(0,1) begin at (1,1), end at (6,6) D(3,1) = 6 D(4,1) = 9 … D(3,2) = 10 D(4,2) = 7 … D(1,3) = 5 D(1,4) = 7 D(2,3) = 10 D(3,3) = 11 D(4,3) = 9 … D(2,4) = 7 D(3,4) = 9 D(4,4) = 10 … D(1,5) = 10 D(2,5) = 9 D(3,5) = 10 D(4,5) = 10 … D(1,6) = 13 D(2,6) = 11 D(3,6) = 12 D(4,6) = 12 … normalized distortion = 13/6 = 2.17

Dynamic Time Warping (DTW)

• Example: 9 8 5 3 1 2 7 7 4 1 3 4 8 7 5 6 5 3 1 2 2 2 1 2 3 3 4 5 4 6 4 2 3 1 3 7 heuristic paths: ½ ½ 1 ½ ½ P1=(1,1)(1,0) P2=(1,1) P3=(1,1)(0,1) begin at (1,1), end at (6,6) D(1,1) = D(1,2) = D(2,3) = D(3,4) = D(3,5) = D(3,6) = D(2,1) = D(2,2) = D(3,3) = D(4,4) = D(4,5) = D(4,6) = D(3,1) = D(3,2) = D(4,3) = D(5,4) = D(5,5) = D(5,6) = D(4,1) = D(4,2) = D(5,3) = D(6,4) = D(6,5) = D(6,6) = 18

Dynamic Time Warping (DTW)

• Local Distance Measures at one time frame

: Need to compare two frames of speech and measure how similar or dissimilar they are. Each frame has one feature vector,

x

for the features from one signal and

y

for the other signal.

A distance measure should have the following properties: 0  0 = d(

x

d d

y

( (

x x

t t

, , ) = d(

(

x

y

) 

y y

t t d

) ) iff (  

x x

t t

, ,

y z x

t t t

) = ) +

y

d t

(

z

y

) (positive definiteness) (symmetry) (triangle inequality) A distance measure should also, for speech, correlate well with

perceived

distance. Spectral domain is better than time domain for this; a perceptually-warped spectral domain is even better.

Dynamic Time Warping (DTW)

• Local Distance Measures at one time frame

: Simple solution: “city-block” distance (in log-spectral space) between two sets of signals represented by (vector) features

x

and

y

(

)

F f

   1 0 |

(

f F

) 

(

) | where

x

(

) is the log power spectrum of signal

x

at time

and frequency

with maximum frequency

F-1

also the Euclidean distance:

F f

   1 0 

(

) 

(

)  2

can indicate simply a feature index, which may or may not correspond to a frequency band. e.g. 13 cepstral features

0 through

other distance measures: Itakura-Saito distance (also called Itakura-Saito distortion), COSH distance, likelihood ratio distance, etc… 20

Dynamic Time Warping (DTW)

• Termination Step The termination step is taking the value at the endpoint (the score of the least distortion over the entire utterance) and dividing by a normalizing factor.

The normalizing factor is only necessary in order to compare the DTW result for this template with DTW from other templates.

So, one method of normalizing is to divide by the number of frames in the template. This is quick, easy, and effective for speech recognition and comparing results of templates.

Another method is to divide by the length of the path taken, adjusting the length by the slope weights at each transition.

This requires going back and summing the slope values, so it’s slower. But, sometimes it’s more appropriate.

Dynamic Time Warping (DTW)

• DTW can be used to perform ASR by comparing input speech with a number of templates; the template with the lowest normalized distortion is most similar to the input and is selected as the recognized word.

• DTW provides both a historical and a logical basis for studying Hidden Markov Models… Hidden Markov Models (HMMs) can be seen as an advancement over DTW technology. • “Sneak preview”:  DTW compares input speech against fixed template (local  distortion measure); HMMs compare input speech against “probabilistic template.” The search algorithm used in HMMs is also similar, but instead of a fixed set of possible paths, there are probabilities of all possible paths.

• Remaining question: what are the

x

and

y

features in Slide 19? 22

Features of the Speech Signal: (Log) Power Spectrum

“Energy” or “Intensity”: intensity is sound energy transmitted per second (power) through a unit area in a sound field. [Moore p. 9] intensity is proportional to the square of the pressure variation [Moore p. 9] normalized energy =

E t



t n



   1

t x n

 intensity

x n N

= signal

at time sample = number of time samples

Features of the Speech Signal: (Log) Power Spectrum

“Energy” or “Intensity”: human auditory system better suited to

relative

scales: energy (bels) =

log 10 (

0 1 ) energy (decibels, dB) = 10  log 10 (

I I

0 1 )

0 is a reference intensity… if the signal becomes twice as powerful (

1 /

0 = 2), then the energy level is 3 dB (3.01023 dB to be more precise) Typical value for

0 is 20  Pa.

20  Pa is close to the average human absolute threshold for a 1000-Hz sinusoid.

Features of the Speech Signal: (Log) Power Spectrum

What makes one phoneme, /aa/, sound different from another phoneme, /iy/?

Different shapes of the vocal tract… /aa/ is produced with the tongue low and in the back of the mouth; /iy/ is produced with the tongue high and toward the front.

The different shapes of the vocal tract produce different “resonant frequencies”, or frequencies at which energy in the signal is concentrated. (Simple example of resonant energy: a tuning fork may have resonant frequency equal to 440 Hz or “A”).

Resonant frequencies in speech (or other sounds) can be displayed by computing a “power spectrum” or “spectrogram,” showing the energy in the signal at different frequencies. 25

Features of the Speech Signal: (Log) Power Spectrum

A time-domain signal can be expressed in terms of sinusoids at a range of frequencies using the Fourier transform:

(

)  

  

(

)



2 

ft dt

 

  

(

)  cos( 2 

) 

sin( 2 

) 

where

(

) is the time-domain signal at time

is a frequency value from 0 to 1, and

(

) is the spectral-domain representation.

e j

  cos(  ) 

sin(  ) 26

Features of the Speech Signal: (Log) Power Spectrum

Since samples are obtained at discrete time steps, and since only a finite section of the signal is of interest, the discrete Fourier transform is more useful:

X t

(

)



1

N N t

   1 0

(

)



2 

tn N



1 for

t N

   1 0

(

)[cos( 2



tn N

)





0 ,

N j

sin(



1 2



)]

in which

(

) is the amplitude at time sample

is a frequency value from 0 to

-1,

is the number of samples or frequency points of interest, and

X t

(

) is the spectral-domain representation of

, beginning at time point

Features of the Speech Signal: (Log) Power Spectrum

The magnitude and phase of the spectral representation are:

magnitude X t

(

) 

X t

(

)  (

X t real

(

)

phase X t

(

)  tan  1 (

X X t t imag

(

) )

real

(

) 

X t real

(

) 

X t imag

(

) 

X t imag

(

)) 0 .

5 absolute value of complex number Phase information is generally considered not important in understanding speech, and the energy (or power) of the magnitude of

X t

(

) on the decibel scale provides most relevant information:

LogPowerSp ectrum X t

(

)  10  log 10 

X t real

(

) 2

X t imag

(

)  2 Note: usually don’t worry about reference intensity

I 0

(assume a value of 1.0); the signal strength (in  Pa) is unknown anyway.

Features of the Speech Signal: (Log) Power Spectrum

In DTW, what is

x

(slide 19)? It’s a representation of the speech signal at one point in time,

. For DTW, we’ll use the log power spectrum at time

. The speech signal is divided into

frames (for each time point 1 …

); typically one frame is 10 msec. At each frame, a feature vector of the speech signal is computed. These features should provide the ability to discriminate between phonemes. For now we’ll use spectral features… later, we’ll switch to “cepstral” features.

=80 • Each vertical line delineates one feature vector at time

x

Dynamic Time Warping (DTW) Project

• First project: Implement DTW algorithm, perform automatic speech recognition • “Template” code is available at the class web site to read in features, provide some context and a starting point.

• The features will be given are “real,” in that they are spectrogram values (energy levels at different frequencies) from utterances of “yes” and “no” sampled every 10 msec.

• For a local distance measure for each frame, use the Euclidean distance.

1 • Use the following heuristic paths: P 1 1 P 2 1 1 1 P1=(1,1)(1,0) P2=(1,1) P3=(1,1)(0,1) P 3 • Give thought to the representation of paths in your code… make your code easily changed to specify new paths AND be able to use slope weights. (This will affect your grade).

Dynamic Time Warping (DTW) Project

• Align pair of files, and print out normalized distortion score: yes_template.txt input1.txt

no_template.txt

input1.txt yes_template.txt input2.txt input2.txt yes_template.txt input3.txt input3.txt • Then, use results to perform rudimentary ASR… (1) is input1.txt more likely to be “yes” or “no”?

(2) is input2.txt more likely to be “yes” or “no”?

(3) is input3.txt more likely to be “yes” or “no”?

• You may have trouble along the way… good code doesn’t always produce an answer. Can you add to or modify the paths to produce an answer for all three inputs? If so, show the modifications and the new output.

Dynamic Time Warping (DTW) Project

• List 3 reasons why you wouldn’t want to rely on DTW for all of your ASR needs… • Due on January 19 (Wednesday, 2 weeks from now); send • your source code • recognition results (minimum normalized distortion scores for each comparison, as well as the best time warping between the two inputs) using the specified paths • 3 reasons why you wouldn’t want to rely on DTW… • results using specifications given here, and results using any necessary modifications to provide answer for all three inputs.

to ‘hosom’ at cslu  ogi  edu; late responses generally not accepted.

Issues in Developing ASR Systems

There are a number of issues that impact the performance of an automatic speech recognition (ASR) system: •

Type of Channel

 Microphone signal different from telephone signal, “land-line” telephone signal different from cellular signal.

 Channel characteristics: pick-up pattern (omni-directional, unidirectional, etc.) frequency response, sensitivity, noise, etc.

 Typical channels: desktop boom mic: hand-held mic: telephone: unidirectional, 100 to 16000 Hz super-cardioid, 60 to 20000 Hz unidirectional, 300 to 8000 Hz  Training on data from one type of channel automatically “learns” that channel’s characteristics; switching channels degrades performance.

Issues in Developing ASR Systems

•

Speaker Characteristics

 Because of differences in vocal tract length, male, female, and children’s speech are different.

 Regional accents are expressed as differences in resonant frequencies, durations, and pitch.

 Individuals have resonant frequency patterns and duration patterns that are unique (allowing us to identify speaker).

 Training on data from one type of speaker automatically “learns” that group or person’s characteristics, makes recognition of other speaker types much worse.

 Training on data from all types of speakers results in lower performance than could be obtained with speaker-specific models.

Issues in Developing ASR Systems

•

Speaking Rate

 Even the same speaker may vary the rate of speech.

 Most ASR systems require a fixed window of input speech.

 Formant dynamics change with different speaking rates.

 ASR performance is best when tested on same rate of speech as training data.

 Training on a wide variation in speaking rate results in lower performance than could be obtained with duration specific models.

Issues in Developing ASR Systems

•

Noise

 Two types of noise: additive, convolutional  Additive: e.g. white noise (random values added to waveform)  Convolutional: filter (additive values in log spectrum)  Techniques for removing noise: RASTA, Cepstral Mean Subtraction (CMS)  (Nearly) impossible to remove all noise while preserving all speech (nearly impossible to separate speech from noise)  Stochastic training “learns” noise as well as speech; if noise changes, performance degrades.

Issues in Developing ASR Systems

•

Vocabulary

 Vocabulary must be specified in advance (can’t recognize new words)  Pronunciation of each word must be specified exactly (phonetic substitutions may degrade performance)  Grammar: either very simple but with likelihoods of word sequences, or highly structured  Reasons for pre-specified vocabulary, grammar constraints: • phonetic recognition so poor that confidence in each recognized phoneme usually very low.

• humans often speak ungrammatically or disfluently.

Issues in Developing ASR Systems Comparing Human and Computer Performance

Human performance: • Large-vocabulary corpus (1995 CSR Hub-3) consisting of North American business news recorded with 3 microphones.

•

Average

word error rate of 2.2%, best word error rate of 0.9%, “committee” error rate of 0.8% • Typical errors: “emigrate” vs. “immigrate”, most errors due to inattention.

Computer performance: • Similar large-vocabulary corpus (1998 Broadcast News Hub-4) •

Best

performance of 13.5% word error rate, (for < 10x real time, best performance of 16.1%), a “committee” error rate of 10.6% • More recent focus on natural speech…

best

error rates of  20% This is consistent with results from other tasks: a general order-of-magnitude difference between human and computer performance; computer doesn’t generalize to new conditions.