Automatic Speech Recognition

Transcript Automatic Speech Recognition

Automatic Speech Recognition
Introduction
Readings: Jurafsky & Martin 7.1-2
HLT Survey Chapter 1
The Human Dialogue System
The Human Dialogue System
Computer Dialogue Systems
Dialogue
Management
Audition
signal
Automatic
Natural
Language
Speech
Recognition Understanding
signal
words
Planning
logical form
Natural
Language
Generation
Text-tospeech
words
signal
Computer Dialogue Systems
Dialogue
Mgmt.
Audition
signal
signal
ASR
NLU
words
NLG
Planning
logical form
Text-tospeech
words
signal
Parameters of ASR Capabilities
• Different types of tasks with different difficulties
–
–
–
–
–
–
–
–
Speaking mode (isolated words/continuous speech)
Speaking style (read/spontaneous)
Enrollment (speaker-independent/dependent)
Vocabulary (small < 20 wd/large >20kword)
Language model (finite state/context sensitive)
Perplexity (small < 10/large >100)
Signal-to-noise ratio (high > 30 dB/low < 10dB)
Transducer (high quality microphone/telephone)
The Noisy Channel Model
message
message
noisy channel
Message
+
Channel =Signal
Decoding model: find Message*= argmax P(Message|Signal)
But how do we represent each of these things?
ASR using HMMs
• Try to solve P(Message|Signal) by breaking the
problem up into separate components
• Most common method: Hidden Markov Models
– Assume that a message is composed of words
– Assume that words are composed of sub-word parts
(phones)
– Assume that phones have some sort of acoustic
realization
– Use probabilistic models for matching acoustics to
phones to words
HMMs: The Traditional View
go
g
o
home
h
o
m
x0 x1 x2 x3 x4 x5 x6 x7 x8 x9
Markov model
backbone composed
of phones
(hidden because we
don’t know
correspondences)
Acoustic observations
Each line represents a probability estimate (more later)
HMMs: The Traditional View
go
g
o
home
h
o
m
x0 x1 x2 x3 x4 x5 x6 x7 x8 x9
Markov model
backbone composed
of phones
(hidden because we
don’t know
correspondences)
Acoustic observations
Even with same word hypothesis, can have different alignments.
Also, have to search over all word hypotheses
HMMs as Dynamic Bayesian
Networks
go
Markov model
backbone composed
of phones
home
q0=g
q1=o
q2=o
q3=o
q4=h
q5=o
q6=o
q7=o
q8=m
q9=m
x0
x1
x2
x3
x4
x5
x6
x7
x8
x9
Acoustic observations
HMMs as Dynamic Bayesian
Networks
go
Markov model
backbone composed
of phones
home
q0=g
q1=o
q2=o
q3=o
q4=h
q5=o
q6=o
q7=o
q8=m
q9=m
x0
x1
x2
x3
x4
x5
x6
x7
x8
x9
ASR: What is best assignment to q0…q9 given x0…x9?
Hidden Markov Models & DBNs
DBN representation
Markov Model
representation
Parts of an ASR System
Feature
Calculation
Acoustic
Modeling
k
@
Pronunciation
Modeling
cat: k@t
dog: dog
mail: mAl
the: D&, DE
…
SEARCH
Language
Modeling
cat dog: 0.00002
cat the: 0.0000005
the cat: 0.029
the dog: 0.031
the mail: 0.054
…
The cat chased the dog
Parts of an ASR System
Feature
Calculation
Acoustic
Modeling
k
Produces
acoustics (xt)
@
Pronunciation
Modeling
cat: k@t
dog: dog
mail: mAl
the: D&, DE
…
Maps acoustics Maps phones
to phones
to words
Language
Modeling
cat dog: 0.00002
cat the: 0.0000005
the cat: 0.029
the dog: 0.031
the mail: 0.054
…
Strings words
together
Feature calculation
Frequency
Feature calculation
Time
Find energy at each time step in
each frequency channel
Frequency
Feature calculation
Time
Take inverse Discrete Fourier
Transform to decorrelate frequencies
Feature calculation
Input:
Output:
-0.1
0.3
1.4
-1.2
2.3
2.6
…
0.2
0.1
1.2
-1.2
4.4
2.2
…
0.2
0.0
1.2
-1.2
4.4
2.2
…
-6.1
-2.1
3.1
2.4
1.0
2.2
…
…
Robust Speech Recognition
• Different schemes have been developed for
dealing with noise, reverberation
– Additive noise: reduce effects of particular
frequencies
– Convolutional noise: remove effects of linear
filters (cepstral mean subtraction)
Now what?
-0.1
0.3
1.4
-1.2
2.3
2.6
…
0.2
0.1
1.2
-1.2
4.4
2.2
…
0.2
0.0
1.2
-1.2
4.4
2.2
…
-6.1
-2.1
3.1
2.4
1.0
2.2
…
???
That you …
Machine Learning!
-0.1
0.3
1.4
-1.2
2.3
2.6
…
0.2
0.1
1.2
-1.2
4.4
2.2
…
0.2
0.0
1.2
-1.2
4.4
2.2
…
-6.1
-2.1
3.1
2.4
1.0
2.2
…
Pattern recognition
with HMMs
That you …
Hidden Markov Models (again!)
P(statet+1|statet)
Pronunciation/Language models
P(acousticst|statet)
Acoustic Model
Acoustic Model
dh a
-0.1
0.3
1.4
-1.2
2.3
2.6
…
0.2
0.1
1.2
-1.2
4.4
2.2
…
a
t
0.2
0.0
1.2
-1.2
4.4
2.2
…
-6.1
-2.1
3.1
2.4
1.0
2.2
…
Na(m,S)
P(X|state=a)
• Assume that you can
label each vector with
a phonetic label
• Collect all of the
examples of a phone
together and build a
Gaussian model (or
some other statistical
model, e.g. neural
networks)
Building up the Markov Model
• Start with a model for each phone
1-p
a
transition probability
p
• Typically, we use 3 states per phone to give
a minimum duration constraint, but ignore
that here…
1-p
a
1-p
p
a
1-p
p
a
p
Building up the Markov Model
• Pronunciation model gives connections
between phones and words
1-pdh
1-pa
dh
a
pdh
1-pt
pa
t
pt
• Multiple pronunciations:
ow
t
m
ah
ey
ah
t
ow
Building up the Markov Model
• Language model gives connections between
words (e.g., bigram grammar)
p(he|that)
dh
a
h
iy
y
uw
t
p(you|that)
ASR as Bayesian Inference
q1w1
q2w1
q3w1
p(he|that)
th
x1
x2
a
x3
argmaxW P(W|X)
=argmaxW P(X|W)P(W)/P(X)
=argmaxW P(X|W)P(W)
=argmaxW SQ P(X,Q|W)P(W)
≈argmaxW maxQ P(X,Q|W)P(W)
≈argmaxW maxQ P(X|Q) P(Q|W) P(W)
iy
sh
uh
iy
y
uw
t
p(you|that)
h
h
d
ASR Probability Models
• Three probability models
– P(X|Q): acoustic model
– P(Q|W): duration/transition/pronunciation
model
– P(W): language model
• language/pronunciation models inferred
from prior knowledge
• Other models learned from data (how?)
Parts of an ASR System
Feature
Calculation
P(X|Q)
P(Q|W)
Acoustic
Modeling
Pronunciation
Modeling
k
@
cat: k@t
dog: dog
mail: mAl
the: D&, DE
…
SEARCH
P(W)
Language
Modeling
cat dog: 0.00002
cat the: 0.0000005
the cat: 0.029
the dog: 0.031
the mail: 0.054
…
The cat chased the dog
EM for ASR: The ForwardBackward Algorithm
• Determine “state occupancy” probabilities
– I.e. assign each data vector to a state
• Calculate new transition probabilities, new
means & standard deviations (emission
probabilities) using assignments
ASR as Bayesian Inference
q1w1
q2w1
q3w1
p(he|that)
th
x1
x2
a
x3
argmaxW P(W|X)
=argmaxW P(X|W)P(W)/P(X)
=argmaxW P(X|W)P(W)
=argmaxW SQ P(X,Q|W)P(W)
≈argmaxW maxQ P(X,Q|W)P(W)
≈argmaxW maxQ P(X|Q) P(Q|W) P(W)
iy
sh
uh
iy
y
uw
t
p(you|that)
h
h
d
Search
• When trying to find W*=argmaxW P(W|X), need
to look at (in theory)
– All possible word sequences W
– All possible segmentations/alignments of W&X
• Generally, this is done by searching the space of
W
– Viterbi search: dynamic programming approach that
looks for the most likely path
– A* search: alternative method that keeps a stack of
hypotheses around
• If |W| is large, pruning becomes important
How to train an ASR system
• Have a speech corpus at hand
– Should have word (and preferrably phone)
transcriptions
– Divide into training, development, and test sets
• Develop models of prior knowledge
– Pronunciation dictionary
– Grammar
• Train acoustic models
– Possibly realigning corpus phonetically
How to train an ASR system
•
•
•
•
•
•
•
Test on your development data (baseline)
**Think real hard
Figure out some neat new modification
Retrain system component
Test on your development data
Lather, rinse, repeat **
Then, at the end of the project, test on the test
data.
Judging the quality of a system
• Usually, ASR performance is judged by the
word error rate
ErrorRate = 100*(Subs + Ins + Dels) / Nwords
REF: I WANT TO GO HOME ***
REC: * WANT TWO GO HOME NOW
SC: D C
S C
C
I
100*(1S+1I+1D)/5 = 60%
Judging the quality of a system
• Usually, ASR performance is judged by the
word error rate
• This assumes that all errors are equal
– Also, a bit of a mismatch between optimization
criterion and error measurement
• Other (task specific) measures sometimes
used
– Task completion
– Concept error rate

Automatic Speech Recognition

Transcript Automatic Speech Recognition

Directory