No Slide Title

Download Report

Transcript No Slide Title

Automatic Speech Recognition (ASR):
A Brief Overview
Radio Rex – 1920’s ASR
Statistical ASR
• i_best = argmax P(M |X )
i
i
= argmax P(X|M ) P(M )
i
i
i
(1st term, acoustic model; 2nd term,
language model)
• P(X|M )  P(X|Q
where Q
i
)
[Viterbi approx.]
Mi
is the best state sequence in M
M
i
approximated
i
by product of local likelihoods
(Markov,conditional independence assumptions)
Automatic Speech
Recognition
Speech Production/Collection
Pre-processing
Feature Extraction
Hypothesis Generation
Cost Estimator
Decoding
Simplified Model of Speech
Production
Periodic
Source
Random
Source
Filters
Vocal vibration or turbulence Vocal tract, nasal tract, radiation
(Fine spectral structure)
(spectral envelope)
Pre-processing
Speech
Room
Acoustics
Microphone
Linear
Filtering
Sampling &
Digitization
Issues: Noise and reverb, effect on modeling
Framewise Analysis
of Speech
Frame 1
Frame 2
Feature Vector
X1
Feature Vector
X2
Feature Extraction
Spectral
Analysis
Auditory
Model/
Orthogonalize
(cepstrum)
Issues: Design for discrimination, insensitivities
to scaling and simple distortions
Representations
are Important
Speech waveform
23% frame correct
Network
PLP features
70% frame correct
Network
Mel Frequency Scale
frequency
Spectral vs Temporal
Processing
Analysis
(e.g., cepstral)
Spectral processing
frequency
Time
Processing
(e.g., mean removal)
Temporal processing
Hypothesis Generation
cat
dog
a cat not is adog
a dog is not a cat
Issue: models of language and task
Cost Estimation
• Distances
• -Log probabilities, from

discrete distributions

Gaussians, mixtures

neural networks
Nonlinear Time Normalization
Decoding
Pronunciation Models
Language Models
Most likely words for largest product
P(acousticswords)  P(words)
P(words) =  P(wordshistory)
•bigram, history is previous word
•trigram, history is previous 2 words
•n-gram, history is previous n-1 words
ASR System Architecture
Language
Model
Cepstrum
Speech
Signal
Signal
Processing
Acoustic
Probability
Estimator
(HMM state
likelihoods)
Probabilities
“z” -0.81
“th” = 0.15
“t” = 0.03
Decoder
Recognized
Words
“zero”
“three”
“two”
Pronunciation
Lexicon
HMMs for Speech
• Math from Baum and others, 1966-1972
• Applied to speech by Baker in the
original CMU Dragon System (1974)
• Developed by IBM (Baker, Jelinek, Bahl,
Mercer,….) (1970-1993)
• Extended by others in the mid-1980’s
Hidden Markov model
(graphical form)
q
q
q
x
x
x
1
1
2
2
3
3
q
4
x
4
Hidden Markov Model
(state machine form)
q
1
q
P(q | q )
2 1
P(x | q )
1
2
q
P(q | q )
3 2
P(x | q )
2
3
P(q | q )
4 3
P(x | q )
3
Markov model
q
q
1
P(x ,x |q ,q
1
2
1
2
2
)  P( q ) P(x |q ) P(q
1
1
1
2
| q ) P(x | q )
1
2
2
HMM Training Steps
• Initialize estimators and models
• Estimate “hidden” variable probabilities
• Choose estimator parameters to maximize
model likelihoods
• Assess and repeat steps as necessary
• A special case of Expectation
Maximization (EM)
Progress in 3 Decades
• From digits to 60,000 words
• From single speakers to many
• From isolated words to continuous
speech
• From no products to many products,
some systems actually saving LOTS
of money
Real Uses
• Telephone: phone company services
(collect versus credit card)
• Telephone: call centers for query
information (e.g., stock quotes,
parcel tracking)
• Dictation products: continuous
recognition, speaker dependent/adaptive
But:
• Still <97% on “yes” for telephone
• Unexpected rate of speech causes doubling
or tripling of error rate
• Unexpected accent hurts badly
• Performance on unrestricted speech at 70%
(with good acoustics)
• Don’t know when we know
• Few advances in basic understanding
Why is ASR Hard?
• Natural speech is continuous
• Natural speech has disfluencies
• Natural speech is variable over:
global rate, local rate, pronunciation
within speaker, pronunciation across
speakers, phonemes in different
contexts
Why is ASR Hard?
(continued)
• Large vocabularies are confusable
• Out of vocabulary words inevitable
• Recorded speech is variable over:
room acoustics, channel characteristics,
background noise
• Large training times are not practical
• User expectations are for equal to or
greater than “human performance”
ASR Dimensions
• Speaker dependent, independent
• Isolated, continuous, keywords
• Lexicon size and difficulty
• Task constraints, perplexity
• Adverse or easy conditions
• Natural or read speech
Telephone Speech
•
•
•
•
•
•
Limited bandwidth (F vs S)
Large speaker variability
Large noise variability
Channel distortion
Different handset microphones
Mobile and handsfree acoustics
Hot Research Problems
•
•
•
•
•
Speech in noise
Multilingual conversational speech (EARS)
Portable (e.g., cellular) ASR
Question answering
Understanding meetings – or at least
browsing them
Hot Research Approaches
•
•
•
•
•
•
•
•
•
•
New (multiple) features and models
New statistical dependencies
Multiple time scales
Multiple (larger) sound units
Dynamic/robust pronunciation models
Long-range language models
Incorporating prosody
Incorporating meaning
Non-speech modalities
Understanding confidence
Multi-frame analysis
• Incorporate multiple frames as
a single observation
• LDA the most common approach
• Neural networks
• Bayesian networks (graphical models,
including Buried Markov Models)
Linear Discriminant
Analysis (LDA)
All variables for
several frames
x
x
X
Transformation to maximize ratio:
between-class variance
within-class variance
x
1
3
x4
x
y
2
5
=
y
1
2
Multi-layer perceptron
Buried Markov Models
Multi-stream analysis
• Multi-band systems
• Multiple temporal properties
• Multiple data-driven temporal filters
Multi-band analysis
Temporally distinct
features
Combining streams
Another novel approach:
Articulator dynamics
• Natural representation of context
• Production apparatus has mass, inertia
• Difficult to accurately model
• Can approximate with simple dynamics
Hidden Dynamic Models
“We hold these truths to be self-evident: that speech is produced
by an underlying dynamic system, that it is endowed by its production
system with certain inherent dynamic qualities, among these are
compactness, continuity, and the pursuit of target values for each
phone class, that to exploit these characteristics Hidden Dynamic
Models are instituted among men. We … solemnly publish and declare,
that these phone classes are and of aright ought to be free and context
independent states …And for the support of this declaration, with a
firm reliance on the acoustic theory of speech production, we mutually
pledge our lives, our fortunes, and our sacred honor.”
John Bridle and Li Deng, 1998 Hopkins Spoken LanguageWorkshop,
with apologies to Thomas Jefferson ...
(See http://www/clsp.jhu.edu/ws98/projects/dynamic/)
Hidden Dynamic Models
SEGMENTATION
TARGET VALUES
TARGET
SWITCH
FILTER
NEURAL
NETWORK
SPEECH PATTERN
Sources of Optimism
• Comparatively new research lines
• Many examples of improvements
• Moore’s Law  much more processing
• Points toward joint development of
front end and statistical components
Summary
• 2002 ASR based on 50+ years of research
• Core algorithms  mature systems, 10-30 yrs
• Deeply difficult, but tasks can be chosen
that are easier in SOME dimension
• Much more yet to do