AVICAR: Audiovisual Speech Recognition in a Car

Download Report

Transcript AVICAR: Audiovisual Speech Recognition in a Car

Object Tracking and Asynchrony in Audio-
Visual Speech Recognition
Mark Hasegawa-Johnson
AIVR Seminar
August 31, 2006
AVICAR is thanks to: Bowon Lee, Ming Liu, Camille Goudeseune, Suketu
Kamdar, Carl Press, Sarah Borys
and to the Motorola Communications Center
Some experiments and most good ideas in this talk thanks to
Ming Liu, Karen Livescu, Kate Saenko and Partha Lal
Why AVSR is not like ASR
• Use of classifiers as
features
– E.g., output of an
AdaBoost lip tracker is
feature in a face
constellation
• Obstruction
– Tongue is rarely visible,
glottis never
• Asynchrony
– Visual evidence for a
word can start long
before the audio
evidence
Which digit is she about to say?
Why ASR is like AVSR
• Use of classifiers as features
– E.g., neural networks or SVMs transform audio spectra
into a phonetic feature space
• Obstruction
– Lip closure “hides” tongue closure
– Glottal stop “hides” lip or tongue position
• Asynchrony
– Tongue, lips, velum, and glottis can be out of sync, e.g.,
“every” →“ervy”
Discriminative Features in Face/Lip
Tracking: AdaBoost
1.
Each wavelet defines a “weak classifier:
hi(x) = 1 iff fi > threshold, else hi(x) = 0
2.
Start with equal weight for all training tokens:
wm(1) = 1/M, 1 ≤ m≤M
3.
For each learning iteration t:
1.
2.
3.
4.
Find i that minimizes the weighted training error.
wm ↓ if token m was correctly classified, else wm ↑.
αt = log((1- εt)/ εt)
Final “strong classifier” is
H(x) = 1 iff Σ t αt ht(x) > Σ t αt
Example Haar Wavelet Features Selected
by AdaBoost
AdaBoost in a Bayesian Context
•
The AdaBoost “margin:”
•
Guaranteed range: 0≤MD(x)≤1
•
Inverse sigmoid transform yields nearly normal distributions
Prior: Relative Position of Lips in the Face
p(r=rlips | MD(x)) a p(r=rlips) p(MD(x) | r=rlips)
Lip Tracking: a few results
Pixel-Based Features
Pixel-Based Features: Dimension
Model-Based Correction for Head-Pose
Variability
• If the head is an ellipse, its measured width wF(t) and
height hF(t) are functions of roll ρ, yaw ψ, pitch φ,
true height ħF and true width wF according to
• … which can usefully be approximated as…
Robust Correction: Linear Regression
• The additive random part of the lip width (wL(t)=w1+ħLcosψ(t)sinρ(t)) is
proportional to similar additive variation in the head width
(wF(t)=wF1+ħFcosψ(t)sinρ(t)), so we can eliminate it by orthogonalizing
wL(t) to wF(t).
WER Results from AVICAR
(Testing on the training data; 34 talkers, continuous digits)
LR = linear
regression
Model = model-based
head-pose
compensation
LLR = log-linear
regression
13+d+dd = 13 static
features
39 = 39 static
features
All systems have
mean and
variance
normalization
and MLLR
Audio-Visual Asynchrony
For example, tongue touches the teeth before acoustic speech onset in the word
“three;” lips are already round in anticipation of the /r/.
Audio-Visual Asynchrony: Coupled HMM is
a typical Phoneme-Viseme Model
(Chu and Huang, 2002)
acoustic
channel
…
…
visual
channel
t=1
t=2
t=3
t =T
A Physical Model of Asynchrony
Slide created by Karen Livescu
Articulatory Phonology [Browman & Goldstein ‘90]: The following
8 tract variables are independently & asynchronously controlled
LIP-LOC
Protruded, Labial, Dental
LIP-OP
CLosed, CRitical, Narrow, Wide
TT-LOC
Dental, Alveolar, Palato-Alveolar,
Retroflex
TB-LOC
Palatal, Velar, Uvular, Pharyngeal
TT-OP,
TB-OP
CLosed, CRitical, Narrow, MidNarrow, Mid, Wide
GLO
VEL
TB-LOC
TT-LOC
LIP-OP
LIP-LOC
TT-OP
TB-OP
GLOTTIS
CLosed (stop), CRitical (voiced),
Open (voiceless)
CLosed (non-nasal), Open (nasal)
For speech recognition, we collapse these into 3 streams: lips,
tongue, and glottis (LTG).
VELUM
Motivation: Pronunciation variation
Slide created by Karen Livescu
word
probably
sense
baseform
p r aa b ax b l iy
s eh n s
don’t
eh v r iy b ah d iy
d ow n t
(2) p r aa b iy
(1) s eh n t s
(1) eh v r ax b ax d iy
(37) d ow n
(1) p r ay
(1) s ih t s
(1) eh v er b ah d iy
(16) d ow
(1) p r aw l uh
(1) eh ux b ax iy
(6) ow n
(1) p r ah b iy
(1) eh r uw ay
(4) d ow n t
(1) p r aa l iy
(1) eh b ah iy
(3) d ow t
(1) p r aa b uw
(3) d ah n
(1) p ow ih
(3) ow
(1) p aa iy
(1) p aa b uh b l iy
(1) p aa ah iy
# pronunciations / word
surface
(actual)
everybody
(3) n ax
80
(2) d ax n
(2) ax
60
(1) n uw
...
40
20
0
0
50
100
150
minimum # occurrences
200
Explanation: Asynchrony of tract variables
Based on a slide created by Karen Livescu
dictionary
surface
variant #1
(example of
feature
asynchrony)
surface
variant #2
(example of
feature
asynchrony
+
substitution)
feature
G
T
phone
values
open
crit / alveolar
s
critical
mid / palatal
eh
feature
G
T
open
crit / alveolar
values
critical
open
nas
mid / palatal closed / alveolar crit / alveolar
phone
s
feature
G
open
T
crit / alveolar
phone
s
eh
open
nasal
closed / alveolar crit / alveolar
n
n
s
t
s
values
critical
critical
open
nas
cl / alv
crit / alveolar
nar / palatal
ih
t
s
n
Implementation: Multi-stream DBN
Slide created by Karen Livescu
• Phone-based
q (phonetic state)
o (observation vector)
• Articulatory Feature-based
L (state of lips)
T (state of tongue)
G (state of glottis)
o (obs vector)
Baseline: Audio-only phone-based HMM
Slide created by Partha Lal
positionInWordA
{0,1,2,...}
stateTransitionA
{0,1}
phoneStateA
obsA
{ /t/1, /t/2, /t/3, /u/1, /u/2, /u/3, …}
Baseline: Video-only phone-based HMM
Slide created by Partha Lal
positionInWordV
{0,1,2,...}
stateTransitionV
{0,1}
phoneStateV
obsV
{ /t/1, /t/2, /t/3, /u/1, /u/2, /u/3, …}
Audio-visual HMM without asynchrony
Slide created by Partha Lal
positionInWord
{0,1,2,...}
stateTransition
{0,1}
phoneState
obs
obsV
obsA
{ /t/1, /t/2, /t/3, /u/1, /u/2, /u/3, …}
Phoneme-Viseme CHMM
Slide created by Partha Lal
positionInWordA
{0,1,2,...}
stateTransitionA
{0,1}
phoneStateA
{ /t/1, /t/2, /t/3, /u/1, /u/2, /u/3, …}
obsA
positionInWordV
{0,1,2,...}
stateTransitionV
{0,1}
phoneStateV
obsV
{ /t/1, /t/2, /t/3, /u/1, /u/2, /u/3, …}
Articulatory Feature CHMM
positionInWordL
{0,1,2,...}
stateTransitionL
{0,1}
L
positionInWordT
{0,1,2,...}
stateTransitionT
{0,1}
T
{ /CL-ALV/1, /CL-ALV/2, /MID-UV/1, …}
positionInWordG
{0,1,2,...}
stateTransitionG
{0,1}
G
obsV obsA
{ /OP/1, /OP/2, /RND/1, …}
{ /OP/1, /OP/2, /CRIT/1, …}
Asynchrony Experiments: CUAVE
• 169 utterances used, 10 digits each
• NOISEX speech babble added at various SNRs
• Experimental setup
– Training on clean data, number of Gaussians tuned on clean
dev set
– Audio/video weights tuned on noise-specific dev sets
– Uniform (“zero-gram”) language model
– Decoding constrained to 10-word utterances (avoids language
model scale/penalty tuning)
• Thanks to Amar Subramanya at UW
for the video observations
• Thanks to Kate Saenko at MIT for initial
baselines and audio observations
Results, part 1:
Should we use video?
Answer: Fusion WER < Single-stream WER
( Novelty: None. Many authors have reported this. )
90
80
70
60
50
Audio
Video
40
Audiovisual
30
20
10
0
CLEAN
SNR 12dB
SNR 10dB
SNR 6dB
SNR 4dB
SNR -4dB
Results, part 2:
Should the streams be asynchronous?
Asynchronous WER < Synchronous WER (4% absolute @ midSNRs)
( Novelty: First phone-based AVSR w/ inter-phone asynchrony. )
70
60
50
No Asynchrony
1 State Async
2 States Async
Unlimited Asyn
40
30
20
10
0
CLEAN
SNR 12dB SNR 10dB
SNR 6dB
SNR 4dB
SNR -4
Results, part 3:
Should asynchrony be modeled using articulatory features?
Answer: Articulatory feature WER = Phoneme-viseme WER
( Novelty: First articulatory feature model for AVSR. )
80
70
60
50
Phone-viseme
Articulatory features
40
30
20
10
0
Clean
SNR
12dB
SNR
10dB
SNR 6dB SNR 4dB SNR -4dB
Results, part 4:
Can AF system help the CHMM to correct mistakes?
Answer: Combination AF + PV gives best results on this database
Details: Systems vote to determine label of each word (NIST rover)
23
WER on devtest, averaged across SNRs
22
21
20
19
18
17
Rover, Best Three Rover, Best Three
w/ AF
w/o AF
PV = Phone-viseme
PV, 2 States
Async
AF
AF = Articulatory features
PV, 1 State
Async
Conclusions
• Classifiers as features:
– AdaBoost “margin” outputs can be used as features in
Gaussian model of facial geometry
• Head-pose correction in noise:
– Best correction algorithm uses linear regression followed by
model-based correction
• Asynchrony matters:
– Best phone-based recognizer is a CHMM with two states of
asynchrony allowed between audio and video
• Articulatory Feature Models complement Phone Models
– These two systems have identical WER
– Best result obtained when systems of both types are
combined using rover