Powerpoint template for scientific posters (Swarthmore

Download Report

Transcript Powerpoint template for scientific posters (Swarthmore

Audiovisual-to-articulatory speech inversion using Hidden Markov Models
Athanassios Katsamanis, George Papandreou, Petros Maragos
School of E.C.E., National Technical University of Athens, Athens 15773, Greece
We use multistream HMMs
x
Visual-to-articulatory mapping is
expected to be nonlinear.
Visual
stream
incorporated
following the Audiovisual ASR
paradigm.
wa
yv
Compared to global linear modal
or audio only or visual only HMM
time
spectral characteristics/MFCC
Determination of multistream
HMM state sequence
/p1/
/p2/
References
Time t, state i:
H. Kjellstrom, O. Engwall, and O. Balter, “Reconstructing tongue movements
from audio and video,” in Interspeech, 2006, pp. 2238–2241.
y t  Ai x t  t
Why Canonical Correlation Analysis (CCA)?
 Leads to optimal reduced-rank linear regression models.
 Improved predictive performance in the case of limited
data
Zero states correspond to the case of a
global linear model.
3-D marker coordinates
state
Performance is improved
wv
ya
We apply CCA
Train a linear mapping at each
HMM state between audiovisual and
articulatory data
Evaluation
Video
Recover vocal tract geometry from
the speech signal and speaker’s
face
Applications in Language Tutoring,
Speech Therapy
Audio
Speech inversion ?
Electromagnetic
Articulography (EMA)
Qualisys-Movetrack database

O. Engwall, “Introducing visual cues in acoustic-to-articulatory inversion,” in
INTERSPEECH, 2005, pp. 3205–3208.
Maximum A Posteriori
articulatory parameter
estimate:
T
i
1
i
1
( x  A Q y)
1
x

Generalization error of the linear regression model vs. model order for varying training set
size. Upper row: Tongue position from face expression. Lower Row: Face expression from
tongue position.
Measured (black) and predicted (light
color) articulatory trajectories
H. Yehia, P. Rubin, and E. Vatikiotis-Bateson, “Quantitative association of
vocal-tract and facial behavior,” Sp. Comm., vol. 26, pp. 23–43, 1998.
K. Richmond, S. King, and P. Taylor, “Modelling the uncertainty in recovering
articulation from acoustics,” Computer Speech and Language, vol. 17, pp. 153–
172, 2003.
S. Hiroya and M. Honda, “Estimation of articulatory movements from speech
acoustics using an hmm-based speech production model,” IEEE TSAP, vol. 12,
no. 2, pp. 175–185, March 2004.
xˆ  (  A Q Ai )
1
x
J. Jiang, A. Alwan, P. A. Keating, E. T. Auer Jr., and L. E. Bernstein, “On the
relationship between face movements, tongue movements, and speech
acoustics,” EURASIP Journal on Applied Signal Processing, vol. 11, pp. 1174–
1188, 2002.
T
i
Where Qi is the
covariance of the
approximation error
and the prior of x is
considered to be
Gaussian determined
at the training phase
1
i
O. Engwall and J. Beskow, “Resynthesis of 3d tongue movements from facial
data,” in EUROSPEECH, 2003.
S. Dupont and J. Luettin, “Audio-visual speech modeling for continuous speech
recognition,” IEEE Tr. Multimedia, vol. 2, no. 3, pp. 141–151, 2000.
K. V. Mardia, J. T. Kent, and J. M. Bibby, Multivariate Analysis. Acad. Press,
1979.
L. L. Scharf and J. K. Thomas, “Wiener filters in canonical coordinates for
transform coding, filtering, and quantizing,” IEEE TSAP, vol. 46, no. 3, pp.
647–654, 1998.
L. Breiman and J. H. Friedman, “Predicting multivariate responses in multiple
linear regression,” Journal of the Royal Stat. Soc. (B), vol. 59, no. 1, pp. 3–54,
1997.
Acknowledgements
This research was co-financed partially by E.U.-European Social Fund (75%)
and the Greek Ministry of Development-GSRT (25%) under Grant ΠΕΝΕΔ2003ΕΔ 866, and partially by the European research project ASPI under Grant
FP6-021324. We would also like to thank O. Engwall from KTH for providing
us the QSMT database.