Articulatory Feature-based Speech Recognition:

Download Report

Transcript Articulatory Feature-based Speech Recognition:

Articulatory Feature-based Speech Recognition:
A Proposal for the 2006 JHU Summer Workshop
on Language Engineering
.
.
.
.
.
.
TTLOC
.
.
.
LIP-OP
TBTT- OPEN
OPEN
VELUM
GLOTTIS
November 11, 2005
Potential team members to date:
Karen Livescu (presenter)
Simon King
Florian Metze
Jeff Bilmes
Mark Hasegawa-Johnson
Ozgur Cetin
Kate Saenko
Motivations
• Why articulatory feature-based ASR?
–
–
–
–
–
Improved modeling of co-articulatory pronunciation phenomena
Take advantage of human perception and production knowledge
Application to audio-visual modeling
Application to multilingual ASR
Evidence of improved ASR performance with feature-based models
* In noise [Kirchhoff et al. 2002]
* For hyperarticulated speech [Soltau et al. 2002]
• Why this workshop project?
– Growing number of sites investigating complementary aspects of this idea;
a non-exhaustive list:
* U. Edinburgh (King et al.)
* UIUC (Hasegawa-Johnson et al.)
* MIT (Livescu, Glass, Saenko)
– Recently developed tools (e.g. graphical models) for systematic exploration
of the model space
Approach: Main Ideas
• Many ways to use articulatory features in ASR
• Approach for this project: Multiple streams of hidden articulatory states that can
desynchronize and stray from target values
– Inspired by linguistic theories, but simplified and cast in a probabilistic setting
baseform
dictionary
everybody
index
0
1
2
3
…
phone
eh
v
r
iy
…
GLOT
VEL
LIP-OPEN
...
V
Off
Wid
e
...
V
Off
Crit
...
V
Off
Wid
e
...
V
Off
Wid
e
...
…
…
…
…
P(| ind GLOT  ind VEL | 2)
+
asynchrony
+
feature
substitutions
ind
0 0 0 0 1 1 1 2 2 2 2 2
ind VEL
0 0 0 0 0 0 0 0 0 0 1 2
ind LIP-OPEN
0 0 0 0 1 1 1 1 2 2 2 2
U LIP-OPEN
WWWW C C C C WWWW
S LIP-OPEN
WW N N N C C C WWWW
GLOT
p(s | u )
Dynamic Bayesian network implementation:
The context-independent case
Example DBN with 3 features:
word 0
ind 01
ind 02
sync01; 2  1
wordT
word1
ind 03
ind 11
ind12
sync11; 2  1
sync01, 2;3  1
ind13
sync11, 2;3  1
...
ind T1
ind T2
syncT1; 2  1
1; 2
 a )  Pr(| ind
1
 ind
2
U 02
U 03
U11
U 12
U 13
U T1
U T2
U T3
S 01
S 02
S 03
S11
S12
S13
ST1
ST2
ST3
| a )
checkSync1;2  1 if | ind 1  ind 2 | async1;2
0
1
2
…
1
.2
.7
0
…
2
.1
.2
.7
…
3
0
.1
.2
…
4
0
0
.1
…
…
…
…
…
…
async t1, 2;3
async t1; 2
ind t1
ind t2
checkSynct1; 2
given by baseform pronunciations
0
.7
0
0
…
syncT1, 2;3  1
U 01
word t
Pr( async
ind T3
ind t3
checkSynct1, 2;3
=1
=1
U t1
U t2
U t3
S t1
S t2
S t3
Recent related work
• Product observation models combining phones and features,
p(obs|s) = p(obs|phs)  p(obs|fi), improve ASR in some conditions
– [Kirchhoff et al. 2002, Metze et al. 2002, Stueker et al. 2002]
• Lexical access from manual transcriptions of Switchboard words using
DBN model above [Livescu & Glass 2004, 2005]
– Improves over phone-based pronunciation models (~50%  ~25% error)
– Preliminary result: Articulatory phonology features preferable to IPA-style
(place/manner) features
• JHU WS’04 project [Hasegawa-Johnson et al. 2004]
– Can combine landmarks + IPA-style features at acoustic level with
articulatory phonology features at pronunciation level
• Articulatory recognition using DBN and ANN/DBN models [Wester et al.
2004, Frankel et al. 2005]
– Modeling inter-feature dependencies useful, asynchrony may also be useful
• Lipreading using multistream DBN model + SVM feature detectors
– Improves over viseme-based models in medium-vocabulary word ranking
and realistic small-vocabulary task [Saenko et al. 2005]
Ongoing work: Audio-visual ASR
phoneme-viseme based
articulatory feature-based
V
V
V
V
V
V
Lip features
asyncLT
visual state
(viseme)
checkSyncLT
Tongue features
audio state
(phoneme)
asyncTG
checkSyncT
A
A
A
G
Glottis/velum
A
Sample alignment from a prototype feature-based system:
spectrogram
mouth
images
G phone
T phone
L phone
A
A
Plan for 2006 Workshop
• Goals:
– To build complete articulatory feature-based ASR systems using multistream DBN structures
– To develop a thorough understanding of the design issues involved
• Questions to be addressed:
– What are appropriate ways to combine models of articulation with observations?
– Are discriminative feature classifiers preferable to generative observation models?
– What asynchrony constraints can account for co-articulation while permitting efficient
implementations?
– How does context affect the modeling of articulatory feature streams?
– Must the features modeled at the observation level be the same as the hidden state streams?
– How can such models be applied to audio-visual ASR?
• A possible work plan:
– Prior to workshop:
* Selection of feature sets to be considered
* Baseline feature-based and phone-based models on selected data
– Workshop, first half:
* Exploration of feature sets and classifiers
* Analysis of articulatory data
* Comparison of hidden feature structures on phonetically-labeled data
– Workshop, second half: Integration of most successful methods from above
Potential participants and contributors
• Local participants:
– Karen Livescu, MIT:
* Feature-based ASR structures, graphical models, GMTK
– Mark Hasegawa-Johnson, U. Illinois at Urbana-Champaign
* Discriminative feature classification, JHU WS’04
– Simon King, U. Edinburgh
* Articulatory feature recognition, ANN/DBN structures
– Ozgur Cetin, ICSI Berkeley
* Multistream/multirate modeling, graphical models, GMTK
– Florian Metze
* Articulatory features in HMM framework
– Jeff Bilmes, U. Washington
* Graphical models, GMTK
– Kate Saenko, MIT
* Visual feature classification, AVSR
– Others?
• Satellite/advisory contributors
– Jim Glass, MIT
– Katrin Kirchhoff, U. Washington
Resources
• Tools
– GMTK
– HTK
– Intel AVCSR toolkit
• Data
– Audio-only:
* Svitchboard (CSTR Edinburgh): Small-vocab, continuous, conversational
* PhoneBook: Medium-vocab, isolated-word, read
* (Switchboard rescoring? LVCSR)
– Audio-visual:
* AVTIMIT (MIT): Medium-vocab, continuous, read, added noise
* Digit strings database (MIT): Continuous, read, naturalistic setting (noise and
video background)
– Articulatory measurements:
* X-ray microbeam database (U. Wisconsin): Many speakers, large-vocab,
isolated-word and continuous
* MOCHA (QMUC, Edinburgh): Few speakers, medium-vocab, continuous
* Others?
– Manual transcriptions: ICSI Berkeley Switchboard transcription project
Thanks!
Questions? Comments?