Models of speech dynamics in a segmental-HMM recognizer using intermediate linear representations Philip Jackson and Martin Russell Electronic Electrical and Computer Engineering http://web.bham.ac.uk/p.jackson/balthasar/

Download Report

Transcript Models of speech dynamics in a segmental-HMM recognizer using intermediate linear representations Philip Jackson and Martin Russell Electronic Electrical and Computer Engineering http://web.bham.ac.uk/p.jackson/balthasar/

Models of speech dynamics in
a segmental-HMM recognizer
using intermediate linear
representations
Philip Jackson and Martin Russell
Electronic Electrical and Computer Engineering
http://web.bham.ac.uk/p.jackson/balthasar/
Speech dynamics into ASR
INTRODUCTION
Conventional model
acoustic observations
acoustic PDF
1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 4 4 4 4
HMM
INTRODUCTION
Linear-trajectory model
acoustic observations
acoustic PDF
articulatory-toacoustic mapping
W
intermediate layer
1
2
3
4
segmental HMM
INTRODUCTION
Multi-level Segmental HMM
• segmental finite-state process
• intermediate “articulatory” layer
– linear trajectories
• mapping required
– linear transformation
– radial basis function network
INTRODUCTION
Estimation of linear mapping
Matched sequences x 1T and y1T ,
min D WX  Y 


ˆ
W  YX
THEORY
Linear-trajectory equations
Defined as:
fi t   mi t  t   ci ,
fi t 
ci
t    1 2
t
THEORY
Training the model parameters
For sˆ , optimal least-squares estimates
(acoustic domain):
1
ˆi 
c
T
ti 1 1
 y(t )
midpoint
t t i
y (t )t  t 


 t  t 
ti 1 1
ˆi
m
t ti
ti 1 1
slope
2
t ti
THEORY
Training the model parameters
For sˆ , optimal least-squares estimates
(articulatory domain):
1
cˆ i 
T
ti 1 1
W
ˆi
m
k
t ti


ti 1 1
t ti

y(t )
midpoint
Wk y (t )t  t 

 t  t 
ti 1 1
slope
2
t ti
THEORY
Training the model parameters
For sˆ , optimal maximum-likelihood estimates
(articulatory domain):
1
cˆ i 
T
ti 1 1
 D W 
t ti

i
k
D W  D y (t )t  t 


 t  t 
ti 1 1
ˆi
m
Di y(t )
t ti
midpoint

i
k
ti 1 1
i
slope
2
t ti
THEORY
Tests on MOCHA
• S. British English, at 16kHz
(Wrench, 2000)
– MFCC13 acoustic features, incl. zero’th
– articulatory x- & y-coords from 7 EMA coils
– PCA9+Lx: first nine articulatory modes plus
the laryngograph log energy
METHOD
MOCHA baseline performance
Accuracy (%)
56
55
54
53
ID_0
ID_1
Mappings
• Constant-trajectory SHMM (ID_0)
• Linear-trajectory SHMM (ID_1)
RESULTS
Performance across mappings
Accuracy (%)
56
55
54
53
ID_0
A (1)
B (2)
C (6)
D (10)
E (10)
F (49)
ID_1
Mappings
RESULTS
Phone categorisation
A
B
C
No.
1
2
6
D
10
E
F
10
49
Description
all data
silence; speech
linguistic categories: silence/stop;
vowel; liquid; nasal; fricative; affricate
as (Deng and Ma, 2000):
silence; vowel; liquid; nasal; UV fric;
/s,ch/; V fric; /z,jh/; UV stop; V stop
discrete articulatory regions
silence; individual phones
METHOD
Tests on TIMIT
• N. American English, at 8kHz
– MFCC13 acoustic features, incl. zero’th
a) F1-3: formants F1, F2 and F3, estimated by
Holmes formant tracker
b) F1-3+BE5: five band energies added
c) PFS12: synthesiser control parameters
METHOD
TIMIT baseline performance
66
Accuracy (%)
65
64
63
62
61
60
ID_0
ID_1
Features
• Constant-trajectory SHMM (ID_0)
• Linear-trajectory SHMM (ID_1)
RESULTS
Performance across feature sets
66
Accuracy (%)
65
64
63
62
61
60
ID_0
(a) F3
(b) F3+BE5
(c) PFS12
ID_1
Features
RESULTS
Performance across groupings
66
Accuracy (%)
65
64
63
62
61
60
ID_0
A (1)
B (2)
C (6) D (10) E (10) F (49)
ID_1
Mappings
RESULTS
Results across groupings
66
64
63
(a) F3
62
(b) F3+BE5
61
(c) PFS12
ID_1
E(10)
D(10)
F(49)
Mappings
C(6)
B(2)
A(1)
60
ID_0
Accuracy (%)
65
RESULTS
Model visualisation
Original
acoustic
data
Constanttrajectory
model
Lineartrajectory
model (c,F)
DISCUSSION
Conclusions
• Developed framework for speech dynamics
in an intermediate space
• Linear traj. + piecewise linear mapping
bounded by performance of linear traj. in
acoustic space
• Near optimal performance achieved
– For more than 3 formant parameters
– For 6 or more linear mappings
• Formants and articulatory parameters
gave qualitatively similar results
• What next?
SUMMARY
Further work
• Complete experiments with lang. model
• Include segment duration models
• Derive pseudo-articulatory representations
by unsupervised (embedded) training
• Implement non-linear mapping (i.e., RBF)
• Further information:
– here and now
– [email protected]
– web.bham.ac.uk/p.jackson/balthasar
SUMMARY