Models of speech dynamics in a segmental-HMM recognizer using intermediate linear representations Philip Jackson and Martin Russell Electronic Electrical and Computer Engineering http://web.bham.ac.uk/p.jackson/balthasar/
Download ReportTranscript Models of speech dynamics in a segmental-HMM recognizer using intermediate linear representations Philip Jackson and Martin Russell Electronic Electrical and Computer Engineering http://web.bham.ac.uk/p.jackson/balthasar/
Models of speech dynamics in a segmental-HMM recognizer using intermediate linear representations Philip Jackson and Martin Russell Electronic Electrical and Computer Engineering http://web.bham.ac.uk/p.jackson/balthasar/ Speech dynamics into ASR INTRODUCTION Conventional model acoustic observations acoustic PDF 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 4 4 4 4 HMM INTRODUCTION Linear-trajectory model acoustic observations acoustic PDF articulatory-toacoustic mapping W intermediate layer 1 2 3 4 segmental HMM INTRODUCTION Multi-level Segmental HMM • segmental finite-state process • intermediate “articulatory” layer – linear trajectories • mapping required – linear transformation – radial basis function network INTRODUCTION Estimation of linear mapping Matched sequences x 1T and y1T , min D WX Y ˆ W YX THEORY Linear-trajectory equations Defined as: fi t mi t t ci , fi t ci t 1 2 t THEORY Training the model parameters For sˆ , optimal least-squares estimates (acoustic domain): 1 ˆi c T ti 1 1 y(t ) midpoint t t i y (t )t t t t ti 1 1 ˆi m t ti ti 1 1 slope 2 t ti THEORY Training the model parameters For sˆ , optimal least-squares estimates (articulatory domain): 1 cˆ i T ti 1 1 W ˆi m k t ti ti 1 1 t ti y(t ) midpoint Wk y (t )t t t t ti 1 1 slope 2 t ti THEORY Training the model parameters For sˆ , optimal maximum-likelihood estimates (articulatory domain): 1 cˆ i T ti 1 1 D W t ti i k D W D y (t )t t t t ti 1 1 ˆi m Di y(t ) t ti midpoint i k ti 1 1 i slope 2 t ti THEORY Tests on MOCHA • S. British English, at 16kHz (Wrench, 2000) – MFCC13 acoustic features, incl. zero’th – articulatory x- & y-coords from 7 EMA coils – PCA9+Lx: first nine articulatory modes plus the laryngograph log energy METHOD MOCHA baseline performance Accuracy (%) 56 55 54 53 ID_0 ID_1 Mappings • Constant-trajectory SHMM (ID_0) • Linear-trajectory SHMM (ID_1) RESULTS Performance across mappings Accuracy (%) 56 55 54 53 ID_0 A (1) B (2) C (6) D (10) E (10) F (49) ID_1 Mappings RESULTS Phone categorisation A B C No. 1 2 6 D 10 E F 10 49 Description all data silence; speech linguistic categories: silence/stop; vowel; liquid; nasal; fricative; affricate as (Deng and Ma, 2000): silence; vowel; liquid; nasal; UV fric; /s,ch/; V fric; /z,jh/; UV stop; V stop discrete articulatory regions silence; individual phones METHOD Tests on TIMIT • N. American English, at 8kHz – MFCC13 acoustic features, incl. zero’th a) F1-3: formants F1, F2 and F3, estimated by Holmes formant tracker b) F1-3+BE5: five band energies added c) PFS12: synthesiser control parameters METHOD TIMIT baseline performance 66 Accuracy (%) 65 64 63 62 61 60 ID_0 ID_1 Features • Constant-trajectory SHMM (ID_0) • Linear-trajectory SHMM (ID_1) RESULTS Performance across feature sets 66 Accuracy (%) 65 64 63 62 61 60 ID_0 (a) F3 (b) F3+BE5 (c) PFS12 ID_1 Features RESULTS Performance across groupings 66 Accuracy (%) 65 64 63 62 61 60 ID_0 A (1) B (2) C (6) D (10) E (10) F (49) ID_1 Mappings RESULTS Results across groupings 66 64 63 (a) F3 62 (b) F3+BE5 61 (c) PFS12 ID_1 E(10) D(10) F(49) Mappings C(6) B(2) A(1) 60 ID_0 Accuracy (%) 65 RESULTS Model visualisation Original acoustic data Constanttrajectory model Lineartrajectory model (c,F) DISCUSSION Conclusions • Developed framework for speech dynamics in an intermediate space • Linear traj. + piecewise linear mapping bounded by performance of linear traj. in acoustic space • Near optimal performance achieved – For more than 3 formant parameters – For 6 or more linear mappings • Formants and articulatory parameters gave qualitatively similar results • What next? SUMMARY Further work • Complete experiments with lang. model • Include segment duration models • Derive pseudo-articulatory representations by unsupervised (embedded) training • Implement non-linear mapping (i.e., RBF) • Further information: – here and now – [email protected] – web.bham.ac.uk/p.jackson/balthasar SUMMARY