A New Technique for Robust Estimation of Feature Normalization Parameters Venkata Ramana Rao Gadde Andreas Stolcke.
Download ReportTranscript A New Technique for Robust Estimation of Feature Normalization Parameters Venkata Ramana Rao Gadde Andreas Stolcke.
A New Technique for Robust Estimation of Feature Normalization Parameters Venkata Ramana Rao Gadde Andreas Stolcke Organization of Talk 1. 2. 3. 4. Problem Solution Experiments Future Work EARS STT talk March 24, 2005 2 PROBLEM EARS STT talk March 24, 2005 3 Problem • Feature mean and variance normalization – Usually applied at speaker level. – The idea is based on Cepstral Mean Subtraction (CMS) which removes multiplicative channel transform (spectral domain). • In principle, the normalization works only for cepstra or features derived from cepstra. • Current ASR systems perform feature mean removal even for noncepstral features (like PLP). – Variance normalization reduces inter-speaker differences by normalizing the mean-subtracted features to unit variance. – Mean/variance estimation uses all data frames with equal weight (or speech-only frames in some systems). • The estimates are affected by the distribution of data among phones (or phone classes). • Robust estimation requires large amounts of data. EARS STT talk March 24, 2005 4 SOLUTION EARS STT talk March 24, 2005 5 Proposed Solution • Eliminate the dependency of the estimates on data distribution. – Compute class level means for different phone classes (or phones). – Estimate the feature mean as the average of the class means. – Similar approach for variance (around the mean estimate). – This approach reduces the dependency of the estimate on the distribution of data among the classes. • Segment boundaries become less important. • Need to optimize the number of classes (class membership?) • Requires a phone-level alignment of the waveform. EARS STT talk March 24, 2005 6 Formulae Standard FMVN N 1 xi , N i 1 N 1 2 ( xi ) N i 1 2 Proposed FMVN 1 C 1 C c 1 N c Nc x , i 1 c i Nc C 1 1 2 ( xic ) 2 C c 1 N c i 1 N – # frames, C – # classes EARS STT talk March 24, 2005 7 Proposed Solution • Our estimates are independent of the data distribution among different phone classes and thus more robust. • Classes with no data can pose a problem. – We proposed to use estimates obtained from all speakers (“global estimates”) to smooth the classes which had a little or no data. – This idea can be generalized by clustering speakers (for each class) to produce a speaker clustering tree. Such a clustering tree can then be used to identify the closest training speaker to the test speaker and use that speaker’s estimates. – In our experiments we used the global estimates to smooth the speaker estimates. EARS STT talk March 24, 2005 8 EXPERIMENTS EARS STT talk March 24, 2005 9 SPINE Experiments • Acoustic models were trained on the SPINE 2001 train set (~32K utterances). – Models were trained using decision tree. – Viterbi training was used for the final genone iterations. • Speaker Normalized MFC features – 39 dim. MFC feature (mel cepstrum with first and second deltas) – normalized for mean and variance. – VTLs from 2001 SPINE system were used. • SPINE 2001 eval set (~4.5K utterances) used for testing. EARS STT talk March 24, 2005 10 SPINE Experiments • Baseline system used standard feature normalization that weighted all frames equally. • Macro normed models used macro mean and variance estimates for normalization. • Results are shown for different number of phone classes used for normalization. • All results are for matched conditions – both training and testing used same phone classes. • All results were obtained after reoptimizing lm weight and gaussian weight. EARS STT talk March 24, 2005 11 SPINE Results (unsmoothed) Model Word Error Rate (for different number of Classes) 7-classes 11-classes 47-classes (=all phones) Baseline 38.5 38.5 38.5 Macro-normed models 38.5 38.4 37.9 EARS STT talk March 24, 2005 12 SPINE Results (unsmoothed) • Best WER improvement is obtained when each phone is treated as a separate class. • For small number of classes (2, 3, 6, 7), no improvement is observed. EARS STT talk March 24, 2005 13 SPINE Results (smoothed with global estimates) Word Error Rate Baseline 38.5 Macro-normed models (47 classes) Unsmoothed Smoothed 37.9 37.1 EARS STT talk March 24, 2005 14 SPINE Results (smoothed with global estimates) • Smoothing with global estimates resulted in a 0.8% abs. reduction in WER over unsmoothed models. • Does this performance hold up after adaptation? – CMLLR/MLLR? EARS STT talk March 24, 2005 15 SPINE Results (effect of adaptation) Word Error Rate Macro-normed models Baseline Unsmoothed Smoothed Unadapted recog. 38.5 37.9 (-0.6) 37.1 (-1.4) CMLLR and Recog. 35.8 35.5 (-0.3) 35.2 (-0.6) MLLR and Recog. 33.7 33.4 (-0.3) 32.8 (-0.9) EARS STT talk March 24, 2005 16 CTS Experiments • Acoustic models trained on the fisher 400hr male training set (~200K utterances). • Feature used is 62-dim composite feature. – 52-dim MFC feature (mel cepstrum with first, second and third derivatives) – 10-dim voicing feature. – Speaker level VTL, mean and var. normalization. • HLDA was used to reduce the feature down to 39dim. • Test set is the male subset of the dev2004 set (~1.5K utterances). EARS STT talk March 24, 2005 17 CTS Experiments • Experiments were conducted using each phone as a class by itself (49 classes). • A bigram non-SARV lm was used in the experiments. • We experimented with two types of adaptation – Using phone loop (PL adapt) – Using hyps from 1st pass recognition (HYP adapt). EARS STT talk March 24, 2005 18 CTS Results Word Error Rate Baseline Macro-normed models Unsmoothed Smoothed Unadapted recog. 34.3 33.7 (-0.6) 33.5 (-0.8) PL adapt and Recog. 32.4 32.0 (-0.4) 32.1 (-0.3) HYP adapt and Recog. 32.2 31.7 (-0.5) 31.9 (-0.3) EARS STT talk March 24, 2005 19 CTS Results • Models with macro norms are consistently better than baseline models. • Adaptation (PL, MLLR) seems to result in a loss of performance. – Adaptation transforms are computed by weighting all frames equally. • Modify adaptation algorithms to optimize likelihood normalized per phone? • Adaptation results in only 0.3-0.4% abs. win erasing half of the win from the new normalizations. EARS STT talk March 24, 2005 20 SUMMARY AND FUTURE WORK EARS STT talk March 24, 2005 21 Summary • Proposed a new feature normalization using macro estimates of mean and variance. • Proposed using global estimates (from training data) to smooth test data normalizations. • Through experiments established that the proposed technique improves accuracy of the recognizer. EARS STT talk March 24, 2005 22 Future Work • Evaluate the improvement in a full system. • Experiment by weighting different phone classes differently. • Predict the class mean/var from other class estimates and global estimates. • Try smoothing with closest training speaker/cluster. • Extend the approach to other feature normalizations/model adaptations by modifying the optimizing criterion to be independent of the data distribution. – – VTL estimation SAT/HLDA/MLLR EARS STT talk March 24, 2005 23