A New Technique for Robust Estimation of Feature Normalization Parameters Venkata Ramana Rao Gadde Andreas Stolcke.

Download Report

Transcript A New Technique for Robust Estimation of Feature Normalization Parameters Venkata Ramana Rao Gadde Andreas Stolcke.

A New Technique for Robust
Estimation of Feature Normalization
Parameters
Venkata Ramana Rao Gadde
Andreas Stolcke
Organization of Talk
1.
2.
3.
4.
Problem
Solution
Experiments
Future Work
EARS STT talk
March 24, 2005
2
PROBLEM
EARS STT talk
March 24, 2005
3
Problem
• Feature mean and variance normalization
– Usually applied at speaker level.
– The idea is based on Cepstral Mean Subtraction (CMS) which
removes multiplicative channel transform (spectral domain).
• In principle, the normalization works only for cepstra or features
derived from cepstra.
• Current ASR systems perform feature mean removal even for noncepstral features (like PLP).
– Variance normalization reduces inter-speaker differences by
normalizing the mean-subtracted features to unit variance.
– Mean/variance estimation uses all data frames with equal
weight (or speech-only frames in some systems).
• The estimates are affected by the distribution of data among
phones (or phone classes).
• Robust estimation requires large amounts of data.
EARS STT talk
March 24, 2005
4
SOLUTION
EARS STT talk
March 24, 2005
5
Proposed Solution
• Eliminate the dependency of the estimates
on data distribution.
– Compute class level means for different phone classes (or
phones).
– Estimate the feature mean as the average of the class means.
– Similar approach for variance (around the mean estimate).
– This approach reduces the dependency of the estimate on the
distribution of data among the classes.
• Segment boundaries become less important.
• Need to optimize the number of classes (class membership?)
• Requires a phone-level alignment of the waveform.
EARS STT talk
March 24, 2005
6
Formulae
Standard FMVN
N
1
   xi ,
N i 1
N
1
2
   ( xi   )
N i 1
2
Proposed FMVN
1 C 1
 
C c 1 N c
Nc
x ,
i 1
c
i
Nc
C
1
1
 2    ( xic   ) 2
C c 1 N c i 1
N – # frames, C – # classes
EARS STT talk
March 24, 2005
7
Proposed Solution
• Our estimates are independent of the data
distribution among different phone classes and
thus more robust.
• Classes with no data can pose a problem.
– We proposed to use estimates obtained from all speakers (“global
estimates”) to smooth the classes which had a little or no data.
– This idea can be generalized by clustering speakers (for each class) to
produce a speaker clustering tree. Such a clustering tree can then be
used to identify the closest training speaker to the test speaker and use
that speaker’s estimates.
– In our experiments we used the global estimates to smooth the speaker
estimates.
EARS STT talk
March 24, 2005
8
EXPERIMENTS
EARS STT talk
March 24, 2005
9
SPINE Experiments
• Acoustic models were trained on the SPINE 2001
train set (~32K utterances).
– Models were trained using decision tree.
– Viterbi training was used for the final genone iterations.
• Speaker Normalized MFC features
– 39 dim. MFC feature (mel cepstrum with first and second deltas)
– normalized for mean and variance.
– VTLs from 2001 SPINE system were used.
• SPINE 2001 eval set (~4.5K utterances) used for
testing.
EARS STT talk
March 24, 2005
10
SPINE Experiments
• Baseline system used standard feature
normalization that weighted all frames equally.
• Macro normed models used macro mean and
variance estimates for normalization.
• Results are shown for different number of phone
classes used for normalization.
• All results are for matched conditions – both
training and testing used same phone classes.
• All results were obtained after reoptimizing lm
weight and gaussian weight.
EARS STT talk
March 24, 2005
11
SPINE Results (unsmoothed)
Model
Word Error Rate
(for different number of Classes)
7-classes
11-classes
47-classes
(=all phones)
Baseline
38.5
38.5
38.5
Macro-normed
models
38.5
38.4
37.9
EARS STT talk
March 24, 2005
12
SPINE Results (unsmoothed)
• Best WER improvement is obtained when each
phone is treated as a separate class.
• For small number of classes (2, 3, 6, 7), no
improvement is observed.
EARS STT talk
March 24, 2005
13
SPINE Results (smoothed with global estimates)
Word Error Rate
Baseline
38.5
Macro-normed models (47 classes)
Unsmoothed
Smoothed
37.9
37.1
EARS STT talk
March 24, 2005
14
SPINE Results (smoothed with global estimates)
• Smoothing with global estimates resulted in a 0.8%
abs. reduction in WER over unsmoothed models.
• Does this performance hold up after adaptation?
– CMLLR/MLLR?
EARS STT talk
March 24, 2005
15
SPINE Results (effect of adaptation)
Word Error Rate
Macro-normed models
Baseline
Unsmoothed
Smoothed
Unadapted
recog.
38.5
37.9 (-0.6)
37.1 (-1.4)
CMLLR and
Recog.
35.8
35.5 (-0.3)
35.2 (-0.6)
MLLR and
Recog.
33.7
33.4 (-0.3)
32.8 (-0.9)
EARS STT talk
March 24, 2005
16
CTS Experiments
• Acoustic models trained on the fisher 400hr male
training set (~200K utterances).
• Feature used is 62-dim composite feature.
– 52-dim MFC feature (mel cepstrum with first, second and third
derivatives)
– 10-dim voicing feature.
– Speaker level VTL, mean and var. normalization.
• HLDA was used to reduce the feature down to 39dim.
• Test set is the male subset of the dev2004 set
(~1.5K utterances).
EARS STT talk
March 24, 2005
17
CTS Experiments
• Experiments were conducted using each phone
as a class by itself (49 classes).
• A bigram non-SARV lm was used in the
experiments.
• We experimented with two types of adaptation
– Using phone loop (PL adapt)
– Using hyps from 1st pass recognition (HYP adapt).
EARS STT talk
March 24, 2005
18
CTS Results
Word Error Rate
Baseline
Macro-normed models
Unsmoothed
Smoothed
Unadapted
recog.
34.3
33.7 (-0.6)
33.5 (-0.8)
PL adapt and
Recog.
32.4
32.0 (-0.4)
32.1 (-0.3)
HYP adapt and
Recog.
32.2
31.7 (-0.5)
31.9 (-0.3)
EARS STT talk
March 24, 2005
19
CTS Results
• Models with macro norms are consistently better
than baseline models.
• Adaptation (PL, MLLR) seems to result in a loss of
performance.
– Adaptation transforms are computed by weighting all frames equally.
• Modify adaptation algorithms to optimize likelihood normalized per phone?
• Adaptation results in only 0.3-0.4% abs. win erasing
half of the win from the new normalizations.
EARS STT talk
March 24, 2005
20
SUMMARY AND FUTURE WORK
EARS STT talk
March 24, 2005
21
Summary
• Proposed a new feature normalization using
macro estimates of mean and variance.
• Proposed using global estimates (from training
data) to smooth test data normalizations.
• Through experiments established that the
proposed technique improves accuracy of the
recognizer.
EARS STT talk
March 24, 2005
22
Future Work
• Evaluate the improvement in a full system.
• Experiment by weighting different phone classes differently.
• Predict the class mean/var from other class estimates and
global estimates.
• Try smoothing with closest training speaker/cluster.
• Extend the approach to other feature normalizations/model
adaptations by modifying the optimizing criterion to be
independent of the data distribution.
–
–
VTL estimation
SAT/HLDA/MLLR
EARS STT talk
March 24, 2005
23