T.J. Watson Research Center, Human Language Technologies EARS Progress Update: Improved MPE, Inline Lattice Rescoring, Fast Decoding, Gaussianization & Fisher experiments Dan Povey, George.
Download ReportTranscript T.J. Watson Research Center, Human Language Technologies EARS Progress Update: Improved MPE, Inline Lattice Rescoring, Fast Decoding, Gaussianization & Fisher experiments Dan Povey, George.
T.J. Watson Research Center, Human Language Technologies EARS Progress Update: Improved MPE, Inline Lattice Rescoring, Fast Decoding, Gaussianization & Fisher experiments Dan Povey, George Saon, Lidia Mangu, Brian Kingsbury & Geoffrey Zweig 12/1/2003 T.J. Watson Research Center, Human Language Technologies Part 1: Improved MPE 2 EARS progress update T.J. Watson Research Center, Human Language Technologies Previous discriminative training setup – Implicit Lattice MMI •Used unigram decoding graph and fast decoding to generate statelevel “posteriors” (actually relative likelihoods: delta between best path using the state and best path overall) •Posteriors used directly (without forward-backward) to accumulate “denominator” statistics. •Numerator statistics accumulated as for ML training, with full forwardbackward •Fairly effective but not “MMI/MPE standard” 3 EARS progress update T.J. Watson Research Center, Human Language Technologies Current discriminative training setup (for standard MMI) Creating lattices with unigram scores on links Forward-backward on lattices (using fixed state sequence) to get occupation probabilities, use same lattices on multiple iterations Creating num + den stats in a consistent way Use slower training speed (E=2, not 1) and more iterations Also implemented MPE 4 EARS progress update T.J. Watson Research Center, Human Language Technologies Experimental conditions Same as for RT’03 evaluation 274 hours of Switchboard training data Training + test data adapted using FMLLR transform [from ML system] 60dim PLPs, VTLN, no MLLR 5 EARS progress update T.J. Watson Research Center, Human Language Technologies Basic MMI results (eval’00) With word-internal phone context, 142K Gaussians ML Iter-1 Iter-2 Old MMI, E=1 23.5% 22.7% 22.2% New MMI, E=2 23.5% 22.5% 21.7% Iter-3 Iter-4 20.9% 20.8% 1.4% more improvement (2.7% total) with this setup 6 EARS progress update T.J. Watson Research Center, Human Language Technologies MPE results (eval’00) ML Iter-1 Iter-2 Iter-3 Iter-4 MMI 23.5% 22.5% 21.7% 20.9% 20.8% MPE 23.5% 22.2% 21.5%* 21.3%* MPE+ MMI 23.5% 21.8% 21.3% 20.9% 20.5% Iter-5 20.3% Standard MPE is not as good as MMI with this setup “MPE+MMI”, which is MPE with I-smoothing to MMI update (not ML), gives 0.5% absolute over MMI * Conditions differ, treat with caution. 7 EARS progress update T.J. Watson Research Center, Human Language Technologies MPE+MMI continued “MPE+MMI” involves storing 4 sets of statistics rather than 3: num, den, ml and now also mmi-den. 33% more storage, no extra computation Do standard MMI update using ml and mmi-den stats, use resulting mean & var in place of ML mean & var in I-smoothing. (Note- I-smoothing is a kind of gradual backoff to a more robust estimate of mean & variance). Probability scaling in MPE MPE training leads to an excess of deletions. Based on previous experience, this can be due to a probability scale that is too extreme. Changing the probability scale from 1/18 to 1/10 gave a ~0.3% win. 1/10 used as scale on all MPE experiments with left-context (see later) 8 EARS progress update T.J. Watson Research Center, Human Language Technologies Fast MMI Work presented by Bill Byrne at Eurospeech’03 showed improved results from MMI where the correctly recognized data was excluded* Achieve a similar effect without hard decisions, by canceling num & den stats I.e., if a state has nonzero occupation probabilities for both numerator and denominator at time t, cancel the shared part so only one is positive. Gives as good or better results as baseline, with half the iterations. Use E=2 as before. ML Iter-1 Iter-2 Iter-3 Iter-4 MMI 23.5% 22.5% 21.7% 20.9% 20.8% Fast MMI 23.5% 21.2% 20.7% 21.2% * “Lattice segmentation and Minimum Bayes Risk Discriminative Training”, Vlasios Doumpiotis et. al, Eurospeech 2003 9 EARS progress update T.J. Watson Research Center, Human Language Technologies MMI+MPE with cross-word (left) phone context Similar size system (about 160K vs 142K), with cross-word context Results shown here connect word-traces into lattices indiscriminately (ignoring constraints of context) There is an additional win possible from using context constraints (~0.2%) RT’00 ML Old MMI* 22.0% Fast MMI 22.0% 20.0% 19.9% MPE 22.0% 20.5% 20.2% 20.0% MPE+MMI 22.0% 20.5% 19.8% 19.4% *I.e. last year, different setup 10 EARS progress update Iter-1 Iter-2 Iter-3 Iter-4 20.8% 19.5% T.J. Watson Research Center, Human Language Technologies MMI and MPE with cross-word context.. on RT’03 The new MMI setup (including ‘fast MMI’) is no better than old MMI About 1.8% improvement on RT’03 from MPE+MMI; MPE alone gives 1.4% improvement. Those numbers are 2.5% and 2.0% on RT’00 Comparison with MPE results in Cambridge’s 28-mix system (~170K Gaussians) from 2002: Most comparable number is 2.2% improvement (30.4% to 28.2%) on dev01sub using FMLLR (“constrained MLLR”) and F-SAT training (*) RT’03 ML Old MMI* 22.0% 29.8% Fast MMI 30.9% 29.9% MPE 30.9% MPE+MMI 30.9% Iter-1 29.7% Iter-2 29.6% Iter-3 Iter-4 29.5% 29.1% “Automatic transcription of conversational telephone speech”, T. Hain et. al, submitted to IEEE transactions on Speech & Audio processing 11 EARS progress update T.J. Watson Research Center, Human Language Technologies Part 2: Inline Lattice Rescoring 12 EARS progress update T.J. Watson Research Center, Human Language Technologies Language model rescoring – some preliminary work Very large LMs help, e.g. moving from a typical to huge (unpruned) LM can help by 0.8% (*) Very hard to build static decoding graphs for huge LMs Good to be able to efficiently rescore lattices with a different LM Also useful for adaptive language modeling … adaptive language modeling gives us ~1% on “superhuman” test set, and 0.2% on RT’03 (+) * “Large LM”, Nikolai Duta & Richard Schwartz (BBN), presentation 2003 EARS meeting, IDIAP, Martigny + “Experiments on adaptive LM”, Lidia Mangu & Geoff Zweig (IBM), ibid. 13 EARS progress update T.J. Watson Research Center, Human Language Technologies Lattice rescoring algorithm Taking a lattice and applying a 3 or 4-gram LM involves expanding lattice nodes This algorithm can take very large amounts of time for some lattices Can be solved by heavy pruning- but this is undesirable if LMs are quite different. Developed lattice LM-rescoring algorithm. Finds the best path through a lattice given a different LM (*) *(We are working on a modified algorithm that will generate rescored lattices) 14 EARS progress update T.J. Watson Research Center, Human Language Technologies Lattice rescoring algorithm (cont’d) Each word-instance in lattice has k tokens (e.g. k=3) Each token has a partial word history ending in the current word, and a traceback to the best predecessor token WHY, -101 WHY CAP, -310 WHEN THE, -205 WHY THE, -210 CAP THE WHEN, -101 WHEN 15 EARS progress update WHY THE CAT, -345 THE CAT, -310 CAT T.J. Watson Research Center, Human Language Technologies Lattice rescoring algorithm (cont’d) For each word-instance in lattice from left to right… …for each token in each predecessor word-instance... …...Add current word to that token’s word-history and work out LM & acoustic costs; …... delete word left-context until the word-history exists in the LM as an LM context …... Form a new token pointing back to predecessor token. ……and add token to the current word-instance’s list of tokens. Always ensure that no two tokens with the same word-history exist (delete the least likely one) … and always keep only the k most likely tokens. Finally, trace-back from most likely token at end of utterance. All done within decoder Highly efficient 16 EARS progress update T.J. Watson Research Center, Human Language Technologies Lattice rescoring algorithm – experiments To verify that it works… Took the 4-gram LM used for the RT’03 evaluation and pruned it 13-fold Built a decoding graph, and rescored with original LM Testing on RT’03, MPE-trained system with Gaussianization WER (RT’03) Big LM (132 MB) 28.5% Tiny LM (10MB) 31.7% Tiny LM + rescoring, k=3 28.5% Tiny LM + rescoring, k=2 28.6% Tiny LM + rescoring, k=3, Backwards traces only (*) 30.1% * See next slide Note, all experiments actually include an n-1 word history in each token, even when not necessary. This should decrease the accuracy of the algorithm for a given k. 17 EARS progress update T.J. Watson Research Center, Human Language Technologies Lattice rescoring algorithm – forward vs backward Lattice generation algorithm: Both alpha and beta likelihoods are available to the algorithm Whenever a word-end state likelihood is within delta of the best path… Trace back until a word beginning state whose best predecessor is word-end, is reached ...and create a “word trace.” Join all these word traces to form a lattice (using graph connectivity constraints) Equivalent to Julian Odell’s algorithm (with n=infinity) BUT we also add “forwards” traces, based on tracing forward from word beginning to word end. Time-symmetric with backtraces. There are fewer forwards traces (due to graph topology) Adding forwards traces is important (0.6% hit from removing them) I don’t believe there is much effect on lattice oracle WER. … it is the alignments of word-sequences that are affected. 18 EARS progress update T.J. Watson Research Center, Human Language Technologies Part 3: Progress in Fast Decoding 19 EARS progress update T.J. Watson Research Center, Human Language Technologies RT’03 Sub-realtime Architecture 20 EARS progress update T.J. Watson Research Center, Human Language Technologies Improvements in Fast Decoding Switched from rank pruning to running beam pruning Hypotheses are pruned early on based on running max estimate during successor expansion then pruned again after final max states max update max update; pruned at the end prune based on current max-beam max update prune based on current max-beam t t+1 time 21 EARS progress update T.J. Watson Research Center, Human Language Technologies Runtime vs. WER: Beam and rank pruning Resulted in a 10% decoding speed-up without loss in accuracy 22 EARS progress update T.J. Watson Research Center, Human Language Technologies Reducing the memory requirements Run-time memory reduction by storing minimum traceback information for Viterbi word sequence recovery Previously we stored information for full state-level alignment Now we store only information for word-level alignment – Alpha entry has accumulated cost and pointer to originating word token – Two alpha vectors for “flip-flop” – Permanent word-level tokens created only at active word-ends No penalty in speed and dynamic memory reduction by two orders of magnitude 23 EARS progress update T.J. Watson Research Center, Human Language Technologies Part 4: Feature-Space Gaussianization 24 EARS progress update T.J. Watson Research Center, Human Language Technologies Feature space Gaussianization [Saon et al. 04] Idea: transform each dimension non-linearly such that it becomes Gaussian distributed Motivations: Perform speaker adaptation with non-linear transforms Natural form of non-linear speaker adaptive training (SAT) Effort of modeling output distribution with GMMs is reduced rank( xi ) yi N 1 Transform is given by the inverse gaussian CDF applied to the empirical CDF 25 EARS progress update T.J. Watson Research Center, Human Language Technologies New data values, absolute Feature Space Gaussianization, Pictorially Inverse Gaussian CDF (mean 0, variance 1) 1 0 +- 1 std dev -1 68% 16 26 EARS progress update 50 84 Old data values, percentile T.J. Watson Research Center, Human Language Technologies An actual transform 27 EARS progress update T.J. Watson Research Center, Human Language Technologies Feature space Gaussianization: WER Results on RT’03 at the SAT level (no MLLR): 28 ML MPE Baseline (FMLLR-SAT) 30.9% 29.1% Gaussianized 30.5% 28.5% EARS progress update T.J. Watson Research Center, Human Language Technologies Part 5: Experiments with Fisher Data 29 EARS progress update T.J. Watson Research Center, Human Language Technologies Acoustic Training Data Training set size based on aligned frames only. corpus # frames # hours Fisher 1-4 Total is 829 hours of speech; 486 hours excluding Fisher. SWB-1 130 M 98.6M 361 274 IBM Voicemail 37.9M 105 BBN CTRANS 20.5M 57 SWB Cellular 6.4M 18 Callhome English 4.9M 14 Training vocabulary includes 61K tokens. First experiments with Fisher 14. Iteration likely to improve results. 30 EARS progress update T.J. Watson Research Center, Human Language Technologies Effect of new Fisher data on WER RT-03 RT-03 RT-03 IBM Switchboard Fisher overall Superhuman 2002 System 34.1 26.0 30.2 36.8 All data 32.7 25.1 29.1 36.7 All less Fisher 33.2 25.7 29.6 36.2 All less VM 32.2 25.4 28.9 36.8 Systems are PLP VTLN SAT, 60-dim. LDA+MLLT features One-shot decoding on IBM 2003 RT-03 LM (interpolated 4gm) for RT-03; generic interpolated 3gm for Superhuman. Fisher data in AM only – not LM 31 EARS progress update T.J. Watson Research Center, Human Language Technologies Summary Discriminative training New MPE 0.7% better than old MMI on RT03 Used MMI estimate rather than ML estimate for I-smoothing with MPE (consistently gives about 0.4% improvement over standard MPE) LM rescoring 10x Reduction in static graph size – 132M 10M Useful for rescoring with adaptive LMs Fast Decoding 10% speedup - incremental application of absolute pruning threshold Gaussianization 0.6% improvement on top of MPE Useful on a variety of tasks (e.g. C&C in cars) Fisher Data 1.3% improvement over last year without it (AM only) Not useful in a broader context 32 EARS progress update