Transcript MMI HLDA
Integrating Speech Recognition
and Machine Translation
Spyros Matsoukas, Ivan Bulyko,
Bing Xiang, Kham Nguyen,
Richard Schwartz, John Makhoul
1
Integration Issues
Machine Translation (MT) system is trained on text data,
so it expects
– segments that correspond to foreign sentences
– properly placed punctuation marks
– numbers, dates, monetary amounts, abbreviations, etc., as they
appear in ordinary text
However, Speech-To-Text (STT) output
– is segmented automatically on long pauses
• resulting segments may be too short, or may cross sentence
boundaries
– has no punctuation
• punctuation needs to be automatically added prior to translation
– has numbers, dates, etc., in spoken form
• output can be parsed to convert numbers to written form
2
STT/MT Pipeline
Initial set of experiments ran MT on the 1-best
hypothesis from STT
3
STT Components
STT-A
– EARS RT04 Arabic BN system
– Word pronunciations based on graphemes
– Acoustic models estimated using Maximum Mutual Information
(MMI) and Speaker Adaptive Training (SAT) on 100 hours of BN
audio data
– 3-gram language model trained on 400 million words of news
text
STT-B
– Uses morphological analyzer and automatic methods to infer
short vowels in word pronunciations
– Trained on an additional 50 hours of acoustic training data
STT-C
– Makes use of additional language model training data
4
MT Components
MT-A
– System developed during the period Sep 2004 – Apr 2005
– Phrase-based translation model, trained on 100M words of
Arabic/English UN and news bitext
– 3-gram English LM, trained on 2 billion words of text (mostly
newswire)
– Translation based on posterior probability P(English | Foreign)
MT-B
– Uses a combination of generative and posterior translation
probabilities
– Includes a phrase segmentation score
– Uses a method to compensate for over-estimated translation
probabilities
– Optimizes decoding weights by minimizing TER on N-best lists
TER results on the 2002 and 2004 MT Eval sets
5
System Id
2002
2004
MT-A
48.29
46.31
MT-B
46.35
45.55
Test Data
Tested integration on bnat05
– 6-hour set from several sources from Jan 2001 and Nov 2003
– Test set consists of both Modern Standard Arabic (MSA) and
Arabic dialect segments
All system comparisons based on TER
– MT system output automatically scored against single
reference transcription, with mixed case
6
Integration Results
Effect of STT accuracy, segmentation and punctuation on MT
accuracy
System
STT/MT
STT
WER
STT
Segmentation
Punctuation
TER
STT-A, MT-A
22.2
auto
period
66.8
STT-B, MT-A
18.3
auto
period
65.9
STT-B, MT-B
18.3
auto
period
64.6
STT-B, MT-B
17.6
reference
period
61.9
REF, MT-B
0.0
reference
period
58.7
REF, MT-B
0.0
reference
reference
58.0
At current MT performance level:
– large improvements in STT accuracy result in small TER gain
– significant TER reduction (2.7% absolute) can be obtained by
improving sentence boundary detection
– full punctuation helps translation only marginally
7
Optimizing STT segmentation for MT
Tuned the audio segmentation procedure in order to output
segments that match the reference in terms of average length
System
STT/MT
STT
WER
Avg. Segment
Length (sec)
TER
STT-B, MT-B
18.3
6.17
64.6
STT-C, MT-B
17.8
6.17
64.4
STT-C, MT-B
17.7
9.47
63.1
STT-C, MT-B
17.7
13.60
62.8
1.6% absolute TER gain for optimizing segmentation
Additional gains can be obtained by
– Converting spoken numbers to written form prior to translation
(0.4-0.5% TER reduction)
– re-defining STT output segmentation, using linguistic information
8
Sentence Boundary Detection (SBD)
Used a hidden-event language model (HELM) to detect
sentence boundaries in the 1-best STT output
– 4-gram HELM, trained 850M words of Arabic news with
Kneser-Ney smoothing
– Silence duration can be integrated as observation into HMM
search
Explored various configurations
– SBD-1: Use only LM to insert periods within speaker turns
– SBD-2: Use LM and silence duration jointly
– SBD-3: Bias the LM to insert boundaries at a higher rate
(by 30-50%), then remove boundaries with lowest model
posteriors while constraining the maximum sentence length
9
SBD Results
Effect of HELM-based SBD on MT accuracy, starting from one of
two audio segmentations
– audio-seg-1:
– audio-seg-2:
9.47 sec / segment
13.60 sec / segment
SBD Configuration
Baseline audio
segmentation
SBD-1
SBD-2
SBD-3
TER (TER-MSA)
audio-seg-1
audio-seg-2
62.55 (60.32)
62.37 (60.28)
62.66 (60.25)
62.49 (60.20)
62.32 (59.78)
62.81 (60.42)
62.79 (60.28)
62.34 (60.02)
HELM has larger effect on Modern Standard Arabic (MSA) regions,
where STT accuracy is high
SBD can be applied safely on top of any audio segmentation
10
Optimizing MT on Speech Data
MT accuracy can be enhanced by optimizing MT
decoding weights on broadcast speech data
– Optimization can compensate for differences in style between
newswire text and STT transcript (esp. on broadcast
conversations)
Optimization Issue:
– MT optimization requires one-to-one mapping between
translation hypotheses and references on the tuning set
– Non-trivial to tune on translations of automatically segmented
STT output
Solutions:
– Re-segment STT output according to reference segmentation
prior to translation, then use translation hypotheses for tuning
– Tune based on translations of the STT reference transcriptions
11
MT Optimization Results
Updated development sets
Purpose
Tuning
Validation
12
Broadcast
Conversations
bcat06
bcad06
bnad06
67.4
66.9
66.8
bcad06
73.3
71.9
71.9
Results
OptSet
MT02
BNC-STT
BNC-REF
Broadcast
News
bnat06
bnad06
MT02: tuning on translations of the 2002 NIST MT evaluation set
BNC-STT: tuning on translations of manually segmented (according to
reference) STT output
BNC-REF: tuning on translations of reference transcripts
Conclusions and Future Research
Results on 1-best STT/MT integration show that
sentence boundary detection has a large impact on MT
performance
– Segmentation should be based on both audio and STT
transcript
Better performance is expected by coupling STT and
MT more tightly
– Have begun running MT on consensus networks from STT
output
– Will explore joint optimization of STT and MT system
parameters
At current operating point, improvements in MT will
have the largest effect
13