PowerPoint 演示文稿

Transcript PowerPoint 演示文稿

Integrated Stochastic
Pronunciation Modeling
Dong Wang
Supervisors: Simon King, Joe Frankel,
James Scobbie
Contents





Problems we are addressing
Previous research
Integrated stochastic pronunciation modeling
Current experimental results
Work plan
Problems we are addressing
1.
2.
Constructing a lexicon is time consuming.
Traditional lexicon-based triphone systems lack
robustness to pronunciation variation in real speech.


Linguistics-based lexica seldom considering real speech
Deterministic decomposition from words to acoustic units,
through lexica and decision tress
Previous research
 Alternative pronunciation generation

Utilize real speech to expand the lexicon.
 Automatic lexicon generation

Utilize real speech to create a lexicon.
 Hidden sequence modeling (HSM)

Build a probabilistic mapping from phonemes to context dependent
phones.
Previous research
Problems:
1. Linguistics-based lexica
2. determinate mapping
Integrated stochastic
pronunciation modeling
Integrated Stochastic Pronunciation Modeling (ISPM)


Build a flexible three-layer architecture which represents
pronunciation variation in probabilistic mappings, achieving better
performance than traditional triphone-based systems.
Focus on the grapheme-based ISPM system, eliminating human
efforts on lexicon construction.
Integrated stochastic
pronunciation modeling
Grapheme-based ISPM
Integrated stochastic
pronunciation modeling
 Spelling simplification model (SSM)



Map a letter string with regular pronunciation into a simple grapheme
according to the context. e.g., EA->E
Map a letter string with several pronunciations to simple graphemes, with
appearance probability attached, e.g., OUGH->O (0.6) AF (0.4)
Examining the transcription from the grapheme decoding against the
reference transcription will help find the mapping.
 Grapheme pronunciation model (GPM)

The probabilistic mapping between the canonical layer and acoustic layer.
LMs/decision trees/ANNs can all be examined here.
Integrated stochastic
pronunciation modeling
 Why graphemes?




Simple relationship between word spellings and sub-word units
helps generate baseforms for any words, so avoid human efforts on
lexicon construction.
It is easy to handle OOV words and reconstruct words from
grapheme strings.
Building and applying grapheme-based LMs will be simple.
Internal composition of phonology rules and acoustic clues makes it
suitable for some applications, such as spoken term detection and
language identification.
Integrated stochastic
pronunciation modeling
 Direct grapheme ISPM
Direct grapheme ISPM: SSM is a 1:1 mapping
Integrated stochastic
pronunciation modeling
 Hidden grapheme ISPM
Hidden grapheme ISPM: SSM is a n:m mapping
Integrated stochastic
pronunciation modeling
 Training


A divide-and-conquer approach, as in HSM, will be utilized for ISPM training. With
this approach, SSM,GPM and AM are optimized iteratively and alternately within an
EM framework, which ensures the process to converge to a local optimum.
The acoustic units will be grown from a set of initial single-letter grapheme HMMs,
as in the automatic lexicon generation approach.
 Decoding

The optimized ISPM will be used to expand searching graphs fed to the viterbi
decoder. No changes are required in the decoder itself.
 Implementation steps

The SSM and GPM are well separated so can be designed/implemented respectively,
and then are combined together. The SSM is relatively simpler therefore will be
implmented first.
Integrated stochastic
pronunciation modeling
 The proposed ISPM will be evaluated on three tasks:



Large vocabulary speech recognition (LVSR)
Spoken term detection (STD)
Language identification (LID)
Simplest
grapheme
(NONO)
Simple grapheme
(SSM)
Direct
grapheme
(GPM)
Hidden grapheme
(SSM+GPM)
LVSR
★
★
★★
STD
★
★★
★★
LID
★
★
Performance gain expectation from ISPM
Current experimental results
 Large vocabulary speech recognition
Data corpora for the LVSR task
WSJCAM0 for read speech and RT04S for spontaneous speech on the meeting domain
Training(h.)
Development(h.)
Evaluation(h.)
WSJCAM0
14.9
0.65
1.00
RT04S
103.9
1.40
1.66
Experiment settings for the LVSR task
WSJCAM0
RT04S
Training voc
Test voc
WSJCAM0
WSJ-5k
CMU+festiva
l
CMU
Language model
WSJ 3-gram
AMI 3-gram
Current experimental results
 Large vocabulary speech recognition
Experimental results of the LVSR task
Phoneme system(WER)
Grapheme system(WER)
WSJCAM0
11.3%
15.8%
RT04S
44.5%
54.5%
Contribution of context dependent modeling
CI(WER)
CD(WRE)
Phoneme
21.2%
9.8%
Grapheme
48.4%
13.0%
Current experimental results
 Large vocabulary speech recognition
Contribution of phonology oriented questions to the grapheme system

Phoneme(WER)
Grapheme(WER)
Extended questions
Grapheme(WER)
Singleton questions
11.3%
15.8%
16.5%
Conclusions




The Grapheme-based system works usually worse than the phoneme-based
one, especially in the RT04S task which is on the meeting domain, where
10% absolute performance degradation is observed.
A grapheme-based system relies on context dependent modeling more than
a phoneme-based system, and requires more Gaussian mixture components.
State-tying questions that reflect phonological rules are helpful.
Other experiments showed that manually-designed multi-letter graphemes
do not help significantly.
Current experimental results
 Spoken term detection
sub-word lattice based architecture for STD
Current experimental results
 Spoken term detection
STD performance on the RT04S task
phone
grapheme
FOM
20.5
18.0
OCC
0.44
0.34
ATWV
WER
0.25
0.16
44.5%
54.5%


Figure of Merit (FOM): average
detection rate over the range [1,10] false
alarms per hour.
Occurrence-weighted value (OCC)
 {N
correct
term
(term)  0.1 N spurious (term)}
N
true
(term)
term

Actual term-weighted value(ATWV)
1  average{PMiss (term)    PFA (term)}
term
Current experimental results
 Spoken term detection
•
A Grapheme-based STD systems is attractive
because OOV words can be handled easily and
the lattice search is efficient and simple.
•
In our experiments the phoneme-based STD
system works better. We suppose this because
some unpopular terms are more difficult for
the grapheme-based system to recognize.
•
If similar ASR performance can be achieved,
the grapheme-based system will outperform
the phoneme-based one, as shown in the right
figure.
Current experimental results
 Spoken term detection
We have demonstrated that in Spanish,
which holds simple graphemephoneme relationship and achieves
close ASR performance with phoneme
and grapheme based systems, the
grapheme-based STD system
outperforms the phoneme-based one.
Current experimental results
 Language identification
parallel phone/grapheme recognizer architecture for LID
Current experimental results
 Language identification
DER%
phone
grapheme
Phone+grapheme
unit likelihood
35.6
32.1
27.9
sentence likelihood
46.8
39.6
39.4
•Globalphone is used for initial experiments, but we will move to NIST standard corpora.
•Detection error rate (DER), defined as the incorrect detection divided by total trials, is
used as metric. Results on 3 seconds of speech within 4 languages are reported.
•Scores of whole sentences and those averaged over sub-word units as the ANN input are
all tested.
Work plan

Phase I: Simple grapheme-based system
1. Finish the STD experiments with high-order LMs (by Jan.2008).
2. Finish the LID oriented tuning (by Nov.2007).
3. Apply powerful LMs to the LID task (by Jan.2008).
4. Finish the SSM design (by Jan.2008).
5. Apply the SSM on LVSR RTS04 and STD (by Feb.2008).

Phase II: Integrated stochastic pronunciation modeling
1. Finish the direct-grapheme architecture (GPM) design (by Jul.2008).
2. Test the direct-grapheme architecture on the LVSR RTS04 task (by Oct.2008).
3. Finish the hidden-grapheme architecture (GPM+SSM) (by Jan.2009).
4. Test the hidden-grapheme architecture on the LVSR RTS04 task (by Feb.2009).

Phase III: Applications based on ISPM
1. Finish the test on the STD task (by May 2009).
2. Finish the test on the LID task (by May 2009).