Transcript Outline

Combining Grapheme-to-Phoneme Converter Outputs
for Enhanced Pronunciation Generation
in Low-Resource Scenarios
Tim Schlippe, Wolf Quaschningk, Tanja Schultz
SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages
St. Petersburg, Russia
KIT – University of the State of Baden-Wuerttemberg and
National Research Center of the Helmholtz Association
www.kit.edu
Outline
1. Motivation and Goals
2. Experimental Setup
1. Grapheme-to-phoneme converters
2. Data
3. Experiments and Results
1.
2.
3.
4.
Single grapheme-to-phoneme converters’ performance
Phoneme-level combination scheme
Adding web-driven grapheme-to-phoneme converters
Automatic speech recognition experiments
4. Conclusion and Future Work
2
15-May-2014
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Motivation
About 7.100 languages exist in the world (www.ethnologue.com)
only few languages have speech processing systems
Pronunciation dictionaries needed for text-to-speech
and automatic speech recognition (ASR)
Manual production of pronunciations slow and costly
19.2–30s / word for Afrikaans (Davel and Barnard, 2004)
Automatic grapheme-to-phoneme (G2P) conversion
But: Consistency pronunciations first at ~3.7k wordpronunciation pairs for training (30k phoneme tokens)
 Methods to reduce manual effort
3
15-May-2014
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Goals
Common approaches use their single favorite G2P
conversion tool
Idea:
Use synergy effects of multiple G2P converters
Close in performance but at the same time produce an
output that differs in their errors
Provides complementary information
 Achieve pronunciations with higher quality
through combination of G2P converter outputs
Reduce manual effort in semi-automatic methods
Impact on ASR performance
4
15-May-2014
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Grapheme-to-phoneme converters
G2P
converters
Knowledge-based
Manual
Rulebased
Local
classification
Handcrafted
rules
CART1based „t2p“
Graphone-based
„Sequitur“
WFST2-based
„Phonetisaurus“
SMT3-based
„Moses“
(Lenzo, 1998)
(Bisani & Ney, 2008)
(Novak 2011)
(Koehn, 2005)
c a r s
K AX 9r S
5
15-May-2014
Data-driven
Probabilistic
(According to (Bisani and Ney, 2008))
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
different
grade of G2P
relationship
Data
Languages:
English, German, French, Spanish
Dictionaries:
English:
German, Spanish:
French:
CMU dictionary
GlobalPhone
Quaero Project
Data sets (randomly chosen):
different
amounts of
small training
data sizes to
simulate low
resources
Training: 200, 500, 1k, 5k, 10k word-pronunciation pairs
Development / test set: 10k word-pronunciation pairs
(disjunctive)
6
15-May-2014
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Analysis of Single G2P Converter Outputs
Edit distance to reference pronunciations at
phoneme level (phoneme error rate (PER))
Lower PERs with increasing amount of training data
7
15-May-2014
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Analysis of Single G2P Converter Outputs
Edit distance to reference pronunciations at
phoneme level (phoneme error rate (PER))
Lowest PERs are achieved with Sequitur and Phonetisaurus for all
languages and data sizes – even Moses it is very close for de
8
15-May-2014
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Analysis of Single G2P Converter Outputs
Edit distance to reference pronunciations at
phoneme level (phoneme error rate (PER))
For 200 en and fr W-P pairs, Rules outperforms Moses
9
15-May-2014
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Phoneme-level combination scheme
Based on ROVER (Fiscus, 1997)
(Recognizer Output Voting Error Reduction)
(traditionally at word level)
Voting Module
by frequency of occurence,
since G2P confidence scores not reliable
10 15-May-2014
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Phoneme-level combination scheme
Example (trained with 200 W-P pairs):
Reference:
cars
K AA 9r ZH
Converter
Output
PER
Sequitur G2P
k EH 9r ZH
25%
Phonetisaurus
K AA ZH
25%
CART
K AE ZH
50%
Moses
K AA 9r S
25%
1:1 G2P (Rules) K AX 9r S
50%
11 15-May-2014
PLC output
PER
K AA 9r ZH
0%
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Phoneme-level combination
Relative PER change compared to best single
converter output
de
In 10 of 16 cases  combination equal or better
12 15-May-2014
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Phoneme-level combination
Relative PER change compared to best single
converter output
de
Most improvement for de and en  ASR experiments
13 15-May-2014
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Phoneme-level combination
Relative PER change compared to best single
converter output
de
es (most regular G2P relationship) never improvements
14 15-May-2014
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Wiktionary
39 Wiktionary
editions with more
than 1k IPA prons.
(June 2012)
Growth of Wiktionary entries over several years
((meta.wikimedia.org/wiki/List of Wiktionaries
T. Schlippe, S. Ochs, T. Schultz: Web-based tools and methods for rapid pronunciation dictionary creation,
Speech Communication, vol. 56, pp. 101 – 118, January 2014
15 15-May-2014
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Wiktionary
Additional G2P converters based on wordpronunciation pairs in Wiktionary
4.6k W-P pairs
Internal consistency (PER %)
1.5k W-P pairs
3.8k W-P pairs
3.3k W-P pairs
16 15-May-2014
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Data
Filtered web-derived pronunciations
Fully automatic methods from (Schlippe, 2012a, 2012b, 2014)
~15% with each filtering method
Language
Best method
English (en)
M2NAlign
33.18%
26.13%
+21.25%
French (fr)
Eps
14.96%
13.97%
+6.62%
German (de)
G2PLen
16.74%
14.17%
+15.35%
Spanish (es)
M2NAlign
10.25%
10.90%
-6.34%
17 15-May-2014
unfiltWDP
filtWDP
Rel. change
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Phoneme-level combination
Relative PER change compared to best single
converter output
PLC-unfiltWDP already better than w/oWDP
18 15-May-2014
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Phoneme-level combination
Relative PER change compared to best single
converter output
23.1% rel. PER reduction
Filtering web-derived pronunciations helps
19 15-May-2014
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
ASR experiments
Replace dictionaries in de & en recognizers with
pronunciations generated with G2P converters
Train and decode the systems
Word Error Rate (WER)
• As in PER evaluation: Sequitur and Phonetisaurus very good in most cases
• However: Rules results in lowest WERs for most scenarios
20 15-May-2014
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
ASR experiments
In only 1 case  PLC-w/oWDP better or equal best single converter
21 15-May-2014
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
ASR experiments
Filtering web-derived word-pronunciation pairs hels.
22 15-May-2014
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
ASR experiments
Confusion Network Combination (CNC) outperforms PLC
23 15-May-2014
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
ASR experiments
In 9 cases  Adding system with PLC in helps in CNC
24 15-May-2014
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Conclusion and Future Work
In most cases, PLC comes close validated reference
pronunciations more than the single converters
Web-derived word-pronunciation pairs can further improve
quality (Filtering the web data helpful)
Weighting single G2P converters’ outputs gave no
improvement
according to performance on dev set
according to converters‘ confidences
Potential to enhance semi-automatic pronunciation dictionary
creation by reducing the human editing effort
25 15-May-2014
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Conclusion and Future Work
Positive impact of the combination in terms of lower PERs
had only little influence on the WERs of our ASR systems
Including systems with pronunciation dictionaries that have
been built with PLC to CNC can lead to improvements
Future work:
Embedding PLC and web-derived pronunciations into the semiautomatic pronunciation dictionary creation
Further languages and further G2P converters
26 15-May-2014
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
благодари́м за внима́ние!
27 15-May-2014
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
ASR experiments
In 6 cases  System with PLC better or equal best single converter
28 15-May-2014
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Data
Filtered web-derived pronunciations
Threshold for each filtering method dependent on
mean µ and standard deviation σ of measure in focus
1st-stage filtering (Len / Eps / M2NAlign)
wordpronunciation
pairs
prefiltering
filtered wordpronunciation
pairs
(Black et al., 1998)
(Martirosian and Davel, 2007)
(Schlippe, 2012a, 2012b)
29 15-May-2014
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Data
Filtered web-derived pronunciations
Threshold for each filtering method dependent on
mean µ and standard deviation σ of measure in focus
1st-stage filtering (Len / Eps / M2NAlign)
wordpronunciation
pairs
prefiltering
filtered word- Train „reliable“ g2p model
pronunciation
pairs
„reliable“
g2p
model
Apply g2p model to words
Edit distance < threshold
remaining wordpronunciation
pairs
30 15-May-2014
2nd-stage filtering (G2P)
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Phoneme-level combination scheme
Example (trained with 200 W-P pairs):
Reference:
cars
K AA 9r ZH
Sequitur
(25% PER)
K
EH
9r
Phonetisaurus
(25% PER)
K
AA
ZH
CART
(50% PER)
K
AE
ZH
K
AA
9r
S
K
AX
9r
S
Moses
(25% PER)
1:1 G2P
(50% PER)
31 15-May-2014
ZH
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
32 15-May-2014
Phoneme-level combination scheme
Alignment Module
K
EH
9r
ZH
K
AA
@
ZH
K
AE
@
ZH
K
AA
9r
S
K
AX
9r
S
Voting Module
by frequency of occurence,
since G2P confidence scores not reliable
33 15-May-2014
K
AA
9r
ZH
(1)
(0.4)
(0.6)
(0.6)
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Data
Filtered WDPs
German:
1.
G2PLen
Remove a pronunciation if the ratio of grapheme and phoneme
tokens is shorter than µLen – σLen or longer than µLen + σLen
2.1. Train G2P models with remaining more “reliable” W-P pairs.
2.2. Apply the G2P models to convert a grapheme string into a
most likely phoneme string.
2.3 Remove a pronunciation if the edit distance between the
synthesized phoneme string and the pronunciation in question
is shorter than µG2P – σG2P or longer than µG2P + σG2P
 PER reduction: 16.74  14.17
34 15-May-2014
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Data
Filtered WDPs
English, Spanish:
M2NAlign
1.
Perform an m-n G2P alignment (Black et al., 1998)
2.
Remove a pronunciation if the alignment score is shorter than
µG2P – σG2P or longer than µG2P + σG2P.
 English PER reduction: 33.18  26.13
 Spanish PER reduction: 10.25  10.90
35 15-May-2014
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Data
Filtered WDPs
French:
1.
Eps
(according to (Martirosian and Davel, 2007))
Perform an 1-1 G2P alignment (Black et al., 1998)
Alignment process involves the insertion of graphemic and
phonemic nulls (epsilons) into the lexical entries of words.
2. Remove a pronunciation if the proportion of graphemic and
phonemic nulls is shorter than µG2P – σG2P or longer than µG2P +
σG2P.
 PER reduction: 14.96 13.97
36 15-May-2014
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
References
37 15-May-2014
Pronunciation Extraction Through Cross-lingual Word-to-Phoneme Alignment