Transcript Document

Towards ASR Without Pronunciation Dictionary, Transcribed Speech
and Text Resources in the Target Language Using Cross-Lingual
Word-to-Phoneme Alignment
Felix Stahlberg, Tim Schlippe, Stephan Vogel, Tanja Schultz
Cognitive Systems Lab
Karlsruhe Institute of Technology
KIT – Universität des Landes Baden-Württemberg und
nationales Forschungszentrum in der Helmholtz-Gemeinschaft
14 May 2014
4th International Workshop on Spoken Language Technologies for
Under-resourced Languages (SLTU'14)
www.kit.edu
Goal
Exploit the phonetic output of a human
simultaneous translator
Translating between a resource-rich source language
and an under-resourced target language
Bootstrap an ASR system without any linguistic
knowledge of the target language
2
16.07.2015
Towards ASR Without Pronunciation Dictionary, Transcribed Speech and Text Resources Using Cross-Lingual
Word-to-Phoneme Alignment – F. Stahlberg, T. Schlippe, S. Vogel, T. Schultz
http://en.wikipedia.org/wiki/File:Akha_laos_11_03d.jpg
Applications
3
Dialects
Speech processing for non-written and
under-resourced languages
16.07.2015
Towards ASR Without Pronunciation Dictionary, Transcribed Speech and Text Resources Using Cross-Lingual
Word-to-Phoneme Alignment – F. Stahlberg, T. Schlippe, S. Vogel, T. Schultz
Scenario
Say “I am sick.” in your
mother tongue.
/b/ /o/ /l/ /a/ /n/ /s/ /e/ /m/
Say “I am healthy.” in your
mother tongue.
/z/ /d/ /r/ /a/ /v/ /s/ /e/ /m/
• /b/ /o/ /l/ /a/ /n/ seems to be a word (meaning sick)
• /z/ /d/ /r/ /a/ /v/ seems to be a word (meaning healthy)
• /s/ /e/ /m/ seems to be a word (meaning I am)
4
16.07.2015
Towards ASR Without Pronunciation Dictionary, Transcribed Speech and Text Resources Using Cross-Lingual
Word-to-Phoneme Alignment – F. Stahlberg, T. Schlippe, S. Vogel, T. Schultz
Word-to-Phoneme Alignments
Sentence:
I
am sick.
English
(Source Language)
Slovene
(Target Language)
Phoneme
sequence:
b o l a n s e m
Phoneme
Recognizer
Audio:
(Stahlberg et. al., 2012)
(Stüker and Besacier, 2009)
(Stüker and Waibel, 2008)
(Besacier et. al., 2006)
5
16.07.2015
Towards ASR Without Pronunciation Dictionary, Transcribed Speech and Text Resources Using Cross-Lingual
Word-to-Phoneme Alignment – F. Stahlberg, T. Schlippe, S. Vogel, T. Schultz
Pronunciation Extraction
Step 1:
Phoneme
Recognition
Step 2:
Croatian Phoneme
Recognizer
Croatian Phoneme
Recognizer
z d r a v s e m
b o l a n s e m
Source de:
Ich
Target si:
z d r a v s e m
bin
gesund
Ich
bin
krank
Alignment
Source en:
Target si:
Step 3:
Clustering
I
am
healthy
z d r a v s e m
vsem,
𝐶1 : sem,
em, sem
b o l a n s e m
I
am
sick
b o l a n s e m
𝐶2 : bolans,
bolan
𝐶3 : zdrav, zdra
Step 4:
Dictionary
Generation
6
16.07.2015
𝒊
1
2
3
Written Form: 𝐩𝟐𝐠 𝐡𝐫 (𝝁 𝑪𝒊 ) Pronunciation: 𝝁 𝑪𝒊
sem
s e m
bolan
b o l a n
zdrav
z d r a v
Towards ASR Without Pronunciation Dictionary, Transcribed Speech and Text Resources Using Cross-Lingual
Word-to-Phoneme Alignment – F. Stahlberg, T. Schlippe, S. Vogel, T. Schultz
System Design
Target language Source
(Slovene) languages
English
German
Word-toPhoneme
Alignments
Croatian
Word-toPhoneme
Alignments
Word-toPhoneme
Alignments
Clustering
(kmeansOidx)
Cluster Sizes
Pronunciations
Estimate 1-gram
Language Model
LM
Phoneme-toGrapheme
Conversion
Dictionary
AM
Speech Recognizer
7
16.07.2015
Phoneme-toGrapheme Model
Text
Towards ASR Without Pronunciation Dictionary, Transcribed Speech and Text Resources Using Cross-Lingual
Word-to-Phoneme Alignment – F. Stahlberg, T. Schlippe, S. Vogel, T. Schultz
Acoustic
Models
Issues with k-means Clustering
New Cluster 𝐶𝑗
Cluster mean 𝜇(𝐶𝑖 ): /h/ /uw/ /eh/ /v/ /er/ (whoever)
Cluster elements 𝐶𝑖 :
Number of Occurrences
/h/ /uw/ /eh/ /v/ /er/
/y/ /uw/ /eh/ /v/ /er/
/h/ /uw/ /eh/ /v/ /er/ /m/
/h/ /aw/ /eh/ /v/ /er/
/h/ /uw/ /eh/ /v/ /er/ /b/ /ih/
/h/ /uw/ /eh/ /v/ /er/ /r/
8
16.07.2015
Errors are approx.
equally likely
Correct pronunciation
(however) occur more
often than wrong ones
Towards ASR Without Pronunciation Dictionary, Transcribed Speech and Text Resources Using Cross-Lingual
Word-to-Phoneme Alignment – F. Stahlberg, T. Schlippe, S. Vogel, T. Schultz
𝐶𝑖
kmeansOidx Clustering Algorithm
𝐶𝑖 :
𝜇(𝐶𝑖 ):
𝑐(𝑝):
3)
𝑐(⋅)
2)
𝑐(⋅)
𝐶3
𝐶2
𝑐(⋅)
𝑐(⋅)
𝑐(⋅)
𝐶1
𝐶5
𝜖𝑜𝑖𝑑𝑥
9
16.07.2015
6)
𝑐(⋅)
4)
𝐶6
𝐶4
1)
Set of elements in
cluster 𝑖.
Mean of cluster 𝑖.
Number of occurrences
of element 𝑝.
5)
7)
𝑐(⋅)
𝐶7
oidx
Towards ASR Without Pronunciation Dictionary, Transcribed Speech and Text Resources Using Cross-Lingual
Word-to-Phoneme Alignment – F. Stahlberg, T. Schlippe, S. Vogel, T. Schultz
Basic Medical Expression Database
(BMED)
10
16.07.2015
Towards ASR Without Pronunciation Dictionary, Transcribed Speech and Text Resources Using Cross-Lingual
Word-to-Phoneme Alignment – F. Stahlberg, T. Schlippe, S. Vogel, T. Schultz
Error-Free Phonetic Transcriptions
(0% Phoneme Error Rate)
Method from (Stahlberg et. al., 2013):
Source Language
de
en
hr
New Method kmeansOidx:
Src. Lang.
de
en
hr
all
•
•
•
•
11
16.07.2015
Outperforms (Stahlberg et. al., 2013)
1-gram language model helps
Source language combination helps
Setting 𝜖𝑜𝑖𝑑𝑥 helps
Character Error Rate (0-gram)
66.0%
62.4%
51.3%
𝝐𝒐𝒊𝒅𝒙
∞
3
2
1.5
1
CER (0-gram) CER (1-gram)
52.1 %
50.7%
51.4%
47.9%
49.9%
48.4%
47.3%
46.1%
46.6%
46.2%
45.9%
45.1%
45.8%
44.4%
45.2%
44.2%
Slovene ASR with Croatian AM
(Gold Standard):
Language Model
WER
CER
0-gram
36.2%
15.7%
1-gram
32.0%
13.6%
Towards ASR Without Pronunciation Dictionary, Transcribed Speech and Text Resources Using Cross-Lingual
Word-to-Phoneme Alignment – F. Stahlberg, T. Schlippe, S. Vogel, T. Schultz
Recognized Phonetic Transcriptions
(55.2% Phoneme Error Rate)
•
•
•
•
12
16.07.2015
English best single source language
Fluctuations with single source languages.
All minima between 𝜖𝑜𝑖𝑑𝑥 = 1.5 and 2.
Best system: 52% Character Error Rate
Slovene ASR with Croatian AM
(Gold Standard):
Language Model
WER
CER
0-gram
36.2%
15.7%
1-gram
32.0%
13.6%
Towards ASR Without Pronunciation Dictionary, Transcribed Speech and Text Resources Using Cross-Lingual
Word-to-Phoneme Alignment – F. Stahlberg, T. Schlippe, S. Vogel, T. Schultz
Human Evaluation
Task: Select correct sentence from the 200 BMED
sentences, given the recognizer transcript
Recognized Phonetic Transcription,
All source languages,
𝜖𝑜𝑖𝑑𝑥 = 2
Character Error Rate: 52%
12.1%
13
16.07.2015
Towards ASR Without Pronunciation Dictionary, Transcribed Speech and Text Resources Using Cross-Lingual
Word-to-Phoneme Alignment – F. Stahlberg, T. Schlippe, S. Vogel, T. Schultz
Summary
Speech recognition for non-written and underresourced languages or dialects
Only using spoken translations from other languages and
resources from resource-rich languages
No given pronunciation dictionary, transcribed speech
and text resources in the target language
Target Language: Slovene
Source Languages: Croatian, English, German
200 sentences, limited domain
Best system: 52% Character Error Rate
Task Error Rate for selecting 1 of 200 sentences: 12.1%
14
16.07.2015
Towards ASR Without Pronunciation Dictionary, Transcribed Speech and Text Resources Using Cross-Lingual
Word-to-Phoneme Alignment – F. Stahlberg, T. Schlippe, S. Vogel, T. Schultz