Transcript Document
Towards ASR Without Pronunciation Dictionary, Transcribed Speech and Text Resources in the Target Language Using Cross-Lingual Word-to-Phoneme Alignment Felix Stahlberg, Tim Schlippe, Stephan Vogel, Tanja Schultz Cognitive Systems Lab Karlsruhe Institute of Technology KIT – Universität des Landes Baden-Württemberg und nationales Forschungszentrum in der Helmholtz-Gemeinschaft 14 May 2014 4th International Workshop on Spoken Language Technologies for Under-resourced Languages (SLTU'14) www.kit.edu Goal Exploit the phonetic output of a human simultaneous translator Translating between a resource-rich source language and an under-resourced target language Bootstrap an ASR system without any linguistic knowledge of the target language 2 16.07.2015 Towards ASR Without Pronunciation Dictionary, Transcribed Speech and Text Resources Using Cross-Lingual Word-to-Phoneme Alignment – F. Stahlberg, T. Schlippe, S. Vogel, T. Schultz http://en.wikipedia.org/wiki/File:Akha_laos_11_03d.jpg Applications 3 Dialects Speech processing for non-written and under-resourced languages 16.07.2015 Towards ASR Without Pronunciation Dictionary, Transcribed Speech and Text Resources Using Cross-Lingual Word-to-Phoneme Alignment – F. Stahlberg, T. Schlippe, S. Vogel, T. Schultz Scenario Say “I am sick.” in your mother tongue. /b/ /o/ /l/ /a/ /n/ /s/ /e/ /m/ Say “I am healthy.” in your mother tongue. /z/ /d/ /r/ /a/ /v/ /s/ /e/ /m/ • /b/ /o/ /l/ /a/ /n/ seems to be a word (meaning sick) • /z/ /d/ /r/ /a/ /v/ seems to be a word (meaning healthy) • /s/ /e/ /m/ seems to be a word (meaning I am) 4 16.07.2015 Towards ASR Without Pronunciation Dictionary, Transcribed Speech and Text Resources Using Cross-Lingual Word-to-Phoneme Alignment – F. Stahlberg, T. Schlippe, S. Vogel, T. Schultz Word-to-Phoneme Alignments Sentence: I am sick. English (Source Language) Slovene (Target Language) Phoneme sequence: b o l a n s e m Phoneme Recognizer Audio: (Stahlberg et. al., 2012) (Stüker and Besacier, 2009) (Stüker and Waibel, 2008) (Besacier et. al., 2006) 5 16.07.2015 Towards ASR Without Pronunciation Dictionary, Transcribed Speech and Text Resources Using Cross-Lingual Word-to-Phoneme Alignment – F. Stahlberg, T. Schlippe, S. Vogel, T. Schultz Pronunciation Extraction Step 1: Phoneme Recognition Step 2: Croatian Phoneme Recognizer Croatian Phoneme Recognizer z d r a v s e m b o l a n s e m Source de: Ich Target si: z d r a v s e m bin gesund Ich bin krank Alignment Source en: Target si: Step 3: Clustering I am healthy z d r a v s e m vsem, 𝐶1 : sem, em, sem b o l a n s e m I am sick b o l a n s e m 𝐶2 : bolans, bolan 𝐶3 : zdrav, zdra Step 4: Dictionary Generation 6 16.07.2015 𝒊 1 2 3 Written Form: 𝐩𝟐𝐠 𝐡𝐫 (𝝁 𝑪𝒊 ) Pronunciation: 𝝁 𝑪𝒊 sem s e m bolan b o l a n zdrav z d r a v Towards ASR Without Pronunciation Dictionary, Transcribed Speech and Text Resources Using Cross-Lingual Word-to-Phoneme Alignment – F. Stahlberg, T. Schlippe, S. Vogel, T. Schultz System Design Target language Source (Slovene) languages English German Word-toPhoneme Alignments Croatian Word-toPhoneme Alignments Word-toPhoneme Alignments Clustering (kmeansOidx) Cluster Sizes Pronunciations Estimate 1-gram Language Model LM Phoneme-toGrapheme Conversion Dictionary AM Speech Recognizer 7 16.07.2015 Phoneme-toGrapheme Model Text Towards ASR Without Pronunciation Dictionary, Transcribed Speech and Text Resources Using Cross-Lingual Word-to-Phoneme Alignment – F. Stahlberg, T. Schlippe, S. Vogel, T. Schultz Acoustic Models Issues with k-means Clustering New Cluster 𝐶𝑗 Cluster mean 𝜇(𝐶𝑖 ): /h/ /uw/ /eh/ /v/ /er/ (whoever) Cluster elements 𝐶𝑖 : Number of Occurrences /h/ /uw/ /eh/ /v/ /er/ /y/ /uw/ /eh/ /v/ /er/ /h/ /uw/ /eh/ /v/ /er/ /m/ /h/ /aw/ /eh/ /v/ /er/ /h/ /uw/ /eh/ /v/ /er/ /b/ /ih/ /h/ /uw/ /eh/ /v/ /er/ /r/ 8 16.07.2015 Errors are approx. equally likely Correct pronunciation (however) occur more often than wrong ones Towards ASR Without Pronunciation Dictionary, Transcribed Speech and Text Resources Using Cross-Lingual Word-to-Phoneme Alignment – F. Stahlberg, T. Schlippe, S. Vogel, T. Schultz 𝐶𝑖 kmeansOidx Clustering Algorithm 𝐶𝑖 : 𝜇(𝐶𝑖 ): 𝑐(𝑝): 3) 𝑐(⋅) 2) 𝑐(⋅) 𝐶3 𝐶2 𝑐(⋅) 𝑐(⋅) 𝑐(⋅) 𝐶1 𝐶5 𝜖𝑜𝑖𝑑𝑥 9 16.07.2015 6) 𝑐(⋅) 4) 𝐶6 𝐶4 1) Set of elements in cluster 𝑖. Mean of cluster 𝑖. Number of occurrences of element 𝑝. 5) 7) 𝑐(⋅) 𝐶7 oidx Towards ASR Without Pronunciation Dictionary, Transcribed Speech and Text Resources Using Cross-Lingual Word-to-Phoneme Alignment – F. Stahlberg, T. Schlippe, S. Vogel, T. Schultz Basic Medical Expression Database (BMED) 10 16.07.2015 Towards ASR Without Pronunciation Dictionary, Transcribed Speech and Text Resources Using Cross-Lingual Word-to-Phoneme Alignment – F. Stahlberg, T. Schlippe, S. Vogel, T. Schultz Error-Free Phonetic Transcriptions (0% Phoneme Error Rate) Method from (Stahlberg et. al., 2013): Source Language de en hr New Method kmeansOidx: Src. Lang. de en hr all • • • • 11 16.07.2015 Outperforms (Stahlberg et. al., 2013) 1-gram language model helps Source language combination helps Setting 𝜖𝑜𝑖𝑑𝑥 helps Character Error Rate (0-gram) 66.0% 62.4% 51.3% 𝝐𝒐𝒊𝒅𝒙 ∞ 3 2 1.5 1 CER (0-gram) CER (1-gram) 52.1 % 50.7% 51.4% 47.9% 49.9% 48.4% 47.3% 46.1% 46.6% 46.2% 45.9% 45.1% 45.8% 44.4% 45.2% 44.2% Slovene ASR with Croatian AM (Gold Standard): Language Model WER CER 0-gram 36.2% 15.7% 1-gram 32.0% 13.6% Towards ASR Without Pronunciation Dictionary, Transcribed Speech and Text Resources Using Cross-Lingual Word-to-Phoneme Alignment – F. Stahlberg, T. Schlippe, S. Vogel, T. Schultz Recognized Phonetic Transcriptions (55.2% Phoneme Error Rate) • • • • 12 16.07.2015 English best single source language Fluctuations with single source languages. All minima between 𝜖𝑜𝑖𝑑𝑥 = 1.5 and 2. Best system: 52% Character Error Rate Slovene ASR with Croatian AM (Gold Standard): Language Model WER CER 0-gram 36.2% 15.7% 1-gram 32.0% 13.6% Towards ASR Without Pronunciation Dictionary, Transcribed Speech and Text Resources Using Cross-Lingual Word-to-Phoneme Alignment – F. Stahlberg, T. Schlippe, S. Vogel, T. Schultz Human Evaluation Task: Select correct sentence from the 200 BMED sentences, given the recognizer transcript Recognized Phonetic Transcription, All source languages, 𝜖𝑜𝑖𝑑𝑥 = 2 Character Error Rate: 52% 12.1% 13 16.07.2015 Towards ASR Without Pronunciation Dictionary, Transcribed Speech and Text Resources Using Cross-Lingual Word-to-Phoneme Alignment – F. Stahlberg, T. Schlippe, S. Vogel, T. Schultz Summary Speech recognition for non-written and underresourced languages or dialects Only using spoken translations from other languages and resources from resource-rich languages No given pronunciation dictionary, transcribed speech and text resources in the target language Target Language: Slovene Source Languages: Croatian, English, German 200 sentences, limited domain Best system: 52% Character Error Rate Task Error Rate for selecting 1 of 200 sentences: 12.1% 14 16.07.2015 Towards ASR Without Pronunciation Dictionary, Transcribed Speech and Text Resources Using Cross-Lingual Word-to-Phoneme Alignment – F. Stahlberg, T. Schlippe, S. Vogel, T. Schultz