Transcript Folie 1
Hausa Large Vocabulary Speech Recognition Tim Schlippe Edy Guevara Komgang Djomgang Ngoc Thang Vu Sebastian Ochs Tanja Schultz Cape Town, South Africa 07 May 2012 SLTU 2012 – The 3rd Workshop on Spoken Language Technologies for Under-resourced Languages Outline Hausa Large Vocabulary Continuous Speech Recognition - 2 Review of the ARPA Speech Understanding Project 1. Motivation 2. The Hausa Language 3. Hausa Resources 3.1 Text Corpus 3.2 Speech Corpus 4. Baseline Speech Recognition System 5. System Optimization 5.1 Pronunciation Dictionary Improvement 5.1.1 Automatic rejection of inconsistent or flawed entries 5.1.2 Tones and vowel lengths 5.2 Language Model Improvement 5.3 Speaker Adaptation and System Combination 6. Conclusion 1. Motivation Hausa Large Vocabulary Continuous Speech Recognition - 3 Review of the ARPA Speech Understanding Project • Speech technology … potentially allows everyone to participate in today’s information revolution, can bridge language barrier gabs, facilitates worldwide business activities, simplifies life in multilingual communities, alleviates humanitarian missions. 1. Motivation Hausa Large Vocabulary Continuous Speech Recognition - 4 Review of the ARPA Speech Understanding Project • Africa itself ... has more than 2,000 languages (Heine and Nurse, 2000) (e.g. there are more than 280 languages in Cameroon (www.ethnologue.com)). plus many different accents • For only a small fraction of Africa’s many languages, speech technology has been analyzed and developed so far We have collected speech and text data in Cameroon for the West African language Hausa as a part of our GlobalPhone corpus (Schultz, 2002) and developed an automatic speech recognition system. 2. The Hausa Language Hausa Large Vocabulary Continuous Speech Recognition - 5 Review of the ARPA Speech Understanding Project • Why Hausa? Lingua franca in many countries With over 25 million speakers, it is widely spoken in West Africa (Burquest, 1992) Hausa speakers according to the Summer Institute of Linguistics (SIL): • • • • • 18.5 million in Nigeria (1991), 5 million in Niger (1998), 489k in Sudan (2001), 23.5k in Cameroon (1982), Benin, Burkina Faso, Ghana, Togo, Chad (Koslow, 1995) Online text resources available Phoneme set defined by International Phonetic Association (IPA) (IPA, 1999) 2. The Hausa Language Hausa Large Vocabulary Continuous Speech Recognition - 6 Review of the ARPA Speech Understanding Project • Classification: Afro-Asiatic, Chadic, West, A, A.1 • Alphabet: • ajami (based on Arabic Alphabet), e.g. „ boko (based on Latin Alphabet), e.g. „Hausa“ • 22 characters of the English Alphabet plus Ɓ/ɓ, Ɗ/ɗ, Ƙ/ƙ, 'Y/'y or Ƴ/ƴ , and ' • In online newspapers: Ɓ/ɓ, Ɗ/ɗ, Ƙ/ƙ B/b, D/d, K/k “ Pronunciation characteristics: 3 lexical tones (low, high, falling) (IPA, 1999), e.g. wuya • • wuyá wúya difficulty neck Vowel lengths (short, long), e.g. gari • • garî gaːrî town flour 3. Hausa Resources Hausa Large Vocabulary Continuous Speech Recognition - 7 Review of the ARPA Speech Understanding Project 3.1 Text Corpus Crawing text from 5 main online newspapers in boko using the Rapid Language Adaptation Toolkit (RLAT) (Black and Schultz, 2008) Text Normalization 1. Remove all HTML tags and codes 2. Remove special characters and empty lines 3. Identify and remove pages and lines from other languages than Hausa based on large lists of frequent Hausa words 4. Delete duplicate lines Select prompts to record speech data for the training, development, and evaluation set and extract text for the language model 3. Hausa Resources Hausa Large Vocabulary Continuous Speech Recognition - 8 Review of the ARPA Speech Understanding Project 3.2 Speech Corpus Speech data collection in GlobalPhone style (Schultz, 2002), i.e. we asked native speakers of Hausa to read prompted sentences of newspaper articles. Offline audio recorder 16 kHz sampling rate with 16 bit quantization Close talk microphone (noise cancellation microphone, NC-185VM) 3. Hausa Resources Hausa Large Vocabulary Continuous Speech Recognition - 9 Review of the ARPA Speech Understanding Project 3.2 Speech Corpus - Challenges Social factors • The majority of Hausa people is Muslim (95% of recorded speakers) • For Muslims close connection between work and religion • Most Muslim female speakers had to ask their husband or father for the permission to do the recording Review of the ARPA Speech Understanding Project Hausa Large Vocabulary Continuous Speech Recognition - 10 3. Hausa Resources 3.2 Speech Corpus - Challenges Technical difficulties • Noisy environments: Big cities, restaurants, offices, at home, meeting halls • Bad infrastructure (electricity) Review of the ARPA Speech Understanding Project Hausa Large Vocabulary Continuous Speech Recognition - 11 3. Hausa Resources 3.2 Speech Corpus Maroua Ngaoundéré Bafoussam Yaoundé Douala Review of the ARPA Speech Understanding Project Hausa Large Vocabulary Continuous Speech Recognition - 12 3. Hausa Resources 3.2 Speech Corpus # age Review of the ARPA Speech Understanding Project Hausa Large Vocabulary Continuous Speech Recognition - 13 4. Baseline Speech Recognition System • 33 Hausa phonemes (26 consonants, 5 vowels, 2 diphtongs) • 6.6 hours to train acoustic models • Bootstrapping with RLAT using multilingual phone inventory MM7 (Schultz and Waibel, 2001) • MM7 trained from 7 randomly selected GlobalPhone languages Selected MM7 models as seed models to produce initial state alignments for the Hausa speech data Preprocessing: Feature extraction with Hamming window of 16 ms length, window overlap of 10 ms Each feature vector has 143 dimensions (11 adjacent frames x 13 MFCC frames) Linear Discriminant Analysis (LDA) feature vector size: 42 dims. Review of the ARPA Speech Understanding Project Hausa Large Vocabulary Continuous Speech Recognition - 14 4. Baseline Speech Recognition System • Acoustic Model (AM): Fully-continuous 3-state left-to-right HMM Emission probabilities are modeled by Gaussian Mixtures with diagonal covariances Context-dependent AM: decision tree splitting stopped at 500 triphones For all AMs one global semi-tied covariance matrix after LDA Data-driven tone modeling (DDTM) (Vu and Schultz, 2009) • • • Use a tone tag in pronunciation dictionary and add tag as question in clustering procedure Data decide during model clustering if two tones have a similar impact on the basic phoneme If so share 1 common model; Otherwise decision tree split 2 different models Same technique for the vowel lengths (Data-driven lengths modeling) Review of the ARPA Speech Understanding Project Hausa Large Vocabulary Continuous Speech Recognition - 15 4. Baseline Speech Recognition System • Language Model (LM): • 3-gram Vocabulary size: 6k (4k training transcriptions + 2k frequent) Perplexity (PPL): 282 Out-of-vocabulary (OOV) rate: 4.7% Pronunciation Dictionary Pronunciations created in a rule-based fashion, then manually revised and cross-checked by native speakers • • Initial rules based on 200 word-pronunciation pairs from Peter Ladefoged (http://archive.phonetics.ucla.edu/Language/HAU/hau.html) Then manual checks by different native speakers Performance of baseline on dev set: 23.49% Review of the ARPA Speech Understanding Project Hausa Large Vocabulary Continuous Speech Recognition - 16 5. System Optimization 5.1 Pronunciation Dictionary Improvement 5.1.1 Automatic rejection of inconsistent or flawed entries (1) 1. Length Filtering (Len) a. Remove a pronunciation if the number of grapheme and phoneme tokens differs more than a threshold. 2. Alignment Filtering (Eps) (according to (Martirosian and Davel, 2007)) a. Perform a grapheme-to-phoneme (g2p) alignment (Black et al., 1998) • The alignment process involves the insertion of graphemic and phonemic nulls (epsilon) into the lexical entries of words. b. Remove a pronunciation if the number of graphemic and phonemic nulls exceeds a threshold. H |a| u | s | a | H_h | _ | H_aU | H_s | H_al | Review of the ARPA Speech Understanding Project Hausa Large Vocabulary Continuous Speech Recognition - 17 5. System Optimization 5.1 Pronunciation Dictionary Improvement 5.1.1 Automatic rejection of inconsistent or flawed entries (1) 3. g2p Filtering after Length/Alignment Filtering (G2PLen/G2PEps) a. Train g2p models with “reliable” word-pronunciation pairs. b. Apply the g2p models to convert a grapheme string into a most likely phoneme string. c. Remove a pronunciation if the edit distance betw. the synthesized phoneme string and the pronunciation in question exceeds a threshold. Review of the ARPA Speech Understanding Project Hausa Large Vocabulary Continuous Speech Recognition - 18 5. System Optimization 5.1 Pronunciation Dictionary Improvement 5.1.1 Automatic rejection of inconsistent or flawed entries (2) • The threshold for each filtering method depends on the mean and the standard deviation of the measure in focus. • Those word-pronunciation pairs whose resulting number is shorter than or longer than are rejected (~16% with each filtering methods). • We built new grapheme-to-phoneme (g2p) models with the remaining wordpronunciation pairs and applied them to the words with rejected pronunciations 2.67% relative improvement with (Grapheme-based: 22.52%) but still room for improvement Review of the ARPA Speech Understanding Project Hausa Large Vocabulary Continuous Speech Recognition - 19 5. System Optimization 5.1 Pronunciation Dictionary Improvement 5.1.2 Analysis of importance of tones and vowel length information 5.2 Language Model Improvement Crawling additional text corpora (8 million tokens from http://hausa.cri.cn) Increase vocabulary size with frequent words (42k resulted in best WER) Review of the ARPA Speech Understanding Project Hausa Large Vocabulary Continuous Speech Recognition - 20 5. System Optimization 5.3 Speaker Adaptation and System Combination System Combination Results (%) on dev (test) set. Review of the ARPA Speech Understanding Project Hausa Large Vocabulary Continuous Speech Recognition - 21 6. Conclusion • Development of a Hausa speech recognition system for large vocabulary • Hausa is a lingua franca in West Africa spoken by over 25 million speakers • We collected almost 9 hours of speech from 102 Hausa speakers reading newspaper articles. • For automatic speech recognition, the modeling of tones and vowel lengths performs better than omitting tone or vowel length information • We improved the pronunciation dictionary quality with methods to filter erroneous word-pronunciation pairs. • The initial recognition performance of 23.49% WER was improved to 13.16% on the dev set and 16.26% on the test set. Hausa Large Vocabulary Continuous Speech Recognition - 22 Review of the ARPA Speech Understanding Project Thanks for your interest! Baie dankie! Review of the ARPA Speech Understanding Project Hausa Large Vocabulary Continuous Speech Recognition - 23 References