Transcript Folie 1
Hausa Large Vocabulary
Speech Recognition
Tim Schlippe
Edy Guevara Komgang Djomgang
Ngoc Thang Vu
Sebastian Ochs
Tanja Schultz
Cape Town, South Africa
07 May 2012
SLTU 2012 – The 3rd Workshop on Spoken Language Technologies for Under-resourced Languages
Outline
Hausa Large Vocabulary Continuous Speech Recognition - 2
Review of the ARPA Speech Understanding Project
1.
Motivation
2.
The Hausa Language
3.
Hausa Resources
3.1 Text Corpus
3.2 Speech Corpus
4.
Baseline Speech Recognition System
5.
System Optimization
5.1 Pronunciation Dictionary Improvement
5.1.1 Automatic rejection of inconsistent or flawed entries
5.1.2 Tones and vowel lengths
5.2 Language Model Improvement
5.3 Speaker Adaptation and System Combination
6.
Conclusion
1. Motivation
Hausa Large Vocabulary Continuous Speech Recognition - 3
Review of the ARPA Speech Understanding Project
•
Speech technology …
potentially allows everyone to participate
in today’s information revolution,
can bridge language barrier gabs,
facilitates worldwide business activities,
simplifies life in multilingual communities,
alleviates humanitarian missions.
1. Motivation
Hausa Large Vocabulary Continuous Speech Recognition - 4
Review of the ARPA Speech Understanding Project
•
Africa itself ...
has more than 2,000 languages (Heine and Nurse, 2000)
(e.g. there are more than 280 languages in Cameroon (www.ethnologue.com)).
plus many different accents
•
For only a small fraction of Africa’s many languages, speech
technology has been analyzed and developed so far
We have collected speech and text data in Cameroon for the West
African language Hausa as a part of our GlobalPhone corpus
(Schultz, 2002) and developed an automatic speech recognition system.
2. The Hausa Language
Hausa Large Vocabulary Continuous Speech Recognition - 5
Review of the ARPA Speech Understanding Project
•
Why Hausa?
Lingua franca in many countries
With over 25 million speakers, it is widely spoken in West Africa
(Burquest, 1992)
Hausa speakers according to the Summer Institute of Linguistics (SIL):
•
•
•
•
•
18.5 million in Nigeria (1991),
5 million in Niger (1998),
489k in Sudan (2001),
23.5k in Cameroon (1982),
Benin, Burkina Faso, Ghana, Togo, Chad
(Koslow, 1995)
Online text resources available
Phoneme set defined by International Phonetic Association (IPA)
(IPA, 1999)
2. The Hausa Language
Hausa Large Vocabulary Continuous Speech Recognition - 6
Review of the ARPA Speech Understanding Project
•
Classification: Afro-Asiatic, Chadic, West, A, A.1
•
Alphabet:
•
ajami (based on Arabic Alphabet), e.g. „
boko (based on Latin Alphabet), e.g. „Hausa“
• 22 characters of the English Alphabet
plus Ɓ/ɓ, Ɗ/ɗ, Ƙ/ƙ, 'Y/'y or Ƴ/ƴ , and '
• In online newspapers: Ɓ/ɓ, Ɗ/ɗ, Ƙ/ƙ B/b, D/d, K/k
“
Pronunciation characteristics:
3 lexical tones (low, high, falling) (IPA, 1999), e.g. wuya
•
•
wuyá
wúya
difficulty
neck
Vowel lengths (short, long), e.g. gari
•
•
garî
gaːrî
town
flour
3. Hausa Resources
Hausa Large Vocabulary Continuous Speech Recognition - 7
Review of the ARPA Speech Understanding Project
3.1 Text Corpus
Crawing text from 5 main online newspapers in boko using the Rapid
Language Adaptation Toolkit (RLAT) (Black and Schultz, 2008)
Text Normalization
1. Remove all HTML tags and codes
2. Remove special characters and empty lines
3. Identify and remove pages and lines from other languages than Hausa
based on large lists of frequent Hausa words
4. Delete duplicate lines
Select prompts to record speech data for the training, development,
and evaluation set and extract text for the language model
3. Hausa Resources
Hausa Large Vocabulary Continuous Speech Recognition - 8
Review of the ARPA Speech Understanding Project
3.2 Speech Corpus
Speech data collection in GlobalPhone style (Schultz, 2002),
i.e. we asked native speakers of Hausa to read prompted sentences of
newspaper articles.
Offline audio recorder
16 kHz sampling rate with 16 bit quantization
Close talk microphone (noise cancellation microphone, NC-185VM)
3. Hausa Resources
Hausa Large Vocabulary Continuous Speech Recognition - 9
Review of the ARPA Speech Understanding Project
3.2 Speech Corpus - Challenges
Social factors
• The majority of Hausa people is Muslim
(95% of recorded speakers)
• For Muslims close connection between work and religion
• Most Muslim female speakers had to ask their husband or father
for the permission to do the recording
Review of the ARPA Speech Understanding Project
Hausa Large Vocabulary Continuous Speech Recognition - 10
3. Hausa Resources
3.2 Speech Corpus - Challenges
Technical difficulties
• Noisy environments:
Big cities, restaurants, offices, at home, meeting halls
• Bad infrastructure (electricity)
Review of the ARPA Speech Understanding Project
Hausa Large Vocabulary Continuous Speech Recognition - 11
3. Hausa Resources
3.2 Speech Corpus
Maroua
Ngaoundéré
Bafoussam
Yaoundé
Douala
Review of the ARPA Speech Understanding Project
Hausa Large Vocabulary Continuous Speech Recognition - 12
3. Hausa Resources
3.2 Speech Corpus
#
age
Review of the ARPA Speech Understanding Project
Hausa Large Vocabulary Continuous Speech Recognition - 13
4. Baseline Speech Recognition System
•
33 Hausa phonemes (26 consonants, 5 vowels, 2 diphtongs)
•
6.6 hours to train acoustic models
•
Bootstrapping with RLAT using multilingual phone inventory MM7
(Schultz and Waibel, 2001)
•
MM7 trained from 7 randomly selected GlobalPhone languages
Selected MM7 models as seed models to produce initial state
alignments for the Hausa speech data
Preprocessing:
Feature extraction with Hamming window of 16 ms length,
window overlap of 10 ms
Each feature vector has 143 dimensions
(11 adjacent frames x 13 MFCC frames)
Linear Discriminant Analysis (LDA) feature vector size: 42 dims.
Review of the ARPA Speech Understanding Project
Hausa Large Vocabulary Continuous Speech Recognition - 14
4. Baseline Speech Recognition System
•
Acoustic Model (AM):
Fully-continuous 3-state left-to-right HMM
Emission probabilities are modeled by Gaussian Mixtures with diagonal
covariances
Context-dependent AM: decision tree splitting stopped at 500 triphones
For all AMs one global semi-tied covariance matrix after LDA
Data-driven tone modeling (DDTM) (Vu and Schultz, 2009)
•
•
•
Use a tone tag in pronunciation dictionary and add tag as question in
clustering procedure
Data decide during model clustering if two tones have a similar impact on
the basic phoneme
If so share 1 common model;
Otherwise decision tree split 2 different models
Same technique for the vowel lengths (Data-driven lengths modeling)
Review of the ARPA Speech Understanding Project
Hausa Large Vocabulary Continuous Speech Recognition - 15
4. Baseline Speech Recognition System
•
Language Model (LM):
•
3-gram
Vocabulary size: 6k (4k training transcriptions + 2k frequent)
Perplexity (PPL): 282
Out-of-vocabulary (OOV) rate: 4.7%
Pronunciation Dictionary
Pronunciations created in a rule-based fashion,
then manually revised and cross-checked by native speakers
•
•
Initial rules based on 200 word-pronunciation pairs from Peter Ladefoged
(http://archive.phonetics.ucla.edu/Language/HAU/hau.html)
Then manual checks by different native speakers
Performance of baseline on dev set: 23.49%
Review of the ARPA Speech Understanding Project
Hausa Large Vocabulary Continuous Speech Recognition - 16
5. System Optimization
5.1 Pronunciation Dictionary Improvement
5.1.1 Automatic rejection of inconsistent or flawed entries (1)
1. Length Filtering (Len)
a. Remove a pronunciation if the number of grapheme and phoneme tokens
differs more than a threshold.
2. Alignment Filtering (Eps) (according to (Martirosian and Davel, 2007))
a. Perform a grapheme-to-phoneme (g2p) alignment (Black et al., 1998)
• The alignment process involves the insertion of graphemic and
phonemic nulls (epsilon) into the lexical entries of words.
b. Remove a pronunciation if the number of graphemic and phonemic nulls
exceeds a threshold.
H
|a| u
| s
| a
|
H_h | _ | H_aU | H_s | H_al |
Review of the ARPA Speech Understanding Project
Hausa Large Vocabulary Continuous Speech Recognition - 17
5. System Optimization
5.1 Pronunciation Dictionary Improvement
5.1.1 Automatic rejection of inconsistent or flawed entries (1)
3. g2p Filtering after Length/Alignment Filtering (G2PLen/G2PEps)
a. Train g2p models with “reliable” word-pronunciation pairs.
b. Apply the g2p models to convert a grapheme string into a most likely
phoneme string.
c. Remove a pronunciation if the edit distance betw. the synthesized phoneme
string and the pronunciation in question exceeds a threshold.
Review of the ARPA Speech Understanding Project
Hausa Large Vocabulary Continuous Speech Recognition - 18
5. System Optimization
5.1 Pronunciation Dictionary Improvement
5.1.1 Automatic rejection of inconsistent or flawed entries (2)
•
The threshold for each filtering method depends on
the mean
and
the standard deviation
of the measure in focus.
•
Those word-pronunciation pairs whose resulting number is shorter than
or longer than
are rejected (~16% with each filtering methods).
•
We built new grapheme-to-phoneme (g2p) models with the remaining wordpronunciation pairs and applied them to the words with rejected pronunciations
2.67% relative improvement with
(Grapheme-based: 22.52%)
but still room for improvement
Review of the ARPA Speech Understanding Project
Hausa Large Vocabulary Continuous Speech Recognition - 19
5. System Optimization
5.1 Pronunciation Dictionary Improvement
5.1.2 Analysis of importance of tones and vowel length information
5.2 Language Model Improvement
Crawling additional text corpora (8 million tokens from http://hausa.cri.cn)
Increase vocabulary size with frequent words (42k resulted in best WER)
Review of the ARPA Speech Understanding Project
Hausa Large Vocabulary Continuous Speech Recognition - 20
5. System Optimization
5.3 Speaker Adaptation and System Combination
System Combination Results (%) on dev (test) set.
Review of the ARPA Speech Understanding Project
Hausa Large Vocabulary Continuous Speech Recognition - 21
6. Conclusion
•
Development of a Hausa speech recognition system for large
vocabulary
•
Hausa is a lingua franca in West Africa spoken by over 25 million
speakers
•
We collected almost 9 hours of speech from 102 Hausa speakers
reading newspaper articles.
•
For automatic speech recognition, the modeling of tones and vowel
lengths performs better than omitting tone or vowel length information
•
We improved the pronunciation dictionary quality with methods to filter
erroneous word-pronunciation pairs.
•
The initial recognition performance of 23.49% WER was improved to
13.16% on the dev set and 16.26% on the test set.
Hausa Large Vocabulary Continuous Speech Recognition - 22
Review of the ARPA Speech Understanding Project
Thanks for your interest!
Baie dankie!
Review of the ARPA Speech Understanding Project
Hausa Large Vocabulary Continuous Speech Recognition - 23
References