Colloquial Language Modeling for Spontaneous Cantonese Speech Recognition

Transcript Colloquial Language Modeling for Spontaneous Cantonese Speech Recognition

Colloquial Language Modeling for
Spontaneous Cantonese Speech Recognition
Supervisor: Pascale FUNG
Yuen Yee LO
HKUST
Department of Electrical and Electronic Engineering
The Hong Kong University of Science and Technology
5/23/2016
SBIR/HLTC/HKUST
1
Outline of the presentation:




Introduction
Background
Methodology
Evaluation I: Spontaneous Cantonese Speech Information
Retrieval System
 Evaluation II: Colloquial Cantonese Dictation System
 Conclusion
 Future Developments
5/23/2016
SBIR/HLTC/HKUST
2
Introduction
• Cantonese speech
– one of the four language groups in Chinese (Mandarin, Cantonese,
Fukienese and Hakka
– linguistically distinct from standard written Chinese
– more colloquial and spontaneous
– different in certain vocabularies
– different words order compares with standard Chinese (Mandarin)
• Spontaneous speech
–
–
–
–
–
5/23/2016
colloquial words
hesitations, disfluency, short pauses, corrections
ill-formed grammar
“er”, “ar”…..
out-of-vocabulary
SBIR/HLTC/HKUST
3
• Understanding spontaneous Cantonese speech
–
–
–
–
–
spontaneous speech information retrieval system
spot out the keywords --> identify the keywords
filter out the noise e.g. “er”,”ar”, background noise
filter out the filler phrases
filter out the out-of-vocabulary
• Large vocabulary continuous speech recognition
– Cantonese dictation
– Cantonese is different from Mandarin in term of word order,
colloquial phrases and certain vocabulary
– requires Cantonese LM for LVCSR
– not enough Cantonese texts for constructing LM
– Language model adaptation from Mandarin LM to Cantonese LM
with a small amount of Cantonese texts data
5/23/2016
SBIR/HLTC/HKUST
4
Publication
 Pascale FUNG and LO Yuen Yee, “Understanding Spontaneous
Chinese Speech – Are Mandarin and Cantonese Very Different?” 1999
International Symposium on Signal Processing and Intelligent System
(ISSPIS’99) 583-587.
 Pascale FUNG, CHEUNG Chi Shun, LAM Kwok Leung, LIU Wai
Kat, LO Yuen Yee: "SALSA, A Hong Kong English Speech-based
Web Browser ".ICSLP 98: Fifth International Conference on Spoken
Language Processing, Sydney: Dec. 1998, vol. 4, 1615-1619
 Pascale FUNG, CHEUNG Chi Shuen, LAM Kwok Leung, LIU Wai
Kat, LO Yuen Yee, and MA Chi Yuen: "SALSA, A Multilingual
Speech-Based Web Browser", The First AEARU Web Technology
Workshop, Kyoto: Nov. 1998, 16-21.
 Pascale Fung and LO Yuen Yee: "An IR Approach for Translating New
Words from Nonparallel, Comparable Texts". The 36th Annual
Conference of the Association for Computational Linguistics,
Montreal, Canada: August 1998, 414--420.
5/23/2016
SBIR/HLTC/HKUST
5
Motivation:
• understand spontaneous Cantonese
– find out the Characteristics of spontaneous Cantonese speech
• keyword spotting
– identify the keywords by filter out the filler phrases and garbage
• spoken Cantonese language is different from formal
written form
– find the Cantonese texts for Cantonese LM for LVCSR
• the language difference between Cantonese and Mandarin
– measure the differences quantitatively
5/23/2016
SBIR/HLTC/HKUST
6
Motivation
• LVCSR
– HK and southern China -- Cantonese
– typing Chinese into a computer is difficult as Chinese language is
not alphabetic
– speech-based input method is more convenient
– colloquial Cantonese Dictation
– not enough Cantonese texts data for computing the Cantonese LM
– Mandarin texts are available
– LM adaptation technique is applied on adapt Mandarin LM to
Cantonese LM by small amount of Cantonese texts data
5/23/2016
SBIR/HLTC/HKUST
7
Background:
Spontaneous Cantonese Speech
 Spontaneous Speech
 colloquial spoken Chinese
 hesitations, collections, ill-formed grammar, short pauses
 Cantonese Speech




one of the four language groups in Chinese
more spontaneous and colloquial than spoken Mandarin
different in certain vocabulary and words order
very different in colloquial and filler
e.g. “ “ , “ “, “ “ ,“ “
 similar in content words as standard written Chinese (Mandarin),
different otherwise
5/23/2016
SBIR/HLTC/HKUST
8
Spontaneous Speech Recognition and
Understanding
• system that allows speakers to speak more flexible
• identify the keywords embedded in spontaneous speech
• Keyword spotting:
– spontaneous speech database collection and analysis
– garbage modeling
– filler phrases modeling
5/23/2016
SBIR/HLTC/HKUST
9
Spontaneous speech database collection and
analysis
• find out the features of the spontaneous speech
• collect natural response from the users
• can be collected from real life environment or script
scenario
• script scenario cannot collect natural response
• analyse people’ speaking styles
• analyse the colloquial phrases
• extract filler phrases embedded in spontaneous speech by
statistical tool
e.g. “Would you please …”, “Thank you” and “I would like to …”
5/23/2016
SBIR/HLTC/HKUST
10
Garbage Modeling
• absorb the extraneous speech embedded in spontaneous
speech
e.g.“um”, “ha”, “er”, short pauses, out-of-vocabulary
• suitable for small vocabulary speech recognition
e.g. digits, command words
• not efficient for large vocabulary keyword spotting system
• training of garbage model
– trained from the noise utterances
– with greater number of mixtures e.g. 16, 32, 64
5/23/2016
SBIR/HLTC/HKUST
11
Filler Phrases Modeling
• people always speech extraneous (filler) words in
spontaneous
• filler phrases model can absorb more extraneous speech
embedded in spontaneous speech
• different tasks have different filler phrases models
• similar events have similar filler phrases
• can be extracted from transcribed speech data or texts
• Cantonese filler phrases are very different from Mandarin
e.g. `` ø¸Ó'' (Cantonese)
5/23/2016
SBIR/HLTC/HKUST
12
Out-of-Vocabulary (OOV) Extraction
• Chinese word -- combines one or more characters
• new compound word -- combines one or more words
• many new words appear everyday and are commonly used
in daily life
• oov can degrade the recognition performance
• to detect oov and extract them by statistical tool, CXtract
5/23/2016
SBIR/HLTC/HKUST
13
CXtract -- extracting oov
• an automatic statistical tool
• retrieve collocation form of Chinese words
• collocation -- a pair of words which appear together
significantly more often than expected by chance.
• the ‘strength’ and the ‘spread’ of the word pair are greater
than certain thresholds is accepted as new word
5/23/2016
SBIR/HLTC/HKUST
14
• Strength Ki of the word pair
freqi  f
Ki 

where freqi is the frequency of its collocations wi
f is the average frequency of freqi
 is the standard deviation
• Spread Ui of the word pair
10
(
j 1
Ui 
p
j
i

p)
i
2*a
where a is the window size
j
p is the probability of the word pair
p i is the average of p
i
j
i
5/23/2016
SBIR/HLTC/HKUST
15
Large Vocabulary Continuous Speech
Recognition
LVCSR e.g. dictation system
• increase vocabulary can cause more confusion when decoding
• recognition performance can greatly improve by LM for LVCSR
let W’be the word sequence and O the acoustic observation sequence, the
decoded string has the maximum a posteriori MAP probability:
W '  arg max P(W |O)
W
P(W|O) is the conditional probability of W given the observation O
5/23/2016
SBIR/HLTC/HKUST
16
By Bayes’theorem,
P(O | W ) P(W )
P(W |O) 
P(O)
where P(O|W) is the observation probability given W
P(O) is the acoustic observation probability
and P(W) is the word sequence probability
P(O) is independent of W
W '  arg max P(W ) P(O |W )
W
P(W) is obtained from Language model
5/23/2016
SBIR/HLTC/HKUST
17
Chinese Language Model
• about 40000 Chinese character commonly used
• about 40000 Chinese words appear frequently
• a word is composed of one to several characters
• a compound word can be formed by combining different words
• each Chinese character is
a syllable and a tone
P( wrepresented
|w
w
...w by
)
( app. 600 syllables and 9 tones in Cantonese)
• a lot of homonymes
• each character can pronounce differently when forming a word
e.g. »È¦æ (bank) (ngan hong) and ¦æ (pedestrian) (hang jan)
• high degree of ambiguity for LVCSR
n
5/23/2016
(n  N )
( n  N  1)
( n  1)
SBIR/HLTC/HKUST
18
–
–
–
–
Chinese sentence is not segmented
segment the Chinese texts based on the Chinese lexicon
construct the word sequence W
compute the Markov process with a probability as follow
P(W )  P( w1w2 w3...wn )
 P( w1) P( w2 | w1)...P( wn | w( n  N ) w( n  N  1)...w( n  1))
where P( wn | w( n  N ) w( n  N  1)...w( n  1)) is called an N-gram
probability
– usually uni-gram, bi-gram and tri-gram are computed
– require a lot of texts data to train an efficient LM
5/23/2016
SBIR/HLTC/HKUST
19
Language Model Adaptation
• domain specific LM is more effective in speech
recognition
• only small amount of data is available for each specific
domain
• LM adaptation can adapt a general LM to a particular task
LM
• common approaches:
– linear interpolation
– MAP method
– backoff
• similar technique for adapting Mandarin LM to Cantonese
LM
5/23/2016
SBIR/HLTC/HKUST
20
Difficulty of Spontaneous Cantonese Speech
Recognition
•
•
•
•
no spontaneous Cantonese speech database
spoken Cantonese is not in written form
small amount of Cantonese texts data is available
spoken Cantonese is different from spoken Mandarin in
term of word order, colloquial terms and certain
vocabulary
• new terms, compound words, colloquial terms emerge
from time to time
5/23/2016
SBIR/HLTC/HKUST
21
Spontaneous speech analysis and modeling
• spontaneous speech database collection and
analysis
• Cantonese text collection and analysis
• language differences measurement
• Cantonese language model adaptation
5/23/2016
SBIR/HLTC/HKUST
22
Spontaneous Cantonese Speech collection and
data analysis
Wizard-of-Oz speech collection system
 an operator standing behind the screen simulates the automatic
machine response
 collecting spontaneous speech
 allows speaker to speak spontaneous to search information on the
WebPages or control the web browser
 39 speakers - 23 are engineering students, 16 are business students
25 males, 14 females
 4150 utterances are collected
5/23/2016
SBIR/HLTC/HKUST
23
Analysis
 90 Cantonese filler phrases and colloquial phrases are extracted using
statistical tool, Cxtract
 filler phrases are very different from spoken Mandarin
 people with less technical background tend to speak more spontaneous
and colloquial
e.g.




include a lot of words that are not listed in standard Mandarin lexicon
the content word or keywords are similar to Mandarin
short pauses, collections, hesitations
spoken Cantonese is very different from written form (Mandarin)
5/23/2016
SBIR/HLTC/HKUST
24
Automatic Extraction of Cantonese
Colloquial Phrases
Cantonese
我想
唔該
o 個度
Written Chinese
我想
多謝
那兒
English meaning
I want to
Thank you
There
Cantonese
睇下
想睇
呢個
Written Chinese
看一看
想看一看
這個
English meaning
Read about …
Want to read…
That page..
Cantonese
之前個頁
可唔可以
係咩黎嫁
Written Chinese
上一頁
可不可以
是什麼
English meaning
Pervious page
Is it possible …
What’s that …
Bigram colloquial phrases
Cantonese
繼續落
o 個版呀
我想去
Written Chinese
繼續向下
這一版
我想去
English meaning
Scroll down slowly
That page ..
I want to go …
N-gram colloquial phrases
5/23/2016
SBIR/HLTC/HKUST
25
Cantonese Text Collection and Analysis
 Collect Cantonese text
 extracting more colloquial phrases
 constructing Cantonese LM for LVCSR
 Difficulties:




not enough transcriptions of flexible speech
cannot collect spontaneous speech on each specific domain
spoken Cantonese is not in written form (Mandarin)
standard newspapers or Chinese documents do not fully represent
spoken Cantonese
 small amount of Cantonese texts are available
5/23/2016
SBIR/HLTC/HKUST
26
 Solution -- Online newsgroups articles
Hong Kong Newsgroup articles - similar to spoken Cantonese
 Collection:
period:
Topic:
6 months (~12Mbyte)
travel, politics, literature, entertainment, technology and etc
 Analysis
 600 colloquial phrases and new proper names are extracted by Cxtract
 colloquial phrases and new proper names are not in standard Mandarin
lexicon
 these colloquial phrases are very commonly used by the general
population in Hong Kong
e.g.
5/23/2016
SBIR/HLTC/HKUST
27
Example Hong Kong Colloquial Phrases
Cantonese
佢地
唔該
而家
唔駛
呢個
係咪
點解
成日
Written Chinese
他們
謝謝
不用
不用
這個
是否
為什麼
經常
5/23/2016
English meaning
They
Thank you
Now/ nowadays
No need
This
Isn’t it?
Why
always
Cantonese
知唔知
同埋
邊度
咁多
請問有冇
得唔得
差唔多
請問邊度
SBIR/HLTC/HKUST
Written Chinese
知不知道
和
哪兒
這麼多
請問有沒有
可不可以
差不多
請問哪兒
English meaning
Do you know?
Together with / and
Where
A lot
Are there any … ?
Is it possible … ?
Nearly the same
Could you tell me where
is ….?
28
Similarity Measure of different languages
 Zif's Law of Language Distance
 between Cantonese & Mandarin texts, most content (keywords) words
are the same
 use a common content words lexicon to tokenize the newsgroup
articles from Hong Kong, Taiwan and China
 Given a pair of texts, the frequencies of their common words on the
horizontal and vertical axis of text1 vs. text2.
 By Zif's Law of language distance, if two texts are exactly the same,
the plot collapses into a line with slope 1
5/23/2016
SBIR/HLTC/HKUST
29
Mandarin (TW/China) texts are more similar
to each other than to Cantonese (HK) texts
5/23/2016
SBIR/HLTC/HKUST
30
R2 regression scores of frequency:
R2 
where
5/23/2016
( xy 
2

x


x2 


n

 x y )
2
n
2

y


y2 


n





x is the frequency of common word in text 1
y is the frequency of common word in text 2
All topics
Politics
Travel
Entertainment
HK1 vs. HK2
0.9499
0.9244
0.8363
0.8200
HK vs. TW
0.7957
0.6018
0.5790
0.5790
TW1 vs TW2
0.9645
0.7045
0.8295
0.9551
HK vs Chi
0.6745
Nil
Nil
0.5277
TW vs Chi
0.8318
Nil
Nil
0.7022
SBIR/HLTC/HKUST
31
Result analysis
• within-language -- converge to the diagonal
• within-language articles have higher similarity scores than
cross-language
• underlining that written Cantonese and written Mandarin is
different
5/23/2016
SBIR/HLTC/HKUST
32
Cantonese Language Model Adaptation
• no sufficient Cantonese texts for constructing Cantonese
LM for LVCSR
• Cantonese and Mandarin share the same content words
• different in colloquial phrases and certain vocabulary
• most colloquial phrases are not listed in standard Chinese
lexicon and texts
• word order of spoken Cantonese is different from
Mandarin
• borrow the technique of LM adaptation
5/23/2016
SBIR/HLTC/HKUST
33
• segment HK newspapers based on Mandarin lexicon
• construct Mandarin LM from HK newspaper (uni-gram
and bi-gram)
• colloquial phrases are extracted from HK newsgroup
articles are added in standard Mandarin lexicon
• segment HK newsgroups articles based on new lexicon
• adapt Mandarin LM to Cantonese LM with small amount
of Cantonese texts by linear interpolation
5/23/2016
SBIR/HLTC/HKUST
34
Linear interpolation:
uni-gram:
Padp( wi )   1PM ( wi )  (1   1) PC ( wi )
and 1 1 are the combination factors
Padp( wi ) is the probability of adapted uni-gram
PM ( w i ) is the probability of uni-gram of word
Pc( wi ) is the probability of uni-gram of word
the combination factor
PM ( wi )
1  PM (wi)  PC (wi)
where

1
wi of Mandarin corpus
wi of Cantonese corpus
bi-gram: Padp( wi | w( i  1))   2 PM ( wi | w( i  1))  (1   2) PC ( wi | w( i  1))
and 1  2 are the combination factors
Padp( wi | w( i  1)) is the probability of adapted bi-gram
PM ( wi | w( i  1))is the probability of bi-gram of word
Pc( wi | w( i  1is
))
the probability of bi-gram of word
where

2
the combination factor
5/23/2016

2

wiof Mandarin corpus
wiof Cantonese corpus
PM ( wi | w( i  1))
PM ( wi | w( i  1))  PC ( wi | w( i  1))
SBIR/HLTC/HKUST
35
The linear interpolation bi-gram LM is a linear combination
involving lower order empirical distributions as follows:
P( wi | w(i  1))  aPadp( wi | w(i  1))  bPadp( wi )
where a and b linear combination factor
a is set to 0.9 and b is set to 0.1
5/23/2016
SBIR/HLTC/HKUST
36
Evaluation I: Flexible Speech Recognition
Experiment Setup:
Cantonese Models:
 Speaker independent
 Clean Cantonese words training data
 Initial-final models, 195 HMM models
 Initial -- right context dependent,
 Final -- context independent
 16 mixtures and 39 feature vectors
Filler phrases
 total 82 filler phrases extracted from the transcribed speech using
CXtract
5/23/2016
SBIR/HLTC/HKUST
37
Garbage Model
 garbage model for absorbing non-speech noise
Keywords
 total 600 keywords
 keywords length ~ 1 to 10 Chinese characters
Testing Data
 #. Of speakers: 15 ( 13 female + 2 male)
 520 spontaneous Cantonese collected from Wizard-of-Oz database
 Recorded in lab. environment
5/23/2016
SBIR/HLTC/HKUST
38
Results
Test cases:
Garbage only (baseline I)
Fillers only (baseline II)
Garbage + Fillers
5/23/2016
% of Correct keywords recognition
61.9%
72.8%
82.5%
SBIR/HLTC/HKUST
39
Evaluation II:
Colloquial Cantonese Dictation System
• Experimental Setup
Acoustics model
– Training Data:
•
•
•
•
•
continuous Cantonese speech
2700 sentences from newspaper
2400 Chinese words (2-6 Chinese characters)
phonetically balanced
speaker -independent
– Subword model
• 195 HMM model, init-final
• context dependent
• 6 mixtures and 39 features vectors
5/23/2016
SBIR/HLTC/HKUST
40
Colloquial Cantonese LM Adaptation
• Lexicon
– Mandarin lexicon - 3600 entries
– 600 colloquial phrases extracted from newsgroups articles are
added and form a new lexicon
• Mandarin LM (baseline)
– HK newspapers (app. 1 year)
– segmented based on standard Mandarin lexicon
– uni-gram and bi-gram are computed
• Cantonese texts data
– HK newsgroups articles ( 6 months, 12MByte)
– segmented based on new lexicon
– uni-gram and bi-gram are computed
5/23/2016
SBIR/HLTC/HKUST
41
• LM adaptation
– linear interpolation
• Testing data
– 162 utterance
– including HK colloquial terms
– recorded in lab. environment
5/23/2016
SBIR/HLTC/HKUST
42
Results:
Baseline
Colloq. LM
(Mandarin adaptation
LM)
Improvment
49%
30%
5/23/2016
64%
SBIR/HLTC/HKUST
43
Conclusion
• To model the colloquial language for spontaneous
Cantonese speech recognition, we analyse the spontaneous
speech data collected from Wizard-of-Oz database base
collection system. 90 filler phrases are extracted.
• Spoken Cantonese is different from written
form(Mandarin), there is only little Cantonese texts for
analysis. We downloaded HK newsgroups articles which is
more similar to spoken Cantonese for analysing the
colloquial phrases and for LM adaptation. 600 colloquial
phases that are not listed in standard Mandarin lexicon are
extracted
5/23/2016
SBIR/HLTC/HKUST
44
• By Zipf’s law of language distance and R 2 regression
scores, we find that within-language articles are more
similar than cross-language articles
• By applying garbage and filler phrases models, we
obtained 82.5% keyword spotting accuracy, which have
33% of improvement from our baseline system
• As there is lack of Cantonese texts, we adapt the baseline
Mandarin LM to Cantonese LM by linear interpolation
method for our colloquial Cantonese dictation system. This
gives a 30% improvement of character accuracy compared
with our baseline system.
5/23/2016
SBIR/HLTC/HKUST
45
Future Developments
Speech based information retrieval system
• Domain independent or domain adaptation filler phrases
• Keywords verification
Colloquial Cantonese dictation system
• Automatic update lexicon and language model
• Class based language model
• Speaker adaptation
5/23/2016
SBIR/HLTC/HKUST
46

Colloquial Language Modeling for Spontaneous Cantonese Speech Recognition

Transcript Colloquial Language Modeling for Spontaneous Cantonese Speech Recognition

Directory