From generic to task-oriented speech recognition : French

Transcript From generic to task-oriented speech recognition : French

ASR and scalability
Dominique Vaufreydaz
ESSLLI’02
GEOD
Communication Langagière et
Interaction Personne-Système
Fédération IMAG
BP 53 - 38041 Grenoble Cedex 9 - France
Dominique Vaufreydaz, ESSLLI 2002
1
Groupe d'Etude sur l'Oral et le Dialogue
ASR and scalability
• State-of-the-art speech recognition
– general overview
– acoustic modelling
– language modelling
• Web-trained language models
– scalability of Web data ?
– Nespole! example
– results
Dominique Vaufreydaz, ESSLLI 2002
2
State-of-the-art speech recognition - general overview
Automatic speech recognition
Phonetically
labelled
signals
Training
Text corpus
Speech
Acoustic
parameters
Training
Acoustic
models
Language model(s)
Acoustic parameters used:
- Mel-scaled Frequency Cepstral Coefficients (MFCC)
- Energy
Recognition
- Zero crossing
Decoding
P( wi )
- Linear Predictive Coding
p( x(LPC)
/ wi )
- Perceptual Linear Predictive (PLP) et Rasta-PLP
- etc.
w*  arg max p( x / wi ).P( wi )
 and   of these parameters
i
Dominique Vaufreydaz, ESSLLI 2002
3
State-of-the-art speech recognition - acoustic modelling
w*  arg max p( x / wi ).P( wi )
i
Hidden Markov Models
• Two different stochastic processes
– X: a first order hidden Markov chain for temporal variability
– Y: an observable process, for spectral variability
• HMM can be described with  = (A, B, ):
– Matrix A: transition probabilities from one state to another
ai,j  p(Xt = j | Xt-1 = i)
– Matrix B: distribution probabilities of observations
bi,j(y)  p(Yt = y | Xt-1 = i, Xt = j)
In continuous speech recognition, these probabilities are multigaussian
mixtures defined with:
• the mean vector
• the covariance matrix
• the weights of each gaussian
– Matrix : probabilities to reach a state from the initial state
i  p(X0 = i)
Dominique Vaufreydaz, ESSLLI 2002
4
State-of-the-art speech recognition - acoustic modelling
w*  arg max p( x / wi ).P( wi )
i
Acoustic units
• Different kinds of system
– context independent systems: phonemes (or other units)
– context dependent systems: allophones, i.e. units in context.
More robust but use more memory and CPU.
 The availability of enough training data determines the choice
between context dependent/independent models and the
number of different allophones.
• HMM topology for each unit
– usually, a bakis model (left/rigth first order
model)
with ai,j = 0 if j < i
a22
a11
a12
S1
a33
a23
S2
S3
a13
Dominique Vaufreydaz, ESSLLI 2002
5
State-of-the-art speech recognition - acoustic modelling
w*  arg max p( x / wi ).P( wi )
i
Train acoustic models
• Estimation and iterative reestimation of the
model parameters
– need an acoustic corpus:
• matching the future recognition condition (speech
quality, noise environment, etc.)
• annotated in acoustic units, i.e. a sequence of
acoustic observations O.
– use Baum-Welch or Expectation-Modification
(EM) algorithms
• find  = (A, B, ) to maximise P(O| )
Dominique Vaufreydaz, ESSLLI 2002
6
State-of-the-art speech recognition - acoustic modelling
w*  arg max p( x / wi ).P( wi )
i
Acoustic Model Adaptation
• Having enough training data for these new acoustic
condition
– train a new model with these data
– train a multicondition model with all your data
• Having a numerical way to simulate new condition (from
clean speech to G723 speech for example)
– transcode your data and train a new or multicondition model
• Having only few adaptation data
– use adaptation algorithms like:
•
•
•
•
Maximum Likelihood Linear Regression (MLLR)
Maximum A Posteriori (MAP)
Bayesian Predictive Adaptation (BPA)
etc.
Dominique Vaufreydaz, ESSLLI 2002
7
w*  arg max p( x / wi ).P( wi )
State-of-the-art speech recognition - language modelling
i
Statistical language models
• Statistical language models
P(W)P(w 1).P(w 2|w 1).i3P(w i|w 1...w i-1)
n
– more robust than grammar for large vocabulary and
dialog systems
– not only a yes/no answer
• n-gram models: considering n-1 words as context
– mostly n is 3:
N(w n2w n-1w n)
P(w n|w 1...w n-1)  P(w n|w n2w n-1) 
N(w n2w n-1)
 need text corpora to compute these probabilities
Dominique Vaufreydaz, ESSLLI 2002
8
State-of-the-art speech recognition - language modelling
w*  arg max p( x / wi ).P( wi )
i
Compute a language model
1 – « Wizard of Oz » experiments
délissasses 1
croquantes 42
emmènerais 9
emmènerait 26
badgé 19
badge 3439
faillirent 52
pentateuque 309
tabloïde 17
tabloïds 117
attendriraient 5
agatisé 1
portiques 1165
accusais 18
accusait 662
bioclimats 4
circonscriras 2
Transcriptions
délissasses 1
croquantes 42
emmènerais 9
emmènerait 26
badgé 19
badge 3439
faillirent 52
pentateuque 309
tabloïde 17
tabloïds 117
attendriraient 5
agatisé 1
portiques 1165
accusais 18
accusait 662
bioclimats 4
circonscriras 2
language
model
LM
Adaptation
tools
délissasses 1
croquantes 42
emmènerais 9
emmènerait 26
badgé 19
badge 3439
faillirent 52
Pentat
Language
model
euque 309
tabloïde 17
tabloïds 117
attendriraient 5
agatisé 1
portiques 1165
accusais 18
accusait 662
bioclimats 4
circonscriras 2
2 – train a language model
A third way using all the available data on the Web ???
Dominique Vaufreydaz, ESSLLI 2002
9
ASR and scalability
• State-of-the-art speech recognition
– general overview
– acoustic modelling
– language modelling
• Web-trained language models
– scalability of Web data ?
– Nespole! example
– results
Dominique Vaufreydaz, ESSLLI 2002
10
w*  arg max p( x / wi ).P( wi )
Web-trained language models - scalability of Web data ?
i
Scalability using the Web ?
• Huge amount a data on many topics
– ~200000 different French lexical forms
– different kinds of text
• well-written text in professional pages for example
• pseudo dialog forms in personal Web pages
« Euh... bonjour, euh... c'est l'Institut Macareux
... euh... c'est pour un sondage (anonyme, quoi... hein) ! »
• Size of the training set is steadily increasing with
the vocabulary size
Task
Vocabulary size
Cstar-II
Nespole!
~3K
~20K
Size of the
training corpus
~145 M words
~1587 M words
Dominique Vaufreydaz, ESSLLI 2002
11
Web-trained language models - Nespole! example
w*  arg max p( x / wi ).P( wi )
i
Specific vocabulary definition
• Recording real dialogs in real condition
(see « Data Collection in Nespole! »)
– 5 different scenarios recorded through NetMeeting
– 191 dialogs in 4 languages including 31 French ones
manually transcribed
 extracted French vocabulary contains 2056 words
• Add CStar-II vocabulary
– a specific tourist vocabulary was previously defined
for the CStar-II project
 vocabulary grows up to 2500 words
Dominique Vaufreydaz, ESSLLI 2002
12
w*  arg max p( x / wi ).P( wi )
Web-trained language models - Nespole! example
i
Increase vocabulary coverage
- lexical OOV délissasses 1
croquantes 42
emmènerais 9
emmènerait 26
badgé 19
badge 3439
faillirent 52
pentateuque 309
tabloïde 17
tabloïds 117
attendriraient 5
agatisé 1
portiques 1165
accusais 18
accusait 662
bioclimats 4
circonscriras 2
ABU
1 - compute words counts
délissasses 1
croquantes 42
emmènerais 9
emmènerait 26
badgé 19
badge 3439
faillirent 52
pentateuque 309
tabloïde 17
tabloïds 117
attendriraient 5
agatisé 1
portiques 1165
accusais 18
accusait 662
bioclimats 4
circonscriras 2
WebFr4
Words frequency
délissasses 1
croquantes 42
emmènerais 9
emmènerait 26
badgé 19
badge 3439
faillirent 52
pentateuque 309
tabloïde 17
tabloïds 117
attendriraient 5
agatisé 1
portiques 1165
accusais 18
accusait 662
bioclimats 4
circonscriras 2
BDLex
délissasses 1
croquantes 42
emmènerais 9
emmènerait 26
badgé 19
badge 3439
faillirent 52
pentateuque 309
tabloïde 17
tabloïds 117
attendriraient 5
agatisé 1
portiques 1165
accusais 18
accusait 662
bioclimats 4
circonscriras 2
Specific
vocabulary
+
délissasses 1
croquantes 42
emmènerais 9
emmènerait 26
badgé 19
badge 3439
faillirent 52
pentateuque 309
tabloïde 17
tabloïds 117
attendriraient 5
agatisé 1
portiques 1165
accusais 18
accusait 662
bioclimats 4
circonscriras 2
20K
vocabulary
2 – add most frequent words
Dominique Vaufreydaz, ESSLLI 2002
13
w*  arg max p( x / wi ).P( wi )
Web-trained language models - Nespole! example
i
Increase vocabulary coverage
- short words délissasses 1
croquantes 42
emmènerais 9
emmènerait 26
badgé 19
badge 3439
faillirent 52
pentateuque 309
tabloïde 17
tabloïds 117
attendriraient 5
agatisé 1
portiques 1165
accusais 18
accusait 662
bioclimats 4
circonscriras 2
20K
vocabulary
3 - compute 5-gram on short words
(5 letters and 3 phonemes maximum)
délissasses 1
croquantes 42
emmènerais 9
emmènerait 26
badgé 19
badge 3439
faillirent 52
pentateuque 309
tabloïde 17
tabloïds 117
attendriraient 5
agatisé 1
portiques 1165
accusais 18
accusait 662
bioclimats 4
circonscriras 2
WebFr4
Multi-words
délissasses 1
croquantes 42
emmènerais 9
emmènerait 26
badgé 19
badge 3439
faillirent 52
pentateuque 309
tabloïde 17
tabloïds 117
attendriraient 5
agatisé 1
portiques 1165
accusais 18
accusait 662
bioclimats 4
circonscriras 2
20K
vocabulary
+
délissasses 1
croquantes 42
emmènerais 9
emmènerait 26
badgé 19
badge 3439
faillirent 52
pentateuque 309
tabloïde 17
tabloïds 117
attendriraient 5
agatisé 1
portiques 1165
accusais 18
accusait 662
bioclimats 4
circonscriras 2
Final
vocabulary
5 – add most frequent multi-words
Dominique Vaufreydaz, ESSLLI 2002
14
w*  arg max p( x / wi ).P( wi )
Web-trained language models - Nespole! example
i
Trigram language model
5 - compute 3-gram Language Models
délissasses 1
croquantes 42
emmènerais 9
emmènerait 26
badgé 19
badge 3439
faillirent 52
pentateuque 309
tabloïde 17
tabloïds 117
attendriraient 5
agatisé 1
portiques 1165
accusais 18
accusait 662
bioclimats 4
circonscriras 2
Final vocabulary
(20,540 words)
WebFr4
Minimal block
length filter
(length=5)
Il mordait en ce moment de fort bon appétit dans un morceau de pain.
Il en arracha un peu de mie pour faire une boulette.
Il la lança railleusement par le vasistas d'une fenêtre sur laquelle il s'appuyait.
Bien dirigée, la boulette rebondit presque à la hauteur de la croisée.
Cet inconnu traversait la cour d'une maison située rue Vivienne, où.
Cette exclamation échappait à un clerc appartenant au genre de ceu.
Il mordait en ce moment de fort bon appétit dans un morceau de pain.
Il en arracha un peu de mie pour faire une boulette.
Il la lança railleusement par le vasistas d'une fenêtre sur laquelle il s'appuyait.
Il en arracha un peu de mie pour faire une boulette.
Il la lança railleusement par le vasistas d'une fenêtre sur laquelle il s'appuyait.
Bien dirigée, la boulette rebondit presque à la hauteur de la croisée,
Cet inconnu traversait la cour d'une maison.
1,587,142,200
words corpus
délissasses 1
croquantes 42
emmènerais 9
emmènerait 26
badgé 19
badge 3439
faillirent 52
pentateuque 309
tabloïde 17
Adapted LM
tools
Final LM
1,960,813 bigrams
6,413,376 trigrams
tabloïds 117
attendriraient 5
agatisé 1
portiques 1165
accusais 18
accusait 662
bioclimats 4
circonscriras 2
Dominique Vaufreydaz, ESSLLI 2002
15
Web-trained language models - results
Results
• On the CStar-II task (~3000 words)
Training
corpus
WebFr1
Test
corpus
CStar120
#
#
#
speakers utterances words
2
120
1127
WA
88%
• On the Nespole! Task (20524 words)
Training
corpus
WebFr4
Test
corpus
Nespole
#
#
#
speakers utterances words
2
235
2066
Dominique Vaufreydaz, ESSLLI 2002
WA
72,50%
16
<!DOCTYPE HTML SYSTEM "HTML.dcl" []>
<HTML VERSION = "2.0">
<HEAD>
<TITLE>
Laboratoire CLIPS</TITLE>
</HEAD>

<BODY BGCOLOR = "#FFFFFF">
<TABLE LANG = "en_us" COLSPEC = "C158C215C170" UNITS = "EM" ALIGN = "CENTER" CLEAR = "no">
<TR LANG = "en_us" VALIGN = "">
<TD LANG = "en_us" COLSPAN = "1" ROWSPAN = "1" VALIGN = "TOP">
<IMG SRC = "clip-arts/logos/logo.gif" ALIGN = "TOP" HEIGHT = "120">
</TD>
<TD LANG = "en_us" COLSPAN = "1" ROWSPAN = "1" VALIGN = "TOP">
<TABLE LANG = "en_us" COLSPEC = "C207" UNITS = "EM" ALIGN = "CENTER" CLEAR = "no">
<TR LANG = "en_us" VALIGN = "TOP">
<TD LANG = "en_us" COLSPAN = "1" ROWSPAN = "1" VALIGN = "TOP">
<H1>
<B><I>CLIPS</I>
</B>
</H1>
</TD>
</TR>
<TR LANG = "en_us" VALIGN = "TOP">
<TD LANG = "en_us" COLSPAN = "1" ROWSPAN = "1" VALIGN = "TOP">
Communication Langagière et<BR>Interaction Personne Système </TD>
</TR>
<TR LANG = "en_us" VALIGN = "TOP">
<TD LANG = "en_us" COLSPAN = "1" ROWSPAN = "1" VALIGN = "TOP">
<I>Fédération IMAG</I>
[…]
Dominique Vaufreydaz, ESSLLI 2002
17
rue de la bibliothèque b </s>
est un laboratoire de grenoble
<s> le centre national de la
<s> un laboratoire et un centre
<s> vous pouvez également faire des
de mots sur tout le
<s> nous avons aussi un peu
si vous ne trouvez pas ce que vous cherchez ici
également la liste de nos
organisée par le laboratoire clips
est de plus en plus important </s>
mais aussi à toute personne
<s> tout savoir sur le programme
<s> la sélection de la semaine </s>
sur le site web de la
sur le site de la
<s> pour profiter de ce site il est
<s> sinon vous pouvez visiter une
de haut niveau dans les domaines
<s> chaque année un programme est
<s> pour accéder directement au programme
et la chimie de la matière
juillet à grenoble saint martin
semaine de juillet à grenoble saint martin
Dominique Vaufreydaz, ESSLLI 2002
18