On the use of Intonation in ASR: preliminary results

Download Report

Transcript On the use of Intonation in ASR: preliminary results

On the use of Intonation in ASR:
preliminary results
March, 13th 14th 2003
Meeting COST 275
Halmstad
OUR GROUP

UPV EI Bilbao


UVA ETSII Valladolid



Inma Hernáez (leader), Eva Navas, Jon Sanchez
Valentín Cardeñoso, Isaac Moro, Carlos Vivaracho,
David Escudero.
Involved in a CICYT Project of Biometrics with
Javier Ortega (Madrid) and Marcos Faundez
(Barcelona).
Experience in ASR and in Modelling Intonation
OUR AIM



To work in the CICYT Project with the rest of
the groups.
To apply our knowledge in Intonation to the
field of ASR.
Here we present our preliminary results.
INTRODUCTION

Why to make use of intonation in ASR?

It is a feature that characterize to the speaker:



It is a very robust feature



Speakers of the same group have a similar prosody.
Each speaker can have its own prosody.
Different sessions
Different microphones
Other experiences in applying intonation to ASR

SUPER SID: very simple model of intonation.
INTRODUCTION

Aim of this preliminary work



To show the potential capabilities of intonation facing
different sessions and microphones
To show that it can be important to make use of
“sophisticated” models for getting benefits in ASR.
Overview






Presentation of the model of intonation.
The corpus.
The experiment of speaker verification.
Considerations about the robustness of the results.
Consideration about the use model of intonation.
Conclusions and future work.
Modelling Intonation
Modelling Intonation
Modelling Intonation
BASIC IDEA FOR ITS
APPLICATION TO ASR:
TO COMPARE THE
MODELS
OF DIFFERENT SPEAKERS
Modelling Intonation
The Corpus






Recorded at EUPMT by Marcos Faundez
One paragraph read by 16 speakers in 2
sessions with 3 microphones.
Each paragraph = 11 sentences, 106 stress
groups. 3816 intonation units.
Speakers are male and in the same social
group.
The pitch was obtained automatically and
segmented into intonation units by hand.
Intonation was parameterised according to the
intonation model.
The Experiment

Speaker Verification.




We have 6 recordings for each of the Speakers: 5 for
modelling and 1 for testing.
Each Speaker will have each Impostor. The impostor
is modelled with the samples of the rest of speakers.
We will repeat the experiment of verification six
times (one for each of the possible set of tests) for
each of the speakers.
The classifier is based on Decision Trees C.45.
Freeware WEKA.
Results
L0
L1
L2
L3
L4
L5
L6
L7
L8
L9
L10
L11
L12
L13
L14
L15
M1
M2
M3
M4
M5
M6
Media
65,19
71,60
57,95
51,72
58,52
73,24
80,12
65,50
48,75
64,77
65,03
62,50
60,87
54,97
63,69
67,10
63,52
72,67
41,18
53,61
58,18
74,81
80,86
56,25
67,81
67,74
72,19
72,81
54,07
68,64
62,96
66,27
60,13
65,82
61,14
58,28
57,32
78,03
85,44
59,88
59,33
68,03
60,00
64,12
56,34
59,52
72,73
66,03
68,59
69,28
52,87
40,35
56,90
72,54
78,36
63,16
56,60
62,57
63,41
69,84
64,85
66,28
60,12
62,66
63,75
67,70
37,87
59,64
53,53
71,13
85,37
61,85
68,28
60,13
73,05
64,75
58,72
61,05
61,96
52,15
59,01
66,25
60,67
50,89
54,76
71,43
81,87
58,72
60,26
65,56
69,19
66,42
58,55
63,64
62,94
67,08
63,4
68,9
51,9
52,4
56,5
73,5
82,0
60,9
60,2
64,8
67,1
66,7
58,9
62,4
64,1
63,5
Low rates, except for some of the speakers
Results: robustness
L0
L1
L2
L3
L4
L5
L6
L7
L8
L9
L10
L11
L12
L13
L14
L15
M1
M2
M3
M4
M5
M6
Media
65,19
71,60
57,95
51,72
58,52
73,24
80,12
65,50
48,75
64,77
65,03
62,50
60,87
54,97
63,69
67,10
63,52
72,67
41,18
53,61
58,18
74,81
80,86
56,25
67,81
67,74
72,19
72,81
54,07
68,64
62,96
66,27
60,13
65,82
61,14
58,28
57,32
78,03
85,44
59,88
59,33
68,03
60,00
64,12
56,34
59,52
72,73
66,03
68,59
69,28
52,87
40,35
56,90
72,54
78,36
63,16
56,60
62,57
63,41
69,84
64,85
66,28
60,12
62,66
63,75
67,70
37,87
59,64
53,53
71,13
85,37
61,85
68,28
60,13
73,05
64,75
58,72
61,05
61,96
52,15
59,01
66,25
60,67
50,89
54,76
71,43
81,87
58,72
60,26
65,56
69,19
66,42
58,55
63,64
62,94
67,08
63,4
68,9
51,9
52,4
56,5
73,5
82,0
60,9
60,2
64,8
67,1
66,7
58,9
62,4
64,1
63,5
No significant changes when different test input
Results: relevance of prosodic
knowledge.
L1
Total
68.89
Inicial 80.03
Central 65.02
Final
69.25
L5
73.53
55.08
72.95
67.71
L6
82.00
77.65
80.86
89.49
L10
67.15
58.11
65.22
75.70
L11
66.74
52.44
69.01
53.87
Some parts of the utterance are more relevant
depending of the speaker
Conclusions and future work



Promising results: some speakers are recognised with
high rates. Results are robust to changes in the session
and in the microphones.
Future work: To test the benefits of including this
results in a ASR system.
To explore the use of our methodology for modelling
intonation in a more general way.



Making use of more classes of intonation.
Getting knowledge of which of the classes of intonation are
more relevant for characterizing to the speaker.
New corpura are welcome.
Stop the war
Thank
you