Speech recognition in MUMIS

Download Report

Transcript Speech recognition in MUMIS

Speech recognition in MUMIS
Judith Kessens, Mirjam Wester
& Helmer Strik
Manual transcriptions
• Transcriptions made by SPEX:
– orthographic transcriptions
– transcriptions on chunk level (2-3 sec.)
• Formats:
– *.Textgrid  praat
– xml-derivatives:
• *.pri – no time information
• *.skp – time information
Manual transcriptions
Total amount of transcribed matches on ftp-site
(including the demo matches):
• Dutch: 6 matches
• German: 21 matches
• English: 3 matches
Extensions:
Dutch (_N), German (_G), English (_E)
Automatic speech recognition
1. Acoustic preprocessing
• Acoustic signal  features
2. Speech recognition
• Acoustic models
• Language models
• Lexicon
Automatic transcriptions
• Problem of recorded data:
Commentaries and stadium noise are mixed
 Very high noise levels
 Recognition of such extreme noisy data is
very difficult
Examples of data
Yug-Ned match
• Dutch
“op _t ogenblik wordt in dit stadion de opstelling voorgelezen”
• English
“and they wanna make the change before the corner”
• German
“und die beiden Tore die die Hollaender bekommen hat haben”
Examples of data
Eng-Dld match
• Dutch
“geeft nu een vrije trap in _t voordeel van Ince”
• English
“and phil neville had to really make about three yards to stop
<dreisler*u> pulling it down and playing it”
• German
“wurde von allen englischen Zeitungen aus der Mannschaft”
Evaluation of aut. transcriptions
WER(%) = insertions+deletions+substitutions
number of words
 WER can be larger than 100% !
WERs (all words)
Yug-Ned
Dutch
84.5
English
84.5
German
77.4
Eng-Dld
83.2
83.3
90.8
WERs (player names)
Yug-Ned
names
Eng-Dld
names
Dutch
84.5
53.0
83.2
55.0
English
84.5
48.2
83.3
56.2
German
77.4
40.9
90.8
77.4
WERs versus SNR
Yug-Ned
SNR
Eng-Dld
SNR
Dutch
84.5
9
83.2
8
English
84.5
12
83.3
11
German
77.4
19
90.8
7
Automatic transcriptions
The language model (LM) and lexicon (lex)
are adapted to a specific match
•
•
•
Start with a general LM and lex
Add player names of the specific match
Expand the general LM and lex when more
data is available
WERs for various amounts of data
96
Yug-Ned (Dutch)
lex: 1CD
Eng-Dld (Dutch)
lex: 1CD
Yug-Ned (German)
lex: 1CD
Yug-Ned (German)
lex: 7CDs
Yug-Ned (German)
lex: 19CDs
Eng-Dld (German)
lex: 7CDs
WER (%)
92
88
84
80
76
0
50,000 100,000 150,000 200,000 250,000
number of words to train the language model
Oracle experiments - ICLSP’02
Due to limited amount of material we started off
with oracle experiments:
• Language models are trained on target match
• Acoustic models are trained on part of target
match or other match
 Much lower WERs
Summary of results
Acoustic model training:
• Leaving out non-speech chunks does not hurt
recognition performance
• Using more training data is benificial, but
more important:
• The SNRs of the training and test data should
be matched
Summary of results
• WERs are SNR-dependent
100
WER (%)
80
Dutch
60
English
40
German
20
0
0
5
10
15
SNR (dB)
(tested on Yug-Ned match)
20
Summary of results
Split words into categories, i.e. function words, content words
and football player’s names:
WER function words > WER content words > WER names
(tested on Yug-Ned match)
Summary of results
• Noise reduction tool (FTNR) small improvement
WERs with and without FTNR
WER (%)
75
50
25
0
NL
Eng
No FTNR
Dld
FTNR
Ongoing work
Techniques to lower WERs
• Tuning of the generic language model
– Defining different classes
– Reduction of OOV words in lexicon and in the
language model (using more material)
• Speaker Adaptation in HTK
(note: all other experiments are being carried out
using Phicos)
Ongoing work
Noise robustness
• Extension of the acoustic models by using
double deltas.
• Histogram Normalization and FTNR.
• SNR dependent acoustic models.
Recommendations
Acoustic modeling
• Record commentaries and stadium noise separately
• Speaker adaptation:
- Transcribe characteristics of commentator
- Collect more speech data of commentator
Recommendations
Lexicon and language modeling
• Collect orthographic transcriptions of spoken
material, instead of written material
- Subtitles
- Close captions