Tutorial in Support Vector Classification

Transcript Tutorial in Support Vector Classification

University of Sheffield
M4 speech recognition
Martin Karafiát*, Steve Renals, Vincent Wan
M4 speech recognition
1
Status at last meeting
• No working speech recognition system
• Use HTK for speech recognition
– No large vocabulary decoder with HTK
• Use DUcoder instead of HVite
– Provided by TUM
• Use SRI LM toolkit for language modelling
• Build initial models using SWITCHBOARD and ICSI
meetings data
M4 speech recognition
2
Software limitations
• Acoustic modelling
– HTK 3.2
• No efficient large vocabulary decoder
• Use only bigram language models
• HVite, a time synchronous decoder, is slow
• Decoding
– Ducoder
• Based on HTK version 2
• Only capable of word internal context dependent triphone
model decoding
• Trade-off between cross word triphone models and
trigram language model appears necessary!
M4 speech recognition
3
System Architecture
Word internal
triphone models
Trigram language model
(SRILM)
n-best lattice generation
Best first decoding
(Ducoder)
Front
end
Cross word
triphone models
Lattice rescoring
Time synchronous decoding
(HTK)
MLLR adaptation
(HTK)
Recognition
output
M4 speech recognition
4
System limitations
• N-best list rescoring not optimal
• Many more hyper-parameters to tune manually
• Adaptation must be performed on two sets of
acoustic models
M4 speech recognition
5
Method comparison
• Ducoder/HTK system compared to a pure HTK
system:
– Pure HTK: 61.51% wer
– Cross word triphones, bigram LM
– Decoding time: over a month
– Ducoder/HTK: 60.29% wer
– Word internal triphones & trigram LM, rescored using cross word
triphones
– Decoding time: a few days
• Early comparison result obtained using Switchboard data
M4 speech recognition
6
Current recognisers
• SWITCHBOARD recogniser
– Acoustic & language models trained on 200 hours of speech
• ICSI meetings recogniser
– Acoustic models trained on 40 hours of speech
– Language model is a combination of SWB and ICSI
M4 speech recognition
7
Results
Recogniser
Test on same data
without adaptation
SWITCHBOARD ICSI meetings
55.05% WER
52.34% WER
N/A
49.27% WER
Test on M4 data
without adaptation
88.14% WER
74.00% WER
Test on M4 data
with unsupervised adaptation
84.67% WER
N/A
Test on same data
with speaker adaptation
M4 speech recognition
8
Areas for immediate improvement
• Increase number of lattices in n-best list
• Word internal triphone models need adaptation
• Fully supervised speaker dependent adaptation
• Vocal tract length normalisation
M4 speech recognition
9
Coming soon
• Release current models and scripts
• Merged Switchboard and ICSI acoustic models
• ASR transcriptions of the M4 meetings
M4 speech recognition
10