Prezentace aplikace PowerPoint

Download Report

Transcript Prezentace aplikace PowerPoint

BUT SWS 2013 - Massive parallel approach
Brno University of Technology
Faculty of Information Technology
Speech@FIT
Igor Szöke, Lukáš Burget, František Grézl,
Lucas Ondel
MediaEval SWS 2013 workshop, October 18.-19. 2013, Barcelona
Outlines
•
•
•
•
•
•
Systems overview & Underlying technologies
AKWS
DTW
Calibration
Fusion
Results and discussion
System overview
• Our internal task was:
To reuse as many Atomic systems as we have and
fuse them on the detection level.
We end up with:
13 Atomic systems, 26 QbE sub-systems,
19 languages (16 unique).
zero resourced system 
• Ingredients
Phoneme recognizer, Acoustic Keyword Spotting,
DTW, Calibration, Fusion
System overview
Igor’s Greeting 
Subsystem
• Sentence mean normalization
• Neural network based features
• three state phone posteriors
• Query detector
• AKWS
• DTW
system
Posteriors
SpeechDat CZ
LCRC
O 129
SpeechDat HU
LCRC
O 177
SpeechDat RU
LCRC
O 150
BABEL CA
BABEL PA
St. BN
A (1045)
660
BABEL TA
BABEL TU
SWS 2012 4lang.
St. BN
O 150
GlobalPhone CZ
St. BN
A 120
GlobalPhone EN
St. BN
A 120
GlobalPhone GE
St. BN
A 126
GlobalPhone PO
St. BN
A 102
GlobalPhone RU
St. BN
A 156
GlobalPhone SP
St. BN
A 102
GlobalPhone TU
St. BN
A 90
GlobalPhone VI
St. BN
A 102
Atomic system
•
Adaptation on target data (GP and BABEL NNs)
• Original NN used for target data labeling (state level)
• Then, universal context, bottle-neck neural network base
classifier trained.
• LCRC, SWS2012 without any adaptation.
AKWS QbE subsystem
•
•
•
•
Query -> example-to-text using phoneme recognizer
Omit initial and final silence
Omit queries having less than 3 non-silence phonemes
No LM constrains
DTW QbE subsystem
•
•
•
•
Segmental DTW (query can start in any frame of utterance)
Log dot product over phoneme state posteriors
Path cost: 1, 1, 1
On-line normalizing of the path
• While filling a cell in a distant matrix, the value already
considers the length of the previous path
• We add VAD as late submission -> really huge impact
• Initial and final silence frames were removed from
examples
Calibration
•
•
•
Really important!
No-norm, z-norm, z-norm_sideinfo, m-norm (the best)
Experiments with adding sideinfo [log(#term_occ), #phn,
log(#nonsilence frames)]
• Linear model was trained (using logistic regresion)
• Good improvement
•
M-norm – find the peak in histogram of term scores
• Calculate variance of data <peak, +inf>
• Apply variance norm on the whole data set
• Subtract the peak (shift the peak to 0)
• Event better than z-norm
• Sideinfo does not helped!
(means m-norm is calibrated enough)
DTW
AWKS
Orig
Z-norm
M-norm
Calibration
1 AKWS subsystem
orig
z-norm
z-norm_side
m-norm
MTWV (UBTWV)
0.0000 (0.1012)
0.0330 (0.1434)
0.0603 (0.1436)
0.0769 (0.1611)
Fusion
•
•
Linear combination of subsystems (and one bias)
Trained with respect to minimizing of cross entropy
(binary logistic regression)
• Detections are clustered
• System not producing any score at given time get a
default score
Fusion
Results
• MTWV(UBTWV)
• UBTWV – non-pooled TWV, ideal calibration, oracle calibration
• DTW is superior to AKWS… but the speed…
• Still having some gaps in calibration
(the difference between DEV and EVAL TWV)
• NN unsupervised adaptation helped
1 AKWS subsystem: 0.0443(0.1154) -> 0.0769(0.1630)
• m-norm!
• Lot of directions for research