Speech recognition in MUMIS

Download Report

Transcript Speech recognition in MUMIS

Speech recognition in MUMIS Mirjam Wester, Judith Kessens & Helmer Strik

Intro • Objective: Automatic speech recognition of football commentaries • SPEX transcribed two matches for two languages (Dutch and English): – England - Germany (Eng-Dld) and – Yugoslavia -The Netherlands (Yug-Ned) • Commentaries and stadium noise are mixed

Data Conversion • SPEX transcription: – text grid: • orthographic transcription • chunk alignment; chunk = a segment of speech of about 2 to 3 seconds – CD with one large wav file • Split according to chunk alignments

Examples of data • Yug-Ned Dutch • Yug-Ned English • Eng-Dld Dutch • Eng-Dld English

Statistics #chunks #speech chunks #empty chunks #words (types) #words (tokens) Dutch 5146 3006 2140 1954 12079 English 5613 3725 1843 2923 24022 English matches have two commentators, Dutch only one.

Overlapping segments have been disregarded.

Training Dutch: • Yug-Ned ¾ of CD (19 min speech) • France Telecom Noise Reduction (FTNR) English: • Yug-Ned ¾ of CD (28 min speech) • FTNR For more information on France Telecom Noise Reduction tool see: B. Noé, J. Sienel, D. Jouvet, L. Mauuary, L. Boves, J. de Veth & F. de Wet “Noise Reduction for Noise Robust Feature Extraction for Distributed Speech Recognition”. In

Proc. of Eurospeech ’01

Test Dutch: • Yug-Ned ¼ of CD – 626 chunks, 1577 words – lexicon and language model based on complete Yug Ned match English: • Yug-Ned ¼ of CD – 636 chunks, 2641 words – lexicon and language model based on complete Yug Ned match

SNR before and after FTNR tool

WER results for Yug-Ned before and after FTNR

60 55 50 45 40 NL-original NL-FTNR Eng-Original Training material acoustic models Eng-FTNR

Dutch – Polyphone • Data is phonetically rich sentences • Phone models were trained on: – Polyphone all speakers – Polyphone male speakers – Polyphone male speakers + MUMIS noise • Polyphone as bootstrap for segmentation of MUMIS material

Polyphone models (Dutch) Yug-Ned test set

95 85 75 65 55 45 Poly-all Poly-male Poly-male+noise Poly-seg.

MUMIS Training material acoustic models

Cross tests (Dutch & English) Cross-tests: • train on ¾ Yug-Ned test on ¼ Eng-Dld • train on ¾ Eng-Dld test on ¼ Yug-Ned

MUMIS models (Dutch) Yug-Ned test Eng-Dld test

70 65 60 55 50 45 Yug-Ned Eng-Dld-cross Eng-Dld Training material acoustic models Yug-Ned-cross

MUMIS models (English) Yug-Ned test Eng-Dld test

70 65 60 55 50 45 Yug-Ned Eng-Dld-cross Eng-Dld Training material acoustic models Yug-Ned-cross

MUMIS models (Dutch+English) Yug-Ned test Eng-Dld test

70 65 60 55 50 45 NL ENG Yug-Ned Eng-Dld-cross Eng-Dld Training material acoustic models Yug-Ned cross

Function words vs content words

80 70 60 50 40 30 20 10 0 Yug-Ned Eng-Dld

English data

Yug-Ned Eng-Dld

Dutch data word type

function content names all

90 80 70 60 50 40 30 20 10 0 0 SNR vs. WER (1) Dutch Data 5 YugNed 10 15 SNR1 (dB) YugNed_ftnr 20 EngDld 25 30

90 80 70 60 50 40 30 20 10 0 0 SNR vs. WER (2) English Data 10 YugNed 20 SNR1 (dB) YugNed_ftnr 30 EngDld 40

Discussion • WERs are high • Noise?

– FTNR leads to lower SNR, but WERs do not improve substantially • Not enough training data?

– Polyphone for training/bootstrapping does not lead to lower WERs than training on MUMIS data – Noisifying Polyphone with MUMIS gives encouraging results

Discussion continued • Function words comprise ± 50% of the data, and cause great deal of the errors • Names are recognized very well • Function words not necessary for information extraction (?)

Future work • Steps to noise robust speech recognition: – model/speaker adaptation – combinations of noisified Polyphone models and FTNR • Other issues: – transcription of more data • English, Dutch and German • preference specific games? radio? TV?

– generic football specific language model – confidence measures?

Future work continued Questions: • What type of output from ASR is needed?

– word-graph – n-best list – top of the list – word spotting? only content words?

• For research purposes: is it possible to obtain data that has not been mixed (noise + commentary)?