Broadcast News segmentation using Metadata and Speech

Download Report

Transcript Broadcast News segmentation using Metadata and Speech

Broadcast News Segmentation using
Metadata and Speech-To-Text
Information
to Improve Speech Recognition
Sebastien Coquoz,
Swiss Federal Institute of Technology (EPFL)
International Computer Science Institute (ICSI)
March 16, 2004
1
Outline
 General Idea
 ASR System used
 Exploratory work
 Strategies
 Results
 Conclusion
2
General idea
Use Metadata (SUs) and Speech-To-Text (STT) information to
improve later STT passes (feedback loop)
3
Why segmentation?
Why segment the audio stream?
• Important to give « linguistically coherent » pieces to
the language model
• Remove « non-speech » (i.e. long silences, laughs,
music, other noises,…)
Why use MDE?
• MDE gives information about sentence and speaker
breaks
• Speaker labels improve the efficiency of the acoustic
model and sentences improve the efficiency of the
language model
• BBN’s error analysis of Broadcast News recognition
revealed a higher error rate at segments boundaries 
this may be caused by missing the true
sentence boundaries
4
Metadata and STT information
MDE object used:
Sentence-like units (SUs): express a thought or
idea. It generally corresponds to a sentence. Each SU has a
confidence measure, timing information (starting point and
duration) and a cluster label.
STT object used:
Lexemes: describe the words that were assumed to
be uttered. Each word has timing information (beginning
and duration).
5
ASR system used
The system used is a simplified SRI BN evaluation system.
Recognition steps:
1. Segment the waveforms
2. Cluster the segments into
« pseudo-speakers »
3. Compute and normalize
features (Mel cepstrum)
4. Do first pass recognition
with non-crossword acoustic
models and bigram language
model
5. Generate lattices
6. Expand lattices using 5gram language model
7. Adapt acoustic models for
each « pseudo-speaker »
8. Generate new lattices using
the adapted acoustic models
9. Expand new lattices using
5-gram language model
10. Score the resulting
hypotheses
6
Types of segmentation
Baseline vs. MDE-based segmentation
Baseline
• Classifies frames into « speech » and « non-speech »
using a 2-state HMM
• Uses inter-words silences and speaker turns to
segment the BN shows
MDE-based
• Uses sentence and speaker breaks to define an initial
segmentation
• Further processes the segments using different
strategies presented later
7
Baseline experiments
Comments:
• The baseline segmentation is the one presented
above
• The results (shown later) obtained are:
•
the current best results
•
the baselines that ultimately have to be improved
• No additional processing step is applied to modify
the segments
8
« Cheating » experiments (1)
Why?
• See if there is room for improvement when using
MDE-based segmentation
How?
• Use transcripts written by humans to segment the
Broadcast News audio stream and apply processing
strategies to improve recognition (i.e. use true
information)
9
« Cheating » experiments (2)
Results: Baseline vs. « Cheating » experiments
WER
Wtd avg
on 6
shows
Baseline
seg
Cheating
seg
(using SU)
Cheating
seg
(SU+proc)
14.0
14.2
13.0
There is room for improvement!
10
Overview of the processing steps
Broadcast News Shows
0. Segmentation using SUs
1. First strategy: splitting of long segments
2. Second strategy: concatenation of short segments
3. Third strategy: addition of time pads
Final segmentation
11
First strategy: splitting of long
segments
Why?
• Too long segments may cover more than 1
sentence  confusing for the language model
How?
• Use automatically generated transcripts and MDE
• Too short segments mustn’t be processed  bad
for the efficiency of the language model
• Take two features into account for decision tree:
•
The duration of segments
•
The pause between words
12
Second strategy: concatenation
of short segments
Why?
• Short segments are not optimal for the language
model
• Short segments increase the WER because all
their words are close to the boundaries (cf. BBN’s
error analysis)
How?
• Take 3 features into account for decision tree:
• Pause between segments
• Sum of the duration of two neighbors
• Cluster label
13
Third strategy: Addition of time
pads
Why?
• Prevent words from only being partially included
• Because the windowing in the front end has a
scope of up to 8 frames (4 on each side)  better to
have enough padding
How?
• Take 1 feature into account for decision tree:
• The pause between segments
14
Examples of improvements (1)
1) Real sentence:
… and strictly limits state authority over how and when water is used …
time
Recognized sentence:
With baseline segmentation (cuts in middle of sentence):
… and stricter limits data
arty over how and when watery hues …
time
With MDE-based segmentation:
… and strict_ limits state authority over how and when water issues …
time
Legend:
segmentation point
red
errors
15
Examples of improvements (2)
2) Real sentence:
… I didn’t know if we would pull off the games. I didn’t know if this community
would ever rally around the Olympics again. …
time
Recognized sentence:
With baseline segmentation (doesn’t cut at end of sentence):
… pull off the games that had not this community would ever rally around …
time
With MDE-based segmentation:
… pull off the game_
I didn’t know _ this community would ever rally around …
time
16
Results for the development set
WER
Baseline
seg
Step 0: SU
seg
SU seg +
step1
SU seg +
steps 1 & 2
SU seg +
steps 1 & 2
&3
Wtd avg
on 6
shows
14.0
14.4
14.2
14.0
13.3
The improvement is 0.7% absolute and 5% relative!
17
Results for the evaluation set
WER
Baseline
seg
Step 0: SU
seg
SU seg +
step1
SU seg +
steps 1 & 2
SU seg +
steps 1 & 2
&3
Wtd avg
on 6
shows
18.7
19.8
19.7
19.6
18.4
The improvement is 0.3% absolute and 1.6% relative!
18
Dev results vs. Eval results
Observations:
• No « cheating » information available for the eval  not
sure how well the SU detection is working
• Improvements from step 0 (SU segmentation) to final
segmentation are similar for dev set and eval set: 1.1%
absolute (7.6% relative) for dev set and 1.3% absolute
(6.6% relative) for eval set  SU information not
optimized for eval
• Respective improvements are quite uneven for each
show  suggests that the strategies are show dependent,
not channel dependent
19
Future work
• Further optimize the thresholds for the three strategies
• Find a representation to choose a specific value of the
thresholds for each show individually (i.e. fully adapted
the decision trees to each show)
• Use Metadata objects such as the confidence measure
of each SU and diarization to further improve the
strategies
20
Conclusion
• Development of a new segmentation method
based on Metadata and Speech-To-Text
information
• Use features given by MDE and STT
information in decision trees for each processing
step
• Results indicate the promiss of this approach
• Further developments still seem to have room
for improvement
21
Acknowlegments
I would like to thank:
• Prof. Bourlard & Prof. Morgan
• Barbara & Andreas
• Yang
• IM2 for supporting my experience
22