Zero Resource Spoken Term Detection on STD 06 dataset Justin Chiu Carnegie Mellon University 07/24/2012, JHU.

Download Report

Transcript Zero Resource Spoken Term Detection on STD 06 dataset Justin Chiu Carnegie Mellon University 07/24/2012, JHU.

Zero Resource Spoken Term
Detection on STD 06 dataset
Justin Chiu
Carnegie Mellon University
07/24/2012, JHU
Motivation
• Given an unknown language, can you do
unsupervised spoken term detection?
• Using high level representation, with some
structural assumption, we can make the
spoken term detection more robust
– Query by example
– Modeling
– ASR Approach
Proposed Approach
• Signals
• MFCC (13 dimension vector)
– 10ms per frame, each frame represent 25ms
• Each utterance = A sequence of MFCC frames
• Goal:
– Cluster the MFCC frames
– Represent each MFCC frame with cluster labels
– Using SDTW algorithm perform term detection
Clustering
• K-mean clustering
– 10 random start K-mean clustering
– Store every cluster center as model
• Gaussian Mixture model
– Clustering with Gaussian Mixtures
– Store the mean and variance as model
• Cluster numbers decide by development data
Representation
• Hard representation (Vector -> Label)
– Each audio file become sequence of cluster labels
• 14 14 22 22 22 25 25 26 …
– Similar to text retrieval
• Soft representation (Vector -> Vector)
– Represent every MFCC frame as posterior
probability for every Gaussian Mixture
– Better vector for distance measurement
Segmental Dynamic Time Warping
a1 a2 a3 a4 a5 a6 a7 a8 a9
q1
q2
q3
q4
• Distance Measurement
– Hard distance: match(0)/not match(1)
– Soft distance: -log (a•q)
• Each jump:500ms x-y distance limitation: 500ms
NIST STD 06 Data set
• One of the dataset used to evaluate Spoken
Term Detection performance
• Advantage
– Widely use because of 2006 STD Evaluation
Workshop, easy to compare with others
• Disadvantage
– Only text query provided, does not have any
spoken queries
Choosing the dataset
• 2006 STD Dataset has 3 different language
– Each language (E,M,A) has different subset
– We select English CTS (Conversational Telephone
Speech) dataset
• Reason: It has most reported result
• Spoken query generation
– Synthesized speech query: Flite
– Extracted speech query: Extracted from dev set
Evaluation Measurement
• ATWV (Average Term Weighted Value)
• Term-Weighted. Value (TWV) is one minus the
average value lost by the system per term.
1 – Avg ( Pmiss + w * PFA)
• Reference ATWV number (Supervised):
– English: 0.85
– Mandarin: 0.38
– Arabic: 0.34
Query Comparison
• Primary experiments on development set
• Synthesized query
– 1100
• ATWV: <<0
• Extracted Query
– 411 Extracted / combined queries
• ATWV: -0.93
– 135 Longer query (Length>1)
• ATWV:0.185
Evaluation Set Result
• Overrun by tides of false alarm
Further struggle
• Remove the first dimension in MFCC
– Represent power of the speech, big value
• Inverted Frequency
– If same frame appears too much time might be
less important (background noise)
• Content-related bonus
– Sequential same tag provide bonus
What we have learned
• Representing speech on every MFCC frame is
too short
• Mismatch on the speech signal do affect a lot
– Synthesized speech vs extracted speech
• Lots of false alarm happening for short query
– At vs hat vs bat
Threshold
• How similar they are can let us decide they are
the same word? (Detected or not)
• How many abstract representation unit we
should use to represent unknown language?
– Possibly can handle this with regularization
Representation
• We need to find better representation (Other
than MFCC frame) to do the clustering
– Phones works, appropriate representation should
work, expected to come from data-driven way
• Advanced Approach for representation
– Lee, Glass
– Jenson, Church
– SSS + clustering
Spoken Term Detection Experiments
• Dataset
– NIST Spoken Term Detection 2006 Evaluation set
– Advantage:
• The dataset designed for STD task
• Evaluation Metrics
– ATWV
– Advantage:
• Evaluation tool is available
• Can compare with lots of supervised baseline
Summary
• Clustering on MFCC frame is an inappropriate
representation for speech
• Need a better representation of speech unit
• Channel/Speaker mismatch will harm the
performance a lot
• The extracted spoken query and audio for
English CTS data is available.
Personal Belief in Zero Resource STD
Speaker
Dependent
Speaker
Independent
Special Thanks
• Alex Rudnicky
•
•
•
•
Florian Metze
Alan Black
Rita Singh
Jack Mostow