Zero Resource Spoken Term Detection on STD 06 dataset Justin Chiu Carnegie Mellon University 07/24/2012, JHU.
Download ReportTranscript Zero Resource Spoken Term Detection on STD 06 dataset Justin Chiu Carnegie Mellon University 07/24/2012, JHU.
Zero Resource Spoken Term Detection on STD 06 dataset Justin Chiu Carnegie Mellon University 07/24/2012, JHU Motivation • Given an unknown language, can you do unsupervised spoken term detection? • Using high level representation, with some structural assumption, we can make the spoken term detection more robust – Query by example – Modeling – ASR Approach Proposed Approach • Signals • MFCC (13 dimension vector) – 10ms per frame, each frame represent 25ms • Each utterance = A sequence of MFCC frames • Goal: – Cluster the MFCC frames – Represent each MFCC frame with cluster labels – Using SDTW algorithm perform term detection Clustering • K-mean clustering – 10 random start K-mean clustering – Store every cluster center as model • Gaussian Mixture model – Clustering with Gaussian Mixtures – Store the mean and variance as model • Cluster numbers decide by development data Representation • Hard representation (Vector -> Label) – Each audio file become sequence of cluster labels • 14 14 22 22 22 25 25 26 … – Similar to text retrieval • Soft representation (Vector -> Vector) – Represent every MFCC frame as posterior probability for every Gaussian Mixture – Better vector for distance measurement Segmental Dynamic Time Warping a1 a2 a3 a4 a5 a6 a7 a8 a9 q1 q2 q3 q4 • Distance Measurement – Hard distance: match(0)/not match(1) – Soft distance: -log (a•q) • Each jump:500ms x-y distance limitation: 500ms NIST STD 06 Data set • One of the dataset used to evaluate Spoken Term Detection performance • Advantage – Widely use because of 2006 STD Evaluation Workshop, easy to compare with others • Disadvantage – Only text query provided, does not have any spoken queries Choosing the dataset • 2006 STD Dataset has 3 different language – Each language (E,M,A) has different subset – We select English CTS (Conversational Telephone Speech) dataset • Reason: It has most reported result • Spoken query generation – Synthesized speech query: Flite – Extracted speech query: Extracted from dev set Evaluation Measurement • ATWV (Average Term Weighted Value) • Term-Weighted. Value (TWV) is one minus the average value lost by the system per term. 1 – Avg ( Pmiss + w * PFA) • Reference ATWV number (Supervised): – English: 0.85 – Mandarin: 0.38 – Arabic: 0.34 Query Comparison • Primary experiments on development set • Synthesized query – 1100 • ATWV: <<0 • Extracted Query – 411 Extracted / combined queries • ATWV: -0.93 – 135 Longer query (Length>1) • ATWV:0.185 Evaluation Set Result • Overrun by tides of false alarm Further struggle • Remove the first dimension in MFCC – Represent power of the speech, big value • Inverted Frequency – If same frame appears too much time might be less important (background noise) • Content-related bonus – Sequential same tag provide bonus What we have learned • Representing speech on every MFCC frame is too short • Mismatch on the speech signal do affect a lot – Synthesized speech vs extracted speech • Lots of false alarm happening for short query – At vs hat vs bat Threshold • How similar they are can let us decide they are the same word? (Detected or not) • How many abstract representation unit we should use to represent unknown language? – Possibly can handle this with regularization Representation • We need to find better representation (Other than MFCC frame) to do the clustering – Phones works, appropriate representation should work, expected to come from data-driven way • Advanced Approach for representation – Lee, Glass – Jenson, Church – SSS + clustering Spoken Term Detection Experiments • Dataset – NIST Spoken Term Detection 2006 Evaluation set – Advantage: • The dataset designed for STD task • Evaluation Metrics – ATWV – Advantage: • Evaluation tool is available • Can compare with lots of supervised baseline Summary • Clustering on MFCC frame is an inappropriate representation for speech • Need a better representation of speech unit • Channel/Speaker mismatch will harm the performance a lot • The extracted spoken query and audio for English CTS data is available. Personal Belief in Zero Resource STD Speaker Dependent Speaker Independent Special Thanks • Alex Rudnicky • • • • Florian Metze Alan Black Rita Singh Jack Mostow