Transcript Slide 1
Relevance Vector Machines with Empirical Likelihood-Ratio Kernels for PLDA Speaker Verification Wei Rao and Man-Wai Mak ISCSLP 2014 Department of Electronic and Information Engineering The Hong Kong Polytechnic University, Hong Kong SAR, China Summary • Our previous studies have shown that PLDA-SVM scoring have three advantages: I. Better utilization of multiple enrollment utterances. II. Use the information of background speakers explicitly. III. Open up opportunity for adopting sparse kernel machines for PLDA-based speaker verification systems. • In this work, we – investigate the property of empirical kernels in SVM and relevance vector machine (RVM) – compare the performance between SVM and RVM in PLDA-based speaker verification. 2 Outline • Empirical LR Kernels – – – – Likelihood Ratio Score for Gaussian PLDA PLDA Score Space Empirical Kernel Map SVM with Empirical Kernels • Relevance Vector Machines – – – – RVM vs. SVM Limitations of SVM RVM Regression PLDA-RVM Scoring • Experiment Setup • Results • Conclusions 3 Likelihood Ratio Score for Gaussian PLDA Given a length-normalized test i-vector and target speaker’s i-vector , the likelihood ratio score can be computed as follows: Implicitly use background information is the factor loading matrix and is the covariance of the PLDA model. 4 PLDA Score Space • Given a set of target speaker’s i-vectors and a set of background speakers’ i-vectors we create a PLDA score space for this speaker. • For each target-speaker’s i-vector, we compute a score vector that lives in this space. • The score vectors of this speakers are used to train a speaker-dependent SVM for scoring. PLDA Model x s, j Constructing PLDA Score Vector Score vectors of background speakers SVM Training Speakerdependent SVM 5 Empirical Kernel Map • We refer to the mapping from i-vectors as Empirical Kernel Map. to PLDA Model x s, j Constructing PLDA Score Vector • Given a test i-vector , we have a test score-vector 6 Utterance Partitioning with Acoustic Vector Resampling (UP-AVR) • For speakers with a small number of enrollment utterances, we apply utterance partitioning (Rao and Mak, 2013) to produce more i-vectors from a few utterances. One enrollment utt RN + 1 enrollment i-vectors 7 SVM with Empirical LR Kernel Background speaker’s X i-vectors X X s , X b b Target speaker’s i-vectors Xs PLDA Scoring + Empirical Kernel Map SVM Training Non-target speakers’ i-vectors Xb PLDA Scoring + Empirical Kernel Map Target speaker SVM Target speaker’s Non-target speaker’s i-vectors X X s i-vectors b SSVM xt , X s , X b jSVs s, j K xt , xs , j jSVb s, j K xt , xb , j d s 8 Outline • Empirical LR Kernels – – – – Likelihood Ratio Score for Gaussian PLDA PLDA Score Space Empirical Kernel Map SVM with Empirical Kernels • Relevance Vector Machines – – – – RVM vs. SVM Limitations of SVM RVM Regression PLDA-RVM Scoring • Experiment Setup • Results • Conclusions 9 RVM versus SVM • RVM stands for relevance vector machine (Tipping, 2001) • The scoring function of RVM and SVM is the same: N f (x t ;w) = å wi K(x t ,x i ) + w0 i=1 where w = [w1,..., wN ]are the optimized weights. w0 is a bias term. K(x t ,xi ) is a kernel function, e.g. empirical LR kernel. • Difference between RVM and SVM SVM training is based on Structural Risk Minimization RVM training is based on Bayesian Relevance Learning 10 Limitations of SVM • The number of support vectors in SVM increases linearly with the size of the training set. • The SVM scores are not probabilistic, meaning that score normalization is required to adjust the score range of individual SVMs. • It is necessary to tradeoff the training error and margin of separation through adjusting the penalty factor for each target speaker during SVM training. RVM is designed to address the above issues in SVM 11 RVM Regression • In RVM regression, the target y is assumed to follow a Gaussian distribution with mean and variance • To avoid over-fitting, RVM defines a prior distribution over the RVM weights w • Note: when 12 RVM Regression • Given a test vector , the predictive distribution is where 13 PLDA-RVM Scoring • We use the mean of the predictive distribution as the verification score: where 14 Outline • Empirical LR Kernels – – – – Likelihood Ratio Score for Gaussian PLDA PLDA Score Space Empirical Kernel Map SVM with Empirical Kernels • Relevance Vector Machines – – – – RVM vs. SVM Limitations of SVM RVM Regression PLDA-RVM Scoring • Experiment Setup • Results • Conclusions 15 Speech Data and PLDA Models • Evaluation dataset: Common evaluation condition 2 of NIST SRE 2012 male core set. • Parameterization: 19 MFCCs together with energy plus their 1st and 2nd derivatives 60-Dim • UBM: gender-dependent, 1024 mixtures • Total Variability Matrix: gender-dependent, 400 total factors • I-Vector Preprocessing: Whitening by WCCN then length normalization Followed by LDA (500-dim 200-dim) and WCCN 16 Property of Empirical LR Kernels in SVM and RVM When the number of SVs increases, the performance of SVMs becomes poor. When the number of RVs is very small, the performance of RVM is poor. The performance of RVM is fairly stable and better than that of SVMs once the number of RVs is sufficient. 17 Property of Empirical LR Kernels in SVM and RVM RBF Width γ When the RBF width increases, the number of SVs decreases first and then gradually increases. The number of RVs monotonically decreases when the RBF width increases. The RVMs is less sparse than the SVMs for a wide range of RBF width. 18 Property of Empirical LR Kernels in SVM and RVM RBF Width γ For the RVMs to be effective, we need at least 50 RVs. Once the RVMs have sufficient RVs, their performance can be better than that of the SVMs. 19 Performance Comparison (CC2 of SRE12, male) Methods EER MinNDCF PLDA 2.40 0.33 PLDA+UP-AVR 2.32 0.32 PLDA+SVM 2.07 0.31 PLDA+RVM-C 3.76 0.48 PLDA+RVM-R 2.32 0.28 PLDA+UP-AVR+SVM 1.97 0.30 PLDA+UP-AVR+RVM-C 3.00 0.42 PLDA+UP-AVR+RVM-R 1.94 0.28 Both PLDA-SVM and PLDA-RVM (regression) perform better than PLDA. RVM classification performs poorly because of the small number of RVs. After performing utterance partitioning (UP-AVR), the performance of both RVM regression and SVM improves and RVM regression slightly outperforms SVM. 20 Conclusions • RVM classification is not appropriate for the verification task in NIST SRE, but RVM regression with empirical LR kernel can achieve comparable performance as PLDA-SVM. • UP-AVR can boost the performance of both SVM and RVM regression and the performance of RVM regression is slightly better than SVM after adopting utterance partitioning. 21 22 Optimization of RBF Width The preferred value of for RVMs is between 1400 and 1500 and that for SVMs is between 1900 and 2000. 23 Property of Empirical LR Kernels in SVM and RVM • 108 target speakers with true-target trials and imposter trials were extracted from NIST 2012 SRE. • EER and achieved by the SVMs and RVMs and their corresponding number of support vectors (SVs) and relevance vectors (RVs) were averaged across the 108 speakers. 24 Experiment Setup for SVM and RVM • Optimization of RBF Width: a development set was created for optimizing RBF parameter for each target speakers’ SVM/RVM. True-target trials: UP-AVR was applied for his/her enrollment utterances to generate a number of enrollment i-vectors. Some of these i-vectors are used for training SVM/RVM model. The remaining i-vectors were used as true-target trials. Imposter trials: 200 background utterances were selected from the previous NIST SREs. Varying the width to maximize the difference between the mean of the true-target scores and the mean of the imposter scores. 25 Empirical Kernel Maps 26