Transcript Slide 1

Relevance Vector Machines with
Empirical Likelihood-Ratio Kernels for
PLDA Speaker Verification
Wei Rao and Man-Wai Mak
ISCSLP 2014
Department of Electronic and Information Engineering
The Hong Kong Polytechnic University, Hong Kong SAR, China
Summary
• Our previous studies have shown that PLDA-SVM scoring
have three advantages:
I. Better utilization of multiple enrollment utterances.
II. Use the information of background speakers explicitly.
III. Open up opportunity for adopting sparse kernel machines for
PLDA-based speaker verification systems.
• In this work, we
– investigate the property of empirical kernels in SVM and
relevance vector machine (RVM)
– compare the performance between SVM and RVM in
PLDA-based speaker verification.
2
Outline
• Empirical LR Kernels
–
–
–
–
Likelihood Ratio Score for Gaussian PLDA
PLDA Score Space
Empirical Kernel Map
SVM with Empirical Kernels
• Relevance Vector Machines
–
–
–
–
RVM vs. SVM
Limitations of SVM
RVM Regression
PLDA-RVM Scoring
• Experiment Setup
• Results
• Conclusions
3
Likelihood Ratio Score for Gaussian PLDA
Given a length-normalized test i-vector
and target
speaker’s i-vector , the likelihood ratio score can be
computed as follows:
Implicitly use
background
information
is the factor loading matrix and
is the covariance of the PLDA model.
4
PLDA Score Space
• Given a set of target speaker’s i-vectors
and a set of background speakers’ i-vectors
we create a PLDA score space for this speaker.
• For each target-speaker’s i-vector, we compute a score
vector that lives in this space.
• The score vectors of this speakers are used to train a
speaker-dependent SVM for scoring.
PLDA Model
x s, j
Constructing
PLDA Score
Vector
Score vectors of
background speakers
SVM
Training
Speakerdependent
SVM
5
Empirical Kernel Map
• We refer to the mapping from i-vectors
as Empirical Kernel Map.
to
PLDA Model
x s, j
Constructing
PLDA Score
Vector
• Given a test i-vector
, we have a test score-vector
6
Utterance Partitioning with Acoustic Vector
Resampling (UP-AVR)
• For speakers with a small number of enrollment
utterances, we apply utterance partitioning (Rao and
Mak, 2013) to produce more i-vectors from a few
utterances.
One enrollment utt
RN + 1 enrollment i-vectors
7
SVM with Empirical LR Kernel
Background speaker’s
X  i-vectors
X   X s , X b
 b
Target
speaker’s
i-vectors
 Xs 
PLDA Scoring
+
Empirical Kernel Map

SVM
Training
Non-target
speakers’
i-vectors
 Xb 
PLDA Scoring
+
Empirical Kernel Map
Target
speaker
SVM

Target speaker’s Non-target speaker’s
i-vectors  X  
 X s  i-vectors
b
SSVM  xt , X s , X b  

jSVs
s, j
K  xt , xs , j  

jSVb
s, j
K  xt , xb , j   d s
8
Outline
• Empirical LR Kernels
–
–
–
–
Likelihood Ratio Score for Gaussian PLDA
PLDA Score Space
Empirical Kernel Map
SVM with Empirical Kernels
• Relevance Vector Machines
–
–
–
–
RVM vs. SVM
Limitations of SVM
RVM Regression
PLDA-RVM Scoring
• Experiment Setup
• Results
• Conclusions
9
RVM versus SVM
• RVM stands for relevance vector machine (Tipping, 2001)
• The scoring function of RVM and SVM is the same:
N
f (x t ;w) = å wi K(x t ,x i ) + w0
i=1
where w = [w1,..., wN ]are the optimized weights. w0 is a bias term.
K(x t ,xi ) is a kernel function, e.g. empirical LR kernel.
• Difference between RVM and SVM
 SVM training is based on Structural Risk Minimization
 RVM training is based on Bayesian Relevance Learning
10
Limitations of SVM
• The number of support vectors in SVM increases linearly
with the size of the training set.
• The SVM scores are not probabilistic, meaning that score
normalization is required to adjust the score range of
individual SVMs.
• It is necessary to tradeoff the training error and margin
of separation through adjusting the penalty factor for
each target speaker during SVM training.
RVM is designed to address the above issues in SVM
11
RVM Regression
• In RVM regression, the target y is assumed to follow a
Gaussian distribution with mean
and variance
• To avoid over-fitting, RVM defines a prior distribution
over the RVM weights w
• Note: when
12
RVM Regression
• Given a test vector
, the predictive distribution is
where
13
PLDA-RVM Scoring
• We use the mean of the predictive distribution as the
verification score:
where
14
Outline
• Empirical LR Kernels
–
–
–
–
Likelihood Ratio Score for Gaussian PLDA
PLDA Score Space
Empirical Kernel Map
SVM with Empirical Kernels
• Relevance Vector Machines
–
–
–
–
RVM vs. SVM
Limitations of SVM
RVM Regression
PLDA-RVM Scoring
• Experiment Setup
• Results
• Conclusions
15
Speech Data and PLDA Models
• Evaluation dataset: Common evaluation condition 2 of NIST SRE
2012 male core set.
• Parameterization: 19 MFCCs together with energy plus their
1st and 2nd derivatives  60-Dim
• UBM: gender-dependent, 1024 mixtures
• Total Variability Matrix: gender-dependent, 400 total factors
• I-Vector Preprocessing:
 Whitening by WCCN then length normalization
 Followed by LDA (500-dim  200-dim) and WCCN
16
Property of Empirical LR Kernels in SVM and RVM
When the number of SVs increases, the performance of
SVMs becomes poor.
When the number of RVs is very small, the performance
of RVM is poor.
The performance of RVM is fairly stable and better than
that of SVMs once the number of RVs is sufficient.
17
Property of Empirical LR Kernels in SVM and RVM
RBF Width γ
When the RBF width increases, the number of SVs
decreases first and then gradually increases.
The number of RVs monotonically decreases when
the RBF width increases.
The RVMs is less sparse than the SVMs for a wide
range of RBF width.
18
Property of Empirical LR Kernels in SVM and RVM
RBF Width γ
For the RVMs to be effective, we need at least
50 RVs.
Once the RVMs have sufficient RVs, their
performance can be better than that of the
SVMs.
19
Performance Comparison (CC2 of SRE12, male)
Methods
EER
MinNDCF
PLDA
2.40
0.33
PLDA+UP-AVR
2.32
0.32
PLDA+SVM
2.07
0.31
PLDA+RVM-C
3.76
0.48
PLDA+RVM-R
2.32
0.28
PLDA+UP-AVR+SVM
1.97
0.30
PLDA+UP-AVR+RVM-C
3.00
0.42
PLDA+UP-AVR+RVM-R
1.94
0.28
 Both PLDA-SVM and PLDA-RVM (regression) perform better than PLDA.
 RVM classification performs poorly because of the small number of RVs.
 After performing utterance partitioning (UP-AVR), the performance of
both RVM regression and SVM improves and RVM regression slightly
outperforms SVM.
20
Conclusions
• RVM classification is not appropriate for the
verification task in NIST SRE, but RVM regression
with empirical LR kernel can achieve comparable
performance as PLDA-SVM.
• UP-AVR can boost the performance of both SVM
and RVM regression and the performance of RVM
regression is slightly better than SVM after
adopting utterance partitioning.
21
22
Optimization of RBF Width
The preferred
value of for RVMs is
between 1400 and
1500 and that for
SVMs is between
1900 and 2000.
23
Property of Empirical LR Kernels in SVM and RVM
• 108 target speakers with true-target trials and
imposter trials were extracted from NIST 2012 SRE.
• EER and achieved by the SVMs and RVMs and their
corresponding number of support vectors (SVs) and
relevance vectors (RVs) were averaged across the 108
speakers.
24
Experiment Setup for SVM and RVM
• Optimization of RBF Width: a development set was created for
optimizing RBF parameter for each target speakers’ SVM/RVM.
 True-target trials: UP-AVR was applied for his/her enrollment
utterances to generate a number of enrollment i-vectors.
Some of these i-vectors are used for training SVM/RVM
model. The remaining i-vectors were used as true-target
trials.
 Imposter trials: 200 background utterances were selected
from the previous NIST SREs.
 Varying the width to maximize the difference between the
mean of the true-target scores and the mean of the imposter
scores.
25
Empirical Kernel Maps
26