DESIGNING A SPEAKER-DISCRIMINATIVE ADAPTIVE FILTER BANK

Download Report

Transcript DESIGNING A SPEAKER-DISCRIMINATIVE ADAPTIVE FILTER BANK

Spectral Features for Automatic TextIndependent Speaker Recognition
Tomi Kinnunen
Research seminar, 27.2.2004
Department of Computer Science
University of Joensuu
Based on a True Story …
T. Kinnunen: Spectral Features for Automatic Text-Independent
Speaker Recognition, Ph.Lic. thesis, 144 pages, Department of
Computer Science, University of Joensuu, 2004.
Downloadable in PDF from :
http://cs.joensuu.fi/pages/tkinnu/research/index.html
Introduction
Why Study Feature Extraction ?
• As the first component in the recognition chain, the accuracy
of classification is strongly determined by its selection
Why Study Feature Extraction ? (cont.)
• Typical feature extraction methods are directly “loaned”
from the speech recognition task
 Quite contradictory, considering the “opposite” nature of
the two tasks
• In general, it seems that currently we are at the best
guessing what might be invidual in our speech !
• Because it is interesting & challenging!
Principle of Feature Extraction
Studied Features
1. FFT-implemented filterbanks (subband processing)
2. FFT-cepstrum
3. LPC-derived features
4. Dynamic spectral features (delta features)
Speech Material & Evaluation Protocol
• Each test file is splitted into segments of
T=350 vectors (about ~ 3.5 seconds of
speech)
• Each segment is classified by vector
quantization
• Speaker models are constructed from the
training data by RLS clustering algorithm
• Performance measure = classification
error rate (%)
1. Subband Features
Computation of Subband Features
Windowed speech frame
Magnitude
spectrum by FFT
Smoothing by a
filterbank
Nonlinear mapping
of the filter outputs
Compressed filter ouputs
f = (f1,f2, … , fM)T
Parameters of the filterbank:
• Number of subbands
• Filter shapes & bandwidths
• Type of frequency warping
• Filter output nonlinearity
Frequency Warping… What’s That?!
• “Real” frequency axis (Hz) is stretched
and compressed locally according to a
(bijective) warping function
A 24-channel barkwarped filterbank
Bark scale
shape: triangular, warping: Bark
25
1
0.8
15
Gain
Frequency [Bark]
20
0.6
10
0.4
5
0.2
0
0
0.5
1
1.5
2
2.5
Frequency [kHz]
3
3.5
4
0
0
500
1000
1500
2000
2500
Frequency [Hz]
3000
3500
4000
Discrimination of Individual Subbands (F-ratio)
(Fixed parameters: 30 linearly spaced triangular filters)
TIMIT
F-ratio
Helsinki
Frequency
Frequency
Low-end (~0-200 Hz) and mid/high frequencies (~ 2 - 4 kHz) are important,
region ~200-2000 Hz less important. (However, not consistently!)
Subband Features :
The Effect of the Filter Output Nonlinearity
1. Linear f(x) = x
2. Logarithmic: f(x) = log(1 + x)
3. Cubic: f(x) = x1/3
Helsinki
Fixed parameters: 30 linearly
spaced triangular filters
TIMIT
Consistent ordering (!) : cubic < log < linear
Subband Features :
The Effect of the Filter Shape
1. Rectangular
2. Triangular
3. Hanning
Helsinki
Fixed parameters: 30
linearly spaced filters,
log-compression
TIMIT
The differences are small, no consistent ordering
 probably the filter shape is not as crucial as the other parameters
Subband Features :
The Number of Subbands (1)
Experiment 1: From 5 to 50
Helsinki
Fixed parameters: linearly
spaced / triangular-shaped
filters, log-compression
TIMIT
Observation: error rates decrease monotonically with increasing
number of subbands (in most cases) …
Subband Features :
The Number of Subbands (2)
Experiment 2: From 50 to 250
Fixed parameters: linearly
spaced / triangular-shaped
filters, log-compression
Helsinki: (Almost) monotonous decrease in errors with
increasing number of subbands
TIMIT: Optimum number of bands is in the range 50..100
Differences between corpora are (partly) explained by the
discrimination curves
Discussion of the Subband Features
• (Typically used) log-compression should be replaced with
cubic compression or some better nonlinearity
• Number of subbands should be relatively high (at least 50
based on these experiments)
• Shape of the filter does not seem to be important
• Discriminative information is not evenly spaced along the
frequency axis
• The relative discriminatory powers of subbands depends
on the selected speaker population/language/speech
content…
2. FFT-Cepstral Features
Computation of FFT-Cepstrum
Windowed speech frame
Magnitude
spectrum by FFT
Processing is very similar to
“raw” subband processing
Smoothing by a
filterbank
Common steps
Nonlinear mapping
of the filter outputs
Decorrelation
by DCT
Coefficient
selection
Cepstrum vector
c = (c1,…,cM)T
FFT-Cepstrum :
Type of Frequency Warping
1. Linear warping
2. Mel-warping
3. Bark-warping
4. ERB-warping
Helsinki
Fixed parameters: 30 triangular filters, logcompression, DCT-transformed filter outputs,
15 lowest cepstral coefficients excluding c[0]
TIMIT
Helsinki: Mel-frequency warped cepstrum gives the best results on
average
TIMIT: Linearly warped cepstrum gives the best results on average
Same explanation as before: discrimination curves
FFT-Cepstrum :
Number of Cepstral Coefficients
( Fixed parameters: mel-frequency warped triangular filters, log-compression,
DCT-transformed filter outputs, 15 lowest cepstral coefficients excluding c[0],
codebook size = 64)
Helsinki
TIMIT
Minimum number of coefficients around ~ 10, rather independent of
the number of filters
Discussion About the FFT-Cepstrum
• Same performance as with the subband features, but smaller
number of features
 For computational and modeling reasons, cepstrum is the
preferred method of these two in automatic recognition
• The commonly used mel-warped filterbank is not the best choice
in general case !
There is no reason to assume that it would be, since mel-cepstrum is
based on modeling of human hearing and originally meant for speech
recognition purposes
• I prefer / recommend to use linear frequency warping, since:
It is easier to control the amount resolution on desired subbands (e.g.
by linear weighting). In nonlinear warping, the relationship between
the “real” and “warped” frequency axes is more complicated
3. LPC-Derived Features
What Is Linear Predictive Coding (LPC) ?
• In time domain, current sample is approximated as a linear
combination of the past p samples :
• The objective is to determine the LPC coefficients a[k] k=1,…,p
such that the squared prediction error is minimized
• In the frequency domain, LPC’s define an all-pole IIR-filter whose poles
correspond to local maximae of the magnitude spectrum
An LPC pole
Computation of LPC and LPC-Based Features
Windowed speech frame
Autocorrelation computation
Solving of YuleWalker AR
equations
Levinson-Durbin algorithm
LPC coefficients (LPC)
Complex
polynomial
expansion
Root-finding
algorithm
Line spectral
frequencies
(LSF)
LPC pole
finding
Formants
(FMT)
Reflection coefficients (REFL)
Atal’s
recursion
Linear Predictive
Cepstral Coefficients
(LPCC)
LAR
conversion
Log area ratios
(LAR)
asin(.)
Arcus sine
coefficients
(ARCSIN)
Linear Prediction (LPC) :
Number of LPC coefficients
Helsinki
TIMIT
• Minimum number around ~ 15 coefficients (not consistent, however)
• Error rates surprisingly small in general !
• LPC coefficients were used directly in Euclidean-distance -based classifier. In literature
there is usually warning of the following form : “Do not ever use LPC’s directly, at least
with the Euclidean metric.”
Comparison of the LPC-Derived Features
Fixed parameters: LPC predictor
order p = 15
Helsinki
TIMIT
• Overall performance is very good
• Raw LPC coefficients gives worst performance on average
A programming
• Differences between feature sets are rather small
 Other factors to be considered:
• Computational complexity
• Ease of implementation
bug???
LPC-Derived Formants
Fixed parameters: Codebook size = 64
Helsinki
TIMIT
• Formants give comparable, and surprisingly good results !
• Why “surprisingly good” ?
1. Analysis procedure was very simple (produces spurious formants)
2. Subband processing, LPC, cepstrum, etc… describe the spectrum
continuously - formants on the other hand pick only a discrete number of
maximum peaks’ amplitudes from the spectrum (and a small number!)
Discussion About the LPC-Derived Features
• In general, results are promising, even for the raw LPC
coefficients
• The differences between feature sets were small
– From the implementation and efficiency viewpoint the following
are the most attractive: LPCC, LAR and ARCSIN
• Formants give (surprisingly) good results also, which
indicates indirectly:
– The regions of spectrum with high amplitude might be important
for speaker recognition
0.5
An idea for future study :
0.45
Magnitude [dB]
0.4
0.35
How about selecting subbands
around local maximae?
0.3
0.25
0.2
0.15
0.1
0.05
0
1000
2000
3000
4000
Frequency [Hz]
5000
6000
4. Dynamic Features
Dynamic Spectral Features
• Dynamic feature: an estimate of the time derivate of the feature
• Can be applied to any feature
Time trajectory of the original
feature
Estimate of the 1st time derivative
(-feature)
Estimate of the 2nd time derivative
( -feature)
• Two widely used estimatation methods are differentiator and linear
regression method :
(M = number of neigboring frames, typically M = 1..3)
• Typical phrase : “Don’t use differentiator, it emphasizes noise”
Delta Features :
Comparison of the Two Estimation Methods
TIMIT
Differentiator
Helsinki
Best: -ARCSIN (8.1 %), M=4
Regression
Best : -LSF (7.0 %), M=1
Best : -LSF (10.6 %), M=2
Best : -ARCSIN (8.8 %), M=1
Delta Features :
Comparison with the Static Features
Discussion About the Delta Features :
• Optimum order is small (In most cases M=1,2 neighboring frames)
• The differentiator method is better in most cases (surprising result, again!)
• Delta features are worse than static features but might provide uncorrelated
extra information (for multiparameter recognition)
• The commonly used delta-cepstrum gives quite poor results !
Towards Concluding Remarks ...
FFT-Cepstrum Revisited :
Question : Is Log-Compression / Mel-Cepstrum Best ?
Please note: Now segment length is reduced down to T=100 vectors, that’s why
absolute recognition rates are worse than before (ran out of time for the thesis…)
Helsinki
TIMIT
Answer: NO !
FFT- vs. LPC-Cepstrum:
Question: Is it really that “FFT-cepstrum is more accurate” ?
Helsinki
TIMIT
Answer: NO ! (TIMIT shows this quite clearly)
The Essential Difference Between the
FFT- and LPC-Cepstra ?
• FFT-cepstrum approximates the spectrum
by linear combination of cosine functions
(non-parametric model)
• LPC makes a least-squares fit of the allpole filter to the spectrum (parametric
model)
• FFT-cepstrum first smoothes the original
spectrum by filterbank, whereas LPC filter
is fitted directly to the original spectrum
LPC captures more “details”
FFT-cepstrum represents
“smooth” spectrum
However, one might argue that we could drop out the filterbank from FFT-cepstrum ...
General Summary and Discussion
• Number of subbands should be high (30-50 for these corpora)
• Number of cepstral coefficients (LPC/FFT-based) should high ( 15)
• In particular, number of subbands, coefficients, and LPC order are
clearly higher than in speech recognition generally
• Formants give (surprisingly) good performance
• Number of formants should be high ( 8)
• In most cases, the differentiator method outperforms the regression
method in delta-feature computation
All of these indicate indirectly the importance of spectral
details and rapid spectral changes
“Philosophical Discussion”
• The current knowledge of speaker individuality is far from perfect :
• Engineers concentrete on tuning complex feature compensation methods but
don’t (necessarily) understand what’s individual in speech
• Phoneticians try to find the “individual code” in the speech signal, but they
don’t (necessarily) know how to apply engineers’ methods
• Why do we believe that speech would be any less individual than
e.g. fingerprints ?
• Compare the history “fingerprint” and “voiceprint” :
• Fingerprints have been studied systematically since the 17th century (1684)
• Spectrograph wasn’t invented until 1946 ! How could we possibly claim that we
know what speech is with research of less than 60 years?
• Why do we believe that human beings are optimal speaker
discriminators? Our ear can be fooled already (e.g. MP3 encoding).
That’s All, Folks !