Speaker Recognition - University at Buffalo
Download
Report
Transcript Speaker Recognition - University at Buffalo
Speaker Recognition
Sharat.S.Chikkerur
Center for Unified Biometrics and Sensors
http://www.cubs.buffalo.edu
Speech Fundamentals
Characterizing speech
Content (Speech recognition)
Signal representation (Vocoding)
• Waveform
• Parametric( Excitation, Vocal Tract)
Signal analysis (Gender determination, Speaker recognition)
Terminologies
Phonemes :
• Basic discrete units of speech.
• English has around 42 phonemes.
• Language specific
Types of speech
• Voiced speech
• Unvoiced speech(Fricatives)
• Plosives
Formants
Speech production
Av
Pitch
17 cm
Impulse
Train
Generator
Glottal Pulse
Model
G(z)
Vocal Tract
Model
V(z)
Noise source
AN
Speech production mechanism
Speech production model
Radiation
Model
R(z)
Nature of speech
1
0.5
0
-0.5
-1
0
1000
2000
3000
4000
5000
6000
Spectrogram
7000
8000
9000
10000
Vocal Tract modeling
Signal Spectrum
Smoothened Signal Spectrum
•The smoothened spectrum indciates the locations of the formants of each user
•The smoothened spectrum is obtained by cepstral coefficients
Parametric Representations: Formants
Formant Frequencies
Characterizes the frequency response of the vocal tract
Used in characterization of vowels
Can be used to determine the gender
15
10
5
0
0
500
1000
1500
2000
2500
3000
3500
4000
8
6
4
2
0
0
500
1000
1500
2000
2500
3000
3500
4000
Parametric Representations:LPC
Linear predictive coefficients
Used in vocoding
Spectral estimation
s[n] ak s[n k ] Gu[n]
k
4
2
20
0
-2
1.5
3
1
2
2
0.5
1000
1500
2000
2500
3000
3500
4000
40
1
-1
0
500
1000
1500
2000
2500
3000
3500
4000
0
500
1000
1500
2000
2500
3000
3500
4000
4
4
2
5
2
200
0
0
-2
500
0
0
-0.5
0
0
500
1000
1500
2000
2500
3000
3500
4000
-2
0
500
1000
1500
2000
2500
3000
3500
4000
Parametric Representations:Cepstrum
Pitch
P[n]
Av
G(z)
V(z)
y1‘[n]+y2‘[n]
x1‘[n]+x2‘[n]
x1[n]*x2[n]
D[]
L[]
D-1[]
DFT[]
LOG[]
IDFT[]
y1[n]*y2[n]
R(z)
x1[n]*x2[n]
u[n]
X1(z)X2(z)
x1‘[n]+x2‘[n]
log(X1(z)) + log(X2(z))
AN
0.5
0
10
-0.5
-1
-1.5
0.5
0
5
-0.5
1000
2000
3000
4000
5000
6000
0
40
-1
-1
-1.5
0
1
0
1000
2000
3000
4000
5000
6000
-2
0
1000
2000
3000
4000
5000
6000
Speaker Recognition
Definition
It is the method of recognizing a person based on his voice
It is one of the forms of biometric identification
Depends of speaker dependent characteristics.
Speech Applications
Transmission
Speech Synthesis
Speech enhancement
Aids to handicapped
Speech Recognition
Speaker Verification
Speaker Recognition
Speaker Identification
Text
Dependent
Text
Independent
Speaker Detection
Speaker Verification
Text
Dependent
Text
Independent
Generic Speaker Recognition System
Speech signal
Analysis Frames
Preprocessing
Feature
Extraction
Preprocessing
Feature
Extraction
A/D Conversion
LAR
End point detection
Cepstrum
Pre-emphasis filter
LPCC
Segmentation
MFCC
Choice
Feature Vector
Score
Pattern
Matching
Verification
Speaker Model
Enrollment
Stochastic
Models
GMM
HMM
Template
Models
DTW
Distance Measures
of features
Differentiating factors b/w speakers include vocal tract shape and behavioral traits
Features should have high inter-speaker and low intra speaker variation
Our Approach
Preprocessing
Silence Removal
Feature
Cepstrum Coefficients
Extraction
Cepstral Normalization
Polynomial Function
Expansion
Speaker model
Reference Template
Dynamic Time Warping
Matching
Distance Computation
Long time average
Silence Removal
Preprocessing
Feature Extraction
Speaker model
Matching
Wn
En
Eavg
1
N
x[ k ] w[ n k ]
2
k 1
N
k 1
x[ k ]
2
Pre-emphasis
Preprocessing
Feature Extraction
Speaker model
Matching
H ( z ) (1 az1 )
a 0.95
Segmentation
Preprocessing
Feature Extraction
The speech signal is segmented into overlapping ‘Analysis Frames’
Speaker model
The speech signal is assumed to be stationary within this frame
Matching
Short time analysis
Qn
x[k ]w[n k ]
k
2 ( n)
w[ n] 0.54 0.46 cos
N
Qn : n th analysisframe
N : lengt h of t heanalysisframe
Q31
Q32
Q33
Q34
Feature Representation
Preprocessing
Feature Extraction
Speaker model
Matching
Speech signal and spectrum of two users uttering ‘ONE’
Speaker Model
P1 j
j 5
9
c j P1 j
j 1
b 9
2
P
1j
j 1
F1 = [a1…a10,b1…b10]
F2 = [a1…a10,b1…b10]
…………….
…………….
FN = [a1…a10,b1…b10]
Dynamic Time Warping
Preprocessing
Feature Extraction
Speaker model
Matching
X (n) {x1 (n), x2 (n).....xK (n)}, n 1....N
Y (m) { y1 (m), y2 (m)....yK (m)}, m 1....M
m w(n), M N
•The DTW warping path in the n-by-m matrix is
the path which has minimum average cumulative
cost. The unmarked area is the constrain that path
is allowed to go.
N
DT min D{ X (n),Y ( w(n))
n 1
K
D( X (n),Y (m)) ri (n) ti (m)
i 1
2
Results
a0
a0
a1
r0
r1
s0
s1
0
0.1226
0.3664
0.3297
0.4009
0.4685
a1
0.1226
0
0.5887
0.3258
0.4086
0.4894
r0
0.3664
0.5887
0
0.0989
0.3299
0.4243
r1
0.3297
0.3258
0.0989
0
0.367
0.4287
s0
0.4009
0.4086
0.3299
0.367
0
0.1401
•Distances are normalized w.r.t. length of the speech signal
•Intra speaker distance less than inter speaker distance
•Distance matrix is symmetric
s1
0.4685
0.4894
0.4243
0.4287
0.1401
0
Matlab Implementation
THANK YOU