Speaker Recognition - University at Buffalo

Download Report

Transcript Speaker Recognition - University at Buffalo

Speaker Recognition
Sharat.S.Chikkerur
Center for Unified Biometrics and Sensors
http://www.cubs.buffalo.edu
Speech Fundamentals

Characterizing speech


Content (Speech recognition)
Signal representation (Vocoding)
• Waveform
• Parametric( Excitation, Vocal Tract)


Signal analysis (Gender determination, Speaker recognition)
Terminologies

Phonemes :
• Basic discrete units of speech.
• English has around 42 phonemes.
• Language specific

Types of speech
• Voiced speech
• Unvoiced speech(Fricatives)
• Plosives

Formants
Speech production
Av
Pitch
17 cm
Impulse
Train
Generator
Glottal Pulse
Model
G(z)
Vocal Tract
Model
V(z)
Noise source
AN
Speech production mechanism
Speech production model
Radiation
Model
R(z)
Nature of speech
1
0.5
0
-0.5
-1
0
1000
2000
3000
4000
5000
6000
Spectrogram
7000
8000
9000
10000
Vocal Tract modeling
Signal Spectrum
Smoothened Signal Spectrum
•The smoothened spectrum indciates the locations of the formants of each user
•The smoothened spectrum is obtained by cepstral coefficients
Parametric Representations: Formants

Formant Frequencies



Characterizes the frequency response of the vocal tract
Used in characterization of vowels
Can be used to determine the gender
15
10
5
0
0
500
1000
1500
2000
2500
3000
3500
4000
8
6
4
2
0
0
500
1000
1500
2000
2500
3000
3500
4000
Parametric Representations:LPC

Linear predictive coefficients
 Used in vocoding
 Spectral estimation
s[n]   ak s[n  k ]  Gu[n]
k
4
2
20
0
-2
1.5
3
1
2
2
0.5
1000
1500
2000
2500
3000
3500
4000
40
1
-1
0
500
1000
1500
2000
2500
3000
3500
4000
0
500
1000
1500
2000
2500
3000
3500
4000
4
4
2
5
2
200
0
0
-2
500
0
0
-0.5
0
0
500
1000
1500
2000
2500
3000
3500
4000
-2
0
500
1000
1500
2000
2500
3000
3500
4000
Parametric Representations:Cepstrum
Pitch
P[n]
Av
G(z)
V(z)
y1‘[n]+y2‘[n]
x1‘[n]+x2‘[n]
x1[n]*x2[n]
D[]
L[]
D-1[]
DFT[]
LOG[]
IDFT[]
y1[n]*y2[n]
R(z)
x1[n]*x2[n]
u[n]
X1(z)X2(z)
x1‘[n]+x2‘[n]
log(X1(z)) + log(X2(z))
AN
0.5
0
10
-0.5
-1
-1.5
0.5
0
5
-0.5
1000
2000
3000
4000
5000
6000
0
40
-1
-1
-1.5
0
1
0
1000
2000
3000
4000
5000
6000
-2
0
1000
2000
3000
4000
5000
6000
Speaker Recognition

Definition



It is the method of recognizing a person based on his voice
It is one of the forms of biometric identification
Depends of speaker dependent characteristics.
Speech Applications
Transmission
Speech Synthesis
Speech enhancement
Aids to handicapped
Speech Recognition
Speaker Verification
Speaker Recognition
Speaker Identification
Text
Dependent
Text
Independent
Speaker Detection
Speaker Verification
Text
Dependent
Text
Independent
Generic Speaker Recognition System
Speech signal




Analysis Frames
Preprocessing
Feature
Extraction
Preprocessing
Feature
Extraction
A/D Conversion

LAR
End point detection

Cepstrum
Pre-emphasis filter

LPCC
Segmentation

MFCC
Choice
Feature Vector
Score
Pattern
Matching
Verification
Speaker Model
Enrollment
Stochastic
Models

GMM

HMM
Template
Models

DTW

Distance Measures
of features

Differentiating factors b/w speakers include vocal tract shape and behavioral traits

Features should have high inter-speaker and low intra speaker variation
Our Approach

Preprocessing
Silence Removal
Feature
Cepstrum Coefficients
Extraction
Cepstral Normalization

Polynomial Function
Expansion
Speaker model
Reference Template

Dynamic Time Warping
Matching
Distance Computation
Long time average
Silence Removal

Preprocessing

Feature Extraction

Speaker model

Matching
Wn
En 

Eavg
1

N
x[ k ] w[ n  k ]
2
k 1
N

k 1
x[ k ]
2
Pre-emphasis

Preprocessing

Feature Extraction

Speaker model

Matching
H ( z )  (1  az1 )
a  0.95
Segmentation

Preprocessing

Feature Extraction

The speech signal is segmented into overlapping ‘Analysis Frames’

Speaker model

The speech signal is assumed to be stationary within this frame

Matching

Short time analysis
Qn 

 x[k ]w[n  k ]
k  
 2 ( n) 
w[ n]  0.54  0.46 cos

 N 
Qn : n th analysisframe
N : lengt h of t heanalysisframe
Q31
Q32
Q33
Q34
Feature Representation

Preprocessing

Feature Extraction

Speaker model

Matching
Speech signal and spectrum of two users uttering ‘ONE’
Speaker Model
P1 j 
 j  5
 9
  c j P1 j

j 1
b  9
2
P
 1j




j 1
F1 = [a1…a10,b1…b10]
F2 = [a1…a10,b1…b10]
…………….
…………….
FN = [a1…a10,b1…b10]
Dynamic Time Warping

Preprocessing

Feature Extraction

Speaker model

Matching
X (n)  {x1 (n), x2 (n).....xK (n)}, n  1....N
Y (m)  { y1 (m), y2 (m)....yK (m)}, m  1....M
m  w(n), M  N
•The DTW warping path in the n-by-m matrix is
the path which has minimum average cumulative
cost. The unmarked area is the constrain that path
is allowed to go.
N
DT  min  D{ X (n),Y ( w(n))
n 1
K
D( X (n),Y (m))   ri (n)  ti (m)
i 1
2
Results
a0
a0
a1
r0
r1
s0
s1
0
0.1226
0.3664
0.3297
0.4009
0.4685
a1
0.1226
0
0.5887
0.3258
0.4086
0.4894
r0
0.3664
0.5887
0
0.0989
0.3299
0.4243
r1
0.3297
0.3258
0.0989
0
0.367
0.4287
s0
0.4009
0.4086
0.3299
0.367
0
0.1401
•Distances are normalized w.r.t. length of the speech signal
•Intra speaker distance less than inter speaker distance
•Distance matrix is symmetric
s1
0.4685
0.4894
0.4243
0.4287
0.1401
0
Matlab Implementation
THANK YOU