Slide 1

Transcript Slide 1

Pitch Prediction for Glottal Spectrum Estimation
with Applications in Speaker Recognition
Nengheng Zheng
Supervised under Professor P.C. Ching
Nov. 26 , 2004
Outline
• Speech production and glottal pulse excitation in detail
• Linear prediction: short-term and Long-term
• Glottal spectrum estimated with long-term prediction and
acoustic features
• For speaker recognition implementation
Speech Production
Discrete time model for speech production
Impulse
train
generator
AV
Glottal pulse
model G(z)
X
u(n) Vocal tract
model V(z)
Random
noise
generator
Radiation
model
R(z)
s(n)
X
AN
A combined transfer function
H ( z )  G( z)V ( z) R( z )
Glottal
pulses
Vocal tract
Speech
signal
Acoustic Features of Glottal Pulse
• Time domain
–
–
–
–
–
pitch period
pitch period perturbation (jitter)
pulse amplitude perturbation (shimmer)
glottal pulse width
abruptness of closure of the glottal flow
– aspiration noise
• Frequency domain
– fundamental frequency (F0)
– spectral tilt (slope)
– harmonic richness
Glottal Pulse and Voice Quality
• Glottal pulse shape plays an important role on the quality of
Natural or synthesized vowels [Rosenberg 1971]
– The shape and periodicity of vocal cord excitation are subject to
large variation
– Such variations are significant for preserving the speech
naturalness
– A typical glottal pulse: asymmetric with shorter falling phase;
spectrum with -12dB/octave decay
• More variation among different speakers than among different
utterance of the same speaker [Mathews 1963]
• Such variations have little significance for speech
intelligibility but affect the perceived vocal quality [Childers
1991]
Various Glottal Pulses
• Some other vocal types
breathy
falsetto
• Temporal and spectral characteristics
vocal fry
Some Comments
• Generally, to study the glottal pulse characteristics, it is
necessary to rebuilding the glottal pulse waveform by inverse
filtering technique
• Automatically and exactly rebuilding the glottal waveform
from real speech is almost impossible, especially, at the
transient phase of articulation, or, for high pitched speakers
• Fortunately, it is possible to estimate the glottal spectrum
from residual signal with pitch prediction
Linear Prediction
• Speech waveform: correlation between current and past
samples and thus predictable
p
• Short-term correlation:
s ( n )    ak s ( n  k )
k 1
• Occurs within one pitch period
• Formant modulation
• Classical linear prediction analysis (short-term prediction)
• Long-term correlation
u(n)  bu(n  p)
• occurs across consecutive pitch periods
• Vocal cords vibration
• Long-term/pitch prediction
Linear Prediction
• Short-term predictor <classical linear prediction>
P
A( z )  1   ak z k
k 1
– Remove the short-term correlation and result in a glottal excitation signal
P
u ( n )  s ( n )   ak s ( n  k )
k 1
• Long-term predictor <pitch prediction>
P( z)  1  b1 z ( p1)  b0 z  p  b1 z ( p1)
– Remove the correlation across consecutive periods
1
v(n)  u (n)   bk u (n  p  k )
k  1
s(n)
M
  ai z
i
u(n)
_
i 1
Short-term predictor
+
1
  bk z
( p  k )
v(n)
_
k  1
Long-term predictor
+
Linear Prediction: A example
1
s(n)
0
2
10
-1
0
100
100
200
300
400
500
600
700
800
400
500
600
700
800
500
600
700
800
0.5
-2
10-2
10 0
0
1000
1000
2000
2000
Frequency (Hz)
3000
3000
4000
4000
Frequency (Hz)
1
Intensity
Intensity(dB)
(dB)
0
10
101
PowerSpectrum
Power
Spectrum Magnitude
Spectrum
Magnitude
(dB)
Power
Power
SpectrumMagnitude
Magnitude(dB)
(dB)
(dB)
intensity
(dB)
intensity
(dB)
2
10
0
-0.5
10
0
0.5
0
0
10
10
0
0
1000
2000
2000
Frequency (Hz)
(Hz)
3000
3000
4000
4000
-0.5
40
20
u(n)
-20
-20
-40
-40
100
-60
-60
00
200
0.2
0.2
-20
-20
300
v(n)
0.4
0.6
0.4
0.6
Frequency
Frequency
0.8
0.8
11
-40
-40
-60
-60
0
100
-80
-800
0
200
0.2
0.2
300
0.4
0.6
0.4
0.6
Frequency
Frequency
400
0.8
0.8
1
80 1
60
40
20
Examples of pitch prediction estimated
glottal spectrum
40
20
0
0
50
100
150
200
250
300
350
400
450
500
0
50
100
150
200
250
300
350
400
450
500
0
50
100
150
200
250
300
350
400
450
500
0
50
100
150
200
250
300
350
400
450
500
6
4
2
0
2
1.5
1
0.5
8
6
4
2
0
Harmonic Structure of Glottal Spectrum
• Two parameters describing the harmonic structure
– Harmonic richness factor and Noise-to-harmonic ratio
• Harmonic richness factor (HRF)
• Noise-to-harmonic ratio (NHR)
HRFn  10log
NHRn  10 log
H
H i Bn
i
H1
N
i
H
i
N i Bn
H i Bn
10
H
i
5
0
0
200
400
o
Ni
600
800
1000
1200
1400
1600
1800
2000
Feature Generation
• Acoustic features including the following:
– Fundamental frequency F0
– Pitch prediction gain g
 u ( n)
g  10 log
 v ( n)
2
2
– Pitch prediction coefficients b-1, b0, b1
– HRFn and NHRn <n=1:10>
• 10 Mel scale frequency bank
• Feature generation process
s(n)
S-T
prediction
u(n)
L-T
prediction
on every
pitch period
p, g, bi
G(z)G(f)
Mel-scale
Bank pass
filtering
HRFn, NHRn,
n=1,2,…,
Experiments Conditions
• Speech quality: telephone speech
• Subject: 49 male speakers
• Training condition:
– 3 training session, about 90s speech totally, over 3~6 weeks
– 128 GMM
• Testing condition:
– 12 testing sessions. Over 4~6 months.
Speaker recognition experiments
• Identification results with long-term prediction related features
Feature
F0
g
[b-1 b0 b1]
HRF
NHR
Iden. Rate
18%
11%
14%
32%
17%
• Comparison of glottal source feature with classical features
Features
Identification
error rate (%)
Fgs: F0_g_HRF_NHR25
52%
LPCC_D_A36
2.84
LPCC_D_A+Fgs
2.26
MFCC_D_A
2.1
MFCC_D_A+Fgs
1.9
Summary
• Glottal source excitation is important for perceptional
naturalness of voice quality and is helpful for distinguishing a
speaker from the others.
• Linear prediction is a powerful tool for speech analysis. The
spectral property of the supraglottal vocal tract system can be
estimated by short-term prediction; While the long-term
prediction estimates the spectrum of the glottal excitation
system
• Recognition results show that the glottal source related acoustic
features (F0, prediction gain, HRF, NHR, etc.) provide a certain
degree of speaker discriminative power.
Other Applications
• Speech coding
• Speech recognition ?
• Speaking emotion recognition !
Thank You!