Spectral Analysis and LPCvocoder
Download
Report
Transcript Spectral Analysis and LPCvocoder
Spectral Analysis and LPCVocoder
ECE 532 Course Project, May.3th.2000
Violeta Gambiroza, Aleksandar Kuzmanovic, Yonghe Liu,
Liang Sun, Yuanbin Guo
Outline
Spectral Analysis of Speech Signal
LP with Levinson and Burg Algorithms
LPC vocoder
Wavelet Based Pitch Detection
Wavelet-based RELP
A framework for SA and LPC of Speech
Spectral Analysis of Speech
Time-varying vs. WSS
Time-Frequency Analysis
Spectrogram
Short-time Spectrum(Block
Processing)
AR model Spectrum
T h e
s p e c g ra m
o f:
T h e
ra in
in
s p a in
s ta y s
m
a in ly
in
th e
g re a t
p la in !
5 0 0
4 5 0
4 0 0
F re q u e n c y
3 5 0
3 0 0
2 5 0
2 0 0
1 5 0
1 0 0
5 0
0
0
1 0
2 0
3 0
4 0
5 0
T im
e
6 0
7 0
8 0
Voice/Unvoiced Spectrum
Fine and Formant Spectrum of Voice
White Noise liked unvoiced Speech
Summary of the various Spectrum
30
DataPSD
Model PSD
errorPSD
20
10
0
-10
-20
-30
-40
-50
0
100
200
300
400
500
600
Speech Production Model(2S)
Voice/Unvoiced Switch
Pitch Period
AR model analysis
Gain
1.Pitch period
Impulse
Train
Generator
2.Voiced/unvoiced
Switch
Random Noise
Generator
3.Vocal Tract
Parameters
Gain
Vocal Tract
Model V(z)
Radiation
Model R(z)
Linear Prediction
One of the most powerful Tools for
Speech
Various types of Formulation
Levinson Algorithm
Burg Algorithm
Levinson vs. Burg
ef p ( n) ef p 1 ( n) k p ebp 1 ( n 1)
eb ( n) eb ( n 1) k ef
p 1
p
p 1 ( n )
p
Result Comparison
Levinson Algorithm
Burg Algorithm
Comparison(cont.)
Stability
– both guaranteed in theory
– practice,burg better than levinson
Computation Form
– Direct form for levinson
– Lattice form for burg
Preprocessing
ACF Computation
Order Selection
AIC
– AIC(k)=Nln(MSE(k))+2*k;
– Tend to overestimate the order
MDL
– MDL(k)= N* ln(MSE(k))+k*ln(N);
– Statistically Consistent,considers the number of bits;
Application of LP output
LP parameters(Parcor,A paras,Gain)
Alternate representation(roots,cepstrum,acf of a
para,..)
– LAR(log area ratio), appropriate for quantization:
Long-term de-correlated Residual error
2S LPCvocoder
Simply assume 2 state excitation
Loss of naturalness, machine-like, keep understandable
Very low bit rate(2.4kbps)
Primitives
Clear Interface and easy maintenance
Software Engineering management Style
–
–
–
–
–
–
–
–
–
–
PreProc(OBlkData, WT,WL);
LPCanalysis(WBlkData,P);
UVDecision;
PitchDetection;
EncodePara
Channel
DecodePara
ExcitGenerator
Systhesizer
PostProc
FDPDetection with Wavelet
Wavelet transform
Pitch Detection in Frequency Domain
Down-sampling improves resolution of
frequency
FFT the coarse scale
The BlkData
The wavedec coeffs of BlkData for L4
1
0.4
0.2
0.5
0
0
-0.2
-0.4
0
100
200
300
-0.5
0
errorf
0.4
0.4
0.2
0.2
0
0
-0.2
-0.2
-0.4
0
100
200
300
-0.4
0
100
200
300
The wavedec coeffs
The wavedec coeffs of errorf
100
200
300
WRELP
Why RELP?
– Difficulty of Pitch Detection
– loss of naturalness
– 2S excitation assumption
Why Wavelet?
System Block for WRELP
Ingredients of WRELP
Combination of Voiced/Unvoiced excitation
White Noise coefficients flat in all scales
Voiced coefficients localized in TF
Voice/Unvoiced Pre-determination
Different Thresholds(Energy) for UV
separation
White noise compensation
Experiment Results
The BlkData
The wavedec coeffs of BlkData for L4
1
0.4
0.2
0.5
0
0
-0.2
-0.4
0
100
200
300
-0.5
0
errorf
0.4
0.4
0.2
0.2
0
0
-0.2
-0.2
-0.4
0
100
200
300
-0.4
0
100
200
300
The
wavedec
coeffs
The wavedec coeffs of errorf
100
200
300
White Noise Compensation
A Framework for SA and LPvocoder
A set of parameters defined
– window,LPT,VDT,PDT,EGT,...
Friendly User-interface
Various choice of methods
Various display modes
More Experiment Results
0.25
0.3
RecErr
errorf
Excitatin
0.2
0.2
0.15
0.1
0.1
0.05
0
0
-0.1
-0.05
-0.2
-0.1
-0.3
-0.15
-0.2
-0.4
0
50
100
150
200
250
300
-0.25
0
50
100
150
200
250
300
More Results
Voice and Unvoice Detection
Pattern Recognition Approach(Rabiner76)
Combination of Several Methods
–
–
–
–
Energy of Signal
Zero Crossing Rate
Autocorrelation Coefficient
First Predictor Coefficient
– Energy of Predictor Error
The measurement now is a 5 dimension vector
U
S
V
U
Pattern Recognition Approach
The Framework
The High Pass Filter
1 2 z 1 z 2
H ( z)
1 e aT cos(bT ) z 1 e 2 aT z 2
Speech Measurements (1)
Zero Crossing Rate
Log Energy Es
1
Es 10 log(
N
N
S
2
(n))
n 1
Normalized Autocorrelation Coefficient
N
C1
s(n) s(n 1)
n 1
N 1
N
( s ( n))( s 2 ( n))
2
n 1
n 0
Speech Measurements (2)
First Predictor Coefficient
– 1 of a 12-pole LPC using Covariance Method
Normalized Prediction Error EP
Ep Es
6
10 log(10
|
P
(0, k ) (0,0)) |
k
k 1
where
1
(i, k )
N
N
s (n i ) s (n k )
n 1
Decision Algorithm (1)
L=5 Dimension Gaussian Distribution
gi ( X ) ( 2 ) L / 2 | Wi |1/ 2 exp(
i =voice/silence/unvoice
1
( X Mi ) H Wi 1 ( X Mi ))
2
Detector: Our favorite Hypothesis Test
If
pigi ( X ) pjgj ( X )
Then Select Hypothesis i (voice/silence/unvoice)
Decision Algorithm (2)
Further Simplified to
dˆi ( X Mi ) H Wi 1 ( X Mi ))
P1
dˆ 2 dˆ 3
dˆ1dˆ 2 dˆ 2 dˆ 3 dˆ1dˆ 3
P2
dˆ1dˆ 3
dˆ1dˆ 2 dˆ 2 dˆ 3 dˆ1dˆ 3
dˆ1dˆ 3
dˆ1dˆ 2 dˆ 2 dˆ 3 dˆ1dˆ 3
P3
Select the class with largest Pi
Estimators of Mi and Wi
1
Mi
Ni
1
Wi
Ni
Ni
x ( n)
i
n 1
Ni
x (n) x
i
i
n 1
Use Train Data to Get
T
(n) MiMi H
Means and Covariance Matrices for the three classes
for the training data.
Zero
Log Energy
Crossings
First Auto-
First LPC
correlation
LPC log
Error
1) Silence
Mean
9.6613
-38.1601
0.9489
0.5084
-10.8084
Covariance
1.0000
0.6760
-0.7077
-0.1904
0.7208
matrix
0.6760
1.0000
0.6933
0.2918
-0.9425
(normalized)
-0.7077
0.6933
1.000
0.3275
-0.8426
-0.1904
0.2918
0.3275
1.0000
-0.2122
0.7208
-0.9425
-0.8426
-0.2122
1.0000
Mean
10.4286
-36.7536
0.9598
0.5243
-10.9076
Covariance
1.0000
0.6059
-0.4069
0.4648
-0.4603
matrix
0.6059
1.0000
-0.1713
0.1916
-0.9337
(normalized)
-0.4069
-0.1713
1.0000
0.1990
-0.1685
2) Unvoiced
0.4648
0.1916
0.1990
1.0000
-0.2121
-0.4603
-0.9337
-0.1685
-0.2121
1.0000
Mean
29.1853
-18.3327
0.9826
1.1977
-11.1256
Covariance
1.0000
-0.2146
-0.8393
-0.3362
0.3608
matrix
-0.2146
1.0000
0.1793
0.6564
-0.7129
(normalized)
-0.8393
0.1793
1.0000
0.3416
-0.5002
-0.3362
0.6564
0.3416
1.0000
-0.4850
0.3608
-0.7129
-0.5002
-0.4850
1.0000
3) Voiced
Matrix of incorrect identifications for the three classes
for the speech data in the training set.
Actual class
Silence
Unvoiced
Voiced
Silence
61
2
0
Unvoiced
1
54
1
Voiced
0
0
258
Total
62
56
259
Identified as
Matrix of incorrect identifications for the three classes
for the speech data in the testing set.
Actual class
Silence
Unvoiced
Voiced
Silence
14
2
0
Unvoiced
2
21
0
Voiced
0
0
139
Total
16
23
139
Identified as
Comparison between actual data and
V/U/S determination results.
V
S
U
V
U
U
S
U
U
V
V
Errors
Number
Parameter Used
Silence
Unvoiced
Voiced
1
Zeros-crossing
10
21
12
4
13
9
5
17
7
15
24
10
7
18
3
1
2
1
count
2
Log energy
Normalized
3
autocorrelation
coefficient
First predictor
4
coefficient of LPC
analysis
Normalized
5
prediction error
Pattern recognition
6
using all five
parameters
Total number of identification errors for the different classes
with different sets of parameters. The total number of segments
was 62 for the silence, 56 for the unvoiced, and 259 for the
voiced class.
Pitch Detection
Voiced sounds
– Produced by forcing air through glottis
– Vocal cords oscillate and modulate air flow into
quasi periodic pulses
– Pulses excite resonances in reminder of vocal
tract
– Different sounds – produced as muscles work to
change shape of vocal tract
• Resonant frequencies or formant frequencies
– Fundamental frequency or pitch – rate of pulses
Pitch Detection
400
amplitude
200
Short sections
of
0
-200
-400
– Voiced speech
0
100
200
300
400
sample number
500
600
700
400
amplitude
200
0
-200
-400
0
100
200
300
400
sample number
500
600
700
– Unvoiced
speech
Cepstral Analysis
Frequency domain pitch detection
Glottal excitation – rich in harmonic content
– Modulated by filtering response of vocal tract
Cepstral analysis – provides way to separate
excitation from filter response
Assumption – sequence of voiced speech
samples is convolution between
– e[n] – glottal excitation sequence
– [n] – vocal tract’s discrete impulse response
s[n] e[n] * [n]
Cepstral Analysis
S ( ) E ( )( )
No easy way to separate excitation from filter
response
log S () log E() log ()
Real cepstrum of signal s[n]
1
1
c[n] FDTFT {log FDTFT {s[n]}}
2
where
S ( )
jn
s
[
n
]
e
n
jn
log
S
(
)
e
d
Pitch Detection
0.6
0.5
0.4
0.3
Peak gives pitch
cepstrum
0.2
0.1
0
-0.1
-0.2
-0.3
-0.4
0
50
100
150
200
250
300
quefrency
350
400
450
500
Time-domain pitch estimation
Well studied area
Variations of
fundamental frequency
are evident
Time-domain
speech processing
should be capable of detecting pitch frequency
400
300
200
amplitude
100
0
-100
-200
-300
-400
0
100
200
300
400
sample number
500
600
700
Pitch Period Estimation Using the Autocorrelation Function
Periodic signals have periodic auto-correlation
function
Rn (k )
N 1 k
'
'
[
x
(
n
m
)
w
(
m
)][
x
(
n
m
k
)
w
(k m)]
m 0
Basic problems in choosing window length:
– Speech changes over time (N low)
– but at least 2 periods of the waveform
Approaches:
– Choose window to catch longest period
– Adaptive N
– Use modified short-time auto-correlation function
Pitch Period Estimation Using the Autocorrelation Function (Cont’d)
Auto-correlation representation - retains
too much of the information in the
speech signal => auto-correlation
function has many peaks
14
12
10
8
6
4
2
0
0
100
200
300
400
500
600
700
“Spectrum flatteners” techniques
amplitude
Remove the effects of the vocal tract transfer function
“Center clipping” - nonlinear transformation, clipping
value depends on maximum amplitude
=> Strong peak at the peach frequency
400
0.16
0.7
300
0.14
0.6
200
0.12
100
0.1
0
0.08
-100
0.06
-200
0.04
-300
0.02
0.5
0.4
0.3
-400
0
100
200
300
400
sample number
500
600
700
0
0.2
0.1
0
100
200
300
400
500
600
700
0
0
100
200
300
400
500
600
700