Spectral Analysis and LPCvocoder

Download Report

Transcript Spectral Analysis and LPCvocoder

Spectral Analysis and LPCVocoder
ECE 532 Course Project, May.3th.2000
Violeta Gambiroza, Aleksandar Kuzmanovic, Yonghe Liu,
Liang Sun, Yuanbin Guo
Outline






Spectral Analysis of Speech Signal
LP with Levinson and Burg Algorithms
LPC vocoder
Wavelet Based Pitch Detection
Wavelet-based RELP
A framework for SA and LPC of Speech
Spectral Analysis of Speech




Time-varying vs. WSS
Time-Frequency Analysis
Spectrogram
Short-time Spectrum(Block
Processing)
AR model Spectrum
T h e
s p e c g ra m
o f:
T h e
ra in
in
s p a in
s ta y s
m
a in ly
in
th e
g re a t
p la in !
5 0 0
4 5 0
4 0 0
F re q u e n c y

3 5 0
3 0 0
2 5 0
2 0 0
1 5 0
1 0 0
5 0
0
0
1 0
2 0
3 0
4 0
5 0
T im
e
6 0
7 0
8 0
Voice/Unvoiced Spectrum


Fine and Formant Spectrum of Voice
White Noise liked unvoiced Speech
Summary of the various Spectrum
30
DataPSD
Model PSD
errorPSD
20
10
0
-10
-20
-30
-40
-50
0
100
200
300
400
500
600
Speech Production Model(2S)




Voice/Unvoiced Switch
Pitch Period
AR model analysis
Gain
1.Pitch period
Impulse
Train
Generator
2.Voiced/unvoiced
Switch
Random Noise
Generator
3.Vocal Tract
Parameters
Gain
Vocal Tract
Model V(z)
Radiation
Model R(z)
Linear Prediction




One of the most powerful Tools for
Speech
Various types of Formulation
Levinson Algorithm
Burg Algorithm
Levinson vs. Burg
ef p ( n)  ef p 1 ( n)  k p  ebp 1 ( n  1)


eb ( n)  eb ( n  1)  k  ef
p 1
p
p 1 ( n )
 p
Result Comparison

Levinson Algorithm

Burg Algorithm
Comparison(cont.)

Stability
– both guaranteed in theory
– practice,burg better than levinson

Computation Form
– Direct form for levinson
– Lattice form for burg


Preprocessing
ACF Computation
Order Selection

AIC
– AIC(k)=Nln(MSE(k))+2*k;
– Tend to overestimate the order

MDL
– MDL(k)= N* ln(MSE(k))+k*ln(N);
– Statistically Consistent,considers the number of bits;
Application of LP output


LP parameters(Parcor,A paras,Gain)
Alternate representation(roots,cepstrum,acf of a
para,..)
– LAR(log area ratio), appropriate for quantization:

Long-term de-correlated Residual error
2S LPCvocoder



Simply assume 2 state excitation
Loss of naturalness, machine-like, keep understandable
Very low bit rate(2.4kbps)
Primitives


Clear Interface and easy maintenance
Software Engineering management Style
–
–
–
–
–
–
–
–
–
–
PreProc(OBlkData, WT,WL);
LPCanalysis(WBlkData,P);
UVDecision;
PitchDetection;
EncodePara
Channel
DecodePara
ExcitGenerator
Systhesizer
PostProc
FDPDetection with Wavelet




Wavelet transform
Pitch Detection in Frequency Domain
Down-sampling improves resolution of
frequency
FFT the coarse scale
The BlkData
The wavedec coeffs of BlkData for L4
1
0.4
0.2
0.5
0
0
-0.2
-0.4
0
100
200
300
-0.5
0
errorf
0.4
0.4
0.2
0.2
0
0
-0.2
-0.2
-0.4
0
100
200
300
-0.4
0
100
200
300
The wavedec coeffs
The wavedec coeffs of errorf
100
200
300
WRELP

Why RELP?
– Difficulty of Pitch Detection
– loss of naturalness
– 2S excitation assumption

Why Wavelet?
System Block for WRELP
Ingredients of WRELP






Combination of Voiced/Unvoiced excitation
White Noise coefficients flat in all scales
Voiced coefficients localized in TF
Voice/Unvoiced Pre-determination
Different Thresholds(Energy) for UV
separation
White noise compensation
Experiment Results
The BlkData
The wavedec coeffs of BlkData for L4
1
0.4
0.2
0.5
0
0
-0.2
-0.4
0
100
200
300
-0.5
0
errorf
0.4
0.4
0.2
0.2
0
0
-0.2
-0.2
-0.4
0
100
200
300
-0.4
0
100
200
300
The
wavedec
coeffs
The wavedec coeffs of errorf
100
200
300
White Noise Compensation
A Framework for SA and LPvocoder

A set of parameters defined
– window,LPT,VDT,PDT,EGT,...



Friendly User-interface
Various choice of methods
Various display modes
More Experiment Results
0.25
0.3
RecErr
errorf
Excitatin
0.2
0.2
0.15
0.1
0.1
0.05
0
0
-0.1
-0.05
-0.2
-0.1
-0.3
-0.15
-0.2
-0.4
0
50
100
150
200
250
300
-0.25
0
50
100
150
200
250
300
More Results
Voice and Unvoice Detection
Pattern Recognition Approach(Rabiner76)

Combination of Several Methods
–
–
–
–
Energy of Signal
Zero Crossing Rate
Autocorrelation Coefficient
First Predictor Coefficient
– Energy of Predictor Error
The measurement now is a 5 dimension vector
U
S
V
U
Pattern Recognition Approach


The Framework
The High Pass Filter
1  2 z 1  z 2
H ( z) 
1  e aT cos(bT ) z 1  e 2 aT z 2
Speech Measurements (1)


Zero Crossing Rate
Log Energy Es
1
Es  10 log( 
N

N
S
2
(n))
n 1
Normalized Autocorrelation Coefficient
N
C1 
 s(n) s(n  1)
n 1
N 1
N
( s ( n))( s 2 ( n))
2
n 1
n 0
Speech Measurements (2)

First Predictor Coefficient
– 1 of a 12-pole LPC using Covariance Method

Normalized Prediction Error EP
Ep  Es

6
10 log(10
|
P
  (0, k )   (0,0)) |
k
k 1
where
1
 (i, k ) 
N
N
 s (n  i ) s (n  k )
n 1
Decision Algorithm (1)

L=5 Dimension Gaussian Distribution
gi ( X )  ( 2 )  L / 2 | Wi |1/ 2 exp( 
i =voice/silence/unvoice

1
( X  Mi ) H Wi 1 ( X  Mi ))
2
Detector: Our favorite Hypothesis Test
If
pigi ( X )  pjgj ( X )
Then Select Hypothesis i (voice/silence/unvoice)
Decision Algorithm (2)

Further Simplified to
dˆi  ( X  Mi ) H Wi 1 ( X  Mi ))
P1 
dˆ 2 dˆ 3
dˆ1dˆ 2  dˆ 2 dˆ 3  dˆ1dˆ 3
P2

dˆ1dˆ 3
dˆ1dˆ 2  dˆ 2 dˆ 3  dˆ1dˆ 3

dˆ1dˆ 3
dˆ1dˆ 2  dˆ 2 dˆ 3  dˆ1dˆ 3
P3

Select the class with largest Pi
Estimators of Mi and Wi
1
Mi 
Ni
1
Wi 
Ni
Ni
 x ( n)
i
n 1
Ni
 x (n) x
i
i
n 1
Use Train Data to Get
T
(n)  MiMi H
Means and Covariance Matrices for the three classes
for the training data.
Zero
Log Energy
Crossings
First Auto-
First LPC
correlation
LPC log
Error
1) Silence
Mean
9.6613
-38.1601
0.9489
0.5084
-10.8084
Covariance
1.0000
0.6760
-0.7077
-0.1904
0.7208
matrix
0.6760
1.0000
0.6933
0.2918
-0.9425
(normalized)
-0.7077
0.6933
1.000
0.3275
-0.8426
-0.1904
0.2918
0.3275
1.0000
-0.2122
0.7208
-0.9425
-0.8426
-0.2122
1.0000
Mean
10.4286
-36.7536
0.9598
0.5243
-10.9076
Covariance
1.0000
0.6059
-0.4069
0.4648
-0.4603
matrix
0.6059
1.0000
-0.1713
0.1916
-0.9337
(normalized)
-0.4069
-0.1713
1.0000
0.1990
-0.1685
2) Unvoiced
0.4648
0.1916
0.1990
1.0000
-0.2121
-0.4603
-0.9337
-0.1685
-0.2121
1.0000
Mean
29.1853
-18.3327
0.9826
1.1977
-11.1256
Covariance
1.0000
-0.2146
-0.8393
-0.3362
0.3608
matrix
-0.2146
1.0000
0.1793
0.6564
-0.7129
(normalized)
-0.8393
0.1793
1.0000
0.3416
-0.5002
-0.3362
0.6564
0.3416
1.0000
-0.4850
0.3608
-0.7129
-0.5002
-0.4850
1.0000
3) Voiced
Matrix of incorrect identifications for the three classes
for the speech data in the training set.
Actual class
Silence
Unvoiced
Voiced
Silence
61
2
0
Unvoiced
1
54
1
Voiced
0
0
258
Total
62
56
259
Identified as
Matrix of incorrect identifications for the three classes
for the speech data in the testing set.
Actual class
Silence
Unvoiced
Voiced
Silence
14
2
0
Unvoiced
2
21
0
Voiced
0
0
139
Total
16
23
139
Identified as
Comparison between actual data and
V/U/S determination results.
V
S
U
V
U
U
S
U
U
V
V
Errors
Number
Parameter Used
Silence
Unvoiced
Voiced
1
Zeros-crossing
10
21
12
4
13
9
5
17
7
15
24
10
7
18
3
1
2
1
count
2
Log energy
Normalized
3
autocorrelation
coefficient
First predictor
4
coefficient of LPC
analysis
Normalized
5
prediction error
Pattern recognition
6
using all five
parameters
Total number of identification errors for the different classes
with different sets of parameters. The total number of segments
was 62 for the silence, 56 for the unvoiced, and 259 for the
voiced class.
Pitch Detection

Voiced sounds
– Produced by forcing air through glottis
– Vocal cords oscillate and modulate air flow into
quasi periodic pulses
– Pulses excite resonances in reminder of vocal
tract
– Different sounds – produced as muscles work to
change shape of vocal tract
• Resonant frequencies or formant frequencies
– Fundamental frequency or pitch – rate of pulses
Pitch Detection

400
amplitude
200
Short sections
of
0
-200
-400
– Voiced speech
0
100
200
300
400
sample number
500
600
700
400
amplitude
200
0
-200
-400
0
100
200
300
400
sample number
500
600
700
– Unvoiced
speech
Cepstral Analysis


Frequency domain pitch detection
Glottal excitation – rich in harmonic content
– Modulated by filtering response of vocal tract


Cepstral analysis – provides way to separate
excitation from filter response
Assumption – sequence of voiced speech
samples is convolution between
– e[n] – glottal excitation sequence
– [n] – vocal tract’s discrete impulse response
s[n]  e[n] * [n]
Cepstral Analysis
S ( )  E ( )( )

No easy way to separate excitation from filter
response
log S ()  log E()  log ()

Real cepstrum of signal s[n]
1
1
c[n]  FDTFT {log FDTFT {s[n]}} 
2
where
S ( ) 

 jn
s
[
n
]
e

n  

jn
log
S
(

)
e
d


Pitch Detection
0.6
0.5
0.4
0.3
Peak gives pitch
cepstrum

0.2
0.1
0
-0.1
-0.2
-0.3
-0.4
0
50
100
150
200
250
300
quefrency
350
400
450
500
Time-domain pitch estimation
Well studied area
 Variations of
fundamental frequency
are evident
 Time-domain
speech processing
should be capable of detecting pitch frequency

400
300
200
amplitude
100
0
-100
-200
-300
-400
0
100
200
300
400
sample number
500
600
700
Pitch Period Estimation Using the Autocorrelation Function

Periodic signals have periodic auto-correlation
function
Rn (k ) 

N 1 k
'
'
[
x
(
n

m
)
w
(
m
)][
x
(
n

m

k
)
w
(k  m)]

m 0
Basic problems in choosing window length:
– Speech changes over time (N low)
– but at least 2 periods of the waveform

Approaches:
– Choose window to catch longest period
– Adaptive N
– Use modified short-time auto-correlation function
Pitch Period Estimation Using the Autocorrelation Function (Cont’d)

Auto-correlation representation - retains
too much of the information in the
speech signal => auto-correlation
function has many peaks
14
12
10
8
6
4
2
0
0
100
200
300
400
500
600
700
“Spectrum flatteners” techniques


amplitude

Remove the effects of the vocal tract transfer function
“Center clipping” - nonlinear transformation, clipping
value depends on maximum amplitude
=> Strong peak at the peach frequency
400
0.16
0.7
300
0.14
0.6
200
0.12
100
0.1
0
0.08
-100
0.06
-200
0.04
-300
0.02
0.5
0.4
0.3
-400
0
100
200
300
400
sample number
500
600
700
0
0.2
0.1
0
100
200
300
400
500
600
700
0
0
100
200
300
400
500
600
700