Transcript Chapter 2
Speech Recognition
Chapter 3
Speech Front-Ends
Linear Prediction Analysis
Linear-Prediction Based Processing
Cepstral Analysis
Auditory signal Processing
Linear Prediction Analysis
Introduction
Linear Prediction Model
Linear Prediction Coefficients Computation
Linear Prediction for Automatic Speech
Recognition
Linear Prediction in Speech Processing
How good is the LP Model.
Signal Processing Front End
Convert the speech waveform
in some type of parametric representation.
sk
Signal Processing
Front End
O=o(1)o(2)..o(T)
Filterbank
Linear Prediction Front End
Linear Prediction Coefficients
Introduction
In short intervals, it provides a good model of
the speech.
Mathematical precise and simple.
Easy to implement in software or hardware.
Works fine for recognition applications.
It also has applications in formant and pitch
estimation, speech coding and synthesis.
Linear Prediction Model
Basic idea:
sn a1sn1 a2sn2 apsn p
a1 , a2, ap are called LP(Linear Prediction)
coefficients.
By including the excitation signal, we obtain:
p
sn ai sn i Gun
i 1
where un is the normalised excitation and
is the gain of the excitation.
G
In the z-domain (secc. 1.1.4, pp. 15, Deller)
p
S( z) ai z i S( z) GU ( z)
i 1
leading to the transfer function (Fig. 3.27)
S ( z)
G
H ( z)
p
U ( z)
i
1 ai z
i 1
LP model retains the spectral
magnitude, but it has a minimum
phase (Sec. 1.1.7, Deller) feature.
However, in practice, phase is not very
important for speech perception.
Observation:
H(z) models the glottal filter(G(z)) and the lips radiation(R(z).
Linear Prediction Coefficients
Computation
Introduction
Methogologies
Linear Prediction Coefficients
Computation
LP coefficients can be obtained by solving
the next equation system (Secc. 3.3.2, Prove
M
a s
i 1
s
m k m i
i
m
sm smi ,
m
i=1,...,M
):
Methodologies
Autocorrelation
Covariance
Method
Method
– Not commonly used in Speech Recognition
Autocorrelation Method
Assumptions: Each frame is independent (Fig.
3.29
).
Solution (Juang, secc. 3.3.3 pp105-106):
M
a r
k 1
where
k ik
ri
ri = sm sm i
m
i=1,...,M
M 1 i
s
q0
s
qi q
(2)
These equations are know as Yule-Walker equations.
Using matrix notation:
Ra r
Symetric.
where
r0
r
1
R r2
rM
r1
r0
r1
r2
r1
r0
rM 1
rM 2
1
rM
rM 1
rM 2
Diagonal elements
r0 are the same.
Toeplitz Matriz
This matrix is known as Toeplitz. A linear system with
this matrix can be solved very efficient.
Examples (Fig. 3.32
and 3.33
)
Example (Fig. 3.34
)
Example (Fig. 3.35
)
Example (Fig. 3.36
)
Linear Prediction for Automatic
Speech Recogition
To minimise
signal
discontinuity
Flats the
spectrum
equation (2)
usually M=8
Incorporate
signal dynamics
to minimise
To Cepstral
noise sensitivity Coefficients
Durbin
Algorithm
Preemphasis
The transfer function of the glottis can
be modelled as follows:
U g ( z)
1
1 z 1 z
1
1
1
0.9 1 , 2 10
.
2
The radiation effect can be modelled as
follows:
1
R( z) 1 z ,
0.9 10
.
Hence, to obtain the transfer function of the vocal tract
the other pole must be cancelled as follows:.
H( z) 1 z ,
1
0.9 10
.
Preemphasis sould be done only for sonorant sounds.
This process can be automated as follows.
rs (1)
rs (0)
where
rs (i ) is the autocorrelation function.
1
0
for sonorant sounds
for no - sonorant sounds
N samples size frame, M samples frame shift
N samples size frame, M samples frame shift
Minimize signal discontinuities at the
edges of the frames.
A typical window is the Hamming
window.
2n
w( z) 0.54 0.46 cos
,
N 1
0 n N 1
LPC Analysis
Converts the autocorrelations
coefficients into LPC “parameter set”.
LPC Parameter set
– LPC coefficients
– Reflection (PARCOR) coefficients
– log area ratio coefficients
The formal method to obtain the LPC
parameter set is know as Durbin’s
method.
Durbin’s method
for ( E 0=r0 , p 1; p pmax ; p )
p 1
kp
rp a ip 1 rp i
i 1
E p-1
a ip k p
for (i=1; i p-1; i )
p
p 1
k
a
ai
p a p i
p
i
a 1
p
0
E p (1 k ) E p 1
2
p
r0
E0
r1
1
1 k
E0
2
1
k1
E1
a1(1)
1
r2 a i(1) r2 i
i 1
k2
k1 k 2
1
1 k
2
E1
2
E2
a1( 2 ) a 2( 2 )
k3
2
r3 a i( 2 ) r3i
i 1
k3
1
a1( 3) a 2( 3) a 3( 3)
k1 k 2 k 3
1 k 3
E3
E3
2
LPC (Typical values)
Para- Fs
meter 6.67KHz
N
300
(45msec)
M
100
(15msec
p
8
Fs
8kHz
240
(30msec)
80
(10msec)
10
Fs
10KHz
300
(30msec)
100
(10msec)
10
Q
12
12
12
K
3
3
3
LPC Parameter Conversion
Conversion to Cepstral Coeficients.
Robust feature set for speech recognition.
Algorithm:
c0 ln 2
m 1
k
cm am ck am k ,
k 1 m
1 m p
m 1
k
cm ck am k ,
k 1 m
m p
Parameter weighting
low-order cepstral coefficents are highly
sensibles to noise
Temporal Cepstral Derivative
First or second order derivatives is
enough.
It can be aproximated as follows:
cm (t )
K
kc
k K
m
(t k )
p
~
sn ai sn i
Given
i 1
p
sn ai sn i Gun
i 1
p
en sn ~
sn sn ai sn i
p
i 1p
2
2
~
En ( sn sn ) en
E n
a k
i 1
i 1
Hamming Windowed
Large prediction errors
since speech is predicted
form previous samples
arbitray set to zero.
Large prediction errors
since speech is predicted
form previous samples
arbitray set to zero.
Unvoiced signals
are not position sensitive.
It does not show special
effect at the edges.
Observe the
“whitening” phenomena
at the error spectrum.
Observe the “whitening
phenomena at the error
specturm
Observe the error
wave periodicity
behaviour taken
as bases for the
Pitch Estimators.
•Observe that a sharp decrease
in the prediction error is obtain
for small M value (M=1...4).
•Observe that unvoiced signal
has higher RMS error.
Observe the all-pole model
ability to match the spectrum.
Linear Prediction in Speech
Processing
LPC for Vocal Tract Shape Estimation
LPC for Pitch Detection
LPC for Formant prediction
LPC for Vocal Tract Shape
Estimation
Free of glottis
and radiation
effects
Vocal Tract
Shape
Estimation
To minimise
signal
discontinuity
kn
0n N
to minimise
To Cepstral
noise sensitivity Coefficients
Parameter
Calculation
Parameter Calculation
Durbin’s Method (As in Speech
Recognition)
– In case, this method is used, first the
autocorrelation analysis should be
performed.
Lattice Filter
Lattice Filter
The reflection coefficients are obtain directly
form the signal, avoiding the autocorrelation
analysis.
Methods:
– Itakura-Saito (Parcor)
– Burg
– New forms
Advantage:
– Easier to implement in Hardware
Disadvantage:
– needs around 5 times more calculation.
Itakura-Saito (PARCOR)
kp
( p 1) ( p 1)
e
n fn
n
e
n
f
( p 1) 2
n
( p 1) 2
n
en( p 1)
n
where
en( p) en( p1) k p f n( p1)
f n( p) k p en( p11) f n(p11)
It can be shown that the PARCOR
coefficients, obtain for the Itakura-Saito
method are exactly the same as the
reflection coefficients obtained by the
Levison Durbin algorithm.
kp
en( p )
kp
f n( p1)
f n( p)
Accumulates over time (n).
Example
Burg
( p 1) ( p 1)
e
n fn
kp
en( p 1)
n
e
f
( p 1) 2
n
n
( p 1) 2
n
kp
kp
z 1
f n( p1)
n
where
( p)
n
e
f
( p)
n
( p1)
n
e
k e
( p1)
p n
k f
( p1)
p n1
f
en( p )
( p1)
n1
Example
f n( p)
Example
e
sn
z
( 0)
n
1
f
( 0)
n
e
(1)
n
e
en( p1)
( p 2 )
n
k1
k p1
kp
k1
k p1
kp
z 1
f
( 0)
n
f n( p2)
z 1
f n( p1)
en( p )
z 1 f n( p)
Itakura-Saito
Burg
New Forms
Stroback, New forms of Levinson and
Schur algorithms, IEEE Signal
Processing Magazine, pp. 12-36, 1991.
Vocal Tract Shape Estimation
From:
An+1 An
k n=
An+1 An
0n N
We obtain
1 kn
An1
An=
1
k
n
0n N
Therefore, by setting the the lips area to an arbitrary value
we can obtain the vocal tract configuration relative to the initial condition.
This technique as been succesfully used to train deaf persons.
LPC for Pitch Detection
Speech
Sampled
at 10KHz
LPF
800Hz
DownSampler
5:1
Inverse
Filering
A(z)
LPC Analysis
Autocorrelation
Peak
finding
V/U decision
or
Pitch
LPC for Formant Detection
Sampled
Speech
LPC
Spectrum
LPC Analysis
Emphasis Peaks
(second derivative)
Peak Formants
finding
LPC Spectrum
LP assumes that the vocal tract system
can be modelled with an all-pole
system:
1
A( z )
The spectrum can be obtain by
z e j
1
A(e j )
In order to emphasis formant peaks we
j
z
e
0.9 1.0
can set
Therefore
P
A( z )
a z
i
i
i 0
Spectrum (DTFT)
A( z) z e j
P
a
i 0
i
i
e ji
Spectrum (DFT)
A( e j ) 2k
N
P
a e
i
i 0
2k
j
i
N
i
In order to increase the spectral resolution we pad with zeros:
A( e
j
2k
N
N
)
a e
i
i 0
i
2k
j
i
N
N>>P;
In order to use an FFT algorithm N 2 n
ai =0
i>P
Caclulate the Spectral magnitude(DFT)
A( e
j
2k
N
)
k=0,...,N-1
Invert the Spectral magnitude(DFT)
1
A( e
2 k
j
N
k=0,...,N-1
)
This spectrum is called the LPC Spectrum.
How good is the LP Model
As shown by the physiological analysis of the
vocal tract the speech model is as follows:
L
( z) G
1 bi z i
i 1
R
1 a i z i
i 1
However, it can be shown (
), that LP
Model is good for estimating the magnitude
of pole-zero system.
Prove
According to lema 1 ( ) and lema 2 (
, ( z) can be written as follows:
1
( z) G
I
i
1
a
z
i
i 1
)
ap ( z )
All pass
component
The estimates a are calculated such that
it correspond to the a of this model.
i
i
Since
ap ( z ) 1
hence
( z) G min ( z)
therefore, if the estimators,a are
exacts, then at least we obtain a model
with a correct magnitude.
i
Lema 1
Lema 1(System Decomposition):
– Any causal ration system
L
( z) G
1 bi z i
i 1
R
1 a i z i
i 1
– can be descomponed as (prove
( z) Gmin ( z)ap ( z)
Minimal phase
component
):
Prove
For two poles and two zeros:
( z)
(z
1
)( z
1
*
)
( z p)( z p * )
1
1
( z )( z * ) ( z )( z * )
( z)
*
*
(
z
p
)(
z
p
)
(
z
)(
z
)
1
1
(
z
)(
z
* )
( z )( z )
( z)
*
*
( z p)( z p ) ( z )( z )
*
Lets define:
1
1
( z )( z * )
x ( z)
*
(
z
)(
z
)
Re-arranging this equation:
(z 1)( * z 1)
x ( z)
* ( z )( z * )
1
z 2 ( z 1 )( * z 1 )
x ( z)
2 ( z )( z * )
z 2 ( z 1 )( z 1 * )
x ( z)
2 ( z )( z * )
x ( z)
z2
2
A( z 1 )
A( z)
H (e ) H ( z ) H ( z 1 ) | z e j
e
j
x (e
j
)
2
2
j
With the knowledge that:
2 2
4
x (e j )
A( z 1 ) A( z )
1
A
(
z
)
A
(
z
) z e j
2
1
4
Hence:
x (e
j
)
2
G ap (e
j
2
)
Therefore:
G
1
4
( z )( z *)
min ( z)
(
z
p
)(
z
p
*)
1
1
( z )( z * )
ap ( z)
(
z
)(
z
*)
End of prove.
Lema 2
Lema 2: Minimum phase component
can be expresed as an all-pole system:
min ( z )
1
I
1 a i z i
i 1
in theory I goes to infinity, in practice
is limited.
Linear Prediction Based
Procesing
Critics to the Linear Prediction Model
Perceptual Linear Prediction (PLP)
LP Cepstra
Critics to the Linear Prediction Model
The LP spectrum approximate the
speech spectrum equally well at all
frequencies of the analysis band.
This property is inconsistent with the
human hearing.
Precepual Linear Prediction
(PLP)
Critical Band
Spectral Analysis
Equal Loudness
Pre-emphasis
Intensity
Loudness
IDFT
Yule-Walker
Equations Solutions
ai
Critical Band Analysis
Speech
Signal
Frame
Windowing
DFT
(20 ms Hamming Window
Short-Term
Spectra
Critical Band
Spectral
Resolution
DFT
(20 ms)
(200 samples
56 zeros for
padding for Ts=10KHz)c
Critical-Band Spectral Resolution
( i )
P( )
Frequency
Warping
(Hertz -> Barks)
0.5
2
( ) 6 ln
1
1200
1200
P(( ))
2.5
P( )()
1.3
i=1,...,18
Convolution
and
Downsampling
filter-bank masking curve
approximation
0
10 2.5 ( 0.5)
1
( )
10 1.0 ( 0.5)
0
13
.
13
. 0.5
0.5 0.5
0.5 2.5
2.5
i
Equal Loudness Pre-emphasis
Approximate the non-equal sensitivity
of the human hearing at different frequencies.
2 56.8x106 4
( ) 2
6 2
6.3x10 2 0.38x109 6 9.58x1026
Intensitive Loudnes Power Law
Approximate the non-linear relation
between the intensity of sound and
its perceived loudness.
( ) ()
3
Cepstral Analysis
Introduction
Homomorphic Processing
Cepstral Spectrum
Cepstrum
Mel-Cepstrum
Cepstrum in Speech Processing
Introduction
When speech is pre-emphasised
S ( ) E ( ) H ( )
The excitation is not necessary for estimate the vocal tract function.
Therefore, it is desirable to separate the excitation information form
the vocal tract information.
H( )
E( )
We can think the speech spectrum as a signal,
we can observer that is composed for the multiplication
of a slow signal, H( ) and a fast signal, E( ) .
S ( ) E ( ) H ( )
Therefore, we can try to obtain the best of this knowledge.
The formal technique which exploit this feature is called
“Homomorphic Processing”.
Homomorphic Processing
It is a technique to filter no-lineal
systems.
In Homomorphic Processing the nonlinear related signals are transform the
signal to a linear domain.
H[ ]
F(z)
H-1[ ]
In order to obtain a linear system a complex
log transformation is applied to the speech spectrum.
log[ ]
S+(z)
exp[ ]
S () log E() log H()
Cepstral Spectrum
Definition.
2 2
s (n) log s(n) log S (e jT )e jnT d
2 0
1
where
is the STFT
log S (e ) log S (e ) e
jT
jT
jS ( e jT )
log S (e jT ) jS (e jT )
Cepstrum
Definition.
cs (n) Re log s(n)
1
2 2
jT
jnT
log
S
(
e
)
e
d
2 0
Cepstrum In Speech Processing
Pitch Estimation
Format Estimation
Pitch and Formant Estimation
Pitch Estimation
Sampled
Speech
Cepstrum
High-Pass
Liftering
Emphasis Peaks
(second derivative)
Peak
finding
Pitch
Formant Estimation
Sampled
Speech
Cepstrum
Low-Pass
Liftering
Emphasis Peaks
(second derivative)
Peak
finding
Formants
Pitch and Formant Estimation
Sampled
Speech
Cepstrum
High-Pass
Liftering
Emphasis Peaks
(second derivative)
Peak
finding
Pitch
Low-Pass
Liftering
Emphasis Peaks
(second derivative)
Peak
finding
Formants