Transcript Chapter 2

Speech Recognition
Chapter 3
Speech Front-Ends
Linear Prediction Analysis
 Linear-Prediction Based Processing
 Cepstral Analysis
 Auditory signal Processing

Linear Prediction Analysis






Introduction
Linear Prediction Model
Linear Prediction Coefficients Computation
Linear Prediction for Automatic Speech
Recognition
Linear Prediction in Speech Processing
How good is the LP Model.
Signal Processing Front End
Convert the speech waveform
in some type of parametric representation.
sk
Signal Processing
Front End
O=o(1)o(2)..o(T)
Filterbank
Linear Prediction Front End
Linear Prediction Coefficients
Introduction





In short intervals, it provides a good model of
the speech.
Mathematical precise and simple.
Easy to implement in software or hardware.
Works fine for recognition applications.
It also has applications in formant and pitch
estimation, speech coding and synthesis.
Linear Prediction Model

Basic idea:

sn  a1sn1  a2sn2 apsn p
a1 , a2, ap are called LP(Linear Prediction)

coefficients.
By including the excitation signal, we obtain:
p
sn   ai sn  i  Gun
i 1

where un is the normalised excitation and
is the gain of the excitation.
G

In the z-domain (secc. 1.1.4, pp. 15, Deller)
p
S( z)   ai z  i S( z)  GU ( z)
i 1

leading to the transfer function (Fig. 3.27)




S ( z) 
G

H ( z) 

p

U ( z) 
i
 1   ai z 


i 1
LP model retains the spectral
magnitude, but it has a minimum
phase (Sec. 1.1.7, Deller) feature.
 However, in practice, phase is not very
important for speech perception.

Observation:
H(z) models the glottal filter(G(z)) and the lips radiation(R(z).
Linear Prediction Coefficients
Computation
Introduction
 Methogologies

Linear Prediction Coefficients
Computation
 LP coefficients can be obtained by solving
the next equation system (Secc. 3.3.2, Prove
M
 a  s
i 1
s
m  k m i
i
m
  sm smi ,
m
i=1,...,M
):
Methodologies
 Autocorrelation
 Covariance
Method
Method
– Not commonly used in Speech Recognition
Autocorrelation Method


Assumptions: Each frame is independent (Fig.
3.29
).
Solution (Juang, secc. 3.3.3 pp105-106):
M
 a r
k 1
where
k ik
 ri
ri =  sm sm  i 
m
i=1,...,M
M 1 i
s
q0
s
qi q
(2)
These equations are know as Yule-Walker equations.

Using matrix notation:
Ra  r

Symetric.
where
 r0
r
 1
R   r2


 rM
r1
r0
r1

r2
r1
r0





rM 1
rM  2
1
rM 
rM 1 
rM  2 

 
Diagonal elements

r0  are the same.
Toeplitz Matriz





This matrix is known as Toeplitz. A linear system with
this matrix can be solved very efficient.
Examples (Fig. 3.32
and 3.33
)
Example (Fig. 3.34
)
Example (Fig. 3.35
)
Example (Fig. 3.36
)
Linear Prediction for Automatic
Speech Recogition
To minimise
signal
discontinuity
Flats the
spectrum
equation (2)
usually M=8
Incorporate
signal dynamics
to minimise
To Cepstral
noise sensitivity Coefficients
Durbin
Algorithm
Preemphasis

The transfer function of the glottis can
be modelled as follows:
U g ( z) 
1
1   z 1   z 
1
1

1
0.9  1 , 2  10
.
2
The radiation effect can be modelled as
follows:
1
R( z)  1  z ,
0.9    10
.
Hence, to obtain the transfer function of the vocal tract
the other pole must be cancelled as follows:.
H( z)  1  z ,
1
0.9    10
.
Preemphasis sould be done only for sonorant sounds.
This process can be automated as follows.
rs (1)

rs (0)
where
rs (i ) is the autocorrelation function.
1

0
for sonorant sounds
for no - sonorant sounds
N samples size frame, M samples frame shift
N samples size frame, M samples frame shift
Minimize signal discontinuities at the
edges of the frames.
 A typical window is the Hamming
window.

 2n 
w( z)  0.54  0.46 cos
,
 N  1
0  n  N 1
LPC Analysis
Converts the autocorrelations
coefficients into LPC “parameter set”.
 LPC Parameter set

– LPC coefficients
– Reflection (PARCOR) coefficients
– log area ratio coefficients

The formal method to obtain the LPC
parameter set is know as Durbin’s
method.
Durbin’s method
for ( E 0=r0 , p  1; p  pmax ; p   )
p 1
kp 
rp   a ip 1 rp i
i 1
E p-1
a ip   k p
for (i=1; i  p-1; i   )
p
p 1


k
a
ai
p a p i
p
i
a 1
p
0
E p  (1  k ) E p 1
2
p
r0
E0
r1
1
1  k  
E0
2
1
k1
E1
a1(1)
1


 r2   a i(1) r2 i 


i 1
 k2
k1 k 2
1
1  k  
2
E1
2
E2
a1( 2 ) a 2( 2 )
 k3
2


 r3   a i( 2 ) r3i 


i 1
k3
1
a1( 3) a 2( 3) a 3( 3)
k1 k 2 k 3

1  k 3 
E3
E3
2

LPC (Typical values)
Para- Fs
meter 6.67KHz
N
300
(45msec)
M
100
(15msec
p
8
Fs
8kHz
240
(30msec)
80
(10msec)
10
Fs
10KHz
300
(30msec)
100
(10msec)
10
Q
12
12
12
K
3
3
3
LPC Parameter Conversion



Conversion to Cepstral Coeficients.
Robust feature set for speech recognition.
Algorithm:
c0  ln  2
m 1
k
cm  am     ck am k ,
 
k 1 m
1 m  p
m 1
k
cm     ck am k ,
 
k 1 m
m p
Parameter weighting

low-order cepstral coefficents are highly
sensibles to noise
Temporal Cepstral Derivative
First or second order derivatives is
enough.
 It can be aproximated as follows:

cm (t )  
K
 kc
k  K
m
(t  k )
p
~
sn   ai sn i
Given
i 1
p
sn   ai sn i  Gun
i 1
p
en  sn  ~
sn  sn   ai sn i
p
i 1p
2
2
~
En   ( sn  sn )   en
E n
a k
i 1
i 1
Hamming Windowed
Large prediction errors
since speech is predicted
form previous samples
arbitray set to zero.
Large prediction errors
since speech is predicted
form previous samples
arbitray set to zero.
Unvoiced signals
are not position sensitive.
It does not show special
effect at the edges.
Observe the
“whitening” phenomena
at the error spectrum.
Observe the “whitening
phenomena at the error
specturm
Observe the error
wave periodicity
behaviour taken
as bases for the
Pitch Estimators.
•Observe that a sharp decrease
in the prediction error is obtain
for small M value (M=1...4).
•Observe that unvoiced signal
has higher RMS error.
Observe the all-pole model
ability to match the spectrum.
Linear Prediction in Speech
Processing



LPC for Vocal Tract Shape Estimation
LPC for Pitch Detection
LPC for Formant prediction
LPC for Vocal Tract Shape
Estimation
Free of glottis
and radiation
effects
Vocal Tract
Shape
Estimation
To minimise
signal
discontinuity
kn
0n N
to minimise
To Cepstral
noise sensitivity Coefficients
Parameter
Calculation
Parameter Calculation

Durbin’s Method (As in Speech
Recognition)
– In case, this method is used, first the
autocorrelation analysis should be
performed.

Lattice Filter
Lattice Filter


The reflection coefficients are obtain directly
form the signal, avoiding the autocorrelation
analysis.
Methods:
– Itakura-Saito (Parcor)
– Burg
– New forms

Advantage:
– Easier to implement in Hardware

Disadvantage:
– needs around 5 times more calculation.
Itakura-Saito (PARCOR)
kp 
( p 1) ( p 1)
e
 n fn
n
 e
n
  f
( p 1) 2
n

( p 1) 2
n
en( p 1)
n
where
en( p)  en( p1)  k p f n( p1)
f n( p)  k p en( p11)  f n(p11)
It can be shown that the PARCOR
coefficients, obtain for the Itakura-Saito
method are exactly the same as the
reflection coefficients obtained by the
Levison Durbin algorithm.
 kp
en( p )
 kp
f n( p1)
f n( p)
Accumulates over time (n).
Example
Burg
( p 1) ( p 1)
e
 n fn
kp 
en( p 1)
n
 e
   f
( p 1) 2
n
n

( p 1) 2
n
 kp
 kp
z 1
f n( p1)
n
where
( p)
n
e
f
( p)
n
( p1)
n
e
 k e
( p1)
p n
k f
( p1)
p n1
f
en( p )
( p1)
n1
Example
f n( p)
Example
e
sn
z
( 0)
n
1
f
( 0)
n
e
(1)
n
e
en( p1)
( p 2 )
n
 k1
 k p1
 kp
 k1
 k p1
 kp
z 1
f
( 0)
n
f n( p2)
z 1
f n( p1)
en( p )
z 1 f n( p)
Itakura-Saito
Burg
New Forms

Stroback, New forms of Levinson and
Schur algorithms, IEEE Signal
Processing Magazine, pp. 12-36, 1991.
Vocal Tract Shape Estimation
From:
An+1  An
k n=
An+1  An
0n N
We obtain
 1  kn 
 An1
An=
1

k

n
0n N
Therefore, by setting the the lips area to an arbitrary value
we can obtain the vocal tract configuration relative to the initial condition.
This technique as been succesfully used to train deaf persons.
LPC for Pitch Detection
Speech
Sampled
at 10KHz
LPF
800Hz
DownSampler
5:1
Inverse
Filering
A(z)
LPC Analysis
Autocorrelation
Peak
finding
V/U decision
or
Pitch
LPC for Formant Detection
Sampled
Speech
LPC
Spectrum
LPC Analysis
Emphasis Peaks
(second derivative)
Peak Formants
finding
LPC Spectrum

LP assumes that the vocal tract system
can be modelled with an all-pole
system:
1
A( z )

The spectrum can be obtain by
z  e j
1
A(e j )

In order to emphasis formant peaks we
j
z


e
0.9    1.0
can set
Therefore
P
A( z ) 
a z
i
i
i 0
Spectrum (DTFT)
A( z) z  e j 
P
a 
i 0
i
i
e  ji
Spectrum (DFT)
A( e j )   2k 
N
P
a  e
i
i 0
 2k 
 j
i
 N 
i
In order to increase the spectral resolution we pad with zeros:
A( e
j
2k
N
N
)
a  e
i
i 0
i
 2k 
 j
i
 N 
N>>P;
In order to use an FFT algorithm N  2 n
ai =0
i>P
Caclulate the Spectral magnitude(DFT)
A( e
j
2k
N
)
k=0,...,N-1
Invert the Spectral magnitude(DFT)
1
A( e
2 k
j
N
k=0,...,N-1
)
This spectrum is called the LPC Spectrum.
How good is the LP Model

As shown by the physiological analysis of the
vocal tract the speech model is as follows:
L
 ( z)  G
1   bi z i
i 1
R
1   a i z i
i 1

However, it can be shown (
), that LP
Model is good for estimating the magnitude
of pole-zero system.
Prove

According to lema 1 ( ) and lema 2 (
,  ( z) can be written as follows:


1
 ( z)  G 
I

i
1

a
z

i

i 1

)


ap ( z )



All pass
component

The estimates a are calculated such that
it correspond to the a of this model.
i
i

Since
ap ( z )  1
hence
 ( z)  G min ( z)

therefore, if the estimators,a are
exacts, then at least we obtain a model
with a correct magnitude.
i
Lema 1

Lema 1(System Decomposition):
– Any causal ration system
L
 ( z)  G
1   bi z i
i 1
R
1   a i z i
i 1
– can be descomponed as (prove
 ( z)  Gmin ( z)ap ( z)
Minimal phase
component
):
Prove
For two poles and two zeros:
 ( z) 
(z 
1

)( z 
1
*
)
( z  p)( z  p * )
1
1 

 ( z   )( z   * )   ( z   )( z   * ) 
 ( z)  

*
* 
(
z

p
)(
z

p
)
(
z


)(
z


)




1
1 

(
z

)(
z

* )
 ( z   )( z   )  


 ( z)  

* 
*
 ( z  p)( z  p )   ( z   )( z   ) 


*
Lets define:
1
1 

 ( z   )( z   * ) 
x ( z)  

*
(
z


)(
z


)






Re-arranging this equation:
 (z  1)( * z  1) 
x ( z) 
 *  ( z   )( z   * ) 
1
z 2  (  z 1 )( *  z 1 ) 
x ( z) 


 2  ( z   )( z   * ) 
z 2  ( z 1   )( z 1   * ) 
x ( z) 


 2  ( z   )( z   * ) 
x ( z) 
z2

2
A( z 1 )
A( z)
H (e )  H ( z ) H ( z 1 ) | z e j
e 
j
 x (e
j
)
2

2
j
With the knowledge that:

2 2
4
x (e j )
 A( z 1 )   A( z )  


1  
A
(
z
)
A
(
z
)   z e j


2

1

4
Hence:
 x (e
j
)
2
 G ap (e
j
2
)
Therefore:
G
1
4
 ( z   )( z   *) 
min ( z)  

(
z

p
)(
z

p
*)


1
1 

 ( z   )( z   * ) 
ap ( z)  

(
z


)(
z


*)




End of prove.
Lema 2

Lema 2: Minimum phase component
can be expresed as an all-pole system:
min ( z ) 
1
I
1   a i z i
i 1

in theory I goes to infinity, in practice
is limited.
Linear Prediction Based
Procesing
Critics to the Linear Prediction Model
 Perceptual Linear Prediction (PLP)
 LP Cepstra

Critics to the Linear Prediction Model
The LP spectrum approximate the
speech spectrum equally well at all
frequencies of the analysis band.
 This property is inconsistent with the
human hearing.

Precepual Linear Prediction
(PLP)
Critical Band
Spectral Analysis
Equal Loudness
Pre-emphasis
Intensity
Loudness
IDFT
Yule-Walker
Equations Solutions
ai
Critical Band Analysis
Speech
Signal
Frame
Windowing
DFT
(20 ms Hamming Window
Short-Term
Spectra
Critical Band
Spectral
Resolution
DFT
(20 ms)
(200 samples
56 zeros for
padding for Ts=10KHz)c
Critical-Band Spectral Resolution
 ( i ) 
P( )
Frequency
Warping
(Hertz -> Barks)
 
0.5
 
   2  
( )  6 ln
 
  1 


1200

1200



 
P(( ))
2.5
 P(   )()
 1.3
i=1,...,18
Convolution
and
Downsampling
filter-bank masking curve
approximation
0

 10 2.5 (  0.5)

1
 ( )  
10 1.0 (  0.5)


0
  13
.
 13
.    0.5
 0.5    0.5
0.5    2.5
  2.5
i
Equal Loudness Pre-emphasis
Approximate the non-equal sensitivity
of the human hearing at different frequencies.
 2  56.8x106  4
( )  2
6 2
  6.3x10   2  0.38x109  6  9.58x1026 
Intensitive Loudnes Power Law
Approximate the non-linear relation
between the intensity of sound and
its perceived loudness.
( )  ()
3
Cepstral Analysis
Introduction
 Homomorphic Processing
 Cepstral Spectrum
 Cepstrum
 Mel-Cepstrum
 Cepstrum in Speech Processing

Introduction
When speech is pre-emphasised
S ( )  E ( ) H ( )
The excitation is not necessary for estimate the vocal tract function.
Therefore, it is desirable to separate the excitation information form
the vocal tract information.
H( )
E( )
We can think the speech spectrum as a signal,
we can observer that is composed for the multiplication
of a slow signal, H( ) and a fast signal, E( ) .
S ( )  E ( ) H ( )
Therefore, we can try to obtain the best of this knowledge.
The formal technique which exploit this feature is called
“Homomorphic Processing”.
Homomorphic Processing
It is a technique to filter no-lineal
systems.
 In Homomorphic Processing the nonlinear related signals are transform the
signal to a linear domain.

H[ ]
F(z)
H-1[ ]
In order to obtain a linear system a complex
log transformation is applied to the speech spectrum.
log[ ]
S+(z)
exp[ ]
S ()  log E()  log H()

Cepstral Spectrum
Definition.


2 2
 s (n)   log s(n)   log S (e jT )e jnT d
2 0
1
where
 is the STFT

log S (e )  log S (e ) e
jT

jT
jS ( e jT )

 log S (e jT )  jS (e jT )

Cepstrum
Definition.
 

cs (n)  Re  log s(n)
1
2 2
jT
jnT

log
S
(
e
)
e
d

2 0
Cepstrum In Speech Processing
Pitch Estimation
 Format Estimation
 Pitch and Formant Estimation

Pitch Estimation
Sampled
Speech
Cepstrum
High-Pass
Liftering
Emphasis Peaks
(second derivative)
Peak
finding
Pitch
Formant Estimation
Sampled
Speech
Cepstrum
Low-Pass
Liftering
Emphasis Peaks
(second derivative)
Peak
finding
Formants
Pitch and Formant Estimation
Sampled
Speech
Cepstrum
High-Pass
Liftering
Emphasis Peaks
(second derivative)
Peak
finding
Pitch
Low-Pass
Liftering
Emphasis Peaks
(second derivative)
Peak
finding
Formants