Glottal Pulse Estimation

Download Report

Transcript Glottal Pulse Estimation

Itay Ben-Lulu & Uri Goldfeld
Instructor : Dr. Yizhar Lavner
Spring 2004
23/9/2004
Abstract
Goal : Estimation of glottal volume velocity (also called
glottal pulse) from acoustic speech signal samples.
Three estimation methods are examined:
1. Least Squares Glottal Inverse Filtering from the Acoustic
Speech Waveform – by Wong, Markel & Gray, 1979.
2. Pitch Synchronous Iterative Adaptive Inverse Filtering
(PSIAIF) – by Alku, 1992.
3. Estimation of the Glottal Flow Derivative Waveform
Through Formant Modulation (From: Modeling of the Glottal
Flow Derivative Waveform with Application to Speaker
Identification) – by Plumpe, Quatieri & Reyndols, 1997.
Applications
•
•
•
•
Speech synthesis – knowledge of the glottal
frequency is important to produce a synthetic
speech that sounds natural.
There are explicit differences between male
and female glottal pulses.
Different glottal excitations produce different
phonation types: normal, pressed, breathy.
Glottal pulse has great importance in
determining speech types : angry voice, soft
voice, happy voice, etc.
Discrete-Time System Model
for Speech Production
Impulse train
generator
e  n
Glottal pulse model
G z
uG  n 
Voiced/unvoiced
switch
Vocal tract model
V z
uL  n 
Radiation model
Rz
Random noise
generator
Denote:
uG (n) - glottal volume velocity (glottal pulse)
s(n) - speech pressure wave signal
q(n) - glottal volume velocity derivative
For voiced speech : the input to
V ( z)
For unvoiced speech : the input to
is the glottal pulse, uG (n)
V ( z)
is a random noise
s  n
Least Squares Glottal Inverse Filtering
from the Acoustic Speech Waveform
(Wong, Markel and Gray)
• The vocal-tract model is assumed to be an all-pole model :
V ( z) 
1
K
1   ci  z i
i 1
where K is an even integer.
• The lip radiation model is given by a differencing filter :
R( z )  1   z 1 ,
0.98    1
• Z-Transform gives:
(*) U G ( z ) 
S ( z)
V ( z )  R( z )
 The problem is estimating the vocal-tract transform, V ( z )
• Assume that an M-th order analysis filter of the form
M
A( z )   ai z i
,
M K
i 0
is to be obtained using covariance method of linear prediction
of the speech signal.
• Then, we can estimate the glottal volume velocity transfer
function :
S ( z )  A( z )
Uˆ G ( z ) 
R( z )
where A( z ) is an all-zero filter:
1
Vˆ ( z ) 
A( z )
Analysis Procedure – Block Diagram
s ( n)
Linear Phase HighPass Filter
Sequential
Covariance
Analysis
s H ( n)
 M ( n)
Normalized
Error Criterion
LPC
Polynomial
Root Solving
A( z )
A( z )
A( z ) 
1
Vˆ ( z )
 ( n)
pitch length
Pitch Detection
Vocal Tract
Model
Estimation
{n1j } j
{n2j } j
Searching for
Minimal
Periods
LPC
qˆ (n)
1
1

R( z ) 1   z 1
uˆG (n)
Algorithm Stages
1. Linear Phase High-Pass Filter –
The speech signal s (n) is passed through an high pass filter.
2. Sequential Covariance Analysis –
An N-length analysis window is sequentially moved one sample
at a time throughout s H (n) . we obtain the total squared error :
 M ( n) 
n  N  M 1

j n
 ( j ) when:
2
M
 (n)   (ai  ci )  s(n  i)
i 1
0
M-1
M
n-M
N-1
N-M
n
n-M+N-1
3. Normalized Error Criterion – Obtaining  (n) by :  (n)   M (n)  (n)
0
where  0 (n) is defined by:  0 (n) 
n  N  M 1

j n
s2 ( j)
4. Searching for Minimal Values Periods –
Scanning  (n) to find the intervals where it gets minimal values.
j
j
we denote the first and last samples in each interval by :{n1 } j , {n2 } j
These intervals are needed for determining the points of glottal
closure and opening : Lc j  n1 j  1 , Lo j  n2 j  N  M  1
5. Vocal Tract Model Estimation –
The prediction error filter A( z ) is estimated using LPC at
each closed phase interval, determined by {Lc j } ,{Lo j } .
6. Polynomial Root Solving –
Removing real poles (close to zero frequency) and high
bandwidth poles, from the filter A( z ) .
7. Inverse Filtering + Integration –
The original speech signal s (n) is passed through the inverse
filter of A( z )  1 Vˆ ( z ) , and then through an integrator
1
 1
R( z )
1   z 1 .
Finally, we obtain the estimation for the glottal pulse - uˆG (n) .
Example of Glottal Pulse Estimation with LS
Algorithm for Normal AA Vowel :
Example of Glottal Pulse Estimation with
LS Algorithm for Pressed AA Vowel :
Algorithm Drawbacks
•
Normalized Error Criterion Calculation In long voice signals a problem of over-complexity may appear.
•
Closed Period Identification –
In noisy voice signals it may be difficult to determine where the
normalized error criterion,  (n), gets its minimal values (phase 4).
An insufficiently accurate closed period identification causes
poor glottal pulse estimation.
•
Minimal Values Periods Criterion –
The numerical criterion for determining the minimal values periods
of  (n) may need to be adapted to some voice signals.
PSIAIF - Pitch Synchronous Iterative
Adaptive Inverse Filtering (Alku)
• A reliable response to some drawbacks in the first Inverse
Filtering algorithm.
• This algorithm is based on the speech production model:
Glottal Excitation
Vocal Tract
Lip Radiation
Speech
• Assumptions for this model:
1. the model is linear and time-invariant during a short time
interval.
2. the interaction between different processes is negligible.
3. the lip radiation effect is modeled with a fixed differentiator.
The PSIAIF Analysis Method
IAIF Method:
• The main idea: we can estimate the vocal tract accurately
enough with LPC analysis, if the tilting effect of the glottal
source is eliminated from the speech spectrum.
• Estimation of the glottal pulse is computed in the IAIFalgorithm with an iterative structure that is repeated twice.
PSIAIF Method:
• In order to improve the performance of LPC analysis in the
estimation of the vocal tract transfer function, the final glottal
wave estimate is computed pitch synchronously.
Structure of the IAIF Algorithm
s ( n)
H g1 ( z )
LPC analysis of
order 1
Inverse Filtering
u1 (n)
Inverse Filtering
u2 ( n )
LPC analysis of
order t1
Integration
LPC analysis of
order g2
Inverse Filtering
u3 (n)
Inverse Filtering
u4 ( n )
H vt1 ( z )
g1 (n)
H g2 ( z )
LPC analysis of
order t 2
Integration
H vt2 ( z )
g a ( n)
Structure of the PSIAIF Algorithm
s ( n)
High-Pass
Filtering
shp (n)
IAIF-1
g pa (n)
Pitch
Synchronism
{n0 , n1 ,...}
IAIF-2
u g ( n)
The speech signal to be analyzed is denoted s(n).
The estimated glottal excitation is denoted u g (n).
• The speech signal s(n) is high-pass filtered.
• The high-pass filtered signal, shp (n) , is used as an input to the
first IAIF-analysis. The output is one frame of a pitch
asynchronously glottal wave estimate, g pa (n) .
• The time indices of maximum glottal openings, {n0 , n1 ,...},
are computed for each frame of g pa (n). This computation requires
the knowledge of M - the average length of pitch period.
Preliminary knowledge of M helps us focusing the search of
maximum glottal openings on short time periods.
• The final estimate for the glottal excitation is obtained by
analyzing the high-pass filtered speech signal, shp (n) , with
the IAIF-algorithm pitch synchronously.
Example of Glottal Pulse Estimation with
PSIAIF Algorithm for Normal AA Vowel :
Example of Glottal Pulse Estimation with
PSIAIF Algorithm for Breathy AA Vowel :
Estimation of the Glottal Flow Derivative
Waveform Through Formant Modulation
(Plumpe)
• This algorithm is similar to Wong’s Least-Squares algorithm,
with few differences (principles and implementation).
• The vocal-tract model is assumed to be an all-pole model :
V ( z) 
1
K
1   ci  z
i
where K is an even integer.
i 1
• The main goal is to estimate the vocal-tract transfer function,
using the covariance method of linear prediction.
When we obtain the vocal-tract model estimation, we can easily
S ( z)
estimate the glottal flow derivative : Qˆ ( z ) 
Vˆ ( z )
Analysis Procedure – Block Diagram
s ( n)
s H ( n)
Linear Phase HighPass Filter
Speech
Waveform
Whitening
g ( n)
Peak Picking
LPC
Pitch Detection
Setting Initial
Stationary Region
pitch length
F1 (n)
{pj}
Formant
Tracking
F ( n)
Measuring
Formant
Frequencies
[n1j , n2j ]
Extending Initial
Stationary Region
[ N1j , N 2j ] Vocal Tract
Model
Estimation
A( z )
Polynomial
Root Solving
A( z )
LPC
qˆ (n)
A( z ) 
1
Vˆ ( z )
LPC
Algorithm Stages
1. Linear Phase High-Pass Filter –
The speech signal s (n) is passed through an high pass filter.
2. Speech Waveform Whitening –
H
The high-pass filtered speech signal s (n) is whitened by inverse
filtering with covariance method solution, using a one pitch-period
frame update and a two pitch-period analysis window. Real zeros
are removed from LPC solution. A rough estimation of the glottal
flow derivative is obtained - g (n).
3. Peak Picking –
The obtained rough estimation, g (n) , is scanned to identify the
approximate time of glottal pulses through negative peak picking.
The negative peaks are marked by : { p j }.
Example of Whitened Speech Waveform Peak
Picking for Pressed AA Vowel :
4. Measuring Formant Frequencies –
At each glottal cycle, a sliding covariance-based linear prediction
analysis with a one-sample shift is used. The size of rectangular
analysis window is 2M , where M is linear prediction order.
A vocal-tract estimate is found for each window.
5. Formant Tracking –
At each glottal cycle, the four lowest formants - calculated from the
vocal-tract estimates - are tracked by their frequency using a Viterbi
search. The cost function is the variance of the formant track
including the proposed pole to be added to the end of the track.
We obtain the formant track, F1 (n).
Example of Formant Tracking for Pressed AA Vowel :
6. Setting Initial Stationary Region –
Within each glottal cycle, we define a formant change function as:
D(n0 ) 
n0  M 1

i  n0
F1 (i )  F1 (i  1)
;
1  n0  N  3M
where M is linear prediction order, N is glottal cycle length.
*
The argument n0 is varied to minimize D(n0 ) : n  min D(n0 )
n0
*
The initial stationary formant region is set to be : [n* , n  M ]
j
j
This region is denoted by : [n1 , n2 ] .
7. Extending Initial Stationary Region –
j
j
[
n
,
n
The initial stationary formant region 1 2 ] is extended to
j
j
obtain the stationary formant region - [ N1 , N 2 ].
The extension to right is based on the following procedure :
Identify Initial Stationary
Region [n1 , n2 ] .
Calculate Average Favg and
Standard Deviation  F over
Interval [n1 , n2 ] .
Is
F1 (n2  1)  Favg  6 F
YES
Include the Point n2  1 in
the Stationary Region
n2  n2  1
NO
Extend the Region to Left
Extending to Left : The final mean and standard deviation are
kept constant.
8. Vocal Tract Model Estimation –
The prediction error filter A( z ) is estimated using LPC at
each stationary formant region, determined by {N1 j } , {N 2 j }.
9. Polynomial Root Solving –
Removing real poles (close to zero frequency) and high
bandwidth poles, from the filter A( z ) .
10. Inverse Filtering –
The original speech signal s (n) is passed through the inverse
filter of A( z )  1 Vˆ ( z ) , to obtain the estimation for the glottal pulse
derivative - qˆ (n) .
Example of Glottal Pulse Estimation with FM
Algorithm for Normal AA Vowel :
Example of Glottal Pulse Estimation with FM
Algorithm for Pressed AA Vowel :
Algorithm Drawbacks
•
Initial Stationary Region Extension In some voice signals, the first formant frequency is not stable
during the closed phase. Hence, an accurate determination of a formant
stationary region is depended on a single numerical parameter.