No Slide Title

Download Report

Transcript No Slide Title

CS 551/651:
Structure of Spoken Language
Lecture 9: The Source-Filter Model
of Speech Production
John-Paul Hosom
Fall 2010
The Source-Filter Model
One more model of speech… proposed in 1848 by Johannes
Müller, developed by Gunnar Fant circa 1970. Also called
the “Acoustic Theory of Speech Production”.
The Source-Filter Model provides a static description of speech;
speech dynamics are dealt with in models of coarticulation.
According to this model, speech is defined by three parts:
1. A sound source
vibration of the vocal folds, air turbulence, or plosion
2. A tube through which the source passes
the vocal tract
3. Radiation of sound from the mouth
These 3 components are assumed to be independent.
We will discuss these three parts separately
The Source-Filter Model: Sound Source
Voiced Sound Source:
•
•
•
•
produced by vibration of the vocal folds
several models exist that describe the flow of air through
the vocal folds
each model describes the increase in air flow as the glottis
opens, decrease in air flow as it closes, and no air flow
as glottis remains closed during pressure buildup.
in spectral domain, shape is approximately flat at very low
frequencies, and has –12 dB/octave slope at higher freq.
air pressure (Pa)
Models: Rosenberg, Fant (LF model), Fujisaki (FL model), Klatt
glottis opening
glottal closure glottis opening
time (msec)
The Source-Filter Model: Sound Source
Voiced Sound Source:
•
•
•
•
•
•
models are of “glottal flow”
glottal flow is the same as volume velocity, V in units of m3/s
volume velocity per unit area, or V/unit area, is in units of
m/s, and is called the point velocity, v.
acoustic pressure, p, in Pascals, equals impedance Z times v:
p=Zv
impedance is constant for a given glottis and vocal tract
therefore, acoustic pressure is directly proportional to
glottal flow, and so the vertical axis of these models can
be considered either glottal flow, volume velocity, or
acoustic pressure (in micro Pascals).
The Source-Filter Model: Sound Source
All models have the following parameters:
• pitch period = 1/F0 = T0
• open quotient (OQ)
• skew (SK)
These three parameters are used in a function that describes how
the sound pressure changes over time within one pitch period.
glottis opening
glottal closure glottis opening
T0
OQ
SK
OQ measured relative to T0;
SK measured relative to OQ
The Source-Filter Model: Sound Source
The Rosenberg model:
gR(t) is glottal pulse with amplitude A and duration T;
gR(t) has three phases: the opening phase until time TO, the closing
phase until time TC, and the closed phase with length T-(TO+TC)
TO
TC
T
(from http://www.physik3.gwdg.de/~micha/aachen98/aachen98.html)
The Source-Filter Model: Sound Source
The Liljencrants-Fant (LF) Model:
Ei
0
Ti
Tp
Te Ta
Tc
• uses sin() and exp() functions to create smooth trajectory
• many parameters allow detailed control of shape
The Fujisaki-Ljungqvist (FL) Model:
• similar to LF, but allows negative flow during closed phase
• simpler polynomial functions
(from http://www.ims.uni-stuttgart.de/phonetik/EGG/page13.htm)
The Source-Filter Model: Sound Source
Unvoiced Sound Source:
•
produced by pushing air through constriction in mouth
•
a simple model: noise that decreases at –6 dB/octave
Plosive Sound Source:
produced by pressure buildup, then release of constriction
•
a very simple model: approximately a step function
amplitude
•
time
The Source-Filter Model: Vocal Tract Filter
The vocal tract can be modeled as a series of connected
tubes with different lengths and diameters:
d4
A1 A2
A3
A4
A5
A6
l4
Life can be made much more simple if we start with
only two tubes for approximating different vowels:
/iy/
A1
A2
/aa/
A1
A2
A1
A1
A2
/uw/
A2
/ah/
The Source-Filter Model: Vocal Tract Filter
An electrical-engineering analogy can be drawn between
the tubes and a transmission line.
From this analogy, the formant frequencies (frequencies of standing
waves) occur when
A1
tan(l2 )  cot(l1 )
A2
2f
where  
c
(from Flanagan, p. 70-71)
c  340 m/s
The Source-Filter Model: Vocal Tract Filter
In the simplest case of a single tube, the formants are located at
(2i  1)c
Fi 
4l
and if l = 17cm (the typical length of the male vocal tract), then
(2  1)  34000
F1 
 500
4 17
(4  1)  34000
F2 
 1500
4 17
etc.
So, for a neutral vowel (no constriction in the vocal tract),
formants occur at 500, 1500, 2500, … Hz.
The Source-Filter Model: Vocal Tract Filter
The Source-Filter Model: Vocal Tract Filter
The Source-Filter Model: Vocal Tract Filter
The two-tube model can be expanded to multiple tubes;
the math becomes ugly, but results are more realistic:
The Source-Filter Model: Bandwidths
In these cases, it has been assumed that the tubes have
hard surfaces, which causes the resonant frequencies (formants)
to have strong energy only at their center frequencies:
(energy is put into the system via the source, but no energy is lost)
In reality, the resonant energies decay over time; energy
is absorbed by:
• viscosity (caused by friction of air against vocal-tract walls)
• heat conduction (at the vocal-tract walls),
• soft surfaces of vocal-tract walls
these effects cause bandwidth to increase with frequency
The Source-Filter Model: Radiation
A final effect of the speech-production process is radiation
of sound from the lips
As sound radiates from a source, its energy decreases.
The decrease in energy is not the same for all frequencies;
this effect can be modeled as a +6 dB/octave increase in
energy:
which, coincidentally, is the same equation as pre-emphasis
with a=1.0, and also corresponds to a differentiation operation.
The Source-Filter Model: Radiation
The derivative effect of radiation from the lips can be
moved to the glottal-source model:
glottal flow
T0
OQ
SK
glottal flow
derivative
The Source-Filter Model: Radiation
The derivative effect of radiation from the lips can also be
moved to the models of frication and plosion:
Unvoiced Sound Source:
•
a very simple model: random (white) noise
Plosive Sound Source:
a very simple model: an impulse function
amplitude
•
time
The Source-Filter Model: Complete Picture
glottal source (harmonics)
vocal tract filter (envelope)
radiation (log scale)
final speech signal
The Source-Filter Model: Estimating Parameters
The vocal-tract parameters (formants) can be estimated
using LPC analysis, with the order of LPC analysis equal
to 2×NF, where NF is the expected number of formants.
In practice, LPC estimation of formants is not very accurate
because of slope of spectrum and irregularities in the
spectrum.
Once the formants are determined, they can then be inverted,
and the original signal filtered with the inverted formants to
obtain the source + radiation (first derivative of glottal flow) signal.
This is called inverse filtering.
The Source-Filter Model: Filtering
Formants can be modeled by a “damped sinusoid”, which
has the following representations:
2
Af
c
x(t )  Ae2t sin(2f ct )
S( f ) 
2
2 2
 f  fc   2fc2 
where S(f) is the spectrum at frequency value f, A is overall
amplitude, fc is the center frequency of the damped sine wave,
and  is a damping factor. [Olive, p. 48, 58]. Or, given formant
and sampling frequency, compute IIR filter coefficients:
r e

Bf
Fs
F f  formantfrequency
a0  2r cos(F f (2 / Fs )) B f  formantbandwidt h
a1  1r 2
Fs  samplingfrequency
a2  1  (a0  a1 )
yn  a2 xn  a0 yn 1  a1 yn  2
(from Klatt, 1980)
The Source-Filter Model
A course project that studies the source-filter model might
be interesting…
1. Implement LPC, extract formant values and bandwidths
of different vowels; how do envelope and formant values
change with different orders of LPC (values of p)?
2. Do LPC analysis, then inverse filter the signal to extract the
glottal source waveform. Does it look the way it should?
3. Construct two-tube models, predict formant frequencies
of all vowels.
If you’re more comfortable with programming, signal processing,
etc.