Transcript Document

CS 551/651:
Structure of Spoken Language
Lecture 1: Visualization of the Speech Signal,
Introductory Phonetics
John-Paul Hosom
Fall 2010
Visualization of the Speech Signal
Most common representations:
• Time-domain waveform
• Energy
• Pitch contour
• Spectrogram (power spectrum)
Structure of Spoken Language : Hosom
2
Visualization of the Speech Signal: Time-Domain Waveform
Time-domain waveform is a signal recorded directly from
microphone, with time on horizontal axis and amplitude on
vertical axis.
“Variations in air pressure in the form of sound waves move
through the air somewhat like ripples on a pond. … A graph
of a sound wave is very similar to a graph of the movements
of the eardrum.” [Ladefoged, p. 184]
“Sound originates from the motion or vibration of an object.
This motion is impressed upon the surrounding medium (usually
air) as a pattern of changes in pressure. … The sound generally
weakens as it moves away from the source and also may be
subject to reflections and refractions…” [Moore, p. 2]
Structure of Spoken Language : Hosom
3
Visualization of the Speech Signal: Time-Domain Waveform
Vertical axis:
amplitude, relative sound pressure
typical unit: Pa (micro-pascals)
(digital signal usually unitless)
quantization (-32768 to 32767)
Horizontal axis:
time
typical unit: msec (milliseconds)
sampling (8000, 16000, 44.1K samp/sec)
Structure of Spoken Language : Hosom
4
Visualization of the Speech Signal: Energy
“Energy” or “Intensity”:
intensity is sound energy transmitted per second (power)
through a unit area in a sound field. [Moore p. 9]
intensity is proportional to the square of the pressure
variation [Moore p. 9]
t  N 1
normalized energy =
x
n t
2
n
 intensity
N
xn = signal x at time sample n
N = number of time samples
Structure of Spoken Language : Hosom
5
Visualization of the Speech Signal: Energy
“Energy” or “Intensity”:
human auditory system better suited to relative scales:
I1
energy (bels) = log10 ( )
I0
I1
energy (decibels, dB) = 10 log10 ( )
I0
I0 is a reference intensity… if the signal becomes twice as
powerful (I1/I0 = 2), then the energy level is 3 dB (3.01023 dB
to be more precise)
Typical value for I0 is 20 Pa.
20 Pa is close to the average human absolute threshold for
a 1000-Hz sinusoid.
Structure of Spoken Language : Hosom
6
Visualization of the Speech Signal: Energy
What is a good value of N? Depends on information of interest:
N=1 msec
N=5 msec
N=20 msec
N=80 msec
Structure of Spoken Language : Hosom
7
Visualization of the Speech Signal: Power Spectrum
What makes one phoneme, /aa/, sound different from another
phoneme, /iy/?
Different shapes of the vocal tract…
/aa/ is produced with the tongue low and in the back of the mouth;
/iy/ is produced with the tongue high and toward the front.
The different shapes of the vocal tract produce different
“resonant frequencies”, or frequencies at which energy in the
signal is concentrated. (Simple example of resonant energy:
a tuning fork may have resonant frequency equal to 440 Hz or “A”).
A resonance is the tendency of a system to oscillate with larger
amplitude at some frequencies than at others [Wikipedia]
Resonant frequencies in speech (or other sounds) can be displayed
by computing a “power spectrum” or “spectrogram,” showing the
energy in the signal at different frequencies.
Structure of Spoken Language : Hosom
8
Visualization of the Speech Signal: Power Spectrum
A time-domain signal can be expressed in terms of sinusoids
at a range of frequencies using the Fourier transform:
X( f )  



t  
t  
x(t )e
 j 2ft
dt 
x(t )cos(2ft )  j sin(2ft )dt
where x(t) is the time-domain signal at time t, f is a frequency
value from 0 to 1, and X(f) is the spectral-domain representation.
note: e
j
 cos( )  j sin( )
One useful property of the Fourier transform is that it is timeinvariant (actually, linear time invariant). While a periodic
signal x(t) changes at t, t+, t+2, etc., the Fourier transform of
this signal is constant, making analysis of periodic signals easier.
Structure of Spoken Language : Hosom
9
Visualization of the Speech Signal: Power Spectrum
Since samples are obtained at discrete time steps, and since
only a finite section of the signal is of interest, the discrete
Fourier transform is more useful:
1
X ( n) 
N
1

N
N 1
 x ( k )e

j 2kn
N
for n  0, N  1
k 0
2kn
2kn
x(k )[cos(
)  j sin(
)]

N
N
k 0
N 1
in which x(k) is the amplitude at time sample k, n is a frequency
value from 0 to N-1, N is the number of samples or frequency points
of interest, and X(n) is the spectral-domain representation of
x(k). Note that we assume that that the series outside the range
(0, N-1) is “extended N-periodic,” that is, xk = xk+N for all k.
Structure of Spoken Language : Hosom
10
Visualization of the Speech Signal: Power Spectrum
• The sampling frequency is the rate at which samples are recorded;
e.g. 8000 Hz = 8000 samples per second.
• Shannon’s Sampling Theorem states that a continuous signal
must be discretely sampled with at least twice the frequency
of the highest frequency present in the signal. So, the signal
must not contain any data above Fsamp/2 (the Nyquist frequency).
If it does, use a low-pass filter to remove these higher frequencies.
• Because the signal is assumed to be periodic over length N,
but this assumption is usually false, then the signal is weighted
with a window so that both edges of the signal taper toward zero:
Hamming window:

 2n  
xw (n)  x(n)   0.54  0.46cos
  n  0...N  1
 N 1 

Structure of Spoken Language : Hosom
11
Visualization of the Speech Signal: Power Spectrum
The magnitude and phase of the spectral representation are:
m agnitudeF( n )  F (n)  ( Freal (n)  Freal (n)  Fimag (n)  Fimag (n))0.5
1
phaseF( n )  tan (
Fimag (n)
Freal (n)
)
absolute value of complex number
Phase information is generally considered not important in
understanding speech, and the energy (or power) of the
magnitude of F(n) on the decibel scale provides most relevant
information:
2
2
PowerSpectrumF( n )  10 log10 ( Freal
(n)  Fimag
(n))
Note: usually don’t worry about reference intensity I0 (assume a
value of 1.0); the signal strength (in Pa) is unknown anyway.
Structure of Spoken Language : Hosom
12
Visualization of the Speech Signal: Power Spectrum
The power spectrum can be plotted like this (vowel /aa/):
timedomain
amplitude
spectral
power
(dB)
(512 samp)
73 dB
0 Hz
frequency (Hz)
Structure of Spoken Language : Hosom
4000 Hz
13
Visualization of the Speech Signal: Power Spectrum
If the speech signal is periodic and the number of samples in
the window is large enough, then harmonics are seen:
periodic signal/aa/
128 samples
periodic signal /aa/
2048 samples
aperiodic signal /sh/
2048 samples
(frequency range is 0 to 4000 Hz in all plots)
A harmonic is a strong energy component at an integer multiple
of the fundamental frequency (pitch), F0.
Structure of Spoken Language : Hosom
14
Visualization of the Speech Signal: Formants
Note that the resonant frequencies, or formants, for the two
vowels /aa/ and /iy/ can be identified in the spectra.
For recognition of phonemes, the spectral envelope is important
(envelope = shape of spectrum without harmonics)
?
?
envelope
0
1K
2K
3K
/aa/ 2048 samples
4K 0
1K
2K
3K
4K
/iy/ 2048 samples
Structure of Spoken Language : Hosom
15
Visualization of the Speech Signal: Formants
The harmonics, which are dependent on F0, are not, in theory,
significantly related to the resonant frequencies, which are
dependent on the vocal tract shape (or phoneme)
/aa/
F0=80Hz
/aa/
F0=164Hz
0
1K
2K
Structure of Spoken Language : Hosom
3K
4KHz
16
Visualization of the Speech Signal: Spectrograms
(FFT size =
10 msec)
/iy/
freq (Hz) amp
/aa/
freq (Hz) amp
Many power spectra can be plotted over time, creating a
“spectrogram” or “spectrograph” (pre-emphasis = 0.97):
time (msec)
Structure of Spoken Language : Hosom
17
Visualization of the Speech Signal: Formants
These formants can be modeled by a “damped sinusoid”, which
has the following representations:
x(t )  Ae2t sin(2f ct )
S( f ) 
f
Afc2
2
f
  2f 
2 2
c
2
c
0
power (dB)
amplitude
where S(f) is the spectrum at frequency value f, A is overall
amplitude, fc is the center frequency of the damped sine wave,
and  is a damping factor. [Olive, p. 48, 58]
center freq. fc
0 dB

frequency (Hz)
time (msec)
Structure of Spoken Language : Hosom
18
Visualization of the Speech Signal: Formants
The bandwidth is defined as the width of the spectral peak
measured at the point where the linear spectral magnitude
value is ½ the maximum value. A reduction of the signal by
a factor of 2 is equivalent to a 3 dB change.
power (dB)
3 dB
0 dB
bandwidth
frequency (Hz)
Also, the resonator must have a value of 0 dB at 0 Hz.
Structure of Spoken Language : Hosom
19
Visualization of the Speech Signal: Formants
• Formants are specified by a frequency, F, and bandwidth, B.
• A neutral vowel (/ax/) theoretically has formants at 500 Hz,
1500 Hz, 2500 Hz, 3500 Hz, etc. The first formant is called F1,
the second is called F2, etc. (The fundamental frequency, or
pitch, is F0.)
• F1, F2, and sometimes F3 are usually sufficient for identifying
vowels.
• Formants can be thought of as filters, which act on the source
waveform. For vowels, the source waveform is air pushed
through the vibrating vocal folds. Energy is lost (hence a damped
sinusoid model) by sound absorption in the mouth.
• A digital model of a formant can be implemented using an
infinite-impulse response (IIR) filter.
Structure of Spoken Language : Hosom
20
Visualization of the Speech Signal: Excitation/Source
amplitude
power (dB)
The vocal-fold vibration source looks like this:
time (msec)
-6 dB/octave
frequency (Hz)
(Note: there are some gross simplifications here… we’ll go into
more detail later in the course.)
amplitude
power (dB)
In fricatives and other unvoiced speech, the source is turbulent air:
time (msec)
flat slope
frequency (Hz)
Structure of Spoken Language : Hosom
21
Visualization of the Speech Signal: Pre-Emphasis
Because the source for voiced sounds decreases at –6 dB/octave,
a simple filter can be used to increase the spectral tilt by
+6 dB/octave, thereby making voiced sounds spectrally flat
and easier to visualize. (NOTE: unvoiced sounds then have
spectral slope of + 6 dB/octave)
x(n)  x(n)  a  x(n  1)
a  0.97
power (dB)
where x(n) is the time-domain speech signal at sample number n,
and x(n) is the pre-emphasized speech signal at sample n.
-6 dB/octave
frequency (Hz)Structure of Spoken Language : Hosom
0 dB/octave
frequency (Hz)
22
Visualization of the Speech Signal: Spectrograms
amp
The FFT window size has a large impact on visual properties:
(FFT size =
5 msec)
freq (Hz)
/aa/
/aa/
(FFT size =
33 msec)
freq (Hz)
“wideband” = small time window = small FFT size
“narrowband” = large time window = large FFT size
Structure of Spoken Language : Hosom
23
Spectrogram Reading: Vowels
Vowel formant frequencies:
Structure of Spoken Language : Hosom
24
Spectrogram Reading: Vowels
Vowel formants (averages for English, male vs. female):
3500
3310
3000
2790
3070 2990
2480 2330
2500
2850 2780 2810 2680
2670
2050
2000
1400
1500
860
1000
500
430
310
760
610
1220 1160
950
850
470 370
0
iy
ih
eh
ae
ah
aa
uh
uw
*from Peterson, G.E., and Barney, H.L. (1952). "Control methods used in the study
of vowels", Journal of the Acoustical
Society
of America,
24,175-184.
Structure of
Spoken Language
: Hosom
25
Spectrogram Reading: Vowels
Vowel formants, Peterson and Barney data:
Structure of Spoken Language : Hosom
26
Spectrogram Reading: Vowels
Ratios of 1st and 2nd formant, from Miller (1989) based on
Peterson and Barney (1952) data:
Structure of Spoken Language : Hosom
27
Spectrogram Reading: Vowels
Observed values from vowel midpoints from a single speaker,
speaking both “clearly” and “conversationally”, in different
phonetic contexts:
iy
ih
eh
ae
ah
uw
uh
Structure of Spoken Language : Hosom
(from Amano-Kusumoto, PhD thesis
2010)
aa
28