Acoustics of Speech Julia Hirschberg CS 4706 11/7/2015 Goal 1: Distinguishing One Phoneme from Another, Automatically • ASR: Did the caller say ‘I want to.

Download Report

Transcript Acoustics of Speech Julia Hirschberg CS 4706 11/7/2015 Goal 1: Distinguishing One Phoneme from Another, Automatically • ASR: Did the caller say ‘I want to.

Acoustics of Speech
Julia Hirschberg
CS 4706
11/7/2015
1
Goal 1: Distinguishing One Phoneme from
Another, Automatically
• ASR: Did the caller say ‘I want to fly to Newark’
or ‘I want to fly to New York’?
• Forensic Linguistics: Did the accused say ‘Kill
him’ or ‘Bill him’?
• What evidence is there in the speech signal?
– How accurately and reliably can we extract it?
11/7/2015
2
Goal 2: Determining How things are said is
sometimes critical to understanding
• Forensic Linguistics: ‘Kill him!’ or ‘Kill him?’
• Call Center: ‘That amount is incorrect.’
• What information do we need to extract from the
speech signal?
• What tools do we have to do this?
11/7/2015
3
Today and Next Class
• Acoustic features to extract
– Fundamental frequency (pitch)
– Amplitude/energy (loudness)
– Spectrum
– Timing (pauses, rate)
• Tools for extraction
– Praat
– Wavesurfer
– Xwaves
11/7/2015
4
Sound Production
• Pressure fluctuations in the air caused by a musical
instrument, a car horn, a voice
– Cause eardrum to move
– Auditory system translates into neural impulses
– Brain interprets as sound
– Plot sound as change in air pressure over time
• From a speech-centric point of view, when sound is not
produced by the human voice, we may term it noise
– Ratio of speech-generated sound to other
simultaneous sound: signal-to-noise ratio
• Higher SNRs are better
11/7/2015
5
How ‘Loud’ are Common Sounds – How
Much Pressure Generated?
Event
Absolute
Whisper
Quiet office
Conversation
Bus
Subway
Thunder
*DAMAGE*
11/7/2015
Pressure (Pa)
20
200
2K
20K
200K
2M
20M
200M
Db
0
20
40
60
80
100
120
140
6
Some Sounds are Periodic
• Simple Periodic Waves (sine waves) defined by
– Frequency: how often does pattern repeat per
time unit
• Cycle: one repetition
• Period: duration of cycle
• Frequency=# cycles per time unit, e.g.
– Frequency in Hz = cycles per second or 1/period
– E.g. 400Hz pitch = 1/.0025 (1 cycle has a period of
.0025; 400 cycles complete in 1 sec)
– Amplitude: peak deviation of pressure from
normal atmospheric pressure
11/7/2015
7
– Phase: timing of waveform relative to a
reference point
11/7/2015
8
11/7/2015
9
Complex Periodic Waves
• Cyclic but composed of multiple sine waves
• Fundamental frequency (F0): rate at which
largest pattern repeats (also GCD of component
freqs)
• Components not always easily identifiable:
power spectrum graphs amplitude vs. frequency
• Any complex waveform can be analyzed into a
set of sine waves with their own frequencies,
amplitudes, and phases (Fourier’s theorem)
11/7/2015
10
11/7/2015
11
11/7/2015
12
Some Sounds are Aperiodic
• Waveforms with random or non-repeating
patterns
– Random aperiodic waveforms: white noise
• Flat spectrum: equal amplitude for all frequency
components
– Transients: sudden bursts of pressure (clicks,
pops, door slams)
• Waveform shows a single impulse (click.wav)
• Fourier analysis shows a flat spectrum
• Some speech sounds, e.g. many consonants
(e.g. cat.wav)
11/7/2015
13
Speech Production
• Voiced and voiceless sounds
• Vocal fold vibration filtered by the Vocal tract
produces complex periodic waveform
– Cycles per sec of lowest frequency
component of signal = fundamental frequency
(F0)
– Fourier analysis yields power spectrum with
component frequencies and amplitudes
• F0 is first (lowest frequency) peak
• Harmonics are resonances of component
frequencies amplified by vocal track
11/7/2015
14
Vocal fold vibration
[UCLA Phonetics Lab demo]
11/7/2015
15
Places of articulation
dental
labial
alveolar post-alveolar/palatal
velar
uvular
pharyngeal
laryngeal/glottal
11/7/2015
http://www.chass.utoronto.ca/~danhall/phonetics/sammy.html
16
How do we capture speech for analysis?
• Recording conditions
– A quiet office, a sound booth, an anachoic
chamber
• Microphones
• Analog devices (e.g. tape recorders) store and
analyze continuous air pressure variations
(speech) as a continuous signal
• Digital devices (e.g. computers,DAT) first
convert continuous signals into discrete signals
(A-to-D conversion)
11/7/2015
17
• File format:
– .wav, .aiff, .ds, .au, .sph,…
– Conversion programs, e.g. sox
• Storage
– Function of how much information we store
about speech in digitization
• Higher quality, closer to original
• More space (1000s of hours of speech take up a
lot of space)
11/7/2015
18
Sampling
• Sampling rate: how often do we need to
sample?
– At least 2 samples per cycle to capture
periodicity of a waveform component at a
given frequency
• 100 Hz waveform needs 200 samples per sec
• Nyquist frequency: highest-frequency component
captured with a given sampling rate (half the
sampling rate)
11/7/2015
19
Sampling/storage tradeoff
• Human hearing: ~20K top frequency
– Do we really need to store 40K samples per
second of speech?
• Telephone speech: 300-4K Hz (8K sampling)
– But some speech sounds (e.g. fricatives, /f/,
/s/, /p/, /t/, /d/) have energy above 4K!
– Peter/teeter/Dieter
• 44k (CD quality audio) vs.16-22K (usually good
enough to study pitch, amplitude, duration, …)
11/7/2015
20
Sampling Errors
• Aliasing:
– Signal’s frequency higher than half the
sampling rate
– Solutions:
• Increase the sampling rate
• Filter out frequencies above half the sampling rate
(anti-aliasing filter)
11/7/2015
21
Quantization
• Measuring the amplitude at sampling points:
what resolution to choose?
– Integer representation
– 8, 12 or 16 bits per sample
• Noise due to quantization steps avoided by
higher resolution -- but requires more storage
– How many different amplitude levels do we
need to distinguish?
– Choice depends on data and application (44K
16bit stereo requires ~10Mb storage)
11/7/2015
22
– But clipping occurs when input volume is
greater than range representable in digitized
waveform
• Increase the resolution
• Decrease the amplitude
11/7/2015
23
What can we do if our data is ‘noisy’?
• Acoustic filters block out certain frequencies of
sounds
– Low-pass filter blocks high frequency
components of a waveform
– High-pass filter blocks low frequencies
– Reject band (what to block) vs. pass band
(what to let through)
• But if frequencies of two sounds
overlap….source separation
11/7/2015
24
How can we capture pitch contours, pitch
range?
• What is the pitch contour of this utterance? Is
the pitch range of X greater than that of Y?
• Pitch tracking: Estimate F0 over time as fn of
vocal fold vibration
• A periodic waveform is correlated with itself
– One period looks much like another (cat.wav)
– Find the period by finding the ‘lag’ (offset)
between two windows on the signal for which
the correlation of the windows is highest
– Lag duration (T) is 1 period of waveform
– Inverse is F0 (1/T)
11/7/2015
25
• Errors to watch for:
– Halving: shortest lag calculated is too long
(underestimate pitch)
– Doubling: shortest lag too short (overestimate
pitch)
– Microprosody effects (e.g. /v/)
11/7/2015
26
Sample Analysis File: Pitch Track Header
• version 1
• type_code 4
• frequency 12000.000000
• samples 160768
• start_time 0.000000
• end_time 13.397333
• bandwidth 6000.000000
• dimensions 1
• maximum 9660.000000
• minimum -17384.000000
• time Sat Nov 2 15:55:50 1991
•
operation record: padding xxxxxxxxxxxx
11/7/2015
27
Sample Analysis File: Pitch Track Data
•
•
•
•
•
•
•
•
•
(F0
Pvoicing Energy A/C Score)
147.896 1 2154.07 0.902643
140.894 1 1544.93 0.967008
138.05 1 1080.55 0.92588
130.399 1 745.262 0.595265
0 0 567.153 0.504029
0 0 638.037 0.222939
0 0 670.936 0.370024
0 0 790.751 0.357141
141.215 1 1281.1 0.904345
11/7/2015
28
Pitch Perception
• But do pitch trackers capture what humans perceive?
• Auditory system’s perception of pitch is non-linear
– Sounds at lower frequencies with same difference in
absolute frequency sound more different than those at
higher frequencies (male vs. female speech)
– Bark scale (Zwicker) and other models of perceived
difference
11/7/2015
29
How do we capture loudness/intensity?
• Is one utterance louder than another?
• Energy closely correlated experimentally with
perceived loudness
• For each window, square the amplitude values
of the samples, take their mean, and take the
root of that mean (RMS energy)
– What size window?
– Longer windows produce smoother amplitude
traces but miss sudden acoustic events
11/7/2015
30
Perception of Loudness
• But the relation is non-linear: sones or decibels (dB)
– Differences in soft sounds more salient than loud
– Intensity proportional to square of amplitude
so…intensity of sound with pressure x vs. reference
sound with pressure r = x2/r2
– bel: base 10 log of ratio
– decibel: 10 bels
– dB = 10log10 (x2/r2)
– Absolute (20 Pa, lowest audible pressure fluctuation of
1000 Hz tone), typical threshold level for tone at frequency
11/7/2015
31
How do we capture….
•
•
•
•
For utterances X and Y
Pitch contour: Same or different?
Pitch range: Is X larger than Y?
Duration: Is utterance X longer than utterance
Y?
• Speaker rate: Is the speaker of X speaking
faster than the speaker of Y?
• Voice quality….
11/7/2015
32
Next Class
• Tools for the Masses: Read the Praat tutorial
• Download Praat from the course syllabus page
and play with a speech file (e.g.
http://www.cs.columbia.edu/~julia/cs4706/cc_00
1_sadness_1669.04_August-second-.wav or
record your own)
• Bring a laptop and headphones to class if you
have them
11/7/2015
33