Acoustics of Speech Lecture 4 Spoken Language Processing Prof. Andrew Rosenberg Overview • What is in a speech signal? • Defining cues to phonetic segments.
Download ReportTranscript Acoustics of Speech Lecture 4 Spoken Language Processing Prof. Andrew Rosenberg Overview • What is in a speech signal? • Defining cues to phonetic segments.
Acoustics of Speech
Lecture 4 Spoken Language Processing Prof. Andrew Rosenberg
Overview
• What is in a speech signal?
• Defining cues to phonetic segments and intonation.
• Techniques to extract these cues.
1
Phone Recognition
• Goal: Distinguishing One Phoneme from Another…Automatically • ASR: Did the caller say “I want to fly to Newark” or “I want to fly to New York”?
• Forensic Linguistics: Did that person say “Kill him” or “Bill him” • What evidence is available in the speech signal?
– How accurately and reliably can we extract it?
– What qualities make this difficult? easy?
2
Prosody and Intonation
•
How
things are said is sometimes critical and often useful for understanding • Forensic Linguistics: “Kill him!” vs. “Kill him?” • TTS: “Travelling from Boston?” vs. “Travelling from Boston.” • What information do we need to extract from/generate in the speech signal?
• What tools do we have to do this?
3
Speech Features
• What cues are important?
– Spectral Features – Fundamental Frequency (pitch) – Amplitude/energy (loudness) – Timing (pauses, rate) – Voice Quality • How do we extract these?
– Digital Signal Processing – Tools and Algorithms • Praat • Wavesurfer • Xwaves 4
Sound Production
• Pressure fluctuations in the air caused by a voice, musical instrument, a car horn etc.
–
Sound waves propagate
through material air, but also solids, etc.
– Cause
eardrum (tympanum)
to vibrate – Auditory system translates this into
neural impulses
–
Brain
interprets these as sound • Represent sounds as
change in pressure over time
5
How “loud” are sounds?
Event
Absolute silence Whisper Quiet office Conversation Bus Subway Thunder *Hearing Damage*
Pressure (Pa)
20 200 2K 20K 200K 2M 20M 200M
dB
0 20 40 60 80 100 120 140 6
Voiced Sounds are (mostly) Periodic
• Simple Periodic Waves (sine waves) defined by – Frequency: how often does the pattern repeat per time unit • Cycle: one repetition • Period: duration of a cycle • Frequency: #cycles per time unit (usually second) – Frequency in Hertz (Hz): cycles per second or 1 / period – E.g. 400 Hz = 1/0.0025 (a cycle has a period of 0.0025 seconds; 400 cycles complete in a second) • Zero crossing: where the waveform crosses the x axis 7
Voiced Sounds are (mostly) Periodic
• Simple Periodic Waves (sine waves) defined by – Amplitude: peak deviation of pressure from normal atmospheric pressure – Phase: timing of a waveform relative to a reference point 8
Phase Differences
9
Complex Periodic Waves
• • • Cyclic but composed of
multiple
sine waves
Fundamental Frequency (F0):
rate at which the largest pattern repeats and its harmonics – Also GCD of component frequencies
Harmonics:
rate of shorter patterns • Any complex waveform can be analyzed into its component sine waves with their frequencies, amplitudes and phases
(Fourier theorem – in 2 lectures)
10
2 sine wave -> 1 complex wave
11
4 sine waves -> 1 complex wave
12
Power Spectra and Spectrograms
• Frequency components of a complex waveform represened in the
power spectrum
.
– Plots frequency and amplitude of each component sine wave • Adding temporal dimension ->
Spectrogram
• Obtained via Fast Fourier Transform (FFT), Linear Predictive Coding (LPC)… – Useful for analysis, coding and synthesis.
13
Example Power spectrum
Australian male /i :/ from “heed” FFT analysis window 12.8ms
http://clas.mq.edu.au/acoustics/speech_spectra/fft_lpc_settings.html
14
Example Spectrogram
15
Terms
• Spectral Slice: frequency plots the amplitude at each • Spectrograms: plots amplitude and frequency over time • Harmonics: components of a complex waveform that are multiples of the
fundamental frequency (F0)
• Formants: frequency bands that are most amplified in speech.
16
Aperiodic Waveforms
• Waveforms with random or non-repeating patterns – Random aperiodic waveforms: white noise • Flat spectrum: equal amplitude for all frequency components.
– Transients : sudden bursts of pressure (clicks, pops, lip smacks, door slams, etc.) • Flat spectrum at a single impulse – Voiceless consonants 17
Speech Waveforms
• Lungs plus
vocal fold
vibration is filtered by
resonance
of the
vocal tract
to produce complex, periodic waveforms.
–
Pitch range, mean, max:
cycles per sec of lowest frequency periodic component of a signal = “Fundamental frequency (F0)” –
Loudness
• •
RMS amplitude Intensity: in dB where P0 is a reference atmospheric pressure
18
Collecting speech for analysis?
• Recording conditions – A quiet office, a sound booth, an anechoic chamber • Microphones convert sound into electrical current – oscillations of air pressure are converted to oscillations of current – Analog devices (e.g. tape recorders) store these as a continuous signal – Digital devices (e.g. DAT, computers) convert to a digital signal (digitizing) 19
Digital Sound Representation
• A microphone is a mechanical
eardrum
, capable of measuring
change in air pressure over time.
• Digital recording converts
analog (smoothly continuous)
bit depth
changes in air pressure over time to a digital signal.
• The digital representation: – measures the pressure at a fixed time interval
sampling rate
– represents pressure as an integral value • The analog to digital conversion results in a
loss of information
.
20
Waveform – “Name”
21
Analog to Digital Conversion
• “Quantization” or “Discretization” 22
Analog to Digital Conversion
• “Quantization” or “Discretization” 23
Analog to Digital Conversion
• “Quantization” or “Discretization” 24
Analog to Digital Conversion
• “Quantization” or “Discretization” 25
Analog to Digital Conversion
• Bit depth impact – 16bit sound – CD Quality – 8bit sound • Sampling rate impact – 44.1kHz
– 16kHz – 8kHz – 4kHz 26
Nyquist Rate
• At least 2 samples per cycle are necessary to capture the periodicity of a waveform at a given frequency • 100Hz needs 200 samples per sec • Nyquist Frequency or Nyquist Rate – Highest frequency that can be captured with a given sampling rate – 8kHz sampling rate (Telephone speech) can capture frequencies up to 4kHz 27
Sampling/storage trade off
• Human hearing: ~20kHz top frequency – Should we store 40kHz samples?
• Telephone speech 300-4kHz (8kHz sampling) – But some speech sounds, (e.g.,
fricatives, stops)
have energy above 4kHz – Peter, Teeter, Dieter • 44kHz (CD quality) vs. 16-22kHz – Usually good enough to study speech, amplitude, duration, pitch, etc.
• Golden Ears.
28
Filtering
•
Acoustic filters
of sounds –
Low-pass filter
components – –
High-pass filter Band-pass filter
around a band block out certain frequencies blocks high frequency blocks low frequencies blocks both high and low, –
Reject band
(what to block) vs.
pass band
to let through) • What if the frequencies fo two sounds overlap?
– Source Separation (what 29
Estimating pitch
• Pitch Tracking: Estimate F0 over time as a function of
vocal fold vibration
• How? Autocorrelation approach – A periodic waveform is correlated with itself, since one period looks like another – Find the period by finding the “lag” (offset) between two windows of the signal where the correlation of the windows is highest – Lag duration, T, is one period of the the waveform – F0 is the inverse: 1/T 30
Pitch Issues
• Microprosody effects of consonants (e.g. /v/) • Creaky voice -> no pitch track, or noisy estimate • Errors to watch for: – Halving: shortest lag calculated is too long, by one or more cycles.
• Since the estimated lag is too long, the pitch is too low (
underestimation
second ( ) of pitch – Doubling: shortest lag is too short. Second half of the cycle is similar to the first • Estimates a short lag, counts too many cycles per
overestimation)
of pitch 31
Pitch Doubling and Halving
Halving Error Doubling Error 32
Next Class
• Speech Recognition Overview • Reading: J&M 9.1, 9.2, 5.5
33