Acoustics of Speech Lecture 4 Spoken Language Processing Prof. Andrew Rosenberg Overview • What is in a speech signal? • Defining cues to phonetic segments.

Download Report

Transcript Acoustics of Speech Lecture 4 Spoken Language Processing Prof. Andrew Rosenberg Overview • What is in a speech signal? • Defining cues to phonetic segments.

Acoustics of Speech

Lecture 4 Spoken Language Processing Prof. Andrew Rosenberg

Overview

• What is in a speech signal?

• Defining cues to phonetic segments and intonation.

• Techniques to extract these cues.

1

Phone Recognition

• Goal: Distinguishing One Phoneme from Another…Automatically • ASR: Did the caller say “I want to fly to Newark” or “I want to fly to New York”?

• Forensic Linguistics: Did that person say “Kill him” or “Bill him” • What evidence is available in the speech signal?

– How accurately and reliably can we extract it?

– What qualities make this difficult? easy?

2

Prosody and Intonation

How

things are said is sometimes critical and often useful for understanding • Forensic Linguistics: “Kill him!” vs. “Kill him?” • TTS: “Travelling from Boston?” vs. “Travelling from Boston.” • What information do we need to extract from/generate in the speech signal?

• What tools do we have to do this?

3

Speech Features

• What cues are important?

– Spectral Features – Fundamental Frequency (pitch) – Amplitude/energy (loudness) – Timing (pauses, rate) – Voice Quality • How do we extract these?

– Digital Signal Processing – Tools and Algorithms • Praat • Wavesurfer • Xwaves 4

Sound Production

• Pressure fluctuations in the air caused by a voice, musical instrument, a car horn etc.

Sound waves propagate

through material air, but also solids, etc.

– Cause

eardrum (tympanum)

to vibrate – Auditory system translates this into

neural impulses

Brain

interprets these as sound • Represent sounds as

change in pressure over time

5

How “loud” are sounds?

Event

Absolute silence Whisper Quiet office Conversation Bus Subway Thunder *Hearing Damage*

Pressure (Pa)

20 200 2K 20K 200K 2M 20M 200M

dB

0 20 40 60 80 100 120 140 6

Voiced Sounds are (mostly) Periodic

• Simple Periodic Waves (sine waves) defined by – Frequency: how often does the pattern repeat per time unit • Cycle: one repetition • Period: duration of a cycle • Frequency: #cycles per time unit (usually second) – Frequency in Hertz (Hz): cycles per second or 1 / period – E.g. 400 Hz = 1/0.0025 (a cycle has a period of 0.0025 seconds; 400 cycles complete in a second) • Zero crossing: where the waveform crosses the x axis 7

Voiced Sounds are (mostly) Periodic

• Simple Periodic Waves (sine waves) defined by – Amplitude: peak deviation of pressure from normal atmospheric pressure – Phase: timing of a waveform relative to a reference point 8

Phase Differences

9

Complex Periodic Waves

• • • Cyclic but composed of

multiple

sine waves

Fundamental Frequency (F0):

rate at which the largest pattern repeats and its harmonics – Also GCD of component frequencies

Harmonics:

rate of shorter patterns • Any complex waveform can be analyzed into its component sine waves with their frequencies, amplitudes and phases

(Fourier theorem – in 2 lectures)

10

2 sine wave -> 1 complex wave

11

4 sine waves -> 1 complex wave

12

Power Spectra and Spectrograms

• Frequency components of a complex waveform represened in the

power spectrum

.

– Plots frequency and amplitude of each component sine wave • Adding temporal dimension ->

Spectrogram

• Obtained via Fast Fourier Transform (FFT), Linear Predictive Coding (LPC)… – Useful for analysis, coding and synthesis.

13

Example Power spectrum

Australian male /i :/ from “heed” FFT analysis window 12.8ms

http://clas.mq.edu.au/acoustics/speech_spectra/fft_lpc_settings.html

14

Example Spectrogram

15

Terms

• Spectral Slice: frequency plots the amplitude at each • Spectrograms: plots amplitude and frequency over time • Harmonics: components of a complex waveform that are multiples of the

fundamental frequency (F0)

• Formants: frequency bands that are most amplified in speech.

16

Aperiodic Waveforms

• Waveforms with random or non-repeating patterns – Random aperiodic waveforms: white noise • Flat spectrum: equal amplitude for all frequency components.

– Transients : sudden bursts of pressure (clicks, pops, lip smacks, door slams, etc.) • Flat spectrum at a single impulse – Voiceless consonants 17

Speech Waveforms

• Lungs plus

vocal fold

vibration is filtered by

resonance

of the

vocal tract

to produce complex, periodic waveforms.

Pitch range, mean, max:

cycles per sec of lowest frequency periodic component of a signal = “Fundamental frequency (F0)” –

Loudness

• •

RMS amplitude Intensity: in dB where P0 is a reference atmospheric pressure

18

Collecting speech for analysis?

• Recording conditions – A quiet office, a sound booth, an anechoic chamber • Microphones convert sound into electrical current – oscillations of air pressure are converted to oscillations of current – Analog devices (e.g. tape recorders) store these as a continuous signal – Digital devices (e.g. DAT, computers) convert to a digital signal (digitizing) 19

Digital Sound Representation

• A microphone is a mechanical

eardrum

, capable of measuring

change in air pressure over time.

• Digital recording converts

analog (smoothly continuous)

bit depth

changes in air pressure over time to a digital signal.

• The digital representation: – measures the pressure at a fixed time interval

sampling rate

– represents pressure as an integral value • The analog to digital conversion results in a

loss of information

.

20

Waveform – “Name”

21

Analog to Digital Conversion

• “Quantization” or “Discretization” 22

Analog to Digital Conversion

• “Quantization” or “Discretization” 23

Analog to Digital Conversion

• “Quantization” or “Discretization” 24

Analog to Digital Conversion

• “Quantization” or “Discretization” 25

Analog to Digital Conversion

• Bit depth impact – 16bit sound – CD Quality – 8bit sound • Sampling rate impact – 44.1kHz

– 16kHz – 8kHz – 4kHz 26

Nyquist Rate

• At least 2 samples per cycle are necessary to capture the periodicity of a waveform at a given frequency • 100Hz needs 200 samples per sec • Nyquist Frequency or Nyquist Rate – Highest frequency that can be captured with a given sampling rate – 8kHz sampling rate (Telephone speech) can capture frequencies up to 4kHz 27

Sampling/storage trade off

• Human hearing: ~20kHz top frequency – Should we store 40kHz samples?

• Telephone speech 300-4kHz (8kHz sampling) – But some speech sounds, (e.g.,

fricatives, stops)

have energy above 4kHz – Peter, Teeter, Dieter • 44kHz (CD quality) vs. 16-22kHz – Usually good enough to study speech, amplitude, duration, pitch, etc.

• Golden Ears.

28

Filtering

Acoustic filters

of sounds –

Low-pass filter

components – –

High-pass filter Band-pass filter

around a band block out certain frequencies blocks high frequency blocks low frequencies blocks both high and low, –

Reject band

(what to block) vs.

pass band

to let through) • What if the frequencies fo two sounds overlap?

– Source Separation (what 29

Estimating pitch

• Pitch Tracking: Estimate F0 over time as a function of

vocal fold vibration

• How? Autocorrelation approach – A periodic waveform is correlated with itself, since one period looks like another – Find the period by finding the “lag” (offset) between two windows of the signal where the correlation of the windows is highest – Lag duration, T, is one period of the the waveform – F0 is the inverse: 1/T 30

Pitch Issues

• Microprosody effects of consonants (e.g. /v/) • Creaky voice -> no pitch track, or noisy estimate • Errors to watch for: – Halving: shortest lag calculated is too long, by one or more cycles.

• Since the estimated lag is too long, the pitch is too low (

underestimation

second ( ) of pitch – Doubling: shortest lag is too short. Second half of the cycle is similar to the first • Estimates a short lag, counts too many cycles per

overestimation)

of pitch 31

Pitch Doubling and Halving

Halving Error Doubling Error 32

Next Class

• Speech Recognition Overview • Reading: J&M 9.1, 9.2, 5.5

33