Transcript Document

Speech & Audio Coding
TSBK01 Image Coding and Data Compression
Lecture 11, 2003
Jörgen Ahlberg
Outline
• Part I - Speech
– Speech
– History of speech synthesis & coding
– Speech coding methods
• Part II – Audio
– Psychoacoustic models
– MPEG-4 Audio
Speech Production
• The human’s vocal
apparatus consists of:
– lungs
– trachea (wind pipe)
– larynx
• contains 2 folds of skin
called vocal cords which
blow apart and flap
together as air is forced
through
– oral tract
– nasal tract
The Speech Signal
1
The Speech Signal
The Speech Signal
Elements of the speech signal:
• spectral resonance (formants, moving)
• periodic excitation (voicing, pitched) + pitch contour
• noise excitation (fricatives, unvoiced, no pitch)
• transients (stop-release bursts)
• amplitude modulation (nasals, approximants)
• timing
The Speech Signal
Vowels - characterised by formants; generally voiced; Tongue & lips
- effect of rounding. Examples of vowels: a, e, i, o, u, a, ah, oh. Vibration
of vocal cords: male 50 - 250Hz, female up to 500Hz. Vowels have in
average much longer duration than consonants. Most of the acoustic
energy of a speech signal is carried by vowels.
F1-F2 chart
Formant positions
History of Speech Coding
• 1926 - PCM - first conceived by Paul M. Rainey and independently by
Alex Reeves (AT&T Paris) in 1937. Deployed in US PSTN in 1962
• 1939 - Channel vocoder - first analysis-by-synthesis system developed
by Homer Dudley of AT&T labs - VODER
VODER – the architecture
History of Speech Coding
• 1926 - PCM - first conceived by Paul M. Rainey and independently by
Alex Reeves (AT&T Paris) in 1937. Deployed in US PSTN in 1962
• 1939 - Channel vocoder - first analysis - by - synthesis system developed
by Homer Dudley of AT&T labs - VODER
OVE formant synthesis (Gunnar Fant, KTH), 1953
History of Speech - Coding
• 1926 - PCM - first conceived by Paul M. Rainey and independently by
Alex Reeves (AT&T Paris) in 1937. Deployed in US PSTN in 1962
• 1939 - Channel vocoder - first analysis - by - synthesis system Homer
Dudley of AT&T labs - VODER
• 1952 - delta modulation proposed, differential PCM invented
• 1957 - -law encoding proposed (standardised for telephone network
in 1972 (G.711))
• 1974 - ADPCM developed
• 1984 - CELP vocoder proposed (majority of coding standards for speech
signal today use a variation on CELP)
Source-filter Model of Speech Production
•
Signal from a source is filtered by a time-varying filter with resonant properties
similar to that of the vocal tract.
•
The gain controls Av and AN determine the intensity of voiced and unvoiced
excitation.
•
The frequency of higher formant are attenuated by -12 dB/octave (due to the
nature of our speech organs).
•
This is an over simplified model for speech production. However, it is very often
adequate for understanding the basic principles.
Speech Coding Strategies
1. PCM
• Invented 1926, deployed 1962.
• The speech signal is sampled at 8 kHz.
• Uniform quantization requires >10 bits/sample.
• Non-uniform quantization (G.711, 1972)
• Quantizing y to 8 bits -> 64 kbit/s.
Speech Coding Strategies
2. Adaptive DPCM
• Example: G.726 (1974)
• Adaptive predictor based on six previous
differences.
• Gain-adaptive quantizer with 15 levels ) 32
kbit/s.
Speech Coding Strategies
3. Model-based Speech Coding
• Advanced speech coders are based on
models of how speech is produced:
Excitation
source
Vocal
tract
An Excitation Source
Noise
generator
Pitch
Pulse
generator
Vocal Tract Filter 1: A Fixed Filter Bank
g1
BP
g2
BP
gn
BP
Vocal Tract Filter 2: A Controllable Filter
Linear Predictive Coding (LPC)
• The controllable filter is modelled as
yn =  ai yn-i + Gn
where n is the input signal and yn is the output.
• We need to estimate the vocal tract parameters (ai
and G) and the exciatation parameters (pitch, v/uv).
• Typically the source signal is divided in short
segments and the parameters are estimated for each
segment.
• Example: The speech signal is sampled at 8 kHz and
divided in segments of 180 samples (22.5
ms/segment).
Typical Scheme of an LPC Coder
Noise
generator
Vocal tract
filter
Pulse
generator
Pitch
v/uv
Gain
Filter coeffs
Estimating the Parameters
•
v/uv estimation
– Based on energy and frequency spectrum.
•
Pitch-period estimation
– Look for periodicity, either via the a.c.f our some
other measure, for example
that gives you a minimum value when p equals
the pitch period.
– Typical pitch-periods: 20 - 160 samples.
Estimating the Parameters
• Vocal tract filter estimation
– Find the filter coefficients that minimize the error
2 = ( yn -  ai yn-i + Gn )2
– Compare to the computation of optimal predictors
(Lecture 7).
Estimating the Parameters
• Assuming a stationary signal:
where R and p contain acf values.
• This is called the autocorrelation method.
Estimating the Parameters
• Alternatively, in case of a non-stationary
signal:
where
• This is called the autocovariance method.
Example
• Coding of parameters using LPC10 (1984):
v/uv
1 bit
Pitch
6 bits
Voiced filter
46 bits
Unvoiced filter
46 bits
Synchronization
1 bit
Sum:
54 bits ) 2.4 kbit/s
The Vocal Tract Filter
• Different representations:
– LPC parameters
– PARCOR (Partial Correlation Coefficients)
– LSF (Line Spectrum Frequencies)
Code Excited Linear Prediction Coding (CELP)
Encoding:
• LPC analysis ) V(z)
• Define perceptual weighting filter. This permits more noise at formant
frequencies where it will be masked by the speech
• Synthesise speech using each codebook entry in turn as the input to V(z)
• Calculate optimum gain to minimise perceptually weighted error energy
in speech frame
Performance:
• Select codebook entry that gives lowest error
• 16kbit/s: MOS=4.2,
• Transmit LPC parameters and codebook index
Delay=1.5 ms, 19 MIPS
Decoding:
• 8 kbit/s: MOS=4.1,
Delay=35 ms, 25 MIPS
• Receive LPC parameters and codebook index
• Re-synthesise speech using V(z) and codebook entry • 2.4kbit/s: MOS=3.3,
Delay=45 ms, 20 MIPS
Examples
• G.728
– V(z) is chosen as a large FIR-filter (M ¼ 50).
– The gain and FIR-parametrers are estimated recursively
from previously received samples.
– The code book contains 127 sequences.
• GSM
– The code book contains regular pulse trains with variabel
frequency and amplitudes.
• MELP
– Mixed excitation linear prediction
– The code book is combined with a noise generator.
Other Variations
• SELP – Self Excited Linear Prediction
• MPLP – Multi-Pulse Excited Linear Prediction
• MBE – Multi-Band Excitation Coding
Quality Levels
Quality level
Bandwidth
Bitrate
Broadcast quality
10 kHz
>64 kbit/s
Network (tool) quality
300 – 3400 kHz
16 – 64 kbit/s
Communication quality
4 – 16 kbit/s
Synthetic quality
<4 kbit/s
Subjective Assessment
• In digital communications speech quality is classified into four general
categories, namely: broadcast, network or toll, communications, and synthetic.
• Broadcast wideband speech – high quality ”commentary” speech – generally
achieved at rates above 64 kbits/s.
• MOS (Mean Opinion Score): result of averaging opinions scores for a set of
between 20 – 60 untrained subjects.
• They rate the quality 1 to 5 (1-bad, 2-poor, 3-fair, 4-good, 5-excellent).
• MOS of 4 or higher defines good or tool quality (network quality)
- reconstructed signal generally indistinguishable from the original.
• MOS between 3.5 – 4.0 defines communication quality – telephone
communications
• MOS between 2.5 – 3.5 implies synthetic quality
Subjective Assessment
• DRT (Diagnostic Rhyme Test): listeners should recognise one of the two
possible words in a set of rhyming pairs (e.g. meatl/heat)
• DAM (Diagnostic Acceptability Measure) - trained listeners
judge various factors e.g. muffledness, buzziness, intelligibility
Quality versus data rate (8kHz sampling rate)