Transcript Document
Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg Outline • Part I - Speech – Speech – History of speech synthesis & coding – Speech coding methods • Part II – Audio – Psychoacoustic models – MPEG-4 Audio Speech Production • The human’s vocal apparatus consists of: – lungs – trachea (wind pipe) – larynx • contains 2 folds of skin called vocal cords which blow apart and flap together as air is forced through – oral tract – nasal tract The Speech Signal 1 The Speech Signal The Speech Signal Elements of the speech signal: • spectral resonance (formants, moving) • periodic excitation (voicing, pitched) + pitch contour • noise excitation (fricatives, unvoiced, no pitch) • transients (stop-release bursts) • amplitude modulation (nasals, approximants) • timing The Speech Signal Vowels - characterised by formants; generally voiced; Tongue & lips - effect of rounding. Examples of vowels: a, e, i, o, u, a, ah, oh. Vibration of vocal cords: male 50 - 250Hz, female up to 500Hz. Vowels have in average much longer duration than consonants. Most of the acoustic energy of a speech signal is carried by vowels. F1-F2 chart Formant positions History of Speech Coding • 1926 - PCM - first conceived by Paul M. Rainey and independently by Alex Reeves (AT&T Paris) in 1937. Deployed in US PSTN in 1962 • 1939 - Channel vocoder - first analysis-by-synthesis system developed by Homer Dudley of AT&T labs - VODER VODER – the architecture History of Speech Coding • 1926 - PCM - first conceived by Paul M. Rainey and independently by Alex Reeves (AT&T Paris) in 1937. Deployed in US PSTN in 1962 • 1939 - Channel vocoder - first analysis - by - synthesis system developed by Homer Dudley of AT&T labs - VODER OVE formant synthesis (Gunnar Fant, KTH), 1953 History of Speech - Coding • 1926 - PCM - first conceived by Paul M. Rainey and independently by Alex Reeves (AT&T Paris) in 1937. Deployed in US PSTN in 1962 • 1939 - Channel vocoder - first analysis - by - synthesis system Homer Dudley of AT&T labs - VODER • 1952 - delta modulation proposed, differential PCM invented • 1957 - -law encoding proposed (standardised for telephone network in 1972 (G.711)) • 1974 - ADPCM developed • 1984 - CELP vocoder proposed (majority of coding standards for speech signal today use a variation on CELP) Source-filter Model of Speech Production • Signal from a source is filtered by a time-varying filter with resonant properties similar to that of the vocal tract. • The gain controls Av and AN determine the intensity of voiced and unvoiced excitation. • The frequency of higher formant are attenuated by -12 dB/octave (due to the nature of our speech organs). • This is an over simplified model for speech production. However, it is very often adequate for understanding the basic principles. Speech Coding Strategies 1. PCM • Invented 1926, deployed 1962. • The speech signal is sampled at 8 kHz. • Uniform quantization requires >10 bits/sample. • Non-uniform quantization (G.711, 1972) • Quantizing y to 8 bits -> 64 kbit/s. Speech Coding Strategies 2. Adaptive DPCM • Example: G.726 (1974) • Adaptive predictor based on six previous differences. • Gain-adaptive quantizer with 15 levels ) 32 kbit/s. Speech Coding Strategies 3. Model-based Speech Coding • Advanced speech coders are based on models of how speech is produced: Excitation source Vocal tract An Excitation Source Noise generator Pitch Pulse generator Vocal Tract Filter 1: A Fixed Filter Bank g1 BP g2 BP gn BP Vocal Tract Filter 2: A Controllable Filter Linear Predictive Coding (LPC) • The controllable filter is modelled as yn = ai yn-i + Gn where n is the input signal and yn is the output. • We need to estimate the vocal tract parameters (ai and G) and the exciatation parameters (pitch, v/uv). • Typically the source signal is divided in short segments and the parameters are estimated for each segment. • Example: The speech signal is sampled at 8 kHz and divided in segments of 180 samples (22.5 ms/segment). Typical Scheme of an LPC Coder Noise generator Vocal tract filter Pulse generator Pitch v/uv Gain Filter coeffs Estimating the Parameters • v/uv estimation – Based on energy and frequency spectrum. • Pitch-period estimation – Look for periodicity, either via the a.c.f our some other measure, for example that gives you a minimum value when p equals the pitch period. – Typical pitch-periods: 20 - 160 samples. Estimating the Parameters • Vocal tract filter estimation – Find the filter coefficients that minimize the error 2 = ( yn - ai yn-i + Gn )2 – Compare to the computation of optimal predictors (Lecture 7). Estimating the Parameters • Assuming a stationary signal: where R and p contain acf values. • This is called the autocorrelation method. Estimating the Parameters • Alternatively, in case of a non-stationary signal: where • This is called the autocovariance method. Example • Coding of parameters using LPC10 (1984): v/uv 1 bit Pitch 6 bits Voiced filter 46 bits Unvoiced filter 46 bits Synchronization 1 bit Sum: 54 bits ) 2.4 kbit/s The Vocal Tract Filter • Different representations: – LPC parameters – PARCOR (Partial Correlation Coefficients) – LSF (Line Spectrum Frequencies) Code Excited Linear Prediction Coding (CELP) Encoding: • LPC analysis ) V(z) • Define perceptual weighting filter. This permits more noise at formant frequencies where it will be masked by the speech • Synthesise speech using each codebook entry in turn as the input to V(z) • Calculate optimum gain to minimise perceptually weighted error energy in speech frame Performance: • Select codebook entry that gives lowest error • 16kbit/s: MOS=4.2, • Transmit LPC parameters and codebook index Delay=1.5 ms, 19 MIPS Decoding: • 8 kbit/s: MOS=4.1, Delay=35 ms, 25 MIPS • Receive LPC parameters and codebook index • Re-synthesise speech using V(z) and codebook entry • 2.4kbit/s: MOS=3.3, Delay=45 ms, 20 MIPS Examples • G.728 – V(z) is chosen as a large FIR-filter (M ¼ 50). – The gain and FIR-parametrers are estimated recursively from previously received samples. – The code book contains 127 sequences. • GSM – The code book contains regular pulse trains with variabel frequency and amplitudes. • MELP – Mixed excitation linear prediction – The code book is combined with a noise generator. Other Variations • SELP – Self Excited Linear Prediction • MPLP – Multi-Pulse Excited Linear Prediction • MBE – Multi-Band Excitation Coding Quality Levels Quality level Bandwidth Bitrate Broadcast quality 10 kHz >64 kbit/s Network (tool) quality 300 – 3400 kHz 16 – 64 kbit/s Communication quality 4 – 16 kbit/s Synthetic quality <4 kbit/s Subjective Assessment • In digital communications speech quality is classified into four general categories, namely: broadcast, network or toll, communications, and synthetic. • Broadcast wideband speech – high quality ”commentary” speech – generally achieved at rates above 64 kbits/s. • MOS (Mean Opinion Score): result of averaging opinions scores for a set of between 20 – 60 untrained subjects. • They rate the quality 1 to 5 (1-bad, 2-poor, 3-fair, 4-good, 5-excellent). • MOS of 4 or higher defines good or tool quality (network quality) - reconstructed signal generally indistinguishable from the original. • MOS between 3.5 – 4.0 defines communication quality – telephone communications • MOS between 2.5 – 3.5 implies synthetic quality Subjective Assessment • DRT (Diagnostic Rhyme Test): listeners should recognise one of the two possible words in a set of rhyming pairs (e.g. meatl/heat) • DAM (Diagnostic Acceptability Measure) - trained listeners judge various factors e.g. muffledness, buzziness, intelligibility Quality versus data rate (8kHz sampling rate)