Advancing the Vibrancy of Computing

Download Report

Transcript Advancing the Vibrancy of Computing

Speech Coding
EE 516 Spring 2009
Alex Acero
Acknowledgments
• Thanks to Allen Gersho for some slides…
Outline
• Quality vs Bit rate
• Types of speech coders
• Waveform Coding
• Speech production and vocoders
• Analysis by Synthesis
• VoIP
Voice Quality
• Bandwidth is easily quantified
– Voice quality is subjective
• MOS, Mean Opinion Score
– ITU-T Recommendation P.800
• Excellent – 5
• Good – 4
• Fair – 3
• Poor – 2
• Bad – 1
– A minimum of 30 people
– Listen to voice samples or in conversations
Voice Quality
• P.800 recommendation
–
–
–
–
The selection of participants
The test environment
Explanations to listeners
Analysis of results
• Toll quality
– A MOS of 4.0 or higher
Quality Measurements
• Subjective and objective quality-testing techniques
• PSQM – Perceptual Speech Quality Measurement
– ITU-T P.861
– faithfully represent human judgement and perception
– algorithmic comparison between the output signal and a know
input
– type of speaker, loudness, delay, active/silence frames, clipping,
environmental noise
Evolution of Speech Coder Performance
Excellent
Good
North American TDMA
2000
Fair
1990
ITU Recommendations
Cellular Standards
1980
Poor
Secure Telephony
1980 Profile
1990 Profile
2000 Profile
Bad
Bit Rate (kb/s)
Ceili
ng
Speech
Coding
(Telephony)
• More complicated than Moore’s Law
– Many Dimensions: Bit Rate, Quality, Complexity and Delay
– Quality ceiling (imposed by telephone standards)
• Easy to reach the ceiling at high bit rates (≥ 8 kb/s)
• More room for progress at low bit rates (≤ 8 kb/s)
Ceili
ng
Speech
Coding
(Telephony)
• More complicated than Moore’s Law
– Many Dimensions: Bit Rate, Quality, Complexity and Delay
– Quality ceiling (imposed by telephone standards)
• Easy to reach the ceiling at high bit rates (≥ 8 kb/s)
• More room for progress at low bit rates (≤ 8 kb/s)
• Moore’s Law Time Constant
– Bit rates half every decade (≤ 8 kb/s)
– Relatively slow by Moore’s Law standards (not hyper-inflation)
• Performance doubles every decade
• Like disk seek or money in the bank (normal inflation)
– Limited more by physics than investment
Ceili
ng
Speech
Coding
(Telephony)
• More complicated than Moore’s Law
– Many Dimensions: Bit Rate, Quality, Complexity and Delay
– Quality ceiling (imposed by telephone standards)
• Easy to reach the ceiling at high bit rates (≥ 8 kb/s)
• More room for progress at low bit rates (≤ 8 kb/s)
• Moore’s Law Time Constant
– Bit rates half every decade (≤ 8 kb/s)
– Relatively slow by Moore’s Law standards (not hyper-inflation)
• Performance doubles every decade
• Like disk seek or money in the bank (normal inflation)
– Limited more by physics than investment
• Potential compression opportunity
– At most 10x: 8 kb/s  2 kb/s  1 kb/s (?)  50 bits per sec (??)
• Speech (2 kb/s) >> text (2 bits/char): 10-1000 times more bits
– Speech coding will not close this gap for foreseeable future
Ceili
ng
Speech
Coding
(Telephony)
• More complicated than Moore’s Law
– Many Dimensions: Bit Rate, Quality, Complexity and Delay
– Quality ceiling (imposed by telephone standards)
• Easy to reach the ceiling at high bit rates (≥ 8 kb/s)
• More room for progress at low bit rates (≤ 8 kb/s)
• Moore’s Law Time Constant
– Bit rates half every decade (≤ 8 kb/s)
– Relatively slow by Moore’s Law standards (not hyper-inflation)
• Performance doubles every decade
• Like disk seek or money in the bank (normal inflation)
– Limited more by physics than investment
• Potential compression opportunity
– At most 10x: 8 kb/s  2 kb/s  1 kb/s (?)
• Speech (2 kb/s) >> text (2 bits/char): 100-1000 times more bits
– Speech coding will not close this gap for foreseeable future
Outline
• Quality vs Bit rate
• Types of speech coders
• Waveform Coding
• Speech production and vocoders
• Analysis by Synthesis
• VoIP
Type of Speech Coders
• Waveform codecs
– Sample and code
– High-quality and not complex
– Large amount of bandwidth
• Source codecs (vocoders)
–
–
–
–
–
–
Match the incoming signal to a math model
Linear-predictive filter model of the vocal tract
A voiced/unvoiced flag for the excitation
The information is sent rather than the signal
Low bit rates, but sounds synthetic
Higher bit rates do not improve much
Type of Speech Coders
• Hybrid codecs
–
–
–
–
Attempt to provide the best of both
Perform a degree of waveform matching
Utilize the sound production model
Quite good quality at low bit rate
Outline
• Quality vs Bit rate
• Types of speech coders
• Waveform Coding
• Speech production and vocoders
• Analysis by Synthesis
• VoIP
Waveform coders
• High quality, high bitrate
• Pulse Code Modulation (PCM)
– Sample input waveform
– Quantization
• Differential PCM
– Sample input waveform
– Encode difference between adjacent samples
• Adaptive DPCM
– Adapt step size for quantization based on
speech statistics
Voice Sampling
• A-to-D
– discrete samples of the waveform and represent each sample
by some number of bits
– A signal can be reconstructed if it is sampled at a minimum of
twice the maximum frequency (Nyquist Theorem)
• Human speech
– 300-3800 Hz
– 8000 samples per second
time
Each sample is encoded into
an 8-bit PCM code word
(e.g. 01100101)
=> 8000 x 8 bit/s
Quantization
• How many bits is used to represent
• Quantization noise
– The difference between the actual level of the input analog
signal
• More bits to reduce
– Diminishing returns
• Uniform quantization levels
– Louder talkers sound better
Non-uniform quantization
• % quantization error is larger for smaller values of x(t)
• Goal: create a set of smaller % error at small signal values
and similarly at large ones.
• This process is called “companding” at the source
encoding end and “decompanding” at the decoding (D/A)
end.
• The net effect is to make the sum of the quantization errors
smaller and more uniform percentage-wise.
• Logarithmic scaling (A-law in Europe and µ-law in US)
Non-uniform quantization
• Smaller quantization steps at smaller signal levels
• Spread signal-to-noise ratio more evenly
G.711
• The most commonplace codec
– Used in circuit-switched telephone network
– PCM, Pulse-Code Modulation
• If uniform quantization
– 12 bits * 8 k/sec = 96 kbps
• Non-uniform quantization
– 64 kbps DS0 rate
– mu-law
• North America
– A-law
• Other countries, a little friendlier to lower signal levels
– An MOS of about 4.3
DPCM
• DPCM, Differential PCM
– Only transmit the difference between the predicated value and
the actual value
– Voice changes relatively slowly
– It is possible to predict the value of a sample base on the values
of previous samples
– The receiver perform the same prediction
– The simplest form
• No prediction
• No algorithmic delay
ADPCM (Adaptive DPCM)
• Predicts sample values based on
– Past samples
– Factoring in some knowledge of how speech varies over time
• The error is quantized and transmitted
– Fewer bits required
• G.721
– 32 kbps
• G.726
– A-law/mu-law PCM -> 16, 24, 32, 40 kbps
– An MOS of about 4.0 at 32 kbps
Subjective quality metrics for speech
METRIC
Mean
Opinion
Score
Diagnostic
Rhyme
Test
(consonants)
Diagnostic
Acceptability
Measure
BASIS
(better-worse)
RANGE
FOR
TOLL
RANGE
FOR
SYNTHETIC
5-1
4.0 – 3.5
3.5 - 2.5
100
~ 95
~ 90
100
~ 73
~ 54
Common Waveform Coders
Type
Quality
MOS –DRT
DAM
Kbit/sec MIPS
Complexity
PCM
Toll
96
Very low
log PCM
4.3 – 95 - 73
64
0.01
40
Low
ADM/CVSD Toll
Outline
• Quality vs Bit rate
• Types of speech coders
• Waveform Coding
• Speech production and vocoders
• Analysis by Synthesis
• VoIP
Information rate of speech
• Phonetic content at a rate of about 72 bits/second:
– 6 bits sufficient for 40-50 different phonemes
– Average speaking rate is about 12 phonemes/second
• This neglects:
– Intonation (no pitch transmitted)
– Emotion
– Individual characterization of speech (the ability to recognize
the speaker)
– Phone durations are different
Redundancies in speech
• Our sampling frequency Fs is >> than vocal tract rate of
change (with the exception of closures )
• F0 (or perceived pitch) changes slowly as compared to
windowing rate
• Adjacent windows correlate rather well
• Spectral waveform changes slowly and most of the energy is
at the low end of frequencies so it changes even more slowly
there (important part of speech)
• It is possible to model phones as periodic/noisy filtered
excitation and still obtain reasonable quality
• Speech parameters may be weighted since they occur
nonuniformly (different probabilities)
• The ear is insensitive to phase so it can be discarded
Average power spectrum of speech
Notice that the
frequency scale
is logarithmic in
this figure.
Speech has in
general higher
power at the
lower frequencies
for sonorants and
less power above
3.3kHz, as shown
here.
Human Speech Production System
• Air flow forced from lungs to vocal tract
– short-term correlations
– Filter with resonances (called formants)
• Speech sound classes
– Voiced sounds
• Voice cord vibration
• Long-term periodicity
– Unvoiced sounds
• Constriction in the vocal tract
• No long-term periodicity
– Plosive sounds
• Release of air pressure behind mouth
A Little About Speech
• Speech
– Air pushed from the lungs past the vocal cords and along the
vocal tract
– The basic vibrations – vocal cords
– The sound is altered by the disposition of the vocal tract
( tongue and mouth)
• Model the vocal tract as a filter
– The shape changes relatively slowly
• The vibrations at the vocal cords
– The excitation signal
Voiced Speech
• The vocal cords vibrate open and close
• Interrupt the air flow
• Quasi-periodic pluses of air
• The rate of the opening and closing – the pitch
• A high degree of periodicity at the pitch period
– 2-20 ms
Voiced Speech
• Voiced speech
• Power spectrum density
Unvoiced Speech
• Forcing air at high velocities through a constriction
• The glottis is held open
• Noise-like turbulence
• Show little long-term periodicity
• Short-term correlations still present
Unvoiced Speech
• unvoiced speech
• Power spectrum density
Stops
• Plosive sounds
– A complete closure in the vocal tract
– Air pressure is built up and released suddenly
• A vast array of sounds
– The speech signal is relatively predictable over time
– The reduction of transmission bandwidth can be significant
Linear predictive Coding (LPC)
• Predict current sample as linear combination of past samples
p
x[n]   ak x[n  k ]  e[n]
k 1
• An all-pole model:
• Minimize squared error
H ( z) 
X ( z)

E( z)
1
p
1   ak z  k

k 1
p


2
2
Em   em [n]    xm [n]  xm [n]    xm [n]   a j xm [n  j ] 
n
n
n 
j 1

• Orthogonality principle
 em , xim   em [n]xm [n  i]  0
n
• Solution
p
a 
j 1
j m
[i, j ]  m [i, 0]
m [i, j ]   xm [n  i]xm [n  j ]
n
1
A( z )
2
xm
em
x2m
x1m
xm
Vocoders (source coders)
• Linear prediction model for human voice system
• Medium quality, low bitrate
Vector Quantization
• Example
• Key challenge
– Given a source distribution, how to select codebook (*) and
partitions (---) to result in smallest average distortion
• Solution:
– Divide and conquer
– Two codes  four  eight …
Outline
• Quality vs Bit rate
• Types of speech coders
• Waveform Coding
• Speech production and vocoders
• Analysis by Synthesis
• VoIP
Analysis-by-Synthesis (AbS) Codecs
• Hybrid method
– Vocoder’s linear prediction model
– Careful selection of excitation signal to reconstruct original
waveform
– High quality, low bitrate!
• The most successful and commonly used
• Time-domain AbS codecs
– Not a simple two-state, voiced/unvoiced
– Different excitation signals are attempted
– Closest to the original waveform is selected
• Types:
– MPE, Multi-Pulse Excited
– RPE, Regular-Pulse Excited
– CELP, Code-Excited Linear Predictive
Linear-Prediction-based Analysis-by-Synthesis
• How it works
– Segment speech into frames
(typically 20ms long)
– Find filter parameter for each
frame
– Find excitation whose that
minimizes prediction error
• Perceptual weighting
• More accuracy where speech
energy is low
– Transmit the filter parameter and
excitation signal
• Vector quantization
LPAS Classification
• Three classes
– Multi-Pulse Excited (MPE)
– Regular-Pulse Excited (RPE)
– Code-Excited Linear Predictive (CELP)
• Difference lies in representation of excitation signal
Multi-Pulse Excited (MPE)
• Excitation is given by a fixed number of pulses
• Position and amplitude of the pulses are computed to
minimize error and transmitted to decoder
• Finding the best match is theoretically possible but not
practical
• Suboptimal estimations are given
• Typically about 4 pulses per 5 ms are used
Regular-Pulse Excited (RPE)
• Multiple pulses used like in MPE
• Regularly spaced at fixed period
• Only needs to transmit first pulse’s position and all pulses
amplitude
– More pulses are allowed for better quality at same bitrate
– Around 10 pulses per 5 ms
Code-Excited Linear Predictive (CELP)
• Excitation is given by
– an entry from a large vector quantizer codebook
– A gain term for its power (amplitude)
• Key challenge
– Searching for the right excitation entries in realtime
– Solution: restructure the codebook optimized for searching (such
as a tree)
• Performance
– 4.8kbps or lower bitrate with good quality
Further Improvements on CELP
• Representation of pitch period
– Adaptive Long-term prediction + short-term adjustment
• Coding of LP filter
– Vector quantization of filter representation
• Multimode coding
– Dynamic bit allocation between excitation, LP filter and pitch
G.728 LD-CELP
• CELP codecs
– A filter; its characteristics change over time
– A codebook of acoustic vectors
• A vector = a set of elements representing various
char. of the excitation
– Transmit
• Filter coefficients, gain, a pointer to the vector
chosen
• Low Delay CELP
– Backward-adaptive coder
• Use previous samples to determine filter coefficients
• Operates on five samples at a time
– Delay < 1 ms
• Only the pointer is transmitted
G.728 LD-CELP
• 1024 vectors in the code book
• 10-bit pointer (index)
• 16 kbps
• LD-CELP encoder
– Minimize a frequency-weighted mean-square error
G.728 LD-CELP
–
–
–
–
–
–
MOS score of about 3.9
One-quarter of G.711 bandwidth (16kbps)
30 MIPS
2 kilobytes of RAM is needed for codebooks
50th order LPC filter.
Lower delays are obtained by making the excitation
vectors very short (~5 samples or 0.625 ms)
Algebraic CELP (ACELP)
• Algebraic CELP (ACELP) the residual samples are not VQ-ed but
derived directly from an algebraic computation to be used in exciting
the LP synthesizer accelerating the search for optimal excitation
• Main advantage is algebraic codebook can be very large (> 50 bits)
without running into storage or CPU time problems. A 16-bit algebraic
codebook is used in the innovative codebook search, the aim of which
is to find the best innovation and gain parameters
• The innovation vector contains, at most, four non-zero pulses. In
ACELP a block of N speech samples is synthesized by filtering an
appropriate innovation sequence from a codebook, scaled by a gain
factor, through two time varying filters, one a long-term or pitch,
synthesis filter and the other a shorter term synthesis filter.
• Conjugate Structure ACELP yields toll-quality with a 10th order LPC
G.723.1 ACELP
• 6.3 or 5.3 kbps
– Both mandatory
– Can change from one to another during a conversation
• The coder
–
–
–
–
–
–
A band-limited input speech signal
Sampled at 8 KHz, 16-bit uniform PCM quantization
Operate on blocks of 240 samples at a time
A look-ahead of 7.5 ms
A total algorithmic delay of 37.5 ms + other delays
A high-pass filter to remove any DC component
G.723.1 ACELP
• Various operations to determine the appropriate filter
coefficients
• 5.3 kbps, Algebraic Code-Excited Linear Prediction
• 6.3 kbps, Multi-pulse Maximum Likelihood Quantization
• The transmission
– Linear predication coefficients
– Gain parameters
– Excitation codebook index
– 24-octet frames at 6.3 kbps, 20-octet frames at 5.3 kbps
G.723.1 ACELP
• G.723.1 Annex A
– Silence Insertion Description (SID) frames of size four octets
• The two lsbs of the first octet
– 00
– 01
– 10
6.3kbps
5.3kbps
SID frame
• An MOS of about 3.8
– At least 27.5 ms delay
24 octets/frame
20
4
G.729
• 8 kbps
• Input frames of 10 ms, 80 samples for 8 KHz sampling rate
• 5 ms look-ahead
– Algorithmic delay of 15 ms
• An 80-bit frame for 10 ms of speech
• A complex codec
–
–
–
–
G.729.A (Annex A), a number of simplifications
Same frame structure
Encoder/decoder, G.729/G.729.A
Slightly lower quality
G.729
• G.729.B
– VAD, Voice Activity Detection
• Based on analysis of several parameters of the input
• The current frames plus two preceding frames
– DTX, Discontinuous Transmission
• Send nothing or send an SID frame
• SID frame contains information to generate comfort
noise
– CNG, Comfort Noise Generation
• G.729, an MOS of about 4.0
• G.729A an MOS of about 3.7
G.729
• G.729 Annex D
– a lower-rate extension
– 6.4 kbps; 10 ms speech samples, 64 bits/frame
– MOS  6.3 kbps G.723.1
• G.729 Annex E
–
–
–
–
–
–
a higher bit rate enhancement
the linear prediction filter of G.729 has 10 coef.
that of G.729 Annex E has 30 coef.
the codebook of G.729 has 35 bits
that of G.729 Annex E has 44 bits
118 bits/frame; 11.8 kbps
CDMA QCELP (IS-733)
• Variable-rate coder
• Two most common rates
– The high rate, 13.3 kbps
– A lower rate, 6.2 kbps
• Silence suppression
• For use with RTP, RFC 2658
GSM Enhanced Full-Rate (EFR)
•
•
•
•
GSM 06.60
An enhanced version of GSM Full-Rate
ACELP-based codec
The same bit rate and the same overall packing structure
– 12.2 kbps
• Support discontinuous transmission
• For use with RTP, RFC 1890
GSM Adaptive Multi-Rate (AMR)
•
•
•
•
•
•
•
20 ms coding delay
Eight different modes
4.75 kbps to 12.2 kbps
12.2 kbps, GSM EFR
7.4 kbps, IS-641 (TDMA cellular systems)
Change the mode at any time
Offer discontinuous transmission
– The SID (Silence Descriptor) is sent in every 8th frame and is 5
bytes in size
• The coding choice of many 3G wireless networks
VSELP
• Vector-Sum-Excited Linear Prediction:
– In IS-54 standard TDMA cell phones in US and a variation in
Japan
– In the first version of RealAudio for audio over the Internet
• Data rate:
– Data rate of 7.95 kbit/s: 20 ms of speech into 159-bit frames
– In an actual TDMA cell phone, the vocoder output is packaged
with error correction and signaling information, resulting in an
over-the-air data rate of 16.2 kbit/s
– For internet audio, each 159-bit frame is stored in 20 bytes,
leaving 1 bit unused. The resulting file thus has a data rate of
exactly 8 kbit/s
• Limited ability to encode non-speech sounds
– Performs poorly in the presence of background noise
References
• Human voice model
– http://cnx.rice.edu/content/m0049/latest/
• Speech Compression
– http://www.data-compression.com/speech.shtml
• Speech coding tutorial
– http://www-mobile.ecs.soton.ac.uk/speech_codecs/
• Standard codecs
– http://www.ittiam.com/pages/products/g711.htm
• Spanias, AS, “Speech coding, a tutorial review”, 1994
Outline
• Quality vs Bit rate
• Types of speech coders
• Waveform Coding
• Speech production and vocoders
• Analysis by Synthesis
• VoIP
Effects of packetization
Advantages of VoIP
• Cost savings:
– Avoid the need for separate voice and data networks
– Conference calling, IVR, call forwarding, automatic redial, and caller ID features
(traditional telcos normally charge extra)
– While regular telephone calls are billed by the minute or second, VoIP calls are
billed per megabyte
• Flexibility:
– Simple way to add an extra telephone line to a home or office.
– Secure calls using standardized protocols (such as Secure Real-time Transport
Protocol)
– Location independence: call center agents using VoIP phones can work from
anywhere
– Integration with other services available over the Internet, including video
conversation, message or data file exchange in parallel with the conversation,
audio conferencing, managing address books
• Potential quality improvements:
– Break legacy 8kHz => 16kHz, stereo, 5.1
Challenges with VoIP
• Quality of Service (QoS)
– Latency
– Packet loss
• Play silence?
• Audio “healing”
– Transcoding leads to quality degradations
• Susceptibility to power failure
• 911 calls
• Security: hackers in your phone
DTX and Comfort Noise
• DTX is Discontinuous Transmission
• Voice activity detector (VAD) detects if there is active speech
or not.
• When there is no active speech different DTX procedures can
be used:
– No Transmission at all
– Comfort Noise (CN) using RFC 3389
– Codec built CN in like AMR SID (Silence Descriptor)
• Frequency of Comfort Noise packets varies but is usually
some fraction of normal packet rate
Tones, Signal, and DTMF Digits
• The hybrid codecs are optimized for human speech
– Other data may need to be transmitted
– Tones: fax tones, dialing tone, busy tone
– DTMF digits for two-stage dialing or voice-mail
• G.711 is OK
• G.723.1 and G.729 can be unintelligible
• The ingress gateway needs to intercept
– The tones and DTMF digits
– Use an external signaling system