Voice over Packet

Download Report

Transcript Voice over Packet

V
O
P
YJS
Speech
Compression
Y(J) Stein VoP3
1
V
Spch
Comp
Quick Overview
O
P
Simple coders
– G.711 A-law m-law
– Delta
– ADPCM
Other methods
– MBE
– MELP
– STC
– Waveform Interpolation
CELP coders
– LPC-10
– RELP/GSM
– CELP
YJS
Y(J) Stein VoP3
2
V
Spch
Comp
Encoder Criteria
O
P
Encoders can be compared in many ways
the most important are:




Bit rate (Kbps)
Speech quality (MOS)
Delay (algorithmic [frame+lookahead] + computational + propagation)
Computational Complexity
Often less important:




YJS
Bit exactness (interoperability)
Transcoding robustness
Behavior on non-speech (babble noise, tones, music)
Bit error robustness
Y(J) Stein VoP3
3
V
Spch
Comp
PSTN Quality Coders
O
P
Rate
128 Kbps
64 Kbps
32 Kbps
16 Kbps
8 Kbps
4 Kbps
ITU-T
encoder
16bit linear sampling
G.711
G.726
G.728
G.729*
SG16Q21*
A-law/m-law 8bit log sampling
ADPCM
LDCELP
CS-ACELP
???
* toll quality MOS rating, but higher delay
YJS
Y(J) Stein VoP3
4
V
Spch
Comp
Digital Cellular Standards
O
P
Coder
Rate
Approach Quality
Complexity Delay
GSM FR
13
RPE-LPT 3.5
Low
40
GSM HR
5.6
VSELP
<3.5
High
45
GSM EFR
12.2
ACELP
4.0
Medium
45
GSM AMR 4-12
ACELP
3.5-4.0
Medium
45
TIA IS54
8
VSELP
3.5
Medium
45
TIA IS641
8
ACELP
4.0
Medium
45
TIA IS96
8*
QCELP
<3.5
Medium
45
TIA EVRC 8*
ACELP
4.0
High
50
TIA Q13
QCELP
4.0
Med-High
45?
13*
* = Variable rate
YJS
Y(J) Stein VoP3
5
V
O
Spch
Comp
Military / Satellite Standards
Coder
P
YJS
Rate
(Kb/s)
Approach
Quality
(MOS)
Complexity
Delay
(ms)
FS-1015
LPC-10
FS-1016
2.4
LPC
2.5
Low- med
13.5
4.8
CELP
3.0
high
67.5
MELP
2.4
MELP
3.3
Med-high
67
Satellite 1 4.8
IMBE
3.3-3.5
medium
100
Satellite 2 2.4-3.6
AMBE
3.3-3.5
medium
100
Y(J) Stein VoP3
6
V
O
P
YJS
Simple
coders
Y(J) Stein VoP3
7
V
O
P
Spch
Comp
G.711
16 bit linear sampling at 8 KHz means 128 Kbps
Minimal toll quality linear sampling is 12 bit (96 Kbps)
8 bit linear sampling (256 levels) is noticeably noisy
Due to
– prevalence of low amplitudes
– logarithmic response of ear
we can use logarithmic sampling
Different standards for different places
YJS
Y(J) Stein VoP3
8
V
Spch
Comp
G.711 - cont.
O
P
North America
m-law
m = 255
Rest Of World
A = 87.56
A-law



YJS
Although very different looking they are nearly identical
G.711 approximates these expressions by 16 staircase straight-line segments
(8 negative and 8 positive)
m-law: horizontal segment through origin, A-law: vertical segment
Y(J) Stein VoP3
9
V
O
P
Spch
Comp
DPCM
Due to low-pass character of speech
differences are usually smaller than signal values
and hence require fewer bits to quantize
Simplest Delta-PCM (DPCM) : quantize first difference signal D
Delta-PCM : quantize difference between signal and prediction
sn = p ( sn-1 , sn-2 , … , sn-N ) =
S pi
sn-i
i
If predict using linear combination (FIR filter), this is linear prediction
Delta-modulation (DM) : use only sign of difference (1bit DPCM)
Sigma-delta (1bit) : oversample, DM, trade-off rate for bits
YJS
Y(J) Stein VoP3
10
Spch
Comp
V
DPCM with prediction
O
P
If the linear prediction works well, then the prediction error
en = sn - sn
will be lower in energy and whiter than sn itself !
Only the error is needed for reconstruction,
since the predictable portion can be predicted sn = sn + en!
sn
prediction
filter
YJS
sn
en
sn
sn
prediction
filter
Y(J) Stein VoP3
11
V
O
P
Spch
Comp
DPCM - post-filtering
Simplest case :
if highly oversampled
then previous sample sn-1 predicts sn well,
so we can use DM,
if sgn(en) < 0 then -D else +D
For DM there is no way to encode zero prediction error
so decoded signal oscillates wildly
Standard remedy is a post-filter that low-pass filters this noise
But there is a b i g g
YJS
er
problem!
Y(J) Stein VoP3
12
Spch
Comp
V
Open-loop Prediction
O
P
The encoder (linear predictor) is present in the decoder
but there runs as feedback
The decoder’s predictions are accurate with the precise error en
but it gets the quantized error en and the models diverge!
sn
PF
YJS
Q
en
IQ
sn
PF
Y(J) Stein VoP3
13
V
O
P
Spch
Comp
Side Information
There are two ways to solve the problem ...
The first way is to send the prediction coefficients
from the encoder to the decoder
and not to let the decoder derive them
The coefficients sent are called side-information
Using side-information means higher bit-rate
(since both en and coefficients must be sent)
The second way does not require increasing bit rate
YJS
Y(J) Stein VoP3
14
Spch
Comp
V
Closed-loop Prediction
O
P
To ensure that the encoder and decoder stay “in-sync”
we put the decoder into the encoder
Thus the encoder’s predictions are identical to the decoder’s
and no model difference accumulates
-
sn
en
en
Q
sn
IQ
IQ
PF
PF
YJS
Y(J) Stein VoP3
15
V
Spch
Comp
Two types of error
O
P
For DM there are two types of error (depending on step size)
YJS
D too small
D OK
D too large
Y(J) Stein VoP3
16
V
O
P
Spch
Comp
Adaptive Step Size
Speech signals are very nonstationary
We need to adapt the step size to match signal behavior
– Increase D when signal changes rapidly
– Decrease D when signal is relatively constant
Simplest method (for DM only):
– If present bit is the same as previous multiply D by K (K=1.5)
– If present bit is different, divide D by K
– Constrain D to a predefined range
More general method :
– Collect N samples in buffer (N = 128 … 512)
– Compute standard deviation in buffer
– Set D to a fraction of standard deviation
• Send D to decoder as side-information or
• Use backward adaptation (closed-loop D computation)
YJS
Y(J) Stein VoP3
17
V
Spch
Comp
ADPCM
O
P

G.726 has
– Adaptive predictor
– Adaptive quantizer and inverse quantizer
– Adaptation speed control
– Tone and transition detector
– Mechanism to prevent loss from tandeming

Computational complexity relatively high (10 MIPS)
24 and 16 Kbps modes defined, but not toll quality

G.727 same rates but embedded for packetize networks
ADPCM only used general low-pass characteristic of speech
What is the next step?
YJS
Y(J) Stein VoP3
18
V
O
P
Spch
Comp
Scalar Quantization
Standard A/D has preset, evenly distributed levels
G.711 has preset, non-evenly distributed levels
With a criterion we can make an adaptive quantizer
Simplest criterion: minimum squared quantization error
en = sn - sn
E = < en2 >
Need algorithm to find optimal placement of levels [EM-type algorithms]
YJS
Y(J) Stein VoP3
19
V
Spch
Comp
Vector Quantization
O
P
We can do the same thing in higher dimensions
Here we wish to match input data
xi i = 1 .. N
to a codebook of codewords
Cj j = 1 .. M
with Minimal Mean Squared Error
E =
Si=1..N
| xi - C |2
where C is the codeword closest to xi in the codebook
C3
C1
C4
C2
xi
YJS
Y(J) Stein VoP3
20
V
Spch
Comp
LBG Algorithm for VQ
O
P
Input xi
i = 1 .. N
[clustering, unsupervised learning]
Randomly initialize codebook Cj j = 1 .. M
Loop until converge:
Classification Step
for i = 1 .. N
for j = 1 .. M
compute
Dij2 = | xi - Cj |2
classify xi to Cj with minimal Dij2
Expectation Step
for j = 1 .. M
YJS
correct center Cj =
1
Nj
S i e Cj xi
Y(J) Stein VoP3
21
V
O
P
Spch
Comp
Speech Application of VQ
OK, I understand what to do with scalar quantization
what is VQ good for ?
We could try to simply VQ frames of speech samples
but this doesn’t work well !
We can VQ spectra or sub-band components
We often VQ parameter sets (e.g. LPC coefficients)
We also VQ model error signals
YJS
Y(J) Stein VoP3
22
V
O
P
CELP
coders
YJS
Y(J) Stein VoP3
23
V
Spch
Comp
LPC-10
O
P
Based on 10th order LPC (obviously) [Bishnu Atal]
180 sample blocks are encoded into 54 bits




Pitch + U/V (found using AMDF) 7 bits
Gain
5 bits
10 reflection coefficients found by covariance method
– first two coefficients converted to log area ratios
– L1, L2, a3, a4 5 bits each
– a5, a6, a7, a8 4 bits each
– a9 3 bits a10 2 bits
41 bits
1 sync bit
1 bit
54 bits 44.44 times per second results in 2400 bps
By using VQ could reduce bit rate to under 1 Kbps!
LPC-10 speech is intelligible, but synthetic sounding
and much of the speaker identity is lost !
YJS
Y(J) Stein VoP3
24
V
O
P
Spch
Comp
The Residual
Recover sn by adding back the residual error signal
sn = sn + en
So if we send en as side-information we can recover sn
en is smaller than sn so may require fewer bits !
But en is whiter than sn so may require many bits!
The question has now become:
How can we compress the residual?
YJS
Y(J) Stein VoP3
25
V
O
P
Spch
Comp
Encoding the Residual
RELP (6-9.6 Kbps)
Low-pass filter and downsample residual to 1 KHz
Encode using ADPCM
VQ-RELP (4.8 Kbps)
VQ coding of residual
RELP (4.8 Kbps)
Perform FFT on residual
Baseband coding
RPE-LTP (GSM-FR at 13 Kbps)
Residual Pulse Excitation - Long Term Predictor
Perform Long Term Prediction (pitch recovery)
Subtract to obtain new residual
Decimate by 3, use phase with maximum energy
Extract 6-bit overall gain
Encode remainder with 3 bits/sample
YJS
Y(J) Stein VoP3
26
V
Spch
Comp
Residual and Excitation
O
P
sn = en + S am sn-m
Synthesis filter
en
all-pole
filter
excitation
sn
residual
rn = sn - S am sn-m
Analysis filter
all-zero
filter
-
sn
rn
Note: all-zero filter is the
inverse of the all-pole filter
So rn = en !
YJS
Y(J) Stein VoP3
27
V
Spch
Comp
CELP
O
P
Atal’s idea:
Find a way to efficiently encode the excitation !
en
LPC
sn
Questions:
How can we find the excitation?
Theoretically, by algebra (invert the filter!)
How can we efficiently encode the residual?
VQ - Code Excited Linear Prediction
How can we efficiently find the best codeword?
Exhaustive search
YJS
Y(J) Stein VoP3
28
V
Spch
Comp
CELP - cont.
O
P
Atal and friends (Schroeder, Remde, Singhal, etc.) discoveries:
Even random codebooks work well [Gaussian, uniform]
Don’t need large codebooks [e.g. 1024 codewords for 40 samples]
Can center-clip with little loss
Codebook with constant amplitude almost as good
So we can use codebooks with structure (and save storage/search/bits)
Multipulse (MP)
Regular Pulse (RP)
Constant Amplitude Pulse
YJS
Y(J) Stein VoP3
29
V
O
Spch
Comp
Special Excitations
Shift technique reduces random CB operations from O(N2) to O(N)
P
[a b c d e f] [c d e f g h] [e f g h I j] ...
Using a small number of +1 amplitude pulses leads to MIPS reduction
YJS

Since most values are zero, there are few operations

Since amplitudes +1 no true multiplications

In a CB containing CW and -CW we can save half

Algebraic codebooks exploit algebraic structure
Example: choose pulses according to Hadamard matrix
Using FHT reduces computation

Conjugate structure codebooks
Excitation is sum of codewords from two related CBs
Y(J) Stein VoP3
30
V
Spch
Comp
Analysis by Synthesis
O
P
Finding the best codeword by exhaustive search
sn
CB .
..
Compute
energy
LPC
find
minimum
YJS
Y(J) Stein VoP3
31
V
Spch
Comp
Perceptual Weighting
O
P
The criterion for selecting the best codeword should be perceptual
not simply the energy of the difference signal!
We perceptually weight the signal and the synthesized signal
sn
LPC
PW
-
CB
PW
sn
YJS
CB
LPC
-
Since PW is a filter
we need use it only once
PW
Y(J) Stein VoP3
32
V
O
P
Spch
Comp
Perceptual Weighting - cont.
The most important PW effect is masking
Coding error energy near formants is not heard anyway
so we allow higher error near formants
but demand lower perceivable error energy
To do this we de-emphasize according to the LPC spectrum!
Simplest filter is 1 - S ai z-I where ai are the LPC coefficients
How do we take the critical bandwidth into account?
We perform bandwidth expansion Denominator expansion > numerator
1 - S g1i ai z-I
1 - S g2i ai z-I
1 > g1 > g2 > 0
Typical values:
YJS
g1 = 0.9 g2 = 0.6
BW = - ln(g) Fs
p
Y(J) Stein VoP3
33
V
O
P
Spch
Comp
Post-filter
Not related to the subject, but if we are already here …
In order to increase the subjective quality of many coders
post-filters are often used to emphasize the formant structure
These have the same form as the perceptual weighting filter
– but 1 > g2 > g1 > 0 with typical values g1 = 0.5 g2 = 0.75
Denominator expansion < numerator!
– the post-filter also reinforces tilt
which should then be compensated by an IIR filter
– since the spectral valleys are de-emphasized
we should change the PW filter parameters g1 and g2
Originally proposed for ADPCM !
YJS
Y(J) Stein VoP3
34
V
Spch
Comp
Subframes
O
P
Coders with large frames (> 10 ms) need a long excitation signal
and hence a lot of bits to encode
An alternative is to divide the frame into (2-4) subframes
each of which has its own codeword excitation
frame n-1
frame n
frame n+1
------- LPC ------- CW
CW
CW
CW
We really should recompute LPC per subframe
but we can get away with interpolating !
subframe 1
subframe 2
subframe 3
subframe 4
YJS
Y(J) Stein VoP3
35
V
Spch
Comp
Lookahead
O
P
If we are already dividing up the frame
we can compute the LPC based on a shifted frame
------- LPC ------- ------- LPC ------CW CW
CW
CW
CW
CW
CW
CW
This is called lookahead, and it adds processing delay !
To decrease delay we can use backward looking IIR filter
and then we needn’t send/store the LPC coefficients at all!
YJS
Y(J) Stein VoP3
36
V
O
P
Spch
Comp
What happened to the pitch?
Unlike LPC, the ABS CELP coder is excited by codebook
Where does the pitch come from?
Random CB: minimi zation will prefer “good” excitation
Regular/Multi pulse: pulse spacing (not enough pulses for high pitch)
But this is usually not enough (residual has pitch periodicity)
Two solutions:
Adaptive codebook (Klejn, etal)
Long term prediction (Atal + Singhal)
Both of these reinforce the pitch component
YJS
Y(J) Stein VoP3
37
V
Spch
Comp
Adaptive CB
O
P
Adaptive codebook is repetitions of previous excitations
Total excitation is weighted sum of stochastic CB (random, MP, RP, etc)
and adaptive CB
Adaptive
Ga
CB
LPC
Fixed
Gs
CB
YJS
Y(J) Stein VoP3
38
V
Spch
Comp
Long Term Prediction
O
P
Using long-term (pitch predictor) and short-term (LPC) prediction
sn
gain
pitch
predictor
Long term predictor may have only
one delay, but then non-integer
1
1 - b z -d
YJS
LPC
-
codebook
perceptual
weighting
error
computation
Y(J) Stein VoP3
39
V
O
P
Spch
Comp
Federal Standard CELP
FS 1016 at 4.8 Kbps has MOS 3.2
Developed by AT&T Bell Labs for DOD 144 bits / 30 ms frame
10th order LPC on 30 ms Hamming window
no pre-emphasis, additional 15 Hz BW expansion (quality and LSP robustness)
Conversion to LSP and nonuniform scalar quantization to 34 bits
4 subframes (7.5 ms) LSP interpolation
512 entry fixed CB - static -1,0,+1 from center-clipped Gaussian
+ 5 bit nonuniform quantized gain 56 bits
256 entry adaptive CB - 8 bits + 5 bit nonuniform quantized gain 48 bits
optional noninteger delays, optional
Perceptual weighting
Postfilter + spectral tilt compensation, removable for noise or tandeming
FEC 4 bits SYNC 1 bit reserved 1 bit
YJS
Y(J) Stein VoP3
40
V
O
P
Spch
Comp
G.728
16 Kbps with MOS similar to G.726 at 32 Kbps
Low 5 sample (0.625 msec) delay
High computational complexity (about 30 MIPS)
CELP with Backward LPC
LPC order 50 (why not? - we don’t transmit side-information!)
Frame of 2.5 ms (20 samples)
4 subframes of 0.625 ms (5 samples)
Perceptual weighting
Only 10 bit index to fixed CB is transmitted
10 bits per 0.625 ms is 16 Kbps !
YJS
Y(J) Stein VoP3
41
V
Spch
Comp
G.729
O
P
8 Kbps toll-quality coder for DSVD and VoIP
Computational complexity 20 MIPS, but G.729a is about 10 MIPS
frame 10 ms (80 samples) lookahead 5 ms (1 subframe)
LPC, LSP, VQ, LSP interpolation
CS-ACELP CB (Interleaved single pulse permutation) 4 [+1] pulses / subframe
closed loop pitch prediction and adaptive CB (delay+gain)
2 (40 sample) subframes per frame
For each frame the encoder outputs 80 bits
LSF coefficients 18 bits
adaptive CB
5 bits
pulse positions 26 bits
YJS
pitch
8 bits
parity check 1 bit
pulse signs 8 bits
gain CB 14 bits
Y(J) Stein VoP3
42
V
Spch
Comp
G.729 annexes
O
P
A Compatible reduced complexity encoder with minimal MOS reduction
B VAD and CNG
C Floating point implementation
D 6.4 Kbps version
similar to G.729 but 64 output bits per frame, quality better than G.726 at 24Kbps
LSF coefficients 18b pitch+adaptive CB 8+4b gain CB 12b fixed CB 22b
E 11.8 Kbps coder for high quality and music
YJS
Y(J) Stein VoP3
43
V
O
P
Spch
Comp
G.723.1
6.4 (MP-MLQ) and 5.4 (ACELP) Kbps rates
About 18 MIPS on DSP
frame 30 ms (240 samples) lookahead 15 ms.
LPC on 30 ms (240 sample) frames, LSP and VQ
open-loop pitch computation on half-frames (120 sample)
excitation on 4 subframes (60 samples) per frame
perceptual weighting and harmonic noise weighting
fifth-order closed loop pitch predictor
MP-MLQ: 5 or 6 [+1] pulses / subframe, positions all even or all odd
ACELP: 4 [+1] pulses / subframe, positions differ by 8
Annex A VAD-CNG Annex B floating point implementation
YJS
Y(J) Stein VoP3
44
V
O
P
YJS
Other
Coders
Y(J) Stein VoP3
45
V
Spch
Comp
MBE coder
O
P
LPC10 makes hard U/V decision - no mixed voicing
Multi Band Excitation uses a different excitation
harmonics of pitch frequency
frequency-dependent binary U/V decision
large number of sub-bands (>16)
V
f
Simultaneous ABS estimation of pitch and spectral envelope
Then U/V decision made based on spectral fit
Use of dynamic programming for pitch tracking
YJS
Y(J) Stein VoP3
46
V
O
P
Spch
Comp
MBE coder - cont.
DVSI made various MBE, AMBE and IMBE for satellite (INMARSAT)
Bit rates 2.4 - 9.6 Kbps (toll quality at 3.6 Kbps)
Integral FEC for bit-error robustness
As an example:
128 bits for each 20 ms frame
pitch 8 bits
U/V decisions K bits (K < 12)
spectral amplitudes (DCT) 75-K bits
FEC (Golay codes) 45 bits
YJS
Y(J) Stein VoP3
47
V
O
P
Spch
Comp
MELP
DOD wanted a new 2.4 Kbps coder with MOS similar to FS1016
Main problems with LPC10:
– voicing determination errors
– no handling of partially voiced speech
Unlike MBE MELP uses standard LPC model
MELP excitation is pulse train plus random noise
Soft decision in small number (5) of sub-bands
Frame 22.5 ms (180 samples)
10th order LPC, 15 Hz BW expansion, LSF, interpolation, VQ
pitch refinement
5 sub-bands (0-500-1000-2000-3000-4000Hz) pitch and noise excitation
FEC
YJS
Y(J) Stein VoP3
48
V
O
P
Spch
Comp
Sinusoidal Transform Coder
McAulay and Quatieri model:
instead of LPC use sum of sine waves
sn =
Si = 1 .. N Ai cos ( wi n + fi )
For each analysis frame (10 - 20 ms) need to extract N Ai & fi s
Voiced speech
Use pitch and important harmonics [from pitch-synchronized STFT]
Unvoiced speech
Use peaks of STFT [points where slope changes from + to -]
At high bit-rates keep magnitudes, frequencies and phases
At low bit-rates frequencies constrained and phases modeled
YJS
Y(J) Stein VoP3
49
V
Spch
Comp
STC - cont.
O
P
• Sparse spectrum is updated at regularly spaced times
sn
overlapped
windowing
• Amplitude linearly interpolated between updates
• Interpolated phase must obey 4 conditions (w f
w f)
FFT
peak
picker
spectrum
encoder
sum of
sinusoids
sn
spectrum
decoder
e.g. all-pole model
YJS
Y(J) Stein VoP3
50
V
Spch
Comp
STC - cont.
O
Tracking the sinusoidal components
frequency
P
birth
death
time
YJS
Y(J) Stein VoP3
51
V
Spch
Comp
Waveform Interpolation
O
P
Voiced speech is a sequence of pitch-cycle waveforms
The characteristic waveform usually changes slowly with time
Useful to think of waveform in 2d
time
Phase
in pitch period
This waveform can be the speech signal or the LPC residual
YJS
Y(J) Stein VoP3
52
V
Spch
Comp
WI - cont.
O
P
• Per frame LPC and pitch are extracted
sn
LPC +
pitch tracking
• Represent CW by features (e.g. DFT coefficients)
• Alignment by circular shift until maximum correlation
• Separate treatment for voice and unvoiced segments
YJS
Characteristic
waveform
extraction
conversion
to 1d
2d CW
alignment
waveform
interpolation
quantization
decoding
sn
Y(J) Stein VoP3
53
V
O
P
Spch
Comp
We can go even lower
The Shannon entropy of speech is 200-500 bps
So even deeper compression should be possible
There are even lower rate speech coders (300-600 bps)
but they are not practical at this point
• extremely high MIPS and memory
• very long latency
• very sensitive to background noise
YJS
Y(J) Stein VoP3
54