Voice Compression - National Tsing Hua University

Transcript Voice Compression - National Tsing Hua University

Introduction to
Voice Compression
VC Lab 2009
2015/7/20
MC 2009, VCLab
1
Outline
•
•
•
•
•
•
•
Why digitize voice?
Introduction
Speech properties
Performance measurement
The channel vocoders and LPC-10
Sinusoidal coders
Analysis-by-synthesis linear predictive coders and
G.723.1
• Quality of service in voice over IP (VoIP)
2015/7/20
MC 2009, VCLab
2
Why Digitize Voice?
• When voice is digitized:
–
–
–
–
–
–
–
–
–
•
Multiplexing is easier
Signaling is easier
PCs and other computers can be used
The voice switch becomes a big computer, and all lines are digital
Lines are not as noisy
Line quality can be monitored closely
New services can be provided
Lines are more tolerant noise
Encryption is possible
Three steps to digits: sampling, quantization, and coding
– PCM, DPCM, DM, ADPCM, ADM
2015/7/20
MC 2009, VCLab
3
Introduction
• Samples: 4 kHz, 8k samples
• Bit rate
– original (PCM): 64kbits/s
– medium-rate: 8-16 kbits/s
– low-rate: 2.4-8 kbits/s
– very-low-rate: < 2.4 kbits/s
• The speech model: The speech is produced by forcing air
first through an elastic opening, the vocal chords, and then
through the laryngeal, oral, nasal, and pharynx passages,
and finally through the mouth and the cavity.
• Analysis-synthesis process: open-loop and closed-loop
2015/7/20
MC 2009, VCLab
4
Speech Compression Techniques
• Parametric representations: speech or non-speech specific
– non-speech specific (waveform) coder: faithfully reconstruct the
time-domain waveform.
– speech specific (voice) coder: rely on speech models and focused
upon producing perceptually intelligible speech without
necessarily matching the waveform.
•
•
•
•
Waveform coding methods
The channel vocoder methods
Sinusoidal analysis-synthesis methods
Analysis-by-synthesis linear predictive methods
2015/7/20
MC 2009, VCLab
5
聲門
• No matter what language is
being spoken, the speech is
generated using machinery what
is not very different from person
to person.
• This machinery has to obey
certain physical laws that
substantially limit the behavior
of outputs.
2015/7/20
MC 2009, VCLab
6
Speech Properties
• Speech signals are non-stationary and at best they can be considered as
quasi-stationary over short segments, typically 5-20 ms.
• Voiced speech is quasi-periodic in the time-domain and harmonically
structured in the frequency-domain while unvoiced speech is randomlike and broadband. The energy of voiced segments is generally high
than that of unvoiced segments.
• The short-time spectrum of voiced speech is characterized by its
fine and formant structure. The fine harmonic structure is a
consequence of the quasi-periodicity of speech and may be
attributed to the vibrating vocal chords. The formant structure
(spectral envelope) is due to the interaction of the source and
the vocal tract. The vocal tract consists of the pharynx and the
mouth cavity.
2015/7/20
MC 2009, VCLab
7
unvoiced speech
voiced speech
formant structure
quasi-periodicity
2015/7/20
MC 2009, VCLab
8
The sound /e/ in test.
The sound /s/ in test.
2015/7/20
MC 2009, VCLab
9
• The shape of the spectral envelope that "fits" the short-time spectrum
of voiced speech, is associated with the transfer characteristics of the
vocal tract and the spectral tilt (6 dB/octave) due to the glottal pulse.
• The spectral envelope is characterized by a set of peaks, which are
called formants. The formants are the resonant modes of the vocal tract.
• For the average vocal tract, there are three to five formants below 5
kHz. The amplitudes and locations of the first three formants, usually
occurring below 3 kHz, are quite important in both speech synthesis
and perception.
• Higher formants are also important for wideband and unvoiced speech
representations.
• The properties of speech are related to the physical speech production
system. Voiced exciting periodic glottal air pulses generated by the
vibrating vocal chords. The frequency of the periodic pulses is referred
to as the fundamental frequency or pitch.
• Unvoiced speech is produced by forcing air through a constriction in the
vocal tract. Nasal sounds (e.g., /n/) are due to the acoustical coupling
of the nasal tract to the vocal tract, and plosive sounds (e.g., /p/) are
produced by abruptly releasing air pressure which was built up behind
a closure in the tract.
2015/7/20
MC 2009, VCLab
10
Historical Prospective
• The first analysis-synthesis method [Dudley 1939]: Analyze
speech in terms of its pitch and spectrum, and synthesize it
by exciting a bank of ten analog band-pass filters (representing
the vocal tract) with periodic (buzz, voiced) or random (hiss,
unvoiced) excitation.
2015/7/20
MC 2009, VCLab
11
• Pulse Code Modulation (PCM), Differential PCM, Adaptive DPCM
• Linear speech source-system production model [Fant, 1960]
• Linear prediction analysis: a process where the present speech sample
is predicated by the linear combination of previous samples.
• Homomorphic analysis: a method that can be used for separating
signal that has been combined by convolution.
2015/7/20
MC 2009, VCLab
12
• Short-Time Fourier transform (STFT) [Flanagan and Golden]:
analysis-synthesis of speech using STFT.
• Transform coding, sub-band coding
• sinusoidal analysis-synthesis of speech [McAulay and Quatieri]
• multiband excitation vocoders [Griffin and Lim]
• multi-pulse and vector excitation schemes for LPC [Atal et al]
• vector quantization (VQ) [Gersho and Gray]
– Vector quantization proved to be very useful in encoding
LPC parameters.
• Code Excited Linear Prediction (CELP): Atal and Schroeder
proposed a linear prediction algorithm with stochastic vector
excitation. The stochastic excitation in CELP is determined
using a perceptually weighted closed-loop (analysis-bysynthesis) optimization.
• Group Speciale Mobile (GSM): a standard that uses a 13 kbits/s
regular pulse excitation algorithm.
2015/7/20
MC 2009, VCLab
13
Performance
Measurement
• Issues: the bit rate, the quality of reconstructed speech, the complexity
•
•
•
•
of the algorithm, the delay introduced.
Four kinds of speech quality: broadcast (~64 kbits/s), network or toll
(~16 kbits/s), communications (~4.8 kbits/s), and synthetic (< 4.0
kbits/s).
SNR ratio and segmental SNR (SEGSNR, computing the SNR for each
N-point segment)
Perceptual criteria: Diagnostic Rhyme Test (DRT), Diagnostic
Acceptability Measure (DAM), and Mean Opinion Score (MOS) are
based on listener ratings.
MOS: involves 12 to 24 listeners who are instructed to rate phonetically
balanced records
according to a 5-level quality scale. Excellent speech quality implies that coded
speech is indistinguishable from the original and without perceptible noise.
2015/7/20
MC 2009, VCLab
14
iLBC
2015/7/20
iLBC
13.3 (or 15.2)
MC 2009, VCLab
Skype
15
Source: Speech Coding: A Tutorial Review
2015/7/20
MC 2009, VCLab
16
Outline
•
•
•
•
•
•
Why digitize voice?
Introduction
Speech properties
Performance measurement
The channel vocoders and LPC-10
Analysis-by-synthesis linear predictive coders and
G.723.1
• Quality of service in voice over IP (VoIP)
• iLBC
2015/7/20
MC 2009, VCLab
17
The Channel Vocoder: the original
• In the channel vocoder, each segment of input
speech is analyzed using a bank of band pass
filters called the analysis filter.
• The energy at the output of each filter is estimated
at fixed intervals.
• A decision is made to whether the speech is voiced
or unvoiced.
• The period of the fundamental harmonic is called
the pitch period.
• It matches the frequency profile of the input
speech
2015/7/20
MC 2009, VCLab
18
The Channel Vocoder Receiver
2015/7/20
MC 2009, VCLab
19
The Linear Predictive Coder
• Instead of the vocal tract being modeled by a bank
of filters, it is modeled as a signal linear filter
whose output yn is related to the input n.by
M
yn   ai yn i  G n ,
i 1
where G is called the gain of the filter.
• The input to the vocal tract filter is either the
output of a random noise generator or a periodic
pulse generator.
2015/7/20
MC 2009, VCLab
20
The Model for Speech Synthesis
2015/7/20
MC 2009, VCLab
21
LPC-10
• 8000 samples per second, 180 samples segments,
corresponding to 22.5 ms.
• The V/U decision: energy and # of zero crossings.
• The voicing decision of the neighboring frames is
considered to avoid single voiced frame located.
• Estimating the pitch period: average magnitude
difference function (AMDF),
1 k0  N
AMDF( P) 
| yi  yi  P |.

N i k0 1
2015/7/20
MC 2009, VCLab
22
AMDF
function for
the sound /e/
in test.
AMDF
function for
the sound /s/
in test.
2015/7/20
MC 2009, VCLab
23
AMDF
• In voiced signal not only do we have a
minimum when P equals the pitch period,
but the difference between the minimum
and average values is quite small.
• We do not have to evaluate the AMDF for
all possible values of P. The pitch period is
between 2.5 and 19.5 ms, it is between 20
and 160 for 8000 samples per second.
2015/7/20
MC 2009, VCLab
24
Vocal Tract Filter (1)
• In analysis phase, the filter coefficients that
best match the segment being analyzed in
the mean squared error sense is calculated
M
en2  ( yn   ai yn i  G n )2 .
• M equations
i 1
M

E [( y n   ai y n  i  G n ) 2 ]  0
a j
i 1
M
 2 E [( y n   ai y n  i  G n ) y n  j ]  0
i 1
M
  ai E [ y n  i y n  j ]  E [ y n y n  j ].    (*)
i 1
2015/7/20
MC 2009, VCLab
25
Vocal Tract Filter (2)
• In order to solve (*), we need to be able to
estimate E[yn-iyn-j]. Two methods: autocorrelation
and autocovariance are used now.
• In autocorrelation approach, we assume that the
{yn} sequence is stationary and therefore
E[yn-iyn-j] = Ryy[|i-j|).
We also assume that {yn} sequence is zero outside
the segment. Thus, the aurocorrelation function is
n N
estimated as
R (k ) 
y y .

0
yy
2015/7/20
n
n  n0 1 k
MC 2009, VCLab
nk
26
Vocal Tract Filter (3)
• M equations can be written in matrix form
as
RA = P
Ryy (1)
Ryy ( 2) 
Ryy ( M  1) 
 Ryy (0)
where  R (1)
R (0)
R (1) 
R ( M  2)
yy
yy
yy
yy


Ryy (1)
Ryy (0)
Ryy ( m  3) 
R   Ryy ( 2)
 






Ryy (0) 
 Ryy ( M  1) Ryy ( M  2) Ryy ( M  3) 
 Ryy (1) 
 a1 
.
A     and P   


 
 Ryy ( M )
a M 
2015/7/20
MC 2009, VCLab
27
Vocal Tract Filter (4)
• The matrix equation can be solved directly to find
the filter coefficients R-1A = P.
• Note that R is Toeplitz, we can obtain a recursive
solution that is computationally very efficient, the
Levinson-Durbin algorithm is the most efficient
one.
• The assumption of stationary is not valid for
speech signals. Discard this assumption, the
equations change. The term E[yn-iyn-j] is now a
function of both i and j.
2015/7/20
MC 2009, VCLab
28
Vocal Tract Filter (5)
• The matrix equation becomes CA = S, where cij = E[yn-iyn-j] .
 c11 c12  c1M 
 c21 c22  c2 M 
C


 

cM 1 cM 2  cMM 
• The element are estimated as
 c10 
S    .
 
c M 0 
cij 
n0  N
y
n i
yn  j .
n  n0 1
• Note that R is symmetric but no longer Toeplitz. The equations are generally
solved by the Cholesky decomposition.
• LPC-10 uses the covariance method. If both the first two coefficients have
very small values, the voicing decision is unvoiced.
2015/7/20
MC 2009, VCLab
29
Transmitting the Parameters
• voicing decision: 1 bit
• pitch period: quantized to 1 to 60 different values
using a log quantizer.
• vocal tract filter parameters: 10th-order filter for
voiced speech and a 4th-order for unvoiced speech.
• gain G: finding the root mean squared (rms) value of
the segment and quantized using 5-bit log
quantization.
• Totally 54 bits per frame, 2400 bits per second.
2015/7/20
MC 2009, VCLab
30
Synthesis, LPC-10
• The voiced frames are generated by exciting the
received vocal filter by a locally stored waveform.
• This waveform is 40 samples long. It is truncated
or padded with zeros depending on the pitch
period.
• If the frame is unvoiced, the vocal tract is excited
by a pseudorandom number generator.
• The use of only two kinds of excitation signals
gives an artificial quality voice. This approach
also suffers when used in noisy environments.
2015/7/20
MC 2009, VCLab
31
Excitation Signals
• The most important factor in generating naturalsounding speech is the excitation signal.
• Solutions: Code-excited LP (CELP), the sinusoidal
coder and Multi-pulse LP coder (MP-LPC), …etc.
• The CELP makes use of a codebook of excitation
signals.
• The sinusoidal coders make use of an excitation
signal that is the sum of sine waves of arbitrary
amplitudes, frequencies, and phases.
• Standards: CELP: FS 1016, G.728, G.729, MPLPC: G.723.1, RPE-LTP: GSM
2015/7/20
MC 2009, VCLab
32
G.728
• A CELP coder with a coder delay of 2 ms operating at 16
kbps.
• To lower the coding delay, the size of each segment has to
be reduced significantly; G.728 uses a segment of five
sample.
• G.728 does away the pitch filter, instead it uses a 50th-order
vocal tract filter. The algorithm obtains the vocal tract filter
parameters in backward adaptive manner They are updated
every fourth frame, 20 samples.
• 10 bits for coding the excitation signal, where 3 bits are
used to encode the gain using a predictive encoding scheme,
and 7 bits for the codebook index.
2015/7/20
MC 2009, VCLab
33
G.728 16 kbits Speech Encoders
2015/7/20
MC 2009, VCLab
34
G.728 16 kbits Speech Decoders
2015/7/20
MC 2009, VCLab
35
Mixed Excitation Linear
Prediction (MELP)
The excitation signal is no longer simply noise or a
periodic pulse but a multi-band mixed excitation.
2015/7/20
MC 2009, VCLab
36
Analysis-by-Synthesis VS.
Analysis-and-Synthesis
• AaS methods perform qualified quality at bit rate 9.6-16 kb/s,
however, due to the lack of feedback control mechanism it
could not avoid the phenomenon of error propagation.
• AbS (closed-loop analysis): the parameters are extracted and
encoded by minimizing explicitly a measure of the difference
between the original and the current reconstructed speech.
• AbS-LPC：It consists of timing-varying filter, excitation
signal process and perceptually based minimization
mechanism. The timing-varying filter is composed of LPC
filter (1/A(z)) and long-term prediction filter (Adapt. CB);
Excitation signal processing makes use of fixed CB or multipulse.
– LTP captures the redundancies of pitch frequency which are longer term.
2015/7/20
MC 2009, VCLab
37
Analysis-by-Synthesis LPC
W(z)
Adapt.
CB
1
Aw(z)
Fixed.
CB
Error
Minimization
2015/7/20
MC 2009, VCLab
38
LTP Residual Signal
x 10
uv.pcm
4
2
y[n]
0
-2
60
120
180
240
300
360
420
480
540
600
660
720
780
840
900
960
240
300
360
420
480
540
600
660
720
780
840
900
960
240
300
360
420
480
540
600
660
720
780
840
900
960
240
300
360
420
480
540
600
660
720
780
840
900
960
5000
t[n]
0
-5000
60
120
180
5000
a[n]
0
-5000
60
120
180
5000
r[n]
0
-5000
60
2015/7/20
120
180
MC 2009, VCLab
39
Simulated Decoder
y[n]
LSP
Quantizer
2.5
LSP
Decoder
2.6
LSP
Interpolator
2.7
Framer
2.2
~
A(z)
s[n]
W(z)
High Pass
Filter
2.3
P(z)
S(z)
Impulse
Response
Calculator
2.12
Memory
Update
2.19
u[n]
x[n]
LPC
Analysis
2.4
e[n]
A(z)
Pitch
Decoder
2.18
Zero Input
Response
2.13 z[n]
Formant
Perceptual
W eighting
2.8
Harmonic
Noise
Shaping
2.11 w[n]
Pitch
Predictor
2.14 p[n]
-
L i, bi
v[n]
+
Excitation
Decoder
2.17
MP-MLQ/
ACELP
2.15, 2.16
r[n]
f[n]
t[n]
-
Pitch
Estimator
2.9
T1517840-95 /d01
FIGURE 1/G.723.1
Block diagram of the speech coder – For each block the corresponding
reference number is indicated
2015/7/20
MC 2009, VCLab
40
Variables
• y[n] : source signal, 4 frames (16 sub-frames)
• t[n] : perceptual weighted signal
• a[n]: the output of adaptive codebook
(long-term predication)
• r[n]: t[n] - a[n], residual signals
predicated by fixed codebook
2015/7/20
MC 2009, VCLab
41
Main Steps
1.
Initialization for LPC filter and long-term prediction filter: 0
or random numbers
2.
LPC analysis on a source frame, y[n] : LPC coefficients
3.
Divide into several sub-frames, for each sub-frame:
•
Compute the long-term prediction filter (LTP coefficients) using
close-loop and get the residual signal
r[n] = t[n] - a[n]
•
2015/7/20
Find one approximate excitation signal for this residual signal,
predicated by the fixed codebook
MC 2009, VCLab
42
G.723.1 (high-rate) for Signal ‘a’
x 10
jj5_960a.pcm
4
2
y[n]
0
-2
60
120
180
240
300
360
420
480
540
600
660
720
780
840
900
960
240
300
360
420
480
540
600
660
720
780
840
900
960
240
300
360
420
480
540
600
660
720
780
840
900
960
240
300
360
420
480
540
600
660
720
780
840
900
960
5000
t[n]
0
-5000
60
120
180
5000
a[n]
0
-5000
60
120
180
5000
r[n]
0
-5000
60
2015/7/20
120
180
MC 2009, VCLab
43
G.723.1 (high-rate) for Signal ‘sh‘
x 10
unvoice.pcm
4
2
y[n]
0
-2
60
120
180
240
300
360
420
480
540
600
660
720
780
840
900
960
240
300
360
420
480
540
600
660
720
780
840
900
960
240
300
360
420
480
540
600
660
720
780
840
900
960
240
300
360
420
480
540
600
660
720
780
840
900
960
5000
t[n]
0
-5000
60
120
180
5000
a[n]
0
-5000
60
120
180
5000
r[n]
0
-5000
60
120
180
t[n] is similar to r[n], in this case
excitation signal is approximately generated by the fixed codebook
2015/7/20
MC 2009, VCLab
44
G.723.1 (high-rate) for signal
('he' to 'i' in 'she is')
x 10
uv.pcm
4
2
y[n]
0
-2
60
120
180
240
300
360
420
480
540
600
660
720
780
840
900
960
240
300
360
420
480
540
600
660
720
780
840
900
960
240
300
360
420
480
540
600
660
720
780
840
900
960
240
300
360
420
480
540
600
660
720
780
840
900
960
5000
t[n]
0
-5000
60
120
180
5000
a[n]
0
-5000
60
120
180
5000
r[n]
0
-5000
60
2015/7/20
120
180
MC 2009, VCLab
45
MP-LPC Approach
Excitation Signal
Impulse Response
Residual Signal
#(odd,even) * #(position) * #(sign) *
#(gain)
= 2 * C(30, 6) * 26 * 24
= 1,824,076,800
2015/7/20
MC 2009, VCLab
46
G.723.1 MP-MLQ
• Sub-optimal pulse-by-pulse sequential search
• Select the candidate pulse position by using cross
correlation function between residual signal and
impulse response
• Using the first pulse to determine the possible gain
value (-3.2db, +0db, +3.2db, +6.4db)
• Only 8 candidate excitation signal selected
2015/7/20
MC 2009, VCLab
47
MP-MLQ Coding Result
uv.pcm
5000
r[n]
0
-5000
60
120
180
240
300
360
420
480
540
600
660
720
780
840
900
960
240
300
360
420
480
540
600
660
720
780
840
900
960
240
300
360
420
480
540
600
660
720
780
840
900
960
240
300
360
420
480
540
600
660
720
780
840
900
960
240
300
360
420
480
540
600
660
720
780
840
900
960
1
h[n]
0
-1
60
120
180
5000
m[n]
0
-5000
60
120
180
5000
c[n]
0
-5000
60
120
180
5000
d[n]
0
-5000
2015/7/20
60
120
180
MC 2009, VCLab
48
G.723.1 MP-MLQ
n 1
ci   rx  mx i
x i
 c0   r0
 c   r
 1   1
   

 
c
 n 2  rn 2
 cn 1   rn 1
r1

rn 2
r2



rn 1
0
rn 1

0
0

0
rn 1   m0 
0   m1 
 

0   
 

0  mn 2 
0   mn 1 
c1  R0  m
2015/7/20
MC 2009, VCLab
49
G.723.1 MP-MLQ
R0  GP1 M P1
m
M i , j ,k   k  j i
0
 m0
 m
 1
M0   

mn 2
 mn 1
2015/7/20
m1

mn 2
m2



mn 1
0
mn 1

0
0

0
mn 1 
0 

0 

0 
0 
if k  j  i  0
else
 m1
m
 2
M1   

mn 1
 0
MC 2009, VCLab
m2

mn 1
m3



0
0
0

0
0

0
0 
0 

0 

0 
0 
50
G.723.1 MP-MLQ
c1  R0  m
c 2  R1  m
 ( R0  GP1 M P1 )  m
 R0  m  GP1 M P1  m
 c1  GP1  P1
c 3  R2  m
 ( R1  GP1 M P2 )  m
 R1  m  GP1 M P2  m
 c 2  GP1  P2

c 6  c5  GP1  P5
2015/7/20
MC 2009, VCLab
51
The QoS Services in VoIP
• The QoS parameters:
–
–
–
–
–
–
Bandwidth (minimum)
Delay (maximum)
Jitter (delay variation)
Packetization Interval (PI)
Information loss: error detection, error recovery
Availability (reliability): How much it enough? How
quickly must service be restored?
– Security
 Bandwidth: the number of bits in the frame divided by the time elapsed
from the moment the first bit of a given frame leaves the network until
the moment the last bit leaves the network.
 Delay: the time elapsed from the moment the first bit of a given frame
enters the network until the moment the first bit leaves the network.
2015/7/20
MC 2009, VCLab
52
Outline
•
•
•
•
•
•
Why digitize voice?
Introduction
Speech properties
Performance measurement
The channel vocoders and LPC-10
Analysis-by-synthesis linear predictive coders and
G.723.1
• Quality of service in voice over IP (VoIP)
• iLBC
2015/7/20
MC 2009, VCLab
53
iLBC
• Exploits long term linear predictive coding without
introducing data dependency across frame boundaries.
• Loss issues in CELP
– Adaptive codebook in the decoder will differ from the one in the
encoder
– CELP encoding results in a decoded signal with noise character in
voiced regions, especially for high-pitched voices
• iLBC applies the adaptive codebook both forward and
backward in time, starting from a segment inside the speech
frame (start state vector)
– This segment will typically capture at least one dominating pitch
pulse when the segment has voiced speech in it.
– The location and waveform of the start state is encoded and
transmitted for each frame.
2015/7/20
MC 2009, VCLab
54
240:
PCM
iLBC, 240-sample case
• 240-sample frame, which is divided into six 40sample subframes
• Two LPC analyses are obtained: one for a window
centered toward the beginning of the frame and one
for a window centered toward the end of the frame.
• Two consecutive subframes having the highest
residual energy are identified.
• Subsequently the 57 trailing or tailing samples is
chosen as the target for start state encoding. It is
encoded using 6 bits for a scale normalization and 3
bits per sample.
2015/7/20
MC 2009, VCLab
55
Bit Allocation
2015/7/20
MC 2009, VCLab
56
Robustness vs. G.729A
original
iLBC
G.729A
2015/7/20
MC 2009, VCLab
57
iLBC
• For each of the LPC analyses, a set of line-spectral frequencies
(LSFs) are obtained, quantized, and interpolated to obtain LSF
coefficients for each sub-block.
• Subsequently, the LPC residual is computed by using the
quantized and interpolated LPC analysis filters.
• A dynamic codebook encoding procedure is used to encode
– the 23/22 (20 ms/30 ms) remaining samples in the two sub-blocks
containing the start state;
– the sub-blocks after the start state in time;
– the sub-blocks before the start state in time.
– Thus, the encoding target can be either the 23/22 samples remaining of
the two sub- blocks containing the start state or a 40-sample sub-block.
2015/7/20
MC 2009, VCLab
58
iLBC
• The codebook coding is based on an adaptive codebook built
from a codebook memory that contains decoded LPC
excitation samples from the already encoded part of the block.
• The codebook search method employs noise shaping derived
from the LPC filters, and the main decision criterion is to
minimize the squared error between the target vector and the
code vectors.
• As codebook encoding with squared-error matching is known
to produce a coded signal of less power than does the scalar
quantized start state signal, a gain re-scaling method is
implemented by a refined search for a better set of codebook
gains in terms of power matching after encoding.
2015/7/20
MC 2009, VCLab
59
iLBC - Decoder
• iLBC first decodes the start state, then
subframes forward in time, and finally
subframes backward in time.
• When a packet is not received in time for
playback, a packet loss concealment is
applied.
• Source code:
http://www.ietf.org/rfc/rfc3951.txt
2015/7/20
MC 2009, VCLab
60
MOS
2015/7/20
MC 2009, VCLab
61
LPC Code
y[n] =  ai  x[n-i]
H(z)=Y(z)/X(z) = 1.0 / (1.0 +  ai  z-i)
#define DRR ((volatile unsigned int *) 0x0020)
#define DXR ((volatile unsigned int *) 0x0021)
extern int Wave;
int c_stack[40];
/*==== Global Variable ====*/
#define Frame 80
#define Order 12
int
Cursor=0,Flag=0,Data0[Frame],Data1[Frame],*Data,i,j,k,l;
float Lpc[Order+1],*Coeff,fBuf[Order+1]={0,0,0,0,0,0,0,0,0,0,0,0,0};
float r[Order+1],sum,rho[Order+1],temp,aa[Order+1][Order+1],Sum;
extern void Mult();
void main() { }
void Mult()
{if(Flag==0){*DXR=Data0[Cursor]&0xFFFC;Data1[Cursor]=*DRR;Cursor++;}
else {*DXR=Data1[Cursor]&0xFFFC;Data0[Cursor]=*DRR; Cursor++; }
if (Cursor>=Frame)
{ Cursor=0; Flag^=1;
if (Flag==1) { Data=(int *)Data0; Coeff=(float *)Lpc; }
else
{ Data=(int *)Data1; Coeff=(float *)Lpc; }
/*===== Estimation LPC =====*/
/*-------Find aotucorrelation coeff. r[k]----------*/
for (k=0;k<=Order;k++)
{ sum=0.;
for (j=0;j<Frame-k;j++) sum+=(float)Data[j]*(float)Data[j+k];
r[k]=sum/(float)Frame; }
2015/7/20
MC 2009, VCLab
62
/*-------Solve Yule-Walker equation using Levinson-Durbin Algorithm*/
aa[1][1]=-r[1]/r[0];
rho[1]=(1.0-aa[1][1]*aa[1][1])*r[0];
for (i=2;i<=Order;i++)
{ temp=-r[i];
for (l=1;l<i;l++) temp-=aa[l][i-1]*r[i-l];
aa[i][i]=temp/rho[i-1];
for(k=1;k<i;k++) aa[k][i]=aa[k][i-1]+aa[i][i]*aa[i-k][i-1];
rho[i]=(1.0-aa[i][i]*aa[i][i])*rho[i-1];
}
for (i=1;i<=Order;i++) Coeff[i-1]=aa[i][Order];
/*===== Synthesize Speech with LPC model =====*/
for (i=0;i<Frame;i++)
{ if ((i%20)==0) Sum=1000.0; else Sum=0.0;
for (j=0;j<Order;j++) Sum-=fBuf[j]*Coeff[j];
for (i=Order;i>0;i--) fBuf[i]=fBuf[i-1];
fBuf[0]=Sum;
Data[i]=(int)(Sum+0.5);
}}}
2015/7/20
MC 2009, VCLab
63
Encoder Principles (1)
• The input to the encoder SHOULD be 16 bit uniform PCM
sampled at 8 kHz.
• It SHOULD be partitioned into blocks of BLOCKL=160/240
samples for the 20/30 ms frame size.
• Each block is divided into NSUB=4/6 consecutive sub-blocks
of SUBL=40 samples each.
• For 30 ms frame size, the encoder performs two
LPC_FILTERORDER=10 linear-predictive coding (LPC)
analyses.
– The first analysis applies a smooth window centered over the second
sub-block and extending to the middle of the fifth sub-block.
– The second LPC analysis applies a smooth asymmetric window centered
over the fifth sub-block and extending to the end of the sixth sub-block.
2015/7/20
MC 2009, VCLab
64
Encoder Principles (2)
• For each of the LPC analyses, a set of line-spectral
frequencies (LSFs) are obtained, quantized, and
interpolated to obtain LSF coefficients for each subblock.
• Subsequently, the LPC residual is computed by using
the quantized and interpolated LPC analysis filters.
• The two consecutive sub-blocks of the residual
exhibiting the maximal weighted energy are identified.
• Within these two sub-blocks, the start state (segment)
is selected from two choices:
– the first 57/58 samples or the last 57/58 samples of the two
consecutive sub-blocks.
• The selected segment is the one of higher energy. The
start state is encoded with scalar quantization.
2015/7/20
MC 2009, VCLab
65
iLBC encoder flow (1)
1.
2.
3.
4.
5.
Pre-process speech with a HP filter, if needed
Compute LPC parameters, quantize, and interpolate
Use analysis filters on speech to compute residual
Select position of 57/58-sample start state
Quantize the 57/58-sample start state with scalar
quantization
6. Search the codebook for each sub-frame. Start with
23/22 sample block, then encode sub-blocks forward
in time, and then encode sub-blocks backward in
time.
7. Packetize the bits into the payload
2015/7/20
MC 2009, VCLab
66
iLBC encoder flow (2)
1.
2.
3.
4.
5.
6.
7.
Decode the part of the residual that has been encoded so far, using
the codebook without perceptual weighting.
Set up the memory by taking data from the decoded residual. This
memory is used to construct codebooks. For blocks preceding the
start state, both the decoded residual and the target are time
reversed
Filter the (memory and target) with the perceptual weighting filter
Search for the best match between the target and the codebook
vector. Compute the optimal gain for this match and quantize that
gain
Update the perceptually weighted target by subtracting the
contribution from the selected codebook vector from the
perceptually weighted memory (quantized gain times selected
vector).
Repeat 4 and 5 for the two additional stages.
Calculate the energy loss due to encoding of the residual. If needed,
compensate for this loss by an upscaling and requantization of the
gain for the first stage
2015/7/20
MC 2009, VCLab
67
iLBC decoder flow
1.
2.
3.
4.
5.
Extract the parameters from the bitstream
Decode the LPC and interpolate
Construct the 57/58-sample start state
Set up the memory by using data from the decoded
residual. This memory is used for codebook construction.
Construct the residuals of this sub-frame
gain[0]cbvec[0] + gain[1]cbvec[1] + gain[2]cbvec[2]
Repeat 4 and 5 until the residual of all sub-blocks has been constructed.
6.
7.
8.
Enhance the residual with the post filter
Synthesis of the residual
Post process with HP filter, if desired
2015/7/20
MC 2009, VCLab
68
Flow of the iLBC decoder:
if a frame was lost
1.
2.
3.
If the block is not received, the block substitution is based
on a pitch-synchronous repetition of the excitation signal,
which is filtered by the last LP filter of the previous block.
The previous block's information is stored in the decoder
state structure. A correlation analysis is performed on the
previous block's excitation signal in order to detect the
amount of pitch periodicity and a pitch value.
The correlation measure is also used to decide on the
voicing level. The excitation in the previous block is used
to create an excitation for the block to be substituted, such
that the pitch of the previous block is maintained. Therefore,
the new excitation is constructed in a pitch-synchronous
manner.
2015/7/20
MC 2009, VCLab
69
Flow of the iLBC decoder:
if a frame was lost
4.
5.
6.
In order to avoid a buzzy-sounding substituted block, a
random excitation is mixed with the new pitch periodic
excitation, and the relative use of the two components is
computed from the correlation measure (voicing level).
For the block to be substituted, the newly constructed
excitation signal is then passed through the LP filter to
produce the speech that will be substituted for the lost
block. For several consecutive lost blocks, the packet loss
concealment continues in a similar manner.
The correlation measure of the last block received is still
used along with the same pitch value. The LP filters of the
last block received are also used again. The energy of the
substituted excitation for consecutive lost blocks is
decreased, leading to a dampened excitation, and therefore
to dampened speech.
2015/7/20
MC 2009, VCLab
70
Dynamic codebook encoding procedure (1)
1) The encoding target can be either the 23/22 samples
remaining of the two sub-blocks containing the start
state or a 40-sample sub-block.
2) The codebook coding is based on an adaptive
codebook built from a codebook memory that contains
decoded LPC excitation samples from the already
encoded part of the block.
3) These samples are indexed in the same time direction
as the target vector, ending at the sample instant prior
to the first sample instant represented in the target
vector.
2015/7/20
MC 2009, VCLab
71
Dynamic codebook encoding procedure (2)
4.
The codebook is used in CB_NSTAGES=3 stages in a successive
refinement approach, and the resulting three code vector gains are
encoded with 5-, 4-, and 3-bit scalar quantization, respectively.
5. The codebook search method employs noise shaping derived from the
LPC filters, and the main decision criterion is to minimize the squared
error between the target vector and the code vectors.
6. Each code vector in this codebook comes from one of
CB_EXPAND=2 codebook sections.
– The first section is filled with delayed, already encoded residual
vectors.
– The code vectors of the second codebook section are constructed
by predefined linear combinations of vectors in the first section of
the codebook.
7. As codebook encoding with squared-error matching is known to
produce a coded signal of less power than does the scalar quantized
start state signal, a gain re-scaling method is implemented by a refined
search for a better set of codebook gains in terms of power matching
after encoding.
2015/7/20
MC 2009, VCLab
72
Speech enhancement unit
1. Pitch estimation of each of the two/three new
80-sample blocks.
2. Find the pitch-period-synchronous sequence n
(for block k) by a search around the estimated
pitch value. Do this for n=1,2,3, -1,-2,-3.
3. Calculate the smoothed residual generated by the
six pitch- period-synchronous sequences from
prior step.
4. Check if the smoothed residual satisfies the
criterion
5. Use constraint to calculate mixing factor
6. Mix smoothed signal with unenhanced residual
(pssq(n) n=0).
2015/7/20
MC 2009, VCLab
73
Adaptive Multi-Rate Codec in TD-SCDMA
• Based on ACELP
•
Adaptive codebook vector v(n) is computed by
interpolating the past excitation signal u(n) at the
optimal integer delay kopt and phase (fraction
delay)
t:
9
9
v n   un  k  i b60 t  i  6   un  k  1  i b60 6  t  i  6, n  0, ,39, t  0, ,5.
i0
•
i0
The algebraic codebook structure is based on
interleaved single-pulse permutation (ISPP)
design. The pulse amplitudes and optimal pulse
positions are limited to satisfy the algebraic
structure and bit allocation requirements.
2015/7/20
MC 2009, VCLab
74

Voice Compression - National Tsing Hua University

Transcript Voice Compression - National Tsing Hua University

Directory