Implementation of a double StereoDipole system on a DSP

Download Report

Transcript Implementation of a double StereoDipole system on a DSP

Implementation of real time partitioned
convolution on a DSP board
Enrico Armelloni, Christian Giottoli, Angelo Farina.
Industrial Engineering Department - University of Parma
Parco Area delle Scienze 181/A, 43100 Parma – Italy
[email protected]
20 October 2003
WASPAA 2003 - New Paltz, NY
1
Industrial Engineering Dept.
University of Parma – Italy
Outline:
• Linear convolution;
• Overlap & Save method;
• Uniformly-partitioned Overlap & Save
method;
• Software implementation on a DSP board;
20 October 2003
WASPAA 2003 - New Paltz, NY
2
Industrial Engineering Dept.
University of Parma – Italy
Convolution (1):
Convolution of a continuous input signal x(t) with a linear
filter characterized by an Impulse Response h(t) yields an
output signal y(t):

y(t )  x(t )  h(t )   x(t   )  h( )  d

If the input signal and the Impulse Response are digitally
sampled (t = it) and the Impulse Response has finite length
N, we can write:
N 1
y(i )   x(i  j )  h( j )
j 0
20 October 2003
WASPAA 2003 - New Paltz, NY
3
Industrial Engineering Dept.
University of Parma – Italy
Convolution (2):
N 1
y(i)   x(i  j )  h( j )
j 0
Multiply and ACcumulate
y:=0;
FOR n:=0 TO N-1 DO
y:= y + a[n]·x[n];
On a DSP board this instruction is performed in one cycle
• Clock core = 100 MHz
• Sample frequency fS = 48 KHz
20 October 2003

WASPAA 2003 - New Paltz, NY
Upper limit is
2000 MAC per
sample
4
Industrial Engineering Dept.
University of Parma – Italy
Convolution (3):
If Impulse Response is very long, i.e. 2 second or
plus, like an IR measured inside of a theatre,

h(t) is length 96000 points or plus @ 48kHz.
20 October 2003
WASPAA 2003 - New Paltz, NY
5
Industrial Engineering Dept.
University of Parma – Italy
Filtering in the frequency domain:
Could be better operate in the frequency domain
x(n)
FFT
x(n)  h(n)
y(n)
Problems
Solution
20 October 2003
X(k)
X(k)  H(k)
IFFT
Y(k)
• Filtering can be performed only when all data are available
• Order of FFT is too high.
• Overlap & Save algorithm.
WASPAA 2003 - New Paltz, NY
6
Overlap & Save algorithm (1):
Industrial Engineering Dept.
University of Parma – Italy
Multiplication of two DFTs corresponds to a circular convolution of their time
domain sequences. A procedure for converting a circular convolution into a
linear convolution is the Overlap&Save algorithm.
1.
Perform N-point FFT of the IR h(n) and store it:
n  0,1,...,Q  1
h(n)
h( n )  
n  Q, Q  1,..., N  1
 0
2.
Select N points from x(n) based on following expression, in order to
create signal section xm(n) :
xm (n)  xn  m  1N  Q  1  Q  1
where:
20 October 2003
n = 0,1,2,…,N-1
m = 1,2,3,…
N = FFT length
Q = IR length
0
N-1
N-Q
WASPAA 2003 - New Paltz, NY
2N - Q + 1
2N - 2Q + 2
3N - 2Q + 1
7
Industrial Engineering Dept.
University of Parma – Italy
Overlap & Save algorithm (2):
3.
Multiply the stored frequency response of h(n) by the FFT of input signal
batch m.
4.
Perform an N-point IFFT of the product.
5.
Discard the first (Q-1) points from each successive output of step 4, and
y1 (n)
y2 [n – (N – Q + 1)]
.
.
ym [n – (m – 1)(N – Q + 1)]
n = Q – 1,…,N - 1
n = N,…,2N - Q
.
.
n = (m – 1)(N – Q + 1) + (Q – 1),
…,(m – 1)(N – Q + 1) + (N – 1)
.
.
.
.
append the remaining outputs to y(n):
20 October 2003
y(n) = y1(n), y2(n),…, ym(n),…
WASPAA 2003 - New Paltz, NY
8
Industrial Engineering Dept.
University of Parma – Italy
Overlap & Save algorithm (3):
Overlap & Save convolution process:
xm(n)
FFT
N-point
x
h(n)
IFFT
Xm(k)H(k)
Select last
N–Q+1
samples
FFT
N-point
Append to
y(n)
Problems
Solution
20 October 2003
• Latency between Input and Output data is too high.
• Management problem with internal memory of the DSP.
• Uniformly-partitioned Overlap & Save algorithm.
WASPAA 2003 - New Paltz, NY
9
Industrial Engineering Dept.
University of Parma – Italy
Uniformly-partitioned O&S algorithm (1):
The
impulse
response h(n) is
partitioned in a
reasonable number
P of equally-sized
blocks (i.e. P = 4),
where each block is
K points long.
1st block
2nd block
20 October 2003
3rd block
4th block
WASPAA 2003 - New Paltz, NY
10
Industrial Engineering Dept.
University of Parma – Italy
Uniformly-partitioned O&S algorithm (2):
Input stream (subdivided in partially overlapped blocks)
1-st block of
L points
S1
2-nd block of
L points
(T-1)-th block
of L points
T-th (last) block
of L points
FFT
FFT
1-st spectrum
T-th spectrum
X
1st seg.
S2
S3
X
2nd seg.
3rd seg.
1st seg.
2nd seg.
X
S1
X
1st data block
3rd seg.
2nd data block
(T-1)-th data block
T-th data block
1st seg.
2nd seg.
lth seg.
Sum at
index 0
Sum at
index K
Sum at
index 2K
Sum at
index i-L
IFFT
IFFT
IFFT
IFFT
Select last Select last Select last
L-K points L-K points L-K points
Output stream
Select last
L-K points
20 October 2003
WASPAA 2003 - New Paltz, NY
Each block is treated as
The
results IR,of zerothe
a separate
multiplications
P
padded to of
L theand
filters
S withwith
the FFTs
transformed
FFT
of
the latest
P inputa
in
order
to obtain
Every
filter
Si is
blocks
are
summed
in P
collection
frequency
convolved,of using
the
frequency-domain
domain
filters S.method,
Overlap&Save
accumulators,
to(i.e.
L-point
P = 3). blocks of
and
an IFFT
is
inputat the
dataend(each
block
done
of
beginson Lthe– content
K points
first
accumulator
after the
previous). for
producing a block of
output data.
Only the latest L-K
points of the block have
to be kept.
11
Industrial Engineering Dept.
University of Parma – Italy
Uniformly-partitioned O&S algorithm (3):
• Total number of FFTs is minimized, in fact each block of
input data needs to be FFT transformed and IFFT
antitransformed just once, after frequency-domain summation.
• Latency of the whole filtering processing is just L points
instead of N. It means that the I/O delay is kept to a low value,
provided that the impulse response is partitioned in a sensible
number of chunks (8 – 32).
20 October 2003
WASPAA 2003 - New Paltz, NY
12
Industrial Engineering Dept.
University of Parma – Italy
Analog Devices DSP platform’s features:
ADDS 21161N Ez-Kit Lite board
• 100 MHz (10 ns) SIMD SHARC
DSP core.
• 600 MFLOPS (32-bit floatingpoint data).
• 600 MOPS (32-bit fixed-point
data).
8 – Channels
OUTPUT
4 – Channels
INPUT
SPDIF
INPUT
• Single-cycle instruction execution,
including SIMD operation in two
parallel computational units (ALUs).
• 4 channels INPUT, 8 channels
OUTPUT.
• AD1836 and AD1852, 48 or 96
kHz sampling frequency, 24-bits
audio converters
20 October 2003
WASPAA 2003 - New Paltz, NY
13
Industrial Engineering Dept.
University of Parma – Italy
Impulse Response processing:
Impulse Response is:
• downloaded on DSP
• partitioned into P blocks,
where each block is K points
length (K = 4096)
P blocks of K points, total N points
K points
K points
K points
K points
Impulse response h
20 October 2003
K points
K points
• each block is zero-padded to a
length of L points (L = 8192)
• transformed by standard FFT
procedure supplied by Analog
Devices and stored in the
external memory.
WASPAA 2003 - New Paltz, NY
14
Industrial Engineering Dept.
University of Parma – Italy
I/O data stream processing:
A ping-pong I/O buffer was used in the implementation of the
algorithm:
20 October 2003
WASPAA 2003 - New Paltz, NY
15
Industrial Engineering Dept.
University of Parma – Italy
Filtering procedure:
From
input_buffer
FFT[A]
FFT[B]
X
Filter[0]
Filter[1]
A
B0
X
X
Filter[2]
A
B1
A
B2
Computation circular buffer
A
B30
B0A
+A
1 1
IFFT[A]
IFFT[B]
To
output_buffer
To
output_buffer
20 October 2003
B1A
+A
2 2
WASPAA 2003 - New Paltz, NY
B2A
+A
3 3
Filter[3]
X
A
B3
• FFT[B]
FFT[A] = FFT of the
processing stream.
• Filter[i] = P blocks
containing FFT of the IR
(i.e. P = 4)
• IFFT[B]
IFFT[A] = IFFT of the
leftmost
block
B0 block
+ A1 A0
• Last L-K points
IFFT[A] are sent
IFFT[B]
Output_Buffer
16
of
to
Industrial Engineering Dept.
University of Parma – Italy
Results (1):
Ch IN
Ch OUT
Number of block
IR length
1
2
2
4
1
2
4
8
27
11
5
2
110592
45056
20480
8192
• 110592 points @ 48 kHz  IR length 2.3 second.
• 45056 points @ 48 kHz  IR length 0.94 second.
• In 2x2 mode (4 filters) is far in excess than the requirements
for good cross-talk canceling filters (typically 4096 taps).
20 October 2003
WASPAA 2003 - New Paltz, NY
17
Industrial Engineering Dept.
University of Parma – Italy
Results (2):
Efficiency of the algorithm (L=8192 samples)
120000
50
45
110000
40
105000
35
100000
95000
30
90000
25
85000
N. of blocks processed
Total lenght of processed IR (samples)
115000
20
80000
15
75000
70000
20.00%
10
30.00%
40.00%
50.00%
60.00%
70.00%
K / L (%)
Total lenght
20 October 2003
N. of blocks
WASPAA 2003 - New Paltz, NY
• Tests
performed
demonstrated that the
maximum efficiency is
reached when the overlap
between
two
input
streams is around half of
the FFT length, L.
• Using L = 8192 points,
and a sampling frequency
fs = 48 kHz, latency
between
Input
and
Output is 170 ms.
18
Industrial Engineering Dept.
University of Parma – Italy
Conclusion:
• Succesfull implementation of the real-time partitioned
convolution on the ADDS 21161N Ez-Kit Lite board,
operated from 1 to 4 input channels @ 48kHz.
• Impulse Responses of 110592 points were managed,
with latency between Input and Output data limited to
170 ms.
• When it is required to implement a light, compact
system and with little number of channels, DSP is a
sensible solution, otherwise a PC provides a significantly
better price/performance ratio.
20 October 2003
WASPAA 2003 - New Paltz, NY
19