Transcript 4-Distributed_arithm..
Distributed Arithmetic
Dr Sumam David S.
Dept. of E&C, NITK Surathkal Courtesy for slides – Xilinx Professor’s Workshop Resources
Objective
Distributed arithmetic What ?
Where ?
How ?
What is DA?
Multiplication using LUT Used to implement multipliers in LUT rich FPGAs
Twos Complement Multiplication
One bit at a time:
SDA 1-Tap FIR Filter
Parallel to serial converter N BITS WIDE SAMPLE DATA
X 0
A0 1
Partial Product ROM
A0 0 1 00000...0
C 0 LUT contains two locations
+/ Z -1 Scaling Accumulator
Distributed Arithmetic
for a 2-Tap Filter Partial products of equal weight are added together before being summed to next higher partial product weight Create look-up table of summed partial products
C0
-2 3 2 2 2 1 2 0 = 1 0 0 1 (-7) X
X0
= 0 1 1 1 ( 7) ( 1 0 0 1 ( 1 0 0 1 ( 1 0 0 1 (0 0 0 0 1 1 0 0 1 1 1 1 (-49) + + + + X
C1 X1
-2 3 2 2 2 1 2 0 = 0 1 1 0 ( 6) = 0 1 0 1 ( 5) 0 1 1 0) 0 0 0 0 ) 0 1 1 0 ) 0 0 0 0 ) 0 0 0 1 1 1 1 0 ( 30) = Sign Extension 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 0 = 1 1 1 0 1 1 0 1 (-1) (-14) (-4) (0) (-19)
(Serial-Data / Tap-Parallel Multiply)
SDA 2-Tap FIR Filter
N BITS WIDE SAMPLE DATA A0
X 0 X 1
A1 1 Partial Product ROM
+/ Z -1 Scaling Accumulator
00 01 10 11
0000...0
C 0 C 0 C 1 + C 1
LUT contains all possible sums of the partial products
SDA 4-Tap FIR Filter
N BITS WIDE SAMPLE DATA A0 1
X 0
A1 1
X 1
A2 1
X 2
A3
X 3 0000...0
C 0 0000...0
C 1 0000...0
C 2 0000...0
C 3 +
Partial Product ROM
+ + Z -1 +/ Scaling Accumulator
SDA 8-Tap FIR Filter
N BITS WIDE SAMPLE DATA
A0
X 0
1 A1
X 1
1
X 2
1 A2 Partial Product ROM A3
X 3
1 A0
X 4
1 A1
X 5
1
X 6
1 A2 A3 Partial Product ROM
X 7
Pre-Adder
+ +/ Scaling Accumulator
4 -input LUT contains all possible sums of the partial products
Z -1
Xilinx DA FIR Performance
60 50 40 30 20 10 0 0 Single MAC DA FIR B=8 DA FIR B=12 DA FIR B=16 Serial FPGA FIR 50 100 150 Filter Length (Taps) 200 250 6000 5000 4000 3000 2000 1000 0 0 Dual MAC DA FIR B=8 DA FIR B=12 DA FIR B=16 Serial FPGA FIR 50 100 150 Filter Length (Taps) 200 250
fclk = 200 MHz for both processor and FPGA B = data sample precision for FPGA
Trade Clock Cycles for Logic Area
Trade Clock Cycles for Logic Area
20Ms/s Multi bits per clock cycle 160Ms/s
b 7 b 7
Serial-DA
b 7
Parallel-DA
b 4 b 3 b 0 b 0 Hardware Over-sampling = 8 b 0 Hardware Over-sampling = 4 b 0 Hardware Over-sampling = 2 b 7 b 3 The sample is serialized and processed 1 bit per clock cycle. 8 clock cycles are thus required to process the whole sample The sample is serialized and processed 2 bits per clock cycle. 4 clock cycles are thus required to process the whole sample b 4 b 0 The sample is serialized and processed 4 bits per clock cycle Hardware Over-sampling = 1 b 0 The sample is processed in parallel 8 bits per clock cycle
Conclusion
Efficiency of computation Slow as its bit serial Memory requirements
References
The role of Distributed Arithmetic in FPGA based signal processing
, www.xilinx.com