4-Distributed_arithm..

Download Report

Transcript 4-Distributed_arithm..

Distributed Arithmetic

Dr Sumam David S.

Dept. of E&C, NITK Surathkal Courtesy for slides – Xilinx Professor’s Workshop Resources

Objective

 Distributed arithmetic  What ?

 Where ?

 How ?

What is DA?

 Multiplication using LUT  Used to implement multipliers in LUT rich FPGAs

Twos Complement Multiplication

 One bit at a time:

SDA 1-Tap FIR Filter

Parallel to serial converter N BITS WIDE SAMPLE DATA

X 0

A0 1

Partial Product ROM

A0 0 1 00000...0

C 0 LUT contains two locations

+/ Z -1 Scaling Accumulator

Distributed Arithmetic

for a 2-Tap Filter   Partial products of equal weight are added together before being summed to next higher partial product weight Create look-up table of summed partial products

C0

-2 3 2 2 2 1 2 0 = 1 0 0 1 (-7) X

X0

= 0 1 1 1 ( 7) ( 1 0 0 1 ( 1 0 0 1 ( 1 0 0 1 (0 0 0 0 1 1 0 0 1 1 1 1 (-49) + + + + X

C1 X1

-2 3 2 2 2 1 2 0 = 0 1 1 0 ( 6) = 0 1 0 1 ( 5) 0 1 1 0) 0 0 0 0 ) 0 1 1 0 ) 0 0 0 0 ) 0 0 0 1 1 1 1 0 ( 30) = Sign Extension 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 0 = 1 1 1 0 1 1 0 1 (-1) (-14) (-4) (0) (-19)

(Serial-Data / Tap-Parallel Multiply)

SDA 2-Tap FIR Filter

N BITS WIDE SAMPLE DATA A0

X 0 X 1

A1 1 Partial Product ROM

+/ Z -1 Scaling Accumulator

00 01 10 11

0000...0

C 0 C 0 C 1 + C 1

LUT contains all possible sums of the partial products

SDA 4-Tap FIR Filter

N BITS WIDE SAMPLE DATA A0 1

X 0

A1 1

X 1

A2 1

X 2

A3

X 3 0000...0

C 0 0000...0

C 1 0000...0

C 2 0000...0

C 3 +

Partial Product ROM

+ + Z -1 +/ Scaling Accumulator

SDA 8-Tap FIR Filter

N BITS WIDE SAMPLE DATA

A0

X 0

1 A1

X 1

1

X 2

1 A2 Partial Product ROM A3

X 3

1 A0

X 4

1 A1

X 5

1

X 6

1 A2 A3 Partial Product ROM

X 7

Pre-Adder

+ +/ Scaling Accumulator

4 -input LUT contains all possible sums of the partial products

Z -1

Xilinx DA FIR Performance

60 50 40 30 20 10 0 0 Single MAC DA FIR B=8 DA FIR B=12 DA FIR B=16 Serial FPGA FIR 50 100 150 Filter Length (Taps) 200 250 6000 5000 4000 3000 2000 1000 0 0 Dual MAC DA FIR B=8 DA FIR B=12 DA FIR B=16 Serial FPGA FIR 50 100 150 Filter Length (Taps) 200 250

fclk = 200 MHz for both processor and FPGA B = data sample precision for FPGA

Trade Clock Cycles for Logic Area

Trade Clock Cycles for Logic Area

20Ms/s Multi bits per clock cycle 160Ms/s

b 7 b 7

Serial-DA

b 7

Parallel-DA

b 4 b 3 b 0 b 0 Hardware Over-sampling = 8 b 0 Hardware Over-sampling = 4 b 0 Hardware Over-sampling = 2 b 7 b 3 The sample is serialized and processed 1 bit per clock cycle. 8 clock cycles are thus required to process the whole sample The sample is serialized and processed 2 bits per clock cycle. 4 clock cycles are thus required to process the whole sample b 4 b 0 The sample is serialized and processed 4 bits per clock cycle Hardware Over-sampling = 1 b 0 The sample is processed in parallel 8 bits per clock cycle

Conclusion

 Efficiency of computation  Slow as its bit serial  Memory requirements

References

The role of Distributed Arithmetic in FPGA based signal processing

, www.xilinx.com