High Speed FIR Filter Implementation Using Add and Shift Method Shahnam Mirzaei, Anup Hosangadi, Ryan Kastner University of California, Santa Barbara ICCD 2006 San Jose,

Download Report

Transcript High Speed FIR Filter Implementation Using Add and Shift Method Shahnam Mirzaei, Anup Hosangadi, Ryan Kastner University of California, Santa Barbara ICCD 2006 San Jose,

High Speed FIR Filter Implementation
Using Add and Shift Method
Shahnam Mirzaei, Anup Hosangadi, Ryan Kastner
University of California, Santa Barbara
ICCD 2006
San Jose, California
October 2006
UC Santa Barbara
ICCD 2006
Outline


Introduction
FIR filter implementation

Traditional Methods



New method


Add and Shift method and CSE (Common Subexpresssion
Elimination)
Experiments and results



MAC (Multiply Accumulate) implementation
DA (Distributed Arithmetic) implementation
Resource utilization
Power consumption
Conclusion
UC Santa Barbara
ICCD 2006
Introduction

Extensive use of FPGAs in computationally intensive
applications such as DSP




More available logic resources in current FPGAs
Broad applications of FIR filters in multimedia and communications
Need to efficient design methods to save area/power
Research motivation




Develop a more efficient implementation method for FIR filters that
consumes less area at comparable performance.
Develop a unified tool for performing redundancy elimination,
scheduling and module assignment.
Perform physically aware optimizations.
Architecture design exploration for ASIC and FPGA
implementations (Distributed Arithmetic based, adder-shifter
based, multiplier-adder based).
UC Santa Barbara
ICCD 2006
FIR Filter
MAC Implementation

L tap FIR filter
Convolution of the latest L input samples. L is the number of coefficients h(k) of the filter,

and x(n) represents the input time series.
y[n] = ∑ h[k] x[n-k]
X [n]
x
hL-1
x
x
hL-2
k= 0, 1, ..., L-1
x
hL-3
x
h1
h0
y [n]
z-1

+
z-1
+
z-1
...
+
z-1
+
z-1
Disadvantages


Large area on FPGA due to multipliers and the fact that full flexibility of general purpose
multipliers are not required
Limited number of embedded resources such as MAC engines, multipliers, etc. in FPGAs
UC Santa Barbara
ICCD 2006
FIR Filter
DA (Distributed Arithmetic) Implementation

An alternative to MAC implementation which is the most common
FPGA FIR implementation due to the LUT rich architecture of
FPGAs.
y[n] = ∑ c[n] ∙ x[n]

n = 0, 1, …, N-1
Variable x[n] can be represented by:
x [n] = ∑ xb [n] ∙ 2b
b=0, 1, …, B-1
xb [n] € [0, 1]
where xb [n] is the bth bit of x[n] and B is the input width. The
inner product can be rewritten as follows:
UC Santa Barbara
ICCD 2006
FIR Filter
DA (Distributed Arithmetic) Implementation (cont’d)
y = ∑ c[n] ∑ xb [k] ∙ 2b
= c[0] (xB-1 [0]2B-1 + xB-2 [0] 2B-2 + … + x0 [0]20 )
+ c[1] (xB-1 [1] 2B-1 + xB-2 [1] 2B-2 + … + x0 [1] 20 )
+…
+ c[N-1] (xB-1 [N-1] 2B-1 + xB-2 [0] 2B-2 + … + x0 [N-1] 20 )
= (c[0] xB-1 [0] + c[1] xB-1 [1] + … + c[N-1] xB-1 [N-1]) 2B-1
+(c[0] xB-1 [0] + c[1] xB-2 [1] + … + c[N-1] xB-2 [N-1]) 2B-2
+…
+ (c[0] x0 [0] + c[1] x0 [1] + … + c[N-1] x0 [N-1]) 20
= ∑ 2b ∑ c[n] ∙ xb [k]
where
UC Santa Barbara
n=0, 1, …, N-1 and b=0, 1, …, B-1
ICCD 2006
DA (Distributed Arithmetic) Implementation
Serial
A Serial DA Filter Block Diagram



n+1 clock cycles are needed for an n
but input symmetrical filter to
generate the output.
Performance is limited by the fact
that the next input sample can be
processed only after every bit of the
current input samples are processed
The tradeoff here is performance for
area
UC Santa Barbara
scaling accumulator
x0[i]
x1[i]
x2[i]
x3[i]
<<
LUT
+
x4[i]
x5[i]
x6[i]
x7[i]
+
D
SET
CLR
Q
Q
LUT
Address
Data
0000
0
0001
C0
0010
C1
…
…
1111
C0+C1+C2+C3
ICCD 2006
DA (Distributed Arithmetic) Implementation
Parallel


The performance of the circuit can
be improved by modifying the
architecture to a parallel architecture
which processes the data bits in
groups
Increasing the number of bits
sampled has a significant effect on
resource utilization on FPGA.
 More LUTs
 Larger size scaling accumulator
x0[i]
x1[i]
x2[i]
x3[i]
LUT
+
x4[i]
x5[i]
x6[i]
x7[i]
scaling accumulator
<<
LUT
+
x0[i+1]
x1[i+1]
x2[i+1]
x3[i+1]
+
D
SET
CLR
Q
Q
LUT
+
x4[i+1]
x5[i+1]
x6[i+1]
x7[i+1]
LUT
A 2 bit parallel DA Filter Block Diagram
UC Santa Barbara
ICCD 2006
CSE (Common Subexpression Elimination)

Linear systems can be modeled using polynomials.
Expressions consist of +,-,<< operators.

Polynomial formulation
C × X = (±X×Li)
(14)10 × X = (1110)2 × X
= X<<3 + X<<2 + X<<1
= XL3 + XL2 + XL1
UC Santa Barbara
ICCD 2006
CSE
Example
Y0 =
Y1 =
Y2 =
Y3 =
Y0
Y1
Y2
Y3
UC Santa Barbara
=
Y0 =
Y1 =
Y2 =
Y3 =
X0 + X1 + X2 + X3
2X0 + X1 – X2 – 2X3
X 0 – X 1 – X 2 + X3
X0 – 2X1 + 2X2 – X3
1
2
1
1
1
1
-1
-2
1
-1
-1
2
1
-2
1
-1
X 0 + X1 + X2 + X3
X0 L + X1 – X 2 – X 3 L
X 0 – X 1 – X 2 + X3
X0 – X 1 L + X2 L – X 3
X0
X1
X2
X3
ICCD 2006
CSE
Example
Y0
Y1
Y2
Y3
=
=
=
=
X0 + X1 + X2 + X3
X0L + X1 - X2 - X3L
X0 - X1 - X2 + X3
X0 - X1L + X2L - X3
Y0
Y1
Y2
Y3
=
=
=
=
D0 + X1 + X2
X0L + X1 - X2 - X3L
D0 - X1 - X2
X0 - X1L + X2L - X3
UC Santa Barbara
D0 = (X0 + X3)
D1 = (X1 – X2)
ICCD 2006
CSE
Example
Y0
Y1
Y2
Y3
=
=
=
=
D0 + X1 + X2
X0L + D1 - X3L
D0 - X1 - X2
X0 - D1L - X3
Y0
Y1
Y2
Y3
=
=
=
=
D0 + D2
X0L + D1 - X3L
D0 - D2
X0 - D1L - X3
UC Santa Barbara
D2 = (X1 + X2)
D3 = (X0 – X3)
ICCD 2006
CSE
Example
Y0
Y1
Y2
Y3
=
=
=
=
X0 + X1 + X2 + X3
X0L + X1 - X2 - X3L
X0 - X1 - X2 + X3
X0 - X1L + X2L - X3
8 additions
D0 =
D1 =
D2 =
D3 =
X 0 + X3
X 1 – X2
X 1 + X2
X 0 - X3
12 additions
4 shifts
Y0 = D 0 + D2
Y1 = D1 + D 3L
Y2 = D 0 - D 2
Y3 = D3 – D 1L
2 shifts
UC Santa Barbara
ICCD 2006
FIR Filter Add/Shift Implementation
Replacing Constant Multiplication by Multiplier Block
X [n]
x
x
hL-1
x
hL-2
x
hL-3
x
h1
h0
y [n]
z-1
+
z-1
...
z-1
+
z-1
+
+
z-1
X[n]
+
+
+
Multiplier
Block
+
+
y0
Delay
Block
UC Santa Barbara
+
+
y2
y1
Z-1
+
Z-1
+
yL-1
Z-1
Z-1
+
y[n]
ICCD 2006
FIR Filter Add/Shift Implementation
Registered Adder at no Additional Cost
s
+
X
X
y
X1
y1
+
s'
z-1
y
LUT
D
SET
CLR
Q
s1
X1
y1
LUT
D
Q
SET
CLR
Logic Block 2
Q
Logic Block 2
carry
X0
y0
LUT
D
SET
CLR
Logic Block 1
UC Santa Barbara
s'1
Q
Q
carry
s0
X0
y0
LUT
D
Q
SET
CLR
Q
s'0
Q
Logic Block 1
ICCD 2006
Extracting Common Subexpressions
F1 = A + B + C + D
F2 = A + B + C + E
Optimization
Extracting Common Expression (A + B + C)
Unoptimized Expression Trees
Extracting Common Expression (A + B)
UC Santa Barbara
ICCD 2006
Synchronization

Extra registers are needed to
synchronize the intermediate values,
such that new values for A,B,C,D,E,F
can be read in every clock cycle
Calculating registers required for fastest evaluation
UC Santa Barbara
ICCD 2006
Experiment Results
Resource Utilization/Performance
Filter
(# taps)
Slices
LUTs
FFs
Performance
(Msps)
Filter
(# taps)
Slices
LUTs
FFs
Performance
(Msps)
6
264
213
509
251
6
524
774
1012
245
10
474
406
916
222
10
781
1103
1480
222
13
386
334
749
252
13
929
1311
1775
199
20
856
705
1650
250
20
1191
1631
2288
199
28
1294
1145
2508
227
28
1774
2544
3381
199
41
2154
1719
4161
223
41
2475
3642
4748
222
61
3264
2591
6303
192
61
3528
5335
6812
199
119
6009
4821
11551
203
119
6484
9754
12539
205
151
7579
6098
14611
180
151
8274
12525
15988
199
Filter Implementation Using Add and Shift Method
UC Santa Barbara
Filter Implementation Using Xilinx Coregen (PDA)
ICCD 2006
Experiment Results
Resource Utilization
Reduction in Resources
80
% Reduction
70
60
50
SLICEs
40
LUTs
30
FFs
20
10
0
6
10
13
20
28
41
61
119
152
# of Taps
UC Santa Barbara
ICCD 2006
Experiment Results
Power Consumption
Power (mw)
Dynamic Power Consumption
1600
1400
1200
1000
800
600
400
200
0
Add/Shift
Coregen
6
10
13
20
28
41
61
119
Filter size (# of taps)
UC Santa Barbara
ICCD 2006
Creating MAC Filters Using Xilinx Coregen
UC Santa Barbara
ICCD 2006
Experiment Results
Comparison with MAC Filters Using Multiplier Blocks
Filter
(# taps)
Add Shift
Method
MAC
filter
Slices
Msps
Slices
Msps
6
264
296
219
262
10
475
296
418
253
13
387
296
462
253
20
851
271
790
251
28
1303
305
886
251
41
2178
296
1660
243
61
3284
247
1947
242
119
6025
294
3581
241
151
7623
294
7631
215
UC Santa Barbara
ICCD 2006
Experiment Results
Comparison with MAC Filters Using Multiplier Blocks –
Resource Utilization
# of slices
resource utilization
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
MAC
Add and Shift
1
2
3
4
5
6
7
8
9
# of taps
UC Santa Barbara
ICCD 2006
Experiment Results
Comparison with MAC Filters Using Multiplier Blocks Performance
Performance
350
300
Msps
250
200
Add and Shift
150
MAC
100
50
0
1
2
3
4
5
6
7
8
9
# of taps
UC Santa Barbara
ICCD 2006
Conclusion/Observations


Presented a multiplierless technique, based on the add and shift
method and common subexpression elimination for low area, low
power and high speed implementations of FIR filters.
Validated our techniques on Virtex II/IV devices where we observed
significant area and power reductions over traditional Distributed
Arithmetic based techniques.
 an average reduction of 58.7% in the number of LUTs, and about
25% reduction in the number of slices and FFs.
 Better performance in most of the cases even though our
algorithm does not optimize for performance
 Observed up to 50% reduction in dynamic power consumption
 Higher performance as the filter size increases.

UC Santa Barbara
Critical path in our design consists of adders while in MAC method,
critical path consists of multipliers and adders.
ICCD 2006