Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner ECE Department University of California, Santa Barbara Farzan Fallah Fujitsu Laboratories of America.

Download Report

Transcript Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner ECE Department University of California, Santa Barbara Farzan Fallah Fujitsu Laboratories of America.

Optimizing high speed arithmetic
circuits using three-term extraction
Anup Hosangadi
Ryan Kastner
ECE Department
University of California, Santa Barbara
Farzan Fallah
Fujitsu Laboratories
of America
Outline
•
•
•
•
•
•
Carry Save Arithmetic
Related Work
Problem formulation
Algebraic methods
Delay aware optimization
Experimental results
2
Carry Save Arithmetic
•
Multi-Operand addition
• F=A+B+C+D+E+F
• Carry propagation major bottleneck
• Fast adders: Carry Lookahead Adder (CLA),
Carry Select Adders, not fast enough
•
Solution: Eliminate Carry propagation to the
final step
•
•
•
•
Generate Sums and Carries separately
Treat them as separate numbers
Keep adding till only two numbers remain
Add the numbers using fast adder (CLA)
3
Carry Save Arithmetic
A
B C
E
D
CSA
Delay = 3 + log2(M + 3)
3 = height of CSA tree
M = bitwidth of operands
S
F
CSA
C
S
C
CSA
S
Tree height = log1.5(N/2)
C
CSA
S
C
+
CLA
F
4
Carry Save arithmetic
RCA
RCA
RCA
(M +1)
Using Ripple carry
adders (RCAs)
(M +2)
Delay = (M+5) + 4
(M +3)
RCA
(M +4)
RCA
(M +5)
Delay thru CSA network =
3 + log1.5(M + 3)
Area comparison
100
80
RCA
60
CSA
40
20
Area (full adder units)
2000
120
1500
RCA
1000
CSA
500
50
46
42
38
34
30
26
22
18
14
6
10
50
46
42
38
34
30
26
18
14
6
10
22
# of operands
2
0
0
2
Delay (full adder delays)
Delay comparison
# Operands
5
Related Work
•
Kim et. al “Arithmetic optimization using Carry
Save Adders”, DAC’98
A
B C
A B
D
E
+
+
C
+
CSA
D
CSA
E
+
+
F
CSA
+
F
6
Related Work
•
Kim. et. al “Optimal allocation of CSAs”, ICCAD’99
•
•
•
•
Delay aware CSA allocation
Kim et. al “High performance, low power
synthesis”, DAC’2000
SynopsysTM Behavioral optimization for arithmetic
(BOA)
A.Verma and P.Ienne “Improved use of the carry
save representation for the synthesis of complex
arithmetic circuits”, ICCAD’2004
Arithmetic
Optimizer?
7
Problem formulation
•
No methodology for detecting redundancy
in CSA computations
• Can reduce the number of CSAs
• Can reduce the number of wires
•
Common subexpression elimination
• Standard compiler technique
• Applied to 2-term arithmetic operations
– Polynomial expressions (ICCAD’04, VLSI’05)
– Constant multiplications (ASAP’04, ASPDAC’05)
•
CSA expressions (Common 3-term
subexpressions)
8
Problem formulation
Y1 = X1 + X1<<2 + X2 + X2<<1 + X2<<2
D1 = X1 + X2 + X2<<1
Y2 = X1<<2 + X2<<2 + X2<<3
Y1 = (D1S + D1C) + X1<<2 + X2<<2
Y2 = (D1S + D1C)
9
Algebraic methods
•
Polynomial transformation
• X<<i = XLi
C × X = (±X×Li)
(14)10 × X =
=
=
=
(1110)2 × X
X<<3 + X<<2 + X<<1
XL3 + XL2 + XL1
(100-10)CSD × X = XL4 – XL1
• Detects shifted common subexpressions and
also extends to multiple variables
10
Algebraic methods
•
•
3-term divisors = All potential common
subexpressions
Divisor generation
•
•
•
•
•
•
One for every combination of 3 terms
eg. F1 = X1 + X1L2 + X2 + X2L + X2L2
d1 = X1L2 + X2L + X2L2
MinL = L
Divisor D1 = d1/L = X1L + X2 + X2L
N
# of divisors =
3
•
Theorem:
• There exists a 3-term common subexpression iff
there exists a non-overlapping intersection among
the set of 3-term divisors
11
Algebraic methods
•
Greedy Iterative algorithm
• Extracts the “best” 3-term divisor
• Rewrites the expressions containing it
SS
C++
FF11==aDD+
Dc21C+
de+
d+
ee
21 b+++D
SS
C++
FF22==aDD+
Dc21C+
df+
d+
f f
21 b+++D
>> D21 = D
a 1+S b+ +Dc1C + e
• Terminates when there are no more common
subexpressions
12
Algebraic methods
•
Algorithm details
Optimize ({Pi})
{
{Pi} = Set of expressions in polynomial form;
{D} = Set of divisors = φ;
// Step 1. Creating divisors and their frequency statistics
for each expression Pi in {Pi}
{
{Dnew} = Divisors(Pi);
Update frequency statistics of divisors in {D};
{D} = {D} { Dnew};
}
//Step 2. Iterative selection and elimination of best divisor
while (1)
{
Find d = divisor in {D} with most number
of non-overlapping intersections;
if (d == NULL) break;
Rewrite affected expressions in {Pi} using d;
Remove divisors in {D} that have become invalid;
Update frequency statistics of affected divisors;
{Dnew} = Set of new divisors from new terms added
by division;
{D} = {D} {Dnew};
}
}
13
Algebraic methods
•
Algorithm complexity
• M expressions, each with N terms
• Divisor generation = M* N = O(MN3)
3
• Iterative algorithm, worst case
– N terms reduced to 2 terms = (N -2) steps
– M expressions = O(MN) steps
14
Delay aware optimization
•
Sharing subexpressions can increase the
total delay
• Traditional high level synthesis approach:
Reduce delay by Tree Height Reduction (THR)
• Our solution: Control delay during optimization
itself
F1 = a(2) + b(0) + c(0) + d(0) + e(0)
F2 = a(2) + b(0) + c(0) + d(0) + f(0)
• Optimal delay CSA allocation (T.Kim, J.Um,
“Timing driven synthesis”, ASPDAC’2000)
– Use this to get minimum possible delay
15
Delay aware optimization
•
Optimal allocation
b
0
c
d
a
1
0
d
f
0
0
CSA
a
1
0
CSA
2
2
CSA
3
0
1
CSA
2
c
e
0
0
CSA
1
b
Delay ignorant extraction
+
2
2
2
CSA
3
3
F1
+
3
F2
Delay(F1) = Delay(F2) =
3 + D(Add)
16
Delay aware extraction
•
Control delay during optimization
• Evaluate each candidate divisor for delay
• Only consider those divisors that do not
increase the delay
(2) + b(0) + c(3)
(0) + d(0)
(0) + e(0)
(0)
F
F11 =
= aD1S(3)
+ D1C +
d +
e
(2) + b(0) + c(3)
(0) + d(0)
(0) + f(0)
(0)
F
F22 =
= aD1S(3)
+ D1C +
d +
f
Delay = 5 + D(Add)
Delay = 5 + D(Add)
>> D1(3) = a(2) + b(0) + c(0)
17
Delay aware extraction
•
Control delay during optimization
• Evaluate each candidate divisor for delay
• Only consider those divisors that do not
increase the delay
(2) + b(0) + c(1)
(0) + d(0)
(0)
(0)
F
F11 =
= aD2S(1)
+ D2C +
e ++ ae(2)
(2) + b(0) + c(1)
(0) + d(0)
(0)
(0)
F
F22 =
= aD2S(1)
+ D2C +
f ++ af(2)
Delay = 3 + D(Add)
Delay = 3 + D(Add)
>> D2(1) = b(0) + c(0) + d(0)
18
Experimental results
Comparing # of CSAs
Comparing # of CSAs
250
# CSAs
•
200
150
Original
100
Optimized
50
0
4
26
.
H
DC
T8
T8
C
ID
p
ta
6
FI
R
p
ta
20
R
FI
p
ta
41
R
FI
ge
a
er
v
A
Example
Average 38.4% reduction
19
Experimental results
Synthesis for Standard Cell Designs
SynopsysTM Design compiler
0.25 micron library
Synthesized for minimum delay
Area results
1800
1600
1400
1200
1000
800
600
400
200
0
Series1
Av
er
ag
e
FI
R
41
ta
p
FI
R
20
ta
p
FI
R
ta
p
6
ID
CT
8
T8
Series2
DC
H.
26
4
•
•
•
Area
•
Example
Avg 32.7% Area reduction
Avg 3.7% increase in delay
20
Experimental results
FPGA synthesis
•
•
Virtex II FPGAs
Synthesized designs and performed place & route
Reduction in LUTs and slices
% Reduction
•
40
35
30
25
20
15
10
5
0
LUTs
Slices
H.264
DCT8
IDCT8
6 tap
FIR
20 tap
FIR
41 tap Average
FIR
Examples
Avg 14.1 % reduction in #Slices and Avg 12.9% reduction in # LUTs
Avg 5.7% increase in the delay
21
Experimental results
•
Evaluate Delay aware extraction algorithm
•
•
•
Consider different arrival times of the signals
Assume delay dominated by gate delay (FA delay)
Only consider best case delay
Example
# of CSAs
Delay (FA units)
Delay
ignorant
Delay
aware
Delay
Ignorant
Delay
aware
H.264
78
79
9
8
DCT8
222
232
14
13
IDCT8
195
201
14
13
FIR6tap
11
15
5
4
FIR20tap
34
45
6
5
FIR41tap
79
91
6
5
Average
103.2
110.5
9
8
Best delay with 15.5% increase in #CSAs
22
Conclusions
•
•
•
•
First methodology for common
subexpression elimination for Carry Save
Arithmetic
Significant area/power reduction
Delay aware optimization algorithm also
developed
Can be combined with CSA tree extraction
methods for actual application improvement
23
Thank you!!
•
Questions?
24