SPEED AND POWER TRADE-OFFS

Download Report

Transcript SPEED AND POWER TRADE-OFFS

Design of Power Efficient VLSI
Arithmetic: Speed and Power
Trade-offs
Vojin G. Oklobdzija, Ram Krishnamurthy
Intel AMR / ACSEL Laboratory
Intel Corp/ University of California Davis
www.ece.ucdavis.edu/acsel
Tutorial Presentation
16th International Symposium on Computer
Arithmetic
Santiago de Compostela, SPAIN
June 18, 2003
Issues to be addressed
• How do we compare different topologies
for their efficiency ?
• How do we estimate speed and efficiency
of our algorithm ?
• What criteria's should we use when
developing a new algorithm ?
• How does power enter into this equation ?
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
2
Additional Issues
• Determine which topology is the best for
given Power or Delay budget
• Determine which topology can stretch
the furthest in terms of speed or power
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
3
Metric
Previously used estimates
Counting the number of gates (logic levels): not accurate
ai
C28
C24
individual adders
generating: gi, pi,
and sum Si
C20
C12
bi
C8
C4
Cin
C16
Carry-lookahead super- blocks of
4-bits blocks generating:
G*i, P*i, and Cin for the 4-bit
blocks
Cout
Cout
Cin
Cin
Carry-lookahead blocks of
4-bits generating:
Gi, Pi, and Cin for the
adders
Group producing final
carry Cout and C16
Critical path delay = (for gi,pi)+2x2 (for G,P)+3x2 (for Cin)+1XOR- (for Sum) = appx. 12of delay
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
5
Critical path in Motorola's 64-bit CLA
P63
G63
P62
G62
P61
G61
P60
G60
P59
G56
P55
G52
P
51
G48
P47
G32
P
31
G16
P15
G12
P11
G8
P7
G4
P3
G3
P2
G2
P1
G1
P0
G0
PG BLOCK
PG BLOCK
61
P63:60
G3:0
P3:0
G7:4
P7:4
G11:8
P11:8
G15:12
P15:12
G51:48
P51:48
G55:52
P55:52
G59:56
P59:56
G63:60
C
P,G62:60 63
C
P,G61:6062
C
P,G60
P,G2:0
P,G1:0
P,G0
CARRY
BLOCK
P63:48
G63:48
P59:48
G59:48
P55:48
G55:48
P51:48
G51:48
P11:0
G11:0
P7:0
G7:0
P3:0
G3:0
C60
C56
C52
C48
C32
C16
C12
C8
C4
P47:32
G47:32
P31:16
G31:16
P15:0
G15:0
P63:0
G63:0
C48
P47:0
G47:0
C32
P31:0
G31:0
C16
P15:0
G15:0
C64
C0
6
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
June 18, 2003
PG BLOCK
PG BLOCK
...
...
...
...
...
...
...
...
PG BLOCK
Critical path: A, B - G0 - G3:0 - G15:0 - G47:0 - C48 - C60 - C63 - S63
Motorola's 64bit CLA
Modified PG Block
Intermediate propagate signals Pi:0
are generated to speed-up C3
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
7
Fan-In and Fan-Out Dependency
(Oklobdzija, Barnes: IBM 1985)
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
8
Delay Comparison: Variable Block Adder
(Oklobdzija, Barnes: IBM 1985)
Delay
June 18, 2003
Complexity
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
9
Design Objective
• Design takes time:
– finding results afterward is not of much value
• There is a disconnect between measures
used by computer arithmetic when developing
an algorithm and what is obtained after
implementation
– we want to estimate as close to the measured
results
• A simple tool that can evaluate different
design trade-off for a given technology is
needed
• Power trade-off is the most important
– speed and power are tradable
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
10
Logical Effort Theory
• “Back of the Envelope” complexity: good for
estimating speed
• Gate delay = linear function of load
– Slope: logical effort  gate driving characteristics
– Intersect: parasitic  gate internal load
• “Logical Effort” accuracy is not sufficient
– We needed to extend and refine the method
– However, that becomes more than “Back of the
Envelope”
• Logical Effort does not account for possible
power-delay trade-offs
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
11
Logical Effort Theory
• Excel –a platform of choice (ARITH-16)
– Simple enough
– Can provide computation quickly
– Easy to enter a given design
• Technology characterization is needed:
– This needs to be done only once: available for
every design afterwards
– Domino gate = 2 stages of dynamic and static
• Different driving characteristics of these stages
• Multi-output gate (carry-look-ahead, Ling/conditional sum)
• Energy model needs to be included
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
12
Energy Motivation
*courtesy of Intel Corp.
Cache
Processor
thermal
map
Temp
(oC)
Execution
core
120oC
AGU
AGUs: performance and peak-current limiters
High activity  thermal hotspot
Goal: high-performance energy-efficient design
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
13
XOR Carry-merge gates PG
Kogge-Stone Adder
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Critical path = PG+5+XOR = 7 gate stages
Generate,Propagate fanout of 2,3
Energy
inefficient
Maximum interconnect spans 16b
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
14
Sparse-tree Adder Architecture
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
C27
C23
C19
C15
C11
C7
C3
Generate every 4th carry in parallel
Side-path: 4-bit conditional sum generator
73% fewer carry-merge gatesenergy-efficient
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
15
Kogge-Stone adder (8-stage)
Design Parameters
Adder Pitch
(um)
Interconnect Cap
(fF/um)
Gate Cap
(fF/um)
Avg inp. Cap
/gate (um)
10
0.157
1/8
D = 8*(GBH) *2.2 + 3.8*P
1.15
14
% int to gate
10%
cap/pitch I
Inv. L.E.
2.24
Parasitic delay
3.8
Kogge Stone Adder
Stage
Logical Effort
(G)
Branch
Effort (B)
PG
CM0
CM1
CM2
CM3
CM4
XOR
Inv
0.6
1.48
0.59
1.48
0.59
1.48
1.69
1
2
2
2
2
2
1
1
1
June 18, 2003
Int. Pitch Effective Brnch
(C)
Effort (B+I.C)
1
2
4
8
16
0
0
0
2.1
2.2
2.4
2.8
3.6
1.0
1.0
1.0
Parastic
Comp.
1.3
2.5
1.6
2.5
1.6
2.5
3.0
1.0
Path Branch Path Logical
Path Delay
Path Effort
Effort = Bi
Effort=Gi
(ps)
108.92
1.14
124.63
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
93.97
16
MXA2 – Architecture & Result
12..15
PG4
16..19
PG4
20..23
PG4
24..27
PG4
28..31
PG4
32..35
PG4
36..39
PG4
40..43
PG4
44..47
PG4
48..51
PG4
52..55
PG4
56..59
PG4
60..63
PG4
S4
S4
S4
S4
S4
S4
S4
S4
S4
S4
S4
S4
S4
S4
S4
S4
a2 b2
a1 b1
p3
S1
0
G01
p3
a3 b3 a2 b2 b0 a0 b1 a1
a1 a0 b0
a0 b0
S1
g1
0
1
S1
g2
S0
a2 b2 a3
g2
g0 p1
p3
2
2
0
p2
p0
G23
G01
G01
P23
2
S1
0
S1
0
S1
0
S1
0
S0
P23
1
Per-stage effort = 3.7
Total effort delay = 33.3
Total parasitic = 22.5
Total delay = 55.8
8..11
PG4
1S
–
–
–
–
4..7
PG4
0
• Multiplexer-based
• Generate carries using
radix-2 (P,G)
• 4-bit conditional sum
selected by carries
• 4-b cell width = 17m
• 9-stage critical path
0..3
PG4
PG Group
Cin
Sum3
June 18, 2003
Sum2
Sum1
Sum0
G03
P03
P03
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
17
...
7
6
5
4
3
2
1
0
...
15
14
...
31
30
• Generate even carries
using radix-2 (P,G)
• Generate odd carries
from even carries
• CMOS adder for sum
• 1-b cell width  4m
• 10-stage critical path
63
62
HC2 – Architecture
L1
L2
L3
L4
L5
L6
Odd
...
Sum
(p,g)
Even XOR2
bits NAND2
Odd
bits
June 18, 2003
XOR2
NAND2
...
...
CM1
CM2
CM3
CM4
CM5
CM6
NOR2
OAI
NAND2
AOI
NOR2
OAI
NAND2
AOI
NOR2
OAI
AOI
CMo
CiN
Sum
XOR2
OAI
XOR2
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
18
HC2 – Circuits & Results
b
a
a
g
pi gi-1 gi
b
pi pi-1
pi pi-1
P Cin
p
Sum
G
CK
CK
Pi
Ai
pi gi-1 gi
Gi
Ai
Bi
Bi
P
Gi-1
G
Pi-1
Gi
Gi-1
CK
G
Gi
P
G
Gi
Gi-1
Pi
Pi
Static
June 18, 2003
Per-Stage Effort
2.8t
Total Effort Delay
28.0t
Total Parasitic
34.5t
Total Delay
62.5t
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
19
...
7
6
5
4
3
2
1
0
...
15
14
13
...
31
30
29
• Generate carries using
radix-2 (P,G)
• CMOS adder for sum
• Similar circuits as HC2
• 1-b cell width  4m
• 9-stage critical path
63
62
KS2 – Architecture & Results
L1
L2
L3
L4
L5
L6
Inv
Sum
Static
Dynamic
June 18, 2003
Per-Stage Effort
3.0t
2.11t
...
Total Effort Delay
27.0t
19.0t
...
...
Total Parasitic
30.6t
23.6t
Total Delay
57.6t
42.6t
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
20
...
8
7
6
5
4
3
2
1
0
...
12
...
16
...
32
...
48
63
62
61
60
59
KS4 – Architecture
G4
P4
G16
P16
Co
Sum
•
•
•
•
Generate carries using redundant radix-4 (P,G)
Dynamic circuit
1-b cell width  4m
6-stage critical path
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
21
KS4 – Circuits & Result
G4
CK
P4
CK
A3
A2
A1
A0
A3
A2
A1
A1
B1
B3
B2
B1
B0
B3
B2
B1
A0
B0
CK
G3
B1
A3
B3
A3
A1
A2
CK
B3
B2
A2
g3
P16
g2
CK
g3
g1
A3
HSN
B3
g3
p1
CK
g3
p3
June 18, 2003
P3
g2
g1
Sum
P2
G16
CK
g0
p2
Dynamic
G0
P1
B2
G3
CK
G1
CK
A3
B3
G2
HS
g0
p1
STB
p2
p1
Per-Stage Effort
2.3t
Total Effort Delay
13.8t
Total Parasitic
16.3t
Total Delay
30.1t
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
22
CLA4 – Architecture
• Generate carries using radix-4 (P,G,C)
• 1-b cell width  4m
• 15-stage critical path
b47
b32
PGC
PGC
PGC
b31
PGC
PGC
PGC
C40
C44
b16
PGC
PGC
C24
C36
C28
C20
PGC
PGC
C32
C16
b63
b48
PGC
PGC
PGC
b15
PGC
b0
PGC
PGC
C56
C60
PGC
PGC
C8
C52
PGC
C12
P-Path
C4
PGC
C48
G-Path
Cin = C0
(P,G,C) Network
June 18, 2003
C
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
23
CLA4 – Circuits & Result
G
CK
P
K
CK
CK
A
AN
A
AN
B
B
BN
BN
G0
P0

G2 P2
G3 P3




P1:0
CK

Sum
CK
Ci
g
G1 P1
CiN
p

C0


June 18, 2003
P2:0


G1:0




G2:0

P3:0

G3:0

STB
C1
Dynamic


Per-Stage Effort
1.4t
C2
Total Effort Delay
21.0t
C3
Total Parasitic
33.3t
Total Delay
54.3t
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
24
LNG4 – Architecture
• Generate carries using Ling pseudo-carries
• Conditional sums selected by local & long carries
• 1-b cell width  5.1m; 9-stage critical path
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
25
LNG4 – Circuits & Result
G3
CK
A2
CK
A3
B2
A2
A1
B2
B1
A1
G4
A0
B0
B1 A3
B3
A0
B0 A2
B2
LC
SumL
C1L
LCH
LCL
C1H C0L
C0H
K
G1
P4
CK
G0
P1
A1
B3
B1
CK
CK
P
G2
G
P2
C1H
SumH
CK
C1L C0H
LCH
C0L
LCL
CK
Dynamic
June 18, 2003
Per-Stage Effort
2.4t
Total Effort Delay
21.6t
Total Parasitic
22.3t
Total Delay
43.9t
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
26
Results from Simulation
HSPICE & Difference (FO4)
16
14
12
10
0.5
-0.9
1.4
2.7
8
6
0.4
0.5
KS-2
Ling
0.1
1.3
4
2
0
KS
Type
Static
Dynamic
Adder
KS2
MX2
HC2
KS4
KS2
LNG4
HC2
CLA4
June 18, 2003
# Stages
9
9
10
6
9
9
10
16
CS
LE (FO4)
11.8
11.4
12.8
6.2
8.7
9.0
9.8
11.4
HC
KS-4
SPICE (FO4)
10.9
12.8
13.3
7.4
9.2
9.5
9.9
14.2
Diff (FO4)
-0.88
1.41
0.46
1.27
0.44
0.51
0.08
2.74
HC
CLA
• Fairly consistent with
logical effort analysis
• Per-stage delay
– 1.4 FO4 (static)
– 0.8 FO4 (dynamic)
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
27
Delay of Representative 64-b Adders
Total Delay (FO4)
12
Static
Dynamic
10
8
6
4
2
0
MXA2
June 18, 2003
HC2
KS2
QTA2
KS4
LNG4
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
28
What happened when Power is
Energy
considered ?
Adder A
A
B
Adder B
Region 1
Region 2
Delay
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
29
Energy-Delay Space
Energy
speed barrier
Different Adders
Emin
power limit
Dmin
June 18, 2003
Delay
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
30
Logical Effort
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
31
Delay in a Logic Gate
Delay of a logic gate has two components
d=f+p
parasitic delay
effort delay, stage effort
f = gh
electrical effort = Cout/Cin
electrical effort
is also
called “fanout”
logical effort
• Logical effort describes relative ability of gate topology to
deliver current (defined to be 1 for an inverter)
• Electrical effort is the ratio of output to input capacitance
*from Mathew Sanu / D. Harris
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
32
Delay
Logical Effort Parameters: Inverter
16
14
12
10
8
6
d=gh+p
4
2
0
p=3.8ps (parasitic delay)
0
1
2
3
4
5
6
Fanout: h =Cin/Cout
• d = gh + p
• Delay increases linearly with fanout
• More complex gates have greater g and p
*from Mathew Sanu / D. Harris
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
33
Normalized Logical Effort: Inverter
Normalized delay: d
*from Mathew Sanu / D. Harris
6
5
g=1
p=1
d = gh + p = h+1
4
3
2
1
effort
delay
parasitic delay
1 2
3 4 5
Fanout: h = Cout/Cin
• Define delay of unloaded inverter = 1
• Define logical effort ‘g’ of inverter = 1
• Delay of complex gates can be defined w.r.t d=1
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
34
Computing Logical Effort
DEF: Logical effort is the ratio of the input capacitance to the input
capacitance of an inverter delivering the same output current
• Measured from delay vs. fanout plots of simulated gates
• Or estimated, counting capacitance in units of transistor W
*from Mathew Sanu / D. Harris
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
35
L.E for Adder Gates
Delay (ps)
*from Mathew Sanu / D. Harris
35.00
30.00
25.00
20.00
15.00
10.00
5.00
0.00
Inverter
Static CM
Dyn PG
Dyn CM
Mux
0
1
2
3
4
5
6
Fanout
• Logical effort parameters obtained from simulation for std cells
• Define logical effort ‘g’ of inverter = 1
• Delay of complex gates can be defined w.r.t d=1
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
36
Normalized L.E
Gate type
Logical Eff. (g)
Parasitics
(Pinv)
Inverter
1
1
Dyn. Nand
0.6
1.34
Dyn. CM
0.6
1.62
Dyn. CM-4N
1
3.71
Static CM
1.48
2.53
Mux
1.68
2.93
XOR
1.69
2.97
• Logical effort & parasitic delay normalized to that of inverter
*from Mathew Sanu
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
37
Delay of a string of gates
• Delay of a path, D = di =
S gh Sp
i i+
i
• gi & pi are constants
• To minimize path delay, optimal values of hi are to be
determined
D is minimized when each stage bears the same effort, i.e. gihi = g i+1h i+1
*from Mathew Sanu / D. Harris
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
38
Minimizing path delay

• Logical Effort of a string of gates:
• Path Electrical Effort:
H=

gi
G=
Cout(path)
hi =
Cin(path)
Con-path + Coff-path
• Branching Effort
b=
• Path Branching Effort:
B=
• Path Effort:
F=GBH
b
Con-path
i
Delay is minimized when
each stage bears the same effort:
The minimum delay of an N-stage path is:
f = gihi = F1/N
NF1/N + P
*from Mathew Sanu / D. Harris
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
39
Inclusion of Wire Delay
into Logical Effort
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
40
Wiring Load
• Wiring in hand analysis
– Only lumped capacitance included
• Wiring in HSPICE
– Short wire: 1-segment -model RC network
– Long wire: 4-segment -model RC network
– Using worst-case wire capacitance
• Wire length
– Estimated from most critical 1-bit pitch
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
41
Modeling interconnect cap.
• Include interconnect cap in branching factor
Coff-path
PG
Cint
CM0
Con-path
b=
Con-path + Coff-path
Con-path
CM0
Con-path
Con-path + Coff-path+Cint
=2
b=
Con-path
=2+I
June 18, 2003
CM0
Adder bitpitch
CM0
Adder bitpitch
PG
Coff-path
= 2+
Cint
Con-path
I : % int. cap to gate cap
in 1 adder bitpitch
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
42
Branching
f0
f1
g0
g1
COUT1
CIN
g2
f2
g3
f3
COUT2
Logical Effort assumes the “branching” factor of this circuit to
be 2. This is incorrect and can create inaccuracies
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
43
Correction on Branching
f0
f1
g0
g1
COUT1
CIN
g2
f2
g3
f3
COUT2
f0 = f1 , f2 = f3
Td1 = (f0 + f1 + parasitics) 
Td2 = (f2 + f3 + parasitics) 
Minimum Delay occurs when Td1 = Td2
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
44
“Real” Branching Calculation
F1
B1
B1
g0 g1 out1
F2
g2 g3 out2
Cin
F1  F2
B2
F1
g0 g1 out1  g2 g3 out2
g0 g1 out1
Branching only equals 2 when:
B2
Cin
F1  F2
F2
g0 g1 out1  g2 g3 out2
g0 g1 out1
g2 g3 out2
g2 g3 out2
This explains why we had to resort to Excel !
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
45
Technology Characterization
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
46
Characterization Setup
• Logical Effort Requirements:
– Equalize input and output transitions.
• Logical Effort is characterized by varying the
h (Cout/Cin) of a gate. By using a variable
load of inverters each gate can be
characterized over the same range of loads.
• The Logical Effort of each gate is
characterized for each input.
• Energy is characterized for each output
transition of the gate caused by each input
transition.
i.e. for an inverter: energy is measured for tLH and tHL
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
47
LE Characterization Setup for
Static Gates
•tLH
•tHL
•Average
•Energy
In
Gate
Gate
Gate
Gate
..
Variable Load
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
48
LE Characterization Setup for
Dynamic Gates
•tHL
•Energy
In
Gate
Gate
Variable Load
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
49
LE Table (Static CMOS)
• Technology: P/N Ratio = 2  INV = 3.67, pINV = 4.29
• Measured on worst-case single-input switching
Fan-out
2
3
4
6
8
g (ps)
p (ps)
g (norm)
p (norm)
INV
11.6
15.3
19.0
26.4
33.6
3.67
4.29
1.00
1.00
June 18, 2003
NAND2
16.3
20.0
24.0
32.4
40.6
4.08
7.90
1.11
1.84
NAND3
22.2
26.6
31.2
40.6
50.0
4.65
12.74
1.27
2.97
NOR2
20.5
25.4
30.6
41.1
51.9
5.25
9.77
1.43
2.28
TGXORi TGXORs TGMUXi TGMUXs
34.9
22.3
8.0
26.0
42.6
28.2
9.9
33.0
50.2
34.2
12.0
39.0
64.4
45.7
16.0
53.0
79.8
56.5
20.2
68.0
7.43
5.71
2.04
6.97
20.19
11.12
3.85
11.76
2.03
1.56
0.55
1.90
4.71
2.59
0.90
2.74
AOI
23.2
28.5
34.1
45.3
56.7
5.60
11.82
1.52
2.76
OAI
21.3
26.7
32.1
43.6
55.3
5.68
9.69
1.55
2.26
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
50
Static CMOS Gates: Delay Graphs
90
90
INV
INV
80
80
TGXORi
NAND2
70
TGMUXi
NOR2
60
TGMUXs
AOI
Delay
Delay
60
TGXORs
70
NAND3
OAI
50
40
50
40
30
30
20
20
10
10
0
0
0
1
2
3
4
5
Fanout
June 18, 2003
6
7
8
9
0
1
2
3
4
5
6
7
8
9
Fanout
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
51
Static Gates: Pull-up Delay Graph
70
INV
60
NAND2
NAND3
NOR2
50
AOI
OAI
Delay
40
30
20
10
0
0
1
2
3
4
5
6
7
8
9
Fanout
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
52
LE Table (Dynamic CMOS)
• Technology:
• Minimum-sized keeper included
• Measured on all-input switching of worst path
Fan-out
2
3
4
6
8
g (ps)
p (ps)
g (norm)
p (norm)
June 18, 2003
DN2
9.9
12.6
16.0
21.7
27.3
2.92
4.04
0.80
0.94
DN3
12.7
14.7
18.3
24.7
31.2
3.15
5.82
0.86
1.36
DN4
16.0
19.1
23.2
30.2
37.8
3.65
8.46
1.00
1.97
Dk1ND2
13.7
16.7
20.7
27.9
36.1
3.75
5.76
1.02
1.34
Dk1NR2
10.6
13.2
16.7
23.2
29.5
3.19
3.95
0.87
0.92
DAOI_A
10.1
12.1
14.7
20.0
24.8
2.49
4.86
0.68
1.13
DOAI_O
8.8
11.3
14.0
19.2
24.0
2.55
3.75
0.69
0.87
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
53
Dynamic CMOS: Delay Graphs
40
40
N2
35
G4
35
N3
P4
N4
30
30
k1ND2
C4
k1NR2
25
25
AOI_A
20
STBSum
20
OAI_O
15
15
10
10
5
5
0
0
0
June 18, 2003
2
4
6
8
10
0
2
4
6
8
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
10
54
Dynamic CMOS: Delay Graphs
50
50
45
LG3
45
40
LP4
40
KSG4
KSP4
G4
35
KSG1
6
KSP1
6
KSSu
m
35
P4
30
30
LC
25
25
Lsum
20
20
15
15
10
10
5
5
0
0
0
June 18, 2003
2
4
6
8
10
0
2
4
6
8
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
10
55
Energy Calculation
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
56
Energy Calculation
16X Minimal Size Dyn-NAND
8X Minimal Size Dyn-NAND
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
57
Energy Calculation
Offset (parasitic+wiring energy) vs. Size (in multiplesof the
gate size)
60
Offset
y = 3.89x + 14.5
50
y = 1.6382x + 11.988
40
y = 1.2559x + 6.762
30
y = 0.5538x + 12.338
y = 1.1413x + 10.22
inv
dgck
oai_o
daoi
tgxor
aoi_o
na2s
tgmuxs
Linear (inv)
Linear (dgck)
Linear (oai_o)
Linear (daoi)
Linear (tgxor)
Linear (aoi_o)
Linear (na2s)
Linear (tgmuxs)
y = 1.9595x + 9.621
20
y = 0.8931x + 4.6411
10
y = 1.0592x + 1.71
0
0
5
10
15
20
25
30
35
40
45
Gate Size (x)
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
58
Energy Calculation
Inverter
1.40E+02
1.20E+02
1.00E+02
8.00E+01
Energy [fJ]
6.00E+01
4.00E+01
2.00E+01
10
7.5
0.00E+00
12
5
18
Size
24
Load [u]
June 18, 2003
36
2.5
48
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
59
Energy Calculation
NAND-2
Output Capacitance (u)
M
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
1
1.12
2.24
3.36
4.48
5.6
6.72
7.84
8.96
10.08
11.2
12.32
13.44
14.56
15.68
16.8
17.92
19.04
20.16
21.28
22.4
INV
5
5.6
11.2
16.8
22.4
28
33.6
39.2
44.8
50.4
56
61.6
67.2
72.8
78.4
84
89.6
95.2
100.8
106.4
112
10
11.2
22.4
33.6
44.8
56
67.2
78.4
89.6
100.8
112
123.2
134.4
145.6
156.8
168
179.2
190.4
201.6
212.8
224
15
16.8
33.6
50.4
67.2
84
100.8
117.6
134.4
151.2
168
184.8
201.6
218.4
235.2
252
268.8
285.6
302.4
319.2
336
Energy [fJ]
20
22.4
44.8
67.2
89.6
112
134.4
156.8
179.2
201.6
224
246.4
268.8
291.2
313.6
336
358.4
380.8
403.2
425.6
448
1
2.51E+00
3.70E+00
4.85E+00
6.16E+00
7.45E+00
8.74E+00
1.02E+01
1.15E+01
1.27E+01
1.42E+01
1.55E+01
1.69E+01
1.81E+01
1.97E+01
2.09E+01
2.26E+01
2.39E+01
2.53E+01
2.67E+01
2.81E+01
5
1.26E+01
1.85E+01
2.42E+01
3.08E+01
3.73E+01
4.37E+01
5.08E+01
5.75E+01
6.36E+01
7.08E+01
7.76E+01
8.44E+01
9.05E+01
9.85E+01
1.04E+02
1.13E+02
1.20E+02
1.27E+02
1.34E+02
1.40E+02
10
2.51E+01
3.70E+01
4.85E+01
6.16E+01
7.45E+01
8.74E+01
1.02E+02
1.15E+02
1.27E+02
1.42E+02
1.55E+02
1.69E+02
1.81E+02
1.97E+02
2.09E+02
2.26E+02
2.39E+02
2.53E+02
2.67E+02
2.81E+02
15
3.77E+01
5.54E+01
7.27E+01
9.24E+01
1.12E+02
1.31E+02
1.52E+02
1.72E+02
1.91E+02
2.13E+02
2.33E+02
2.53E+02
2.71E+02
2.96E+02
3.13E+02
3.39E+02
3.59E+02
3.80E+02
4.01E+02
4.21E+02
20
5.02E+01
7.39E+01
9.70E+01
1.23E+02
1.49E+02
1.75E+02
2.03E+02
2.30E+02
2.54E+02
2.83E+02
3.10E+02
3.37E+02
3.62E+02
3.94E+02
4.18E+02
4.52E+02
4.79E+02
5.06E+02
5.34E+02
5.61E+02
Energy Factors
Output Capacitance Factor
1.211300121
June 18, 2003
Multiplier Factor
7.39E-01
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
60
Examples
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
61
64-Bit Adders
•
•
•
•
•
Han-Carlson (prefix-2, HC2): Static and Dynamic
Han-Carlson (prefix-2, HC2-2): Dynamic-Static
Kogge-Stone (prefix-2, KS2): Static and Dynamic
Kogge-Stone (prefix-2, KS2-2): Dynamic-Static
Quaternary-Tree (prefix-2, QT2): Static and
Dynamic
Len (um)
Delay (ps)
10
0.01
20
0.04
30
0.09
40
0.17
60
0.38
80
0.67
120
1.50
160
2.67
240
6.01
320
10.7
480
24.1
Included wire delay, tdelay = 0.7RwireCwire
Included wire energy, Ew = CwireV2
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
62
Test Setup
1mm wire
A0
Cwire
S0
Adder
S63
A63
Cwire
H=(Cin + Cwire)/Cin
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
63
Energy-Delay Estimates
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
64
Adders: Energy
Energy vs. Delay
Cout = 1mm wire (160u gate cap)
For Cin = ~minimum input to 50*minimum input
900
HC Dynamic (2-2)
800
KS Dynamic (2-0)
Dynamic: KS, HC
HC Dynamic (2-0)
700
KS Dynamic (2-2)
KS Static Prefix 2
600
Energy [pJ]
HC Static Prefix 2
QT
Quarternary Dynamic (2-2)
500
Quarternary Static
400
KS
300
Static
200
HC
100
Dynamic-Static
0
0
50
100
150
200
250
300
Delay [pS]
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
65
Dynamic Static Implementation
of Carry-Merge stage
inverters to be
eliminated
VDD
VDD
Clk
Clk
VDD
VDD
Pi
Pi-1
Delayed Clk
VDD
Static Gate
Pi
Pi-1
Clk
Clk
Gi-1
Gi-2
VDD
Gi
Pi
Gi-3
Clk
Pi-2
Regular Domino Implementation
June 18, 2003
Pi-2
Clk
Gi
Gi-2
Gi-3
VDD
Gi-1
Pi
Compound-Domino Implementation
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
66
Energy-Delay comparison of 64-bit
KS, HC and QT adders
3
KS compound-domino
Normalized Energy
2.5
2
KS Static
1.5
HC Static
HC compound-domino
1
QT compound-domino
QT Static
0.5
0
0.9
1.1
1.3
1.5
1.7
1.9
2.1
Normalized Delay
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
67
Adders: Critical Path Energy
Critical Path Energy vs. Delay (no internal wire Energy)
Cout = 1mm wire (160u gate cap)
For Cin = ~minimum input to 50*minimum input
12000
QT dynamic-static
HC Dynamic (2-2)
KS Dynamic (2-0)
10000
HC Dynamic (2-0)
KS Dynamic (2-2)
Energy [fJ]
HC-dynamic
KS Static Prefix 2
8000
HC Static Prefix 2
Quarternary (2-2)
Quarternary Static (2-2)
6000
KS dynamic
HC dynamic-static
4000
QT static
2000
KS dynamic-static
HC-static
KS-static
0
0
50
100
150
200
250
300
Delay [S]
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
68
Intel 32-bit Adder 0.13u 1.2V [VLSI-2002]
Comparison with Intel Measured Data
50
Kogge-Stone (2-0)
Quarternary (2-2)
Intel Kogge-Stone (2-0)
Intel Quarternary (2-2)
KS
45
40
35
Energy [fJ]
30
25
KS estimated
QT
20
15
10
5
QT Estimated
0
0
20
40
60
80
100
120
140
160
180
200
Delay [pS]
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
69
Energy-Delay comparison of 32-bit QT
and KS adders: estimated vs. simulation
in 0.10mm technology
60
KS Estimate
50
Energy [pJ]
40
30
55%
20
35%
KS [9]
QT Estimate
10
QT [9]
0
90
100
110
120
130
140
150
160
Delay [pS]
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
70
1E-10
Est. Results: All Adders
w/o Wires
sKS
sQT9
dKS
6E-11
dHC
dQT9
dQT7
4E-11
dCLA
dIBM
dLNG
0E+00
2E-11
Estimated Energy (J)
8E-11
sHC
7
8
9
10
11
12
13
14
15
Delay (FO4)
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
71
2.0E-10
Est. Results: All Adders
w/ Wires
1.5E-10
sHC_LE
sQT9_LE
dKS_LE
1.0E-10
dHC_LE
dQT9_LE
dQT7_LE
dIBM_LE
dLNG_LE
5.0E-11
0.0E+00
Estimated Energy (J).
sKS_LE
8
10
12
14
16
18
Delay (FO4)
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
72
Conclusion
• Using realistic measures for comparing
various designs leads to better design
choices
• Power is as important as speed
• Making comparison in Energy-Delay space
is necessary:
– power can always be traded for speed and
vice versa
• Wire effects are significant
• Leakage currents ?
June 18, 2003
16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
73