Proposal Slides - Technion – Israel Institute of Technology

Transcript Proposal Slides - Technion – Israel Institute of Technology

Directions in Low-Power CAD
Dennis Sylvester
University of Michigan
[email protected]
http://vlsida.eecs.umich.edu
With acknowledgements to: Prof. David Blaauw, Dr. Sarvesh
Kulkarni, Saumil Shah, Kavi Chopra
Topics
 A new dual-Vth assignment formulation
 Dual-Vdd power distribution
 Approaches to parametric yield optimization: statistical
leakage + delay
Motivation
 We require high-performance yet low-power circuits
 Leakage power contributes significantly to total power
 All High- Vth implementation too slow
 All Low-Vth implementation too leaky
S. Narendra et al [ICCAD ’03]
 Dual- Vth processes popular
 Problem Definition



Minimize
 Total Circuit Power
Subject to
 Circuit Delay Constraint
 Sizing Constraints
Optimization Variables
 Gate Sizes
 Gate Threshold Voltages
Switching
Subthreshold
leakage
Gate Sizing + Vth Assignment
Problem Prior Work
 Traditionally a discrete problem
 Previous approaches




Separate Sizing and Vth Assignment
Mixed Integer Non-Linear Programming
Sensitivity-based methods (DUET, etc)
Continuous formulation [Chen, ASP-DAC ‘05]

Very reliant on discretization heuristic
Proposed Approach – Selfsnapping formulation
 Continuous formulation – Use of large variety of
algorithms/powerful non-linear optimizers possible
 Solution has almost all gates assigned to one of the
two available threshold voltages
 Small fraction of gates with intermediate Vth’s, can
be handled heuristically
 Discretization algorithm has negligible power impact
and can be very simple
Proposed Approach – Mixed- Vth
Gates
 Consider each gate to be a parallel combination of high and low
Vth gates
 RC Delay Model
D  Reff Cl
D=Reff Cl
R l / WRl RR h / Wh
l h
==
CCl
l
RRl /W
W +R
/
W
+R
W
l hl hh lh
HVt
LVt
C =C
+K (W +W )
l Load SL l
h
 Linear Power Model
HVt Gate
P=PLVt +PHVt
=PlWl +Ph Wh
LVt Gate
Mixed Gate
Complete Dual- Vth Problem
Formulation
 Similar to single-Vth gate sizing problem, with simple gate delays
replaced with High Vth/Low Vth parallel combinations
 Minimize
 Subject to:
 Pl ,iWl ,i  Ph,iWh,i
iG
a j  A0
a j  Di  ai
i  ({1,..., n}  {inputs})
j  {input (i )}
Di  a i
i  {inputs}
0  Wl ,i
i  1, ..., n.
0  Wh,i
Li  Wl ,i  Wh,i
i  1, ..., n.
 U i  1, ..., n.
i
Proof of Discretized Solution
 Conceptually separate optimization process into two
distinct phases:


D-Phase : Fix delays of all gates
W-Phase : Find the minimum-power sizing solution
that satisfies the chosen D vector
 Hypothetical separation for proof – Not implemented
in actual optimization procedure
W-Phase
 Proof of discrete optimal solution under arbitrary D-vector
sufficient
 W-Phase formulation
 Minimize
 Pl ,iWl ,i  Ph,iWh,i
iG
 Subject to:
Rl ,iWh,i  Rh,iWl ,i 

Rl ,i Rh,i (
(Cinp, j (Wl , j Wh, j ))  K (W  W ))
SL l ,i
h,i
j fanout (i )
i  1, ..., n.
Di
Wl ,i  0
i  1, ..., n.
Wh,i  0
i  1, ..., n.
W-Phase
 Linear programming problem
 n basic variables, n non-basic variables
 Therefore, only n non-zero variables
 Every gate snapped to either high-Vth or low-Vth
 Addition of upper and lower bounds on total size
leads to some non-snapped gates
 Number extremely small – simple heuristic achieves
good results
Practical Constraint – Fixed-Width
Input Drivers
 Sequential elements driving the combinational circuit
 Delay of these elements affected by primary input
widths
 Modeled as fixed-width drivers
Extension of Discretization
Analysis
 m+n constraints in the optimization problem
 n+m basic variables, n-m non-basic variables
 Therefore, n+m positive variables
 Total number of non-snapped gates bounded by
number of inputs


Once again, small in number; can be handled
heuristically
In practice, number of non-snapped gates found to be
much less than the number of inputs
Discretization Heuristics
 Iterative snapping

Round gates to closer Vth and re-optimize until nonsnapped solution achieved
 Single-pass Vth assignment

Fix all gates to closer Vth and re-optimize only for gate
sizes
 Second heuristic faster with negligible power impact
Results
c7552
c5315
c7552
c5315
0.11
0.10
0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
2.0
0.0020
1.8
0.0018
1.6
0.0016
1.4
0.0014
1.2
0.0012
0.0010
1.0
0.0008
0.8
0.0006
0.6
0.0004
0.4
0.01
0.00
0
5
10
15
Timing Backoff(%)
20
0.0002
0.2
0.0000
0.0
c2670 c3540 c5315 c6288 c7552
i8
i9
i10
Circuit
 # of non-snapped gates is very small
 Dominated by gates at upper and lower size bounds
 Approach is easily extendable to multi-Vth AND multiLgate
--
% of total non-snapped gates due to input drivers
0.12
% of total non-snapped gates
1.5
1.4
1.3
1.2
1.1
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
% of Non-snapped gates
due to fixed-width drivers
% of non-snapped gates
 Snapping properties of some circuits
Results
 Power and runtime comparisons between proposed
approach and sensitivity-based algorithm at 2% timing
backoff (results shown for larger circuits only)
 Average: 31% leakage reduction vs. previous approaches
SBA
Continuous
Formulation
%
Improvement
Runtime(s)
Ckt
Static
Dyn.
Static
Dyn.
Total
Static
Total
SBA
Cont
C3540
0.26
0.74
0.16
0.78
0.94
38.14
6.46
28
51
C5315
0.22
0.78
0.15
0.80
0.95
30.53
5.11
52
133
C6288
0.35
0.65
0.26
0.65
0.91
24.69
9.04
136
443
C7552
0.31
0.69
0.24
0.68
0.91
23.93
8.87
94
171
i8
0.24
0.76
0.19
0.75
0.94
21.57
5.87
24
35
i9
0.20
0.80
0.16
0.77
0.94
17.65
6.47
9
21
i10
0.31
0.69
0.23
0.69
0.92
24.8
7.69
287
373
Topics
 A new dual-Vth assignment formulation
 Dual-Vdd power distribution
 Approaches to parametric yield optimization: statistical
leakage + delay
Multiple supply design
FF
VDDH
VDDL
FF
DC Current
FF
FF
VDDL Swing
IN
FF
Need for Level Conversion
 Relies on applying a lower supply (VDDL) to gates along non-critical
paths thus reducing power while meeting timing
 A flexible fine-grained VDD assignment scheme promises best power
reduction

Gate-level Extended Clustered Voltage Scaling
 However, physical design and power delivery are complicated
Implications of using multiple supplies
Critical
Non-critical
OUT
Circuits
IN
Level shifting
CVS
ECVS
Coupled
Algorithms
issues
VDD assignment
Physical design
VDD Granularity
Power delivery
Distribution
Generation
Fine-grained
Islanding
Power delivery for dual-VDD circuits
 Power grid integrity vital for circuit performance
 Dual-VDD circuits require two supply voltages for operation
 Fine-grained dual-VDD can place VDDL/VDDH gates arbitrarily on the die
 Implications at the board, package and die level

Fixed resources need to be split between VDDL and VDDH
 However, load on each supply is lower than on original single supply:
Power supply current demanded by a dual-VDD circuit is significantly lower than
the corresponding single-VDD circuit, allowing robust power delivery within
available resources (decap, C4, wiring)
Reduced current load on VDDL/VDDH
 Gate level comparison

Avg. 54% (33%) for VDDL = 0.8V (0.6V)
INVX10
NAND2X2
NAND3X6
NOR2X1
NOR3X4
AVERAGE
Single-VDD
Dual-VDD: VDDL=0.8V Dual-VDD: VDDL=0.6V
Low-VTH High-VTH Low-VTH High-VTH Low-VTH High-VTH
1.00
0.90
0.57
0.49
0.36
0.27
1.00
0.85
0.54
0.45
0.34
0.23
1.00
0.88
0.55
0.47
0.35
0.24
1.00
0.86
0.52
0.39
0.30
0.19
1.00
0.85
0.50
0.37
0.29
0.18
1.00
0.88
0.54
0.44
0.33
0.23
VDD
 Circuit level comparison

Avg. 49% (51%) and 28% (14%) for VDDH and VDDL for 0.8V (0.6V)
Single VDD Dual VDD: VDDL=0.8V Dual VDD: VDDL=0.6V
VDD
VDDH
VDDL
VDDH
VDDL
c880
9.7
5.6
2.2
5.9
1.3
c2670
23.6
11.9
6.5
10.1
3.0
c5315
36.7
20.9
7.2
20.9
3.6
c7552
47.9
13.9
19.4
20.4
8.5
AVERAGE %
100.0
48.5
27.7
50.7
13.5
ECVS
Package level results
 Two VRMs on board to supply VDDL and VDDH
 Ground path can be shared by VDDL and VDDH
 Decoupling capacitance divided in the ratio of current loads
Lmb1
+
VDDH
-
VDDL
+
RblkH
RhfH
Rpkg_capH
LblkH
LhfH
I(VDDH) Lpkg_capH
CblkH
-
Rmb1 Lmb2 Rmb2 Lskt Rskt
ChfH
LpkgH RpkgH
RdieH
RhfL
Rpkg_capL
LblkL
LhfL
I(VDDL) Lpkg_capL
CblkL
ChfL
1
RdieL
CdieL
Cpkg_capL
Lmb1 Rmb1 Lmb2 Rmb2 Lskt Rskt
VDD or
VDDH & VDDL
PK
QS
VDDH
Load
CdieH
Cpkg_capH
RblkL
2
Single-VDD
VDD
Dual-VDD
VDDH
VDDL VDDL = 0.6V VDDL
Load
Dual-VDD
3
VDDH
VDDL = 0.8V VDDL
mV
%
mV
%
mV
%
mV
%
mV
%
92.7
7.7
63.0
5.3
18.0
3.0
63.0
5.3
37.0
4.6
65.0
5.4
34.0
2.8
9.0
1.5
32.0
2.7
18.0
2.3
LpkgL RpkgL
 Similar power supply noise with same resources as single-VDD case
(decoupling capacitance, C4s)
Intel, “Intel Pentium 4 processor in the 432 pin/Intel 850 Chipset Platform,” 2002.
GND
PK
QS
92.7
7.7
68.9
5.7
68.9
11.5
77.8
6.5
77.8
9.7
65.0
5.4
40.7
3.4
40.7
6.8
46.0
3.8
46.0
5.7
Dual-VDD physical design alternatives
Single-VDD
Dual-VDD
VDDH VDDL GND
VDDH + VDDL row
VDDH + VDDL row
VDDH + VDDL row
VDDH + VDDL row
Dual-VDD segregated
Dual-VDD segregated
VDDH + VDDL row
VDDH + VDDL row
VDDH + VDDL row
Dual-VDD fine-grained
Segregated placement constrains placer leading to higher core area and wirelength
C. Yeh, et al., “Layout techniques supporting the use of dual supply voltages for cell-based designs,” Proc.
DAC, 1999.
M. Igarashi, et al., “A low-power design method using multiple supply voltages,” Proc. ISLPED, 1997.
Dual-VDD power grid alternatives
 Routing the power supply rails
Single-VDD
Dual-VDD Shared-GND Dual-VDD Dual-GND
Dual-VDD standard cells topologies
3-rail cell
VDD
GND
VDDH
VDDL
GND
(shared)
VDDH
GNDH
VDDL
GNDL
VDDH
VDDL
GND
(shared)
4-rail cell
VDDH
GNDH
VDDL
GNDL
 Dual-VDD Dual-GND requires two separate grounds off-chip and complicates
timing analysis and design of the board itself
 Multi-rail standard cells can be used to realize the Dual-VDD grids  allows
placer to operate with no constraints
Dual-VDD on-chip power grid design
 Guidelines while designing the dual-VDD grid:


Scale wires with respect to the single-VDD considering how the
current demand has scaled
VDDL gates more sensitive to grid noise  important since
ground is shared



120mV noise is 10% for a 1.2V gate, but 20% for a 0.6V gate
Placement of VDDL and VDDH gates  assign more wiring
resources to VDDL grid in areas where there is more demand
for VDDL current
Consider effects that arise from the board and package level
such as shared C4s

Fewer C4s leads to higher effective package R, L
Proposed technique D-Place

Let  = I(VDDH)/I(VDD) and  = I(VDDL)/I(VDD)

Scale wires as follows

WVDDH   W
Partition the chip floorplan
Regional
VDDH
W
VDDL
VDDH
    
W
VDDL
Global
WVDDL  
WGND
Original Single
VDD design
Obtain Dual
VDD design
Local
Obtain current
consumption of
Single/Dual VDD
designs (SPICE)
VDDH
Single
VDD
Lib file
Dual
VDD
Lib file
Measure voltage
droop/bounce
Measure wire
congestion
Placement
database
(Cadence)
Size each wire
segment in each
local area using
effective , β &
simulate grid
Break down die
into “local” &
“regional” areas

 effective 
GND
Obtain eff.  and  as follows
Arealocal
Arealocal
  global
Arearegional
Areaglobal
Arealocal
Arealocal
1

Arearegional Areaglobal
 local   regional
Calculate local,
regional, global
& effective  & 
for each wire
segment
VDDL
Peak voltage drop comparisons
VDDL = 0.6V
c880
c2670
c5315
c7552
MAX
AVG
MAX
AVG
MAX
AVG
MAX
AVG
Single VDD
16.9%
9.5%
25.6%
15.9%
29.6%
21.6%
26.8%
22.2%
DVDG
30.9%
14.7%
35.5%
19.8%
38.2%
23.4%
34.2%
21.0%
VDDL = 0.8V
D-Vanilla D-Place
16.4%
18.6%
9.6%
9.5%
32.2%
25.5%
15.2%
14.5%
37.4%
32.0%
20.2%
19.8%
34.5%
29.4%
21.1%
18.7%
c880
c2670
c5315
c7552
MAX
AVG
MAX
AVG
MAX
AVG
MAX
AVG
Single VDD
16.9%
9.5%
25.6%
15.9%
29.6%
21.6%
26.8%
22.2%
DVDG
30.3%
15.9%
36.1%
22.1%
38.1%
25.4%
31.4%
24.9%
D-Vanilla D-Place
16.3%
19.5%
9.7%
9.8%
27.6%
27.0%
15.8%
15.3%
33.0%
31.8%
20.1%
20.3%
31.6%
28.7%
22.3%
20.1%
 D-Place grids better than single-VDD grids in AVG cases
 Inferior by < 2.6% (≈15mV) in some MAX cases
 0.6V VDDL as robust as 0.8V
 0.6V also provides higher power savings
 Proposed approach better by 2-7% (AVG) and 7-12% (MAX) compared to
prior approaches
Voltage variation across die
 Voltage drop contours
Single VDD grid
D-Place Dual VDD grid
0.7
0.7
Y Axis (mm)
0.5
0.4
0.3
0.5
0.4
0.3
0.2
0.2
0.1
0.1
0.0
0.0
0.1
0.2
0.3
0.4
X Axis (mm)
0.5
0.6
0.7
12.00
14.00
16.00
18.00
20.00
22.00
24.00
26.00
28.00
0.6
Y Axis (mm)
15.00
16.25
17.50
18.75
20.00
21.25
22.50
23.75
25.00
0.6
0.0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
X Axis (mm)
 Wiring congestion similar for dual-Vdd vs. single Vdd grids
 Lower current demands can lead to smaller amounts of decoupling cap;
lower leakage (or use same decap for better performance)
Dual-VDD grid no less robust than single-VDD grid
Topics
 A new dual-Vth assignment formulation
 Dual-Vdd power distribution
 Approaches to parametric yield optimization:
statistical leakage + delay
Introduction
Optical Proximity Effects Variation
Chemical Mechanical Polishing Variations
Low Leakage
PoorTiming
Timing Yield Loss
P
P
Good Timing
High Leakage
Power Yield Loss
Process Parameter-space
Chip Performance-space
This Work: Optimize the timing and power yield using gate sizing
Problem Description
 Nonlinear Continuous Optimization
Objective: Maximize Timing and Power Yield
Yield: A utility function defined w.r.t the JPDF of leakage and timing
Decision Variables: Gate Size


Pconst
Tconst
 Efficient implementation requires


Computing yield as function of decision variables - gate size
Fast and Accurate Gradient computation
Power and Timing Yield Analysis (see
DAC05 for more detail)
Timing Analysis
[Sapatnekar03, Chandu05]
(d, d)
n
Delay  d 0   di X i  d n1R
i 1
d
Delay
Correlation
(1 parameter)
n
   d i li
i 1
Power Analysis
(l, l)
n
logLeakage  l0   li X i  ln1R
i 1
l
Delay and Power
Bivariate JPDF
(d, d, l, l,  )
Log(Leakage)
Cut Set SSTA: Intuition
 Consider Timing Graph Cut Edge Time(CT)

Arrival Time (AT)
Required Arrival Time (RT)
Unperturbed Sub Graph
2
6
9
Unperturbed
Left Sub Graph
1
8
3
Unperturbed
Right Sub Graph
10
4
7
Size
Up 7
5
Traditional Incremental Timing
Max Cut Edge Time (CT)
 If Forward SSTA  Reverse SSTA then Cut Set SSTA will give
exact same sensitivities as naïve approach that recomputes yield
relating to all nodes, most being unchanged
Statistical Yield Optimization Results
D < Dμ,initial , P < Pμ,initial
Circuit
Yield without L (%)
Yield with L (%)
c432
45.4
80.2
c499
39.2
59.0
c880
49.3
83.2
c1908
47.9
82.8
c2670
51.1
85.3
c3540
51.2
87.1
c5315
50.0
87.3
c6288
50.3
86.5
c7552
51.2
80.8
 Initial yield ~0-2%
due to inverse
correlation
 Gate sizing alone
provides good
improvements
 Combined with
Lgate biasing,
provides outstanding
results
Chopra, et al., ICCAD05
Another approach to statistical optimization
 General statistical optimization

Method relies on efficient deterministic formulations and
variation space sampling to drive statistical optimization

Applicable to many mainstream VLSI design problems: gate
sizing, Vth assignment, Leff biasing as well as potential new
levers
Statistically Optimized Body Bias Clustering
for Post-Silicon Tuning
 Concept:
Speed up critical gates using FBB
and slow down non-critical gates
using RBB to meet timing and
power constraints
Vth  Vth0   

2F  Vsb 
2F

Critical
Non-critical
 Traditional view:
Centralized body bias generator
controlling different die regions


Ineffective for compensating
intra-die variations
Highly suboptimal power
BB
controller
Coarse Body Bias Assignment
ONE BIAS FOR ALL GATES
600
Frequency
Critical
Frequency
500
400
300
200
100
0
0.064
Correlated
0.068
0.072
160
140
120
100
80
60
40
20
0
0.2
DELAY
0.4
0.6
POWER
 Simplified assignment minimizing routing overheads
 Biasing dictated by placement instead of gate criticality
 Disregards complex dependence of gate criticality on:


Circuit topology
Correlations in process variations
 Effective in tightening delay but leads to high power
 Important to cluster gates to leverage ABB effectively
0.8
1.0
Proposed New Optimization Framework
Generate sample scenarios
Leff_4.1
Solve BB assignment
for each scenario
Scenario ‘1’
Generate PDFs of optimal actions
Gate
4
7Scenario ‘2’
Leff_4.2
Leff_5.1 4 3
Leff_7.1
7Scenario ‘x’
Leff_3.1
Leff_4.x
5
3
Leff_5.2
Leff_2.1
Leff_7.2
4
Leff_1.1
Leff_3.2
7
Leff_6.1
5 2
Leff_2.2
3
Leff_5.x 6
Leff_7.x
1 Leff_1.2
Leff_6.2
Leff_3.x
5 2
Leff_2.x
6
1
Leff_1.x
Leff_6.x
2
DETERMINISTICALLY
optimize each scenario
(i.e., tune each gate for
each die scenario)
6
1
Post-silicon tuning
Clustering
20
Dual-Vth design
Proposed work
Power [uW]
15
10
Timing target
5
0
0.55
0.60
0.65
0.70
Delay [ns]
0.75
BB-PDF
ρi,j
Results vs. Traditional Dual-Vth
 Leakage power

 Delay
Dual-Vth vs. 2-4 ABB clusters

Avg. 28-38% (51-59%) lower μ

3-9X tighter σ
(95th)
 Area

Capo generates contiguous regions of similarly clustered cells while
minimally displacing cells

5-8% increase in wirelength and area
A few conclusions
 Parametric yield is a critical design objective going
forward


Requires accurate estimation and fast optimization
approaches to this key metric
Envision all tools in 4-6 years being yield-driven, rather
than timing or power alone
 Lots of room for improvement in many ‘well-studied’ CAD
problems today

Recent examples; dual-Vth+sizing, placement (Cong, et al)

Proposal Slides - Technion – Israel Institute of Technology

Transcript Proposal Slides - Technion – Israel Institute of Technology

Directory