Transcript ppt - ISPD

Optimization for Leakage Power Reduction using Multi-Threshold Voltages for High Performance Microprocessors

ISPD 2007 Austin Jeegar Shah, Marius Evers, Jeff Trull, Alper Halbutogullari AMD Sunnyvale, CA March 19, 2007

Agenda

• Justification for threshold voltage selection for leakage power reduction and multi-corner cycle time adjustments • Multi-Threshold voltage selection flow • Heuristic V TH selection algorithm • Dynamic Forward traversal V TH selection algorithm • Results • Conclusions • Q & A 2 March 19, 2007 ISPD 2007

Motivation

• Reduce leakage power by increasing the threshold voltages of non-critical gates.

• Meet aggressive timing constraints • Support the above constraints for multiple process corners • Optimize extremely rigid designs at post-route step to handle process variability • Support multi-V TH are made available) flows (scalable as more V TH libraries • Generate design variants with power-performance tradeoff 3 March 19, 2007 ISPD 2007

METHODOLOGY & OPTIMIZATION FLOW

4 March 19, 2007 ISPD 2007

Methodology Flow

1. Start with unoptimized design 2. Read in constraints for multiple corners 3. Run Static Timing Analysis for each of these corners 4. Optimize first to meet aggressive timing constraints for each corner by down-swapping (selecting lower V TH cells for critical path gates) 5. Then optimize to reduce leakage power by up-swapping (selecting higher V TH cells for critical path gates) 6. Let multiple corners interact 7. Iterate 3-6 8. Static Timing Analysis check 5 March 19, 2007 ISPD 2007

Simultaneous optimizations across multiple corners

Corner 1 Corner 2 STA 1 STA 1 6 Optimization Iteration 1 Optimization Iteration 1 Exchange swaps as they are computed New design STA 2 March 19, 2007 ISPD 2007 STA 2

Multi-Threshold V

TH

selection flow

Start with MVT cell design with few protected user defined cells Run Static Timing Analysis Determine which cells to change to LVT based on heuristic and smart swap algorithms Run Optimization engine on design N Swap selected MVT cells to LVT cells Done Swapping MVT cells to LVT?

Y Swap Remaining MVT cells to HVT cells Run Static Timing Analysis Determine which HVT cells to change to MVT cells using the 2 algorithms Run Optimization Engine on design N Done Swapping HVT cells to MVT?

Y Finish Swap selected HVT cells to MVT cells 7 March 19, 2007 ISPD 2007

Optimization flow – Multi corner + design variant

Mobile constraints Desktop constraints Corner 1 Corner 2 Un-optimized Design Corner 3 Corner 4 Lib Lib Lib Lib Optimized for corner 1 Optimized for corner 2 8 March 19, 2007 Optimized Mobile design ISPD 2007 Optimized for corner 3 Optimized for corner 4 Optimized Desktop design

Multi V

TH

scalable – 3 V

TH

example

Step 1: Meet timing constraints : down-swap Un-optimized MVT Design

Fix critical paths by changing to LVT

MVT LVT Step 2: Reduce leakage power : up-swap Un-optimized HVT Design Un-optimized MVT Design + HVT LVT HVT LVT Extract HVT MVT HVT LVT Final Design 9 March 19, 2007 ISPD 2007

Heuristic V

TH

Selection Algorithm

10 March 19, 2007 ISPD 2007

Heuristic Algorithm

• Sensitivity analysis based heuristic approach • Picks instances that have the most impact on performance with reasonable leakage costs • Instances picked affect multiple paths • Circuit topology aware • Works best for the first few optimization iterations • Flexibility to chose an instance selection window size to fine-grain the optimization 11 March 19, 2007 ISPD 2007

Heuristic algorithm – Pros and Cons

Pros • Extremely fast • Efficiently selects instances that affect multiple critical paths.

• Changing only these instances to low V meet aggressive timing constraints at very low power leakage costs.

TH cells helps • Parametrizable instance selection windows • Topology aware algorithm 12 March 19, 2007 ISPD 2007

Cons • Effective only in the first few set of iterations.

• Does not work best when fine-grain optimization is required • No timing update or analysis done to improve results within a single round of iteration. • Each iteration picks a window of instances for V swap with the same selection group.

TH selection. Timing information is not updated with every 13 March 19, 2007 ISPD 2007

PseudoCode for heuristic algorithm

17.

18.

19.

20.

21.

22.

23.

24.

25.

26.

1.

2.

3.

4.

5.

6.

7.

8.

9.

10.

11.

12.

13.

14.

15.

16.

list all launching flops foreach flop f do depth first recursive forward traversal calculate time benefit if swapped from libraries determine total VTH layout width (cost) calculate benefit/cost score for each immediate o/p pin prorate each score criticality with other relatively critical pins register capture flop [recursively get downstream scores] add downstream scores to current inst score for each flop from list of capture flops do depth first recursive reverse traversal calculate time benefit if swapped from libraries determine total VTH layout width (cost) calculate benefit/cost score for each immediate i/p pin prorate each score based on i/p pin criticality with other relatively critical pins [recursively get upstream scores] add upstream scores to current inst score list all instances in decreasing final scores pick top x% of instances and swap them to lower VTH update database and perform STA repeat

14 March 19, 2007 ISPD 2007

Definition of Instance score

Score = m

delay a

Width p a

delay b

Width p b

 

Width p a

2  

Width p b

a : Original Cell b : Potential Cell selection m : Instance under consideration p : Each transistor within cell ‘a’ or cell ‘b’ 15 March 19, 2007 ISPD 2007

Updated topological instance score

Individual score from Sensitivity analysis inst Scores of Instances downstream inst Scores of Instances upstream inst 16 March 19, 2007 ISPD 2007

Computing DownCone scores

m n   n=FO(m) o=FI(Gate(n))

downScore

n

x C

0  p=Vt n

q

m: instance being considered for selection n: Fanout gate of m 17 March 19, 2007 ISPD 2007

p n

/

o

))

Computing UpCone scores

n m

upScore =

 n=Gate(FI(m)) (  o=FI(n)

upScore n

x C 0  p=Vt

s.t. slk

FI(m)

- slk < q

m: instance being considered for selection n: Fanin gate of m 18 March 19, 2007 ISPD 2007

p FI (m)

/

slk o

))

Upscore proration

With Proration

upScore =

 n=Gate(FI(m)) (  o=FI(n)

upScore n

x C 0  p=Vt

s.t. slk

FI(m)

- slk < q

p FI (m)

/

slk o

)) Without Proration

upScore

m

=  n=Gate(FI(m))

upScore

n

x (C 0  p=Vt

Width (m)

p

) 19 March 19, 2007 ISPD 2007

Downscore proration

With Proration   n=FO(m) o=FI(Gate(n))

downScore

n

x C

0  p=Vt

q

n

p n

/

o

))

Without Proration

downScore

m

=  n=FO(m)

downScore

n

x (C 0  p=Vt

Width (m)

p

) 20 March 19, 2007 ISPD 2007

Advantage of proration

1.2

1 0.8

0.6

0.4

0.2

-10 -5

Timing slack considered for optimization (ps)

0 0 Without prorated cones With prorated cones Leakage power Normalized with respect to non-prorated cones 21 March 19, 2007 ISPD 2007

Dynamic Path Traversing V

TH

Swap algorithm

22 March 19, 2007 ISPD 2007

Dynamic Path Traversing

• Regular Forward traversal algorithm • Breadth-first search from flop to flop • Works with a power and timing budget to do V TH selection • Only forward traversal, though backward traversal could be implemented • Stops optimizing when either power or timing budget is exhausted • Budgets scaled for every path based on a linear formulation of combinational logic depth and effective fanout •Works best for the last few iterations where fine-grain optimization is required 23 March 19, 2007 ISPD 2007

Pros and Cons

Pros • Simple implementation • Constantly works with a power and timing budget • After every V TH selection, the budgets are updated • Timing between swaps is more up-to-date as compared to the Heuristic algorithm • Timing paths can be differentiated based on combinational depth and fanout 24 March 19, 2007 ISPD 2007

Cons • Not as fast as the Heuristic algorithm • Complementary to the Heuristic algorithm • Works best for fine-grain selection. Not good at selecting the most ‘influential’ instances.

• Since it is traverses forward and is budget limited, it ends up selecting instances closer to the launching flop • No circuit topology information 25 March 19, 2007 ISPD 2007

Psuedo Code for Dynamic algorithm

1.

2.

3.

4.

5.

6.

7.

8.

9.

10.

11.

12.

13.

14.

15.

16.

17.

list all launching flops decide worst slack to consider (eg.wslk = -40ps) foreach launching flop f Start with worst slack at o/p pin (path slack) Start with an approximate swap cost budget do breadth first recursive forward traversal for each instance failing timing calculate time benefit if swapped from libraries determine leakage delta (cost) swap this instance to its lower V exit loop if timing met (wslk) exit if receiving flop reached exit loop if budget exhausted 18.

19. update design database 20. perform STA and repeat with new wslk TH exit loop if path is unconstrained version New Timing budget = Slack of path – time benefit of inst New power budget =Budget – delta power of this inst Update design database for new V TH cells

26 March 19, 2007 ISPD 2007

Flow iteration (scalable)

Swap from MVT to LVT (11 iterations) H-2, H-4, H-8, D-60, H-15, D-40, H-20, D-20, H-8, D-10, D-0 Swap from HVT to MVT with LVT swaps included (11 iterations) H-2, H-4, H-8, D-60, H-15, D-40, H-20, D-20, H-8, D-10, D-0 Swap from VHVT to HVT with LVT and MVT swaps included (11 iterations) H-2, H-4, H-8, D-60, H-15, D-40, H-20, D-20, H-8, D-10, D-0 H-4 => Heuristic flow with 4% instance window D-40 => Dynamic algorithm with worst slack of -40 ps 27 March 19, 2007 ISPD 2007

Slack Distribution after optimization

28 March 19, 2007 ISPD 2007

Experiments

Ex 1: Initial unoptimized design not meeting timing constraints Ex 2: Quick implementation of backward followed by forward (Front-based technique [12] * ) Ex 3: 6 step iteration using only the Dynamic swapper algorithm Ex 4: 6 step iteration using only the Heuristic swapper algorithm Ex 5: 6 step iteration using alternating combinations of the Dynamic and Heuristic swapper algorithms *[12] Srivastava, “Minimizing total power by simultaneous Vdd/VTH assignment, IEEE Transactions on Computer Aided Design; 2004 29 March 19, 2007 ISPD 2007

Results

HVT (%) MVT (%) LVT (%) Total Leakage Power (W) Ex 1 8.5

90.4

0.3

Ex 2 22.9

37.1

39.2

2.278

6.560

Ex 3 31.2

52.3

15.7

Ex 4 39.9

45.7

13.6

3.554

3.122

Ex 5 47.1

40.2

12 2.834

30 March 19, 2007 ISPD 2007

Conclusions

• Described here is a post-route optimization flow for V multiple corners TH selection that supports • This iterative flow uses 2 complementary instance selection techniques : Heuristic and a budget based forward traversal algorithm • The flow is not limited to 2-3 V TH levels but is scalable for any number of levels • The Heuristic algorithm is a unique non-solver based topologically aware heuristic that optimizes over multiple paths simultaneously by including the effects of the upstream and downtream logic cones • Can handle huge full chip microprocessor designs with more than 5 million stdcell gates • No extensive probabilistic stdcell characterization is required. • Process corners can simulate inter-chip variations that are not currently handled by statistical methods.

• Multiple process corner optimizations occur in parallel and optimization results are shared between different servers in real-time. This reduces the number of iterations and improves the quality of the optimization.

• Solver based techniques failed to handle full chip industrial size designs. These designs were handled by this flow 31 March 19, 2007 ISPD 2007

Thanks

Trademark Attribution

AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners.

© 2006 Advanced Micro Devices, Inc. All rights reserved.

32 March 19, 2007 ISPD 2007

Backup Slides

33 March 19, 2007 ISPD 2007

Solver based statistical tools

• Inaccurate sensitivity models based on delta VTH variation of transistor widths • Difficulty in translating transistor model sensitivities of power based on variational parameters to huge libraries • Lack of interchip variation and consideration of only intra-chip variations • Virtual memory constraints for linear solvers on industrial size designs and modeling approximations involved in non-linear solvers • No topological information taken into consideration in path based heuristic approaches • Inappropriate consideration of logic fanouts • In statistical methods, the optimization step is usually decoupled from the librray characterization step 34 March 19, 2007 ISPD 2007

Downstream Score

35 March 19, 2007 ISPD 2007

Upstream Score

36 March 19, 2007 ISPD 2007