Low power design

Download Report

Transcript Low power design

ELEN 468 Advanced Logic Design

Lecture 29 Low Power Design ELEN 468 Lecture 29 1

Power Dissipation

100 10 1 4004 8008 8080 8085 8086 286 386 486 P6 Pentium ® proc 0.1

1971 1974 1978 Year 1985 1992 2000 Power increases despite Vdd decrease

ELEN 468 Lecture 29 Courtesy, Intel 2

Power Density

10000 1000 Nuclear Reactor Rocket Nozzle 100 Hot Plate 10 8086 4004 8008 8080 8085 286 386 1 1970 1980 486 1990 Year

ELEN 468 Lecture 29

P6 Pentium ® proc 2000 2010

Courtesy, Intel 3

Why Power Increased

Growing die size, fast frequency scaling 10000

Clock Frequency (MHz)

1000 100 10 85 87 89 91 93 95 97 ELEN 468 Lecture 29 99 01 03 05 4

Gate Power Dissipation

Leakage power Dynamic power Short circuit power ELEN 468 Lecture 29 5

Dynamic Power

Occurs at each switching P d = C L ●V dd 2 ●f p f p switching frequency V dd Linear out V dd Saturation out ELEN 468 Lecture 29 6

Leakage Power

Static Leakage current = a ● V = b/V t dd Leakage current Killer to CMOS technology V dd Leakage out Linear V dd Leakage Saturation out ELEN 468 Lecture 29 7

Short Circuit Power

During switching, there is a short moment when both PMOS and CMOS are partially on P s t r = Q●(V dd -V rising time t ) 3 ●t r ●f p V dd out Input rising Input falling V dd out ELEN 468 Lecture 29 8

Where Does Power Go?

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

Power percentages

Active power Cache leakage Gate leakage Core transistor leakage 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

Power percentages

Active power Cache leakage Gate leakage Core transistor leakage

Total chip power based on ITRS roadmap

In 2004, we are just breaking even

[Kim, et al, Computer 2003]

ELEN 468 Lecture 29

Scalable X86 CPU Design for 90nm

Low V T devices are <1% of total non-memory transistor width

[J. Schultz and C. Webb, ISSCC 2004]

9

Energy – Performance Space

Every design is a point on a 2-D plane

Performance

ELEN 468 Lecture 29 10

Low Power Design

Reduce dynamic power   a : clock gating, sleep mode C: small transistors (esp. on clock), short wires   V DD : lowest suitable voltage f: lowest suitable frequency Reduce static power   Selectively use low V t devices Power gating, MTCMOS   Stacked devices Body bias ELEN 468 Lecture 29 11

Clock Gating

Gate off clock to idle functional units   e.g., floating point units need logic to generate disable signal e g Functional unit    R increases complexity of control logic consumes power timing critical to avoid clock glitches at OR gate output  additional gate delay on clock signal clock disable  gating OR gate can replace a buffer in the clock distribution tree ELEN 468 Lecture 29 12

Active Power Reduction - Supply Voltage Reduction

Low Supply Voltage Slow Static Fast Slow High Supply Voltage Dynamic Adjusting operation voltage and frequency to performance requirements: • High performance – high V dd • Power saving – low V dd & frequency & frequency Pros: • Always active in saving Cons: • Additional power delivery network • Needs special care of interface between power domains • signals close to V t – excessive leakage and reduced noise margins Pros: • Doesn’t limit performance Cons: • Penalty of transition between different power states can be high (in performance and power) • Additional control logic ELEN 468 Lecture 29 13

Voltage Islands (Multi-Vdd)

Vddh Vddl Usami+ JSSC’98 Lackey+ ICCAD’02 GVI DAC’03 Allow both macro and cell voltage assignment Allow different voltage islands in the same circuit row Lift unnatural layout restrictions Minimal placement disturbance ELEN 468 Lecture 29 14

Level Converter

VddL Interface circuit when V ddl to avoid leakage drives V ddh V ddh V ddh VddH weak on!

OUT

V ddl

IN IN OUT

Conventional dual supply level converter

ELEN 468 Lecture 29

New single supply level converter

15

Adjacency Metrics for Clustering

Logic adjacency metric (LAM): V ddl fanin cone of level shifter without going through V ddh

V ddh V ddh LC1 LC2 LC2 V ddl V ddl LC3 LC3

Physical adjacency metric (PAM): V ddl for each candidate cell, compute total size of its neighbor V ddl cells 

LAM to guide logic aware voltage assignment

PAM to guide placement aware voltage re-assignment

ELEN 468 Lecture 29 16

Level Converter Optimizations

Logic replacement (or gate sizing)

LC LC LC LC MUX 1 Z MUX 2 Z DEC DEC

LC/Buffer co-optimization

B LC A B LC A

ELEN 468 Lecture 29 17

Placement to Form Voltage Islands with Power Grid Co-design

Based on V ddl and V ddh cell placement after voltage assignment, define V ddl /V ddh power grids on demand

Power grids on demand Vddh

Detailed placement to form V ddl supplies /V ddh voltage islands that can hit their corresponding power ELEN 468 Lecture 29

V ddl Vddl V ddh V ddl V ddh V ddl V ddh

18

Example of Voltage Islands

-

IBM Cu11

-

0.13um

- 400 MHz V ddh = 1.5V

V ddl 1.2V

=

No timing degradation, no area increase!

(courtesy IBM)

ELEN 468 Lecture 29 19

Dynamic Frequency and Voltage Scaling

Always run at the lowest supply voltage that meets the timing constraints   DFS (dynamic frequency scaling) saves only power DVS (dynamic voltage scaling) + DFS saves both energy and power A DVS+DFS system requires the following    A programmable clock generator (PLL)  PLL from 200MHz  700MHz in increments of 33MHz A supply regulation loop that sets the minimum V DD operation at the desired frequency necessary for  32 levels of V DD from 1.1V to 1.6V

An operating system that sets the required frequency + supply voltage to meet the task completion deadlines   heavier load  lighter load V DD  ramp up V DD , when stable speed up clock slow down clock, when PLL locks onto new rate, ramp down ELEN 468 Lecture 29 20

Leakage Reduction Techniques

W u pullup (V dd ) W l V x stack effect High V t Low V t devices devices dual V t partitioning V dd V nwell ≥ V dd V pwell ≤ 0 variable threshold (VTCMOS) sleep sleep V dd HVT virtual V dd low V t logic virtual Gnd HVT multi-threshold (MTCMOS) ELEN 468 Lecture 29 22

Natural Transistor Stacks

How ?

• Reduce the leakage by stacking the devices • Reduced Vds • Negative Vgs • Negative Vbs ELEN 468 Lecture 29 23

Design with Dual V

th Dual V th evaluation Dual V th  design Two flavors of transistors: slow – high V th , fast – low V th  Low V th are faster, but have ≈10X leakage ELEN 468 Lecture 29 24

Impacts of Variable V

T Reducing the V T increases the sub threshold leakage current (exponentially) V T = V T0 +  (  F + V SB  F ) where V T0 is the threshold voltage at V SB 0, V SB is the source- bulk (substrate) voltage,  is the body-effect coefficient = But, reducing V T decreases (increases performance) gate delay ELEN 468 Lecture 29 25

Variable V

T

through Body Bias

For NMOS, the substrate is normally tied to ground (V SB = 0) A negative bias on V SB causes V T to increase Adjusting the substrate bias at runtime (ABB) or is called adaptive body-biasing dynamic threshold scaling (DTS)  Requires a triple well fab process

0.9

0.85

0.8

0.75

0.7

0.65

0.6

0.55

0.5

0.45

0.4

-2.5

-2 -1.5

-1

V SB (V)

-0.5

ELEN 468 Lecture 29 26

0

Forward/Reverse Body Biasing

RBB (Reverse Body Bias): zero body bias in active mode, a deep reverse bias in standby mode.

FBB (Forward Body Bias): high Vth in standby mode, forward body biasing to achieve better current drive in active mode.

Disadvantages: • Increase PN junction reverse leakage • Scaling down technology worsen Disadvantages: • Larger junction capacitance • High body effect for stack devices short channel effects and weaken the Vth modulation capability ELEN 468 Lecture 29 27

Implementation of Dynamic Vth Scaling (DTS) How?

• When critical path replica frequency is less then reference CLK, adjust bias to decrease Vth.

• Otherwise adjust bias to increase Vth. Results: • The lowest Vth is delivered (NBB-no body bias) if the highest performance is required. • When the performance demand is low, clock frequency is lowered and Vth is raised via RBB to reduce the run time leakage power dissipation.

ELEN 468 Lecture 29 28

Power Gating Using Sleep Transistors Or can reduce leakage by gating the supply rails when the circuit is in sleep mode   in normal mode, sleep = 0 and the sleep transistors must present as small a resistance as possible (via sizing) in sleep mode, sleep = 1, the transistor stack effect reduces leakage by orders of magnitude Or can eliminate leakage by switching off the power supply (but lose the memory state) ELEN 468 Lecture 29 29

Example of Power Gating

Can reduce power 1000X Power Switch Control Signals Smaller voltage swing (IR drop on sleep transistors)   Lower performance Increased noise coupling  Local power grid design Embedded Power Switches Rows of Standard Cells ELEN 468 Lecture 29 30

Power Dissipation on Variation Tolerance

Conventional variation tolerance    Using large timing safety margin Implies aggressive timing target Greater power dissipation Observation    Near-worst-case variations occur rarely Safety margin is applied continuously to guard the small chance of variations Poor power efficiency ELEN 468 Lecture 29 31

Question..

Can we deal with errors instead preventing them from occurring by conservative binning/clocking?

How fast can we speed up the circuit with error rate in manageable range?

ELEN 468 Lecture 29 32

Fault tolerant system

Begin with reference values Introduce redundancy  Hardware: Triple Modular Redundancy  Time: Repeated process  Information: Code  Software: various algorithm How about for delay fault?

 how do we detect (may be correct?) errors?

ELEN 468 Lecture 29 33

Delay fault tolerant system

Delay fault detection  Redundant timing margin in signal path  +: Second sampling at increase clock period  - : Decrease delay of reference signal between pipeline registers Timing margin 2 nd sampling t 1 t 2 ELEN 468 Lecture 29 t 34

Delay fault tolerant system

Delay fault removal   Reference signal (S R ) Reprocessing at slower clock period (t’) Timing margin t 1 t 2 S R t’ t ELEN 468 Lecture 29 35

Delay fault tolerant system: Example RAZOR*  Dynamic Voltage Scaling Design  Reduce power voltage down to manageable failure rate Timing margin t 1 t 2 * Razor: a low-power pipeline based on circuit-level timing speculation, D. Ernst et al, 36th Annual IEEE/ACM International Symposium on Microarchitecture 2003 ELEN 468 Lecture 29 36

Delay fault tolerant system: Example RAZOR continued   Implemented to 120MHz clock frequency But for high speed circuits…    Managing two clocks Minimum path delay constraint Delay of MUX ELEN 468 Lecture 29 37

Delay fault tolerant system: Example Parity coding   Parity generation based on output correlation Avoid well-correlated outputs for pairing Timing margin t ELEN 468 Lecture 29 38

Now.. Let’s look at delay distribution(s) ELEN 468 Lecture 29 39

Clock speed achieved for contained error rate ELEN 468 Lecture 29 40

Delay fault tolerant system: Example Parity coding (continued)  Complexity  Example: C449 ISCAS Benchmark ELEN 468 Lecture 29 41

Recently Proposed Design

Fault detection  Partial hardware and time redundancy Timing margin L n g 0 FL g i BL g m L n+1 L' n+1 BL' g m t ELEN 468 Lecture 29 42

Proposed Design

L n Fault removal  Pipeline flush & reprocessing at lower clock g 0 FL g i BL g m L n+1 BL' g m L' n+1 ELEN 468 Lecture 29 43

Proposed Design

PI Division of FL an BL FL BL PO Latch BL CP Error?

ELEN 468 Lecture 29 44

Proposed Design

Division of FL an BL  Considerations  The effects on the original circuit should be minimal.

 Maximize delay fault detection coverage  Minimize added complexity ELEN 468 Lecture 29 45

Proposed Design

Division of FL an BL  First, POs to BL  Gate with longest delay to gate with shortest delay  For the gates connected to BL,  Choose the gate with maximum delay  Then, any gate whose number of fanout> number of fanin ELEN 468 Lecture 29 46

Proposed Design

Delay fault detection coverage 

d FL

: delay from PI to any gate in FL 

d i

: delay from PI to any gate in original circuit

C F

max{

d FL

}

d

Add graphical view ELEN 468 Lecture 29 47

Proposed Design

Delay simulation  SPICE simulation  TSMC 0.18um tech. Vcc=1.6V

 Gate delay for rising and falling signal  Load: inverter  Different input combinations are considered  Delay simulation  Randomly generated test vectors  10 6 ~10 8 according to number of primary inputs (PI) ELEN 468 Lecture 29 48

Proposed Design

Area complexity     

N gate :

Number of gates in the original circuit

N ff :

Number of ffs in each pipeline, (N PI +N PO )/2

N gate_BL :

Number of gates in BL

N gate_CP :

Number of gates in comparison block

N Latch :

Number of latches=Number of connections between FL and BL 

w C A

N gate

_

BL

N gate N gate

  _

CP ff

N Latch

ELEN 468 Lecture 29 49

0.5

0.4

0.3

0.2

0.1

0 0 0.6

0.5

0.4

0.3

0.2

0.1

0 0

Fault Coverage vs. Complexity

Fault Detection Coverage vs. Added Com plexity : C432 Fault Detection Coverage vs. Added Complexity : C499 0.5

0.1

0.2

0.3

0.4

Fault detection Coverage C F 0.5

0.6

0.7

0.4

0.3

0.2

0.1

0 0 0.1

0.2

0.3

0.4

Fault detection Coverage C F 0.5

Fault Detection Coverage vs. Added Com plexity : C6288 Fault Detection Coverage vs. Added Complexity : C880 0.5

0.4

0.3

0.2

0.1

0 0 0.1

0.2

0.3

0.4

Fault detection Coverage C F 0.5

0.1

0.2

0.3

0.4

Fault detection Coverage C F 0.5

0.6

ELEN 468 Lecture 29 50 0.6

0.6

Complexity

Effective complexity penalty  Depends on application  More than half of area is cache  Speed critical part: integer unit

C AE

C A

 

C A

 0.5

ELEN 468 Lecture 29 51

Estimation of Complexity

Intel® Pentium® 4 Processor on 90 nm Process Data Cache Align Mux ALUs & AGU Registers ELEN 468 Lecture 29 52

Conclusion

Delay fault tolerant design is proposed  Possible operation clock frequency gain is estimated from modeling and experiments  Delay fault detection coverage and complexity are analyzed for optimal implementation  It shows that 10% clock frequency gain is possible with proposed design at a moderate (8 25%) complexity increase ELEN 468 Lecture 29 53