Transcript (.ppt)

MAPG:
Memory Access
Power Gating
Kwangok Jeong, Andrew B. Kahng, Seokhyeong Kang,
Tajana S. Rosing, and Richard Strong
University of California, San Diego
The Problem
• Relative increase of leakage power in advanced nodes
• Leakage cost even when cores stall
• Cores can stall quite often when accessing memory !
– Memory access (L1->DDR3->L1) can take 80ns
– Access latency grows as more threads contend for the memory resource
• Previous work
– Fine-grain power gating of functional units: 4-6ns wake-up latency but wastes
leakage in other parts of core
– High-overhead core power gating: 100ms to wake up due to state restoration
from main memory
1
Motivation: Stalled Core Energy Waste
• Leakage power percentage increasing in smaller technology nodes


In-order Core MAX Energy Waste: 52.3%
EV6 Core MAX Energy Waste: 39.1%
2
Goals of This Work
• Power gate cores during long memory accesses to reduce
leakage waste
• New mechanism:
– Power-gate whole core with minimum wake-up cost
– Avoid long restore latency: keep cache on, store key data in e.g.,
retention FFs
• Low overhead wake-up
– Within the threshold latency to save energy during a memory access
• Satisfy voltage noise constraints
– Active core voltage drop: < 5%
– Idle core voltage drop: < 40%
– PDN core peak current limits
3
Power Gating Introduction
active
idle
Vdd_core
Vdd_int
sleep energy
A=stall energy, B=retention energy, C=wake-up
pg_enable
active
Logic
block
power
Without
power gating
Vdd_core
Voltage
A
Vss
C
Vdd_int
B
t
Break-even point (minimum
time)gating
Withoutidle
power
decreases in advanced technologyWith
nodes
power gatin
Vss
Current
active
idle
t
With power gating
power
idle
active
pg_enable
sleep
Vdd_core
Logic
block
Voltage
Vss
Current
active
idle
wake
up
Vdd_int
Without power gating
With power gating
4

Ilimit
time
enable_few
enable_rest
Power-gating
controller
enable_rest
Ilimit
time
Best possible wake-up
Rush current
With two control signals, current profile is programmable
Rush current

Optimal wake-up profile requires
complex control logic
Rush current
PPGS Design: Wake-up Delay and Noise
Ilimit
enable_rest
time
Safer wake-up
with longer time
PPGS: provides multiple wake-up modes
subject to the Ilimit that may be used to
charge core logic capacitance
5
PPGS: Programmable Power Gating Switch
0
1
m[0-9]
0
1
m[0]
m[1]
0
1
enable_rest
mode
1
2
mout[0]
mout[1]
10
mout[9]
m[9]
PPGS
Depending on mode, time difference
between enable_few and enable_rest
changes
10Ilimit
9Ilimit
8Ilimit
7Ilimit
6Ilimit
5Ilimit
4Ilimit
3Ilimit
2Ilimit
Ilimit
Rush current
enable_few
enable_few enable_rest
m[0]=0
m[1-9]=1
m[0-1]=0
m[2-9]=1
…
m[0-9]=0
Mode 10
Mode 2
Mode 1
t1/10
t1/2
t1 time
6
Core Model and PDN Analysis
●
McPAT
Core area
Core transistor counts
● Core power
●
●
●
ITRS 2009-2010
Update McPAT
● Transistor capacitance
● VDD
●
●
Qcore = (Clogic+Cint)VDD
7
Core Wake-up Latency Results
●
PPGS wake-up modes for a 32nm HP in-order core
mode
Latency(ns)
Core Wake-up Latency (ns)
●
1
2
3
4
5
6
14.16 13.14 12.12 11.11 10.09 9.08
7
8.06
8
7.05
9
6.03
10
5.02
Core wake-up latency for varying system utilization
16
14
12
10
8-cores
8
6-cores
6
4-cores
4
2-cores
2
0
0
1
2
3
4
Number of Idle Cores
5
6
7
8
Core State Retention and Restoration
Interface for power gating and data retention
head
switch
Vdd
switch
control
Controller
(PPGS)
Retention
Domain
CORE
level
shifter
Collapsible Domain
RET
D
PC
retention
flip-flops
clock
ID
EX
I$/D$
MEM
flip-flops
retention
control
IF
flip-flops
architectural,
misc priv.
registers
flip-flops
WB
flip-flops
●
Vdd(sram)
SRAM
Register
files
I$, D$
Vss
• Three power domains
– Collapsible domain: supply voltage is disconnected during power gating
– Retention domain: retain data with supply voltage
– SRAM domain: source biasing during standby mode
9
WUC: Wake-up Controller
PPGS State Diagram
(2) WUC
Provides
Wake-up
Mode
(1) Query WUC
PPGS
Waiting for
WUC
Core
Off
(5) WUC Updates
All Core Wakeup Modes
(3) PG Mem
Stall
PPGS Wakeup Mode
Ready
(4) Release
Wake-up to
WUC
10
MAPG-Controller
Controller Design
Power States
Mem
Request
If (cur_stall_cycles > latencyLLC-hit-response):
latencypred-stall = latencyrow-buffer-miss+δ;
β = latencystall – latencypred-stall;
if (β < 0): δ = δ + β; Avoid future performance hit
else: δ = 0.8δ+0.2*β; Adapt to increasing mem latency
Core
Stalls
In-rush
Current
Mem Response,
Update δ
PPGS Power
Gates Core
Restore State
& Fill Pipeline
Core Saves
State
MAPG
Controller
Predicted
Wake-up
Active
Stalled
Power Gated
0
10
20
30
Time
40
50
nanoseconds
60
70
80
11
Methodology
●
TOOLS
●
●
●
●
●
Comparison points
●
●
●
●
FUPG: functional unit power gating
Oracle: PPGS with oracle core stall knowledge
MAPG-Counter: PPGS with practical controller
System
●
●
●
●
●
GEM5: architectural simulation
DRAMSIM2: memory hierarchy tool
McPAT: area and power analysis tool
HSPICE: core wake-up and PDN analysis
4 in-order cores @ 2GHz (2IALU, 1IMULT, 1FPALU)
32KB-2way 0.5ns L1 , 256KB-8way 4ns L2, 8MB-16way 13ns L3
DDR3 50ns Memory
32nm HP
Benchmarks: SPEC2006
4-Aug-16
Your Name / Affiliation
12
Results: Energy Comparison
Oracle saves 8.8% energy on average, up to 38% max
● MAPG saves 1.68X the energy savings of FUPG
●
38%
21%
14%
-0.2%
-2.0%
13
Results: Time Breakdown
0.08% average execution overhead
● 11% average power gate time (47% MAX for lbm)
● 0.6% and 1% average core restore and wake-up time
●
14
Conclusions
●
Developed new power gating mechanism in between FUPG and long
latency wake-up core power gating
●
Composed of PPGS, WUC, predictive MAPG-Controller
Modeled safe core wake-up latencies between 5.02ns – 14.16ns
● Showed oracle energy savings as high as 38%
● Demonstrated practical MAPG energy savings as high as 21%
● Currently, we are extending our work to:
●
●
●
●
●
Power gate out-of-order cores
Create a model for wake-up delay given core states and location
Analyze the benefits of staggered core wake-up
Apply our technique to thermal management
15
Thank You
4-Aug-16
Your Name / Affiliation
16
Backup Slides
• BACKUP SLIDES
17
Methodology
core
Memory Hierarchy
System Configuration
ISA
DEC-Alpha
L2 Cache
256kB-8way 4ns
Total Cores
4,8,16
Model
EV4
L3 Cache
8MB-16way 13ns
Tech Node
32nm
Executio
n
In-order
DDR3 Latency
50ns
Private Caches
L1, L2
Clock
2.0GHz
DDR3 Size
2GB
Shared Cache
L3
Icache
32kB-2way
OS
Vanilla-Linux2.6.27
Dcache
32KB-2way
Width
2
Function
2IALU 1IMULT 1FPALU
al Units
●
TOOLS
●
●
●
●
GEM5: architectural simulation
DRAMSIM2: memory hierarchy tool
McPAT: area and power analysis tool
HSPICE: core wake-up and PDN analysis
18
Results: Energy Comparison
Oracle: oracle knowledge of core stall periods
TAP: Token-Based Adaptive Power Gating
MAPG-Counter: Adaptive Stall Counter Mechanism
FUPG: Function Unit Power Gating
In-Order Core
Wake-up Mode: 8ns Charge Delay
0% Performance Hit!
TAP In-Order: Up to 25.26% energy savings
TAP EV6: Up to 23.18% energy saving
EV6 Core
1.5% of Max Oracle Savings
2x the energy savings of FUPG
19
Summary
• PPGS provides a flexible mechanism for reducing core
leakage power
• TAP provides core stall duration information to allow the
PPGS to power gate
• WUC manages reliability constraints to prevent core logic
corruption
• Power gating will be an important mechanism for providing
energy proportional processors
• Waking-up a core from power gated state is possible in less
than 10ns.
20
PPGS: ProgRAMMABLE Power Gating Switch
• Wake-up time vs. rush current: header case
– During sleep mode, charges in all circuit nodes are discharged
– During wake-up, all nodes need to be charged to the correct
states
• The amount of charges depends on design size (not wake-up time)
Rush current
Rush current
• Fast (slow) wake-up  large (small) rush current
Same area
Ilimit
Ilimit
time
Slow wake-up
time
Rush current
Fast wake-up
Ilimit
time
Optimal wake-up
How to make this waveform?
 needs a good wake-up
control technique with fewest
#signals!
21
Power Gating/Wake-up Sequence
ACTIVE MODE
POWER DOWN
7
CLOCK
power down
1
1T
1T
Trestore
2
1T
enable few
power up trigger
power down trigger
retention
clamp
RESTORE ACTIVE MODE
WAKE UP
8
3
4
Tcharge 5
enable rest
1T
6
async-reset
Power down sequence
Wake up sequence
– Tcharge (between 4 & 5): charge time for Vdd_int node (10 – 50 cycles)
– Trestore: pipeline refill (6 cycles)
– We exploit variable wake-up time at different system utilization levels
22
Safe Wake-up Mode Analysis
• Minimum wake-up time for EV6 16-core (SPICE simulation)
Wa
A Wa
A Wa
Wa
(a) 7.8ns
(b) 11.1ns
A Wa
Wa
(c) 13.8ns
Wa
Wa
Wa A Wa
Wa A Wa
Wa
(d) 16.1ns
Wa
Wa Wd
(e) 16.5ns
A
A
Wd
(f) 3.9ns
A
Wn
(g) 3.4ns
(h) 8.3ns
A: critical active core, Wa: adjacent woken-up core,
Wd: woken-up core in the diagonal position, Wn: non-adjacent woken-up core
• Minimum wake-up latency model: To, , , ,  : coefficient,
w, x, y: # of W , W , W
a
d
n
𝑻 = 𝑻𝟎 (𝒘 + 𝜷 ∙ 𝒙 + 𝜸 ∙ 𝒚 + 𝜹 ∙ 𝒛)𝜶z: # of woken-up
cores in edge
• Modeled wake-up latency and error from SPICE
23
To Dependency
24
Staggered Wake-up
• Different start time between two woken-up cores
Staggered wake-up
can reduce wakeup time
significantly
• Minimum wake-up latency with some interval time (delta)
25
TAP: Adapting to Memory Latency
• DDR3 access latency experiences variability from:
• Bank Queue Length
• Row Buffer Hit/Row Buffer Miss
• Channel Contention
• Refresh Cycle
• TAP adapts to this variability via two step process:
• On last level cache miss, TAP sends token with unknown ETA
and PPGS power gates the core immediately.
• Memory controller sends updated ETA once memory
operation scheduled, PPGS then schedules core wake-up.
26
RESULTS: Core Execution TIME BreakDOwn
• TAP has 0% performance impact as all execution time is normalize to original non
modified execution time
• TAP power gates in-order and out-of-cores for 16.23% and 8.25% of simulation time
respectively, when average across all benchmarks
• TAP max power gated time is 64.4% for lbm on in-order core and 49.98% for mcf on the
ev6 core
• TAP spends 5.62% and 1.20% of time waking up and restoring the the in-order and ev6
cores respectively when average across all benchmarks.
EV6 Core
In-order Core
27
Results: Energy Savings Vs Wake-up Latency
Note: Higher wake-up modes denote higher
wake-up latency.
In-Order Core
System: 4 core CMP
As wake-up latency of the core increases
from 2ns to 16ns, max energy savings
decrease from:
•TAP In-Order: 31.5% to 22.3% (lbm)
•TAP EV6: 25.8% to 20.1% (mcf)
• Greater than 20% energy savings even
•when core wake-up latency is 16ns!
EV6 Core
28
16
8‐core Wake‐up Time (ns) 14
Results: Staggered
Wake-up Energy Savings
12
10
System: 16 EV6 8Core CMP
6
4
Wake‐up Time (ns) A stagger of 0.9ns,
can reduce wake-up latency by 7.7ns and improve energy savings from
2
18.92% to 22.06%
for mcf
0
0
20
18
16
14
12
10
8
6
4
2
0
1
2
3
4
5
6
16‐core 0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Energy Savings
24.00%
19.00%
14.00%
no stagger
9.00%
0.3ns stagger
4.00%
0.6ns stagger
-1.00%
0.9ns stagger
29
RESULTS: Adapting to Memory Latency
From 1 to 32 threads, average core stall latency increases from
36.77ns to 287.63ns
TAP can increased core power gated time from 10.12% to 34.23%
of total execution time.
40.00%
Average Stall Dura on
TAP
MAPG-Counter
300
35.00%
30.00%
250
25.00%
200
20.00%
150
15.00%
100
10.00%
50
5.00%
0
0.00%
0
5
10
15
20
Number of Threads
25
30
Core Power Gate Time (%)
average stall dura on (nanoseconds)
350
35
30
Power Gating and Data Retention
head
switch
Vdd_core
enable_few
enable_rest
• Interface
Vdd_sram
– Three power domains
– Vdd_int domain is collapsible
during power gating
– Vdd_core domain supplies
Retention latches
– Vdd_sram domain supplies
SRAM and source biasing can
be used during standby mode
CORE
Vdd_int
retention
RET Q
clk
reset
level
shifter
retention
flip-flops
RET Q
D
clk
clamp
D
SET
SET
Q
Q
SRAM
reset
CLR
CLR
Q
Q
reset
clock
controller
Vss
• Power gating/ wake-up sequence
– Tcharge (between 4 & 5 ): charge time
for Vdd_int node
– Trestore: cycles for data restoration
– We exploit variable wake-up time at
different system utilization levels
ACTIVE MODE
POWER DOWN
WAKE UP
power down
1
1T
1T
Trestore
2
1T
enable few
power up trigger
power down trigger
retention
clamp
RESTORE ACTIVE MODE
7
CLOCK
8
3
4
Tcharge 5
enable rest
1T
6
async-reset
Power down sequence
Wake up sequence
31
Power Gating Design: Enable Signal
• Enable signal topologies
enable
enable
power gating switch
enable
Single daisy
chain
bone group of switches
Star
• With
two-signal
wake-up,Fish
each
cells
needs to be controlled as fast as possible
– Rush current is controlled by the time difference
between enable_few and enable_rest, not by the
topology
32
Core Modeling
Power Gating Strategies  Energy, Performance
IR-drop rule:
Switch: Ron, Ioff
- Total gate cap.
- Total interconnect cap.
- Total charge (Q)
- #switches
- Rush current
- Wake-up time
- Energy
- Break-even point
Design / Tech
Spec
Freq.: 2 GHz
#tr
ITRS
Logic gate model
(following ITRS MPU
power/freq. model)
- Logic area
- Runtime dynamic power
- Peak dynamic power
- Leakage power
7.8M
McPAT
Device parameters
M1 half pitch, Lgate
Vdd, Cg,total, Jg,limit, Ioff
33
Safe Mode w.r.t Location
• Multi core wake-up case
8
7
8
7
8
8
8
8
8
8
7
7
8
8
7
7
8
7
8
8
8
6
5
8
• Mode does not much different w.r.t location
• About the turn-on location, temperature analysis is
possible ? (e.g., S. Reda work)
34
Wake-UP Time Control With WUC
PACKET:
1. wake-up mode
2. staggered offset
WUC determines the optimal
wake-up mode and staggered
offset for PPGS
memory
controller
CORE
PACKET:
expected
latency
PPGS
WUC
• Packet interface between WUC and PPGS
memory miss /
expected latency
• Run time (dynamic) wake-up scenario with core information
memory miss on CORE 1
interval for
staggered wake-up
wake-up mode update
due to CORE 2
victim core
ON
wake-up
OFF
T’WAKE,1 TWAKE,1
TOFF,1
CORE 1
ON
TON,1
wake- active
up
OFF
ON
TOFF,2
memory miss on CORE 2
wake-up
OFF
TWAKE,2
ON
TON,2
wakeup
CORE 2
time
expected memory latency
35