Heterogeneous Microarchitectures Trump Voltage Scaling for Low-Power Cores Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Ronald Dreslinski Jr., Thomas F.

Download Report

Transcript Heterogeneous Microarchitectures Trump Voltage Scaling for Low-Power Cores Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Ronald Dreslinski Jr., Thomas F.

Heterogeneous Microarchitectures
Trump Voltage Scaling
for Low-Power Cores
Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das,
Ronald Dreslinski Jr., Thomas F. Wenisch, and Scott Mahlke
University of Michigan
PACT’23
August 26th, 2014
Low-Power Cores
1996
2013
Nokia 9000
Nexus 5
?
• 800 mAh battery
• Intel 80386 @ 24MHz
• 2300 mAh battery
• Krait 400 [email protected]
How
do we
get there?
HD
video
+ Web
surfing
All day battery?
2
The Big Picture
• Study efficiency of heterogeneous architectures
• Efficiency depends on the schedule
• Factor out scheduling efficiency
• Study architectural efficiency
3
Quanta
(Epochs)
Heterogeneity
ILP
Application
MLP
Pointers
Branch
Miss
Floating
Point
Core
ILP
Pointers
Decompose Core?
4
Heterogeneity
1000 MHz
Big Core
(OoO)
Little Core
(InOrder)
OoO
Core
InO
Core
Pointers
250 MHz
OoO
Core
Pointers
InO
Core
Core
Pointers
Ooo
Core
Ooo
Core
InO
Core
InO
Core
Single-ISA
Microarchitectures
(HMs)
DynamicHeterogeneous
Voltage/Frequency
More dimensions?
Scaling (DVFS)
Migrate
core microarchitecture
Big and
Little) [Kumar’03]
Change
voltage/frequency
points to (i.e.
improve
efficiency
[Horowitz’94]
5
Quanta Size
Cache
Off-Chip
On-Chip
Shared
Transfer
Regulators
Regs
Caches
Big @ 1GHz
Coarse
Grained
Fine
Grained
BigLittle
@
750MHz
Core
DVFS
HMs
Quantum
20-70 uSec
15-25 uSec
10M Insts
[Mazouz’13]
[Greenhalgh’11]
10-20 nSec
[Kim’12]
10-30 nSec
[Padmanabha’13]
1K Insts
6
The Goal
DVFS
Coarse
Grained
Yesterday’s
Cores
Fine
Grained
Future
Cores?
HMs
?
Today’s
Future
Cores
Cores?
Future
Cores?
Future
Cores?
Which is most efficient?
DVFS vs. HMs
Coarse vs. Fine
7
Schedules
Quanta
(Epochs)
IPC
Performance
On Big Core
Performance
On Little Core
Time
Schedule:
Little
Big
Little
Big
Little
Big
Most efficient schedule?
8
Pareto-Optimal Schedules
Quantum 1
Quantum 2
Quantum 3
Delay Energy Delay Energy Delay Energy
Big Core
10 ms
50 mJ
20 ms
60 mJ
30 ms
60 mJ
Little Core 20 ms
20 mJ
40 ms
50 mJ
35 ms
40mJ
Number Schedule Delay Energy
1
{B,B,B} 60 ms 170 mJ
2
{B,B,L} 65 ms 150 mJ
3
{B,L,B} 80 ms 160 mJ
4
{B,L,L} 85 ms 140 mJ
5
{L,B,B} 70 ms 140 mJ
6
{L,B,L} 75 ms 120 mJ
7
{L,L,B} 90 ms 130 mJ
8
{L,L,L} 95 ms 110 mJ
Schedule:
HowPick
efficient
one iscore
thisfor
schedule?
each quantum
{L,L,B} => ( 90 ms, 130 mJ )
9
Pareto-Optimal Schedules
Pareto Optimal
Non-Pareto Optimal
Schedule
180
Energy (mJ)
Number Schedule Delay Energy
1
{B,B,B} 60 ms 170 mJ
2
{B,B,L} 65 ms 150 mJ
3
{B,L,B} 80 ms 160 mJ
4
{B,L,L} 85 ms 140 mJ
5
{L,B,B} 70 ms 140 mJ
6
{L,B,L} 75 ms 120 mJ
7
{L,L,B} 90 ms 130 mJ
8
{L,L,L} 95 ms 110 mJ
1
2
140
Better
3
5
Worse
4
7
6
8
100
50
75
Delay (ms)
100
Some
schedules
just
better best
Pareto-optimal
schedules
determine
Schedule
efficiency
effects
tradeoffs
for
given
architecture
(tradeoffs
#6
> #3
)
architectural
efficiency
10
Schedule Efficiency
Lowest energy for given performance level
K Modes
x
N Quanta
Total Schedules:
KN
(121000000)
Find most efficient schedule
for given performance level
11
Regions
Delay (0..m)
Energy
Energy
Energy
Sum
Delay (0..n)
Delay [m..n)
Merging regions requires
Can Combine
we break
regions?
into regions?
exponential
complexity…

12
Approximate Regions
Worst Energy
& Delay
≤ΔE
Worst Case≤ΔD
Pareto
Frontier
Best
Energy
& Delay
Pareto-Optimal
Region
Energy
Energy
Energy
Sum
Best Case
Pareto Frontier
Delay(0..n)
(0..m)
Delay
Delay [m..n)
Best region
energy/delay
Limit
error totradeoffs!
+/- 2.5%
( with bounded error )
13
Evaluation - DVFS
• DVFS
– 28nm node
– Low-Power Fully Depleted Silicon-on-Oxide (FDSOI)
2500
Frequency (MHz)
2000
1500
1000
600MHz @ 0.6V
2000MHz @ 1.1V
500
0
0,5
0,6
0,7
0,8
0,9
Voltage (V)
1
1,1
1,2
14
Evaluation - HMs
• HMs modeled off ARM’s big.LITTLE
– Little (A7): 2-issue in-order core
– Big (A15): 3-issue out-of-order core
• Validation (Dhrystone):
System
Evaluation
Δ Performance
Δ Energy
Industry
Big.Little
1.9x
3.5x
Modeled
Gem5+Mcpat
2.09x
3.01x
15
Coarse-Grained Comparison
DVFS
HMs
DVFS + HMs
100%
Energy Savings
80%
60%
40%
20%
0%
0%
50%
100%
150%
200%
250%
Slowdown
Normalized
Pareto-optimal
HMs provide
to highest
schedules
better
performance
benefits
result incore
Most
Lower
DVFS+HMs
efficient
DVFS+HMs
DVFS
100%
DVFS
levels
slowdown
provides
allows
schedule
are less
continued
=minimal
2x
efficient
for
runtime
50%
benefits
scaling
than
slowdown
HMs
especially
best possible
Big
for smaller
@ 2GHz
tradeoffs
slowdowns
16
Fine-Grained Architecture
• DVFS – On-chip voltage regulators
– Neglect efficiency losses
• HMs – Composite Cores architecture
– Shared L1 caches, frontend
• HMs incur ~7% power overheads
– Leakage => clock-gating (not power-gating)
– Dynamic => Over-provisioned hardware
17
Fine-Grained Comparison
DVFS
HMs
DVFS + HMs
100%
Energy Savings
80%
60%
40%
20%
0%
0%
20%
40%
60%
80%
100%
Slowdown
HMs DVFS
beats Coarse-Grained
+ HMs provides DVFS
benefits
+ HMs
Fine-Grained
Start
~10%
~5%
DVFS
with
savings
higher
Coarse-Grained
≈ Coarse-Grained
savings
for free
DVFS
untilbut~50%
not additive
slowdown
18
Benchmarks
DVFS
HMs
DVFS + HMs
Energy Savings
50%
40%
30%
20%
10%
0%
• EnergyDVFS+HMs
savings
for ahas
5%adifferent
slowdown
Not always
clear winner
benefits
19
Summary
DVFS
HMs
DVFS + HMs
100%
Energy Savings
80%
5%
60%
40%
6%
20%
0%
Coarse
Fine
5%
Coarse
25%
Slowdown
Fine
Coarse
Fine
100%
Fine-Grained
Fine-Grained
DVFS
HMs+ provide
HMs provide
most no
benefits
benefits
for small
large slowdowns
20
More Details in Paper
• Assumptions of fine-grained architectures
• Overheads analysis for HMs
– Switching Overheads
– Power Overheads
• Detailed benchmark analysis
21
Conclusions
Questions?
Coarse
Grained
Fine
Grained
DVFS
HMs






DVFS
HMs trump
+ HMs
DVFS
best for
for small
large slowdowns
22
Heterogeneous Microarchitectures
Trump Voltage Scaling
for Low-Power Cores
Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das,
Ronald Dreslinski Jr., Thomas F. Wenisch, and Scott Mahlke
University of Michigan
PACT’23
August 26th, 2014
Switching Overheads
0 ns
20 ns
50 ns
100 ns
200 ns
60%
Energy Savings
50%
40%
30%
20%
10%
0%
DVFS
HMs
5%
DVFS
HMs
10%
Slowdown
DVFS
HMs
25%
24
Leakage Overheads
5% Little
10% Little
20% Little
30% Little
40% Little
Ideal DVFS
50%
Energy Savings
40%
30%
20%
10%
0%
5%
10%
Slowdown
25%
25
Benchmarks I
Coarse DVFS + HMs
Fine Grained HMs
30%
Fine Grained DVFS
DVFS
HMs
DVFS + HMs
60%
Hmmer
Xalancbmk
50%
Energy Savings
Energy Savings
25%
20%
40%
15%
30%
10%
20%
5%
10%
0%
0%
5%
10%
15%
Slowdown
20%
25%
DVFS
60%
0%
5%
HMs
10%
15%
Slowdown
20%
DVFS + HMs
50%
Energy Savings
0%
40%
30%
20%
mcf
10%
0%
0%
5%
10%
15%
Slowdown
20%
25%
26
25%
Benchmarks II
DVFS
HMs
DVFS + HMs
60%
perlbench
40%
30%
20%
10%
0%
0%
5%
10%
15%
Slowdown
20%
25%
Coarse Grained HMs
Fine Grained HMs
60%
omnetpp
50%
Energy Savings
Energy Savings
50%
40%
30%
20%
10%
0%
0%
5%
10%
15%
Slowdown
20%
27
25%
Limitation: State Transfer
10s of KB
iCache
iTLB
Branch Pred
State transfer costs can be veryFetch
high:
~20K cycles (ARM’s big.LITTLE)
<1 KB
Reg File
dTLB
dCache
iCache
Big
Pipeline
10s of KB
Decode
Little
Pipeline
Limits switching to coarse granularity:
100M Instructions ( Kumar’04)
iTLB
Branch Pred
Reg File
dTLB
dCache
28
Creating a Composite Core
iCache
iTLB
State transfer overheads:
O3 Execute
BigFetch 20K to ~20
DecodecyclesRAT
Load/Store
uEngine
Branch Pred
iCache
iTLB
Branch Pred
iCache
Reg File
Switching granularity:
Fetch
100M Controller
to 1000 instructions
<1KB
Queue
dTLB
dCache
dTLB
dCache
dCache
dTLB
Reg File
Mem
Little
Fetch pays ~8%Decode
iTLB
Little
energy overhead
inO Execute
uEngine
Branch Pred
29
Low-Power Cores
Are everywhere…
1 Billion smartphones in 2014
1,2
[1] http://www.gartner.com/newsroom/id/2665715
[2] http://www.gartner.com/newsroom/id/2665715
http://www.dialaphone.co.uk/blog/2008/06/17/a-funny-look-back-at-some-old-cell-phones/
http://www.businesskorea.co.kr/article/1687/local-market-saturation-korea%E2%80%99s-smartphone-market-forecast-negative-growth-year
30
Cites
• M. Horowitz, T. Indermaur, and R. Gonzalez, “Low-power digital
design,” in IEEE Symp. Low Power Electron. (ISLPE’94) Digest of
Tech. Papers, Oct. 1994, pp. 8–11.
• Kumar, R., Farkas, K. I., Jouppi, N. P., Ranganathan, P., & Tullsen,
D. M. (2003, December). Single-ISA heterogeneous multi-core
architectures: The potential for processor power reduction. In
Microarchitecture, 2003. MICRO-36. Proceedings. 36th Annual
IEEE/ACM International Symposium on (pp. 81-92). IEEE.
[1] https://software.intel.com/sites/default/files/ftalat.pdf
[2] http://www.arm.com/files/downloads/big_LITTLE_Final_Final.pdf
[3] “A fully-integrated 3-level dc-dc converter for nanosecond-scale dvfs”
[4] “Composite cores: pushing heterogeneity within a core”
31