Heterogeneous Microarchitectures Trump Voltage Scaling for Low-Power Cores Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Ronald Dreslinski Jr., Thomas F.
Download ReportTranscript Heterogeneous Microarchitectures Trump Voltage Scaling for Low-Power Cores Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Ronald Dreslinski Jr., Thomas F.
Heterogeneous Microarchitectures Trump Voltage Scaling for Low-Power Cores Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Ronald Dreslinski Jr., Thomas F. Wenisch, and Scott Mahlke University of Michigan PACT’23 August 26th, 2014 Low-Power Cores 1996 2013 Nokia 9000 Nexus 5 ? • 800 mAh battery • Intel 80386 @ 24MHz • 2300 mAh battery • Krait 400 [email protected] How do we get there? HD video + Web surfing All day battery? 2 The Big Picture • Study efficiency of heterogeneous architectures • Efficiency depends on the schedule • Factor out scheduling efficiency • Study architectural efficiency 3 Quanta (Epochs) Heterogeneity ILP Application MLP Pointers Branch Miss Floating Point Core ILP Pointers Decompose Core? 4 Heterogeneity 1000 MHz Big Core (OoO) Little Core (InOrder) OoO Core InO Core Pointers 250 MHz OoO Core Pointers InO Core Core Pointers Ooo Core Ooo Core InO Core InO Core Single-ISA Microarchitectures (HMs) DynamicHeterogeneous Voltage/Frequency More dimensions? Scaling (DVFS) Migrate core microarchitecture Big and Little) [Kumar’03] Change voltage/frequency points to (i.e. improve efficiency [Horowitz’94] 5 Quanta Size Cache Off-Chip On-Chip Shared Transfer Regulators Regs Caches Big @ 1GHz Coarse Grained Fine Grained BigLittle @ 750MHz Core DVFS HMs Quantum 20-70 uSec 15-25 uSec 10M Insts [Mazouz’13] [Greenhalgh’11] 10-20 nSec [Kim’12] 10-30 nSec [Padmanabha’13] 1K Insts 6 The Goal DVFS Coarse Grained Yesterday’s Cores Fine Grained Future Cores? HMs ? Today’s Future Cores Cores? Future Cores? Future Cores? Which is most efficient? DVFS vs. HMs Coarse vs. Fine 7 Schedules Quanta (Epochs) IPC Performance On Big Core Performance On Little Core Time Schedule: Little Big Little Big Little Big Most efficient schedule? 8 Pareto-Optimal Schedules Quantum 1 Quantum 2 Quantum 3 Delay Energy Delay Energy Delay Energy Big Core 10 ms 50 mJ 20 ms 60 mJ 30 ms 60 mJ Little Core 20 ms 20 mJ 40 ms 50 mJ 35 ms 40mJ Number Schedule Delay Energy 1 {B,B,B} 60 ms 170 mJ 2 {B,B,L} 65 ms 150 mJ 3 {B,L,B} 80 ms 160 mJ 4 {B,L,L} 85 ms 140 mJ 5 {L,B,B} 70 ms 140 mJ 6 {L,B,L} 75 ms 120 mJ 7 {L,L,B} 90 ms 130 mJ 8 {L,L,L} 95 ms 110 mJ Schedule: HowPick efficient one iscore thisfor schedule? each quantum {L,L,B} => ( 90 ms, 130 mJ ) 9 Pareto-Optimal Schedules Pareto Optimal Non-Pareto Optimal Schedule 180 Energy (mJ) Number Schedule Delay Energy 1 {B,B,B} 60 ms 170 mJ 2 {B,B,L} 65 ms 150 mJ 3 {B,L,B} 80 ms 160 mJ 4 {B,L,L} 85 ms 140 mJ 5 {L,B,B} 70 ms 140 mJ 6 {L,B,L} 75 ms 120 mJ 7 {L,L,B} 90 ms 130 mJ 8 {L,L,L} 95 ms 110 mJ 1 2 140 Better 3 5 Worse 4 7 6 8 100 50 75 Delay (ms) 100 Some schedules just better best Pareto-optimal schedules determine Schedule efficiency effects tradeoffs for given architecture (tradeoffs #6 > #3 ) architectural efficiency 10 Schedule Efficiency Lowest energy for given performance level K Modes x N Quanta Total Schedules: KN (121000000) Find most efficient schedule for given performance level 11 Regions Delay (0..m) Energy Energy Energy Sum Delay (0..n) Delay [m..n) Merging regions requires Can Combine we break regions? into regions? exponential complexity… 12 Approximate Regions Worst Energy & Delay ≤ΔE Worst Case≤ΔD Pareto Frontier Best Energy & Delay Pareto-Optimal Region Energy Energy Energy Sum Best Case Pareto Frontier Delay(0..n) (0..m) Delay Delay [m..n) Best region energy/delay Limit error totradeoffs! +/- 2.5% ( with bounded error ) 13 Evaluation - DVFS • DVFS – 28nm node – Low-Power Fully Depleted Silicon-on-Oxide (FDSOI) 2500 Frequency (MHz) 2000 1500 1000 600MHz @ 0.6V 2000MHz @ 1.1V 500 0 0,5 0,6 0,7 0,8 0,9 Voltage (V) 1 1,1 1,2 14 Evaluation - HMs • HMs modeled off ARM’s big.LITTLE – Little (A7): 2-issue in-order core – Big (A15): 3-issue out-of-order core • Validation (Dhrystone): System Evaluation Δ Performance Δ Energy Industry Big.Little 1.9x 3.5x Modeled Gem5+Mcpat 2.09x 3.01x 15 Coarse-Grained Comparison DVFS HMs DVFS + HMs 100% Energy Savings 80% 60% 40% 20% 0% 0% 50% 100% 150% 200% 250% Slowdown Normalized Pareto-optimal HMs provide to highest schedules better performance benefits result incore Most Lower DVFS+HMs efficient DVFS+HMs DVFS 100% DVFS levels slowdown provides allows schedule are less continued =minimal 2x efficient for runtime 50% benefits scaling than slowdown HMs especially best possible Big for smaller @ 2GHz tradeoffs slowdowns 16 Fine-Grained Architecture • DVFS – On-chip voltage regulators – Neglect efficiency losses • HMs – Composite Cores architecture – Shared L1 caches, frontend • HMs incur ~7% power overheads – Leakage => clock-gating (not power-gating) – Dynamic => Over-provisioned hardware 17 Fine-Grained Comparison DVFS HMs DVFS + HMs 100% Energy Savings 80% 60% 40% 20% 0% 0% 20% 40% 60% 80% 100% Slowdown HMs DVFS beats Coarse-Grained + HMs provides DVFS benefits + HMs Fine-Grained Start ~10% ~5% DVFS with savings higher Coarse-Grained ≈ Coarse-Grained savings for free DVFS untilbut~50% not additive slowdown 18 Benchmarks DVFS HMs DVFS + HMs Energy Savings 50% 40% 30% 20% 10% 0% • EnergyDVFS+HMs savings for ahas 5%adifferent slowdown Not always clear winner benefits 19 Summary DVFS HMs DVFS + HMs 100% Energy Savings 80% 5% 60% 40% 6% 20% 0% Coarse Fine 5% Coarse 25% Slowdown Fine Coarse Fine 100% Fine-Grained Fine-Grained DVFS HMs+ provide HMs provide most no benefits benefits for small large slowdowns 20 More Details in Paper • Assumptions of fine-grained architectures • Overheads analysis for HMs – Switching Overheads – Power Overheads • Detailed benchmark analysis 21 Conclusions Questions? Coarse Grained Fine Grained DVFS HMs DVFS HMs trump + HMs DVFS best for for small large slowdowns 22 Heterogeneous Microarchitectures Trump Voltage Scaling for Low-Power Cores Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Ronald Dreslinski Jr., Thomas F. Wenisch, and Scott Mahlke University of Michigan PACT’23 August 26th, 2014 Switching Overheads 0 ns 20 ns 50 ns 100 ns 200 ns 60% Energy Savings 50% 40% 30% 20% 10% 0% DVFS HMs 5% DVFS HMs 10% Slowdown DVFS HMs 25% 24 Leakage Overheads 5% Little 10% Little 20% Little 30% Little 40% Little Ideal DVFS 50% Energy Savings 40% 30% 20% 10% 0% 5% 10% Slowdown 25% 25 Benchmarks I Coarse DVFS + HMs Fine Grained HMs 30% Fine Grained DVFS DVFS HMs DVFS + HMs 60% Hmmer Xalancbmk 50% Energy Savings Energy Savings 25% 20% 40% 15% 30% 10% 20% 5% 10% 0% 0% 5% 10% 15% Slowdown 20% 25% DVFS 60% 0% 5% HMs 10% 15% Slowdown 20% DVFS + HMs 50% Energy Savings 0% 40% 30% 20% mcf 10% 0% 0% 5% 10% 15% Slowdown 20% 25% 26 25% Benchmarks II DVFS HMs DVFS + HMs 60% perlbench 40% 30% 20% 10% 0% 0% 5% 10% 15% Slowdown 20% 25% Coarse Grained HMs Fine Grained HMs 60% omnetpp 50% Energy Savings Energy Savings 50% 40% 30% 20% 10% 0% 0% 5% 10% 15% Slowdown 20% 27 25% Limitation: State Transfer 10s of KB iCache iTLB Branch Pred State transfer costs can be veryFetch high: ~20K cycles (ARM’s big.LITTLE) <1 KB Reg File dTLB dCache iCache Big Pipeline 10s of KB Decode Little Pipeline Limits switching to coarse granularity: 100M Instructions ( Kumar’04) iTLB Branch Pred Reg File dTLB dCache 28 Creating a Composite Core iCache iTLB State transfer overheads: O3 Execute BigFetch 20K to ~20 DecodecyclesRAT Load/Store uEngine Branch Pred iCache iTLB Branch Pred iCache Reg File Switching granularity: Fetch 100M Controller to 1000 instructions <1KB Queue dTLB dCache dTLB dCache dCache dTLB Reg File Mem Little Fetch pays ~8%Decode iTLB Little energy overhead inO Execute uEngine Branch Pred 29 Low-Power Cores Are everywhere… 1 Billion smartphones in 2014 1,2 [1] http://www.gartner.com/newsroom/id/2665715 [2] http://www.gartner.com/newsroom/id/2665715 http://www.dialaphone.co.uk/blog/2008/06/17/a-funny-look-back-at-some-old-cell-phones/ http://www.businesskorea.co.kr/article/1687/local-market-saturation-korea%E2%80%99s-smartphone-market-forecast-negative-growth-year 30 Cites • M. Horowitz, T. Indermaur, and R. Gonzalez, “Low-power digital design,” in IEEE Symp. Low Power Electron. (ISLPE’94) Digest of Tech. Papers, Oct. 1994, pp. 8–11. • Kumar, R., Farkas, K. I., Jouppi, N. P., Ranganathan, P., & Tullsen, D. M. (2003, December). Single-ISA heterogeneous multi-core architectures: The potential for processor power reduction. In Microarchitecture, 2003. MICRO-36. Proceedings. 36th Annual IEEE/ACM International Symposium on (pp. 81-92). IEEE. [1] https://software.intel.com/sites/default/files/ftalat.pdf [2] http://www.arm.com/files/downloads/big_LITTLE_Final_Final.pdf [3] “A fully-integrated 3-level dc-dc converter for nanosecond-scale dvfs” [4] “Composite cores: pushing heterogeneity within a core” 31