Transcript A Unified View of Non-monotonic Core Selection and Application
A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors
Sandeep Navada, Niket K. Choudhary, Salil Wadhavkar, Eric Rotenberg Department of Electrical and Computer Engineering North Carolina State University Sandeep Navada © 2013 1
Single-ISA HCMP
• Same ISA • Different microarchitectures – Superscalar width – Structure sizes – Frequency • Cores have different performance and power • New run-time optimization lever Sandeep Navada © 2013 2
Monotonic HCMP
• Cores can be ranked independent of application • Core 1 faster than Core 2 for
any
application Core 1 Core 2 A Sandeep Navada © 2013 B C
Applications
D 3
Monotonic HCMP example
Sandeep Navada © 2013 4
HCMP literature
• Focus – Monotonic cores – Cores are preordained – Scheduling • Single thread – Minimize energy for given performance degradation threshold w.r.t. highest ranked core • Multiple threads – Maximize throughput/Watt/mm 2 Sandeep Navada © 2013 5
Going beyond monotonic HCMP
• Cores can’t be ranked independent of application • Cores designed from ground-up, not pre-existing Core 1 Core 2 A Sandeep Navada © 2013 B C
Applications
D 6
Non-monotonic HCMP
High-contention scenario (Optimize throughput) Kumar, et al., Core Architecture Optimization for Single ISA Heterogeneous Multiprocessors Low-contention scenario (Optimize latency) Our work Sandeep Navada © 2013 7
Optimize latency
Performance = IPC × frequency Complexity↑ => IPC↑ frequency↓
App A App B
IPC frequency perf Complexity Complexity IPC frequency perf This tradeoff plays out differently for different apps and is dependent on the ILP characteristics of the app Sandeep Navada © 2013 8
Non-monotonic HCMP challenges
Core Selection
How to pick the core types comprising the heterogeneous design?
Sandeep Navada © 2013
Application Steering
How to steer the applications to the best core?
9
CORE SELECTION
Sandeep Navada © 2013 10
Parameter
Front end width Issue width Physical register file size Issue queue size Load queue/ Store queue size L1 I$ size L1 D$ size L2$ size Clock period
Core design space
Value Range
2, 3, 4, 5, 6, 7, 8 2, 3, 4, 5, 6, 7, 8 64, 128, 192, 256, 384, 512 16, 24, 32, 48, 64, 96, 128 8/8, 16/16, 24/24, 32/32, 40/40, 48/48, 56/56, 64/64 8, 16, 32, 64, 128KB 8, 16, 32, 64, 128KB 2MB 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2 ns 7 8 5 5 1 8
Number
7 7 6 Sandeep Navada © 2013 11
Core design space SPEC bench Search N=1 HCMP
Core selection
Search N=2 HCMP Pruning script SimPoint tool Search N=3 HCMP Optimal 1-core-type HCMP Optimal 2-core-type HCMP Sandeep Navada © 2013 Optimal 3-core-type HCMP Pruned design Space 39 10M phases Search N=4 HCMP Optimal 4-core-type HCMP FabScalar toolset IPC, freq, power Performance of every phase on every design point N: Number of core types 12
BIPS Phases 1 2 A
1.5 0.5
B
3.2
2.3
C
1.3
2.5
Core Types D E
2.2
1.9
1.6
3.1
F
1.7
1.8
G
1.3
2.0
H
2.0
1.2
Core 1 A E A E E Core 2 B B F F F C C G
Search for Optimal 4-core-type HCMP
Core 3 Core 4 Performance C C D D HMEAN(3.2, 2.5) = 2.81
HMEAN(3.2, 3.1) = 3.15
D D H HMEAN(2.2, 2.5) = 2.34
HMEAN(2.2, 3.1) = 2.57
HMEAN(2.0, 3.1) = 2.43
… Sandeep Navada © 2013 13
Kiviat diagram
• Visualize core parameters Frequency higher frequency increase superscalar width Width Sandeep Navada © 2013 larger structures Window
Optimal 1-core-type HCMP
Frequency Window A Width Sandeep Navada © 2013 15
Optimal 1-core-type HCMP
Frequency A Width Window “A” core is an average core which strikes a good balance between IPC and frequency.
Sandeep Navada © 2013 16
Optimal 2-core-type HCMP
Frequency Width Sandeep Navada © 2013 Window A LW 17
Optimal 2-core-type HCMP
Frequency A LW Width Window “A” core is still selected! Sandeep Navada © 2013 18
Optimal 2-core-type HCMP
Frequency Width Window A LW LARGER WIDER “LW” core targets window and width bottlenecks in “A” core.
Sandeep Navada © 2013 19
Optimal 3-core-type HCMP
Frequency A LW N Width Sandeep Navada © 2013 Window 20
Optimal 3-core-type HCMP
Frequency A LW N Width Window “A” core is still selected!! Sandeep Navada © 2013 21
Optimal 3-core-type HCMP
Frequency A LW N Width Window “LW” core is still selected. Sandeep Navada © 2013 22
Optimal 3-core-type HCMP
Frequency Width Window “N” core targets frequency bottleneck.
Sandeep Navada © 2013 A LW N 23
Optimal 4-core-type HCMP
Frequency Window A L W N Width Sandeep Navada © 2013 24
Optimal 4-core-type HCMP
Frequency Width “A” and “N” are selected, again.
Window “LW” got split into “L” and “W”, A L W N 25
Width Sandeep Navada © 2013
LW split
Frequency Window A LW L W 26
Optimal HCMP
Core Type
A N L W
Clock Period
0.6
0.5
0.7
0.7
ILP-extracting buffers
32, 128, 128 32, 64, 64 48, 128, 384 32, 128, 128
Widths
3, 4 2, 2 4, 4 6, 6
Caches
64, 64 16, 16 128, 128 128, 32 The optimal HCMP consists of 1. Average core which is the best homogeneous core 2. Accelerator cores that relieve distinct bottlenecks in the average core Sandeep Navada © 2013 27
APPLICATION STEERING
Sandeep Navada © 2013 28
Bottleneck-driven steering
• Application is continuously diagnosed for bottlenecks on the current core using perf. counters • Migrate to different core when bottlenecks change – To an accelerator core that relieves any diagnosed bottleneck and doesn’t worsen any diagnosed bottleneck – To the average core if no accelerator meets this condition, or if no bottlenecks Sandeep Navada © 2013 29
Bottleneck-driven steering
Track performance counters Sandeep Navada © 2013 Diagnose bottlenecks Steer phase 30
Track performance counters
Counter
Width_ctr Window_ctr I$_ctr D$_ctr Misp_ctr L2_ctr Cycle_ctr
Description
Ready instruction not issued due to limited issue width.
Instruction not dispatched due to issue queue or reorder buffer full.
Instruction stalled due to instruction cache miss.
Load instruction stalled due to data cache miss.
Mispredicted branch.
Instruction stalled due to L2 cache miss.
Number of cycles.
Sandeep Navada © 2013 31
Diagnose bottlenecks
• Every 10K instructions, evaluate bottlenecks using performance counters and thresholds • Performance counters are normalized with respect to the cycle count • If the normalized performance counter value is above threshold, then the corresponding resource is a bottleneck Sandeep Navada © 2013
Diagnose bottlenecks
Bottleneck
bool Width bool Window bool Frequency bool I$ bool D$
Expression Width = (Width_ctr > Width_thresh) Window = (Window_ctr > Window_thresh) Frequency = (Misp_ctr > Misp_thresh) || (L2_ctr > L2_thresh) I$ = (I$_ctr > I$_thresh) D$ = (D$_ctr > D$_thresh)
Thresholds are determined empirically using a training process Sandeep Navada © 2013 33
Steer phase
Core
W
Bottlenecks relieved
Width
Bottlenecks worsened
Frequency L N A Window Frequency n/a Frequency Width, Window n/a
Steering logic if (Width && !Frequency) W else if (Window && !Frequency) L else if (Frequency && !(Width || Window)) N else A
Paper shows full steering logic with I$ and D$ bottlenecks included.
Sandeep Navada © 2013 34
Sandeep Navada © 2013
RESULTS
35
Methodology
• Benchmarks: SPEC 2000 – Simulate first 4 billion instructions • Metrics – Performance: BIPS – Efficiency: BIPS 3 /Watt • Migration overhead – Default: 100 cycles – Sensitivity study: 1K, 10K cycles Sandeep Navada © 2013 36
Steering algorithms
Algorithm
Baseline Sampling Bottleneck Optimal Oracle
Description
Run the entire 4B instructions on the average core Run on each core type for the sampling interval and then on the best core type for the switching interval Run current 10K instruction segment based on the bottlenecks of the prior 10K segment Run every 10K instruction segment on the best core type of the prior 10K segment Run every 10K instruction segment on the best core type Sandeep Navada © 2013 37
4-core-type HCMP
•4-core HCMP outperforms homogeneous CMP by up to 76% and 15%, on average •Our steering algorithm is able to capture most of this gain Sandeep Navada © 2013 38
Sampling vs. bottleneck steering
Sandeep Navada © 2013 39
Occupancy
Occupancy pattern varies dramatically across different applications Sandeep Navada © 2013 40
Efficiency
Sampling performs 25% better than the average core Bottleneck steering performs 33% better than the average core Sandeep Navada © 2013 41
Sandeep Navada © 2013
SUMMARY
42
Summary
• First proposal to architect and orchestrate multiple core types for latency reduction.
• With N core types, the optimal HCMP consists of an average core type coupled with N-1 accelerator core types. • In the complementary steering algorithm, the application is continuously diagnosed for bottlenecks and is migrated to the core type which relieves the bottlenecks.
Sandeep Navada © 2013 43
Future work
• HCMPs open up a whole new direction of microarchitecture research.
• Many microarchitecture optimizations don’t provide universal benefits.
• As each core-type targets a narrow workload space, HCMP provides a great platform to reconsider these optimizations.
Sandeep Navada © 2013 44