A Unified View of Non-monotonic Core Selection and Application

Download Report

Transcript A Unified View of Non-monotonic Core Selection and Application

A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors

Single-ISA HCMP

Monotonic HCMP

• Cores can be ranked independent of application • Core 1 faster than Core 2 for

any

Applications

D 3

Monotonic HCMP example

HCMP literature

• Focus – Monotonic cores – Cores are preordained – Scheduling • Single thread – Minimize energy for given performance degradation threshold w.r.t. highest ranked core • Multiple threads – Maximize throughput/Watt/mm 2 Sandeep Navada © 2013 5

Going beyond monotonic HCMP

Applications

D 6

Non-monotonic HCMP

High-contention scenario (Optimize throughput) Kumar, et al., Core Architecture Optimization for Single ISA Heterogeneous Multiprocessors Low-contention scenario (Optimize latency) Our work Sandeep Navada © 2013 7

Optimize latency

Performance = IPC × frequency Complexity↑ => IPC↑ frequency↓

App A App B

Non-monotonic HCMP challenges

Core Selection

How to pick the core types comprising the heterogeneous design?

Application Steering

How to steer the applications to the best core?

CORE SELECTION

Parameter

Front end width Issue width Physical register file size Issue queue size Load queue/ Store queue size L1 I$ size L1 D$ size L2$ size Clock period

Core design space

Value Range

2, 3, 4, 5, 6, 7, 8 2, 3, 4, 5, 6, 7, 8 64, 128, 192, 256, 384, 512 16, 24, 32, 48, 64, 96, 128 8/8, 16/16, 24/24, 32/32, 40/40, 48/48, 56/56, 64/64 8, 16, 32, 64, 128KB 8, 16, 32, 64, 128KB 2MB 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2 ns 7 8 5 5 1 8

Number

Core design space SPEC bench Search N=1 HCMP

Core selection

Search N=2 HCMP Pruning script SimPoint tool Search N=3 HCMP Optimal 1-core-type HCMP Optimal 2-core-type HCMP Sandeep Navada © 2013 Optimal 3-core-type HCMP Pruned design Space 39 10M phases Search N=4 HCMP Optimal 4-core-type HCMP FabScalar toolset IPC, freq, power Performance of every phase on every design point N: Number of core types 12

BIPS Phases 1 2 A

1.5 0.5

3.2

2.3

1.3

2.5

Core Types D E

2.2

1.9

1.6

3.1

1.7

1.8

1.3

2.0

1.2

Core 1 A E A E E Core 2 B B F F F C C G

Search for Optimal 4-core-type HCMP

Core 3 Core 4 Performance C C D D HMEAN(3.2, 2.5) = 2.81

HMEAN(3.2, 3.1) = 3.15

D D H HMEAN(2.2, 2.5) = 2.34

HMEAN(2.2, 3.1) = 2.57

HMEAN(2.0, 3.1) = 2.43

Kiviat diagram

Optimal 1-core-type HCMP

Frequency A Width Window “A” core is an average core which strikes a good balance between IPC and frequency.

Optimal 2-core-type HCMP

Frequency Width Window A LW LARGER WIDER “LW” core targets window and width bottlenecks in “A” core.

Optimal 3-core-type HCMP

Frequency Width Window “N” core targets frequency bottleneck.

Optimal 4-core-type HCMP

Frequency Width “A” and “N” are selected, again.

Window “LW” got split into “L” and “W”, A L W N 25

LW split

Frequency Window A LW L W 26

Optimal HCMP

Core Type

A N L W

Clock Period

0.6

0.5

0.7

ILP-extracting buffers

32, 128, 128 32, 64, 64 48, 128, 384 32, 128, 128

Widths

3, 4 2, 2 4, 4 6, 6

Caches

64, 64 16, 16 128, 128 128, 32 The optimal HCMP consists of 1. Average core which is the best homogeneous core 2. Accelerator cores that relieve distinct bottlenecks in the average core Sandeep Navada © 2013 27

APPLICATION STEERING

Bottleneck-driven steering

• Application is continuously diagnosed for bottlenecks on the current core using perf. counters • Migrate to different core when bottlenecks change – To an accelerator core that relieves any diagnosed bottleneck and doesn’t worsen any diagnosed bottleneck – To the average core if no accelerator meets this condition, or if no bottlenecks Sandeep Navada © 2013 29

Bottleneck-driven steering

Track performance counters

Counter

Width_ctr Window_ctr I$_ctr D$_ctr Misp_ctr L2_ctr Cycle_ctr

Description

Ready instruction not issued due to limited issue width.

Instruction not dispatched due to issue queue or reorder buffer full.

Instruction stalled due to instruction cache miss.

Load instruction stalled due to data cache miss.

Mispredicted branch.

Instruction stalled due to L2 cache miss.

Number of cycles.

Diagnose bottlenecks

• Every 10K instructions, evaluate bottlenecks using performance counters and thresholds • Performance counters are normalized with respect to the cycle count • If the normalized performance counter value is above threshold, then the corresponding resource is a bottleneck Sandeep Navada © 2013

Diagnose bottlenecks

Bottleneck

bool Width bool Window bool Frequency bool I$ bool D$

Expression Width = (Width_ctr > Width_thresh) Window = (Window_ctr > Window_thresh) Frequency = (Misp_ctr > Misp_thresh) || (L2_ctr > L2_thresh) I$ = (I$_ctr > I$_thresh) D$ = (D$_ctr > D$_thresh)

Steer phase

Core

Bottlenecks relieved

Width

Bottlenecks worsened

Frequency L N A Window Frequency n/a Frequency Width, Window n/a

Steering logic if (Width && !Frequency) W else if (Window && !Frequency) L else if (Frequency && !(Width || Window)) N else A

Paper shows full steering logic with I$ and D$ bottlenecks included.

RESULTS

Methodology

• Benchmarks: SPEC 2000 – Simulate first 4 billion instructions • Metrics – Performance: BIPS – Efficiency: BIPS 3 /Watt • Migration overhead – Default: 100 cycles – Sensitivity study: 1K, 10K cycles Sandeep Navada © 2013 36

Steering algorithms

Algorithm

Baseline Sampling Bottleneck Optimal Oracle

Description

Run the entire 4B instructions on the average core Run on each core type for the sampling interval and then on the best core type for the switching interval Run current 10K instruction segment based on the bottlenecks of the prior 10K segment Run every 10K instruction segment on the best core type of the prior 10K segment Run every 10K instruction segment on the best core type Sandeep Navada © 2013 37

4-core-type HCMP

Sampling vs. bottleneck steering

Occupancy

Efficiency

SUMMARY

Summary

• First proposal to architect and orchestrate multiple core types for latency reduction.

• With N core types, the optimal HCMP consists of an average core type coupled with N-1 accelerator core types. • In the complementary steering algorithm, the application is continuously diagnosed for bottlenecks and is migrated to the core type which relieves the bottlenecks.

Future work

• HCMPs open up a whole new direction of microarchitecture research.

• Many microarchitecture optimizations don’t provide universal benefits.

• As each core-type targets a narrow workload space, HCMP provides a great platform to reconsider these optimizations.

A Unified View of Non-monotonic Core Selection and Application

Transcript A Unified View of Non-monotonic Core Selection and Application

Single-ISA HCMP

Monotonic HCMP

Monotonic HCMP example

HCMP literature

Going beyond monotonic HCMP

Non-monotonic HCMP

Optimize latency

Non-monotonic HCMP challenges

Core Selection

Application Steering

CORE SELECTION

Core design space

Core selection

Kiviat diagram

Optimal 1-core-type HCMP

Optimal 1-core-type HCMP

Optimal 2-core-type HCMP

Optimal 2-core-type HCMP

Optimal 2-core-type HCMP

Optimal 3-core-type HCMP

Optimal 3-core-type HCMP

Optimal 3-core-type HCMP

Optimal 3-core-type HCMP

Optimal 4-core-type HCMP

Optimal 4-core-type HCMP

LW split

Optimal HCMP

APPLICATION STEERING

Bottleneck-driven steering

Bottleneck-driven steering

Track performance counters

Diagnose bottlenecks

Diagnose bottlenecks

Steer phase

RESULTS

Methodology

Steering algorithms

4-core-type HCMP

Sampling vs. bottleneck steering

Occupancy

Efficiency

SUMMARY

Summary

Future work

Directory