A Unified View of Non-monotonic Core Selection and Application

Download Report

Transcript A Unified View of Non-monotonic Core Selection and Application

A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors

Sandeep Navada, Niket K. Choudhary, Salil Wadhavkar, Eric Rotenberg Department of Electrical and Computer Engineering North Carolina State University Sandeep Navada © 2013 1

Single-ISA HCMP

• Same ISA • Different microarchitectures – Superscalar width – Structure sizes – Frequency • Cores have different performance and power • New run-time optimization lever Sandeep Navada © 2013 2

Monotonic HCMP

• Cores can be ranked independent of application • Core 1 faster than Core 2 for

any

application Core 1 Core 2 A Sandeep Navada © 2013 B C

Applications

D 3

Monotonic HCMP example

Sandeep Navada © 2013 4

HCMP literature

• Focus – Monotonic cores – Cores are preordained – Scheduling • Single thread – Minimize energy for given performance degradation threshold w.r.t. highest ranked core • Multiple threads – Maximize throughput/Watt/mm 2 Sandeep Navada © 2013 5

Going beyond monotonic HCMP

• Cores can’t be ranked independent of application • Cores designed from ground-up, not pre-existing Core 1 Core 2 A Sandeep Navada © 2013 B C

Applications

D 6

Non-monotonic HCMP

High-contention scenario (Optimize throughput) Kumar, et al., Core Architecture Optimization for Single ISA Heterogeneous Multiprocessors Low-contention scenario (Optimize latency) Our work Sandeep Navada © 2013 7

Optimize latency

Performance = IPC × frequency Complexity↑ => IPC↑ frequency↓

App A App B

IPC frequency perf Complexity Complexity IPC frequency perf This tradeoff plays out differently for different apps and is dependent on the ILP characteristics of the app Sandeep Navada © 2013 8

Non-monotonic HCMP challenges

Core Selection

How to pick the core types comprising the heterogeneous design?

Sandeep Navada © 2013

Application Steering

How to steer the applications to the best core?

9

CORE SELECTION

Sandeep Navada © 2013 10

Parameter

Front end width Issue width Physical register file size Issue queue size Load queue/ Store queue size L1 I$ size L1 D$ size L2$ size Clock period

Core design space

Value Range

2, 3, 4, 5, 6, 7, 8 2, 3, 4, 5, 6, 7, 8 64, 128, 192, 256, 384, 512 16, 24, 32, 48, 64, 96, 128 8/8, 16/16, 24/24, 32/32, 40/40, 48/48, 56/56, 64/64 8, 16, 32, 64, 128KB 8, 16, 32, 64, 128KB 2MB 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2 ns 7 8 5 5 1 8

Number

7 7 6 Sandeep Navada © 2013 11

Core design space SPEC bench Search N=1 HCMP

Core selection

Search N=2 HCMP Pruning script SimPoint tool Search N=3 HCMP Optimal 1-core-type HCMP Optimal 2-core-type HCMP Sandeep Navada © 2013 Optimal 3-core-type HCMP Pruned design Space 39 10M phases Search N=4 HCMP Optimal 4-core-type HCMP FabScalar toolset IPC, freq, power Performance of every phase on every design point N: Number of core types 12

BIPS Phases 1 2 A

1.5 0.5

B

3.2

2.3

C

1.3

2.5

Core Types D E

2.2

1.9

1.6

3.1

F

1.7

1.8

G

1.3

2.0

H

2.0

1.2

Core 1 A E A E E Core 2 B B F F F C C G

Search for Optimal 4-core-type HCMP

Core 3 Core 4 Performance C C D D HMEAN(3.2, 2.5) = 2.81

HMEAN(3.2, 3.1) = 3.15

D D H HMEAN(2.2, 2.5) = 2.34

HMEAN(2.2, 3.1) = 2.57

HMEAN(2.0, 3.1) = 2.43

… Sandeep Navada © 2013 13

Kiviat diagram

• Visualize core parameters Frequency higher frequency increase superscalar width Width Sandeep Navada © 2013 larger structures Window

Optimal 1-core-type HCMP

Frequency Window A Width Sandeep Navada © 2013 15

Optimal 1-core-type HCMP

Frequency A Width Window “A” core is an average core which strikes a good balance between IPC and frequency.

Sandeep Navada © 2013 16

Optimal 2-core-type HCMP

Frequency Width Sandeep Navada © 2013 Window A LW 17

Optimal 2-core-type HCMP

Frequency A LW Width Window “A” core is still selected! Sandeep Navada © 2013 18

Optimal 2-core-type HCMP

Frequency Width Window A LW LARGER WIDER “LW” core targets window and width bottlenecks in “A” core.

Sandeep Navada © 2013 19

Optimal 3-core-type HCMP

Frequency A LW N Width Sandeep Navada © 2013 Window 20

Optimal 3-core-type HCMP

Frequency A LW N Width Window “A” core is still selected!! Sandeep Navada © 2013 21

Optimal 3-core-type HCMP

Frequency A LW N Width Window “LW” core is still selected. Sandeep Navada © 2013 22

Optimal 3-core-type HCMP

Frequency Width Window “N” core targets frequency bottleneck.

Sandeep Navada © 2013 A LW N 23

Optimal 4-core-type HCMP

Frequency Window A L W N Width Sandeep Navada © 2013 24

Optimal 4-core-type HCMP

Frequency Width “A” and “N” are selected, again.

Window “LW” got split into “L” and “W”, A L W N 25

Width Sandeep Navada © 2013

LW split

Frequency Window A LW L W 26

Optimal HCMP

Core Type

A N L W

Clock Period

0.6

0.5

0.7

0.7

ILP-extracting buffers

32, 128, 128 32, 64, 64 48, 128, 384 32, 128, 128

Widths

3, 4 2, 2 4, 4 6, 6

Caches

64, 64 16, 16 128, 128 128, 32 The optimal HCMP consists of 1. Average core which is the best homogeneous core 2. Accelerator cores that relieve distinct bottlenecks in the average core Sandeep Navada © 2013 27

APPLICATION STEERING

Sandeep Navada © 2013 28

Bottleneck-driven steering

• Application is continuously diagnosed for bottlenecks on the current core using perf. counters • Migrate to different core when bottlenecks change – To an accelerator core that relieves any diagnosed bottleneck and doesn’t worsen any diagnosed bottleneck – To the average core if no accelerator meets this condition, or if no bottlenecks Sandeep Navada © 2013 29

Bottleneck-driven steering

Track performance counters Sandeep Navada © 2013 Diagnose bottlenecks Steer phase 30

Track performance counters

Counter

Width_ctr Window_ctr I$_ctr D$_ctr Misp_ctr L2_ctr Cycle_ctr

Description

Ready instruction not issued due to limited issue width.

Instruction not dispatched due to issue queue or reorder buffer full.

Instruction stalled due to instruction cache miss.

Load instruction stalled due to data cache miss.

Mispredicted branch.

Instruction stalled due to L2 cache miss.

Number of cycles.

Sandeep Navada © 2013 31

Diagnose bottlenecks

• Every 10K instructions, evaluate bottlenecks using performance counters and thresholds • Performance counters are normalized with respect to the cycle count • If the normalized performance counter value is above threshold, then the corresponding resource is a bottleneck Sandeep Navada © 2013

Diagnose bottlenecks

Bottleneck

bool Width bool Window bool Frequency bool I$ bool D$

Expression Width = (Width_ctr > Width_thresh) Window = (Window_ctr > Window_thresh) Frequency = (Misp_ctr > Misp_thresh) || (L2_ctr > L2_thresh) I$ = (I$_ctr > I$_thresh) D$ = (D$_ctr > D$_thresh)

Thresholds are determined empirically using a training process Sandeep Navada © 2013 33

Steer phase

Core

W

Bottlenecks relieved

Width

Bottlenecks worsened

Frequency L N A Window Frequency n/a Frequency Width, Window n/a

Steering logic if (Width && !Frequency) W else if (Window && !Frequency) L else if (Frequency && !(Width || Window)) N else A

Paper shows full steering logic with I$ and D$ bottlenecks included.

Sandeep Navada © 2013 34

Sandeep Navada © 2013

RESULTS

35

Methodology

• Benchmarks: SPEC 2000 – Simulate first 4 billion instructions • Metrics – Performance: BIPS – Efficiency: BIPS 3 /Watt • Migration overhead – Default: 100 cycles – Sensitivity study: 1K, 10K cycles Sandeep Navada © 2013 36

Steering algorithms

Algorithm

Baseline Sampling Bottleneck Optimal Oracle

Description

Run the entire 4B instructions on the average core Run on each core type for the sampling interval and then on the best core type for the switching interval Run current 10K instruction segment based on the bottlenecks of the prior 10K segment Run every 10K instruction segment on the best core type of the prior 10K segment Run every 10K instruction segment on the best core type Sandeep Navada © 2013 37

4-core-type HCMP

•4-core HCMP outperforms homogeneous CMP by up to 76% and 15%, on average •Our steering algorithm is able to capture most of this gain Sandeep Navada © 2013 38

Sampling vs. bottleneck steering

Sandeep Navada © 2013 39

Occupancy

Occupancy pattern varies dramatically across different applications Sandeep Navada © 2013 40

Efficiency

Sampling performs 25% better than the average core Bottleneck steering performs 33% better than the average core Sandeep Navada © 2013 41

Sandeep Navada © 2013

SUMMARY

42

Summary

• First proposal to architect and orchestrate multiple core types for latency reduction.

• With N core types, the optimal HCMP consists of an average core type coupled with N-1 accelerator core types. • In the complementary steering algorithm, the application is continuously diagnosed for bottlenecks and is migrated to the core type which relieves the bottlenecks.

Sandeep Navada © 2013 43

Future work

• HCMPs open up a whole new direction of microarchitecture research.

• Many microarchitecture optimizations don’t provide universal benefits.

• As each core-type targets a narrow workload space, HCMP provides a great platform to reconsider these optimizations.

Sandeep Navada © 2013 44