Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi Princeton University.

Download Report

Transcript Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi Princeton University.

Thread Criticality Predictors
for Dynamic Performance, Power,
and Resource Management
in Chip Multiprocessors
Abhishek Bhattacharjee
Margaret Martonosi
Princeton University
Why Thread Criticality Prediction?
Insts
Exec
T0
T1
T2
T3
• Sources of variability: algorithm,
process variations, thermal
emergencies etc.
D-Cache
Miss
I-Cache
Miss
Stall
• Threads 1 & 3 are critical 
Performance degradation, energy
inefficiency
Stall
• With thread criticality prediction:
1. Task stealing for performance
2. DVFS for energy efficiency
3. Many others …
Related Work
 Instruction criticality [Fields et al., Tune et al. 2001 etc.]
 Thrifty barrier [Li et al. 2005]
 Faster cores transitioned into low-power mode based on prediction of
Our Approach:
barrier
stall time
1. Also handles non-barrier code
2. Works
on constant or variable
loop[Liu
iteration
 DVFS
for energy-efficiency
at barriers
et al.size
2005]
3. Predicts criticality at any point in time, not just barriers
 Meeting points [Cai et al. 2008]
 DVFS non-critical threads by tracking loop iterations completion rate
across cores (parallel loops)
Thread Criticality Prediction Goals
Design Goals
Design Decisions
1. Accuracy
• Absolute TCP accuracy
• Relative TCP accuracy
1. Find suitable arch. metric
2. Low-overhead implementation
• Simple HW (allow SW policies to
be built on top)
2. History-based local approach
versus thread-comparative
approach
3. One predictor, many uses
3. This paper: TBB, DVFS
Other uses: Shared LLC
management, SMT and
memory priority, …
Outline of this Talk
 Thread Criticality Predictor Design
 Methodology
 Identify µarchitectural events impacting thread criticality
 Introduce basic TCP hardware
 Thread Criticality Predictor Uses
 Apply to Intel’s Threading Building Blocks (TBB)
 Apply for energy-efficiency in barrier-based programs
Methodology
 Evaluations on a range of architectures: highperformance and embedded domains
 Full-system including OS
 Detailed power/energy studies using FPGA emulator
Infrastructure
GEMS Simulator
ARM Simulator
FPGA Emulator
Domain
High-performance,
wide-issue, out-of-order
Embedded, in-order
Embedded, in-order
System
16 core CMP with
Solaris 10
4-32 core CMP
4-core CMP with
Linux 2.6
Cores
4-issue SPARC
2-issue ARM
1-issue SPARC
Caches
32KB L1 , 4MB L2
32KB L1, 4MB L2
4KB I-Cache,
8KB D-Cache
Why not History-Based TCPs?
Norm. Time
(Rel. to Iteration 0)
+ Info local to core: no communication
-- Requires repetitive barrier behavior
-- Problem for in-order pipelines: variant IPCs
1.2
Stall
Compute
1
0.8
0.6
0.4
0.2
0
0
1
2
3
4
5
Ocean: Iteration Number
(Barrier 8)
 Time 
6
7
8
SPLASH-2
PARSEC
Streamcluster
Blackscholes
Fluidanimate
Swaptions
LU
Ocean
Barnes
Water-Nsq
Volrend
Water-Sp
Radix
FFT
Cholesky
% Error of Metric in Tracking Compute Time
Thread-Comparative Metrics for TCP:
Instruction Counts
100
In-order Instruction Count
80
60
40
20
0
SPLASH-2
PARSEC
Streamcluster
Blackscholes
Fluidanimate
Swaptions
LU
Ocean
In-order Instruction Count
Barnes
Water-Nsq
Volrend
Water-Sp
Radix
FFT
Cholesky
% Error of Metric in Tracking Compute Time
Thread-Comparative Metrics for TCP:
L1 D Cache Misses
100
In-order L1 D Cache per Inst
80
60
40
20
0
Thread-Comparative Metrics for TCP:
L1 I & D Cache Misses
In-order Instruction Count
In-order L1 D Cache per Inst
In-order L1 I & D Cache per Inst
80
60
40
20
SPLASH-2
PARSEC
Streamcluster
Blackscholes
Fluidanimate
Swaptions
LU
Ocean
Barnes
Water-Nsq
Volrend
Water-Sp
Radix
FFT
0
Cholesky
% Error of Metric in Tracking Compute Time
100
Thread-Comparative Metrics for TCP:
All L1 and L2 Cache Misses
80
In-order Instruction Count
In-order L1 D Cache per Inst
In-order L1 I & D Cache per Inst
In-order L1 & L2 Cache per Inst
60
40
20
SPLASH-2
PARSEC
Streamcluster
Blackscholes
Fluidanimate
Swaptions
LU
Ocean
Barnes
Water-Nsq
Volrend
Water-Sp
Radix
FFT
0
Cholesky
% Error of Metric in Tracking Compute Time
100
Thread-Comparative Metrics for TCP:
All L1 and L2 Cache Misses
In-order Instruction Count
In-order L1 I & D Cache per Inst
Out-of-Order L1 & L2 Cache per Inst
80
In-order L1 D Cache per Inst
In-order L1 & L2 Cache per Inst
60
40
20
SPLASH-2
PARSEC
Streamcluster
Blackscholes
Fluidanimate
Swaptions
LU
Ocean
Barnes
Water-Nsq
Volrend
Water-Sp
Radix
FFT
0
Cholesky
% Error of Metric in Tracking Compute Time
100
Outline of this Talk
 Thread Criticality Predictor Design
 Methodology
 Identify µarchitectural events impacting thread criticality
 Introduce basic TCP hardware
 Thread Criticality Predictor Uses
 Apply to Intel’s Threading Building Blocks (TBB)
 Apply for energy-efficiency in barrier-based programs
Basic TCP Hardware
Inst
Inst
Inst5:25:
5:L1
Inst
Inst20
10
1
2
Miss
L2
D$$Miss!
Miss
Over
Inst
Inst
Inst135
15
20
30
35
1
2
5
Periodically
Per-corerefresh
Criticality
criticality
Counters
counters
track poorly
with
Interval
cached,
Bound
slowRegister
threads
Inst
Inst20:
20:L1
Inst
Inst125
15
25
1
2
5
Miss
I$ Miss!
Over
Inst
Inst
Inst135
15
20
30
35
1
2
5
Core 0
Core 1
Core 2
Core 3
L1 I $ L1 D $
L1 I $ L1L1D $
L1L1I $ L1 D $
L1 I $ L1 D $
L2
Cache
Miss!
Cache
Miss!
L2 Controller
Shared L2 Cache
Cache
Miss!
TCP Hardware
Criticality
Counters
0
0
1
11
0
1
0
Outline of this Talk
 Thread Criticality Predictor (TCP) Design
 Methodology
 Identify µarchitectural events impacting thread criticality
 Introduce basic TCP hardware
 Thread Criticality Predictor Uses
 Apply to Intel’s Threading Building Blocks (TBB)
 Apply for energy-efficiency in barrier-based programs
TBB Task Stealing & Thread Criticality
 TBB dynamic scheduler distributes tasks
 Each thread maintains software queue filled with tasks
 Empty queue – thread “steals” task from another thread’s queue
 Approach 1: Default TBB uses random task stealing
 More failed steals at higher core counts  poor performance
 Approach 2: Occupancy-based task stealing [Contreras,
Martonosi, 2008]
 Steal based on number of items in SW queue
 Must track and compare max. occupancy counts
TCP-Guided TBB Task Stealing
Core 0
Core 1
Core 2
Core 3
Task 0
Task 4
Task
None7
5
Task 6
SW Q1
Core 2:
Steal SW Q2
Req.
Core 3:
L1
SW Q3
Miss
SW Q0
Task 1
Task 2
• TCP initiates steals from critical thread
Task 7
• Modest message overhead: L2 access latency
counters
 114
Criticality
Counters
Clock:
Clock:
Clock:100
30
10
0• Scalable: 14-bit criticality
Core 3:
bytes of storage @ 64
L2 cores
14
0
0
5
0 21
2
11
1
0
Task 3
Miss
Shared L2 Cache
Interval
Bound
Register
TCP
Steal
Scan
for
Control
from
max val.
Logic
Core 3
TCP-Guided TBB Performance
% Perf. Improvement versus Random Task Stealing
35
Occupancy-based Approach
Criticality-based Approach
30
25
20
Avg. Improvement over Random (32 cores) = 21.6 %
15
10
Avg. Improvement over Occupancy (32 cores) = 13.8 %
5
0
4
8
16 32
Blackscholes
4
8
16 32
4
Fluidanimate
8
16 32
Swaptions
Core Count
• TCP access penalized with L2 latency
4
8
16 32
Stream.
Outline of this Talk
 Thread Criticality Predictor Design
 Methodology
 Identify µarchitectural events impacting thread criticality
 Introduce basic TCP hardware
 Thread Criticality Predictor Uses
 Apply to Intel’s Threading Building Blocks (TBB)
 Apply for energy-efficiency in barrier-based programs
Adapting TCP for Energy Efficiency in
Barrier-Based Programs
Insts
Exec
T0
T1
L2 D$
Miss
Over
T2
Approach: DVFS non-critical
threads to eliminate barrier
stall time
T3
Challenges:
• Relative criticalities
T1 critical, • Misprediction costs
• DVFS overheads
=> DVFS
T0, T2, T3
TCP for DVFS: Results
0.3
Normalized Energy Savings (Rel. to Original Benchmark)
0.25
0.2
0.15
0.1
Average 15% energy savings
0.05
0
 FPGA platform with 4 cores, 50% fixed leakage cost
 See paper for details: TCP mispredictions, DVFS overheads etc
Conclusions
 Goal 1: Accuracy
 Accurate TCPs based on simple cache statistics
 Goal 2: Low-overhead hardware
 Scalable per-core criticality counters used
 TCP in central location where cache info. is already available
 Goal 3: Versatility
 TBB improved by 13.8% over best known approach @ 32 cores
 DVFS used to achieve 15% energy savings
 Two uses shown, many others possible…