Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi Princeton University.
Download
Report
Transcript Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi Princeton University.
Thread Criticality Predictors
for Dynamic Performance, Power,
and Resource Management
in Chip Multiprocessors
Abhishek Bhattacharjee
Margaret Martonosi
Princeton University
Why Thread Criticality Prediction?
Insts
Exec
T0
T1
T2
T3
• Sources of variability: algorithm,
process variations, thermal
emergencies etc.
D-Cache
Miss
I-Cache
Miss
Stall
• Threads 1 & 3 are critical
Performance degradation, energy
inefficiency
Stall
• With thread criticality prediction:
1. Task stealing for performance
2. DVFS for energy efficiency
3. Many others …
Related Work
Instruction criticality [Fields et al., Tune et al. 2001 etc.]
Thrifty barrier [Li et al. 2005]
Faster cores transitioned into low-power mode based on prediction of
Our Approach:
barrier
stall time
1. Also handles non-barrier code
2. Works
on constant or variable
loop[Liu
iteration
DVFS
for energy-efficiency
at barriers
et al.size
2005]
3. Predicts criticality at any point in time, not just barriers
Meeting points [Cai et al. 2008]
DVFS non-critical threads by tracking loop iterations completion rate
across cores (parallel loops)
Thread Criticality Prediction Goals
Design Goals
Design Decisions
1. Accuracy
• Absolute TCP accuracy
• Relative TCP accuracy
1. Find suitable arch. metric
2. Low-overhead implementation
• Simple HW (allow SW policies to
be built on top)
2. History-based local approach
versus thread-comparative
approach
3. One predictor, many uses
3. This paper: TBB, DVFS
Other uses: Shared LLC
management, SMT and
memory priority, …
Outline of this Talk
Thread Criticality Predictor Design
Methodology
Identify µarchitectural events impacting thread criticality
Introduce basic TCP hardware
Thread Criticality Predictor Uses
Apply to Intel’s Threading Building Blocks (TBB)
Apply for energy-efficiency in barrier-based programs
Methodology
Evaluations on a range of architectures: highperformance and embedded domains
Full-system including OS
Detailed power/energy studies using FPGA emulator
Infrastructure
GEMS Simulator
ARM Simulator
FPGA Emulator
Domain
High-performance,
wide-issue, out-of-order
Embedded, in-order
Embedded, in-order
System
16 core CMP with
Solaris 10
4-32 core CMP
4-core CMP with
Linux 2.6
Cores
4-issue SPARC
2-issue ARM
1-issue SPARC
Caches
32KB L1 , 4MB L2
32KB L1, 4MB L2
4KB I-Cache,
8KB D-Cache
Why not History-Based TCPs?
Norm. Time
(Rel. to Iteration 0)
+ Info local to core: no communication
-- Requires repetitive barrier behavior
-- Problem for in-order pipelines: variant IPCs
1.2
Stall
Compute
1
0.8
0.6
0.4
0.2
0
0
1
2
3
4
5
Ocean: Iteration Number
(Barrier 8)
Time
6
7
8
SPLASH-2
PARSEC
Streamcluster
Blackscholes
Fluidanimate
Swaptions
LU
Ocean
Barnes
Water-Nsq
Volrend
Water-Sp
Radix
FFT
Cholesky
% Error of Metric in Tracking Compute Time
Thread-Comparative Metrics for TCP:
Instruction Counts
100
In-order Instruction Count
80
60
40
20
0
SPLASH-2
PARSEC
Streamcluster
Blackscholes
Fluidanimate
Swaptions
LU
Ocean
In-order Instruction Count
Barnes
Water-Nsq
Volrend
Water-Sp
Radix
FFT
Cholesky
% Error of Metric in Tracking Compute Time
Thread-Comparative Metrics for TCP:
L1 D Cache Misses
100
In-order L1 D Cache per Inst
80
60
40
20
0
Thread-Comparative Metrics for TCP:
L1 I & D Cache Misses
In-order Instruction Count
In-order L1 D Cache per Inst
In-order L1 I & D Cache per Inst
80
60
40
20
SPLASH-2
PARSEC
Streamcluster
Blackscholes
Fluidanimate
Swaptions
LU
Ocean
Barnes
Water-Nsq
Volrend
Water-Sp
Radix
FFT
0
Cholesky
% Error of Metric in Tracking Compute Time
100
Thread-Comparative Metrics for TCP:
All L1 and L2 Cache Misses
80
In-order Instruction Count
In-order L1 D Cache per Inst
In-order L1 I & D Cache per Inst
In-order L1 & L2 Cache per Inst
60
40
20
SPLASH-2
PARSEC
Streamcluster
Blackscholes
Fluidanimate
Swaptions
LU
Ocean
Barnes
Water-Nsq
Volrend
Water-Sp
Radix
FFT
0
Cholesky
% Error of Metric in Tracking Compute Time
100
Thread-Comparative Metrics for TCP:
All L1 and L2 Cache Misses
In-order Instruction Count
In-order L1 I & D Cache per Inst
Out-of-Order L1 & L2 Cache per Inst
80
In-order L1 D Cache per Inst
In-order L1 & L2 Cache per Inst
60
40
20
SPLASH-2
PARSEC
Streamcluster
Blackscholes
Fluidanimate
Swaptions
LU
Ocean
Barnes
Water-Nsq
Volrend
Water-Sp
Radix
FFT
0
Cholesky
% Error of Metric in Tracking Compute Time
100
Outline of this Talk
Thread Criticality Predictor Design
Methodology
Identify µarchitectural events impacting thread criticality
Introduce basic TCP hardware
Thread Criticality Predictor Uses
Apply to Intel’s Threading Building Blocks (TBB)
Apply for energy-efficiency in barrier-based programs
Basic TCP Hardware
Inst
Inst
Inst5:25:
5:L1
Inst
Inst20
10
1
2
Miss
L2
D$$Miss!
Miss
Over
Inst
Inst
Inst135
15
20
30
35
1
2
5
Periodically
Per-corerefresh
Criticality
criticality
Counters
counters
track poorly
with
Interval
cached,
Bound
slowRegister
threads
Inst
Inst20:
20:L1
Inst
Inst125
15
25
1
2
5
Miss
I$ Miss!
Over
Inst
Inst
Inst135
15
20
30
35
1
2
5
Core 0
Core 1
Core 2
Core 3
L1 I $ L1 D $
L1 I $ L1L1D $
L1L1I $ L1 D $
L1 I $ L1 D $
L2
Cache
Miss!
Cache
Miss!
L2 Controller
Shared L2 Cache
Cache
Miss!
TCP Hardware
Criticality
Counters
0
0
1
11
0
1
0
Outline of this Talk
Thread Criticality Predictor (TCP) Design
Methodology
Identify µarchitectural events impacting thread criticality
Introduce basic TCP hardware
Thread Criticality Predictor Uses
Apply to Intel’s Threading Building Blocks (TBB)
Apply for energy-efficiency in barrier-based programs
TBB Task Stealing & Thread Criticality
TBB dynamic scheduler distributes tasks
Each thread maintains software queue filled with tasks
Empty queue – thread “steals” task from another thread’s queue
Approach 1: Default TBB uses random task stealing
More failed steals at higher core counts poor performance
Approach 2: Occupancy-based task stealing [Contreras,
Martonosi, 2008]
Steal based on number of items in SW queue
Must track and compare max. occupancy counts
TCP-Guided TBB Task Stealing
Core 0
Core 1
Core 2
Core 3
Task 0
Task 4
Task
None7
5
Task 6
SW Q1
Core 2:
Steal SW Q2
Req.
Core 3:
L1
SW Q3
Miss
SW Q0
Task 1
Task 2
• TCP initiates steals from critical thread
Task 7
• Modest message overhead: L2 access latency
counters
114
Criticality
Counters
Clock:
Clock:
Clock:100
30
10
0• Scalable: 14-bit criticality
Core 3:
bytes of storage @ 64
L2 cores
14
0
0
5
0 21
2
11
1
0
Task 3
Miss
Shared L2 Cache
Interval
Bound
Register
TCP
Steal
Scan
for
Control
from
max val.
Logic
Core 3
TCP-Guided TBB Performance
% Perf. Improvement versus Random Task Stealing
35
Occupancy-based Approach
Criticality-based Approach
30
25
20
Avg. Improvement over Random (32 cores) = 21.6 %
15
10
Avg. Improvement over Occupancy (32 cores) = 13.8 %
5
0
4
8
16 32
Blackscholes
4
8
16 32
4
Fluidanimate
8
16 32
Swaptions
Core Count
• TCP access penalized with L2 latency
4
8
16 32
Stream.
Outline of this Talk
Thread Criticality Predictor Design
Methodology
Identify µarchitectural events impacting thread criticality
Introduce basic TCP hardware
Thread Criticality Predictor Uses
Apply to Intel’s Threading Building Blocks (TBB)
Apply for energy-efficiency in barrier-based programs
Adapting TCP for Energy Efficiency in
Barrier-Based Programs
Insts
Exec
T0
T1
L2 D$
Miss
Over
T2
Approach: DVFS non-critical
threads to eliminate barrier
stall time
T3
Challenges:
• Relative criticalities
T1 critical, • Misprediction costs
• DVFS overheads
=> DVFS
T0, T2, T3
TCP for DVFS: Results
0.3
Normalized Energy Savings (Rel. to Original Benchmark)
0.25
0.2
0.15
0.1
Average 15% energy savings
0.05
0
FPGA platform with 4 cores, 50% fixed leakage cost
See paper for details: TCP mispredictions, DVFS overheads etc
Conclusions
Goal 1: Accuracy
Accurate TCPs based on simple cache statistics
Goal 2: Low-overhead hardware
Scalable per-core criticality counters used
TCP in central location where cache info. is already available
Goal 3: Versatility
TBB improved by 13.8% over best known approach @ 32 cores
DVFS used to achieve 15% energy savings
Two uses shown, many others possible…