Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N.

Download Report

Transcript Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N.

Techniques for Efficient Processing
in Runahead Execution Engines
Onur Mutlu
Hyesoon Kim
Yale N. Patt
Talk Outline







Background on Runahead Execution
The Problem
Causes of Inefficiency and Eliminating Them
Evaluation
Performance Optimizations to Increase Efficiency
Combined Results
Conclusions
Efficient Runahead Execution
2
Background on Runahead Execution


A technique to obtain the memory-level parallelism benefits
of a large instruction window
When the oldest instruction is an L2 miss:


In runahead mode:




Checkpoint architectural state and enter runahead mode
Instructions are speculatively pre-executed
The purpose of pre-execution is to generate prefetches
L2-miss dependent instructions are marked INV and dropped
Runahead mode ends when the original L2 miss returns

Checkpoint is restored and normal execution resumes
Efficient Runahead Execution
3
Runahead Example
Small Window:
Load 2 Miss
Load 1 Miss
Compute
Stall
Compute
Miss 1
Stall
Miss 2
Runahead:
Load 1 Miss
Compute
Load 2 Miss
Runahead
Miss 1
Load 1 Hit
Load 2 Hit
Compute
Saved Cycles
Miss 2
Efficient Runahead Execution
4
The Problem



A runahead processor pre-executes some instructions
speculatively
Each pre-executed instruction consumes energy
Runahead execution significantly increases the
number of executed instructions, sometimes
without providing significant performance
improvement
Efficient Runahead Execution
5
The Problem (cont.)
235%
110%
100%
% Increase in IPC
% Increase in Executed Instructions
90%
80%
70%
60%
50%
40%
30%
22.6%
26.5%
20%
10%
Efficient Runahead Execution
AVG
wupwise
swim
sixtrack
mgrid
mesa
lucas
galgel
fma3d
facerec
equake
art
apsi
applu
ammp
vpr
vortex
twolf
perlbmk
parser
mcf
gzip
gcc
gap
eon
crafty
bzip2
0%
6
Efficiency of Runahead Execution
Efficiency =
% Increase in IPC
% Increase in Executed Instructions

Goals:


Reduce the number of executed instructions
without reducing the IPC improvement
Increase the IPC improvement
without increasing the number of executed instructions
Efficient Runahead Execution
7
Talk Outline







Background on Runahead Execution
The Problem
Causes of Inefficiency and Eliminating Them
Evaluation
Performance Optimizations to Increase Efficiency
Combined Results
Conclusions
Efficient Runahead Execution
8
Causes of Inefficiency

Short runahead periods

Overlapping runahead periods

Useless runahead periods
Efficient Runahead Execution
9
Short Runahead Periods

Processor can initiate runahead mode due to an already in-flight L2
miss generated by
 the prefetcher, wrong-path, or a previous runahead period
Load 1 Miss
Load 2 Miss
Load 1 Hit Load 2 Miss
Compute Runahead
Miss 1
Miss 2

Short periods
 are less likely to generate useful L2 misses
 have high overhead due to the flush penalty at runahead exit
Efficient Runahead Execution
10
Eliminating Short Runahead Periods

Mechanism to eliminate short periods:




Record the number of cycles C an L2-miss has been in flight
If C is greater than a threshold T for an L2 miss, disable entry
into runahead mode due to that miss
T can be determined statically (at design time) or dynamically
T=400 for a minimum main memory latency of 500 cycles
works well
Efficient Runahead Execution
11
Overlapping Runahead Periods

Two runahead periods that execute the same instructions
Load 1 Miss Load 2 INV
Compute
Load 1 Hit Load 2 Miss
Miss 1

OVERLAP
OVERLAP
Runahead
Miss 2
Second period is inefficient
Efficient Runahead Execution
12
Overlapping Runahead Periods (cont.)

Overlapping periods are not necessarily useless

The availability of a new data value can result in the
generation of useful L2 misses

But, this does not happen often enough

Mechanism to eliminate overlapping periods:



Keep track of the number of pseudo-retired instructions R
during a runahead period
Keep track of the number of fetched instructions N since the
exit from last runahead period
If N < R, do not enter runahead mode
Efficient Runahead Execution
13
Useless Runahead Periods

Periods that do not result in prefetches for normal mode
Load 1 Miss
Compute
Load 1 Hit
Runahead
Miss 1


They exist due to the lack of memory-level parallelism
Mechanism to eliminate useless periods:


Predict if a period will generate useful L2 misses
Estimate a period to be useful if it generated an L2 miss that
cannot be captured by the instruction window

Useless period predictors are trained based on this estimation
Efficient Runahead Execution
14
Predicting Useless Runahead Periods

Prediction based on the past usefulness of runahead
periods caused by the same static load instruction


Prediction based on too many INV loads


If the fraction of INV loads in a runahead period is greater than T,
exit runahead mode
Sampling (phase) based prediction


A 2-bit state machine records the past usefulness of a load
If last N runahead periods generated fewer than T L2 misses, do
not enter runahead for the next M runahead opportunities
Compile-time profile-based prediction

If runahead modes caused by a load were not useful in the profiling
run, mark it as non-runahead load
Efficient Runahead Execution
15
Talk Outline







Background on Runahead Execution
The Problem
Causes of Inefficiency and Eliminating Them
Evaluation
Performance Optimizations to Increase Efficiency
Combined Results
Conclusions
Efficient Runahead Execution
16
Baseline Processor








Execution-driven Alpha simulator
8-wide superscalar processor
128-entry instruction window, 20-stage pipeline
64 KB, 4-way, 2-cycle L1 data and instruction caches
1 MB, 32-way, 10-cycle unified L2 cache
500-cycle minimum main memory latency
Aggressive stream-based prefetcher
32 DRAM banks, 32-byte wide processor-memory bus (4:1
frequency ratio), 128 outstanding misses

Detailed memory model
Efficient Runahead Execution
17
Impact on Efficiency
baseline runahead
35%
short
overlapping
useless
30%
Increase Over Baseline OOO
short+overlapping+useless
26.5%26.5%26.5%26.5%
25%
22.6%
20.1%
20%
15.3%
15%
14.9%
11.8%
10%
6.7%
5%
0%
Executed Instructions
Efficient Runahead Execution
IPC
18
Performance Optimizations for Efficiency


Both efficiency AND performance can be increased by
increasing the usefulness of runahead periods
Three optimizations:



Turning off the Floating Point Unit (FPU) in runahead mode
Optimizing the update policy of the hardware prefetcher
(HWP) in runahead mode
Early wake-up of INV instructions (in paper)
Efficient Runahead Execution
19
Turning Off the FPU in Runahead Mode


FP instructions do not contribute to the generation of load
addresses
FP instructions can be dropped after decode





Spares processor resources for more useful instructions
Increases performance by enabling faster progress
Enables dynamic/static energy savings
Results in an unresolvable branch misprediction if a
mispredicted branch depends on an FP operation (rare)
Overall – increases IPC and reduces executed instructions
Efficient Runahead Execution
20
HWP Update Policy in Runahead Mode



Aggressive hardware prefetching in runahead mode may
hurt performance, if the prefetcher accuracy is low
Runahead requests more accurate than prefetcher requests
Three policies:





Do not update the prefetcher state
Update the prefetcher state just like in normal mode
Only train existing streams, but do not create new streams
Runahead mode improves the timeliness of the prefetcher
in many benchmarks
Only training the existing streams is the best policy
Efficient Runahead Execution
21
Talk Outline







Background on Runahead Execution
The Problem
Causes of Inefficiency and Eliminating Them
Evaluation
Performance Optimizations to Increase Efficiency
Combined Results
Conclusions
Efficient Runahead Execution
22
Efficient Runahead Execution
AVG
wupwise
swim
sixtrack
mgrid
mesa
lucas
galgel
fma3d
facerec
equake
110%
art
apsi
applu
ammp
vpr
vortex
twolf
perlbmk
parser
mcf
gzip
90%
gcc
100%
gap
eon
crafty
bzip2
Increase in Executed Instructions
Overall Impact on Executed Instructions
235%
baseline runahead
all techniques
80%
70%
60%
50%
40%
30%
26.5%
20%
10%
6.2%
0%
23
Efficient Runahead Execution
AVG
wupwise
swim
sixtrack
mgrid
mesa
lucas
galgel
fma3d
facerec
equake
110%
art
apsi
applu
ammp
vpr
vortex
twolf
perlbmk
parser
mcf
gzip
90%
gcc
100%
gap
eon
crafty
bzip2
Increase in IPC
Overall Impact on IPC
116%
baseline runahead
all techniques
80%
70%
60%
50%
40%
30%
20%
22.6%
22.1%
10%
0%
24
Conclusions

Three major causes of inefficiency in runahead execution:
short, overlapping, and useless runahead periods



Simple efficiency techniques can effectively reduce the
three causes of inefficiency
Simple performance optimizations can increase efficiency by
increasing the usefulness of runahead periods
Proposed techniques:


reduce the extra instructions from 26.5% to 6.2%,
without significantly affecting performance
are effective for a variety of memory latencies ranging from
100 to 900 cycles
Efficient Runahead Execution
25
Backup Slides
Efficient Runahead Execution
AVG
wupwise
swim
sixtrack
mgrid
mesa
lucas
galgel
fma3d
facerec
equake
art
apsi
applu
ammp
vpr
vortex
twolf
perlbmk
4.0
parser
4.5
mcf
5.0
gzip
gcc
gap
eon
crafty
bzip2
IPC
Baseline IPC
5.5
no prefetcher
baseline
runahead
perfect L2
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
27
Memory Latency (Executed Instructions)
50%
45%
Increase in Executed Instructions
40%
baseline runahead
all techniques
35%
30%
25%
20%
15%
10%
5%
0%
100
300
500
700
900
Memory Latency
Efficient Runahead Execution
28
Memory Latency (IPC Delta)
50%
Increase in IPC
45%
40%
baseline runahead
35%
all techniques
30%
25%
20%
15%
10%
5%
0%
100
300
500
700
900
Memory Latency
Efficient Runahead Execution
29
Cache Sizes (Executed Instructions)
45%
Increase in Executed Instructions
40%
35%
baseline runahead
all techniques
30%
25%
20%
15%
10%
5%
0%
512 KB
Efficient Runahead Execution
1 MB
2 MB
4 MB
30
Cache Sizes (IPC Delta)
45%
40%
35%
baseline runahead
all techniques
Increase in IPC
30%
25%
20%
15%
10%
5%
0%
512 KB
Efficient Runahead Execution
1 MB
2 MB
4 MB
31
INT (Executed Instructions)
40%
Increase in Executed Instructions
35%
30%
25%
runahead (INT)
all techniques (INT)
20%
15%
10%
5%
0%
100
300
500
700
900
Memory Latency
Efficient Runahead Execution
32
INT (IPC Delta)
40%
35%
Increase in IPC
30%
25%
runahead (INT)
all techniques (INT)
20%
15%
10%
5%
0%
100
300
500
700
900
Memory Latency
Efficient Runahead Execution
33
FP (Executed Instructions)
65%
Increase in Executed Instructions
60%
55%
runahead (FP)
50%
all techniques (FP)
45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
100
300
500
700
900
Memory Latency
Efficient Runahead Execution
34
FP (IPC Delta)
65%
60%
55%
runahead (FP)
50%
all techniques (FP)
Increase in IPC
45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
100
300
500
700
900
Memory Latency
Efficient Runahead Execution
35
Early INV Wake-up


Keep track of INV status of an instruction in the scheduler.
Scheduler wakes up the instruction if any source is INV.
+ Enables faster progress during runahead mode by removing
the useless INV instructions faster.
- Increases the number of executed instructions.
- Increases the complexity of the scheduling logic.

Not worth implementing due to small IPC gain
Efficient Runahead Execution
36