Performance-Aware Speculation Control using Wrong Path Usefulness Prediction Chang Joo Lee Hyesoon Kim* Onur Mutlu** Yale N.

Download Report

Transcript Performance-Aware Speculation Control using Wrong Path Usefulness Prediction Chang Joo Lee Hyesoon Kim* Onur Mutlu** Yale N.

Performance-Aware Speculation Control
using Wrong Path Usefulness Prediction
Chang Joo Lee
Hyesoon Kim*
Onur Mutlu**
Yale N. Patt
HPS Research Group
University of Texas at Austin
*School of Computer Science
Georgia Institute of Technology
2015-11-06
**Microsoft Research
1
Outline




Motivation
Mechanism
Experimental Evaluation
Conclusion
2015-11-06
2
Fetch Gating (Pipeline Gating)

Proposed by Manne et al. [ISCA98]



Stops fetching instructions on wrong path
to save energy.
Assumes wrong-path instructions do not contribute
to performance and consume energy.
Various fetch gating mechanisms

Baniasadi and Moshovos [ISLPED01],
Karkhanis et al. [ISLPED02], Aragon et al. [HPCA03],
Buyuktosunoglu et al. [GLSVLSI03],
Collins et al. [MICRO04]
2015-11-06
3
Limitations of Previous Mechanisms

Hardware complexity
Branch confidence estimator,
changes to critical/power-hungry structures.
 Additional hardware can offset
energy savings due to fetch gating.


Assumption

Wrong-path execution consumes energy
but is useless for performance.
2015-11-06
4
Is Wrong Path Execution
Really Useless?
Perfect fetch gating
Delta (%)

IPC
Energy
AVG
twolf
bzip2
vortex
gap
perlbmk
eon
parser
crafty
mcf
gcc
vpr
gzip
20
15
10
5
0
-5
-10
-15
-20
-25
-30
-35
-40
parser:
mcf: Performance
Performance
Energy consumption
degrades
of most decreases
benchmarks
by 30% and
byincreases
28%
energy
butconsumption
performance
by perfect fetch
increases
degrades
gating.by
by 15%
5%
2015-11-06
5
Why Does Performance Degrade
with Perfect Fetch Gating?
L2 Cache Fills (%)
MPKI: 36.6 MPKI: 1.5
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Correct path fills
Unused wrong path fills
Used wrong path fills
twolf
bzip2
vortex
gap
perlbmk
eon
parser
crafty
mcf
gcc
vpr
gzip
mcf:
parser:
almost37%
allWrong
ofiswrong-path
used
wrong
L2
path
fillsfills,
used,
14%
memory
is unused
intensive
wrong
(MPKI:
path fills
36.6)
path
execution
can
prefetch
useful
data

 30%
5% performance
performance
degradation
degradation
with
with
perfect
perfectfetch
fetch
gating
gating
Butler [Thesis93],
Pierce
and Mudge
[IPPS94,
MICRO96],
Mutlu
et al. [IEEE TC05]
2015-11-06
6
Why Can Wrong Path Execution
Be Useful?

Taken
Not-taken
BB2

…..
BR BB4

Misprediction
Mispredicted
recovery
BB3
BB4
Load A
Load A
Cache hit
Load B L2 cache miss
Load B
…..
…..
JMP BB5

BB5
Load C
…..
Cache
L2 cache
hit miss
From mcf
Hammock structure within a
frequently executed loop
BR in BB2 is frequently
mispredicted
Since memory latency is large,
wrong path prefetching benefit
can be significant

2015-11-06
Taking into account wrongpath usefulness is important
7
Outline




Motivation
Mechanism
Experimental Evaluation
Conclusion
2015-11-06
8
Our Solution: Performance-Aware
Speculation Control

Hardware complexity:
Simple low cost fetch
gating mechanism
Performance-Aware
Speculation Control
Lookup
Fetch Gating
WPUP
Useful

Wrong-path Usefulness:
Low cost Wrong Path
Usefulness Predictor
(WPUP)
Gate Enable
Branch Count
Fetch Engine
Fetch gate only when wrong path execution is useless
2015-11-06
9
Our Fetch Gating Mechanism

Branch-count based mechanism

More


Performance-Aware
branches Speculation
 higherControl
chance
of misprediction.
Lookup
Fetch gate if (# of Branches)
> Threshold
Fetch Gating
Mispredictions
show phase WPUP
behavior.
Useful
Threshold is determined by branch prediction
Gate Enable
Branch Count
accuracy
for a certain
period.
 Higher accuracy  Higher threshold
Fetch Engine
 No need for complex logic (e.g. confidence estimator)

2015-11-06
10
Two WPUP Mechanisms

Performance-Aware
Speculation Control
Branch PC-based WPUP (Fine grained)
Lookup
Fetch Gating

WPUP
Phase-based WPUP (Coarse grained)
Useful
Gate Enable
Branch Count
Can be combined with other fetch gating mechanisms.
Fetch Engine
2015-11-06
11
Branch PC-based WPUP

Basic idea
Identifies and records conditional branch PCs
that lead to useful wrong-path memory
references
 If the fetched branch is recorded as useful, do
not fetch gate

2015-11-06
12
Branch PC-based WPUP

Implementation

Fetch Engine



Latest Branch PC Register (LBPC, 16bits)
LBPC value carried through pipeline
Miss Status Holding Registers (MSHR)

Branch ID field (BID, 10bits)




Already used for branch misprediction recovery
Branch PC field (BPC, 16bits)
Wrong Path field (WP, 1bit)
WPUP Cache

2015-11-06
4 way set-associative, No Data Store, LRU
13
Branch PC-Based WPUP (Training)
LBPC:
Taken
Not-taken
PC 2
Load C
B
A
BID
inLoad
BB3
2
BB5
from
A
with
in
branch
BB4
PC
PC22unit
and
and BID
BID 22
BB2
PC2 :
…..
BR 2
BID 2
Misprediction
Mispredicted
recovery
BB3
BB4
Load A
Load A
Load B
Load B
…..
…..
JMP
BB5
Load C
…..
L2 cache miss
MSHR
Addr
BID
BPC
WP
A
2
PC2
0
1
B
2
PC2
0
1
C
2
PC2
0
1
MSHR hit; Wrong Path was useful.
BPC 2 is stored in WPUP cache.
2015-11-06
14
Branch PC-Based WPUP (Prediction)
LBPC:
Taken
Not-taken
PC 2
Fetch Gate?
Fetch Gate? BB2
PC2 :
…..
BR 2
Mispredicted
BB3
Load A
Load B
…..
JMP
Wrong-path
Execution
BB5
Load C
…..
BB4
Load A
Load B
…..
WPUP Cache
Addr
LRU
PC2
……
……
……
Hit; Do not fetch gate.
2015-11-06
15
Phase-based WPUP

Basic idea


Predict if the current phase will provide useful
wrong-path memory references
If so, do not fetch gate
Number of useful wrong-path
references
800
700
600
500
400
300
200
100
0
2000
1800
1600
1400
1200
1000
800
600
400
200
0
Time (x100K cycles)
2015-11-06
16
Phase-based WPUP

Implementation
Wrong Path Usefulness Counter
(WPUC, 5bits)
 Incremented for each useful wrong-path
memory reference
 Reset periodically
 Do not fetch gate if WPUC > threshold
 BPC fields or WPUP cache not needed

2015-11-06
17
Outline




Motivation
Mechanism
Experimental Evaluation
Conclusion
2015-11-06
18
Simulation Methodology


Alpha ISA execution driven simulator
Baseline processor configuration








Wattch power model: 100 nm, 1.2V technology
Manne’s fetch gating




2GHz, 8-wide issue, out-of-order, 128-entry ROB
Hybrid branch predictor (64K-entry gshare and 64K-entry PAs)
11 stages (minimum branch misprediction penalty)
1MB, 8-way unified L2 cache
32 L2 MSHRs, 300 cycle memory latency
Stream prefetcher
Gating threshold: 3 low confidence branches
JRS confidence estimator (4K-entry, 4bit-MDC)
Tuned for the best energy-delay product
Branch Count-based fetch gating
2015-11-06
BP Acc(%)
100~99
99~97
97~95
95~93
93~90
90~85
85~0
Threshold
18
16
13
12
11
7
3
19
Branch-Count Based Fetch Gating
15
IPC Delta (%)
10
5
0
-5
perfect
manne
fg-br
-10
-15
-20
gap
vortex
bzip2
twolf
hmean
gap
vortex
bzip2
twolf
AVG
perlbmk
eon
parser
crafty
mcf
gcc
vpr
-30
20
15
10
5
0
-5
-10
-15
-20
-25
-30
-35
-40
gzip
Energy Delta (%)
-25
perfect
manne
fg-br
perlbmk
eon
parser
crafty
mcf
gcc
vpr
gzip
Manne’s
Performance
and our fetch
and gating
energydegrade
savingsperformance
are higher than
of mcf
Manne’s.
and parser
2015-11-06
20
IPC Delta (%)
WPUP Mechanisms
1
0
-1
-2
-3
-4
-5
-6
-7
-8
-9
-10
3
2
1
0
-1
-2
-3
-4
-5
-6
-7
-8
-9
-10
-11
-12
-13
fg-br
fg-br/pc-wpup
fg-br/phase-wpup
twolf
AVG
bzip2
twolf
AVG
Energy Delta (%)
bzip2
vortex
gap
perlbmk
eon
parser
crafty
mcf
gcc
fg-br
fg-br/pc-wpup
fg-br/phase-wpup
vortex
gap
perlbmk
eon
parser
crafty
mcf
gcc
vpr
gzip
2015-11-06
vpr
gzip
Improves performance
Improvesand
performance
energy savings
of mcfcompared
and parserto Manne’s
21
Hardware Cost
Performance-Aware Speculation Control
vs.
Manne’s Fetch Gating
Hardware cost
Fetch Gating
WPUP
Total
Manne
2049B
-
2049B
FG-BR/PC-WPUP
6B
260B
266B
FG-BR/PHASE WPUP
6B
45B
51B
2015-11-06
22
manne
manne/pc-wpup
fg-br/pc-wpup
bzip2
twolf
AVG
bzip2
twolf
AVG
vortex
gap
perlbmk
eon
parser
crafty
mcf
gcc
WPUPs improve performance and energy efficiency of Manne’s
vpr
0
-1
-2
-3
-4
-5
-6
-7
-8
-9
-10
4
3
2
1
0
-1
-2
-3
-4
-5
-6
-7
-8
-9
-10
-11
-12
gzip
Energy Delta (%)
IPC Delta (%)
Comparison with
Manne’s Fetch Gating
manne
manne/pc-wpup
fg-br/pc-wpup
vortex
gap
perlbmk
eon
parser
crafty
mcf
gcc
vpr
gzip
2.5% less performance degradation, 1.0% more energy savings
2015-11-06
23
EDP Delta (%)
Energy-Delay Product
14
12
10
8
6
4
2
0
-2
-4
-6
-8
-10
manne
fg-br/pc-wpup
fg-br/phase-wpup
AVG
twolf
bzip2
vortex
gap
perlbmk
eon
parser
crafty
mcf
gcc
vpr
gzip
Improves Energy-Delay Product (2.6% compared to Manne’s)
2015-11-06
24
Conclusion

Performance-Aware Speculation Control

Branch count-based fetch gating


Simple and low cost.
Introduced Wrong Path Usefulness Prediction
Recovers performance loss due to fetch gating by
executing useful wrong-path instructions.
 Can be combined with other fetch gating mechanisms.


Reduces performance loss due to fetch gating and
also saves energy.
2015-11-06
25
Questions?
2015-11-06
26