Performance-Aware Speculation Control using Wrong Path Usefulness Prediction Chang Joo Lee Hyesoon Kim* Onur Mutlu** Yale N.
Download
Report
Transcript Performance-Aware Speculation Control using Wrong Path Usefulness Prediction Chang Joo Lee Hyesoon Kim* Onur Mutlu** Yale N.
Performance-Aware Speculation Control
using Wrong Path Usefulness Prediction
Chang Joo Lee
Hyesoon Kim*
Onur Mutlu**
Yale N. Patt
HPS Research Group
University of Texas at Austin
*School of Computer Science
Georgia Institute of Technology
2015-11-06
**Microsoft Research
1
Outline
Motivation
Mechanism
Experimental Evaluation
Conclusion
2015-11-06
2
Fetch Gating (Pipeline Gating)
Proposed by Manne et al. [ISCA98]
Stops fetching instructions on wrong path
to save energy.
Assumes wrong-path instructions do not contribute
to performance and consume energy.
Various fetch gating mechanisms
Baniasadi and Moshovos [ISLPED01],
Karkhanis et al. [ISLPED02], Aragon et al. [HPCA03],
Buyuktosunoglu et al. [GLSVLSI03],
Collins et al. [MICRO04]
2015-11-06
3
Limitations of Previous Mechanisms
Hardware complexity
Branch confidence estimator,
changes to critical/power-hungry structures.
Additional hardware can offset
energy savings due to fetch gating.
Assumption
Wrong-path execution consumes energy
but is useless for performance.
2015-11-06
4
Is Wrong Path Execution
Really Useless?
Perfect fetch gating
Delta (%)
IPC
Energy
AVG
twolf
bzip2
vortex
gap
perlbmk
eon
parser
crafty
mcf
gcc
vpr
gzip
20
15
10
5
0
-5
-10
-15
-20
-25
-30
-35
-40
parser:
mcf: Performance
Performance
Energy consumption
degrades
of most decreases
benchmarks
by 30% and
byincreases
28%
energy
butconsumption
performance
by perfect fetch
increases
degrades
gating.by
by 15%
5%
2015-11-06
5
Why Does Performance Degrade
with Perfect Fetch Gating?
L2 Cache Fills (%)
MPKI: 36.6 MPKI: 1.5
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Correct path fills
Unused wrong path fills
Used wrong path fills
twolf
bzip2
vortex
gap
perlbmk
eon
parser
crafty
mcf
gcc
vpr
gzip
mcf:
parser:
almost37%
allWrong
ofiswrong-path
used
wrong
L2
path
fillsfills,
used,
14%
memory
is unused
intensive
wrong
(MPKI:
path fills
36.6)
path
execution
can
prefetch
useful
data
30%
5% performance
performance
degradation
degradation
with
with
perfect
perfectfetch
fetch
gating
gating
Butler [Thesis93],
Pierce
and Mudge
[IPPS94,
MICRO96],
Mutlu
et al. [IEEE TC05]
2015-11-06
6
Why Can Wrong Path Execution
Be Useful?
Taken
Not-taken
BB2
…..
BR BB4
Misprediction
Mispredicted
recovery
BB3
BB4
Load A
Load A
Cache hit
Load B L2 cache miss
Load B
…..
…..
JMP BB5
BB5
Load C
…..
Cache
L2 cache
hit miss
From mcf
Hammock structure within a
frequently executed loop
BR in BB2 is frequently
mispredicted
Since memory latency is large,
wrong path prefetching benefit
can be significant
2015-11-06
Taking into account wrongpath usefulness is important
7
Outline
Motivation
Mechanism
Experimental Evaluation
Conclusion
2015-11-06
8
Our Solution: Performance-Aware
Speculation Control
Hardware complexity:
Simple low cost fetch
gating mechanism
Performance-Aware
Speculation Control
Lookup
Fetch Gating
WPUP
Useful
Wrong-path Usefulness:
Low cost Wrong Path
Usefulness Predictor
(WPUP)
Gate Enable
Branch Count
Fetch Engine
Fetch gate only when wrong path execution is useless
2015-11-06
9
Our Fetch Gating Mechanism
Branch-count based mechanism
More
Performance-Aware
branches Speculation
higherControl
chance
of misprediction.
Lookup
Fetch gate if (# of Branches)
> Threshold
Fetch Gating
Mispredictions
show phase WPUP
behavior.
Useful
Threshold is determined by branch prediction
Gate Enable
Branch Count
accuracy
for a certain
period.
Higher accuracy Higher threshold
Fetch Engine
No need for complex logic (e.g. confidence estimator)
2015-11-06
10
Two WPUP Mechanisms
Performance-Aware
Speculation Control
Branch PC-based WPUP (Fine grained)
Lookup
Fetch Gating
WPUP
Phase-based WPUP (Coarse grained)
Useful
Gate Enable
Branch Count
Can be combined with other fetch gating mechanisms.
Fetch Engine
2015-11-06
11
Branch PC-based WPUP
Basic idea
Identifies and records conditional branch PCs
that lead to useful wrong-path memory
references
If the fetched branch is recorded as useful, do
not fetch gate
2015-11-06
12
Branch PC-based WPUP
Implementation
Fetch Engine
Latest Branch PC Register (LBPC, 16bits)
LBPC value carried through pipeline
Miss Status Holding Registers (MSHR)
Branch ID field (BID, 10bits)
Already used for branch misprediction recovery
Branch PC field (BPC, 16bits)
Wrong Path field (WP, 1bit)
WPUP Cache
2015-11-06
4 way set-associative, No Data Store, LRU
13
Branch PC-Based WPUP (Training)
LBPC:
Taken
Not-taken
PC 2
Load C
B
A
BID
inLoad
BB3
2
BB5
from
A
with
in
branch
BB4
PC
PC22unit
and
and BID
BID 22
BB2
PC2 :
…..
BR 2
BID 2
Misprediction
Mispredicted
recovery
BB3
BB4
Load A
Load A
Load B
Load B
…..
…..
JMP
BB5
Load C
…..
L2 cache miss
MSHR
Addr
BID
BPC
WP
A
2
PC2
0
1
B
2
PC2
0
1
C
2
PC2
0
1
MSHR hit; Wrong Path was useful.
BPC 2 is stored in WPUP cache.
2015-11-06
14
Branch PC-Based WPUP (Prediction)
LBPC:
Taken
Not-taken
PC 2
Fetch Gate?
Fetch Gate? BB2
PC2 :
…..
BR 2
Mispredicted
BB3
Load A
Load B
…..
JMP
Wrong-path
Execution
BB5
Load C
…..
BB4
Load A
Load B
…..
WPUP Cache
Addr
LRU
PC2
……
……
……
Hit; Do not fetch gate.
2015-11-06
15
Phase-based WPUP
Basic idea
Predict if the current phase will provide useful
wrong-path memory references
If so, do not fetch gate
Number of useful wrong-path
references
800
700
600
500
400
300
200
100
0
2000
1800
1600
1400
1200
1000
800
600
400
200
0
Time (x100K cycles)
2015-11-06
16
Phase-based WPUP
Implementation
Wrong Path Usefulness Counter
(WPUC, 5bits)
Incremented for each useful wrong-path
memory reference
Reset periodically
Do not fetch gate if WPUC > threshold
BPC fields or WPUP cache not needed
2015-11-06
17
Outline
Motivation
Mechanism
Experimental Evaluation
Conclusion
2015-11-06
18
Simulation Methodology
Alpha ISA execution driven simulator
Baseline processor configuration
Wattch power model: 100 nm, 1.2V technology
Manne’s fetch gating
2GHz, 8-wide issue, out-of-order, 128-entry ROB
Hybrid branch predictor (64K-entry gshare and 64K-entry PAs)
11 stages (minimum branch misprediction penalty)
1MB, 8-way unified L2 cache
32 L2 MSHRs, 300 cycle memory latency
Stream prefetcher
Gating threshold: 3 low confidence branches
JRS confidence estimator (4K-entry, 4bit-MDC)
Tuned for the best energy-delay product
Branch Count-based fetch gating
2015-11-06
BP Acc(%)
100~99
99~97
97~95
95~93
93~90
90~85
85~0
Threshold
18
16
13
12
11
7
3
19
Branch-Count Based Fetch Gating
15
IPC Delta (%)
10
5
0
-5
perfect
manne
fg-br
-10
-15
-20
gap
vortex
bzip2
twolf
hmean
gap
vortex
bzip2
twolf
AVG
perlbmk
eon
parser
crafty
mcf
gcc
vpr
-30
20
15
10
5
0
-5
-10
-15
-20
-25
-30
-35
-40
gzip
Energy Delta (%)
-25
perfect
manne
fg-br
perlbmk
eon
parser
crafty
mcf
gcc
vpr
gzip
Manne’s
Performance
and our fetch
and gating
energydegrade
savingsperformance
are higher than
of mcf
Manne’s.
and parser
2015-11-06
20
IPC Delta (%)
WPUP Mechanisms
1
0
-1
-2
-3
-4
-5
-6
-7
-8
-9
-10
3
2
1
0
-1
-2
-3
-4
-5
-6
-7
-8
-9
-10
-11
-12
-13
fg-br
fg-br/pc-wpup
fg-br/phase-wpup
twolf
AVG
bzip2
twolf
AVG
Energy Delta (%)
bzip2
vortex
gap
perlbmk
eon
parser
crafty
mcf
gcc
fg-br
fg-br/pc-wpup
fg-br/phase-wpup
vortex
gap
perlbmk
eon
parser
crafty
mcf
gcc
vpr
gzip
2015-11-06
vpr
gzip
Improves performance
Improvesand
performance
energy savings
of mcfcompared
and parserto Manne’s
21
Hardware Cost
Performance-Aware Speculation Control
vs.
Manne’s Fetch Gating
Hardware cost
Fetch Gating
WPUP
Total
Manne
2049B
-
2049B
FG-BR/PC-WPUP
6B
260B
266B
FG-BR/PHASE WPUP
6B
45B
51B
2015-11-06
22
manne
manne/pc-wpup
fg-br/pc-wpup
bzip2
twolf
AVG
bzip2
twolf
AVG
vortex
gap
perlbmk
eon
parser
crafty
mcf
gcc
WPUPs improve performance and energy efficiency of Manne’s
vpr
0
-1
-2
-3
-4
-5
-6
-7
-8
-9
-10
4
3
2
1
0
-1
-2
-3
-4
-5
-6
-7
-8
-9
-10
-11
-12
gzip
Energy Delta (%)
IPC Delta (%)
Comparison with
Manne’s Fetch Gating
manne
manne/pc-wpup
fg-br/pc-wpup
vortex
gap
perlbmk
eon
parser
crafty
mcf
gcc
vpr
gzip
2.5% less performance degradation, 1.0% more energy savings
2015-11-06
23
EDP Delta (%)
Energy-Delay Product
14
12
10
8
6
4
2
0
-2
-4
-6
-8
-10
manne
fg-br/pc-wpup
fg-br/phase-wpup
AVG
twolf
bzip2
vortex
gap
perlbmk
eon
parser
crafty
mcf
gcc
vpr
gzip
Improves Energy-Delay Product (2.6% compared to Manne’s)
2015-11-06
24
Conclusion
Performance-Aware Speculation Control
Branch count-based fetch gating
Simple and low cost.
Introduced Wrong Path Usefulness Prediction
Recovers performance loss due to fetch gating by
executing useful wrong-path instructions.
Can be combined with other fetch gating mechanisms.
Reduces performance loss due to fetch gating and
also saves energy.
2015-11-06
25
Questions?
2015-11-06
26