EOLE: Paving the Way for an Effective Implementation of

Download Report

Transcript EOLE: Paving the Way for an Effective Implementation of

EOLE: Paving the Way for an Effective
Implementation of Value Prediction
Arthur Perais & André Seznec
EMETTEUR
Arthur
Perais & André Seznec - ISCA 2014
00 MOIS
2011- 1
7/18/2015
Increasing Sequential Performance is Hard.
 « Natural » way: increase the superscalar width.
Complexity, power, timing issues.
 Currently: try to maximize the utilization of the
resources we can implement:
• Branch prediction to feed the execution core.
• Memory dependency prediction to increase ILP.
 Value Prediction to increase ILP.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 2
Outline
1. Value Prediction Today.
2. Introducing the EOLE Architecture.
3. Lighter Value Prediction with EOLE.
4. Results.
5. Conclusion.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 3
1
Value Prediction Today
What we have.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 4
Value Prediction [Lipasti96][Mendelson97]
 Breaks true data dependencies to extract more ILP,
e.g:
I1
I2
I3
I4
I5
 Becomes, if I3 is predicted:
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 5
Why is VP not Implemented Yet?
 Predictors:
• Stride-based and FCM: How do you deal with the
speculative window?
• FCM: Timing issues with instructions in tight loops.
 Validation & Recovery:
• Validate in the OoO core.
• Selective replay to absorb the cost of a misprediction.
 Too complex.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 6
A First Solution [Perais&Seznec@HPCA14]
 A new predictor leveraging branch history, VTAGE:
• No speculative window required.
• No issues with tight loops.
 Validation & Recovery at retirement:
• Validate outside the OoO core, in-order.
• Squashing with very high predictor accuracy to
recover.
 Actually still too complex.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 7
The – Slightly – Hidden Costs of VP
n-issue Out-of-order
Engine
ROB
Fetch
IQ
PRF
FUs
VPredict
PC
Validation +
Squashing
@commit
Arthur Perais & André Seznec - ISCA 2014
More ports on the PRF:
• Write ports to write predictions.
• Read ports to validate/train.
7/18/2015 - 8
Let’s Count.
 Baseline 8-wide, 6-issue:
• 12 read ports, 6 write ports.
 VP 8-wide, 6-issue:
• 12R/6W for OoO execution.
• 8W to write 8 predictions/cycle in the PRF.
• 8R to validate/train 8 instructions/cycle.
 12R/6W vs. 20R/14W!
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 9
2
The EOLE Architecture
What we propose.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 10
Leveraging the – slightly – Hidden Benefits of
VP
 Value Prediction provides:
• Instructions with ready operands flowing from the value
predictor.
• Predicted instructions not needing to be executed before
retirement.
 Offload execution to some other in-order parts of the core
to reduce complexity in the out-of-order core. Save PRF
ports in the process.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 11
Introducing Early Execution
Fetch
Decode
Rename
Dispatch
Out-of-order
engine
VPred
Early Execution
Execute ready single-cycle
instructions in parallel with
Rename, in-order.
Do not dispatch to the IQ.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 12
Early Execution Hardware
From Decode and Value Predictor
Values come from:
• Decode (Immediate)
• Value Predictor
• Bypass Network
To Dispatch
 Execute what you can, write in the PRF with the ports
provisioned for VP.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 13
Introducing Late Execution
Validation/
Late Execution
Validation
Out-of-order
engine
CMP
Retire
VPredict
Prediction FIFO Queue
Execute single-cycle predicted instructions
just before retirement, in-order.
Do not dispatch to the IQ either.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 14
Late Execution Hardware
Late Execution
Late Exec Control
Validation
Prediction FIFO Queue
PRF
CMP
I1Correct I2Correct
CMP
To VPred
 Execute just before validation and retirement by
leveraging the ports provisioned for validation.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 15
{Early | OoO | Late} Execution: EOLE
 Much less instructions enter the IQ: We may be able to
reduce the issue-width:
• Simpler IQ.
• Less ports on the PRF.
• Less bypass.
 Simpler OoO.
 Non critical predictions become useful as the instructions
can be late-executed.
 What about hardware cost?
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 16
Hardware Cost of EOLE
 Early Execution:
• A single rank of simple ALUs.
• Associated bypass network.
• No additional PRF ports.
 Late Execution & Validation:
• Rank of simple ALUs and comparators (to validate).
• No bypass.
• n read ports to validate becomes 2n to handle n
instructions per cycle: 16R for an 8-wide pipeline.
 From 20R/14W for an 8-wide, 6-issue with VP, we now
need 28R/14W! Only 12R/6W for the baseline…
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 17
3
Ligther Value Prediction with EOLE
What we can optimize.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 18
Reducing the Issue Width
 If less instructions enter the IQ, then we can reduce the
issue width (maybe the IQ size):
• From 6 to 4 (-4R and -2W): 24R/12W.
 The remaining issue capacity is offloaded to the
Early/Late Execution stages.
 Still too many ports.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 19
Banking the Physical Register File
 Prediction and Validation are done in-order.
 Bank the PRF and attribute predictions to consecutive
banks.
8 pred/cycle
Bank 0
Bank 1
Bank 2
Bank 3
8 valid/cycle
 2 write ports per bank instead of 8 for a 4-bank file.
 Read port savings are not as straightforward because
of Late Execution.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 20
Read Port Sharing
 8 instructions can be validated with 2R per bank…
 …but Late Execution needs 16R per-bank to process 8
instructions.
 Fortunately, not all instructions are predictable (e.g.
stores) or late-executable (e.g. loads).
 Constrain the number of read ports and share them
between late execution and validation as needed : 4R
per-bank is a good tradeoff.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 21
Let’s Count, Again.
 4-issue out-of-order engine (4W/8R per bank).
 8 predictions per cycle (2W per bank).
 Constrained late-execution/validation (4R per bank).
 12R/6W per bank in total.
 From 28R/14W, we now only need 12R/6W! This is the
same amount as the PRF without VP, except issue
width is 33% less.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 22
Putting It All Together
Less than n-issue Out-of-order
Engine
Predictions
flow through
Predictions/Early
results are
Early Execution
written
to the PRF at Dispatch
ROB
IQ
Bank 0
Rename
Fetch
Early
Exec
PC
VPredict
FUs
Bank 1
Bank 2
Predictions
All predicted instructions are validated
at commit time
Validation +
Late
Squashing
Execution
@commit
Arthur Perais & André Seznec - ISCA 2014
Bank 3
Regular Out-of-order Execution
Single cycle predicted instructions are
late executed by reading operands
from the PRF
7/18/2015 - 23
Putting It All Together
 EOLE provides a way to nullify the pressure applied by VP
on the PRF (assuming banking is cheap).
 It reduces the complexity of the OoO engine itself: smaller
issue width, simple Wakeup & Select, less bypass.
 EOLE needs VP to provide instructions to early/late
execute while VP needs EOLE to mitigate the complexity
it introduces.
 The two features are complementary.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 24
4
Experimental Results
What we get.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 25
Experimental Framework
 Simulator: gem5 (x86_64).
 4GHz, 8-wide, 6-issue, 20 cycles min. Bmispred., 192ROB,
64IQ, 48LQ/48SQ, 256INT/256FP regs. 32KB L1D/L1I,
2MB unified L2 with stride prefetcher, 4GB DDR3-1600
(min. ~75 cycles).
 8K-entry base predictor + 6 1K-entry tagged components
VTAGE + 8K-entry 2-delta Stride hybrid predictor with
Forward Probabilistic Counters [Perais&Seznec14]
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 26
Experimental Framework
 Single-thread benchmarks: Subset of SPEC’00 and
SPEC’06 (19 benchmarks) – ref inputs.
 Simpoint: One slice per benchmark, warmup for
50Minsts, run for 100Minsts.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 27
Speedup over Baseline 8-wide/6-issue
VTAGE-2D-Str (a.k.a. Simple_VP_6I_64IQ)
1.6
1.5
1.4
1.3
1.2
1.1
1
0.9
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 28
Early Executed – Late Executed
Early Executed
LE: High-Confidence Branches
LE: Value-predicted
0.7
0.6
Low EOLE potential
0.5
0.4
0.3
0.2
0.1
0
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 29
Reducing the Issue Width
Simple_VP_4I_64IQ
EOLE_4I_64IQ
EOLE_6I_64IQ
1.4
1.3
1.2
Slowdown in almost
all cases
Slight speedup
in general
Slowdown in a
single case
1.1
1
0.9
0.8
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 30
Reducing the IQ Size
Simple_VP_6I_48IQ
EOLE_6I_48IQ
EOLE_6I_64IQ
1.2
1.15
Slowdown in
all cases
Noticeable slowdown
In many cases
1.1
1.05
1
0.95
0.9
0.85
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 31
Limited Issue and PRF Ports
No_VP_6I_64IQ
1.2
EOLE_4I_64IQ
EOLE_4I_64IQ_4Rports_4banks
Without VP
1.1
1
0.9
Same performance for
4R/bank as ideal
0.8
0.7
0.6
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 32
5
Concluding Remarks
What remains to be done.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 33
VP In a Processor with EOLE?
 Pros:
• No additional ports on the PRF, assuming enough
banks.
• Simpler Out-of-order engine.
• Performance very similar to the baseline VP pipeline.
 Cons:
• Additional hardware (Early and Late Execution,
Predictor).
• Impact on power consumption is unclear.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 34
Future Work
 What about the predictor?
• 8-wide fetch -> 8 predictions/cycle -> 8-ported
tables?
• Hybrid with Stride, how do you implement the
speculative window?
 The remaining complexity is really in the predictor.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 35
Questions?
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 36