EOLE: Paving the Way for an Effective Implementation of
Download
Report
Transcript EOLE: Paving the Way for an Effective Implementation of
EOLE: Paving the Way for an Effective
Implementation of Value Prediction
Arthur Perais & André Seznec
EMETTEUR
Arthur
Perais & André Seznec - ISCA 2014
00 MOIS
2011- 1
7/18/2015
Increasing Sequential Performance is Hard.
« Natural » way: increase the superscalar width.
Complexity, power, timing issues.
Currently: try to maximize the utilization of the
resources we can implement:
• Branch prediction to feed the execution core.
• Memory dependency prediction to increase ILP.
Value Prediction to increase ILP.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 2
Outline
1. Value Prediction Today.
2. Introducing the EOLE Architecture.
3. Lighter Value Prediction with EOLE.
4. Results.
5. Conclusion.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 3
1
Value Prediction Today
What we have.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 4
Value Prediction [Lipasti96][Mendelson97]
Breaks true data dependencies to extract more ILP,
e.g:
I1
I2
I3
I4
I5
Becomes, if I3 is predicted:
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 5
Why is VP not Implemented Yet?
Predictors:
• Stride-based and FCM: How do you deal with the
speculative window?
• FCM: Timing issues with instructions in tight loops.
Validation & Recovery:
• Validate in the OoO core.
• Selective replay to absorb the cost of a misprediction.
Too complex.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 6
A First Solution [Perais&Seznec@HPCA14]
A new predictor leveraging branch history, VTAGE:
• No speculative window required.
• No issues with tight loops.
Validation & Recovery at retirement:
• Validate outside the OoO core, in-order.
• Squashing with very high predictor accuracy to
recover.
Actually still too complex.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 7
The – Slightly – Hidden Costs of VP
n-issue Out-of-order
Engine
ROB
Fetch
IQ
PRF
FUs
VPredict
PC
Validation +
Squashing
@commit
Arthur Perais & André Seznec - ISCA 2014
More ports on the PRF:
• Write ports to write predictions.
• Read ports to validate/train.
7/18/2015 - 8
Let’s Count.
Baseline 8-wide, 6-issue:
• 12 read ports, 6 write ports.
VP 8-wide, 6-issue:
• 12R/6W for OoO execution.
• 8W to write 8 predictions/cycle in the PRF.
• 8R to validate/train 8 instructions/cycle.
12R/6W vs. 20R/14W!
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 9
2
The EOLE Architecture
What we propose.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 10
Leveraging the – slightly – Hidden Benefits of
VP
Value Prediction provides:
• Instructions with ready operands flowing from the value
predictor.
• Predicted instructions not needing to be executed before
retirement.
Offload execution to some other in-order parts of the core
to reduce complexity in the out-of-order core. Save PRF
ports in the process.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 11
Introducing Early Execution
Fetch
Decode
Rename
Dispatch
Out-of-order
engine
VPred
Early Execution
Execute ready single-cycle
instructions in parallel with
Rename, in-order.
Do not dispatch to the IQ.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 12
Early Execution Hardware
From Decode and Value Predictor
Values come from:
• Decode (Immediate)
• Value Predictor
• Bypass Network
To Dispatch
Execute what you can, write in the PRF with the ports
provisioned for VP.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 13
Introducing Late Execution
Validation/
Late Execution
Validation
Out-of-order
engine
CMP
Retire
VPredict
Prediction FIFO Queue
Execute single-cycle predicted instructions
just before retirement, in-order.
Do not dispatch to the IQ either.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 14
Late Execution Hardware
Late Execution
Late Exec Control
Validation
Prediction FIFO Queue
PRF
CMP
I1Correct I2Correct
CMP
To VPred
Execute just before validation and retirement by
leveraging the ports provisioned for validation.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 15
{Early | OoO | Late} Execution: EOLE
Much less instructions enter the IQ: We may be able to
reduce the issue-width:
• Simpler IQ.
• Less ports on the PRF.
• Less bypass.
Simpler OoO.
Non critical predictions become useful as the instructions
can be late-executed.
What about hardware cost?
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 16
Hardware Cost of EOLE
Early Execution:
• A single rank of simple ALUs.
• Associated bypass network.
• No additional PRF ports.
Late Execution & Validation:
• Rank of simple ALUs and comparators (to validate).
• No bypass.
• n read ports to validate becomes 2n to handle n
instructions per cycle: 16R for an 8-wide pipeline.
From 20R/14W for an 8-wide, 6-issue with VP, we now
need 28R/14W! Only 12R/6W for the baseline…
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 17
3
Ligther Value Prediction with EOLE
What we can optimize.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 18
Reducing the Issue Width
If less instructions enter the IQ, then we can reduce the
issue width (maybe the IQ size):
• From 6 to 4 (-4R and -2W): 24R/12W.
The remaining issue capacity is offloaded to the
Early/Late Execution stages.
Still too many ports.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 19
Banking the Physical Register File
Prediction and Validation are done in-order.
Bank the PRF and attribute predictions to consecutive
banks.
8 pred/cycle
Bank 0
Bank 1
Bank 2
Bank 3
8 valid/cycle
2 write ports per bank instead of 8 for a 4-bank file.
Read port savings are not as straightforward because
of Late Execution.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 20
Read Port Sharing
8 instructions can be validated with 2R per bank…
…but Late Execution needs 16R per-bank to process 8
instructions.
Fortunately, not all instructions are predictable (e.g.
stores) or late-executable (e.g. loads).
Constrain the number of read ports and share them
between late execution and validation as needed : 4R
per-bank is a good tradeoff.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 21
Let’s Count, Again.
4-issue out-of-order engine (4W/8R per bank).
8 predictions per cycle (2W per bank).
Constrained late-execution/validation (4R per bank).
12R/6W per bank in total.
From 28R/14W, we now only need 12R/6W! This is the
same amount as the PRF without VP, except issue
width is 33% less.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 22
Putting It All Together
Less than n-issue Out-of-order
Engine
Predictions
flow through
Predictions/Early
results are
Early Execution
written
to the PRF at Dispatch
ROB
IQ
Bank 0
Rename
Fetch
Early
Exec
PC
VPredict
FUs
Bank 1
Bank 2
Predictions
All predicted instructions are validated
at commit time
Validation +
Late
Squashing
Execution
@commit
Arthur Perais & André Seznec - ISCA 2014
Bank 3
Regular Out-of-order Execution
Single cycle predicted instructions are
late executed by reading operands
from the PRF
7/18/2015 - 23
Putting It All Together
EOLE provides a way to nullify the pressure applied by VP
on the PRF (assuming banking is cheap).
It reduces the complexity of the OoO engine itself: smaller
issue width, simple Wakeup & Select, less bypass.
EOLE needs VP to provide instructions to early/late
execute while VP needs EOLE to mitigate the complexity
it introduces.
The two features are complementary.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 24
4
Experimental Results
What we get.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 25
Experimental Framework
Simulator: gem5 (x86_64).
4GHz, 8-wide, 6-issue, 20 cycles min. Bmispred., 192ROB,
64IQ, 48LQ/48SQ, 256INT/256FP regs. 32KB L1D/L1I,
2MB unified L2 with stride prefetcher, 4GB DDR3-1600
(min. ~75 cycles).
8K-entry base predictor + 6 1K-entry tagged components
VTAGE + 8K-entry 2-delta Stride hybrid predictor with
Forward Probabilistic Counters [Perais&Seznec14]
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 26
Experimental Framework
Single-thread benchmarks: Subset of SPEC’00 and
SPEC’06 (19 benchmarks) – ref inputs.
Simpoint: One slice per benchmark, warmup for
50Minsts, run for 100Minsts.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 27
Speedup over Baseline 8-wide/6-issue
VTAGE-2D-Str (a.k.a. Simple_VP_6I_64IQ)
1.6
1.5
1.4
1.3
1.2
1.1
1
0.9
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 28
Early Executed – Late Executed
Early Executed
LE: High-Confidence Branches
LE: Value-predicted
0.7
0.6
Low EOLE potential
0.5
0.4
0.3
0.2
0.1
0
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 29
Reducing the Issue Width
Simple_VP_4I_64IQ
EOLE_4I_64IQ
EOLE_6I_64IQ
1.4
1.3
1.2
Slowdown in almost
all cases
Slight speedup
in general
Slowdown in a
single case
1.1
1
0.9
0.8
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 30
Reducing the IQ Size
Simple_VP_6I_48IQ
EOLE_6I_48IQ
EOLE_6I_64IQ
1.2
1.15
Slowdown in
all cases
Noticeable slowdown
In many cases
1.1
1.05
1
0.95
0.9
0.85
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 31
Limited Issue and PRF Ports
No_VP_6I_64IQ
1.2
EOLE_4I_64IQ
EOLE_4I_64IQ_4Rports_4banks
Without VP
1.1
1
0.9
Same performance for
4R/bank as ideal
0.8
0.7
0.6
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 32
5
Concluding Remarks
What remains to be done.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 33
VP In a Processor with EOLE?
Pros:
• No additional ports on the PRF, assuming enough
banks.
• Simpler Out-of-order engine.
• Performance very similar to the baseline VP pipeline.
Cons:
• Additional hardware (Early and Late Execution,
Predictor).
• Impact on power consumption is unclear.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 34
Future Work
What about the predictor?
• 8-wide fetch -> 8 predictions/cycle -> 8-ported
tables?
• Hybrid with Stride, how do you implement the
speculative window?
The remaining complexity is really in the predictor.
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 35
Questions?
Arthur Perais & André Seznec - ISCA 2014
7/18/2015 - 36