Presentation - Rutgers University
Download
Report
Transcript Presentation - Rutgers University
Neural Methods for
Dynamic Branch Prediction
Daniel A. Jiménez
Dept. of Computer Science
Rutgers University
Calvin Lin
Dept. of Computer Science
Univ. of Texas Austin
Presented by:
Rohit Mittal
Overview
Branch prediction background
Applying machine learning to branch prediction
Results and analysis
Future work and conclusions
2
Branch Prediction Background
3
Outline
What are branches?
Reducing branch penalties
Branch prediction
Why is branch prediction necessary?
Branch prediction basics
Issues which affect accurate branch prediction
Examples of real predictors
4
Branches
Instructions which can alter the flow of instruction execution
in a program
Direct
Indirect
Conditional
Unconditional
if - then- else
for loops
(bez, bnez, etc)
procedure calls (jal)
goto (j)
return (jr)
virtual function lookup
5
The Context
How can we exploit program behavior to make it go
faster?
Remove
control dependences
Increase
instruction-level parallelism
6
An Example
The inner loop of this code executes two statements each time
through the loop.
int foo (int w[], bool v[], int n) {
int
sum = 0;
for (int i=0; i<n; i++) {
if (v[i])
sum += w[i];
else
sum += ~w[i];
}
return sum;
}
7
An Example continued
This C++ code computes the same thing with three statements
in the loop.
int foo2 (int w[], bool v[], int n) {
int
sum = 0;
for (int i=0; i<n; i++) {
int a = w[i];
int b = - (int) v[i];
sum += ~(a ^ b);
}
return sum;
}
This version is 55% faster on a Pentium 4.
Previous version had many mispredicted branch instructions.
8
Branch Prediction
To speed up the process, pipelining overlaps execution of
multiple instructions, exploiting parallelism between
instructions.
Conditional branches create a problem for pipelining: the next
instruction can't be fetched until the branch has executed,
several stages later.
A branch predictor allows the processor to speculatively fetch
and execute instructions down the predicted path.
Branch predictors must be highly accurate to avoid mispredictions!
9
Why good Branch Prediction is necessary..
Branches are frequent - 15-25%
Today’s pipelines are deeper and wider
Higher performance penalty for stalling
High Misprediction Penalty
A lot of cycles can be wasted!!!
10
Branch Predictors Must Improve
The cost of a misprediction is proportional to pipeline depth
As pipelines deepen, we need more accurate branch predictors
Pentium 4 pipeline has 20 stages
Future pipelines will have > 32 stages
Deeper pipelines allow higher clock
rates by decreasing the delay of each
pipeline stage
Decreasing misprediction rate from
9% to 4% results in 31% speedup for
32 stage pipeline
Simulations with SimpleScalar/Alpha
11
Branch Prediction
Predicting the outcome of a branch
Direction:
Taken / Not Taken
Direction predictors
Target Address
PC+offset (Taken)/ PC+4 (Not Taken)
Target address predictors
Branch Target Address Cache (BTAC) or Branch Target Buffer
(BTB)
12
Why do we need branch prediction?
Branch prediction
Increases the number of instructions available for the
scheduler to issue. Increases instruction level parallelism
(ILP)
Allows useful work to be completed while waiting for the
branch to resolve
13
Branch Prediction Strategies
Static
Decided before runtime
Examples:
Always-Not Taken
Always-Taken
Backwards Taken, Forward Not Taken (BTFNT)
Profile-driven prediction
Dynamic
Prediction decisions may change during the execution of
the program
14
Dynamic Branch Prediction
Performance = ƒ(accuracy, cost of misprediction)
Branch History Table (BHT) is simplest
Also called a branch-prediction buffer
Lower bits of branch address index table of 1-bit
values
Says whether or not branch taken last time
If branch was taken last time, then take again
Initially, bits are set to predict that all branches are
taken
1-bit Branch History Table
Problems :
Two branches can have the same low-order bits.
In a loop, 1-bit BHT will cause two mispredictions:
End of loop case, when it exits instead of looping as
before
First time through loop on next time through code, when
it predicts exit instead of looping
LOOP: LOAD R1, 100(R2)
MUL R6, R6, R1
SUBI R2, R2, #4
BNEZ R2, LOOP
16
2-bit Predictor
Solution : 2-bit predictor scheme where change prediction only if
mispredict twice in a row
T
NT
Predict Taken
T
Predict Not
Taken
NT
T
Predict Taken
NT
T
Predict Not
NT Taken
•This idea can be extended to n-bit saturating counters
–Increment counter when branch is taken
–Decrement counter when branch is not taken
–If counter <= 2n-1, then predict the branch is taken; else not taken.
17
Correlating Branches
Often the behavior of one branch is correlated with the
behavior of other branches.
Example C code
B3 can be predicted with
if (aa == 2)
B1
100% accuracy based on the
aa = 0;
outcomes of B1 and B2
if (bb == 2)
B2
bb = 0;
if (aa != bb)
B3
cc = 4;
If the first two branches are not taken, the third one will be.
18
Correlating Branches – contd.
Hypothesis: recent branches are correlated; that is, behavior of
recently executed branches affects prediction of current branch
Idea: record m most recently executed branches as taken or
not taken, and use that pattern to select the proper branch
history table
In general, (m,n) predictor means record last m branches to
select between 2m history tables each with n-bit counters
Old 2-bit BHT is then a (0,2) predictor
19
Need Address at same time as Prediction
Branch Target Buffer (BTB): Address of branch index to get
prediction AND branch address (if taken)
Note: must check for branch match now, since can’t use wrong
branch address
Branch PC
Predicted PC
PC of instruction
FETCH
=?
Predict taken or not taken
Return instruction addresses predicted with stack
20
Branch Target Buffer
A branch-target buffer or branch-target cache stores
the predicted address of branches that are predicted to
be taken.
Values not in the buffer are predicted to be not taken.
The branch-target buffer is accessed during the IF
stage, based on the k low order bits of the branch
address.
If the branch-target is in the buffer and is predicted
correctly, the one cycle stall is eliminated.
21
Branch Predictor Accuracy
Larger tables and smarter organizations yield better accuracy
Longer histories provide more context for finding correlations
Table size is exponential in history length
The cost is increased access delay and chip area
22
Alpha 21264
8-stage pipeline, mispredict penalty 7 cycle
64 KB, 2-way instruction cache with line and way prediction
bits (Fetch)
Each 4-instruction fetch block contains a prediction for the next fetch
block
Hybrid predictor (Fetch)
12-bit GAg (4K-entry PHT, 2 bit counters)
10-bit PAg (1K-entry BHT, 1K-entry PHT, 3-bit counters)
23
Ultra Sparc III
14-stage pipeline, branch prediction accessed in instruction
fetch stages 2-3
16K-entry 2-bit counter Gshare predictor
Bimodal predictor which XOR’s PC bits with global
history register (except 3 lower order bits) to reduce
aliasing
Miss queue
Halves mispredict penalty by providing instructions for
immediate use
24
Pentium III
Dynamic branch prediction
512-entry BTB predicts direction and target, 4-bit
history used with PC to derive direction
Static branch predictor for BTB misses
Branch Penalties:
Not Taken: no penalty
Correctly predicted taken: 1 cycle
Mispredicted: at least 9 cycles, as many as 26, average
10-15 cycles
25
AMD Athlon K7
10-stage integer, 15-stage fp pipeline, predictor accessed in
fetch
2K-entry bimodal predictor, 2K-entry BTB
Branch Penalties:
Correct Predict Taken: 1 cycle
Mispredict penalty: at least 10 cycles
26
Applying Machine Learning to
Branch Prediction
27
Branch Prediction is a
Machine Learning Problem
So why not apply a machine learning algorithm?
Replace 2-bit counters with a more accurate predictor
Tight constraints on prediction mechanism
Must be fast and small enough to work as a component of a
microprocessor
Artificial neural networks
Simple model of neural networks in brain cells
Learn to recognize and classify patterns
Most neural nets are slow and complex relative to tables
For branch prediction, we need a small and fast neural method
28
A Neural Method for Branch Prediction
Several neural methods were investigated
Most were too slow, too big, or not accurate enough
The perceptron [Rosenblatt `62, Block `62]
Very high accuracy for branch prediction
Prediction and update are quick, relative to other neural methods
Sound theoretical foundation; perceptron convergence theorem
Proven to work well for many classification problems
29
Branch-Predicting Perceptron
Inputs (x’s) are from branch history register
Weights (w’s) are small integers learned by on-line training
Output (y) gives prediction; dot product of x’s and w’s
Training finds correlations between history and outcome
w0 – bias, independent of the history
30
Training Algorithm
31
Training Perceptrons
W’ – i.e. new weights vector, might be a worse set of
weights for any other training example. It is not evident
that this is a useful algorithm.
Perception Convergence Theorem:
If any set of weights exist that correctly classify a finite set
of training examples, then perceptron learning will come
up with a (possibly different) set of weights that also
correctly classifies all examples after a finite number of
change steps, for a finite separable set of training
examples.
32
Linear Separability
A limitation of perceptrons is that they are only capable of
learning linearly separable functions
A boolean function over variables xi..n is linearly separable
iff there exist values for wi..n such that all the true instances
can be separated from all the false instances by a
hyperplane defined by the solution of:
n
w0 + ∑ xi wi = 0
i=1
• i.e. If n = 2, the hyperplane is a line.
33
Linear Separability – contd.
Example: a perceptron can learn the logical AND for two
inputs but not the XOR.
A perceptron can still give good predictions for
inseparable functions but will not achieve 100% accuracy.
In contrast a two level PHT (pattern history table) scheme
like gshare can learn any boolean function if given enough
time.
34
Putting it all together – perceptron based
predictor
1.
2.
3.
4.
5.
6.
The Branch address is hashed into the table of
perceptrons
The ith perceptron is fetched, into a vector register, P1..n
of weights.
The value of y is computed as the dot product of P and
the global history register
The branch is predicted not taken if y is negative, or
taken otherwise
Once this branch is resolved, the outcome is used by the
training algorithm to update P
P is written back to the ith entry in the table
35
Organization of the Perceptron Predictor
Keeps a table of perceptrons, indexed by branch address
Inputs are from branch history register
Predict taken if output 0, otherwise predict not taken
Key intuition: table size isn't exponential in history length, so
we can consider much longer histories
36
Results and Analysis for the
Perceptron Predictor
37
Results: Predictor Accuracy
Perceptron outperforms competitive hybrid predictor by 36%
at ~4KB; 1.71% vs. 2.66%
38
Results: Large Hardware Budgets
Multi-component hybrid was the most accurate fully dynamic
predictor known in the literature [Evers 2000]
Perceptron predictor is even more accurate
39
Results: IPC with high clock rate
Pentium 4-like: 20 cycle misprediction penalty, 1.76 GHz
15.8% higher IPC than gshare, 5.7% higher than hybrid
40
Analysis: History Length
The fixed-length path branch predictor can also use long
histories [Stark, Evers & Patt `98]
41
Analysis: Training Times
Perceptron “warms up’’ faster
42
Future Work and Conclusions
43
Future Work with Perceptron Predictor
Let's make the best predictor even better
Better representation
Better training algorithm
Latency is a problem
How can we eliminate the latency of the perceptron
predictor?
44
Future Work with Perceptron Predictor
Value prediction
Predict which set of values is likely to be the result of a
load operation to mitigate memory latency
Indirect branch prediction
Virtual dispatch
Switch statements in C
45
Future Work
Characterizing Predictability
Branch predictability, value predictability
How can we characterize algorithms in terms of their predictability?
Given an algorithm, how can we transform it so that its branches and
values are easier to predict?
How much predictability is inherent in the algorithm, and how much is
an artifact of the program structure?
How can we compare different algorithms' predictability?
46
Conclusions
Neural predictors can improve performance for deeply
pipelined microprocessors
Perceptron learning is well-suited for microarchitectural
implementation
There is still a lot of work left to be done on the perceptron
predictor in particular and microarchitectural prediction in
general
47
The End
48