Presentation - Rutgers University

Transcript Presentation - Rutgers University

Neural Methods for
Dynamic Branch Prediction
Daniel A. Jiménez
Dept. of Computer Science
Rutgers University
Calvin Lin
Dept. of Computer Science
Univ. of Texas Austin
Presented by:
Rohit Mittal
Overview

Branch prediction background

Applying machine learning to branch prediction

Results and analysis

Future work and conclusions
2
Branch Prediction Background
3
Outline







What are branches?
Reducing branch penalties
Branch prediction
Why is branch prediction necessary?
Branch prediction basics
Issues which affect accurate branch prediction
Examples of real predictors
4
Branches

Instructions which can alter the flow of instruction execution
in a program
Direct
Indirect
Conditional
Unconditional
if - then- else
for loops
(bez, bnez, etc)
procedure calls (jal)
goto (j)
return (jr)
virtual function lookup
5
The Context

How can we exploit program behavior to make it go
faster?
 Remove
control dependences
 Increase
instruction-level parallelism
6
An Example

The inner loop of this code executes two statements each time
through the loop.
int foo (int w[], bool v[], int n) {
int
sum = 0;
for (int i=0; i<n; i++) {
if (v[i])
sum += w[i];
else
sum += ~w[i];
}
return sum;
}
7
An Example continued

This C++ code computes the same thing with three statements
in the loop.
int foo2 (int w[], bool v[], int n) {
int
sum = 0;
for (int i=0; i<n; i++) {
int a = w[i];
int b = - (int) v[i];
sum += ~(a ^ b);
}
return sum;
}


This version is 55% faster on a Pentium 4.
Previous version had many mispredicted branch instructions.
8
Branch Prediction

To speed up the process, pipelining overlaps execution of
multiple instructions, exploiting parallelism between
instructions.

Conditional branches create a problem for pipelining: the next
instruction can't be fetched until the branch has executed,
several stages later.

A branch predictor allows the processor to speculatively fetch
and execute instructions down the predicted path.
Branch predictors must be highly accurate to avoid mispredictions!
9
Why good Branch Prediction is necessary..



Branches are frequent - 15-25%
Today’s pipelines are deeper and wider
 Higher performance penalty for stalling
 High Misprediction Penalty
A lot of cycles can be wasted!!!
10
Branch Predictors Must Improve


The cost of a misprediction is proportional to pipeline depth
As pipelines deepen, we need more accurate branch predictors


Pentium 4 pipeline has 20 stages
Future pipelines will have > 32 stages
Deeper pipelines allow higher clock
rates by decreasing the delay of each
pipeline stage

Decreasing misprediction rate from
9% to 4% results in 31% speedup for
32 stage pipeline

Simulations with SimpleScalar/Alpha
11
Branch Prediction

Predicting the outcome of a branch
 Direction:
Taken / Not Taken
 Direction predictors


Target Address
PC+offset (Taken)/ PC+4 (Not Taken)
 Target address predictors
 Branch Target Address Cache (BTAC) or Branch Target Buffer
(BTB)

12
Why do we need branch prediction?

Branch prediction
 Increases the number of instructions available for the
scheduler to issue. Increases instruction level parallelism
(ILP)
 Allows useful work to be completed while waiting for the
branch to resolve
13
Branch Prediction Strategies

Static
 Decided before runtime
 Examples:
Always-Not Taken
 Always-Taken
 Backwards Taken, Forward Not Taken (BTFNT)
 Profile-driven prediction


Dynamic
 Prediction decisions may change during the execution of
the program
14
Dynamic Branch Prediction
Performance = ƒ(accuracy, cost of misprediction)
 Branch History Table (BHT) is simplest
 Also called a branch-prediction buffer
 Lower bits of branch address index table of 1-bit
values
 Says whether or not branch taken last time
 If branch was taken last time, then take again
 Initially, bits are set to predict that all branches are
taken

1-bit Branch History Table
Problems :
Two branches can have the same low-order bits.
In a loop, 1-bit BHT will cause two mispredictions:
End of loop case, when it exits instead of looping as
before
First time through loop on next time through code, when
it predicts exit instead of looping
LOOP: LOAD R1, 100(R2)
MUL R6, R6, R1
SUBI R2, R2, #4
BNEZ R2, LOOP
16
2-bit Predictor
Solution : 2-bit predictor scheme where change prediction only if
mispredict twice in a row
T
NT
Predict Taken
T
Predict Not
Taken
NT
T
Predict Taken
NT
T
Predict Not
NT Taken
•This idea can be extended to n-bit saturating counters
–Increment counter when branch is taken
–Decrement counter when branch is not taken
–If counter <= 2n-1, then predict the branch is taken; else not taken.
17
Correlating Branches



Often the behavior of one branch is correlated with the
behavior of other branches.
Example C code
 B3 can be predicted with
if (aa == 2)
B1
100% accuracy based on the
aa = 0;
outcomes of B1 and B2
if (bb == 2)
B2
bb = 0;
if (aa != bb)
B3
cc = 4;
If the first two branches are not taken, the third one will be.
18
Correlating Branches – contd.



Hypothesis: recent branches are correlated; that is, behavior of
recently executed branches affects prediction of current branch
Idea: record m most recently executed branches as taken or
not taken, and use that pattern to select the proper branch
history table
In general, (m,n) predictor means record last m branches to
select between 2m history tables each with n-bit counters
 Old 2-bit BHT is then a (0,2) predictor
19
Need Address at same time as Prediction

Branch Target Buffer (BTB): Address of branch index to get
prediction AND branch address (if taken)
 Note: must check for branch match now, since can’t use wrong
branch address
Branch PC
Predicted PC
PC of instruction
FETCH
=?

Predict taken or not taken
Return instruction addresses predicted with stack
20
Branch Target Buffer




A branch-target buffer or branch-target cache stores
the predicted address of branches that are predicted to
be taken.
Values not in the buffer are predicted to be not taken.
The branch-target buffer is accessed during the IF
stage, based on the k low order bits of the branch
address.
If the branch-target is in the buffer and is predicted
correctly, the one cycle stall is eliminated.
21
Branch Predictor Accuracy


Larger tables and smarter organizations yield better accuracy
Longer histories provide more context for finding correlations


Table size is exponential in history length
The cost is increased access delay and chip area
22
Alpha 21264


8-stage pipeline, mispredict penalty 7 cycle
64 KB, 2-way instruction cache with line and way prediction
bits (Fetch)


Each 4-instruction fetch block contains a prediction for the next fetch
block
Hybrid predictor (Fetch)
12-bit GAg (4K-entry PHT, 2 bit counters)
 10-bit PAg (1K-entry BHT, 1K-entry PHT, 3-bit counters)

23
Ultra Sparc III



14-stage pipeline, branch prediction accessed in instruction
fetch stages 2-3
16K-entry 2-bit counter Gshare predictor
 Bimodal predictor which XOR’s PC bits with global
history register (except 3 lower order bits) to reduce
aliasing
Miss queue
 Halves mispredict penalty by providing instructions for
immediate use
24
Pentium III



Dynamic branch prediction
 512-entry BTB predicts direction and target, 4-bit
history used with PC to derive direction
Static branch predictor for BTB misses
Branch Penalties:
 Not Taken: no penalty
 Correctly predicted taken: 1 cycle
 Mispredicted: at least 9 cycles, as many as 26, average
10-15 cycles
25
AMD Athlon K7



10-stage integer, 15-stage fp pipeline, predictor accessed in
fetch
2K-entry bimodal predictor, 2K-entry BTB
Branch Penalties:
 Correct Predict Taken: 1 cycle
 Mispredict penalty: at least 10 cycles
26
Applying Machine Learning to
Branch Prediction
27
Branch Prediction is a
Machine Learning Problem

So why not apply a machine learning algorithm?

Replace 2-bit counters with a more accurate predictor

Tight constraints on prediction mechanism

Must be fast and small enough to work as a component of a
microprocessor

Artificial neural networks

Simple model of neural networks in brain cells

Learn to recognize and classify patterns

Most neural nets are slow and complex relative to tables

For branch prediction, we need a small and fast neural method
28
A Neural Method for Branch Prediction

Several neural methods were investigated


Most were too slow, too big, or not accurate enough
The perceptron [Rosenblatt `62, Block `62]

Very high accuracy for branch prediction

Prediction and update are quick, relative to other neural methods

Sound theoretical foundation; perceptron convergence theorem

Proven to work well for many classification problems
29
Branch-Predicting Perceptron





Inputs (x’s) are from branch history register
Weights (w’s) are small integers learned by on-line training
Output (y) gives prediction; dot product of x’s and w’s
Training finds correlations between history and outcome
w0 – bias, independent of the history
30
Training Algorithm
31
Training Perceptrons

W’ – i.e. new weights vector, might be a worse set of
weights for any other training example. It is not evident
that this is a useful algorithm.

Perception Convergence Theorem:
If any set of weights exist that correctly classify a finite set
of training examples, then perceptron learning will come
up with a (possibly different) set of weights that also
correctly classifies all examples after a finite number of
change steps, for a finite separable set of training
examples.
32
Linear Separability


A limitation of perceptrons is that they are only capable of
learning linearly separable functions
A boolean function over variables xi..n is linearly separable
iff there exist values for wi..n such that all the true instances
can be separated from all the false instances by a
hyperplane defined by the solution of:
n
w0 + ∑ xi wi = 0
i=1
• i.e. If n = 2, the hyperplane is a line.
33
Linear Separability – contd.

Example: a perceptron can learn the logical AND for two
inputs but not the XOR.

A perceptron can still give good predictions for
inseparable functions but will not achieve 100% accuracy.
In contrast a two level PHT (pattern history table) scheme
like gshare can learn any boolean function if given enough
time.
34
Putting it all together – perceptron based
predictor
1.
2.
3.
4.
5.
6.
The Branch address is hashed into the table of
perceptrons
The ith perceptron is fetched, into a vector register, P1..n
of weights.
The value of y is computed as the dot product of P and
the global history register
The branch is predicted not taken if y is negative, or
taken otherwise
Once this branch is resolved, the outcome is used by the
training algorithm to update P
P is written back to the ith entry in the table
35
Organization of the Perceptron Predictor




Keeps a table of perceptrons, indexed by branch address
Inputs are from branch history register
Predict taken if output  0, otherwise predict not taken
Key intuition: table size isn't exponential in history length, so
we can consider much longer histories
36
Results and Analysis for the
Perceptron Predictor
37
Results: Predictor Accuracy

Perceptron outperforms competitive hybrid predictor by 36%
at ~4KB; 1.71% vs. 2.66%
38
Results: Large Hardware Budgets

Multi-component hybrid was the most accurate fully dynamic
predictor known in the literature [Evers 2000]

Perceptron predictor is even more accurate
39
Results: IPC with high clock rate


Pentium 4-like: 20 cycle misprediction penalty, 1.76 GHz
15.8% higher IPC than gshare, 5.7% higher than hybrid
40
Analysis: History Length

The fixed-length path branch predictor can also use long
histories [Stark, Evers & Patt `98]
41
Analysis: Training Times

Perceptron “warms up’’ faster
42
Future Work and Conclusions
43
Future Work with Perceptron Predictor


Let's make the best predictor even better

Better representation

Better training algorithm
Latency is a problem

How can we eliminate the latency of the perceptron
predictor?
44
Future Work with Perceptron Predictor

Value prediction

Predict which set of values is likely to be the result of a
load operation to mitigate memory latency

Indirect branch prediction

Virtual dispatch

Switch statements in C
45
Future Work
Characterizing Predictability

Branch predictability, value predictability

How can we characterize algorithms in terms of their predictability?

Given an algorithm, how can we transform it so that its branches and
values are easier to predict?

How much predictability is inherent in the algorithm, and how much is
an artifact of the program structure?

How can we compare different algorithms' predictability?
46
Conclusions

Neural predictors can improve performance for deeply
pipelined microprocessors

Perceptron learning is well-suited for microarchitectural
implementation

There is still a lot of work left to be done on the perceptron
predictor in particular and microarchitectural prediction in
general
47
The End
48

Presentation - Rutgers University

Transcript Presentation - Rutgers University

Directory