Slides - School of Electrical Engineering and Computer Science

Transcript Slides - School of Electrical Engineering and Computer Science

Branch Prediction
High-Performance Computer Architecture
Joe Crop
Oregon State University
School of Electrical Engineering and
Computer Science
label: xor r10,r1,r11
Reg
DMem
Ifetch
Reg
Ifetch
Reg
Ifetch
Reg
DMem
Reg
Chapter 2: A Five Stage RISC Pipeline
Reg
DMem
Reg
DMem
ALU
add r8,r1,r9
Ifetch
ALU
or r6,r1,r7
Reg
ALU
and r2,r3,r5
Ifetch
ALU
beq r1,r3,label
ALU
Control Hazard
Reg
Reg
DMem
2
Branch Penalty Impact
• If CPI = 1, 30% branch,
Stall 3 cycles => new CPI = 1.9!
• Two part solution:
– Determine branch taken or not sooner, AND
– Compute taken branch address earlier
• MIPS branch tests if register = 0 or  0
– beqz R4, name
• MIPS Solution:
– Move Zero test to ID/RF stage
– Adder to calculate new PC in ID/RF stage
– 1 clock cycle penalty for branch versus 3
Chapter 2: A Five Stage RISC Pipeline
3
Modified MIPS Datapath
Instruction
Fetch
Instr. Decode
Reg. Fetch
Execute
Addr. Calc
Memory
Access
Write
Back
Next SEQ PC
Next PC
MUX
Adder
Adder
Zero?
rs
MUX
Data
Memory
MEM/WB
ALU
MUX
Imm
ID/EX
Register
File
IF/ID
PC
Instruction
Memory
EX/MEM
rt
Sign
Extend
rd
rd
rd
WB Data
Chapter 2: A Five Stage RISC Pipeline
4
…
Reg
DMem
Ifetch
Reg
Ifetch
Reg
Ifetch
Reg
DMem
Reg
Chapter 2: A Five Stage RISC Pipeline
Reg
DMem
Reg
DMem
ALU
…
Ifetch
ALU
Label: xor r10,r1,r11
Reg
ALU
and r2,r3,r5
Ifetch
ALU
beq r1,r3,label
ALU
Branch Resolved in ID Stage
Reg
Reg
DMem
5
Branch Prediction
• Predict Branch Not Taken
–
–
–
–
Execute successor instructions in sequence.
“Squash” instructions in pipeline if branch actually taken.
47% MIPS branches not taken on average.
PC+4 already calculated, so use it to get next instruction.
• Predict Branch Taken
– 53% MIPS branches taken on average.
– But haven’t calculated branch target address yet
• MIPS still incurs 1 cycle branch penalty
• Other machines: branch target known before outcome
• Delay Branch Technique
Chapter 2: A Five Stage RISC Pipeline
6
Delay Branches
• This technique involves using software making the
delay slots valid and useful. Some n number of
instructions after the branch is executed regardless
of whether the branch is taken.
branch instruction
sequential successor1
sequential successor2
........
sequential successorn
branch target if taken
Branch delay of length n
• 1 delay slot allows proper decision and branch target
address in 5 stage pipeline
• MIPS uses this.
Chapter 2: A Five Stage RISC Pipeline
7
Performance Effect of
Branch Penalty
Let
pb = the probability that an instruction is a branch
pt = the probability that a branch is taken
b = the branch penalty
CPI = the average number of cycles per instruction.
Then
CPI = (1 - pb) + pb[pt(1 + b) + (1 - pt)]
CPI = 1 + bptpb
Chapter 2: A Five Stage RISC Pipeline
8
Delay Branch Technique
Chapter 2: A Five Stage RISC Pipeline
9
Delay Branch Technique (1)
“From before”
A:=B+C
If B>C Then Goto Next
Delay Slot
...
Next:
becomes
If B>C Then Goto Next
A:=B+C
....
Next:
Chapter 2: A Five Stage RISC Pipeline
10
Delay Branch Technique (2)
“From target”
Next:
X := Y * Z
...
B := A + C
If B > C Then Goto Next
Delay Slot
becomes
May need to duplicate
Next:
Must be OK to execute
when not taken
X := Y * Z
...
...
B := A + C
If B > C Then Goto Next
X := Y * Z
Chapter 2: A Five Stage RISC Pipeline
11
Delay Branch Technique (3)
“From fall through”
B := A + C
If B > C Then Goto Next
Delay Slot
X := Y * Z
...
Next:
becomes
Must be OK to execute
when taken
B := A + C
If B > C Then Goto Next
X := Y * Z
...
Next:
Chapter 2: A Five Stage RISC Pipeline
12
Delay Branch Technique (cont.)
The performance of Delay Branches can be modeled by
the following equation:
CPI = 1+bpbpnop
where pnop is the fraction of the b delay slots filled with
nops. Thus, if fi is the probability that the delay slot i is
filled with a useful instruction, then
pnop = 1 - (f1 + f2 + …+ fb)/b
Example: Suppose we have the following characteristic
b=4, f1 =0.6, f2 = 0.1, f3 = f4 =0, pb=0.2
We have
CPI = 1 + 4  0.2  0.825 = 1.66
Chapter 2: A Five Stage RISC Pipeline
13
Delay Branch Technique (cont.)
The concept of squashing or annulling can be used in
conjunction with delay branches.
Next:
X := Y * Z
...
…
B := A + C
If B > C Then Goto Next
X := Y * Z
=>This instruction is nullified
bne,a
rs,rt,label
a bit
a
Branch outcome
taken
not taken
taken
Delay inst. Executed?
yes
yes
yes
a
not taken
no (annulled)
Chapter 2: A Five Stage RISC Pipeline
14
Delay Branch Technique (cont.)
• For processors with this capability, the performance
can be modeled as
CPI = 1 + bpb[pnop(1 - pnull) + pnull)]
where pnull=(1-pt) for nullify-on-branch-not-taken.
• Suppose b=4, f1=0.8, f2=0.3, f3=0.1, f4=0, pb=0.2,
pnull= 0.35
=> CPI=1.644
Chapter 2: A Five Stage RISC Pipeline
15
Delayed Branch Performance
• Compiler effectiveness for single branch delay
slot:
– Fills about 60% of branch delay slots.
– About 80% of instructions executed in branch delay
slots useful in computation.
– About 50% (60% x 80%) of slots usefully filled.
Chapter 2: A Five Stage RISC Pipeline
16
Evaluating Branch Alternatives
Pipeline speedup =
Pipeline depth
1 +Branch frequency´Branch penalty
Suppose Conditional & Unconditional = 14%, 65% change
PC
Prediction
Stall pipeline
Predict taken
Predict not taken
Delayed branch
Branch
scheme
3
1
1
0.5
CPI
penalty
1.42
1.14
1.09
1.07
speedup v.
unpipelined
3.5
4.4
4.5
4.6
Chapter 2: A Five Stage RISC Pipeline
speedup v.
stall
1.0
1.26
1.29
1.31
17
Reducing Branch Penalty
Branch penalty in dynamically scheduled processors:
wasted cycles due to pipeline flushing on mispredicted branches
Reduce branch penalty:
1. Predict branch/jump instructions AND branch direction
(taken or not taken)
2. Predict branch/jump target address (for taken
branches)
3. Speculatively execute instructions along the predicted
path
18
What to Use and What to Predict
Available info:
–
–
Current predicted PC
Past branch history (direction
and target)
What to predict:
–
–
–
PC
pred_PC
Conditional branch inst: branch
direction and target address
Jump inst: target address
Procedure call/return: target
address
May need instruction pre-decoded
IM
PC & Inst
Predictors
pred info feedback
P
C
19
Mis-prediction Detections and Feedbacks
Detections:
• At the end of decoding
– Target address known at
decoding, and not match
– Flush fetch stage
• At commit (most cases)
– Wrong branch direction or target
address not match
– Flush the whole pipeline
Feedbacks:
• Any time a mis-prediction is
detected
• At a branch’s commit
(at EXE: called speculative update)
FETCH
predictors
RENAME
REB/ROB
SCHD
EXE
WB
COMMIT
20
Branch Direction Prediction
•
Predict branch direction: taken or not taken (T/NT)
taken
Not taken
•
•
BNE R1, R2, L1
…
L1: …
Static prediction: compilers decide the direction
Dynamic prediction: hardware decides the direction
using dynamic information
1.
2.
3.
4.
5.
1-bit Branch-Prediction Buffer
2-bit Branch-Prediction Buffer
Correlating Branch Prediction Buffer
Tournament Branch Predictor
and more …
21
Predictor for a Single Branch
General Form
1. Access
2. Predict
Output T/NT
state
PC
3. Feedback T/NT
1-bit prediction
Feedback
T
Predict Taken
NT
1
NT
T
0
Predict Taken
22
Branch History Table of 1-bit Predictor
K-bit
BHT also Called Branch
Prediction Buffer in textbook
• Can use only one 1-bit
predictor, but accuracy is low
• BHT: use a table of simple
predictors, indexed by bits
from PC
• Similar to direct mapped
cache
• More entries, more cost, but
less conflicts, higher
accuracy
• BHT can contain complex
predictors
Branch
address
2k
Prediction
23
1-bit BHT Weakness
• Example: in a loop, 1-bit BHT will cause
2 mispredictions
• Consider a loop of 9 iterations before exit:
for (…){
for (i=0; i<9; i++)
a[i] = a[i] * 2.0;
}
– End of loop case, when it exits instead of looping
as before
– First time through loop on next time through code,
when it predicts exit instead of looping
– Only 80% accuracy even if loop 90% of the time
24
2-bit Saturating Counter
• Solution: 2-bit scheme where change prediction only if get
misprediction twice: (Figure 3.7, p. 249)
T
Predict Taken
11
NT
T
T
Predict Not
Taken
01
10
Predict Taken
00
Predict Not
Taken
NT
NT
T
NT
• Gray: stop, not taken
• Blue: go, taken
• Adds hysteresis to decision making process
25
Correlating Branches
Code example
showing the
potential
If (d==0)
d=1;
If (d==1)
…
Assemble code
BNEZ R1, L1
DADDIU R1,R0,#1
L1: DADDIU R3,R1,#1
BNEZ R3, L2
L2:
…
Observation: if BNEZ1 is not taken, then BNEZ2
is taken
26
(1, 1) Predictor
• (1,1) predictor - last branch, 1-bit prediction
• We use a pair of bits where the first bit being the prediction if the
last branch in the program was not taken, and the second bit
being the prediction if the last branch was taken.
Prediction If
Prediction Bits
Last branch Not Taken
Last Branch Taken
NT/NT
Not Taken
Not Taken
NT/T
Not Taken
Taken
T/NT
Taken
Not Taken
T/T
Taken
Taken
Chapter 3 - Exploiting ILP
27
(1, 1) Predictor: Example
•
Consider the following code assuming d is assigned to R1.
if (d==0)
d=1;
if (d==1)
bnez
addi
subi
bnez
...
L1:
R1,L1
R1,R0,#1
R3,R1,#1
R3,L2
; branch b1 (d!=0)
; d==0, so d=1
; branch b2 (d!=1)
L2:
•
Suppose d alternates between 2 and 0, (1, 1) predictor initialized to not
taken. Bold indicate prediction.
•
d=?
b1 pred
b1 action
new b1 pred
b2 pred
b2 action
new b2 pred
2
NT/NT
T
T/NT
NT/NT
T
NT/T
0
T/NT
NT
T/NT
NT/T
NT
NT/T
2
T/NT
T
T/NT
NT/T
T
NT/T
0
T/NT
NT
T/NT
NT/T
NT
NT/T
The only misprediction is on the first iteration, when d=2, because the
b1 was not correlated with the previous prediction of b2
Chapter 3 - Exploiting ILP
28
(1, 1) Predictor: Example
• If we had use a 1-bit predictor
d=?
b1 pred
b1 action
new b1 pred
b2 pred
b2 action
new b2 pred
2
NT
T
T
NT
T
T
0
T
NT
NT
T
NT
NT
2
NT
T
T
NT
T
T
0
T
NT
NT
T
NT
NT
• We would have had all the branches mispredicted!
Chapter 3 - Exploiting ILP
29
(m, n) Predictor
(m,n) Predictor:
In general, (m,n)
predictor uses the
behavior of last m
branches (using shift
register) to choose from
2m branch predictors,
each of which is a n-bit
predictor for a single
branch.
Chapter 3 - Exploiting ILP
30
Performance of (2, 2) Predictor
• Improvement is most
noticeable in integer
benchmarks.
• (m,n) predictor
outperforms 2-bit
predictor, even with
unlimited entries!
Integer
benchmarks
Chapter 3 - Exploiting ILP
31
Tournament Predictors
• Uses multiple predictors, usually one based on local
information and one based on global information.
– Local predictors are better for some branches
– Global predictors are better at utilizing correlation
• A selector is used to choose among the predictors, usually a
2-bit saturating counter.
00
11
n/m means:
01
10
0/1 means:
• n - left predictor
• m - right predictor
• 0 - Incorrect
• 1 - Correct
Chapter 3 - Exploiting ILP
32
Example: Alpha 21264 Branch
Predictor
21264 uses the most sophisticated branch
predictor.
3-bit
saturating
counter
Last 12 outcomes
of all the branches
2-bit
predictor
2-bit
saturating
counter
Last 10 outcomes
of this branch
Chapter 3 - Exploiting ILP
33
Tournament Predictor in Alpha 21264
• Local predictor consists of a 2-level predictor:
– Top level a local history table consisting of 1024 10-bit entries;
each 10-bit entry corresponds to the most recent 10 branch
outcomes for the entry. 10-bit history allows patterns 10
branches to be discovered and predicted
– Next level Selected entry from the local history table is used to
index a table of 1K entries consisting 3-bit saturating counters,
which provide the local prediction
• Total size: 4K*2 + 4K*2 + 1K*10 + 1K*3 = 29K bits!
(~180K
transistors)
1K  10
bits
1K 
3
bits
% of predictions from local predictor in
Tournament Prediction Scheme
0%
20%
40%
60%
80%
100%
98%
nasa7
100%
matrix300
94%
tomcatv
90%
doduc
55%
spice
76%
fpppp
72%
gcc
63%
espresso
eqntott
li
37%
69%
Accuracy of Branch Prediction
99%
99%
100%
tomcatv
95%
doduc
84%
fpppp
86%
82%
li
77%
97%
88%
gcc
70%
0%
20%
40%
60%
80%
Profile-based
2-bit counter
Tournament
98%
86%
82%
espresso
98%
96%
88%
94%
100%
• Profile: branch profile from last execution
(static in that is encoded in instruction, but profile)
fig 3.40
Accuracy v. Size (SPEC89)
Conditional branch misprediction rate
10%
9%
8%
Local - 2 bit counters
7%
6%
5%
Correlating - (2,2) scheme
4%
3%
Tournament
2%
1%
0%
0
8
16
24
32
40
48
56
64
72
80
88
96
Total predictor size (Kbits)
104 112 120 128
Power Consumption
BlueRISC’s Compiler-driven Power-Aware Branch Prediction
Comparison with 512 entry BTAC bimodal (patent-pending)
Copyright 2007 CAM & BlueRISC
Pitfall: Sometimes dumber is better
• Alpha 21264 uses tournament predictor (29
Kbits)
• Earlier 21164 uses a simple 2-bit predictor with
2K entries (or a total of 4 Kbits)
• SPEC95 benchmarks, 21264 outperforms
– 21264 avg. 11.5 mispredictions per 1000 instructions
– 21164 avg. 16.5 mispredictions per 1000 instructions
• Reversed for transaction processing (TP) !
– 21264 avg. 17 mispredictions per 1000 instructions
– 21164 avg. 15 mispredictions per 1000 instructions
• TP code much larger & 21164 hold 2X branch
predictions based on local behavior (2K vs. 1K
local predictor in the 21264)
• What about power?
– Large predictors give some increase in prediction rate but for a
large power cost
Branch Target Buffer
BTB acts as a cache for BTAs. This eliminates cycles
wasted per branch required to calculate the BTAs.
Chapter 3 - Exploiting ILP
40
BTB (cont.)
BTA and the outcome of the
branch is known by end of ID
stage
…but not relayed until EX
stage
Chapter 3 - Exploiting ILP
41
BTB (cont.)
Buffer Hit
Taken
Not Taken
0
2
Buffer Miss
Taken
Not Taken
2
0
The performance of BTB can be modeled by
CPI =1+ bpb (1- p m ) pw +bpb pm pw +cpm pb p t
where b is the normal branch penalty, c is the number of cycles required to service a
BTB miss, and p m is the probability of a BTB miss. The probability of wrong prediction
p w depends on whether there was a BTB miss o r a hit. In the case of a BTB hit, p w = 1- pt.
For a BTB miss, pw = p t . Thus
CPI =1+ bpb (1- p m )(1- p t )+ (b+ c)p b p m pt
Chapter 3 - Exploiting ILP
42
Return Address Prediction
BTB and BPB do a good job in predicting how future
behavior will repeat. However, the subroutine
call/return paradigm makes correct prediction difficult.
The BTB then contains the following
after the second subroutine is called:
100
104
108
112
...
jal
...
...
jal
...
500
...
520
subr:...
...
jr
subr
subr
Inst. Addr
100
520
112
Target Addr.
500
104
500
$31
When we return from subr, we get a hit
on a valid entry in the BTB (Inst. Addr. = 520)
and predict that we will return to address 104.
However, this is not correct.
The next instruction should be 116!
Chapter 3 - Exploiting ILP
43
Subroutine Return Stack
In order to detect such mispredictions, subroutine return
stack can be used to augment the BTB.
Chapter 3 - Exploiting ILP
44
Performance of SRS
SPEC 95
Chapter 3 - Exploiting ILP
45
Pentium 4’s Branch Predictor
• “Unveiling the Intel Branch Predictors”
– Pentium 4
– http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1597026
46
Natural Branch Predictors
• “Towards a High Performance Neural
Branch Predictor”
– http://webspace.ulbsibiu.ro/lucian.vintan/html/USA.pdf
– The main advantage of the neural predictor is its ability to
exploit long histories while requiring only linear resource
growth
– Used in IA-64 simulators
47
Core 2’s Branch Predictor?
• TAGE: Tagged Geometric
Chapter 3 - Exploiting ILP
48
TAGE Performance
49
To Learn More
Chapter 3 - Exploiting ILP
50