Dynamic Branch Perdiction and Speculation

Download Report

Transcript Dynamic Branch Perdiction and Speculation

ENGS 116 Lecture 9
1
Dynamic Branch Prediction
and Speculation
Vincent H. Berk
October 10, 2005
Reading for today: Chapter 3.2 – 3.6
Reading for Wednesday: Chapter 3.7-3.9, 4.1
Homework #2: due Friday 14th, 2.8, A.2, A.13, 3.6a&b, 3.10, 4.5,
4.8, (4.13 optional)
Project Proposals due Wednesday!
ENGS 116 Lecture 9
2
Project Proposals
• 2 pages
• Names and Title
• Introduction to problem domain
• Research question
– goal of your project
• Work plan
– e.g.: 2 weeks programming, 1 week experiments, 1 week writing
• References
– books, websites, articles
ENGS 116 Lecture 9
3
Dynamic Branch Prediction
• Control dependences limit ILP
• Performance = (accuracy, cost of misprediction)
• Branches arrive much faster when multiple instructions are issued
per clock
– Amdahl’s Law
• Want to predict outcome of branch as early as possible
• Methods:
– Branch history table (1 or more bits)
– Correlated branches
– Branch target buffer
ENGS 116 Lecture 9
4
Dynamic Branch Prediction: A Simple Approach
• Branch History Table (BHT) (aka Branch Prediction Buffer)
Lower bits
– Lower bits of PC address index table of 1-bit values
of PC
– Entry says whether or not branch taken last time
T
– No address check
NT
T
• Problem: In a loop, 1-bit BHT will cause two mispredictions
– First time through loop on next time through code, when it
predicts exit instead of looping
– End of loop case, when it exits instead of looping as before
T
NT
NT
.
.
.
ENGS 116 Lecture 9
5
Dynamic Branch Prediction: A Better Way
Solution: 2-bit scheme where prediction changes only if we get
misprediction twice.
Helps when target is known before result of condition.
Taken
Not taken
Predict taken
Predict taken
Taken
Taken
Not taken
Predict not taken
Not taken
Predict not taken
Taken
Not taken
ENGS 116 Lecture 9
6
BHT General Case
• n-bit predictor:
– counter can hold values between 0 and
– predict taken when value is greater than or equal to half
of maximum value:
– The counter is incremented on each taken branch
– and decremented on each not taken branch
ENGS 116 Lecture 9
7
BHT Accuracy
• Mispredict because either:
– Wrong guess for that branch
– Got branch history of wrong branch from index table
• 4096-entry table: programs vary from 1% misprediction
(nasa7, tomcatv) to 18% (eqntott), with spice at 9% and
gcc at 12%.
• 4096 entries about as good as infinite number of entries
• 2-bit predictors work nearly as well as more-bit predictors
ENGS 116 Lecture 9
8
Correlating Branches
• Hypothesis: recent branches are correlated; that is, behavior of
recently-executed branches affects prediction of current branch
if (d == 0)
d = 1;
if (d == 1)
…
ENGS 116 Lecture 9
9
Correlated Branch Prediction
• Idea: record m most recently executed branches as taken or not taken,
and use that pattern to select the proper n-bit branch history table
• In general, (m,n) predictor means record last m branches to select
between 2m history tables, each with n-bit counters
– Thus, old 2-bit BHT is a (0,2) predictor
• Global Branch History: m-bit shift register keeping T/NT status of last
m branches.
• Each entry in table has m n-bit predictors.
ENGS 116 Lecture 9
10
Correlating Branches
(2,2) predictor
–
Behavior of recent
branches selects
between four
predictions of next
branch, updating just
that prediction
Branch address
4
2-bits per branch predictor
Prediction
2-bit global branch history
ENGS 116 Lecture 9
11
Accuracy of Different Schemes
(FROM SECOND EDITION)
4096 Entries 2-bit BHT
Unlimited Entries 2-bit BHT
1024 Entries (2,2) BHT
18%
16%
14%
12%
11%
10%
8%
6%
6%
5%
6%
6%
5%
4%
4%
4,096 entries: 2-bits per entry
Unlimited entries: 2-bits/entry
1,024 entries (2,2)
li
eqntott
expresso
gcc
fpppp
matrix300
0%
spice
1%
0%
doducd
1%
tomcatv
2%
nasa7
Frequency of Mispredictions
20%
ENGS 116 Lecture 9
12
Tournament Predictors
• Multilevel branch predictor
• Use n-bit saturating counter to choose between predictors
• Usual choice between global and local predictors
ENGS 116 Lecture 9
Tournament Predictors: DEC Alpha 21264
Tournament predictor using 4K 2-bit counters indexed by local branch
address. Chooses between:
• Global predictor
– 4K entries index by history of last 12 branches (212 = 4K)
– Each entry is a standard 2-bit predictor
• Local predictor
– Local history table: 1024 10-bit entries recording last 10 branches,
index by branch address
– The pattern of the last 10 occurrences of that particular branch used
to index table of 1K entries with 3-bit saturating counters
13
ENGS 116 Lecture 9
14
Branch Target Buffers
• Branch target calculation is costly and stalls the instruction fetch.
• BTB stores PCs the same way as caches
• The PC of a branch is sent to the BTB
• When a match is found the corresponding Predicted PC is returned
• If the branch was predicted taken, instruction fetch continues at the
returned predicted PC
ENGS 116 Lecture 9
15
Branch Target Buffers
ENGS 116 Lecture 9
16
Figure 3.20 The steps involved in handling an instruction with a branch-target buffer
Send PC to memory
and branch-target
buffer
IF
No
Yes
Entry found in
branch-target
buffer?
Send out
predicted
PC
No
ID
Is instruction a
taken branch?
Yes
No
Normal
instruction
execution
EX
Enter branch
instruction PC and
next PC into branch
target buffer
Branch taken?
Mispredicted branch, kill
fetched instruction; restart
fetch at other target; delete
entry from target buffer
Yes
Branch correctly
predicted; continue
execution with no
stalls
ENGS 116 Lecture 9
17
Multiple Issue Machines
• Superscalar: multiple parallel dedicated pipelines:
– Varying number of instructions per cycle, scheduled by compiler
and/or by hardware (Tomasulo)
– IBM PowerPC, Sun UltraSparc, DEC Alpha, IA32 Pentium
• VLIW (Very Long Instruction Word): multiple operations encoded in
instruction:
– Instructions have wide template (4-16 operations)
– IA-64 Itanium
ENGS 116 Lecture 9
18
Getting CPI < 1: Issuing Multiple Instructions/Cycle
• Superscalar DLX: 2 instructions, 1 FP & 1 anything else
– Fetch 64-bits/clock cycle; integer on left, FP on right
– Can only issue 2nd instruction if 1st instruction issues
– More ports for FP registers to do FP load & FP op in a pair
Type
Int. instruction
FP instruction
Int. instruction
FP instruction
Int. instruction
FP instruction
Pipe Stages
IF ID EX
IF ID EX
IF ID
IF ID
IF
IF
MEM
MEM
EX
EX
ID
ID
WB
WB
MEM
MEM
EX
EX
WB
WB
MEM WB
MEM WB
• 1 cycle load delay expands to 3 instructions in superscalar DLX
– Instruction in right half can’t use it, nor instructions in next slot
ENGS 116 Lecture 9
19
Multiple Issue Challenges
• While Integer/FP split is simple for the HW, get CPI of 0.5 only for
programs with:
– Exactly 50% FP operations
– No hazards
• If more instructions issued at same time, greater difficulty in decode
and issue
– Even 2-way scalar  examine 2 opcodes, 6 register specifiers, & decide if
1 or 2 instructions can issue
• VLIW: tradeoff instruction space for simple decoding
– The long instruction word has room for many operations
– By definition, all the operations the compiler puts in the long instruction
word are independent  execute in parallel
– E.g., 2 integer operations, 2 FP ops, 2 memory refs, 1 branch  16 to 24
bits per field  7*16 or 112 bits to 7*24 or 168 bits wide
– Need compiling technique that schedules across several branches
ENGS 116 Lecture 9
20
Limits to Multi-Issue Machines
• Inherent limitations of instruction-level parallelism
– 1 branch in 5: How to keep a 5-way VLIW busy?
– Latencies of units: many operations must be scheduled
– Easy: More instruction bandwidth
– Easy: Duplicate functional units to get parallel execution
– Hard: Increase ports to register file (bandwidth)
• VLIW example needs 7 reads and 3 writes for integer registers
& 5 reads and 3 writes for FP registers
– Harder: Increase ports to memory (bandwidth)
– Decoding superscalar and impact on clock rate, pipeline depth?
ENGS 116 Lecture 9
21
Hardware-Based Speculation
• Instead of just instruction fetch and decode, also execute instructions
based on prediction of branch.
• Execute instructions out of order as soon as their operands are
available.
• Wait with instruction commit until branch is decided.
• Re-order instructions after execution and commit them in order
– reorder buffer or ROB
– register file not updated until commit
• Do not raise exceptions until instruction is committed
• ROB holds and provides operands until commit.
ENGS 116 Lecture 9
22
Tomasulo with Speculation
1. Issue – Empty reservation station and an empty ROB slot. Send
operands to reservation station from register file or from ROB. This
stage is often referred to as: dispatch
2. Execute – Monitor CDB for operands, check RAW hazards. When
both operands are available, then execute.
3. Write Result – When available, write result to CDB through to ROB
and any waiting reservation stations. Stores write to value field in
ROB.
4. Commit – Three cases:
• Normal Commit: write registers, in order commit
• Store: update memory
• Incorrect branch: flush ROB, reservation stations and restart
execution at correct PC
ENGS 116 Lecture 9
23
ENGS 116 Lecture 9
24
Problems with speculation
•
Multi Issue Machines:
– Must be able to commit multiple instructions from ROB
– More registers, more renaming
•
How much speculation:
– How many branches deep?
– What to do on a cache miss?
– TLB miss?
– Cache interference due to incorrect branch prediction
ENGS 116 Lecture 9
Figure: 3.41
Number of registers available for renaming.
25
ENGS 116 Lecture 9
Figure: 3.45
Window size: the number of instructions the issue unit may
look ahead and schedule from.
26
ENGS 116 Lecture 9
27
HW Support for More ILP
• Avoid branch prediction by turning branches
into conditionally executed instructions:
If (X) then A = B op C else NOP
– If false, then neither store result nor cause
exception
– Expanded ISA of Alpha, MIPS, PowerPC,
SPARC have conditional move; PA-RISC can
annul any following instruction.
– IA-64: 61 1-bit condition fields selected so
conditional execution of any instruction
• Drawbacks to conditional instructions
– Still takes a clock even if “annulled”
– Stall if condition evaluated late
– Complex conditions reduce effectiveness;
condition becomes known late in pipeline
X
A=
B op C