CS152: Computer Architecture and Engineering
Download
Report
Transcript CS152: Computer Architecture and Engineering
CENG 450
Computer Systems and Architecture
Lecture 11
Amirali Baniasadi
[email protected]
1
This Lecture
Branch Prediction
Multiple Issue
2
Branch Prediction
Predicting the outcome of a branch
Direction:
Taken / Not Taken
Direction predictors
Target Address
PC+offset (Taken)/ PC+4 (Not Taken)
Target address predictors
• Branch Target Buffer (BTB)
3
Why do we need branch prediction?
Branch prediction
Increases the number of instructions available for the scheduler
to issue. Increases instruction level parallelism (ILP)
Allows useful work to be completed while waiting for the branch
to resolve
4
Branch Prediction Strategies
Static
Decided before runtime
Examples:
Always-Not Taken
Always-Taken
Backwards Taken, Forward Not Taken (BTFNT)
Profile-driven prediction
Dynamic
Prediction decisions may change during the execution of the program
5
What happens when a branch is predicted?
On misprediction:
No speculative state may commit
Squash instructions in the pipeline
Must not allow stores in the pipeline to occur
• Cannot allow stores which would not have happened to
commit
• Even for good branch predictors more
than half of the fetched instructions are
squashed
6
A Generic Branch Predictor
Predicted Stream
PC, T or NT
Execution Order
Fetch
f(PC, x)
Resolve
Actual Stream
Actual Stream
f(PC, x) = T or NT
Predicted Stream
- What’s f (PC, x)?
- x can be any relevant info
thus far x was empty
7
Bimodal Branch Predictors
Dynamically store information about the branch behaviour
Branches tend to behave in a fixed way
Branches tend to behave in the same way across program
execution
Index a Pattern History Table using the branch address
1 bit: branch behaves as it did last time
Saturating 2 bit counter: branch behaves as it usually does
8
Saturating-Counter Predictors
Consider strongly biased branch with infrequent outcome
TTTTTTTTNTTTTTTTTNTTTT
Last-outcome will misspredict twice per infrequent outcome encounter:
TTTTTTTTNTTTTTTTTNTTTT
Idea: Remember most frequent case
Saturating-Counter: Hysteresis
often called bi-modal predictor
Captures Temporal Bias
9
Bimodal Prediction
Table of 2-bit saturating counters
Predict the most common direction
Taken
11
T
PC
Not
Taken
10
T
PHT
Ta k e n
Ta k e n
00
Ta k e n
01
No t
Ta k e n
Tak en Tak en
00
11
No t
Ta k e n
Tak en
01
Not
Tak en
Not
Taken
Ta k e n
10
No t
Ta k e n
Not
Tak en
T/NT
11
Not
Not
Tak en Tak en
Taken
01
No t
Ta k e n
Tak en
10
Taken
Not
Taken
...
NT
Taken
00
NT
Not
Taken
Advantages: simple, cheap, “good” accuracy
Bimodal will misspredict once per infrequent outcome
encounter:
TTTTTTTTNTTTTTTTTNTTTT
10
Correlating Predictors
From program perspective:
Different Branches may be correlated
if (aa == 2) aa = 0;
if (bb == 2) bb = 0;
if (aa != bb) then …
Can be viewed as a pattern detector
Instead of keeping aggregate history information
I.e., most frequent outcome
Keep exact history information
Pattern of n most recent outcomes
Example:
BHR: n most recent branch outcomes
Use PC and BHR (xor?) to access prediction table
11
Pattern-based Prediction
Nested loops:
for i = 0 to N
for j = 0 to 3
…
Branch Outcome Stream for j-for branch
• 11101110111011101110
Patterns:
• 111 -> 0
• 110 -> 1
• 101 -> 1
• 011 -> 1
100% accuracy
Learning time 4 instances
Table Index (PC, 3-bit history)
12
Two-level Branch Predictors
A branch outcome depends on the outcomes of previous branches
First level: Branch History Registers (BHR)
Global history / Branch correlation: past executions of all branches
Self history / Private history: past executions of the same branch
Second level: Pattern History Table (PHT)
Use first level information to index a table
Possibly XOR with the branch address
PHT: Usually saturating 2 bit counters
Also private, shared or global
13
Gshare Predictor (McFarling)
Branch History Table
Global BHR
PC
f
Prediction
PC and BHR can be
concatenated
completely overlapped
partially overlapped
xored, etc.
How deep BHR should be?
Really depends on program
But, deeper increases learning time
May increase quality of information
14
Hybrid Prediction
Combining branch predictors
Use two different branch predictors
Access both in parallel
A third table determines which prediction to use Two or more
predictor components combined
PC
GSHARE
Bimodal
...
Different
branches benefit
from different types
of history
T/NT
T/NT
Selector
T/NT
15
Issues Affecting Accurate Branch Prediction
Aliasing
More than one branch may use the same BHT/PHT entry
Constructive
• Prediction that would have been incorrect, predicted
correctly
Destructive
• Prediction that would have been correct, predicted
incorrectly
Neutral
• No change in the accuracy
16
More Issues
Training time
Need to see enough branches to uncover pattern
Need enough time to reach steady state
“Wrong” history
Incorrect type of history for the branch
Stale state
Predictor is updated after information is needed
Operating system context switches
More aliasing caused by branches in different programs
17
Performance Metrics
Misprediction rate
Mispredicted branches per executed branch
Unfortunately the most usually found
Instructions per mispredicted branch
Gives a better idea of the program behaviour
Branches are not evenly spaced
18
Upper Limit to ILP: Ideal Machine
Amount of parallelism when there are no branch mispredictions and we’re limited only by data dependencies.
160
150.1
FP: 75 - 150
Instruction Issues per cycle
IPC
140
120
118.7
Integer: 18 - 60
100
75.2
80
62.6
60
54.8
40
17.9
20
0
gcc
Instructions that could
theoretically be issued per
cycle.
espresso
li
fpppp
doducd
tomcatv
Programs
19
Impact of Realistic Branch Prediction
Limiting the type of branch prediction.
61
60
58
60
FP: 15 - 45
48
50
46 45
46 45 45
IPC
Instr uction issues per cycle
41
40
35
Integer: 6 - 12
30
29
19
20
16
15
12
10
13 14
10
9
6
7
6
6
6
7
4
2
2
2
0
gcc
espresso
li
fpppp
doducd
tomcatv
P rogr am
P erfect
S elective predic tor
S tandard 2-bit
S tatic
None
20
Pentium III
Dynamic branch prediction
512-entry BTB predicts direction and target, 4-bit history used with
PC to derive direction
Mispredicted: at least 9 cycles, as many as 26, average 10-15
cycles
21
AMD Athlon K7
10-stage integer, 15-stage fp pipeline, predictor accessed in fetch
2K-entry bimodal, 2K-entry BTB
Branch Penalties:
Mispredict penalty: at least 10 cycles
22
Multiple Issue
•
Multiple Issue is the ability of the processor to start more than one
instruction in a given cycle.
•
Superscalar processors
•
Very Long Instruction Word (VLIW) processors
23
1990’s: Superscalar Processors
Bottleneck: CPI >= 1
Limit on scalar performance (single instruction issue)
Hazards
Superpipelining? Diminishing returns (hazards + overhead)
How can we make the CPI = 0.5?
Multiple instructions in every pipeline stage (super-scalar)
1
2
3
4
5
6
7
Inst0
Inst1
Inst2
Inst3
Inst4
Inst5
IF
IF
ID
ID
IF
IF
EX
EX
ID
ID
IF
IF
MEM
MEM
EX
EX
ID
ID
WB
WB
MEM
MEM
EX
EX
WB
WB
MEM
MEM
WB
WB
24
Superscalar Vs. VLIW
Religious debate, similar to RISC vs. CISC
Wisconsin + Michigan (Super scalar) Vs. Illinois (VLIW)
Q. Who can schedule code better, hardware or software?
25
Hardware Scheduling
High branch prediction accuracy
Dynamic information on latencies (cache misses)
Dynamic information on memory dependences
Easy to speculate (& recover from mis-speculation)
Works for generic, non-loop, irregular code
Ex: databases, desktop applications, compilers
Limited reorder buffer size limits “lookahead”
High cost/complexity
Slow clock
26
Software Scheduling
Large scheduling scope (full program), large “lookahead”
Can handle very long latencies
Simple hardware with fast clock
Only works well for “regular” codes (scientific, FORTRAN)
Low branch prediction accuracy
Can improve by profiling
No information on latencies like cache misses
Can improve by profiling
Pain to speculate and recover from mis-speculation
Can improve with hardware support
27
Superscalar Processors
Pioneer: IBM (America => RIOS, RS/6000, Power-1)
Superscalar instruction combinations
1 ALU or memory or branch + 1 FP (RS/6000)
Any 1 + 1 ALU (Pentium)
Any 1 ALU or FP+ 1 ALU + 1 load + 1 store + 1 branch (Pentium
II)
Impact of superscalar
More opportunity for hazards (why?)
More performance loss due to hazards (why?)
28
Superscalar Processors
•
Issues varying number of instructions per clock
•
Scheduling: Static (by the compiler) or dynamic(by the hardware)
•
Superscalar has a varying number of instructions/cycle (1 to 8), scheduled
by compiler or by HW (Tomasulo).
•
IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000
29
Elements of Advanced Superscalars
High performance instruction fetching
Good dynamic branch and jump prediction
Multiple instructions per cycle, multiple branches per cycle?
Scheduling and hazard elimination
Dynamic scheduling
Not necessarily: Alpha 21064 & Pentium were statically scheduled
Register renaming to eliminate WAR and WAW
Parallel functional units, paths/buses/multiple register ports
High performance memory systems
Speculative execution
30
SS + DS + Speculation
Superscalar + Dynamic scheduling + Speculation
Three great tastes that taste great together
CPI >= 1?
Overcome with superscalar
Superscalar increases hazards
Overcome with dynamic scheduling
RAW dependences still a problem?
Overcome with a large window
Branches a problem for filling large window?
Overcome with speculation
31
The Big Picture
issue
Static program
Fetch & branch
predict
execution
&
Reorder & commit
32
Readings
New paper on branch prediction online. READ.
Material would be used in the THIRD quiz
33