CS152: Computer Architecture and Engineering

Download Report

Transcript CS152: Computer Architecture and Engineering

CENG 450
Computer Systems and Architecture
Lecture 11
Amirali Baniasadi
[email protected]
1
This Lecture
 Branch Prediction
 Multiple Issue
2
Branch Prediction
 Predicting the outcome of a branch
 Direction:
Taken / Not Taken
Direction predictors
 Target Address
PC+offset (Taken)/ PC+4 (Not Taken)
Target address predictors
• Branch Target Buffer (BTB)
3
Why do we need branch prediction?
 Branch prediction
Increases the number of instructions available for the scheduler
to issue. Increases instruction level parallelism (ILP)
Allows useful work to be completed while waiting for the branch
to resolve
4
Branch Prediction Strategies
 Static
 Decided before runtime
 Examples:
Always-Not Taken
Always-Taken
Backwards Taken, Forward Not Taken (BTFNT)
Profile-driven prediction
 Dynamic
 Prediction decisions may change during the execution of the program
5
What happens when a branch is predicted?
 On misprediction:
No speculative state may commit
Squash instructions in the pipeline
Must not allow stores in the pipeline to occur
• Cannot allow stores which would not have happened to
commit
• Even for good branch predictors more
than half of the fetched instructions are
squashed
6
A Generic Branch Predictor
Predicted Stream
PC, T or NT
Execution Order
Fetch
f(PC, x)
Resolve
Actual Stream
Actual Stream
f(PC, x) = T or NT
Predicted Stream
- What’s f (PC, x)?
- x can be any relevant info
thus far x was empty
7
Bimodal Branch Predictors
 Dynamically store information about the branch behaviour
Branches tend to behave in a fixed way
Branches tend to behave in the same way across program
execution
 Index a Pattern History Table using the branch address
1 bit: branch behaves as it did last time
Saturating 2 bit counter: branch behaves as it usually does
8
Saturating-Counter Predictors
 Consider strongly biased branch with infrequent outcome
 TTTTTTTTNTTTTTTTTNTTTT
 Last-outcome will misspredict twice per infrequent outcome encounter:
 TTTTTTTTNTTTTTTTTNTTTT
 Idea: Remember most frequent case
 Saturating-Counter: Hysteresis
 often called bi-modal predictor
 Captures Temporal Bias
9
Bimodal Prediction
 Table of 2-bit saturating counters
 Predict the most common direction
Taken
11
T
PC
Not
Taken
10
T
PHT
Ta k e n
Ta k e n
00
Ta k e n
01
No t
Ta k e n
Tak en Tak en
00
11
No t
Ta k e n
Tak en
01
Not
Tak en
Not
Taken
Ta k e n
10
No t
Ta k e n
Not
Tak en
T/NT
11
Not
Not
Tak en Tak en
Taken
01
No t
Ta k e n
Tak en
10
Taken
Not
Taken
...
NT
Taken
00
NT
Not
Taken
 Advantages: simple, cheap, “good” accuracy
 Bimodal will misspredict once per infrequent outcome
encounter:
TTTTTTTTNTTTTTTTTNTTTT
10
Correlating Predictors
 From program perspective:
 Different Branches may be correlated
 if (aa == 2) aa = 0;
 if (bb == 2) bb = 0;
 if (aa != bb) then …
 Can be viewed as a pattern detector
 Instead of keeping aggregate history information
I.e., most frequent outcome
 Keep exact history information
Pattern of n most recent outcomes
 Example:
 BHR: n most recent branch outcomes
 Use PC and BHR (xor?) to access prediction table
11
Pattern-based Prediction
 Nested loops:
for i = 0 to N
for j = 0 to 3
…
 Branch Outcome Stream for j-for branch
• 11101110111011101110
 Patterns:
• 111 -> 0
• 110 -> 1
• 101 -> 1
• 011 -> 1
 100% accuracy
 Learning time 4 instances
 Table Index (PC, 3-bit history)
12
Two-level Branch Predictors
 A branch outcome depends on the outcomes of previous branches
 First level: Branch History Registers (BHR)
 Global history / Branch correlation: past executions of all branches
 Self history / Private history: past executions of the same branch
 Second level: Pattern History Table (PHT)
 Use first level information to index a table
Possibly XOR with the branch address
 PHT: Usually saturating 2 bit counters
 Also private, shared or global
13
Gshare Predictor (McFarling)
Branch History Table
Global BHR
PC
f
Prediction
 PC and BHR can be
 concatenated
 completely overlapped
 partially overlapped
 xored, etc.
 How deep BHR should be?
 Really depends on program
 But, deeper increases learning time
 May increase quality of information
14
Hybrid Prediction
 Combining branch predictors
 Use two different branch predictors
Access both in parallel
 A third table determines which prediction to use Two or more
predictor components combined
PC
GSHARE
Bimodal
...
 Different
branches benefit
from different types
of history
T/NT
T/NT
Selector
T/NT
15
Issues Affecting Accurate Branch Prediction
 Aliasing
 More than one branch may use the same BHT/PHT entry
Constructive
• Prediction that would have been incorrect, predicted
correctly
Destructive
• Prediction that would have been correct, predicted
incorrectly
Neutral
• No change in the accuracy
16
More Issues
 Training time
 Need to see enough branches to uncover pattern
 Need enough time to reach steady state
 “Wrong” history
 Incorrect type of history for the branch
 Stale state
 Predictor is updated after information is needed
 Operating system context switches
 More aliasing caused by branches in different programs
17
Performance Metrics
 Misprediction rate
 Mispredicted branches per executed branch
Unfortunately the most usually found
 Instructions per mispredicted branch
 Gives a better idea of the program behaviour
Branches are not evenly spaced
18
Upper Limit to ILP: Ideal Machine
Amount of parallelism when there are no branch mispredictions and we’re limited only by data dependencies.
160
150.1
FP: 75 - 150
Instruction Issues per cycle
IPC
140
120
118.7
Integer: 18 - 60
100
75.2
80
62.6
60
54.8
40
17.9
20
0
gcc
Instructions that could
theoretically be issued per
cycle.
espresso
li
fpppp
doducd
tomcatv
Programs
19
Impact of Realistic Branch Prediction
Limiting the type of branch prediction.
61
60
58
60
FP: 15 - 45
48
50
46 45
46 45 45
IPC
Instr uction issues per cycle
41
40
35
Integer: 6 - 12
30
29
19
20
16
15
12
10
13 14
10
9
6
7
6
6
6
7
4
2
2
2
0
gcc
espresso
li
fpppp
doducd
tomcatv
P rogr am
P erfect
S elective predic tor
S tandard 2-bit
S tatic
None
20
Pentium III
 Dynamic branch prediction
 512-entry BTB predicts direction and target, 4-bit history used with
PC to derive direction
 Mispredicted: at least 9 cycles, as many as 26, average 10-15
cycles
21
AMD Athlon K7
 10-stage integer, 15-stage fp pipeline, predictor accessed in fetch
 2K-entry bimodal, 2K-entry BTB
 Branch Penalties:
 Mispredict penalty: at least 10 cycles
22
Multiple Issue
•
Multiple Issue is the ability of the processor to start more than one
instruction in a given cycle.
•
Superscalar processors
•
Very Long Instruction Word (VLIW) processors
23
1990’s: Superscalar Processors
 Bottleneck: CPI >= 1
 Limit on scalar performance (single instruction issue)
Hazards
Superpipelining? Diminishing returns (hazards + overhead)
 How can we make the CPI = 0.5?
 Multiple instructions in every pipeline stage (super-scalar)

1
2
3
4
5
6
7






Inst0
Inst1
Inst2
Inst3
Inst4
Inst5
IF
IF
ID
ID
IF
IF
EX
EX
ID
ID
IF
IF
MEM
MEM
EX
EX
ID
ID
WB
WB
MEM
MEM
EX
EX
WB
WB
MEM
MEM
WB
WB
24
Superscalar Vs. VLIW
 Religious debate, similar to RISC vs. CISC
Wisconsin + Michigan (Super scalar) Vs. Illinois (VLIW)
Q. Who can schedule code better, hardware or software?
25
Hardware Scheduling
 High branch prediction accuracy
 Dynamic information on latencies (cache misses)
 Dynamic information on memory dependences
 Easy to speculate (& recover from mis-speculation)
 Works for generic, non-loop, irregular code
 Ex: databases, desktop applications, compilers
 Limited reorder buffer size limits “lookahead”
 High cost/complexity
 Slow clock
26
Software Scheduling
Large scheduling scope (full program), large “lookahead”
Can handle very long latencies
Simple hardware with fast clock
Only works well for “regular” codes (scientific, FORTRAN)
Low branch prediction accuracy
Can improve by profiling
No information on latencies like cache misses
Can improve by profiling
Pain to speculate and recover from mis-speculation
Can improve with hardware support
27
Superscalar Processors
 Pioneer: IBM (America => RIOS, RS/6000, Power-1)
 Superscalar instruction combinations
1 ALU or memory or branch + 1 FP (RS/6000)
Any 1 + 1 ALU (Pentium)
Any 1 ALU or FP+ 1 ALU + 1 load + 1 store + 1 branch (Pentium
II)
 Impact of superscalar
 More opportunity for hazards (why?)
 More performance loss due to hazards (why?)
28
Superscalar Processors
•
Issues varying number of instructions per clock
•
Scheduling: Static (by the compiler) or dynamic(by the hardware)
•
Superscalar has a varying number of instructions/cycle (1 to 8), scheduled
by compiler or by HW (Tomasulo).
•
IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000
29
Elements of Advanced Superscalars
 High performance instruction fetching
 Good dynamic branch and jump prediction
 Multiple instructions per cycle, multiple branches per cycle?
 Scheduling and hazard elimination
 Dynamic scheduling
 Not necessarily: Alpha 21064 & Pentium were statically scheduled
 Register renaming to eliminate WAR and WAW
 Parallel functional units, paths/buses/multiple register ports
 High performance memory systems
 Speculative execution
30
SS + DS + Speculation
 Superscalar + Dynamic scheduling + Speculation
Three great tastes that taste great together
 CPI >= 1?
Overcome with superscalar
 Superscalar increases hazards
Overcome with dynamic scheduling
 RAW dependences still a problem?
Overcome with a large window
Branches a problem for filling large window?
Overcome with speculation
31
The Big Picture
issue
Static program
Fetch & branch
predict
execution
&
Reorder & commit
32
Readings
 New paper on branch prediction online. READ.
 Material would be used in the THIRD quiz
33