Lect 11: Prediction Intro/Projects

Download Report

Transcript Lect 11: Prediction Intro/Projects

CS 505: Computer Structures
Lecture 4: Branch Prediction
Thu D. Nguyen
Spring 2003
Computer Science
Rutgers University
Rutgers University, Spring 2005
1
CS 505: Thu D. Nguyen
Case for Branch Prediction
1. Branches will arrive up to n times faster in an nissue processor
2. Amdahl’s Law => relative impact of the control
stalls will be larger with the lower potential CPI in
an n-issue processor
conversely, need branch prediction to ‘see’ potential
parallelism
Rutgers University, Spring 2005
2
CS 505: Thu D. Nguyen
Branch Prediction Schemes
•
•
•
•
1-bit Branch-Prediction Buffer
2-bit Branch-Prediction Buffer
Correlating Branch Prediction Buffer
Tournament Branch Predictor
Rutgers University, Spring 2005
3
CS 505: Thu D. Nguyen
Dynamic Branch Prediction
• Performance = ƒ(accuracy, cost of misprediction)
• Branch History Table: Lower bits of PC address
index table of 1-bit values
– Says whether or not branch taken last time
– No address check (saves HW, but may not be right branch)
• Problem: in a loop, 1-bit BHT will cause
2 mispredictions (avg is 9 iterations before exit):
– End of loop case, when it exits instead of looping as before
– First time through loop on next time through code, when it
predicts exit instead of looping
– Only 80% accuracy even if loop 90% of the time
Rutgers University, Spring 2005
4
CS 505: Thu D. Nguyen
Review: Dynamic Branch Prediction
(Jim Smith, 1981)
• Better Solution: 2-bit scheme where change
prediction only if get misprediction twice:
T
NT
Predict Taken
Predict Taken
T
T
NT
NT
Predict Not
Taken
T
Predict Not
Taken
NT
• Red: stop, not taken
• Green: go, taken
• Adds hysteresis to decision making process
Rutgers University, Spring 2005
5
CS 505: Thu D. Nguyen
BHT Accuracy
• Mispredict because either:
– Wrong guess for that branch
– Got branch history of wrong branch when index the table
• 4096 entry table programs vary from 1%
misprediction (nasa7, tomcatv) to 18% (eqntott),
with spice at 9% and gcc at 12%
• 4096 about as good as infinite table
(in Alpha 211164)
Rutgers University, Spring 2005
6
CS 505: Thu D. Nguyen
Correlating Branches
• Hypothesis: recent branches are correlated; that
is, behavior of recently executed branches affects
prediction of current branch
• Two possibilities; Current branch depends on:
– Last m most recently executed branches anywhere in program
(global)
– Last m most recent outcomes of same branch (local)
• Idea: record m most recently executed branches
as taken or not taken, and use that pattern to
select the proper branch history table entry
– A single history table shared by all branches indexed by history
value.
– Branch address is used along with history to select table entry
Rutgers University, Spring 2005
7
CS 505: Thu D. Nguyen
Correlating Branches
• For instance, consider global history, set-indexed
BHT. That gives us a GAs history table.
(2,2) GAs predictor
Branch address
– First 2 means that we keep
two bits of history
– Second means that we have 2
bit counters in each slot.
– Then behavior of recent
branches selects between,
say, four predictions of next
branch, updating just that
prediction
– Note that the original two-bit
counter solution would be a
(0,2) GAs predictor
– Note also that aliasing is
possible here...
Rutgers University, Spring 2005
2-bits per branch predictors
Prediction
Each slot is
2-bit counter
2-bit global branch history register
8
CS 505: Thu D. Nguyen
Accuracy of Different Schemes
(Figure 4.21, p. 272)
4096 Entries 2-bit BHT
Unlimited Entries 2-bit BHT
1024 Entries (2,2) BHT 11%
16%
14%
12%
10%
8%
6%
6%
6%
6%
5%
5%
4%
4%
2%
1%
1%
4,096 entries: 2-bits per entry
Rutgers University, Spring 2005
Unlimit ed entries: 2-bit s/ entry
9
li
eqntott
espresso
gcc
fpppp
spice
tomcatv
matrix300
0%
0%
doducd
0%
nasa7
of Mispredictions
Frequency
Frequency of Mispredictions
18%
18%
1,024 entries (2,2)
CS 505: Thu D. Nguyen
Tournament Predictors
• Motivation for correlating branch predictors is
2-bit predictor failed on important branches;
by adding global information, performance
improved
• Tournament predictors: use 2 predictors, 1
based on global information and 1 based on
local information, and combine with a selector
• Hopes to select right predictor for right
branch (or right context of branch)
Rutgers University, Spring 2005
10
CS 505: Thu D. Nguyen
Tournament Predictor in Alpha
21264
• 4K 2-bit counters to choose from among a global
predictor and a local predictor
• Global predictor also has 4K entries and is indexed by
the history of the last 12 branches; each entry in the
global predictor is a standard 2-bit predictor
– 12-bit pattern: ith bit 0 => ith prior branch not taken;
ith bit 1 => ith prior branch taken;
• Local predictor consists of a 2-level predictor:
– Top level a local history table consisting of 1024 10-bit
entries; each 10-bit entry corresponds to the most recent
10 branch outcomes for the entry. 10-bit history allows
patterns 10 branches to be discovered and predicted.
– Next level Selected entry from the local history table is
used to index a table of 1K entries consisting a 3-bit
saturating counters, which provide the local prediction
• Total size: 4K*2 + 4K*2 + 1K*10 + 1K*3 = 29K bits!
(~180,000 transistors)
Rutgers University, Spring 2005
11
CS 505: Thu D. Nguyen
% of predictions from local
predictor in Tournament
Prediction Scheme
0%
20%
40%
60%
80%
98%
100%
94%
90%
nasa7
matrix300
tomcatv
doduc
55%
spice
76%
72%
63%
fpppp
gcc
espresso
eqntott
37%
69%
li
Rutgers University, Spring 2005
100%
12
CS 505: Thu D. Nguyen
Accuracy of Branch Prediction
99%
99%
100%
tomcatv
95%
doduc
84%
fpppp
86%
82%
li
77%
97%
88%
gcc
70%
0%
20%
40%
60%
80%
Profile-based
2-bit counter
Tournament
98%
86%
82%
espresso
98%
96%
88%
94%
fig 3.40
100%
Branch prediction accuracy
• Profile: branch profile from last execution
(static in that in encoded in instruction, but profile)
Rutgers University, Spring 2005
13
CS 505: Thu D. Nguyen
Accuracy v. Size (SPEC89)
Rutgers University, Spring 2005
14
CS 505: Thu D. Nguyen
Need Address
at Same Time as Prediction
• Branch Target Buffer (BTB): Address of branch index to get
prediction AND branch address (if taken)
– Note: must check for branch match now, since can’t use wrong branch address
(Figure 4.22, p. 273)
Branch PC
Predicted PC
PC of instruction
FETCH
=?
Predict taken or untaken
• Return instruction addresses predicted with stack
• Remember branch folding (Crisp processor)?
Rutgers University, Spring 2005
15
CS 505: Thu D. Nguyen
Re-evaluating Correlation
• Several of the SPEC benchmarks have less than a
dozen branches responsible for 90% of taken
branches:
program
compress
eqntott
gcc
mpeg
real gcc
branch %
14%
25%
15%
10%
13%
static
236
494
9531
5598
17361
# = 90%
13
5
2020
532
3214
• Real programs + OS more like gcc
• Small benefits beyond benchmarks for correlation?
problems with branch aliases?
Rutgers University, Spring 2005
16
CS 505: Thu D. Nguyen
Predicated Execution
• Avoid branch prediction by turning branches
into conditionally executed instructions:
if (x) then A = B op C else NOP
– If false, then neither store result nor cause exception
– Expanded ISA of Alpha, MIPS, PowerPC, SPARC have
conditional move; PA-RISC can annul any following
instr.
– IA-64: 64 1-bit condition fields selected so conditional
execution of any instruction
– This transformation is called “if-conversion”
x
A=
B op C
• Drawbacks to conditional instructions
– Still takes a clock even if “annulled”
– Stall if condition evaluated late
– Complex conditions reduce effectiveness;
condition becomes known late in pipeline
Rutgers University, Spring 2005
17
CS 505: Thu D. Nguyen
Dynamic Branch Prediction
Summary
• Prediction becoming important part of scalar
execution.
– Prediction is exploiting “information compressibility” in execution
• Branch History Table: 2 bits for loop accuracy
• Correlation: Recently executed branches correlated
with next branch.
– Either different branches (GA)
– Or different executions of same branches (PA).
• Branch Target Buffer: include branch address &
prediction
• Predicated Execution can reduce number of
branches, number of mispredicted branches
Rutgers University, Spring 2005
18
CS 505: Thu D. Nguyen
Problems with scalar approach to
ILP extraction
• Limits to conventional exploitation of ILP:
– pipelined clock rate: at some point, each increase in clock rate has
corresponding CPI increase (branches, other hazards)
– branch prediction: branches get in the way of wide issue. They
are too unpredictable.
– instruction fetch and decode: at some point, its hard to fetch and
decode more instructions per clock cycle
» How wide can we go?
– register renaming: Rename logic gets really complicate for many
instructions
» Increasing ports to register file is also hard
– cache hit rate: some long-running (scientific) programs have very
large data sets accessed with poor locality; others have continuous
data streams (multimedia) and hence poor locality
Rutgers University, Spring 2005
19
CS 505: Thu D. Nguyen
Limits to ILP
• Conflicting studies of amount
– Benchmarks (vectorized Fortran FP vs. integer C programs)
– Hardware sophistication
– Compiler sophistication
• How much ILP is available using existing
mechanisms with increasing HW budgets?
• Do we need to invent new HW/SW mechanisms to
keep on processor performance curve?
– Intel MMX
– Motorola AltaVec
– Supersparc Multimedia ops, etc.
Rutgers University, Spring 2005
20
CS 505: Thu D. Nguyen
Limits to ILP
•
Assumptions for ideal/perfect machine to start:
1. Register renaming–infinite virtual registers and all WAW & WAR
hazards are avoided
2. Branch prediction–perfect; no mispredictions
3. Jump prediction–all jumps perfectly predicted => machine with
perfect speculation & an unbounded buffer of instructions
available
4. Memory-address alias analysis–addresses are known & a store
can be moved before a load provided addresses not equal
•
1 cycle latency for all instructions; unlimited
number of instructions issued per clock cycle
Rutgers University, Spring 2005
21
CS 505: Thu D. Nguyen
Upper Limit to ILP: Ideal
Machine
(Figure 3.35)
160
150.1
FP: 75 - 150
Instruction Issues per cycle
IPC
140
118.7
Integer: 18 - 60
120
100
75.2
80
62.6
60
54.8
40
17.9
20
0
gcc
espresso
li
fpppp
doducd
tomcatv
Programs
Rutgers University, Spring 2005
22
CS 505: Thu D. Nguyen
More Realistic HW: Branch Impact
Figure 3.39
Change from Infinite window
to examine to 2000 and
maximum issue of 64
instructions per clock cycle
60
50
FP: 15 - 45
61
60
58
48
46 45
46 45 45
40
35
29
Integer: 6 - 12
30
IPC
Instruction issues per cycle
41
19
20
16
15
12
10
13 14
10
9
6
7
6
6
6
7
4
2
2
2
0
gcc
espres so
li
f pppp
doducd
tomc atv
Progr am
Perf ec t
Rutgers University, Spring 2005
Selec t iv e predic tor
Standard 2-bit
23
Static
None
CS 505: Thu D. Nguyen
More Realistic HW: Register Impact
Figure 4.44, Page 328
60
Change 2000 instr
window, 64 instr issue, 8K
2 level Prediction
FP: 11 - 45
54
49
45
40
44
35
29
Integer: 5 - 15
30
IPC
Instruction issues per cycle
50
59
28
20
20
15 15
11 10 10
10
16
13
12 12 12 11
10
9
5
5
4
11
6
4
15
5
5
5
4
7
5
5
0
gcc
espres so
li
f pppp
doducd
tomc at v
Program
Inf inite
Rutgers University, Spring 2005
256
128
24
64
32
None
CS 505: Thu D. Nguyen
More Realistic HW: Alias Impact
Figure 3.44
49
50
Change 2000 instr window,
64 instr issue, 8K 2 level
Prediction, 256 renaming
registers
45
40
45
45
FP: 4 - 45
(Fortran,
no heap)
30
25
IPC
Instruction issues per cycle
35
49
Integer: 4 - 9
20
16
16
15
15
12
10
10
5
9
7
7
4
5
5
4
3
3
4
6
4
3
5
4
0
gcc
espres so
li
f pppp
doducd
tomc at v
Program
Perf ec t
Rutgers University, Spring 2005
Global/ s tack Perf ec t
25
Ins pec tion
None
CS 505: Thu D. Nguyen
Interesting Trade-Offs
• 8-scalar IBM Power-2 @ 71.5 MHz (5 stage
pipe)
900 vs. 2-scalar Alpha @ 200 MHz (7 stage pipe)
800
700
SPECMarks
600
500
400
300
200
100
Rutgers University, Spring 2005
Benchmark
26
fpppp
nasa
hydro2d
su2cor
swm256
mdljsp2
ear
alvinn
ora
tomcatv
wave5
mdljdp2
doduc
spice
gcc
sc
compress
eqntott
li
espresso
0
CS 505: Thu D. Nguyen
Interesting Trade-Offs
SuperSPARC: flexible 3-issue, 60 MHZ
HP PA 7100: constrained 2-issue, 99 MHZ
Rutgers University, Spring 2005
27
CS 505: Thu D. Nguyen
Cost-performance of simple vs. OOO
•
•
•
•
•
•
•
•
MIPS MPUs
Clock Rate
On-Chip Caches
Instructions/Cycle
Pipe stages
Model
Die Size (mm2)
R5000
200 MHz
32K/32K
1(+ FP)
5
In-order
84
– without cache, TLB
32
Development (man yr.) 60
SPECint_base95
5.7
Rutgers University, Spring 2005
28
R10000
10k/5k
195 MHz
1.0x
32K/32K
1.0x
4
4.0x
5-7
1.2x
Out-of-order --298
3.5x
205
6.3x
300
5.0x
8.8
1.6x
CS 505: Thu D. Nguyen
Alternative Model:
Vector Processing
• Vector processors have high-level operations that
work on linear arrays of numbers: "vectors"
SCALAR
(1 operation)
v1 v2
r2
r1
+
+
r3
v3
add r3, r1, r2
Rutgers University, Spring 2005
VECTOR
(N operations)
vector
length
add.vv v3, v1, v2
29
CS 505: Thu D. Nguyen
25
Properties of Vector Processors
• Each result independent of previous result
=> long pipeline, compiler ensures no dependencies
=> high clock rate
• Vector instructions access memory with known
pattern
=> highly interleaved memory
=> amortize mem. latency over multiple elements
=> no (data) caches required!
• Reduces branches and branch problems in pipelines
• Single vector instruction implies lots of work
=> fewer instruction fetches
Rutgers University, Spring 2005
30
CS 505: Thu D. Nguyen
Operation & Instruction Count:
RISC v. Vector Processor
(from F. Quintana, U. Barcelona.)
Spec92fp
Program
swim256
hydro2d
nasa7
su2cor
tomcatv
wave5
mdljdp2
Operations (Millions)
RISC Vector R / V
115
95
1.1x
58
40
1.4x
69
41
1.7x
51
35
1.4x
15
10
1.4x
27
25
1.1x
32
52
0.6x
RISC
115
58
69
51
15
27
32
Instructions (M)
Vector
R/V
0.8
142x
0.8
71x
2.2
31x
1.8
29x
1.3
11x
7.2
4x
15.8
2x
Vector reduces ops by 1.2X, instructions by 20X
Rutgers University, Spring 2005
31
CS 505: Thu D. Nguyen
Vector Advantages
• Easy to get high performance; N operations:
•
•
•
•
•
•
–
–
–
–
–
–
–
are independent
use same functional unit
access disjoint registers
access registers in same order as previous instructions
access contiguous memory words or known pattern
can exploit large memory bandwidth
hide memory latency (and any other latency)
Scalable: (get higher performance by adding HW resources)
Compact: Describe N operations with 1 short instruction
Predictable: performance vs. statistical performance (cache)
Multimedia ready: N * 64b, 2N * 32b, 4N * 16b, 8N * 8b
Mature, developed compiler technology
Vector Disadvantage: Out of Fashion?
– Hard to say. Many irregular loop structures seem to still be hard to
vectorize automatically.
Rutgers University, Spring 2005
32
CS 505: Thu D. Nguyen
Summary #1
• Precise exceptions/Speculation: Out-of-order
execution, In-order commit (reorder buffer)
• Superscalar and VLIW: CPI < 1 (IPC > 1)
– Dynamic issue vs. Static issue
– More instructions issue at same time => larger hazard penalty
– Limitation is often number of instructions that you can
successfully fetch and decode per cycle  “Flynn barrier”
• Branch prediction is one of the most crucial factors
to performance of superscalar execution.
Rutgers University, Spring 2005
33
CS 505: Thu D. Nguyen
Summary #2
• Vector model accommodates long memory latency,
doesn’t rely on caches as does Out-Of-Order,
superscalar/VLIW designs
• Much easier for hardware: more powerful
instructions, more predictable memory accesses,
fewer hazards, fewer branches, fewer
mispredicted branches, ...
• What % of computation is vectorizable?
• Is vector a good match to new apps such as
multimedia, DSP?
Rutgers University, Spring 2005
34
CS 505: Thu D. Nguyen