Transcript Document

5. Microarchitecture of Superscalars (3)
Branch Prediction
Dezső Sima
Fall 2006
 D. Sima, 2006
Branch prediction
•
1. Introdutcion
•
2. Basic branch prediction mechanisms
•
3. Auxiliary branch prediction mechanisms
•
4. Accessing the branch target path
1.1 The branch processing problem of pipelining (1)
ti
ii
b
F
i i+1
t i+1
t i+2
t i+3
D
E
W
F
D
i i+2
ij
t i+4
F
F
BTI
Branch
fetching
Branch
detection
BTA
calculation
2 bubbles
BTI
fetching
Figure 1.1: Straightforward processing of an unconditional branch on a four stage pipeline
1.1 The branch processing problem of pipelining (2)
ti
ii
bc
F
i i+1
t i+1
t i+2
t i+3
D
E
W
F
D
E
F
D
i i+2
t i+5
F
i i+3
ij
t i+4
F
BTI
bc
fetching
bc
detection
Condition
checking
(branch!)
BTA
calculation
3 bubbles
BTI
fetching
Figure 1.2: Straightforward processing of a conditional branch on a four stage pipeline
with immediate condition resolution
1.1 The branch processing problem of pipelining (3)
ti
ii
bc
t i+1
F
i i+1
t i+2
t i+4
E
D
E
F
D
i i+2
t i+3
tj
E
t j+1
t j+2
E
W
t j+3
F
bc
fetching
bc
detection
Condition Condition
checking checking
Condition Condition BTA
checking checking calculation
(branch!)
Dynamic
stop
ij
BTI
F
Large number of bubbles
BTI
fetching
Figure 1.3: Straightforward processing of a conditional branch on a four stage pipeline,
with delayed condition resolution
t j+4
1.1 The branch processing problem of pipelining (4)
No of pipeline stages
40
30
20
Pentium
(5)
10
*
1990
Pentium Pro
(~12)
K6
*
(6)
*
1995
Pentium 4
(~20)
*
Athlon
(6)
P4 Prescott
(~30)
*
Athlon-64
(12)
*
Core Duo
Conroe
(14)
*
*
2000
2005
Figure 1.4: Number of pipeline stages in Intel’s and AMD’s processors
Year
1.2 Branch statistics (1)
Figure 1.5: Dynamic ratio of branches
1.2 Branch statistics (2)
Figure 1.6: Ratio of the main instruction types
Source: Stephens et al. „Instruction level profiling and evaluation of the IBM RS/6000”,
Proc. 18th ISCA, pp. 137-146
1.2 Branch statistics (3)
Branches
Unconditional branches
Simple
unconditional
branch
Branch
to subroutine
~ 1/3
Return from
subroutine
Conditional branches
Loop-closing
conditional
branch
Other
conditional
branches
~ 1/3
~ 1/3
Taken for the
first (n-1) iterations
~ 1/6
~ 1/6
Taken
Not taken
Taken
~ 5/6
Figure 1.7: Grohoski’s estimate of branch statistics
Source: Grohoski, G.F, IBM J. Res. Develop., 34 Jan. pp. 37-58
Not taken
~ 1/6
1.2 Branch statistics (3)
Reference
Lee, Smith 1984
Frequency of taken Frequency of not taken
branches
branches
57 - 99 %
1 - 43 %
Edenfield & al. 1990
75 %
25 %
Grohoski 1990
~ 5/6
~ 1/6
Figure 1.8: Frequency of taken and not taken branches
Source: Sima, D et. al., ACA, Addison Wesley, 1997, pp. 303
1.3 The principle of branch prediction (1)
ti
ii
bc
t i+1
F
i i+1
t i+2
D
E
F
D
i i+2
t i+3
t i+4
E
E
t j+1
t j+2
E
W
t j+3
F
bc
fetching
bc
detection
Condition Condition BTA
checking checking calculation
(branch!)
Condition Condition
checking checking
Dynamic
stop
Branch
prediction
(branch!)
BTA
calculation
i i+3
tj
BTI (speculative)
F
Spec. ex.
acknowledged
D
FF
ij
2 bubbles
BTA
fetching
BTI
decode
Figure 1.9: Correctly predicted conditional branch with delayed condition resolution
on a four stage pipeline
1.3 The principle of branch prediction (2)
ti
ii
bc
t i+1
F
i i+1
i i+2
t i+2
D
E
F
D
t i+3
t i+4
E
tj
E
t j+1
t j+2
E
W
t j+3
t j+4
F
bc
fetching
bc
detection
Condition
checking
BTA
Condition Condition
checking checking calculation
(no branch!)
Condition
checking
Branch pred.
(branch!)
Dynamic
BTA calc. stop
i i+3
BTI (speculative)
F
D
FF
ij
BTA
fetching
BTI
decode
i j+1
F
A large number of bubbles
i i+1
fetching
Figure 1.10: Incorrectly predicted conditional branch with delayed condition resolution
on a four stage pipeline
1.3 The principle of branch prediction (3)
ti
ii
bc
i i+1
t i+1
F1
t i+2
t i+3
F2
F3
D1
D2
F1
F2
F3
D1
F1
F2
F3
i i+2
F1
t i+4
t i+5
F2
tj
t j+1
E1
t j+2
t j+3
W
E2
Condition
checking
mispred.!
(branch!)
F1
bc fetching
BTA
calculation
bc detection
Branch prediction
(no branch!)
i i+n
F1
i i+n+1 BTI
F1
Misprediction penalty
BTI
fetching
Figure 1.11: Branch misprediction penalty on a long pipeline
t j+4
1.4 Branch prediction accuracy/penalty (1)
Guessing method
(relevant for
Implementation
prediction
accuracy)
Processor
Am 29000 (1987)
Implicit dynamic
32-entry two-way set
associative BTIC
Implicit dynamic,
32-entry fully associative
overridden by opcodeBTIC
based static
2-bit dynamic
256-entry BTAC
MC 88110 (1991)
MC 68060 (1993)
Prediction
accuracy
Reference
60 % for repetitive
branches
70 % on SPEC
Weiss 1987
Diefendorff, Allen 1992
> 90 %
Circello, Goodrich 1993
MIPS R10000 (1996)
2-bit dynamic
512-entry BHT
90 %
Halfhill, 1994
PowerPC 620 (1995)
Implicit dynamic,
augmented with 2-bit
dynamic
Implicit dynamic,
overridden by 3-bit
dynamic or compiler
based static
2-bit dynamic
256-entry fully
associative BTAC, 2-Kentry BHT
32-entry fully associative
BTAC, 256-entry BHT
90 %
Thomson, Ryan 1994
80 % on SPECint92
Gwennap 1994
PA-8000 (1995)
UltraSparc (1995)
BHT
BTIC
2 K-entries in the IC, each 88 % on SPECint92 94 %
shared among two
on SPECfp92
instructions
: Branch history table
: Branch target instruction cache
BTAC
IC
: Branch target address cache
: Instruction cache
Figure 1.12: Branch prediction accuracy
Source: Sima, D et. al., ACA, Addison Wesley, 1997, pp. 340
Wayner 1994
1.4 Prediction accuracy/penalty (2)
Effective penalty of branch processing (simplified)
P  f c  Pc  f m  Pm
fc:
fm :
Pc:
Pm:
Probability (frequency) of correctly predicted branches
Probability (frequency) of mispredicted branches
Penalty of correctly predicted branches
Penalty of mispredicted branches
If : Pc  0
P  f m  Pm
Examples:
PPro
P4 Willamette
P4 Prescott
1
1
1.5
0.1
0.05
0.05
10 cycles
20 cycles
30 cycles
2. Basic branch prediction mechanisms
2.1 Introduction (1)
Branch processing
Branch
detection
Branch
prediction
Accessing the
branch target path
2.1 Introduction (2)
Branch prediction mechanisms
Basic branch prediction
mechanism
Auxilliary branch prediction
mechanism
2.1 Introduction (2)
Basic branch prediction mechanism
Processor based
Local
Compiler hints
?
Prediction depends only
on the behaviour of the branch considered
Figure 2.1.: Local prediction
2.1 Introduction (2)
Basic branch prediction mechanism
Processor based
Local
Global
(2-level)
Compiler hints
1
0
0
Path 2:
.
.
0
0
Path 1:
0
0
.
.
1
0
0
0
0
?
Prediction depends
on the actual execution path,
that is on all branches executed
Figure 2.2.: Global prediction
2.1 Introduction (2)
Basic branch prediction mechanism
Processor based
Local
Global
(2-level)
Compiler hints
Combined
(Choice prediction)
2.2. Local prediction (1)
Local prediction
1-level
2-level
2.2. Local prediction (2)
1-level (local) prediction
Fixed prediction
Always the same prediction
'Always not taken' 'Always taken'
approach
approach
Dynamic prediction
Static prediction
Based on the object code
Displacementbased
Opcodebased
Based on the execution history
1-bit
prediction
80486 (1989)
MC 68040 (1990)
SuperSparc (1992)
R4000 (1992)
POWER1 (1990)
POWER2 (1993)
R8000 (1994)
PPC 601 (1993)
PPC: PowerPC
PPC 601 (1993)
2.2. Local prediction (3)
BHT (Branch History Table)
IFA:
x }
x: 0: sequential cont
1: branch.
Figure 2.3: Principle of the 1-bit dynamic prediction
2.2. Local prediction (4)
NT
T
Not
taken
Taken
NT
T
T: Branch has been taken
NT: Branch has not been taken
Figure 2.4: State transition diagram of the 1-bit dynamic prediction
2.2. Local prediction (6)
1-level (local) prediction
Fixed prediction
Always the same prediction
'Always not taken' 'Always taken'
approach
approach
Dynamic prediction
Static prediction
Based on the object code
Displacementbased
Opcodebased
Based on the execution history
1-bit
prediction
80486 (1989)
Pentium (1993)
MC 68040 (1990)
MC 68060 (1993)
SuperSparc (1992)
UltraSparc (1995)
R4000 (1992)
POWER1 (1990)
POWER2 (1993)
2-bit
prediction
R8000 (1994)
PPC 601 (1993)
PPC: PowerPC
PPC 601 (1993)
R10000 (1996)
PPC 604 (1995)
PPC 620 (1996)
2.2. Local prediction (7)
BHT
IFA:
xx }
xx: 00,01: sequential cont
10,11: branch.
BHT: Branch History Table
Figure 2.6: Principle of the 2-bit dynamic prediction
2.2. Local prediction (8)
ANT
Strongly
AT
Initialised when a
branch is taken first
ANT
Weakly
taken
taken
11
10
AT
Prediction: "Taken"
AT
ANT
Weakly
not
taken
Strongly
not
taken
01
00
AT
Prediction: "Not Taken"
Branch has been :
AT: actually taken
ANT: actually not taken
Figure 2.7: State transition diagram of the most frequently used
2-bit dynamic prediction (Smith algorithm)
ANT
2.2. Local prediction (5)
Accessing BHTs/BTACs
Cache-like access
(direct / set associative)
Indexed access
IFA:
Associative access
IFA:
Index
BHT
C
(Counters)
For large tables most branches will
map to a unique entry.
For smaller tables multiple branches
may map to the same entry, resulting
in interferences and thus in degrated
prediction accuracy.
IFA:
Tags
Index
IFA
Tags
C
Tags
C
IFA
C
(E.g. two-way set associative)
Reduces interferences but increases cost.
Avoids interference but stronly increases cost.
Examples:
16K entry local BHT (Power4)
16K entry global BHT (Power4)
16K entry selector table (Power4)
128*4 way BHT/BTAC (Pentium Pro)
1K*4 way BHT/BTAC (Pentium II, III, 4)
128*2 way BTAC (Power3)
64 entry BTAC (PPC 604)
Figure 2.5: Alternatives for accessing Branch History Tables or Branch Target Address Buffers
2.2. Local prediction (9)
1-level (local) prediction
Fixed prediction
Always the same prediction
'Always not taken' 'Always taken'
approach
approach
Dynamic prediction
Static prediction
Based on the object code
Displacementbased
Opcodebased
Based on the execution history
1-bit
prediction
80486 (1989)
Pentium (1993)
MC 68040 (1990)
MC 68060 (1993)
SuperSparc (1992)
UltraSparc (1995)
R4000 (1992)
POWER1 (1990)
POWER2 (1993)
2-bit
prediction
R8000 (1994)
PPC 601 (1993)
PPC 601 (1993)
R10000 (1996)
PPC 604 (1995)
PPC 620 (1996)
PPC: PowerPC
Figure 2.8: Early branch prediction mechanisms and their trends indicated
by subsequent models of pipelined, 1. and 2. generation superscalars
3-bit
prediction
2.2. Local prediction (10)
Local prediction
1-level
2-level
Fixed
prediction
Static
prediction
Dynamic
prediction
Always the same
prediction
Based on the
object code
Based on the
execution history
2.2. Local prediction (11)
2-level local
branch
prediction
2-level
local
prediction
(1.-level: branch patterns, 2.-level: history bits)
Individual counters
Shared counters
With a shared global history
table for all patterns
With individual history
tables for different patterns
(Alpha 21264)
(Pentium Pro)
IFA:
Local BHT
(e.g. 16×2 bit)
IFA:
Local BHT
(e.g. 1K×10 bit)
1100101001
Local BHT
(e.g. 1K×3 bit)1
101
Branch
The 21264 uses 3-bit saturating counters
whose most significant bit provides the prediction
Local BHT
(e.g. 128×4 bit)
6
0110
e.g. 4-ways each
10
Branch
2.2. Local prediction (12)
76
0
BTA
(linear)
BHT
Tag
Index
127
Way 2
Way 3
Way 0
Way 1
0 1 01 0
15
0
6
x x
xx: 00/01 not taken
10/11 taken
Tags
History
4-bit
Tags
History
4-bit
Tags
History
4-bit
Tags
0
Counters
Figure 2.9.: The principle of Pentium Pro’s 128x4 way set associative BHT
History
4-bit
2.2. Local prediction (13)
127
0
Tag
Tag
Tag
Tag
H
C
H
C
H
C
H
Figure 2.10.: The actual layout of Pentium Pro’s 128x4 way set associative BHT
C
2.3. Global prediction (1)
Basic branch prediction mechanism
Processor based
Local
Global
(2-level)
Compiler hints
Combined
(Choice prediction)
2.3. Global prediction (1)
Global prediction
Simple global
2.3. Global prediction (1)
Global history
(shift register)
0
1
1
0
0
1
1
BHT
x
Figure 2.11.: Simple global prediction
Branch history
2.3. Global prediction (1)
Global prediction
Simple global
Gshare
2.3. Global prediction (1)
Global history
0
1
1
0
0
1
1
}
XOR
IFA
...
1
0
0
1
1
0
0
BHT
x
Figure 2.12.: Principle of the Gshare prediction
Branch history
2.3. Global prediction (1)
Global prediction
Simple global
Gshare
Gselect
2.3. Global prediction (1)
Global history
0
1
1
0
0
1
1
BHT
Branch history
x
... 1
IFA:
0 1 1 0
Figure 2.13.: Principle of the Gselect prediction
0
2.4. Combined prediction (1)
Basic branch prediction mechanism
Processor based
Local
Global
(2-level)
Compiler hints
Combined
(Choice prediction)
2.4. Combined prediction (2)
IFA:
Global history
Local
BHT
Global
BHT
IFA:
Best choice
BHT
x
Global prediction
Local
prediction
Local
prediction
Global
prediction
Actual prediction
(for updating)
Resulting prediction
Figure 2.14.: Principle of the combined local and global prediction
(as used in the Alpha 21264, or the POWER 4)
2.4. Combined prediction (3)
Combined prediction
Alpha 21264
1. prediction
2. prediction
2-level local dynamic prediction with a
shared counter table for all patterns
Simple 2-level global prediction
(1K * 10 bits/1K * 3 bits)
(12-bit global history/4K * 2 bits)
Choice
Global history referenced choice table
(12-bit global history/4K * 2-bits)
Figure 2.15.: Implementation alternatives of the combined prediction
2.4. Combined prediction (4)
•
•
•
•
•
Minimum branch penalty: 7 cycles
Typical branch penalty: 11+ cycles (IQ delay)
48K bits of target addresses stored in I-cache
32-entry return address stack
Predictor tables are reset on a context switch
Figure 2.16.: The combined predictor of the Alpha 21264
Source: Microprocessor Report, 10/28/96
2.4. Combined prediction (5)
Combined prediction
Alpha 21264
1. prediction
2. prediction
2-level local dynamic prediction with a
shared counter table for all patterns
Simple 2-level global prediction
(1K * 10 bits/1K * 3 bits)
1-level local dynamic prediction
POWER 4
(16K * 1-bit)
(12-bit global history/4K * 2 bits)
2-level Gshare global prediction
(11-bit global history is hashed with
the IFA, 16K * 1-bit counter table)
Choice
Global history referenced choice table
(12-bit global history/4K * 2-bits)
Accessed in the same way as the
global counter table
(16K * 1-bit)
Figure 2.17.: Implementation alternatives of the combined prediction
2.4. Combined prediction (6)
11-bit global history
0
...
18
5
1
1
1
0
0
1
1
0
XOR
0 1
1
0
0
}
1-bit per group
IFA
IFA:
BHT
14
14
14
16K*1bit
16K*1bit
Selector Table
Local History
Update
Local
prediction
16K*1bit
Global History
Select the better
Global
prediction
Figure 2.18.: The principle of the combined predictor of the POWER 4
2.5. Overview of the basic branch prediction mechanisms
Basi c pre diction m e chani sm
Local
Global
Fi xe d
pre diction
S tatic
pre diction
Dyn am ic
1-bit
Pe ntiu m1
2-le ve l
2-le ve l
1-le ve l
S hare d
coun te rs
2-bit
C om bi ne d
(Choice
prediction)
In di vidual
coun te rs
S im ple
gl obal
Gsh are
Gse l e ct
3-bit
Pe ntiu m
(256*2)
Pe ntiu m Pro
(512*2)
Pe ntiu m Pro
P4 W il l/Northw.
(4K*2)
P4 W il l/Northw.
P4 Pre scott
(4K*2)
P4 Pre scott
K6
K6
(8K*2)
K7
K7
K8
K8
(16K*2)
PPC 604
PPC 604
(512*2)
PPC 620
PPC 620
(2K*2)
PO W ER 3
PO W ER 3
(2K*2)
(PO W ER 4)
(11-bit/16K*1)
(PO W ER 4)
(16K*1)
PO W ER 4
1
Alph a 21164
Alph a 21164
(2K*2)
(Alpha 21264)
(1K*10/1K*3)
Alph a 21264
PA-8000
(Alpha 21264)
(12it s/4K*2)
1
1. generation superscalars
Alph a 21264
PA-8000
(256*3)
PA-8500/8700
UltraSPARC -III
PO W ER 4
PA-8500/8700
UltraSPARC -III
(12-bits/16K*2)
Figure 2.20.: Trends of branch prediction schemes used in 2. and 3. generation superscalars
3. Auxillary branch prediction mechanisms
Auxiliary branch prediction mechanisms
Backup use
of static
prediction
Pentium
1
Pentium
Pentium Pro
Pentium Pro
P4 Will/Northw.
P4 Will/Northw.
P4 Prescott
P4 Prescott
K6
K7
K8
PPC 604
PPC 620
POWER 3
POWER 4
POWER 5
Alpha 21164
1
Alpha 21264
PA-8000
PA-8500/8700
UltraSPARC-III
1:
1. generation superscalars
1
2:
Supported by compiler hints
RAS: Return Address Stack
Figure 3.1.: Overview of auxiliary branch prediction mechanisms in 2. and 3. generation superscalars
Figure 3.2: Static branch prediction algorithm of the Pentium Pro
Source: Shanley T., „Pentium Pro Processor System Architecture„, Addison-Wesley Developers Press, 1996
3. Auxillary branch prediction mechanisms
Auxiliary branch prediction mechanisms
Backup use
of static
prediction
Pentium
1
Preemptive
use of
compiler hints
Pentium
Pentium Pro
P4 Will/Northw.
P4 Prescott
P4 Prescott
P4 Will/Northw.
P4 Prescott
K6
K7
K8
PPC 604
PPC 620
P4 Will/Northw.
P4 Prescott
K6
(16-entries)
K7
(12-entries)
K8
(12-entries)
PPC 620
POWER 3
POWER 3
POWER 4
POWER 4
POWER 5
Alpha 21164
RA
S
Pentium Pro
Pentium Pro
P4 Will/Northw.
Dedicated prediction
POWER 5
1
Alpha 21264
PA-8000
POWER 4 2
POWER 52
Alpha 21164
(12-entries)
Alpha 21264
(32-entries)
PA-8000
PA-8500/8700
UltraSPARC-III
UltraSPARC-III
1:
1. generation superscalars
1
2:
Supported by compiler hints
UltraSPARC-III
(8-entries)
RAS: Return Address Stack
Figure 3.1.: Overview of auxiliary branch prediction mechanisms in 2. and 3. generation superscalars
Return Address Stack (RAS)
POP
return address
on a RET
PUSH
return address
on a CALL
RAS
used to continue execution speculatively
from the popped up return address
PUSH
return address
on a CALL
POP
return address
on a RET
Architectural stack
with preserved sequential consistency
The Problem of RASs:
A procedure, such as a printf () might be called from many different locations,
so there are many different return addresses.
During speculative ooo execution however,
the logical sequence of the related PUSH RET instructions may be disturbed,
so the predicted return address may be wrong.
For checking the prediction the RET instruction will be executed,
and for a misprediction a repair mechanism will be activated
(to cancel wrongly executed instructions and repair the corrupted RAS).
3. Auxillary branch prediction mechanisms
Auxiliary branch prediction mechanisms
Backup use
of static
prediction
Pentium
1
Preemptive
use of
compiler hints
RA
S
Loop detector
Indirect
branch pred.
Pentium
Pentium Pro
Pentium Pro
Pentium Pro
P4 Will/Northw.
P4 Will/Northw.
P4 Prescott
P4 Prescott
P4 Will/Northw.
P4 Prescott
K6
K7
K8
PPC 604
P4 Will/Northw.
P4 Prescott
P4 Prescott
K6
(16-entries)
K7
(12-entries)
K8
(12-entries)
PPC 604
PPC 620
PPC 620
PPC 620
POWER 3
POWER 3
POWER 4
POWER 4
POWER 5
Alpha 21164
Dedicated prediction
POWER 5
1
Alpha 21264
PA-8000
POWER 4 2
POWER 52
Alpha 21164
(12-entries)
Alpha 21264
(32-entries)
POWER 4 2
POWER 5 2
POWER 4
PA-8000
PA-8500/8700
UltraSPARC-III
UltraSPARC-III
1:
1. generation superscalars
1
2:
Supported by compiler hints
UltraSPARC-III
(8-entries)
RAS: Return Address Stack
Figure 3.1.: Overview of auxiliary branch prediction mechanisms in 2. and 3. generation superscalars
4. Accessing the branch target path (1)
4.1. Overview
BTA
Calculated on the fly
Figure 4.1.: Alternatives to generate the BTA
A
BTA
IIFA
Compute
BTA
I
F
A
R
I
I+1
I+2
I+3
Instruction
fetch address
+
sequential
address
(IFA)
I-cache
BTI
BTI+1 BTI+2
BTI+3
This scheme is employed in earlier scalar (pipeline) processors as well as in a number of
superscalar processors, such as:
Z 80000 (1984)
i486
(1989)
MC 68040 (1990)
Sparc CY7C601 (1988), SuperSparc (1992p),
Power PC 601 (1993), 603 (1993), Power1 (1990), Power2 (1993),
POWER4 (2001), POWER5 (2005)
21064 (1992), 21064A (1994), 21164 (1995),
R4000 (1992), R 10000 (1996)
Ultra SPARC III (2003)
Figure 4.2.: Principle of calculating the BTA on the fly
Source: Sima, D et. al., ACA, Addison Wesley, 1997, pp. 303
4. Accessing the branch target path (1)
4.1. Overview
BTA
Calculated on the fly
Accessed from the BTAC
Figure 4.1.: Alternatives to generate the BTA
+
I
F
A
R
Instruction fetch address (IFA)
A
I
I+1
I+2
I+3
BTI+1 BTI+2
IIFA
BTA
BTA
BTAC
I-cache
BTI
BA-1
Sequential
address
BTI+3
Branch target address
The Branch Target Address Cache (BTAC) contains branch target addresses (BTAs). These BTAs are read
from the BTAC when the instruction immediately preceding a branh is fetched. (Their addresses are
designated as BA-1).
Figure 4.3.: Principle of the BTAC scheme to access the branch target path
+
IFA:
I$
IFA:
BHT
BTAC
IFA:
Tag
I
F
A
R
Update BTAC
(create/delete
BTAC entry)
IB
C
Further processing
Update BHT
with branch result
Tags
BTA
Update BTAC
with BTA if BHT initiates it.
Figure 4.4.: The principle of branch prediction using both a BHT and a BTAC
(C: counter)
(Designated as BTB (Branch Target Buffer) by Intel)
if BTAC misses
IIFA
BTA if mispred.
if BTAC hits
Processor
Number of
BTAC entries
Implementation of
the BTAC
ES/9000 520-based
procs (1992p)
4K
2-way associative
Pentium (1994)
256
Fully associative
Pentium Pro
512
4-way associative
Pentium 4
4K
4-way associative
MC 68060 (1993)
256
4-way associative
R 8000 (1994)1
1K
PA 8000 (1995)
32
Fully associative
Power PC 604 (1994)
64
Fully associative
Power PC 620 (1995)
256
Fully associative
1: Each entry is shared among 4 instructions
Figure 4.5.: Examples of processors using the BTAC scheme
Figure 4.6.: The physical implementation of branch prediction
in Intel’s P4 Northwood and Prescott cores
Source: de Vries H., „Looking at Intel’s Prescott die, part II.”, http://www.chip-architect.com, April 2003
4. Accessing the branch target path (1)
4.1. Overview
BTA
Calculated on the fly
Accessed from BTAC
Figure 4.1.: Alternatives to generate the BTA
From the I$
Instruction fetch address (IFA)
A
I
I
F
A
R
+
BA
I-cache
BTI
BTA+
BTIC
To decoding
The BTIC contains the addresses of the last recently taken branches (BA), the corresponding branch
target instructions (BTI) and the addresses of the instructions following the BTIs (BTA+). When there
is an entry in the BTIC for the actual IFA, the corresponding BTI is fetched from the BTIC and
selected for decoding instead of the instruction from the I-cache. The address of the subsequent
instruction along the taken path is also read from BTIC and becomes the next IFA
Examples:
Gmicrol/200 (1988), AM 29000 (1988), MC 88110 (1993).
Figure 4.7.: Principle of the BTIC scheme to access the branch target path
IFA
4. Accessing the branch target path (1)
4.1. Overview
BTA
Calculated on the fly
Accessed from BTAC
From the I$
PPro/PII/PIII/P4
21264
Examples
Ultra SPARC III
K6
Power 4, 5
K7/K8
Power 3
Figure 4.8.:Trends to generate the BTA
4.2. Case example 1: K7 (1)
To each 16-Byte long fetch block a 16 bit selector block is allocated as follows:
BTA
Fetch block
(16-Byte)
15
14
13
12
3
2
1
Instruction
execution
Selector block
(16-bit)
15 13
14 12
1
3
2
0
The selector block identifies branches, included in the associated fetch block.
Two bits of the selector block correspont to two bytes of the fetch block.
RETs are a single byte long all other branches are at least two bytes long.
Assuming max. a single RET in the fetch block, there may be at most one
branch ending in any pair of Bytes.
In a fetch block, there are up to a single RET and two non-RET branches.
More branches in a fetch block lead to conflicts in the prediction logic.
0
4.2. Case example 1: K7 (2)
Each two bit entry indicates whether or not there is a branch ending in the
corresponding two bytes in the fetch block, if yes, it identifies the type of
the branch as well. A branch instruction that crosses the 16-byte boundary
is counted to the second 16 byte window.
Coding of the two bits (assumed)
00: no branch
01: RET
10: There is a conditional branch whose brach is in the BTA0 field of the BTAC
11: There is a conditional branch whose brach is in the BTA1 field of the BTAC
4.2. Case example 1: K7 (3)
Characteristic examples of selector settings:
xx 00 00 00 00 00 00 00
No branch
IFA+16
xx 00 01 00 00 00 00 00
A RET instruction
Return address
of the RET
xx 00 00 00 10 00 00 00
A cond. branch (it’s BTA is
in the BTAC 0 field)
BTA0 if taken else
IFA+16
xx 00 00 10 00 11 00 00
Two cond. branches (their
BTAs are in the BTAC 0
and BTAC 1 fields)
BC1
Y
BTA0
N
Y
BC2
BTA1
N
IFA+16
During predecoding instruction boundaries as well as branch instructions
are detected and the appropriate selector entries are marked accordingly.
Predecoding is performed not faster than 4 bytes/cycle
If a cache line (64 bytes = 4 fetch blocks) is replaced, all associated
selector blocks are invalidated
4.2. Case example 1: K7 (4)
The selector table is shared between the upper and lower part of the I$,
and an extra address bit (A) identifies whether the entry belongt to the
upper or the lower part of the I$.
Source: Kaiser, A. ,”K7 Branch Prediction”, Dec. 1999, http://www.s.netic.de
4.2. Case example 1: K7 (5)
31
15 14
31
4 30
2-way set associative I$
IFA:
Tag
14 13
43 0
BTAC
IFA:
Tag
Index
BTA 0
BTA 1
Index
1K x
2 addr.
I
F
A
R
IFA [13:4]
1K*16B
fetch blocks
Way 0
Way 1
IFA [14:4]
IFA [14:4]
[31:15]
[31:15]
16 b
16 b
Selector Table
BTA
(Exec.)
(shared for the
upper and lower
parts of the I$)
1K*16B
fetch blocks
BTA1
BTA0
Fetch unit
(during predecoding)
Tags
15
16B+P
IFA [3:0]
16 B Fetch
block
16B+P
0
IFA [3:1]
A 15
16 bit selector
block
Tags
0
IFA14 W: 31
BTA
0 C:
BTA
x x
32-bit
Decode and issue instructions
beginning with the given address
Sequential
(no branch)
12
entries
RET
BTA 1
BTA 0
Take or not according
to the global prediction
(cond. branch)
Take the branch
(uncond. branch)
RAT
RET address
Figure 4.9.: Assumed simplified scheme of accessing the branch target path in the K7,
without showing the global prediction (A: address bit, C: Conditional branch, W: Way)
+16
4.2. Case example 2: K8 (1)
The K8 doubled the size of the selector table, so each fetch block has it’s
own selector entry.
The K8 allows any mix of up to 3 branches (CALL, JMP, RET, conditional) /
fetch block, the coding of the selector entries is modified accordingly.
When instruction cache lines are evicted to the L2 cache, branch selectors
and predecode information are also stored in the L2 cache.
The K8 uses 48-bit addresses but the BTAC keeps only the 15 least
significant bits to identify the next address.
Each BTA entry identifies the least significant 15-bits of the IFA as well as
additional information, such as
3-bit old IFA (bits 16,15)
W bit: W identificator
4.2. Case example 2: K8 (2)
31
15 14
31
4 30
2-way set associative I$
IFA:
Tag
14 13
43 0
IFA:
Tag
Index SA
Index
BTAC
?
BTA 2
BTA 1
BTA 0
I
F
A
R
512 x
4 addr.
1K*16B
fetch blocks
IFA [12:4]
Way 1
Selector Table
+ 16
Way 0
IFA [14:4]
IFA [14:4]
[31:15]
[31:15]
BTA
calculator
?
BTA2 BTA1 BTA0
1K*16B
fetch blocks
Tags
15
16B+P
16 b
SA
Predecoding
SA [3:0]
16 B Fetch
block
0
16 b
16B+P
15
16 bit selector
block
Tags
IFA [3:1]
0
x x
Old IFA15 16W 14
New IFA
0 RC
BTA
11-bit
Decode and issue instructions
beginning with the given address
Sequential BTA2/RET
(no branch)
BTA1/RET
BTA0/RET
12
entries
RAT
Take or not according
to the global prediction
(cond. branch)
Take the branch
(uncond. branch)
RET address
Figure 4.10.: Assumed simplified scheme of accessing the branch target path in the K8,
without showing the global prediction (C: Conditional branch, R: Return, W: Way 0/1, SA: Start address)
4.2. Case example 2: K8 (3)
Figure 4.11.: Logical view of Opteron’s (K8’s) instruction fetch and decode stages
Source: de Vries H., „Understanding the detailed Architecture of AMD’s 64 bit Core”, http://www.chip-archtect.com, Sept., 2003
4.2. Case example 2: K8 (4)
Figure 4.12.: Physical implementation of Opteron’s (K8’s) instruction cache and decoding
Source: de Vries H., „Understanding the detailed Architecture of AMD’s 64 bit Core”, http://www.chip-archtect.com, Sept., 2003