A 256Kbits L-TAGE branch predictor

Download Report

Transcript A 256Kbits L-TAGE branch predictor

A 256 Kbits L-TAGE branch predictor
André Seznec
IRISA/INRIA/HIPEAC
1
André Seznec
Caps Team
IRISA/INRIA
Directly derived from:
A case for (partially) tagged branch predictors,
A. Seznec and P. Michaud JILP Feb. 2006
+
Tricks:
Loop predictor
Kernel/user histories
2
André Seznec
Caps Team
Irisa
TAGE:
TAgged GEometric history length predictors
The genesis
3
André Seznec
Caps Team
Irisa
Back around 2003
 2bcgskew was state-of-the-art, but:
 but was lagging behind neural inspired
predictors on a few benchmarks
 Just wanted to get best of both behaviors
and maintain:
 Reasonable
implementation cost:
• Use only global history
• Medium number of tables
 In-time response
4
André Seznec
Caps Team
Irisa
The basis : A Multiple length global
history predictor
TO
T1
T2
L(0)
L(1)
L(2)
T3
?
T4
L(3)
L(4)
5
André Seznec
Caps Team
Irisa
GEometric History Length
predictor
The set of history lengths forms a geometric series
L(0) 0
L(i)  α i  1L(1)
Capture correlation
on very long histories
{0, 2, 4, 8, 16, 32, 64, 128}
most of the storage
What
is important:
for short
history !!L(i)-L(i-1) is drastically increasing
6
André Seznec
Caps Team
Irisa
Combining multiple predictions ?
 Classical solution:
 Use of a meta predictor
“wasting” storage !?!
chosing among 5 or 10 predictions ??
 Neural inspired predictors, Jimenez and Lin 2001
 Use an adder tree instead of a meta-predictor
 Partial matching
 Use tagged tables and the longest matching history
Chen et al 96, Michaud 2005
7
André Seznec
Caps Team
Irisa
CBP-1 (2004): OGEHL
Final computation through a sum
TO
T1
T2
L(0)
L(1)
L(2)
T3
∑
T4
L(3)
Prediction=Sign
L(4)
12 components 3.670 misp/KI
8
André Seznec
Caps Team
Irisa
TAGE
Geometric history length + PPM-like
+ optimized update policy
pc
h[0:L1]
pc
hash
hash
tag
ctr
pc h[0:L2]
u
hash
ctr
=?
1
pc h[0:L3]
hash
tag
u
hash
ctr
=?
1
1
tag
hash
u
=?
1
1
1
1
1
Tagless base
predictor
1
9
prediction
André Seznec
Caps Team
Irisa
Miss
Hit
Pred
=?
1
=?
1
1
=?
1
1
1
1
1
Hit
Altpred 10
1
André Seznec
Caps Team
Irisa
Prediction computation
 General case:
 Longest matching component provides the prediction
 Special case:
 Many mispredictions on newly allocated entries: weak Ctr
On many applications, Altpred more accurate than Pred
 Property dynamically monitored through a single 4-bit
counter
11
André Seznec
Caps Team
Irisa
TAGE update policy
 General principle:
Minimize the footprint of the prediction.
Just
update the longest history
matching component and allocate at
most one entry on mispredictions
12
André Seznec
Caps Team
Irisa
A tagged table entry
 Ctr: 3-bit prediction counter
 U: 2-bit useful counter
 Was the entry recently useful ?
 Tag: partial tag
U
Tag
13
Ctr
André Seznec
Caps Team
Irisa
Updating the U counter
If (Altpred ≠ Pred) then
• Pred = taken : U= U + 1
• Pred ≠ taken : U = U - 1
Graceful aging:
Periodic shift of all U counters
• implemented through the reset of a single bit
14
André Seznec
Caps Team
Irisa
Allocating a new entry on a
misprediction
 Find a single “useless” entry with a longer history:
 Priviledge the smallest possible history
• To minimize footprint
 But not too much
• To avoid ping-pong phenomena
 Initialize Ctr as weak and U as zero
15
André Seznec
Caps Team
Irisa
Improve the global history
 Address + conditional branch history:
 path confusion on short histories 
 Address + path:
 Direct hashing leads to path confusion 
1. Represent all branches in branch history
2. Use also path history ( 1 bit per branch, limited to 16
bits)
16
André Seznec
Caps Team
Irisa
Design tradeoff for CBP2 (1)
 13 components:
 Bring the best accuracy on distributed traces
• 8 components not very far !
 History length:
 Min=4 , Max = 640
Could use any Min in [2,6] and any Max in
[300, 2000]
17
André Seznec
Caps Team
Irisa
Design tradeoff for CBP2 (2)
 Tag width tradeoff:
 (destructive) false match is better tolerated
on shorter history
 7 bits on T1 to 15 bits on T12
 Tuning the number of table entries:
 Smaller number for very long histories
 Smaller number for very short histories
18
André Seznec
Caps Team
Irisa
Adding a loop predictor
 The loop predictor captures the number of iterations of a loop

When successively encounters 4 times the same number of
iterations, the loop predictor provides the prediction.
 Advantages:
 Very reliable
 Small storage budget: 256 52-bit entries
 Complexity ?
 Might be difficult to manage speculative iteration numbers on
deep pipelines
19
André Seznec
Caps Team
Irisa
Using a kernel history and a user
history
 Traces mix user and kernel activities:
 Kernel activity after exception
• Global history pollution
 Solution: use two separate global histories
 User
history is updated only in user mode
 Kernel history is updated in both modes
20
André Seznec
Caps Team
Irisa
L-TAGE submission accuracy
(distributed traces)
3.314 misp/KI
21
André Seznec
Caps Team
Irisa
Reducing L-TAGE complexity
 Included 241,5 Kbits TAGE predictor:
3.368
misp/KI
 Loop
predictor beneficial only on gzip:
Might not be worth the extra complexity
22
André Seznec
Caps Team
Irisa
Using less tables
 8 components 256 Kbits TAGE predictor:
3.446 misp/KI
23
André Seznec
Caps Team
Irisa
TAGE prediction computation time ?
 3 successive steps:
 Index computation
 Table read
 Partial match + multiplexor
 Does not fit on a single cycle:
 But can be ahead pipelined !
24
André Seznec
Caps Team
Irisa
Ahead pipelining a global history
branch predictor (principle)
 Initiate branch prediction X+1 cycles in advance to
provide the prediction in time
 Use information available:
• X-block ahead instruction address
• X-block ahead history
 To ensure accuracy:
 Use intermediate path information
25
André Seznec
Caps Team
Irisa
Practice
A
B
C
bc
Ha
Ahead pipelined TAGE:
4// prediction computations
A
26
André Seznec
Caps Team
Irisa
3-branch ahead pipelined
8 component 256 Kbits TAGE
3.552 misp/KI
27
André Seznec
Caps Team
Irisa
A final case for the Geometric History
Length predictors
 delivers state-of-the-art accuracy
 uses only global information:
 Very long history: 300+ bits !!
 can be ahead pipelined
 many effective design points
 OGEHL or TAGE 
 Nb of tables, history lengths
28
André Seznec
Caps Team
Irisa
The End 
29
André Seznec
Caps Team
Irisa