A Penalty-Sensitive Branch Predictor

Download Report

Transcript A Penalty-Sensitive Branch Predictor

A Penalty-Sensitive Branch Predictor
Yue Hu
David M. Koppelman
Lu Peng
Department of Electrical and Computer Engineering
Louisiana State University
.
1. Motivation
Typical branch predictor: to decrease misprediction rate (MR):
i.e. Two-level adaptive (Yeh & Patt), Neural (Vintan & Jimenez)
and LTAGE (Seznec)
However
Performance can also be improved even if MR doesn’t decrease
Run 1
Run 2
Time
The same program on the same computers but different branch predictors
Time that a mispredicted branch is on the wrong path
Low penalty (HP)
High penalty (HP)
Why not favor HP
branches to decrease
their MR?
Even if total MR doesn't
decrease, performance
could still be improved
2
Design Overview
2. Design Overview
1
Resolve
cycles Penalty
PC
predictor
PC
Main predictor
Loop enabled?
2
Two-class
TAGE predictor
No
3
History
PC
PC
Loop predictor
Final
prediction
Yes
Assistant predictor
Figure 1. Overall structure of our predictor
1: Predict a branch: HP or LP?
2: Based on TAGE, can favor HP branches, while only provide
normal operation for LP branches;
3
3: Enabled only when beneficial.
2.1 Penalty Predictor
8-bit penalty counter (CNT)
Design Overview
1-bit penalty state (STA)
CNT = 0;
STA = LP
…
Penalty table
Penalty
>= 120 cyc?
Yes
CNT += 8;
CNT >= 192?
Yes
STA = HP
CNT == 0?
Yes
STA = LP
No
CNT --;
No
No
High-penalty state remains at least hundreds of executions,
so the following HP branches can get benefits.
4
2.2 Two-class TAGE Predictor Design Overview
[Only rough idea]
Prediction:
Hash (His, PC)  Index: direct to one entry in each bank;
 Tag: check whether hit (H) or miss (M);
Higher bank: longer history, wider tag -> more accurate
Final Prediction
History
...
PC
Bank 0 hash
Bank 1 hash Bank 2
H
H
hash Bank 3
2-bit
bimodal
predictor
hash Bank 5
M
U2
M
M
hash Bank 4
U0
hash Bank 6
U1
M
U1
U1
M
U0
wider tag
[9-16]-bit
tag
3-bit 2-bit
pred use (U)
5
Update:
2.2 Two-class TAGE Predictor Design Overview
History
...
PC
Bank 0 hash
Bank 1 hash Bank 2
H
hash Bank 3
U2
M
M
hash Bank 4
hash Bank 5
hash Bank 6
M
M
U1
U0
U1
U0
M
mispred
Since
occupied,
not used.
U0
First
Since
allocation occupied,
here
not used.
Second
allocation
here for HP
New entries allocated at higher banks when mispred.
LP: only one entry allocated;
HP: a second entry allocated with two limitations
1. A bank with a useless entry;
2. Last two allocations in the bank are
one-entry allocations;
HP’s double-entry
allocation doesn’t
harm that of LP
6
too much
Update:
2.2 Two-class TAGE Predictor Design Overview
History
...
PC
Bank 0 hash
Bank 1 hash Bank 2
H
hash Bank 3
U2
M
M
hash Bank 4
hash Bank 5
hash Bank 6
M
M
U1
U0
U1
U0
M
mispred
Since
occupied,
not used.
U0
First
Since
allocation occupied,
here
not used.
Second
allocation
here for HP
Two cases for U0
1. Entry itself is not recently useful, if ever;
2. New allocation, usefulness hasn’t been established
Double-entry allocation favors HP branches so that their new
entries can survive longer time to establish their usefulness. 7
3.1 Penalty Predictor
Performance
Analysis
100%
90
80
70
60
50
40
30
20
10
0
-10
CL01
CL02
CL03
CL04
CL05
CL06
CL07
CL08
CL09
CL10
CL11
CL12
CL13
CL14
CL15
CL16
INT01
INT02
INT03
INT04
INT05
INT06
MM01
MM02
MM03
MM04
MM05
MM06
MM07
SER01
SER02
SER03
SER04
SER05
WS01
WS02
WS03
WS04
WS05
WS06
Average
1. predicted to be HP (50.2%); covers 98.7% actual HP
2. among all branches, actual HP (27%);
3. predicted LP while turn out to be HP (1.3%);
Average penalty of branches predicted LP: 121
HP: 212 cycles 8
3.2 Two-class TAGE predictor
MR
0.039
0.039
LTAGE
0.038
PSLTAGE
-5E-5
0.038
0.037 +7E-5
0.037
0.036
0.036
0.035
-4E-5
All negative
-4E-5
-6E-5 -7E-5
0.035
-9E-5
Performance
Analysis
-8E-5
0.034
0.034
+2E-5
0.031
0.033
+3E-5
0.032
+3E-6
+4E-5 0.031
0.03
0.03
0.033
0.032
8K
16K 32K 64K 128K 256K
Low-penalty branches
8K
16K 32K 64K 128K 256K
High-penalty branches
1. MR of HP branches is about 10% higher;
Loop branches; branches with cache misses
2. Penalty-Sensitive (PS) method effectively favors HP branch;
3. 64KB: HP, -6E-5; LP, +3E-5.
9
Overall, it is beneficial.
4 Summary
Our penalty-sensitive branch predictor works
Penalty predictor: 50.2% predicted HP; covers 98.7% actual HP
Average penalty ( HP VS LP= 212: 121)
Two-class TAGE predictor: favor HP branches, globally beneficial,
but limited
Limited favoring mechanism:
Double-entry allocation for HP branches to increase the chance that
their new entries will survive longer time to establish usefulness.
Future: more helpful favoring mechanism needed
Conclusion:
1. Mispredicted HP branches are more harmful;
2. Even if total MR doesn’t decrease, performance could still be
improved by favoring HP branches;
3. Can be applied to any predictors once we can find an effective
favoring mechanism.
10
Thanks!
11
12
Average
WS06
WS05
WS04
WS03
WS02
WS01
SER05
SER04
SER03
SER02
Lo_AvgPen
SER01
MM07
MM06
MM05
MM04
MM03
MM02
MM01
INT06
INT05
317
INT04
INT03
INT02
INT01
CL16
CL15
CL14
CL13
CL12
CL11
CL10
CL09
CL08
CL07
CL06
CL05
CL04
CL03
CL02
CL01
Penalty Predictor
Backup Slides
300
1830
Hi_AvgPen
250
200
150
100
50
0
Two-class TAGE predictor
MR
0.039
0.039
LTAGE
0.038
PSLTAGE
0.037
0.036
0.036
-4E-5
-4.7E-4
-4E-5
-6E-5 -7E-5
0.035
-9E-5
-6E-5
0.038
0.037 +7E-5
0.035
-5E-5
Backup Slides
-8E-5
0.034
0.034
+2E-5
0.031
0.033
+3E-5
0.032
+3E-6
+4E-5 0.031
0.03
0.03
0.033
0.032
8K
16K 32K 64K 128K 256K
Low-penalty branches
-6E-5
-4.7E-4
8K
16K 32K 64K 128K 256K
High-penalty branches
= 12.8%
Penalty-Sensitive achieved 12.8% improvement on MR of HP
Branch that would be achieved by doubling storage budget.13
1000
900
800
700
600
500
400
300
200
100
0
Client01
Client02
Client03
Client04
Client05
Client06
Client07
Client08
Client09
Client10
Client11
Client12
Client13
Client14
Client15
Client16
int01
int02
int03
int04
int05
int06
mm01
mm02
mm03
mm04
mm05
mm06
mm07
server01
server02
server03
server04
server05
ws01
ws02
ws03
ws04
ws05
ws06
Average
MPPKI
3673
3596
Loop Predictor
PSTAGE(without loop)
PSLTAGE
1643
1643
2208
2208
2839
2839
1.3% Improvement with only 0.53KB
Backup Slides
8204
7920
6624
6630
2592
2515
1000
987
Average MPPKI normalized to 1000
Very
efficient
14