Document 7618399

Download Report

Transcript Document 7618399

Procesadores Superescalares
Prof. Mateo Valero
Las Palmas de Gran Canaria
26 de Noviembre de 1999
Initial developments
• Mechanical machines
• 1854: Boolean algebra by G. Boole
• 1904: Diode vacuum tube by J.A. Fleming
• 1946: ENIAC by J.P. Eckert and J. Mauchly
• 1945: Stored program by J.V. Neuman
• 1949: EDSAC by M. Wilkes
• 1952: UNIVAC I and IBM 701
M. Valero
2
Eniac 1946
M. Valero
3
EDSAC 1949
M. Valero
4
Pipeline
M. Valero
5
Data Cache
Bypass
Register
file
Wakeup+
select
Instruction
Window
Rename
Decode
Fetch
Superscalar Processor
Fetch of multiple instructions every cycle.
Rename of registers to eliminate added dependencies.
Instructions wait for source operands and for functional
units.
Out- of -order execution, but in order graduation.
Scalable Pipes
M. Valero
6
Technology Trends and Impact
Delay in Psec.
3500
3000
2500
2000
0.80 micras
0.35 micras
0.18 micras
1500
1000
500
0
ss
pa
By
up
eak
W e
am
en
R
ss
pa
By
up
eak
W e
am
en
R
Issue Width= 4
ROB Size = 32
Issue Width= 8
ROB Size = 64
S. Palacharla et al ¨Complexity Effective…¨. ISCA 1997. Denver.
M. Valero
7
Die reachable (percent)
Physical Scalability
100
90
80
70
60
50
40
30
20
10
0
1 clock
2 clocks
4 clocks
8 clocks
16 clocks
0,25
0,25
0,18
0,18
0,13
0,13
0,1
0,1
0,08
0,08
0,06
0,06
Processor generation (microns)
Doug Matzke. ¨ Will Physical Scalability… ¨. IEEE Computer. Sept. 1997. pp 37-39.
M. Valero
8
Register influence on ILP
8-way fetch/issue
window of 256 entries
up to 1 taken branch
g-share 64k entries
One cycle latency
• Spec95
3,9
3,4
IPC
2,9
2,4
Integer
Floating Point
1,9
1,4
0,9
0,4
48
64
96
128
160
192
224
256
Register file size
M. Valero
9
Register File Latency
1 cycle
1 cycle
2 cycle
2,3
2 cycle
4,5
2,1
4
1,9
3,5
1,5
IPC
IPC
1,7
1,3
1,1
3
2,5
2
0,9
Hmean
wave5
turb3d
tomcatv
swim
su2cor
mgrid
hydro2d
fpppp
apsi
Hmean
1
vortex
perl
m88ksim
li
ijpeg
go
gcc
compress
0,5
applu
1,5
0,7
– 66% and 20% performance improvement when
moving from 2 to 1-cycle latency
M. Valero
10
Outline
• Virtual-physical register
• A register file cache
• VLIW architectures
M. Valero
11
Virtual-Physical Registers
• Motivation
Icache
Decode&Rename
Commit
– Conventional renaming scheme
Register unused
– Virtual-Physical Registers
M. Valero
Register used
Register used
12
Example
load
fdiv
fmul
fadd
f2,
f2,
f2,
f2,
0(r4)
f2, f10
f2, f12
f2, 1
load
fdiv
fmul
fadd
rename
p1,
p2,
p3,
p4,
0(r4)
p1, p10
p2, p12
p2, 1
Cache miss: 20
Fdiv: 20
Fmul: 10
Fadd: 5
– Register pressure: average registers per cycle
p1
p2
Conventional: 3.6
p3
p4
p1
p2
Virtual-Physical: 0.7
p3
p4
0
M. Valero
5
10
15
20
25
30
35
40
45
50
55
13
Percentage of Used/Wasted Registers
120
140
100
120
100
80
80
Used
Wasted 60
60
40
40
20
20
0
0
n
ea
Hm
2d
o
dr
hy
im
sw
plu
ap
tv
id
ca
tom
r
mg
ean
Hm
l
pe r
li
go
gcc
ss
pre
com
M. Valero
14
Virtual-Physical register
• Physical register play two different roles
– Keep track of dependences (decode)
– Provide a storage location for results (writeback)
• Proposal: Three types of registers
– Logical: Architected registers
– Virtual-Physical (VP): Keep track of
dependences
– Physical: Store values
• Approach
– Decode: rename from logical to VP
– Write-back (or issue): rename from VP to
physical
M. Valero
15
Virtual-Physical Registers
C
Lreg
Preg
ROB
R1
VPreg
D
Src1
V
Phy. Map Table
Execute
Preg
Src2
VP
Fetch
Lreg
Inst. queue
General Map Table
VPreg
R2
• Hardware support
Decode
M. Valero
Issue
Write-back
Commit
16
Virtual-Physical Registers
• No free physical register
– Re-execute but… if it is the oldest instruction…
– Avoiding deadlock
• A number (NRR) of registers are reserved for the
oldest instructions
• 21% speedup for Spec95 on a 8-way issue [HPCA-4]
– Conclusions
– Optimal NRR is different for each program
– For a given program, best NRR may be different for
different sections of code
M. Valero
17
Virtual-Physical Registers
– Performance evaluation
• SimpleScalar OoO with
modified renaming
• 8-way issue
• RUU: 128 entries
• FU (latency)
»
»
»
»
»
»
8 Simple int. (1)
4 Int Mult (7)
6 Simple FP (4)
4 FP Mult (4)
4 FP Div (16)
4 mem ports
• L1 Dcache
» 32 KB, 2-way, 32
B/line, 1 cycle
M. Valero
• L1 Icache
» 32 KB, 2-way, 64
B/line, 1 cycle
• L2 cache
» 1 MB, 2-way, 64
B/line, 12 cycles
• Main memory
» 50 cycles
• Branch prediction
» 18-bit Gshare
» 2 taken branches
• Benchmarks: SPEC95
» Compac/Dec
compilers -O5
18
Virtual-Physical Registers
% Speedup
– Performance evaluation
co
45
40
35
30
25
20
15
10
5
0
m
ss
e
pr
42
29
22
20
6
5
1
c
gc
1
0
go
3
li
rl
e
p
d
lu
id
m
tv
gr ca pp swi ro2
a
m
d
m
to
hy
Speedup for 64 registers
M. Valero
19
IPC and NRR
3,5
3
2,5
li
applu
2
1,5
1
1
M. Valero
4
8
16
24
36
20
Virtual-Physical Registers
• What is the optimal allocation policy ?
– Approximation
• Registers should be allocated to the instructions that
can use them earlier (avoid unused registers)
• If some instruction should be stall because of the lack
of registers, choose the latest instructions (delaying
the earliest would also delay the commit of the latest)
– Implementation
• Each instruction allocates a physical register in the
write-back. If none available, it steals the register
from the latest instruction after the current
M. Valero
21
DSY Performance
SpecInt95
2,3
M. Valero
Hmean
perl
li
go
gcc
1,9
compress
2,1
Hmean
2,5
hydro2d
2,7
swim
2,9
applu
3,1
3,3
3,1
2,9
2,7
2,5
conventional 2,3
vp-original 2,1
vp-dsy
1,9
1,7
1,5
tomcatv
3,3
mgrid
SpecFp99
22
Performance and Number of Registers
SpecIn95
SpecFp95
3
2,8
2,8
2,7
2,6
2,6
2,4
conventional
2,2
vp-original
2
vp-dsy
1,8
2,5
2,4
1,6
2,3
1,4
2,2
1,2
48
64
80
M. Valero
96
128 160
48
64
80
96
128
160
23
Outline
• Virtual-physical register
• A register file cache
• VLIW architecture
M. Valero
24
Register Requirements
SpecFP95
SpecInt95
100
100
80
80
60
60
40
40
20
20
0
0
0
2
4
6
8
10
12
14
16 18
20 22
24
26
Value & Instruction
Value & ready Instruction
M. Valero
28
30
32
0
2
4
6
8
10
12
14
16 18
20
22
24 26
28
30
Value & Instruction
Value & Ready Instruction
25
32
Register File Latency
1 cycle
1 cycle
2 cycle
2 cycle
4,5
2,3
2,1
4
1,9
3,5
1,5
IPC
IPC
1,7
1,3
1,1
3
2,5
2
0,9
1,5
0,7
0,5
m
pr
gc
es
s
c
go
ijp
eg
li
m
pe
rl
ks
im
88
vo
r
tex
Hm
1
ea
n
n
ea
Hm
5
ve
wa
3d
rb
tu tv
ca
m
to
im
sw
or
2c
su
id
gr
m d
o2
dr
hy
p
pp
fp
si
ap
u
pl
ap
co
– 66% and 20% performance improvement when
moving from 2 to 1-cycle latency
M. Valero
26
Register File Bypass
2,3
2,1
1,9
1,7
1,5
1,3
1,1
0,9
0,7
0,5
SpecInt95
1-cycle, 1-bypass
level
2 cycle, 2-bypass
levels
2-cycle, 1-bypass
level
n
ea
Hm
x
rte
vo
rl
pe im
s
8k
m8
li
eg
ijp
go
c
gc ress
mp
co
M. Valero
27
Register File Bypass
4,5
SpecFP95
4
3,5
1-cycle, 1-bypass
level
2 cycle, 2-bypass
levels
2-cycle, 1-bypass
level
3
2,5
2
1,5
Hme an
wave 5
turb3d
tomcatv
s win
s u2cor
M. Valero
mgrid
hydro2d
fppp
aps i
applu
1
28
Register File Cache
• Organization
– Bank 1 (Register File)
• All registers (128)
• 2-cycle latency
RF
– Bank 2 (Reg. File Cache)
• A subset of registers (16)
• 1-cycle latency
M. Valero
RFC
29
Experimental Framework
– OoO simulator
• 8-way issue/commit
• Functional Units (lat.)
– 2 Simple integer (1)
– 3 Complex integer
» Mult. (2)
» Div. (14)
– 4 Simple FP (2)
– 2 FP div.: 2 (14)
– 3 Branch (1)
– 4 Load/Store
• 128-entry ROB
• 16-bit Gshare
M. Valero
• Icache and Dcache
–
–
–
–
64 KB
2-way set-associative
1/8-cycle hit/miss
Dcache: Lock-up free-16
outstanding misses
– Benchmarks
• Spec95
• DEC compiler -O4 (int.) -O5
(FP)
• 100 million after
inizialitations
– Access time and area
models
• Extension to
Wilton&Jouppi models
30
Caching Policy (1 of 3)
• First policy
• Many values (85%-Int and
84%-FP) are used at most
once
• Thus, only non-bypassed
values are cached
• FIFO replacement
M. Valero
RF
RFC
31
Performance
1 cycle
RFC.1
1 cycle
2 cycle
RFC.1
2 cycle
4,5
2,3
4
2,1
1,9
3,5
1,5
IPC
IPC
1,7
1,3
1,1
3
2,5
2
0,9
– 20% and 4% improvement over 2-cycle
– 29% and 13% degradation over 1-cycle
M. Valero
32
Hmean
wave5
turb3d
tomcatv
swim
su2cor
mgrid
hydro2d
fpppp
1
apsi
Hmean
vortex
perl
m88ksim
li
ijpeg
go
gcc
compress
0,5
applu
1,5
0,7
Caching Policy (1 of 2)
• Second policy
• Values that are sources of
any non-issued instruction
with all its operands ready
– Not issued because of lack
of functional units
– or, the other operand in in
the main register file
M. Valero
RF
RFC
33
Performance
1 cycle
RFC.2
2 cycle
1 cycle
2,3
RFC.2
2 cycle
4,5
2,1
4
1,9
3,5
1,5
IPC
IPC
1,7
1,3
3
2,5
1,1
2
0,9
0,7
– 24% and 5% improvement over 2-cycle
– 25% and 12% degradation over 1-cycle
M. Valero
34
Hmean
wave5
turb3d
tomcatv
swim
su2cor
mgrid
hydro2d
fpppp
apsi
1
applu
Hmean
Amean
vortex
perl
m88ksim
li
ijpeg
go
gcc
compress
0,5
1,5
Caching Policy (1 of 3)
• Third policy
• Values that are sources of any non-issued
instruction with all its operands ready
• Prefetching
– Table that for each physical register indicates which is
the other operand of the first instruction that uses it
• Replacement: give priority to those values
already read at least once
M. Valero
35
Performance
1 cycle
RFC.3
2 cycle
1 cycle
2,3
RFC.3
2 cycle
4,5
2,1
4
1,9
3,5
1,5
IPC
IPC
1,7
1,3
1,1
3
2,5
2
0,9
– 27% and 7% improvement over 2-cycle
– 24% and 11% degradation over 1-cycle
M. Valero
36
Hmean
wave5
turb3d
tomcatv
swim
su2cor
mgrid
hydro2d
fpppp
apsi
Hmean
vortex
perl
m88ksim
li
ijpeg
go
gcc
1
compress
0,5
applu
1,5
0,7
Speed for Different RFC Architectures
Taken into account access time
SpecInt95
2,1
1,9
1,7
1-cycle
1,5
2-cycle, one bypass
1,3
Non-bypass caching
+ prefetch-first-pair
1,1
0,9
0,7
C1
M. Valero
C2
C3
C4
37
Speed for Different RFC Architectures
SpecFp95
3,2
2,7
1-cycle
2,2
2-cycle, one bypass
1,7
Non-bypass caching
+ prefetch-first-pair
1,2
0,7
C1
M. Valero
C2
C3
C4
38
Conclusions
– Register file access time is critical
– Virtual-physical registers significantly
reduce the register pressure
• 24% improvement for SpecFP95
– A register file cache can reduce the average
access time
• 27% and 7% improvement for a two-level,
locality-based partitioning architecture
M. Valero
39
High performance instruction
fetch through a
software/hardware cooperation
Alex Ramirez
Josep Ll. Larriba-Pey
Mateo Valero
UPC-Barcelona
Data Cache
Bypass
Register
file
Wakeup+
select
Instruction
Window
Rename
Decode
Fetch
Superscalar Processor
Fetch of multiple instructions every cycle.
Rename of registers to eliminate added dependencies.
Instructions wait for source operands and for functional
units.
Out- of -order execution, but in order graduation.
J.E. Smith and S.Vajapeyam.¨Trace Processors…¨ IEEE Computer.Sept. 1997. pp68-74.
M. Valero
41
Motivation
Branch /Jump outcome
Instruction
Fetch &
Decode
Instruction
Execution
Instruction Queue(s)
• Instruction Fetch rate important not only in steady
state
– Program start-up
– Miss-speculation points
– Program segments with little ILP
M. Valero
42
Motivation
• Instruction fetch effectively limits the performance
of superscalar processors
– Even more relevant at program startup points
• More aggressive processors need higher fetch
bandwidth
– Multiple basic block fetching becomes necessary
• Current solutions need extensive additional
hardware
– Branch address cache
– Collapsing buffer: multi-ported cache
– Trace cache: special purpose cache
M. Valero
43
PostgreSQL
64KB I1, 64KB D1, 256KB L2
2.4
2.2
L=0
2
B
L
1.8
B
1.6
1.4
1.2
1
32KB
64KB
F4
F8
F16
PBr
Pic
Bw4
Bw8
Bw16
PF-
PF4
Postgres
M. Valero
44
Programs Behaviour
64KB I1, 64KB D1, 256KB L2
3,5
3
2,5
2
1,5
1
32KB
64KB
F4
F8
F16
PBr
Postgres
M. Valero
Pic
Gcc
Bw4
Bw8
Bw16
PF-
PF4
Vortex
45
The Fetch Unit (1 of 3)
Fetch
Address
• Scalar Fetch Unit
– Few instructions per cycle
Branch
Prediction
Mechanism
Instruction
Cache
(i-cache)
– 1 branch
• Limitations
– Prediction accuracy
Next Address
Logic
Shift & Mask
Scalar Fetch Unit
• Prev. work, code reordering
To Decode
Next Fetch Address
Software,
reduce cache
misses
M. Valero
– I-cache miss rate
–
–
–
–
–
Fisher (IEEE Tr. on Comp. 81)
Hwu and Chang (ISCA’89)
Petis and Hansen (Sigplan’90)
Torrellas et al. (HPCA’95)
Kalamatianos et al. (HPCA’98)
46
The Fetch Unit (2 of 3)
Fetch
Address
• Aggressive Fetch Unit
– Lot of instructions per cycle
Instruction
Cache
(i-cache)
Branch
Target
Buffer
Return Multiple
Branch
Stack Predictor
– Several branches
• Limitations
– Prediction accuracy
Next Address
Logic
– Sequentiality
– I-cache miss rate
Shift & Mask
• Prev. work, trace building
Aggressive
Core Fetch Unit
To Decode
Next Fetch Address
Hardware,
form traces
at run time
M. Valero
–
–
–
–
Yeh et al. (ICS’93)
Conte et al. (ISCA’95)
Rottenberg et al. (MICRO’96)
Friendly et al. (MICRO’97)
47
Trace Cache
b0
Trace is a sequence of logically
contiguos instructions.
b1
Trace cache line stores a segment of
the dynamic instruction traces
across multiple, potentially, taken
branches:(b1-b2-b4, b1-b3-b7….)
It is indexed by fetch address and
branches outcome
History-based fetch mecanism.
M. Valero
b3
b2
b6
b7
b4
b5
b8
48
The Fetch Unit (3 of 3)
Fetch
Address
Instruction
Cache
(i-cache)
Branch
Target
Buffer
Return Multiple
Branch
Stack Predictor
Trace Cache
(t-cache)
Next Address
Logic
Shift & Mask
Fill
Buffer
Aggressive
Core Fetch Unit
Trace Cache aims at
forming traces
run time
M. Valero
To Decode
Next Fetch Address
From Fetch or Commit
49
Our Contribution
• Mixed software-hardware approach
– Optimize performance at compile-time
• Use profiling information
• Make optimum use of the available hardware
– Avoid redundant work at run-time
• Do not repeat what was done at compile-time
• Adapt hardware to the new software
• Software Trace Cache
– Profile-directed code reordering & mapping
• Selective Trace Storage
– Fill Unit modification
M. Valero
50
Our Work
• Workload analysis
32KB instruction cache
64KB trace cache
– Temporal locality
– Sequentiality
–
–
–
–
Seed selection
Trace building
Trace mapping
Results
10,5
9,5
FIPA
• Software Trace Cache
8,5
• Selective Trace Storage
7,5
– Counting blue traces
– Implementation
– Results
6,5
M. Valero
gcc
Base
li
TC
postgres
STC
STS
51
Workload Analysis (Reference Locality)
Benchmark
swim
hydro2d
applu
m88ksim
li
gcc
compress
postgres
Dynamic references
75%
90%
99%
148
232
763
1223 1977
5371
2407 5060
10509
458
1006
2863
325
563
1365
9595 22098 57878
243
338
525
2716 5221
11748
Code
size
110350
125946
132803
51341
38126
349382
21991
374399
• Considerable amount of reference
locality
M. Valero
52
Workload Analysis (Sequentiality)
Benchmark
swim
mgrid
apsi
m88ksim
li
gcc
ijpeg
postgres
Unpredictable
45.3
19.9
22.1
37.3
49.2
60.1
70.2
23.8
Predictable
54.7
81.1
77.9
62.7
50.8
39.9
29.8
76.2
Predictable
Un-predictable
 Fall-through
 Unconditional branches
 Conditional branches with Fixed
Behaviour
 Subroutine calls




M. Valero
Loop branches
Indirect jumps
Subroutine returns
Unpredictable conditional
branches
53
Software Trace Cache
• Profile directed code reordering
– Obtain a weighted control flow graph
– Select seeds or starting basic blocks
– Build basic block traces
• Map dynamically consecutive basic blocks to physically
contiguous storage
• Move unused basic blocks out of the execution path
– Carefully map these traces in memory
• Avoid conflict misses in the most popular traces
• Minimize conflicts among the rest
• Increased role of the instruction cache
– Able to provide longer instruction traces
M. Valero
54
STC : Seed Selection
• All procedure entry points
– Ordered by popularity
– Starts building traces on the most popular procedures
• Knowledge based selection
– Based on source code knowledge
– Leads to longer sequences
• Inlining of the main path of found procedures
– Loses temporal locality
• Less popular basic blocks surround the most popular ones
M. Valero
55
STC : Trace Building
10
A1
1
10
• Greedy algorithm
Branch
Threshold
0.1
B1 30
0.9
20
– Follow the most likely
path out of a basic block
– Add secondary seeds for
all other targets
A2
C1
10
A3
0.6
0.4
6 A4
Exec
Threshold0.4
0.6
2.4
A6
Valid,
1
visit later
10
7.6
• Two threshold values
0.55
C2
Valid,
4 A5
visit later
A7
11 0.45
1
1
20
0.1
150
1
C5
A8
M. Valero
20
C4
0.01
• Do not include
unpopular basic blocks
– Transition threshold
C3
0.9
– Execution threshold
Branch
Threshold0.99
• Do not follow unlikely
transitions
• Iterate process with less
restrictive thresholds
56
STC : Trace Mapping
Most popular traces
Least popular traces
No code here
I-cache size
CFA
I-cache
M. Valero
57
I-cache Miss Rate
Instruction Cache
(i-cache)
BTB
RAS
BP
Next Address Logic
Xchange, Shift & Mask
I-cache/CFA
8KB I-cache
2KB CFA
4KB CFA
6KB CFA
32KB I-cache
4KB CFA
8KB CFA
24KB CFA
64KB I-cache
8KB CFA
16KB CFA
24KB CFA
M. Valero
Base
6.5
*
2.7
*
1.4
*
Code Layout
P&H Torr Auto
3.0
*
*
2.3
2.2
*
2.9
4.2
3.1
2.3
0.3
*
*
0.2
0.3
*
0.2
0.4
0.2
0.3
0.09
*
*
0.05 0.07
*
0.14 0.08
0.02 0.03
Ops
*
2.1
2.9
5.2
*
0.2
0.2
0.2
*
0.04
0.05
0.03
Cache
2-way Victim
6.1
5.6
*
*
1.2
1.6
*
*
0.3
0.4
*
*
58
Fetch Bandwidth
Instruction Cache
(i-cache)
BTB
RAS
BP
Next Address Logic
Xchange, Shift & Mask
I-cache/CFA
IDEAL
8KB I-cache
2KB CFA
4KB CFA
6KB CFA
32KB I-cache
4KB CFA
8KB CFA
24KB CFA
64KB I-cache
8KB CFA
16KB CFA
24KB CFA
M. Valero
Base
7.6
3.1
P&H
9.6
5.2
*
*
4.7
8.8
*
*
1.4
9.3
*
*
Code Layout
Torr
Auto
8.5-9.9
9.9
*
*
5.6
6.0
5.0
5.3
4.9
5.8
*
*
8.9
9.2
8.4
8.8
8.2
9.2
*
*
8.8
9.8
8.4
9.7
9.8
8.5
Ops
10.7
*
6.2
6.6
5.6
*
10.0
10.1
10.1
*
10.6
10.5
10.6
Trace Cache
16KB+ops
16KB
10.3
12.2
5.1
*
8.4
*
8.7
8.1
7.2
*
11.5
*
11.5
11.6
8.6
*
12.0
12.1
*
12.1
59
STC : Results
32KB Instruction cache, 64KB Trace cache
6
5,64
5,11
4,95
4,61
4,41
Base
STC
TC
S/HTC
FIPC
5
5,05
4,54
4
3,13
2,97
2,652,55
3
2,2
2
gcc
M. Valero
li
postgres
60
STC: Conclusions
• STC increases the role of the core fetch unit
– Build traces at compile-time
• Increases code sequentiality
– Map them carefully in memory
• Reduces instruction cache miss rate
• Increased core fetch unit performance
– Trace cache-like performance with no additional
hardware cost
• Compile-time solution
or ...
– Optimum results with a small supporting trace cache
• Better fail-safe mechanism on a trace cache miss
M. Valero
61
Selective Trace Storage
• The STC constructed traces at compile time
– Blue traces
• Built at compile-time
• Traces containing only consecutive instructions
• May be provided by the instruction cache in a single cycle
– Red traces
• Built at run-time
• Traces containing taken branches
• Can be provided by the trace cache in a single cycle
• Blue traces need not be stored in the trace cache
– Better usage of the storage space
• Better performance with same cost
• Equivalent performance at lower cost
M. Valero
62
STS: Counting Blue Traces
100%
80%
3+ breaks
2
1
0
60%
40%
20%
M. Valero
re
s.S
TC
po
st
g
re
s.o
rig
po
st
g
li.
ST
C
li.
or
ig
c.
ST
C
gc
gc
Reordering
reduces the
number of
breaks
c.
or
ig
0%
High degree of redundancy,
even in the original code
63
STS: Implementation
Fetch Address
Branch
Target
Buffer
Return
Address
Stack
Multiple
Branch
Predictor
Next Address Logic
Xchange, Shift & Mask
Blue
(redundant)
trace
Fill
Unit
Hit
Red trace
components
Filter out
blue traces
in the fill unit
Next Fetch Address
To Decode
M. Valero
64
M. Valero
Gcc
8
Li
Postgres
65
48+
ST
C.2
0
48
ST
C.2
0
48+
48
TC
.20
TC
.20
2
2+
ST
C.5
1
ST
C.5
1
2
2+
TC
.51
TC
.51
8+
ST
C.1
2
ST
C.1
2
8+
TC
.12
8
+
TC
.12
ST
C.3
2
ST
C.3
2
+
TC
.32
TC
.32
STS: FIPA - Realistic Branch Predictor
11.5
11
10.5
10
9.5
9
8.5
8
7.5
7
M. Valero
Gcc
8
Li
Postgres
66
48+
ST
C.2
0
48
ST
C.2
0
48+
48
TC
.20
TC
.20
2
2+
ST
C.5
1
ST
C.5
1
2
2+
TC
.51
TC
.51
8+
ST
C.1
2
ST
C.1
2
8+
TC
.12
8
+
TC
.12
ST
C.3
2
ST
C.3
2
+
TC
.32
TC
.32
STS: FIPC - Realistic BP - 64KB i-cache
6
5.5
5
4.5
4
3.5
3
2.5
2
M. Valero
Gcc
8
Li
Postgres
67
48+
ST
C.2
0
48
ST
C.2
0
48+
48
TC
.20
TC
.20
2
2+
ST
C.5
1
ST
C.5
1
2
2+
TC
.51
TC
.51
8+
ST
C.1
2
ST
C.1
2
8+
TC
.12
8
+
TC
.12
ST
C.3
2
ST
C.3
2
+
TC
.32
TC
.32
STS: FIPA - Perfect Branch Predictor
12
11.5
11
10.5
10
9.5
9
8.5
8
STS: Conclusions
• Minor hardware modification
– Filter out blue traces in the fill unit
• Avoid redundant run-time work
• Better usage of the storage space
– Higher performance with the same cost
– Equivalent performance at much lower cost
• Benefits of STS increase when used with STC
– The more work done at compile-time, the less work left
to do at run-time
M. Valero
68
Conclusions
• Instruction fetch is better approached using both
software and hardware techniques
– Compile-time code reorganization
• Increase code sequentiality
• Minimize instruction cache misses
– Avoid run-time redundant work
• Do not store the same traces twice
• High fetch unit performance with little additional
hardware
– Small 2KB complementary trace cache & smart fill unit
M. Valero
69
Future Work
• Further increasing fetch performance
– Increase i-cache performance
• Reduce miss ratio
• Reduce miss penalty
– Increase quality of provided instructions
• Better branch prediction accuracy
– Faster recovery after mispredictions
• Take the path of least resistance
– Simplicity of design
– Software approach whenever possible
M. Valero
70
The End
M. Valero
71