Scavenger - Welcome to Microarch.org

Download Report

Transcript Scavenger - Welcome to Microarch.org

Scavenger:
A New Last Level Cache
Architecture with Global Block
Priority
Arkaprava Basu, IIT Kanpur
Mainak Chaudhuri, IIT Kanpur
Nevin Kirman, Cornell
Meyrem Kirman, Cornell
Jose F. Martinez, Cornell
Talk in one slide



Observation#1: large number of blocks
miss repeatedly in the last-level cache
Observation#2: number of evictions
between an eviction-reuse pair is too
large to be captured by a conventional
fully associative victim file
How to exploit this temporal behavior
with “large period”?
– Our solution prioritizes blocks evicted from
the last-level cache by their miss frequencies
– Top k frequently missing blocks are
scavenged and retained in a fast k-entry
victim file Scavenger (IITK-Cornell)
Sketch



Observations and hypothesis
Scavenger overview (Contributions)
Scavenger architecture
– Frequency estimator
– Priority queue
– Victim file




Simulation environment
Simulation results
Related work
Summary
Scavenger
(IITK-Cornell)
Observations and hypothesis
1
ROB stall cycles (%)
100
512 KB 8-way
1 MB 8-way
2-9
10-99
100-999
>= 1000
80
60
40
20
0
gz wu sw ap vp gc me ar mc eq cr am pe bz tw aps
Scavenger
(IITK-Cornell)
Observations and hypothesis
Wish, but too large
Too
(FA?)
small
Scavenger
(IITK-Cornell)
Observations and hypothesis




Block addresses repeat in the miss
address stream of the last-level cache
Repeating block addresses in the miss
stream cause significant ROB stall
Hypothesis: identifying and retaining the
most frequently missing blocks in a victim
file should be beneficial, but …
Number of evictions between an evictionreuse pair is very large
– Temporal behavior happens at too large a
scale to be captured by any reasonably sized
fully associative victim file
Scavenger
(IITK-Cornell)
Sketch
Observations and hypothesis
 Scavenger overview (Contributions)
 Scavenger architecture

– Frequency estimator
– Priority queue
– Victim file




Simulation environment
Simulation results
Related work
Summary
Scavenger
(IITK-Cornell)
Scavenger overview
(Contributions)

Functional requirements
– Determine the frequency of occurrence of an
evicted block address in the miss stream
seen so far
– Determine (preferably O(1) time) the min.
frequency among the top k frequently
missing blocks and if the frequency of the
current block is bigger than or equal to this
min., replace the min., insert this frequency,
and compute new minimum quickly
– Allocate a new block in the victim file by
replacing the minimum frequency block,
irrespectiveScavenger
of the (IITK-Cornell)
addresses of these blocks
Scavenger overview (L2 eviction)
L2 tag
& data
Evicted block
address
Bloom
filter
Replace min.,
insert new
Victim
file
Minheap
Scavenger
To MC
Freq.
Min.
Freq. >= Min.
(IITK-Cornell)
Scavenger overview (L1 miss)
L2 tag
& data
Miss address
Bloom
filter
L1
Hit
Victim
file
To MC
De-allocate
Minheap
Scavenger
(IITK-Cornell)
Sketch
Observations and hypothesis
 Scavenger overview (Contributions)
 Scavenger architecture

– Frequency estimator
– Priority queue
– Victim file




Simulation environment
Simulation results
Related work
Summary
Scavenger
(IITK-Cornell)
Miss frequency estimator
[24:19]
BF4
[18:9]
BF3
Block address
BF2
[25:23]
[22:15]
BF1
BF0
Min
Estimate
Scavenger
(IITK-Cornell)
[14:0]
Priority queue (min-heap)
5
6
6
10
10
13
11
5
6
10
6
10
13
11
11
9
15
12
15
18
11
13
T1
T2
T3
T4
T5
T6
T7
T8
T9
T10
T11
T12
T13
T14
T15
11 9 15 12 15 18 11 13
Right child: (i << 1) | 1
Priority | VPTR VF tag
Left child: i << 1
Scavenger
(IITK-Cornell)
Pipelined min-heap

Both insertion and de-allocation require
O(log k) steps for a k-entry heap
– Each step involves read, comparison, and
write operations; step latency: r+c+w cycles
– Latency of (r+c+w)log(k) cycles is too high
to cope up with bursty cache misses
– Both insertion and de-allocation must be
pipelined
– We unify insertion and de-allocation into a
single pipelined operation called replacement

De-allocation is same as a zero insertion
Scavenger
(IITK-Cornell)
Pipelined heap replacement
20
20 10
6
10
5
6
6
5
10
10
13
11
11 9 15 12 15 18 11 13
Right child: (i << 1) | 1
Left child: i << 1
Scavenger
6
10
13
11
11
9
15
12
15
18
11
13
(IITK-Cornell)
5
20, 10, 0
R
C
W
20, 10, 0
R
C
W
R
C
W
R
C
W
Pipelined heap replacement
5
20
10
5
20
6
10
10
13
11
11 9 15 12 15 18 11 13
Right child: (i << 1) | 1
Left child: i << 1
Scavenger
6
10
13
11
11
9
15
12
15
18
11
13
(IITK-Cornell)
5
20
20, 10, 0
R
C
W
20, 10, 0
R
C
W
20, 10, 1
R
C
W
R
C
W
Pipelined heap replacement
5
20
10
5
20
6
10
10
13
11
11 9 15 12 15 18 11 13
Right child: (i << 1) | 1
Left child: i << 1
Scavenger
6
10
13
11
11
9
15
12
15
18
11
13
(IITK-Cornell)
20, 10, 0
R
C
W
20, 10, 0
R
C
W
6, 10
20, 10, 1
R
20
C
6
W
20, 100, 1
5
R
C
W
Pipelined heap replacement
5
6
10
5
6
20
10
10
13
11
11 9 15 12 15 18 11 13
Right child: (i << 1) | 1
Left child: i << 1
Scavenger
20
10
13
11
11
9
15
12
15
18
11
13
(IITK-Cornell)
20, 10, 0
R
C
W
20, 10, 0
6
R
C
6
W
6, 10
20, 10, 1
R
C
W
20, 100, 1
11, 9
R
9
C
20
W
5
Victim file

Functional requirements
– Should be able to replace a block with
minimum priority by a block of higher or
equal priority irrespective of addresses (fully
associative functionality)
– Should offer fast lookup (conventional fully
associative won’t do)
– On a hit, should de-allocate the block and
move it to main L2 cache (different from
conventional victim caches)
Scavenger
(IITK-Cornell)
Victim file organization

Tag array
– Direct-mapped hash table with collisions (i.e.,
conflicts) resolved by chaining
– Each tag entry contains an upstream (toward
head) and a downstream (toward tail)
pointer, and a head (H) and a tail (T) bit
– Victim file lookup at address A walks the tag
list sequentially starting at direct-mapped
index of A

Each tag lookup has latency equal to the latency
of a direct-mapped cache of same size
– A replacement delinks the replaced tag from
its list and links it up with the list of the new
tag
Scavenger (IITK-Cornell)
Victim file lookup
(A >> BO) & (k-1)
VF Tag
Requires a back pointer to heap
VF Data
Tail
k
Head
Hit
Invalid
Insert zero
priority in
heap node
Scavenger
(IITK-Cornell)
Sketch



Observations and hypothesis
Scavenger overview (Contributions)
Scavenger architecture
– Frequency estimator
– Priority queue
– Victim file




Simulation environment
Simulation results
Related work
Summary
Scavenger
(IITK-Cornell)
Simulation environment



Single-stream evaluation in this paper
Configs differ only in L2 cache arch.
Common attributes (more in paper)
– 4 GHz, 4-4-6 pipe, 128-entry ROB, 160 i/fpRF
– L1 caches: 32 KB/4-way/64B/LRU/0.75 ns
– L2 cache miss latency (load-to-use): 121 ns
– 16-stream stride prefetcher between L2
cache and memory with max. stride 256B
– Applications: 1 billion representative dynamic
instructions from sixteen SPEC 2000
applications (will discuss results for nine
memory-bound applications; rest in paper)
Scavenger
(IITK-Cornell)
Simulation environment

L2 cache configurations
– Baseline: 1 MB/8-way/64B/LRU/2.25 ns/
15.54 mm2
– Scavenger: 512 KB/8-way/64B/LRU/2 ns
conventional L2 cache + 512 KB VF (8192
entries x 64 B/entry)/0.5 ns, 0.75 ns +
auxiliary data structures (8192-entry priority
queue, BFs, pointer RAMs)/0.5 ns
16.75 mm2
– 16-way: 1 MB/16-way/64B/LRU/2.75 ns/
26.4 mm2
– 512KB-FA-VC: 512 KB/8-way/64B/LRU/2 ns
conventional L2 cache + 512 KB/FA/64B/
Random/3.5
ns conventional VC
Scavenger (IITK-Cornell)
Sketch



Observations and hypothesis
Scavenger overview (Contributions)
Scavenger architecture
– Frequency estimator
– Priority queue
– Victim file
Simulation environment
 Simulation results
 Related work
 Summary

Scavenger
(IITK-Cornell)
Victim file characteristics

Number of tag accesses per L1 cache
miss request
– Mean below 1.5 for 14 applications
– Mode (common case) is one for 15
applications (enjoy direct-mapped latency)
– More than 90% requests require at most
three for 15 applications
Scavenger
(IITK-Cornell)
Performance (Speedup)
1.4
1.3
Higher is better
1.63
16-way (1.01, 1.00)
512 KB-FA-VC (1.01, 1.01)
Scavenger (1.14, 1.08)
1.2
1.1
1.0
0.9
wu sw
ap
vp
Scavenger
ar
mc
(IITK-Cornell)
eq
am tw
Performance (L2 cache misses)
1.1
16-way (0.98, 0.98)
512 KB-FA-VC (0.94, 0.96)
Scavenger (0.85, 0.90)
Lower is better
1.0
0.9
0.8
0.7
0.6
wu sw
ap
vp
Scavenger
ar
mc
(IITK-Cornell)
eq
am tw
Sketch



Observations and hypothesis
Scavenger overview (Contributions)
Scavenger architecture
– Frequency estimator
– Priority queue
– Victim file
Simulation environment
 Simulation results
 Related work
 Summary

Scavenger
(IITK-Cornell)
L2 cache misses in recent
proposals
Lower is better
DIP [ISCA’07] (0.84) [Beats Scavenger in art and mcf only]
V-way [ISCA’05] (0.87) [Beats Scavenger only in ammp]
Scavenger (0.84) [Improvement across the board]
1.00
Bottleneck: BFs
0.85
0.7
0.55
0.4
wu sw
ap
vp
Scavenger
ar
mc
(IITK-Cornell)
eq
am tw
Summary of Scavenger




Last-level cache arch. with algorithms to
discover global block priority
Divides the storage into a conventional
set-associative cache and a large fast VF
offering the functionality of a FA VF
without using any CAM
Insertion into VF is controlled by a
priority queue backed by a cache block
miss frequency estimator
Offers IPC improvement of up to 63%
and on average 8% for a set of sixteen
SPEC 2000 applications
Scavenger
(IITK-Cornell)
Scavenger:
A New Last Level Cache
Architecture with Global Block
Priority
THANK YOU!
Arkaprava Basu, IIT Kanpur
Mainak Chaudhuri, IIT Kanpur
Nevin Kirman, Cornell
Meyrem Kirman, Cornell
Jose F. Martinez, Cornell