Transcript Document

Adaptive Insertion Policies for
High-Performance Caching
Moinuddin K. Qureshi
Yale N. Patt
Aamer Jaleel
Simon C. Steely Jr.
Joel Emer
International Symposium on Computer Architecture (ISCA) 2007
1
Background
Fast processor + Slow memory  Cache hierarchy
(~2 cycles)
Proc L1
(~10 cycles)
L2
L2 miss
(~300 cycles)
Memory
L1 misses  Short latency, can be hidden
L2 misses  Long-latency, hurts performance
Important to reduce Last Level (L2) cache misses
2
Motivation
 L1 for latency, L2 for capacity
 Traditionally L2 managed similar to L1 (typically LRU)
 L1 filters temporal locality  Poor locality at L2
 LRU causes thrashing when working set > cache size
Most lines remain unused between insertion and eviction
3
Dead on Arrival (DoA) Lines
(%) DoA Lines
DoA Lines: Lines unused between insertion and eviction
For the 1MB 16-way L2, 60% of lines are DoA
 Ineffective use of cache space
4
Why DoA Lines ?
 Streaming data  Never reused. L2 caches don’t help.
art
Cache size in MB
Misses per 1000 instructions
Misses per 1000 instructions
 Working set of application greater than cache size
mcf
Cache size in MB
Soln: if working set > cache size, retain some working set
5
Overview
Problem: LRU replacement inefficient for L2 caches
Goal: A replacement policy that has:
1. Low hardware overhead
2. Low complexity
3. High performance
4. Robust across workloads
Proposal: A mechanism that reduces misses by 21% and
has total storage overhead < two bytes
6
Outline
 Introduction
 Static Insertion Policies
 Dynamic Insertion Policies
 Summary
7
Cache Insertion Policy
Two components of cache replacement:
1. Victim Selection:
Which line to replace for incoming line?
(E.g. LRU, Random, FIFO, LFU)
2. Insertion Policy:
Where is incoming line placed in replacement list?
(E.g. insert incoming line at MRU position)
Simple changes to insertion policy can greatly improve
cache performance for memory-intensive workloads
8
LRU-Insertion Policy (LIP)
MRU
a
b
c
d
e
f
g
LRU
h
Reference to ‘i’ with traditional LRU policy:
i
a
b
c
d
e
f
g
g
i
Reference to ‘i’ with LIP:
a
b
c
d
e
f
Choose victim. Do NOT promote to MRU
Lines do not enter non-LRU positions unless reused
9
Bimodal-Insertion Policy (BIP)
LIP does not age older lines
Infrequently insert lines in MRU position
Let e = Bimodal throttle parameter
if ( rand() < e )
Insert at MRU position;
else
Insert at LRU position;
For small e , BIP retains thrashing protection of LIP
while responding to changes in working set
10
Circular Reference Model
[Smith & Goodman
ISCA’84]
Reference stream has T blocks and repeats N times.
Cache has K blocks (K<T and N>>T)
Policy
(a1 a2 a3 … aT)N
LRU
0
(b1 b2 b3 …
bT)N
0
OPT
(K-1)/(T-1)
(K-1)/(T-1)
LIP
(K-1)/T
0
BIP (small e)
≈ (K-1)/T
≈ (K-1)/T
For small e , BIP retains thrashing protection of LIP
while adapting to changes in working set
11
Results for LIP and BIP
BIP(e=1/32)
(%) Reduction in L2 MPKI
LIP
Changes to insertion policy increases misses for
LRU-friendly workloads
12
Outline
 Introduction
 Static Insertion Policies
 Dynamic Insertion Policies
 Summary
13
Dynamic-Insertion Policy (DIP)
Two types of workloads: LRU-friendly or BIP-friendly
DIP can be implemented by:
1. Monitor both policies (LRU and BIP)
2. Choose the best-performing policy
3. Apply the best policy to the cache
Need a cost-effective implementation  “Set Dueling”
14
DIP via “Set Dueling”
Divide the cache in three:
– Dedicated LRU sets
– Dedicated BIP sets
– Follower sets (winner of
LRU,BIP)
n-bit saturating counter
misses to LRU-sets: counter++
misses to BIP-set: counter-Counter decides policy for
Follower sets:
– MSB = 0, Use LRU
– MSB = 1, Use BIP
LRU-sets
BIP-sets
miss
miss
+
–
n-bit cntr
MSB = 0?
YES
Follower Sets
Use LRU
No
Use BIP
monitor  choose  apply
(using a single counter)
15
Bounds on Dedicated Sets
How many dedicated sets required for “Set Dueling”?
μLRU, σLRU, μBIP, σBIP = Avg. misses and stdev. for LRU and BIP
P(Best) = probability of
selecting best policy
P(Best) = P(Z< r√n)
n = number of dedicated sets
Z = standard Gaussian variable
r = |μLRU- μBIP|/√(σLRU2 + σBIP2)
(For majority workloads r > 0.2)
32-64 dedicated sets sufficient
16
(%) Reduction in L2 MPKI
Results for DIP
BIP
DIP (32 dedicated sets)
DIP reduces average MPKI by 21% and
requires < two bytes storage overhead
17
Reduction in average MPKI
in L2 MPKI
(%)% Reduction
DIP vs. Other Policies
35
30
25
20
15
10
5
0
(LRU+LFU) (LRU+MRU)
(LRU+RND) (LRU+LFU)
(LRU+RND)
(LRU+MRU)
DIP
DIP
OPT
OPT
Double(2MB)
Double
DIP bridges two-thirds of gap between LRU and OPT
18
IPC Improvement with DIP (%)
IPC Improvement
Processor: 4 wide, 32-entry window
Memory 270 cycles. L2: 1MB 16-way LRU
DIP Improves IPC by 9.3% on average
19
Outline
 Introduction
 Static Insertion Policies
 Dynamic Insertion Policies
 Summary
20
Summary
LRU inefficient for L2 caches. Most lines remain
unused between insertion and eviction
Proposed changes to cache insertion policy (DIP) has:
overhead
 1. Low hardware
Requires < two bytes storage overhead
 2. Low complexity
Trivial to implement. No changes to cache structure
performance
 3. High Reduces
misses by 21%. Two-thirds as good as OPT
across workloads
 4. RobustAlmost
as good as LRU for LRU-friendly workloads
21
source code:
www.ece.utexas.edu/~qk/dip
Questions
22
2MB 8MB
1MB 4MB
}
}
}
}
MPKI Relative to 1MB LRU (%)
(Smaller is better)
DIP vs. LRU Across Cache Sizes
art
mcf
equake
swim
health
LRU
DIP
Avg_16
MPKI reduces till workload fits in the cache
23
(%) Reduction in L2 MPKI
DIP with 1MB 8-way L2 Cache
50
40
30
20
10
0
MPKI reduction with 8-way (19%) similar to 16-way (21%)
24
Interaction with Prefetching
(%) Reduction in L2 MPKI
(PC-based stride prefetcher)
DIP-NoPref
LRU-Pref
DIP-Pref
DIP also works well in presence of prefetching
25
mcf snippet
26
art snippet
27
health mpki
28
swim mpki
29
DIP Bypass
30
DIP (design and implementation)
31
Random Replacement (Success Function)
Cache contains K blocks and reference stream contains T
Prob that a block in cache survives 1 eviction = (1-1/K)
Total number of evictions = (T-1)*Pmiss
Phit = (1-1/K)^(T-1)*Pmiss)
Phit = (1-1/K)^(T-1)(1-Phit)
Iterative solution: Start at Phit=0
1. Phit = (1-1/K)^T
32