Transcript pptx
Reuse-based Online Models for
Caches
1
RATHIJIT SEN
DAVID A. WOOD
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
The Problem
2
Caches: power vs performance
Reconfigurable caches
e.g., IvyBridge
Core
LLC
LLC
Core
Core
LLC
LLC
Core
Core
LLC
LLC
Core
Core
LLC
LLC
Core
Miss
The Problem:
Fetch
DRAM
Which configuration to select?
e.g., to get the best energy-efficiency?
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Cache Performance Prediction
3
We propose a framework
h = (r · B) · φ
h: hit ratio
r: reuse-distance distribution (novel hardware support)
B: stochastic Binomial matrix
φ: hit function (LRU, PLRU, RANDOM, NMRU)
Case study:
Energy-Delay Product (EDP) within 7% of minimum
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Agenda
4
The Problem
Framework
Locality (r)
Matrix transformations (B)
Hit functions (φ)
h = (r · B) · φ
Hardware support
Case Study
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Cache Overview
5
Limited storage
Sets of (usually 64-byte) blocks
#blocks/set = associativity (#ways)
Set Index + Address tags identify data
Address
N
Miss
Tag
Match?
Y
Associativity (A)
Hit
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
Sets (S)
6/20/2013
Workload Variation
6
Last-Level Cache (LLC)
Miss / 1000 Instruction
30
25
swim
20
mgrid
15
apache
zeus
10
5
0
oltp
jbb
equake, gafort, wupwise
fma3d
ammp, blackscholes, bodytrack, fluidanimate, freqmine, swaptions
2MB
4MB
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
8MB
16MB
32MB
6/20/2013
Bad configurations hurt!
7
Maximum EDP (energy-delay product)
Minimum
Relative to min. EDP
3.5
218% worse
Max. EDP
3
2.5
2
1.5
27% worse
1
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Problem Summary
8
Reconfigurable caches
Associativity (A)
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
Sets (S)
Multiple replacement policies
Goal: Online miss-ratio prediction
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Indexing Assumption
9
Mapping of unique addresses to cache sets
Assumption: independent, uniform [Smith, 1978]
Unique accesses as Bernoulli trials
(Partial) Hashing
POWER4, POWER5, POWER6, Xeon
Simple XOR-based function [similar to Cypher, 2008]
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Agenda
10
The Problem
Framework
Locality (r)
Matrix transformations (B)
Hit functions (φ)
h = (r · B) · φ
Hardware support
Case Study
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Temporal Locality Metrics
11
Unique Reuse Distance (URD)
#unique intervening addresses
x y z z y x : URD(x)=2
Stack Distance [Mattson, 1970] – 1
Large cache large distances to track
i
r ■■■■…■■
P(URD=i)
Size?
Absolute Reuse Distance (ARD)
#intervening addresses
x y z z y x : ARD(x)=4
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Per-set Locality, r(S)
12
i
r ■■■■…■■
r(S) is “compressed” as S (#sets) increases
Less of the tail is important
Probability
Prob.
Cumulative
0.61
#sets: S > S
#sets: S
0.5
0.8
0.4
0.6
0.3
0.4
0.2
0.2
0.1
P(URD=i)
x x
x x
S=2^14
S=2^13
S=2^14
S=2^13
S=2^12
S=2^12
S=2^11
S=2^11
S=2^10
S=2^10
0
0
4
8
12
16
20
24
distance)
Per-set URD (unique reuse
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
28
32
6/20/2013
Agenda
13
The Problem
Framework
Locality (r)
Matrix transformations (B)
Hit functions (φ)
h = (r · B) · φ
Hardware support
Case Study
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Estimating per-set locality
14
Generalized stochastic Binomial matrices [Strum, 1977]
r(S) = r(1) · B(1 – 1/S, 1/S)
B
i
r ■■■■■■■■
P(URD=i)
1 00 0 0 0 00
0 0 0 0 0 0
0 0 0 0 0 i
0 0 0 0
0 0 0
0 0
0
k
P(k successes in i trials)
i.e.,
P(k of i to the same set)
Composition:
r(S) = r(S) · B(1 – S/S, S/S)
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Computation reuse & speedup
15
“Shorter” tail smaller matrices
i
r ■■■■…■■
P(URD=i)
Poisson
Approximation
r(214)
r(213)
r(1)
r(212)
r(211)
r(210)
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
Size?
r(1)
r(210)
Now: compute
Later: hardware support
r(214)
r(213)
r(212)
r(211)
6/20/2013
Size of r(210)?
16
i
Prediction with r(210) limited to URD < n
P(URD=i)
0.3
n=32
n=256
0.25
Miss Ratio
r ■■■■…■■
n=64
n=512
n=128
Actual
0.2
0.15
0.1
0.05
2-w
4-w
8-w
16-w
32-w
2-w
4-w
8-w
16-w
32-w
2-w
4-w
8-w
16-w
32-w
2-w
4-w
8-w
16-w
32-w
2-w
4-w
8-w
16-w
32-w
0
2MB
4MB
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
8MB
16MB
32MB
6/20/2013
Agenda
17
The Problem
Framework
Locality (r)
Matrix transformations (B)
Hit functions (φ)
h = (r · B) · φ
Hardware support
Case Study
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Hit Function, φ
18
x x
Not x
φk: P(x will hit|URD(x)=k)
Monotonically decreasing model
Intuition: larger URD same or larger eviction probability
φ0 = 1
φk ≤ φk-1
φ∞ = 0
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Hit Function, φ
19
Hit Probability
Example: A=8
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
LRU
PLRU
NMRU
RANDOM
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
Unique Reuse Distance
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Formulating φ
20
φ(LRU): step-function
(r · B) · φ(LRU) [Smith, 1978], [Hill & Smith, 1989]
φ(PLRU):
Assumes on average, traffic evenly divided between subtrees
φ(RANDOM):
Estimates #intervening misses using ARD
φ(NMRU): similar to φ(RANDOM) except φ1=1
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Agenda
21
The Problem
Framework
Locality (r)
Matrix transformations (B)
Hit functions (φ)
h = (r · B) · φ
Hardware support
Case Study
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Prediction Accuracy
22
LRU, PLRU(A=2), NMRU(A=2): exact per-set model
Cumulative Probability
Others: approximate per-set model
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
LRU
0%
1%
PLRU
RANDOM
2%
3%
4%
5%
abs((predicted-actual)/actual) miss ratio
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
NMRU
6%
6/20/2013
Overheads
23
r = r · B : 6 80 μsec
Binomial Poisson approximation for each row of B
h = (r · B) · φ : 20 30 μsec
Average over 24 configurations
B applied 8 times
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Agenda
24
The Problem
Framework
Locality (r)
Matrix transformations (B)
Hit functions (φ)
h = (r · B) · φ
Hardware support
Case Study
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Computation reuse & speedup
25
“Shorter” tail smaller matrices
i
r ■■■■…■■
P(URD=i)
Poisson
Approximation
r(214)
r(213)
r(1)
r(212)
r(211)
r(210)
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
Size=512
r(1)
r(210)
Now: compute
Later:
Now hardware support
r(214)
r(213)
r(212)
r(211)
6/20/2013
Insights
26
i
r ■■■■…■■
x y z z y x : URD(x)=2
P(URD=i)
Unique “remember” addresses
Only cardinality, not full addresses
Bloom filter for compact (approximate) representation
r(210) is seen by any set of a cache with S=210
Filter address stream
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Hardware Support for estimating r(210)
27
Start
Sample
Y
512-entry
Histogram
array
Addr
match?
access
Unique?
Set
Filter
N
read
inc
filtered access
reset
9-bit
Counter
Control Logic
load
Y (not hit)
Remember
hit
read
Reference address
register
insert
inc
1024-bit
Bloom Filter
2 hash fns
End
Sample
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Agenda
28
The Problem
Framework
Locality (r)
Matrix transformations (B)
Hit functions (φ)
h = (r · B) · φ
Hardware support
Case Study + way counters
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
LRU Way Counters [Suh, et al. 2002]
29
One counter per logical way (stack position)
Determining logical position is hard
not totally (re-)ordered with every access
heuristics, e.g., for PLRU [Kedzierski, et al. 2010]
Other Limitations
Inclusion property
Fixed #sets
S = S : special case of reuse framework
S S ? Use B
provided, enough tail of r(S) is available
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Min. EDP configuration
30
EDP within 7% of minimum
Relative to min. EDP
Reuse models outperform PLRU way counters in most cases
1.08
1.07
1.06
1.05
1.04
1.03
1.02
1.01
1
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
Reuse Model
PLRU Way Counters
6/20/2013
Summary
31
The Problem:
Online miss-rate estimation for reconfigurable caches
We propose a framework
h = (r · B) · φ
h: hit-ratio
r: reuse-distance distribution (novel hardware support)
B: stochastic Binomial matrix
φ: hit function (LRU, PLRU, RANDOM, NMRU)
Case study: EDP within 7% of minimum
Future work: More policies, applications/case studies
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Also in the paper
32
r: lossy summarization of the address trace
Estimation for ARD
Optimizations for LRU
Conditions for PLRU eviction
More details on models & evaluation
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Reuse-based Online Models for Caches
33
Questions?
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Example LLC performance
34
OLTP (TPC-C + IBM DB2)
0.4
RANDOM
NMRU
PLRU
LRU
Miss Ratio
0.3
0.2
0.1
2-w
4-w
8-w
16-w
32-w
2-w
4-w
8-w
16-w
32-w
2-w
4-w
8-w
16-w
32-w
2-w
4-w
8-w
16-w
32-w
2-w
4-w
8-w
16-w
32-w
0
2MB
4MB
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
8MB
16MB
32MB
6/20/2013
Estimating cache performance
35
Hit ratio = hits/access
∑ P(URD=i) · P(hit|URD=i)
i
i
i
= r ■ ■ ■ ■ … ■ ■ · φ …
P(URD=i)
P(hit|URD=i)
Miss ratio = misses/access
= 1 – hit ratio
Miss rate = misses/instruction
= miss ratio x access/instruction
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
URD vs ARD
36
{z0}* {z0,z1}* {z0,z1,z2}*
x
z0z1 z2
z3
dk
{z0,z1,z2,...,zk-1}*
x
zk-1
∞
Approximation: dk = dk-1 +1/ri
k
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013