Near-Optimal Cache Block Placement with Reactive Nonuniform Cache Architectures Nikos Hardavellas, Northwestern University Team: M.

Transcript Near-Optimal Cache Block Placement with Reactive Nonuniform Cache Architectures Nikos Hardavellas, Northwestern University Team: M.

Near-Optimal Cache Block Placement with
Reactive Nonuniform Cache Architectures
Nikos Hardavellas, Northwestern University
Team: M. Ferdman, B. Falsafi, A. Ailamaki
Northwestern, Carnegie Mellon, EPFL
Moore’s Law Is Alive And Well
90nm
90nm transistor Swine Flu A/H1N1
(Intel, 2005)
(CDC)
65nm
45nm 32nm 22nm 16nm
2007
2010
2013 2016 2019
Device scaling continues for at least another 10 years
2
© Hardavellas
Moore’s
LawEnded
Is Alive
And2002
Well
Good Days
Nov.
[Yelick09]
“New” Moore’s Law: 2x cores with every generation
On-chip cache grows commensurately to supply all cores with data
3
© Hardavellas
100,000
10,000
1,000
100
10
large caches
1
1990
2000
Year
L2 Hit Latency (cycles)
L2 Cache Size (KB)
Larger Caches Are Slower Caches
25
slow access
20
15
10
5
0
1990
2010
2000
Year
2010
Increasing access latency forces caches to be distributed
4
© Hardavellas
Cache design trends
As caches become bigger, they get slower:
Split cache into smaller “slices”:
Balance cache slice access with network latency
5
© Hardavellas
Modern Caches: Distributed
core
core
core
core
core
core
core
core
L2
L2
L2
L2
L2
L2
L2
L2
Split cache into “slices”, distribute across die
6
© Hardavellas
Data Placement Determines Performance
cache
slice
core
core
core
core
core
core
core
core
L2
L2
L2
L2
L2
L2
L2
L2
core
core
core
core
core
core
core
core
L2
L2
L2
L2
L2
L2
L2
L2
core
core
core
core
core
core
core
core
L2
L2
L2
L2
L2
L2
L2
L2
core
core
core
core
core
core
core
core
L2
L2
L2
L2
L2
L2
L2
L2
Goal: place data on chip close to where they are used
7
© Hardavellas
Our proposal: R-NUCA
Reactive Nonuniform Cache Architecture
• Data may exhibit arbitrarily complex behaviors
 ...but few that matter!
• Learn the behaviors at run time & exploit their characteristics
 Make the common case fast, the rare case correct
 Resolve conflicting requirements
8
© Hardavellas
Reactive Nonuniform Cache Architecture
[Hardavellas et al, ISCA 2009]
[Hardavellas et al, IEEE-Micro Top Picks 2010]
• Cache accesses can be classified at run-time
 Each class amenable to different placement
• Per-class block placement
 Simple, scalable, transparent
 No need for HW coherence mechanisms at LLC
 Up to 32% speedup (17% on average)
 -5% on avg. from an ideal cache organization
• Rotational Interleaving
 Data replication and fast single-probe lookup
9
© Hardavellas
Outline
•
•
•
•
•
•
Introduction
Why do Cache Accesses Matter?
Access Classification and Block Placement
Reactive NUCA Mechanisms
Evaluation
Conclusion
10
© Hardavellas
Cache accesses dominate execution
[Hardavellas et al, CIDR 2007]
L2-hit stalls
Lower is
better
Mem stalls
Total
3
2.5
4-core CMP
DSS: TPCH/DB2
1GB database
CPI
2
1.5
1
Ideal
0.5
0
0
10
20
L2 Cache Size (MB)
30
Bottleneck shifts from memory to L2-hit stalls
11
© Hardavellas
How much do we lose?
Higher is
better
Norm. Throughput
DSS-const
DSS-real
3
2.5
2
1.5
4-core CMP
DSS: TPCH/DB2
1GB database
1
0.5
0
0
20
10
L2 Cache Size (MB)
30
We lose half the potential throughput
12
© Hardavellas
Outline
•
•
•
•
•
•
Introduction
Why do Cache Accesses Matter?
Access Classification and Block Placement
Reactive NUCA Mechanisms
Evaluation
Conclusion
13
© Hardavellas
Terminology: Data Types
core
core
Read
or
Write
L2
Private
core
core
core
Read
Read
Read
Write
L2
L2
Shared
Read-Only
Shared
Read-Write
14
© Hardavellas
Distributed shared L2
core
core
core
core
core
core
core
core
L2
L2
L2
L2
L2
L2
L2
L2
core
core
core
core
core
core
core
core
L2
L2
L2
L2
L2
L2
L2
L2
core
core
core
core
core
core
core
core
L2
L2
L2
L2
L2
L2
L2
L2
core
core
core
core
core
core
core
core
L2
L2
L2
L2
L2
L2
L2
L2
address
mod <#slices>
Unique location
for any block
(private or shared)
Maximum capacity, but slow access (30+ cycles)
15
© Hardavellas
Distributed private L2
core
core
core
core
core
core
core
core
L2
L2
L2
L2
L2
L2
L2
L2
core
core
core
core
core
core
core
core
L2
L2
L2
L2
L2
L2
L2
L2
core
core
core
core
core
core
core
core
L2
L2
L2
L2
L2
L2
L2
L2
core
core
core
core
core
core
core
core
L2
L2
L2
L2
L2
L2
L2
L2
On every access
allocate data
at local L2 slice
Private data:
allocate at
local L2 slice
Fast access to core-private data
16
© Hardavellas
Distributed private L2: shared-RO access
core
core
core
core
core
core
core
core
L2
L2
L2
L2
L2
L2
L2
L2
core
core
core
core
core
core
core
core
L2
L2
L2
L2
L2
L2
L2
L2
core
core
core
core
core
core
core
core
L2
L2
L2
L2
L2
L2
L2
L2
core
core
core
core
core
core
core
core
L2
L2
L2
L2
L2
L2
L2
L2
On every access
allocate data
at local L2 slice
Shared read-only
data: replicate
across L2 slices
Wastes capacity due to replication
17
© Hardavellas
Distributed private L2: shared-RW access
core
core
core
core
core
core
core
core
L2
L2
L2
L2
L2
L2
L2
L2
core
core
core
core
core
core
core
core
L2
L2
L2
L2
L2
L2
L2
L2
core
core
core
core
core
core
core
core
L2
L2
L2
L2
L2
L2
L2
L2
core
core
core
core
core
core
core
core
L2
L2
L2
dir
L2
L2
L2
L2
X
On every access
allocate data
at local L2 slice
Shared read-write
data: maintain
coherence via
indirection (dir)
Slow for shared read-write
Wastes capacity (dir overhead) and bandwidth
18
© Hardavellas
Conventional Multi-Core Caches
Shared
Private
core
core
core
core
core
core
core
core
L2
L2
L2
L2
L2
L2
L2
L2
core
core
core
core
core
core
core
core
L2
L2
L2
L2
dir
L2
L2
L2
Address-interleave blocks
+ High capacity
Each block cached locally
+ Fast access (local)
− Low capacity (replicas)
− Coherence: via indirection
(distributed directory)
We want: high capacity (shared) + fast access (private)
− Slow access
19
© Hardavellas
Where to Place the Data?
read-write
read-only
migrate
• Close to where they are used!
• Accessed by single core: migrate locally
• Accessed by many cores: replicate (?)
 If read-only, replication is OK
 If read-write, coherence a problem
 Low reuse: evenly distribute across sharers
share
replicate
sharers#
20
© Hardavellas
Methodology
Flexus: Full-system cycle-accurate timing simulation
[Hardavellas et al, SIGMETRICS-PER 2004
Wenisch et al, IEEE Micro 2006]
Workloads
• OLTP: TPC-C 3.0 100 WH
 IBM DB2 v8
 Oracle 10g
• DSS: TPC-H Qry 6, 8, 13
 IBM DB2 v8
• SPECweb99 on Apache 2.0
• Multiprogammed: Spec2K
• Scientific: em3d
Model Parameters
• Tiled, LLC = L2
• Server/Scientific wrkld.
 16-cores, 1MB/core
• Multi-programmed wrkld.
 8-cores, 3MB/core
• OoO, 2GHz, 96-entry ROB
• Folded 2D-torus
 2-cycle router, 1-cycle link
21 • 45ns memory
© Hardavellas
Cache Access Classification Example
Read-Write
Bubble
Blocks inBlocks
RW
%%
• Each bubble: cache blocks shared by x cores
• Size of bubble proportional to % L2 accesses
• y axis: % blocks in bubble that are read-write
Instructions
Data-Private
Data-Shared
120%
100%
80%
60%
% L2
accesses
40%
20%
0%
-20% 0
2
4
6
8 10 12 14 16 18 20
Number of Sharers
22
© Hardavellas
Cache Access Clustering
Data-Private
Data-Private
Data-Private
Data-Private
Data-Private
Data-Private
Data-Private
Data-Private
Data-Private
Data-Private
Data-Private
Data-Shared
Data-Shared
Data-Shared
Data-Shared
Data-Shared
Data-Shared
Data-Shared
Data-Shared
Data-Shared
Data-Shared
Data-Shared
0%
-20% 0 2 4 6 8 10 12 14 16 18 20
Number of Sharers
migrate
locally
Instructions
Instructions
120%
Instructions
Instructions
100%
R/W
80%
60%
40%
R/O
20%
migrate
Bubble
Blocks inBlocks
%%RW
Read-Write
Instructions
Instructions
120%
Instructions
Instructions
100%
Instructions
Instructions
80%
Instructions
Instructions
60%
Instructions
Instructions
40%
Instructions
20%
Read-Write
Bubble
Blocks inBlocks
RW
%%
share (addr-interleave)
Data-Private
Data-Private
Data-Private
Data-Private
Data-Shared
Data-Shared
Data-Shared
Data-Shared
share
replicate
0%
%
-4
-2 0 2 4 6 8 10 12 14 16 18 20
-20%
sharers#
Number of Sharers
Server Apps
Scientific/MP Apps
replicate
Accesses naturally form 3 clusters
23
© Hardavellas
Instruction Replication
• Instruction working set too large for one cache slice
core
core
core
core
core
core
core
core
L2
L2
L2
L2
L2
L2
L2
L2
core
core
core
core
core
core
core
core
L2
L2
L2
L2
L2
L2
L2
L2
core
core
core
core
core
core
core
core
L2
L2
L2
L2
L2
L2
L2
L2
Distribute in cluster of neighbors, replicate across
24
© Hardavellas
Reactive NUCA in a nutshell
• Classify accesses
 private data: like private scheme (migrate)
 shared data: like shared scheme (interleave)
 instructions: controlled replication (middle ground)
To place cache blocks, we first need to classify them
25
© Hardavellas
Outline
•
•
•
•
•
Introduction
Access Classification and Block Placement
Reactive NUCA Mechanisms
Evaluation
Conclusion
26
© Hardavellas
Classification Granularity
• Per-block classification
 High area/power overhead (cut L2 size by half)
 High latency (indirection through directory)
• Per-page classification (utilize OS page table)
 Persistent structure
 Core accesses the page table for every access anyway (TLB)
 Utilize already existing SW/HW structures and events
 Page classification is accurate (<0.5% error)
Classify entire data pages, page table/TLB for bookkeeping
27
© Hardavellas
Classification Mechanisms
• Instructions classification: all accesses from L1-I (per-block)
• Data classification: private/shared per-page at TLB miss
On 1st access
Core i
core
L2
On access by another core
Ld A
TLB Miss
Ld A
TLB Miss
core
Core j
L2
OS
OS
A: Private to “i”
A: Private to “i”
A: Shared
Bookkeeping through OS page table and TLB
28
© Hardavellas
Page Table and TLB Extensions
• Core accesses the page table for every access anyway (TLB)
 Pass information from the “directory” to the core
• Utilize already existing SW/HW structures and events
TLB entry: P/S
vpage
ppage
vpage
ppage
1 bit
Page table entry: P/S/I
L2 id
2 bits
log(n)
Page granularity allows simple + practical HW
29
© Hardavellas
Data Class Bookkeeping and Lookup
• private data: place in local L2 slice
Page table entry:
P
L2 id
vpage
ppage
P
vpage
ppage
TLB entry:
• shared data: place in aggregate L2 (addr interleave)
Page table entry:
S
L2 id
vpage
ppage
S
vpage
ppage
TLB entry:
Physical Addr.:
tag
L2 id
30
cache index
offset
© Hardavellas
Coherence: No Need for HW Mechanisms at LLC
• Reactive NUCA placement guarantee
 Each R/W datum in unique & known location
Shared data: addr-interleave
Private data: local slice
core
core
core
core
core
core
core
core
L2
L2
L2
L2
L2
L2
L2
L2
core
core
core
core
core
core
core
core
L2
L2
L2
L2
L2
L2
L2
L2
Fast access, eliminates HW overhead, SIMPLE
31
© Hardavellas
Instructions Lookup: Rotational Interleaving
+1
RID
+log2(k)
0
2
0
2
1
3
1
3
2
0
2
0
RID
3
1
3
1
0
2
0
2
1
3
1
3
2
0
2
0
3
1
3
1
size-4 clusters:
PC: 0xfa480
Addr
each slice caches the same blocks
local slice + 3 neighbors
on behalf of any cluster
Destination  Addr RID1 & (n 1)


Fast access (nearest-neighbor, simple lookup)
Balance access latency with capacity constraints
Equal capacity pressure at overlapped slices
32
© Hardavellas
Outline
•
•
•
•
•
Introduction
Access Classification and Block Placement
Reactive NUCA Mechanisms
Evaluation
Conclusion
33
© Hardavellas
Speedup over Private
Evaluation
60%
50%
40%
30%
20%
10%
0%
-10%
-20%
(S)
 Shared
(R)
 R-NUCA
 Ideal (I)
SR I
S R I
S R I
SR I
S R I
S R I
SR I
SR I
OLTP
DB2
Apache
DSS
Qry6
DSS
Qry8
DSS
Qry13
em3d
OLTP
Oracle
MIX
Private-averse workloads
Shared-averse
workloads
Delivers robust performance across workloads
Shared: same for Web, DSS; 17% for OLTP, MIX
Private: 17% for OLTP, Web, DSS; same for MIX
34
© 2009 Hardavellas
Conclusions
• Data may exhibit arbitrarily complex behaviors
 ...but few that matter!
• Learn the behaviors that matter at run time
 Make the common case fast, the rare case correct
• Reactive NUCA: near-optimal cache block placement
 Simple, scalable, low-overhead, transparent, no coherence
 Robust performance
 Matches best alternative, or 17% better; up to 32%
 Near-optimal placement (-5% avg. from ideal)
35
© Hardavellas
Thank You!
For more information:
• N. Hardavellas, M. Ferdman, B. Falsafi and A. Ailamaki. NearOptimal Cache Block Placement with Reactive Nonuniform
Cache Architectures. IEEE Micro Top Picks, Vol. 30(1), pp. 20-28,
January/February 2010.
• N. Hardavellas, M. Ferdman, B. Falsafi and A. Ailamaki. Reactive
NUCA: Near-Optimal Block Placement and Replication in
Distributed Caches. ISCA 2009.
http://www.eecs.northwestern.edu/~hardav/
36
© Hardavellas
BACKUP SLIDES
37
© 2009 Hardavellas
Why Are Caches Growing So Large?
• Increasing number of cores: cache grows commensurately
 Fewer but faster cores have the same effect
• Increasing datasets: faster than Moore’s Law!
• Power/thermal efficiency: caches are “cool”, cores are “hot”
 So, its easier to fit more cache in a power budget
• Limited bandwidth: large cache == more data on chip
 Off-chip pins are used less frequently
38
© Hardavellas
Backup Slides
ASR
39
© 2009 Hardavellas
ASR vs. R-NUCA Configurations
ASR-1
Memory
Local L2
Shared L2
Local L2
ASR-2
R-NUCA
12.5×
25.0×
5.6×
2.1×
2.2×
38%
In-Order
OoO
OoO
4
16
16
Memory
150
500
90
Local L2
12
20
16
Avg. Shared L2
25
44
22
Core Type
L2 Size (MB)
40
© 2009 Hardavellas
ASR design space search
ASR
Alloc 50%
Alloc 0%
Alloc 75%
Alloc 25%
Alloc 100%
Speedup over private
6%
4%
2%
0%
-2%
-4%
-6%
OLTP
DB2
Apache
DSS
Qry8
41
em3d
OLTP
Oracle
MIX
© Hardavellas
Backup Slides
Prior Work
42
© 2009 Hardavellas
Prior Work
• Several proposals for CMP cache management
 ASR, cooperative caching, victim replication,
CMP-NuRapid, D-NUCA
• ...but suffer from shortcomings
 complex, high-latency lookup/coherence
 don’t scale
 lower effective cache capacity
 optimize only for subset of accesses
We need:
Simple, scalable mechanism for fast access to all data
43
© Hardavellas
Shortcomings of prior work
• L2-Private
 Wastes capacity
 High latency (3 slice accesses + 3 hops on shr.)
• L2-Shared
 High latency
• Cooperative Caching
 Doesn’t scale (centralized tag structure)
• CMP-NuRapid
 High latency (pointer dereference, 3 hops on shr)
• OS-managed L2
 Wastes capacity (migrates all blocks)
 Spill to neighbors useless (all run same code)
44
© Hardavellas
Shortcomings of Prior Work
• D-NUCA
 No practical implementation (lookup?)
• Victim Replication
 High latency (like L2-Private)
 Wastes capacity (home always stores block)
• Adaptive Selective Replication (ASR)
 High latency (like L2-Private)
 Capacity pressure (replicates at slice granularity)
 Complex (4 separate HW structures to bias coin)
45
© Hardavellas
Backup Slides
Classification and Lookup
46
© 2009 Hardavellas
Core i
Data Classification Timeline
Core j
core
Ld A
L2
inval A
TLB Miss
TLBi
evict A
allocate A
Ld A
TLB Miss
reply A
core
L2
Core k
core
L2
allocate A
OS
P i
i≠j
vpage
ppage
S x
vpage
ppage
Fast & simple lookup for data
47
© 2009 Hardavellas
Misclassifications at Page Granularity
Private+Shared Data
Accesses from pages with
multiple access types
MIX
em3d
0%
DSS Qry13
MIX
em3d
DSS Qry13
DSS Qry8
DSS Qry6
Apache
OLTP Oracle
0%
20%
DSS Qry8
20%
40%
DSS Qry6
40%
60%
Apache
60%
80%
OLTP Oracle
80%
Correct
100%
OLTP DB2
100%
Private Data as Shared
Total L2 Accesses
Instructions+Data
OLTP DB2
Total L2 Accesses
One Class
Access misclassifications
• A page may service multiple access types
• But, one type always dominates accesses
Classification at page granularity is accurate
48
© Hardavellas
Backup Slides
Placement
49
© 2009 Hardavellas
100%
OLTP DB2
OLTP Oracle
Apache
DSS Qry6
DSS Qry8
DSS Qry13
em3d
MIX
80%
60%
40%
20%
0%
1
4
16
64
256
1,024
4,096
16,384
65,536
262,144
1,048,576
Total L2 Accesses (CDF)
Private Data Placement
Private Data (KB)
• Spill to neighbors if working set too large?
 NO!!! Each core runs similar threads
Store in local L2 slice (like in private cache)
50
© Hardavellas
100%
OLTP DB2
OLTP Oracle
Apache
DSS Qry6
DSS Qry8
DSS Qry13
em3d
MIX
80%
60%
40%
20%
0%
1
4
16
64
256
1,024
4,096
16,384
65,536
262,144
1,048,576
Total L2 Accesses (CDF)
Private Data Working Set
Private Data (KB)
• OLTP: Small per-core work. set (3MB/16 cores = 200KB/core)
• Web: primary wk. set <6KB/core, remaining <1.5% L2 refs
• DSS: Policy doesn’t matter much
(>100MB work. set, <13% L2 refs  very low reuse on private)
51
© Hardavellas
20%
MIX
em3d
0%
DSS Qry13
0%
40%
DSS Qry8
20%
60%
DSS Qry6
40%
80%
Apache
60%
100%
OLTP Oracle
80%
1st access
2nd access
3rd-4th access
5th-8th access
9+ access
OLTP DB2
OLTP DB2
OLTP Oracle
Apache
DSS Qry6
DSS Qry8
DSS Qry13
em3d
MIX
Total L2 Accesses
100%
1
4
16
64
256
1,024
4,096
16,384
65,536
262,144
1,048,576
Total L2 Accesses (CDF)
Shared Data Placement
Shared Data (KB)
Shared Data
• Read-write + large working set + low reuse
 Unlikely to be in local slice for reuse
• Also, next sharer is random [WMPI’04]
Address-interleave in aggregate L2 (like shared cache)
52
© Hardavellas
100%
OLTP DB2
OLTP Oracle
Apache
DSS Qry6
DSS Qry8
DSS Qry13
em3d
MIX
80%
60%
40%
20%
0%
1
4
16
64
256
1,024
4,096
16,384
65,536
262,144
1,048,576
Total L2 Accesses (CDF)
Shared Data Working Set
Shared Data (KB)
53
© Hardavellas
40%
20%
Instructions (KB)
MIX
em3d
4,096
1,024
256
64
16
4
0%
DSS Qry13
0%
DSS Qry8
20%
60%
DSS Qry6
40%
80%
Apache
60%
100%
OLTP Oracle
80%
1st access
2nd access
3rd-4th access
5th-8th access
9+ access
OLTP DB2
OLTP DB2
OLTP Oracle
Apache
DSS Qry6
DSS Qry8
DSS Qry13
em3d
MIX
Total L2 Accesses
100%
1
Total L2 Accesses (CDF)
Instruction Placement
Instructions
• Working set too large for one slice
 Slices store private & shared data too!
 Sufficient capacity with 4 L2 slices
Share in clusters of neighbors, replicate across
54
© Hardavellas
100%
OLTP DB2
OLTP Oracle
Apache
DSS Qry6
DSS Qry8
DSS Qry13
em3d
MIX
80%
60%
40%
20%
4,096
1,024
256
64
16
4
0%
1
Total L2 Accesses (CDF)
Instructions Working Set
Instructions (KB)
55
© Hardavellas
Backup Slides
Rotational Interleaving
56
© 2009 Hardavellas
Instruction Classification and Lookup
• Identification: all accesses from L1-I
• But, working set too large to fit in one cache slice
core
core
core
core
core
core
core
core
L2
L2
L2
L2
L2
L2
L2
L2
core
core
core
core
core
core
core
core
L2
L2
L2
L2
L2
L2
L2
L2
core
core
core
core
core
core
core
core
L2
L2
L2
L2
L2
L2
L2
L2
Share within neighbors’ cluster, replicate across
57
© 2009 Hardavellas
Rotational Interleaving
+1
RotationalID
0
TileID
2
8
+log2(k) 16
0
24
2
1
3
9
17
1
25
3
2
10
0
18
2
26
0
3
11
1
19
3
27
1
4
0
12
2
20
0
28
2
5
1
13
3
21
1
29
3
6
2
14
0
22
2
30
0
7
3
15
1
23
3
31
1

Addr  RotationalIDdest 

D
RotationalIDcenter 
  TileIDdest
TileIDcenter 



D  RotationalIDdest  RotationalIDcenter 1 & n 1
Fast access (nearest-neighbor, simple lookup)
Equalize capacity pressure at overlapping slices
58
© 2009 Hardavellas
Nearest-neighbor size-8 clusters
0
3
6
1
4
7
2
5
1
4
7
2
5
0
3
6
2
5
0
3
6
1
4
7
3
6
1
4
7
2
5
0
4
7
2
5
0
3
6
1
D
59
5
0
3
6
1
4
7
2
6
1
4
7
2
5
0
3
7
2
5
0
3
6
1
4
C
© Hardavellas

Near-Optimal Cache Block Placement with Reactive Nonuniform Cache Architectures Nikos Hardavellas, Northwestern University Team: M.

Transcript Near-Optimal Cache Block Placement with Reactive Nonuniform Cache Architectures Nikos Hardavellas, Northwestern University Team: M.

Directory