Amoeba-Cache Adaptive Blocks for Eliminating Waste in the

Download Report

Transcript Amoeba-Cache Adaptive Blocks for Eliminating Waste in the

Amoeba-Cache
Adaptive Blocks for Eliminating
Waste in the Memory Hierarchy
Snehasish Kumar
Arrvindh Shriraman
Eric Matthews
Lesley Shannon
Hongzhou Zhao
Sandhya Dwarkadas
On-chip Storage
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
2
Fixed granularity cache
Tag Array
Data Array
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
3
Cache data utilization
Tag Array
Data Array
Utilization = Fraction of words
touched in
Untouched
Tags
Data
cache block at the time of eviction
Data
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
4
Cache utilization
100%
64K L1 – 4 ways – 64B/block
75%
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
eclipse
tpcc
firefox
x264
apache
jbb
h2
lbm
0%
mcf
25%
cann.
50%
5
Block Distribution
6%
55%
13%
# Words
Touched
25%
9%
1-2
26%
3-4
5%
6%
18%
5%
4%
73%
40%
5-6
14%
7-8
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
75%
Firefox
26%
Canneal
Eclipse
Apache
64K – 64B/block
6
Block Distribution
1M – 64B/block
6%
# Words
Touched
5%
14%
10%
1-2
12%
3-4
20%
75%
58%
Canneal
Canneal
64K – 64B/block
5-6
7-8
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
7
Factors affecting cache utilization
 Application specific behaviour
― Inefficient data structure access
patterns
 Interaction with cache geometry
— Way conflicts reduce block lifetime
and cause poor utilization
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
8
Application Specific Behaviour
struct TIE {
for (int i=0; i<1024; i++)
long long X, Y, Data
Z; Array
{
long long V, H;
Imperial[i].X = …;
long long data[3];
Imperial[i].Y = …;
} Imperial[1024];
Imperial[i].Z = …;
Imperial[i].V = …;
}
Access in a loop
X
Y
Z
V
H
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
Data[3]
9
Cache Geometry
Data Array – 4 ways
1
2
4
5
3
Problem : Lots of data map to same set
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
10
Implications
1. Shrinks effective cache
space
=
2. Increases miss rate
3. Wastes on-chip
bandwidth
4. Increases on-chip cache
energy consumption
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
11
Target Metrics
Bandwidth
Amoeba
Cache
Space
Utilisation
Miss
Rate
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
12
Variable Granularity Blocks
Tag Array
Data Array
How to support variable # of blocks / set ?
How to support variable granularity for
each block?
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
13
Our Approach : Amoeba Cache
Unified SRAM Array
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
14
Amoeba Cache
•
•
•
•
Insert
Lookup
Partial Miss
Overheads
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
15
SRAM Array
Bitmaps
Valid? Tag?
0000
0000
0000
0000
0000
0000
0000
0000
SRAM Array
Tag
Region
Tag
Start
1 word
Data Block
End
1+ words
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
16
Tag - Regions
RMAX
bytes
Memory
Region
Top
Region Tag
Set Index
3
3
Start /
End
Byte
64 bit address
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
17
Example
Imperial.X = … ;
struct TIE {
long long X,Miss
Y, Z;
long long V, H;
long
long data[3];
Invoke Spatial
Granularity
Predictor
} Imperial;
(PC/Region based)
Fetch
Tag
X
Y
Z
V
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
18
Amoeba Cache – Insert (8words/set)
Insert 4+1 words
00000000
1
Tag?
00000000
00000
substring()
Valid?
Pos: 0
SRAM Array / Set
Miss
Tag
X
Y
Z
V
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
19
Amoeba Cache – Insert (8words/set)
2
Valid?
Tag?
11111000
00000000
00000000
10000000
3
SRAM Array / Set
Tag
Tag
X
Y
Z
V
X
Y
Z
V
Refill
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
20
Example
struct TIE {
long long X, Y, Z;
long long V, H;
long long data[3];
} Imperial;
Tag
X
X
Y
Z
V
Imperial.Y
=the
… ;cache
Lookup
Data
from
Y
Z
V
H
Data[3]
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
21
Amoeba Cache – Lookup (8words/set)
SRAM Array / Set
10000000
Tag?
Region
Tag
Set
Index
Tag
X
Y
Z
V
Tag
X
Y
Z
V utput Buffer
O
Word
(W)
2x1
2
Region
==
Critical Path
1
2x1
2x1
2x1
𝐴𝑑𝑑𝑟 ∈ 𝑇𝑎𝑔
Start ≤ W
Hit?
End > W
Word Selector
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
3
22
Partial Miss
Identify Sub-Blocks
Step 1 of 2
Fetch New
Tag
X
Y
1
Z
Tag
MSHR
New
∩ Tags
V
X
Y
Tag
V
H
2 Evict Overlap
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
23
Partial Miss
Insert New Block
Step 2 of 2
Allocate 6 words
3
Tag
X
Y
Z
V
H
MSHR
X
Y
Z?
V
H
4
Miss
Tag
5
Occurs
≈ 5 in 1000
Patch Missing
?’s
accesses
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
24
Hardware Overheads
0000
0000
0000
0000
1 KB
Amoeba Critical Path
0000
Critical Path
0000
SRAM Array
Extra
Metadata
Valid? Tag?
Latency +4%
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
25
Evaluation
• Parameters for latency and energy
• Workloads
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
26
Latency Parameters (cycles)
Cache
Granularity
Fixed
Amoeba
1.04
1
Latency +4%
CPU
3
64K L1
1M LLC
300
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
20
27
101
64K L1
105
≈ 7 / word
230
1M LLC
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
238
Amoeba Cache
Fixed Granularity
On-Chip Energy Parameters (pJ)
28
Workloads
• 22 diverse workloads from
• PARSEC
• SPEC-CPU 2000 & 2006
• DaCapo ( Java Benchmarks )
• Apache, Firefox and PostgreSQL
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
29
Results
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
30
% Improvement in L1 Miss-Rate
40%
30%
20%
Reduces L1 and L2 miss rate by 18%
10%
0%
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
31
% Improvement in L1 Miss-Bandwidth
75%
50%
Reduces on-chip bandwidth by 46%
25% Reduces off-chip bandwidth by 38%
0%
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
32
% Improvement in memory energy
40%
30%
20%
Reduces energy by 11%
10%
0%
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
33
% Improvement in execution time
15%
10%
21%
20%
Improves performance by 10%
5%
0%
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
34
Results Summary
Amoeba-Cache
• Reduce cache pollution for applications with low
cache utilization
• Improve performance for moderate cache
utilization
• Maintain performance for high cache utilization
workloads
• Save energy for streaming applications by
keeping out unused words
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
35
Additional Results
 Lookup
an extra
cache
pipeline
stage
Foras
extra
pipeline
stage,
8 of 22
vs. throttling
the
CPU
applications show improvement
 Spatial Granularity Predictor
— Indexing
18 of 22 – Address region better
— Training
Evictions and First Touch
— Table
256Size
– PC and 1024 – Region
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
36
Additional Results
Reduces miss rate (avg 18%) and LLC
 Multicore
Shared
Cache(16%-39%)
miss
bandwidth
 Comparison against other designs
— Fixed Granularity 2X
— Sector Cache variants
— Multi-$
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
37
Amoeba Cache
What?
—Enable variable granularity data caching
Why?
—Eliminate waste
How?
—Unify tag and data into a single SRAM array
—Afforded by recent technology trends
Where?
—Definitely at the L2, possibly at the L1
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
38
Frequently Asked Questions
1. Multiple threads?
2. Compare against other designs
3. Spatial Pattern Predictor
4. Replacement Policy
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
39
Multicore Shared Cache
Mix
jbb x2, tpc-c x2
Miss
Miss
Miss
Miss
BW
T1
T2
T3
T4
(All)
12.38% 12.38% 22.29% 22.37% 39.07%
Firefox x2, x264 x2
3.82%
3.61%
–2.44% 0.43% 15.71%
cactus, fluid., omnet.,
sopl.
1.01%
1.86% 22.38% 0.59% 18.62%
canneal, astar, ferret,
milc
4.85%
2.75%
19.39% –4.07% 17.77%
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
40
Comparison
Multi -$
Sector Variants
Amoeba Cache
Impact on Miss-Rate
Impact on Bandwidth
Low tag overhead
Tradeoff data and tag space
Dynamically resize blocks
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
~
Yes No
Yes No No
Yes No No
~
41
Comparison – Moderate Group – 64K
Bandwidth Ratio
1.0
Fixed-2X
0.9
0.8
0.7
Amoeba
Multi$-25
0.6
0.5
Sector
(x:2.9)
Sector-Pre
Multi$-50
0.4
1.0
1.1
1.2 1.3 1.4
Miss Rate Ratio
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
1.5
1.6
42
Spatial Pattern Predictor
Predictor History Table
Index
Pattern
PC / Region
01011111
00011101
What to do when there is no entry? 2
PC / Region
1
PC : Read Addr
0
0
0
1
1
1
0
1
Critical Word
Policy Miss vs Policy-Bandwidth
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
43
Predictor Training
Data Array
Add / update
entry on evict
Index
Pattern
PC / Region
01011111
PC / Region
00011101
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
44
Predictor – L1 Miss Rate (1 of 2)
Aligned
Finite
Infinite
Finite+FT
History
10
6
4
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
x264
tpc-c
h2
firef.
0
eclip.
2
canne.
MPKI
8
45
Predictor – L1 Miss Rate (2 of 2)
Finite
Infinite
Finite+FT
History
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
jbb
mcf
lbm
apac.
MPKI
Aligned
140
120
100
80
60
40
20
0
46
Predictor – L1 Miss Bandwidth (1 of 2)
Aligned
Finite
Infinite
Finite+FT
History
1500
1200
900
600
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
x264
tpc-c
h2
firef.
0
eclip.
300
canne.
Bandwidth Rate
1800
47
Finite
Infinite
Finite+FT
lbm
mcf
History
8000
6000
4000
2000
0
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
jbb
Bandwidth Rate
Aligned
10000
apac.
Predictor – L1 Miss Bandwidth (2 of 2)
48
Predictor – Summary
 For majority applications Region
Predictor with
— 1024 entry table
— Table with 8 ways x 128 sets
 PC Predictor is good for 5 applications
— apache, art, mcf, lbm and omnetpp
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
49
Pseudo LRU Replacement
Way 0
Way 1
• Logically partition the set into a Nways
• Pick a block at random from way
• Unset the T? (Tag) and V? (Valid) bits
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
50
0
45
20
39
79
30
80
77
82
49
62
55
38
40
32
29
81
33
21
53
73
29
46
50
100
apache
art
astar
cactus
cann…
eclipse
facesi…
ferret
firefox
fluid.
freq.
h2
jbb
lbm
mcf
milc
omnet.
soplex
tpc-c.
trade.
twolf
x264
mean
Words Accessed (%)
Word distribution for 64K L1
Access Distribution for L1
1-2 Words
3-4 Words
5-6 Words
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
7-8 Words
80
60
40
20
51
0
92
80
98
100
67
98
88
99
78
100
94
82
89
89
93
100
83
91
91
97
70
91
90
100
apache
art
astar
cactus
cann…
eclipse
facesi…
ferret
firefox
fluid.
freq.
h2
jbb
lbm
mcf
milc
omnet.
soplex
tpc-c.
trade.
twolf
x264
mean
% of Amoeba Blocks
Block distribution for 64K L1
Amoeba block size distribution for L1
1-2 Words
3-4 Words
5-6 Words
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
7-8 Words
80
60
40
20
52
L1 FSM
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
53
Miss-Rate ( 64K L1 )
80
Fixed
60
Amoeba
40
20
0
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
54
Miss Bandwidth Rate ( 64K L1 )
10000
Fixed
8000
Amoeba
6000
4000
2000
0
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
55
Energy Rate ( L1 + LLC ) – (nJ/KI)
100
Fixed
75
Amoeba
50
25
0
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
56
Reduction in execution time
16000
Fixed
12000
Amoeba
8000
4000
0
Amoeba Cache : Adaptive blocks for
Eliminating Waste in the Memory Hierarchy
57