Cooperative Cache Scrubbing Jennifer B. Sartor, Wim Heirman, Steve Blackburn*, Lieven Eeckhout, Kathryn S.
Download
Report
Transcript Cooperative Cache Scrubbing Jennifer B. Sartor, Wim Heirman, Steve Blackburn*, Lieven Eeckhout, Kathryn S.
Cooperative Cache Scrubbing
Jennifer B. Sartor, Wim Heirman, Steve
Blackburn*, Lieven Eeckhout, Kathryn S. McKinley^
PACT 2014
*
^
Multicore Challenge
Objects rapidly
allocated and
short-lived
Chip
Application
Managed language
runtime environment
Operating System
P
P
P
P
$
$
$
$
LLC
memory (DRAM)
p. 2
Problem: Allocation Wall
Objects rapidly
allocated and
short-lived
Chip
Application
Managed language
runtime environment
Operating System
P
P
P
P
$
$
$
$
DEAD
DEAD
DEAD
DEAD
LLC
DEAD
DEAD
memory (DRAM)
p. 3
Problem: Bandwidth & Power Wall
Objects rapidly
allocated and
short-lived
Chip
Application
Managed language
runtime environment
Operating System
P
P
P
P
$
$
$
$
DEAD
DEAD
DEAD
DEAD
LLC
Zero
initialization
DEAD
0000000
0000000
DEAD
memory (DRAM)
p. 4
Cooperative Cache Scrubbing
Objects rapidly
allocated and
short-lived
Chip
Application
Managed language
runtime environment
Operating System
P
P
P
P
$
$
$
$
DEAD
DEAD
DEAD
DEAD
LLC
DEAD
Zero
initialization
0000000
0000000
memory (DRAM)
p. 5
Generational Garbage Collection
Mature
Nursery
Young objects die quickly
Nursery
Traced for live objects
Copy to mature space
Reclaimed ‘en masse’
DEAD
DEAD
DEAD
LLC
DEAD
DEAD
8MB
p. 6
Dead Lines in LLC (8MB)
p. 7
Dead Data Written Back?
Application
Managed language
runtime environment
Operating System
Chip
P
P
P
P
$
$
$
$
DEAD
DEAD
DEAD
DEAD
DEAD
DEAD
LLC
memory (DRAM)
p. 8
Useless Write Backs (8MB LLC)
p. 9
Cooperative Cache Scrubbing
Communicate managed language’s
semantic information to hardware
Caches
writes
‘Scrub’ dead lines
Invalidate
Unset dirty bit
Zero lines without fetch
reads
Result
Better cache management
Avoid traffic to DRAM
Save DRAM energy
p. 10
Dead Data Written in Cache?
Mature
Nursery
Young objects die quickly
Nursery
Traced for live objects
Copy to mature space
Reclaimed ‘en masse’
DEAD
DEAD 0000000
DEAD
DEAD
DEAD
DEAD
DEAD DEAD
LLC
p. 11
Dead Lines Written in LLC (8MB)
p. 12
SW-HW Cooperative Scrubbing
Software
Identify cache line-aligned dead/zero region
Generational Immix collector (stop-the-world)
After nursery collection, call scrub instruction on each
line in entire range
Call zero instructions to zero region (32KB)
Hardware
p. 13
SW-HW Cooperative Scrubbing
Software
Hardware
Scrubbing (LLC)
clinvalidate: invalidates cache line
clundirty: clears dirty bit
clclean: clears dirty bit, moves line to LRU
Zeroing (L2)
PowerPC’s dcbi, ARM
clzero: zero cache line without fetch
PowerPC’s dcbz
Modifications to MESI cache coherence protocol
Back-propagation from LLC to L1/L2 cache levels
Local coherence transitions (no off-chip)
p. 14
MESI Coherence Transitions
clinvalidate/clclean/-
E
clclean/-
clinvalidate/-
M
clclean/-
clclean/-
I
clinvalidate/-
S
p. 15
MESI Coherence Transitions
M
E
BusInvalidate
clzero/BusInvalidate
clzero/-
clzero/-
I
external:
from
another
LLC
BusInvalidate
S
p. 16
Methodology
Sniper simulator
4 cores, 8MB shared L3 (LLC), McPAT
Extensions for JVM
Works with JIT compiler
Emulate system calls (futex & nanosleep)
JVM-simulator communication with new instruction
Jikes RVM 3.1.2 and DaCapo benchmarks
Generational Immix garbage collector
4 application, 4 GC threads
2x minimum heap
Replay compilation, 2nd invocation
p. 17
DRAM Writes (8MB nursery)
Writes/Baseline (%)
120
100
80
60
40
20
clinvalidate
clundirty
clclean
clzero
clclean+clzero
0
p. 18
DRAM Writes (8MB nursery)
Writes/Baseline (%)
120
100
80
60
40
20
clinvalidate
clundirty
clclean
clzero
clclean+clzero
0
p. 19
DRAM Writes (8MB nursery)
Writes/Baseline (%)
120
100
80
60
40
20
clinvalidate
clundirty
clclean
clzero
clclean+clzero
0
p. 20
DRAM Reads (8MB nursery)
225
Reads/Baseline (%)
200
175
150
125
100
75
50
clinvalidate
clundirty
clclean
clzero
clclean+clzero
25
0
p. 21
DRAM Reads (8MB nursery)
225
Reads/Baseline (%)
200
175
150
125
100
75
50
clinvalidate
clundirty
clclean
clzero
clclean+clzero
25
0
p. 22
DRAM Reads (8MB nursery)
225
Reads/Baseline (%)
200
175
150
125
100
75
50
clinvalidate
clundirty
clclean
clzero
clclean+clzero
25
0
p. 23
DRAM Reads (8MB nursery)
225
Reads/Baseline (%)
200
175
150
125
100
75
50
clinvalidate
clundirty
clclean
clzero
clclean+clzero
25
0
p. 24
DRAM Reads (8MB nursery)
225
Reads/Baseline (%)
200
175
150
125
100
75
50
clinvalidate
clundirty
clclean
clzero
clclean+clzero
25
0
p. 25
Dynamic DRAM Energy (8MB nursery)
80
Energy Reduction (%)
70
60
50
clinvalidate
clundirty
clclean
clzero
clclean+clzero
40
30
20
10
0
Mean
p. 26
Dynamic DRAM Energy (8MB nursery)
80
Energy Reduction (%)
70
60
50
clinvalidate
clundirty
clclean
clzero
clclean+clzero
40
30
20
10
0
Mean
p. 27
Total DRAM Energy
25
Energy Reduction (%)
20
15
clinvalidate
clundirty
clclean
clzero
clclean+clzero
10
5
0
4M
-5
8M
16M
-22%
p. 28
Total DRAM Energy
25
Energy Reduction (%)
20
15
clinvalidate
clundirty
clclean
clzero
clclean+clzero
10
5
0
4M
-5
8M
16M
-22%
p. 29
Total DRAM Traffic
100
Traffic Reduction (%)
75
50
clinvalidate
clundirty
clclean
clzero
clclean+clzero
25
0
4M
8M
16M
-25
-50
-14x
p. 30
clclean+clzero Improvements
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
4MB
8MB
16MB
0%
p. 31
Related Work
Cooperative cache management
ESKIMO by Isen & John, Micro 09
Wang et al., PACT 02/ ISCA 03; Sartor et al., 05
Useless reads and writes to DRAM by sequential C
programs
Reduce energy
Require large map in hardware, extra cache bits
C & Fortran static analysis to give cache hints to evict or
keep data
Zero initialization [Yang et al., OOPSLA 11]
Studied costs in time, cache and traffic
Use non-temporal writes to DRAM, increase bandwidth
p. 32
Conclusions
Software-hardware cooperative cache
scrubbing
Leverages region allocation semantics
Changes to MESI coherence protocol
New multicore architectural simulation
methodology
Reductions
59% traffic
DEAD
14% DRAM energy
4.6% execution time
0000000
http://users.elis.ugent.be/~jsartor/
p. 33
p. 34
Execution Time (8MB nursery)
Execution Time Reduction (%)
7
6
5
clinvalidate
clundirty
clclean
clzero
clclean+clzero
4
3
2
1
0
Mean
p. 35
Changes to MESI coherence protocol
State
clinvalidate
clundirty/clcl
ean
clzero
BusInvalidate
M
invalidate
invalidate
⁄
L1/L2 (no WB) L1/L2 (no WB)
I
E
(clclean LRU)
E
invalidate
L1/L2
I
invalidate
L1/L2
(clclean LRU)
M
invalidate
L1/L2
I
S
invalidate
L1/L2
I
invalidate
L1/L2
(clclean LRU)
BusInvalidate
M
invalidate
L1/L2
I
I
⁄
⁄
BusInvalidate
M
⁄
invalidate
L1/L2 (no WB)
I
p. 36
Total DRAM Energy (8MB nursery)
60
Energy Reduction (%)
50
40
30
20
clinvalidate
clundirty
clclean
clzero
clclean+clzero
10
0
-10
p. 37
Execution Time Across Nurseries
p. 38
Execution Time
p. 39
Dynamic DRAM Energy 8MB Nursery
p. 40