Cooperative Cache Scrubbing Jennifer B. Sartor, Wim Heirman, Steve Blackburn*, Lieven Eeckhout, Kathryn S.

Download Report

Transcript Cooperative Cache Scrubbing Jennifer B. Sartor, Wim Heirman, Steve Blackburn*, Lieven Eeckhout, Kathryn S.

Cooperative Cache Scrubbing
Jennifer B. Sartor, Wim Heirman, Steve
Blackburn*, Lieven Eeckhout, Kathryn S. McKinley^
PACT 2014
*
^
Multicore Challenge
Objects rapidly
allocated and
short-lived
Chip
Application
Managed language
runtime environment
Operating System
P
P
P
P
$
$
$
$
LLC
memory (DRAM)
p. 2
Problem: Allocation Wall
Objects rapidly
allocated and
short-lived
Chip
Application
Managed language
runtime environment
Operating System
P
P
P
P
$
$
$
$
DEAD
DEAD
DEAD
DEAD
LLC
DEAD
DEAD
memory (DRAM)
p. 3
Problem: Bandwidth & Power Wall
Objects rapidly
allocated and
short-lived
Chip
Application
Managed language
runtime environment
Operating System
P
P
P
P
$
$
$
$
DEAD
DEAD
DEAD
DEAD
LLC
Zero
initialization
DEAD
0000000
0000000
DEAD
memory (DRAM)
p. 4
Cooperative Cache Scrubbing
Objects rapidly
allocated and
short-lived
Chip
Application
Managed language
runtime environment
Operating System
P
P
P
P
$
$
$
$
DEAD
DEAD
DEAD
DEAD
LLC
DEAD
Zero
initialization
0000000
0000000
memory (DRAM)
p. 5
Generational Garbage Collection
Mature
Nursery
Young objects die quickly
 Nursery




Traced for live objects
Copy to mature space
Reclaimed ‘en masse’
DEAD
DEAD
DEAD
LLC
DEAD
DEAD
8MB
p. 6
Dead Lines in LLC (8MB)
p. 7
Dead Data Written Back?
Application
Managed language
runtime environment
Operating System
Chip
P
P
P
P
$
$
$
$
DEAD
DEAD
DEAD
DEAD
DEAD
DEAD
LLC
memory (DRAM)
p. 8
Useless Write Backs (8MB LLC)
p. 9
Cooperative Cache Scrubbing
Communicate managed language’s
semantic information to hardware
 Caches
writes
 ‘Scrub’ dead lines




Invalidate
Unset dirty bit
Zero lines without fetch
reads
 Result



Better cache management
Avoid traffic to DRAM
Save DRAM energy
p. 10
Dead Data Written in Cache?
Mature
Nursery
Young objects die quickly
 Nursery




Traced for live objects
Copy to mature space
Reclaimed ‘en masse’
DEAD
DEAD 0000000
DEAD
DEAD
DEAD
DEAD
DEAD DEAD
LLC
p. 11
Dead Lines Written in LLC (8MB)
p. 12
SW-HW Cooperative Scrubbing

Software


Identify cache line-aligned dead/zero region
Generational Immix collector (stop-the-world)



After nursery collection, call scrub instruction on each
line in entire range
Call zero instructions to zero region (32KB)
Hardware
p. 13
SW-HW Cooperative Scrubbing
Software
 Hardware


Scrubbing (LLC)




clinvalidate: invalidates cache line
clundirty: clears dirty bit
clclean: clears dirty bit, moves line to LRU
Zeroing (L2)


PowerPC’s dcbi, ARM
clzero: zero cache line without fetch
PowerPC’s dcbz
Modifications to MESI cache coherence protocol


Back-propagation from LLC to L1/L2 cache levels
Local coherence transitions (no off-chip)
p. 14
MESI Coherence Transitions
clinvalidate/clclean/-
E
clclean/-
clinvalidate/-
M
clclean/-
clclean/-
I
clinvalidate/-
S
p. 15
MESI Coherence Transitions
M
E
BusInvalidate
clzero/BusInvalidate
clzero/-
clzero/-
I
external:
from
another
LLC
BusInvalidate
S
p. 16
Methodology

Sniper simulator


4 cores, 8MB shared L3 (LLC), McPAT
Extensions for JVM




Works with JIT compiler
Emulate system calls (futex & nanosleep)
JVM-simulator communication with new instruction
Jikes RVM 3.1.2 and DaCapo benchmarks




Generational Immix garbage collector
4 application, 4 GC threads
2x minimum heap
Replay compilation, 2nd invocation
p. 17
DRAM Writes (8MB nursery)
Writes/Baseline (%)
120
100
80
60
40
20
clinvalidate
clundirty
clclean
clzero
clclean+clzero
0
p. 18
DRAM Writes (8MB nursery)
Writes/Baseline (%)
120
100
80
60
40
20
clinvalidate
clundirty
clclean
clzero
clclean+clzero
0
p. 19
DRAM Writes (8MB nursery)
Writes/Baseline (%)
120
100
80
60
40
20
clinvalidate
clundirty
clclean
clzero
clclean+clzero
0
p. 20
DRAM Reads (8MB nursery)
225
Reads/Baseline (%)
200
175
150
125
100
75
50
clinvalidate
clundirty
clclean
clzero
clclean+clzero
25
0
p. 21
DRAM Reads (8MB nursery)
225
Reads/Baseline (%)
200
175
150
125
100
75
50
clinvalidate
clundirty
clclean
clzero
clclean+clzero
25
0
p. 22
DRAM Reads (8MB nursery)
225
Reads/Baseline (%)
200
175
150
125
100
75
50
clinvalidate
clundirty
clclean
clzero
clclean+clzero
25
0
p. 23
DRAM Reads (8MB nursery)
225
Reads/Baseline (%)
200
175
150
125
100
75
50
clinvalidate
clundirty
clclean
clzero
clclean+clzero
25
0
p. 24
DRAM Reads (8MB nursery)
225
Reads/Baseline (%)
200
175
150
125
100
75
50
clinvalidate
clundirty
clclean
clzero
clclean+clzero
25
0
p. 25
Dynamic DRAM Energy (8MB nursery)
80
Energy Reduction (%)
70
60
50
clinvalidate
clundirty
clclean
clzero
clclean+clzero
40
30
20
10
0
Mean
p. 26
Dynamic DRAM Energy (8MB nursery)
80
Energy Reduction (%)
70
60
50
clinvalidate
clundirty
clclean
clzero
clclean+clzero
40
30
20
10
0
Mean
p. 27
Total DRAM Energy
25
Energy Reduction (%)
20
15
clinvalidate
clundirty
clclean
clzero
clclean+clzero
10
5
0
4M
-5
8M
16M
-22%
p. 28
Total DRAM Energy
25
Energy Reduction (%)
20
15
clinvalidate
clundirty
clclean
clzero
clclean+clzero
10
5
0
4M
-5
8M
16M
-22%
p. 29
Total DRAM Traffic
100
Traffic Reduction (%)
75
50
clinvalidate
clundirty
clclean
clzero
clclean+clzero
25
0
4M
8M
16M
-25
-50
-14x
p. 30
clclean+clzero Improvements
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
4MB
8MB
16MB
0%
p. 31
Related Work

Cooperative cache management

ESKIMO by Isen & John, Micro 09




Wang et al., PACT 02/ ISCA 03; Sartor et al., 05


Useless reads and writes to DRAM by sequential C
programs
Reduce energy
Require large map in hardware, extra cache bits
C & Fortran static analysis to give cache hints to evict or
keep data
Zero initialization [Yang et al., OOPSLA 11]


Studied costs in time, cache and traffic
Use non-temporal writes to DRAM, increase bandwidth
p. 32
Conclusions

Software-hardware cooperative cache
scrubbing
Leverages region allocation semantics
 Changes to MESI coherence protocol
 New multicore architectural simulation
methodology
 Reductions
 59% traffic
DEAD
 14% DRAM energy
 4.6% execution time

0000000
http://users.elis.ugent.be/~jsartor/
p. 33
p. 34
Execution Time (8MB nursery)
Execution Time Reduction (%)
7
6
5
clinvalidate
clundirty
clclean
clzero
clclean+clzero
4
3
2
1
0
Mean
p. 35
Changes to MESI coherence protocol
State
clinvalidate
clundirty/clcl
ean
clzero
BusInvalidate
M
invalidate
invalidate
⁄
L1/L2 (no WB) L1/L2 (no WB)
I
E
(clclean LRU)
E
invalidate
L1/L2
I
invalidate
L1/L2
(clclean LRU)
M
invalidate
L1/L2
I
S
invalidate
L1/L2
I
invalidate
L1/L2
(clclean LRU)
BusInvalidate
M
invalidate
L1/L2
I
I
⁄
⁄
BusInvalidate
M
⁄
invalidate
L1/L2 (no WB)
I
p. 36
Total DRAM Energy (8MB nursery)
60
Energy Reduction (%)
50
40
30
20
clinvalidate
clundirty
clclean
clzero
clclean+clzero
10
0
-10
p. 37
Execution Time Across Nurseries
p. 38
Execution Time
p. 39
Dynamic DRAM Energy 8MB Nursery
p. 40