Cooperative Cache Scrubbing Jennifer B. Sartor, Wim Heirman, Steve Blackburn*, Lieven Eeckhout, Kathryn S.
Download ReportTranscript Cooperative Cache Scrubbing Jennifer B. Sartor, Wim Heirman, Steve Blackburn*, Lieven Eeckhout, Kathryn S.
Cooperative Cache Scrubbing Jennifer B. Sartor, Wim Heirman, Steve Blackburn*, Lieven Eeckhout, Kathryn S. McKinley^ PACT 2014 * ^ Multicore Challenge Objects rapidly allocated and short-lived Chip Application Managed language runtime environment Operating System P P P P $ $ $ $ LLC memory (DRAM) p. 2 Problem: Allocation Wall Objects rapidly allocated and short-lived Chip Application Managed language runtime environment Operating System P P P P $ $ $ $ DEAD DEAD DEAD DEAD LLC DEAD DEAD memory (DRAM) p. 3 Problem: Bandwidth & Power Wall Objects rapidly allocated and short-lived Chip Application Managed language runtime environment Operating System P P P P $ $ $ $ DEAD DEAD DEAD DEAD LLC Zero initialization DEAD 0000000 0000000 DEAD memory (DRAM) p. 4 Cooperative Cache Scrubbing Objects rapidly allocated and short-lived Chip Application Managed language runtime environment Operating System P P P P $ $ $ $ DEAD DEAD DEAD DEAD LLC DEAD Zero initialization 0000000 0000000 memory (DRAM) p. 5 Generational Garbage Collection Mature Nursery Young objects die quickly Nursery Traced for live objects Copy to mature space Reclaimed ‘en masse’ DEAD DEAD DEAD LLC DEAD DEAD 8MB p. 6 Dead Lines in LLC (8MB) p. 7 Dead Data Written Back? Application Managed language runtime environment Operating System Chip P P P P $ $ $ $ DEAD DEAD DEAD DEAD DEAD DEAD LLC memory (DRAM) p. 8 Useless Write Backs (8MB LLC) p. 9 Cooperative Cache Scrubbing Communicate managed language’s semantic information to hardware Caches writes ‘Scrub’ dead lines Invalidate Unset dirty bit Zero lines without fetch reads Result Better cache management Avoid traffic to DRAM Save DRAM energy p. 10 Dead Data Written in Cache? Mature Nursery Young objects die quickly Nursery Traced for live objects Copy to mature space Reclaimed ‘en masse’ DEAD DEAD 0000000 DEAD DEAD DEAD DEAD DEAD DEAD LLC p. 11 Dead Lines Written in LLC (8MB) p. 12 SW-HW Cooperative Scrubbing Software Identify cache line-aligned dead/zero region Generational Immix collector (stop-the-world) After nursery collection, call scrub instruction on each line in entire range Call zero instructions to zero region (32KB) Hardware p. 13 SW-HW Cooperative Scrubbing Software Hardware Scrubbing (LLC) clinvalidate: invalidates cache line clundirty: clears dirty bit clclean: clears dirty bit, moves line to LRU Zeroing (L2) PowerPC’s dcbi, ARM clzero: zero cache line without fetch PowerPC’s dcbz Modifications to MESI cache coherence protocol Back-propagation from LLC to L1/L2 cache levels Local coherence transitions (no off-chip) p. 14 MESI Coherence Transitions clinvalidate/clclean/- E clclean/- clinvalidate/- M clclean/- clclean/- I clinvalidate/- S p. 15 MESI Coherence Transitions M E BusInvalidate clzero/BusInvalidate clzero/- clzero/- I external: from another LLC BusInvalidate S p. 16 Methodology Sniper simulator 4 cores, 8MB shared L3 (LLC), McPAT Extensions for JVM Works with JIT compiler Emulate system calls (futex & nanosleep) JVM-simulator communication with new instruction Jikes RVM 3.1.2 and DaCapo benchmarks Generational Immix garbage collector 4 application, 4 GC threads 2x minimum heap Replay compilation, 2nd invocation p. 17 DRAM Writes (8MB nursery) Writes/Baseline (%) 120 100 80 60 40 20 clinvalidate clundirty clclean clzero clclean+clzero 0 p. 18 DRAM Writes (8MB nursery) Writes/Baseline (%) 120 100 80 60 40 20 clinvalidate clundirty clclean clzero clclean+clzero 0 p. 19 DRAM Writes (8MB nursery) Writes/Baseline (%) 120 100 80 60 40 20 clinvalidate clundirty clclean clzero clclean+clzero 0 p. 20 DRAM Reads (8MB nursery) 225 Reads/Baseline (%) 200 175 150 125 100 75 50 clinvalidate clundirty clclean clzero clclean+clzero 25 0 p. 21 DRAM Reads (8MB nursery) 225 Reads/Baseline (%) 200 175 150 125 100 75 50 clinvalidate clundirty clclean clzero clclean+clzero 25 0 p. 22 DRAM Reads (8MB nursery) 225 Reads/Baseline (%) 200 175 150 125 100 75 50 clinvalidate clundirty clclean clzero clclean+clzero 25 0 p. 23 DRAM Reads (8MB nursery) 225 Reads/Baseline (%) 200 175 150 125 100 75 50 clinvalidate clundirty clclean clzero clclean+clzero 25 0 p. 24 DRAM Reads (8MB nursery) 225 Reads/Baseline (%) 200 175 150 125 100 75 50 clinvalidate clundirty clclean clzero clclean+clzero 25 0 p. 25 Dynamic DRAM Energy (8MB nursery) 80 Energy Reduction (%) 70 60 50 clinvalidate clundirty clclean clzero clclean+clzero 40 30 20 10 0 Mean p. 26 Dynamic DRAM Energy (8MB nursery) 80 Energy Reduction (%) 70 60 50 clinvalidate clundirty clclean clzero clclean+clzero 40 30 20 10 0 Mean p. 27 Total DRAM Energy 25 Energy Reduction (%) 20 15 clinvalidate clundirty clclean clzero clclean+clzero 10 5 0 4M -5 8M 16M -22% p. 28 Total DRAM Energy 25 Energy Reduction (%) 20 15 clinvalidate clundirty clclean clzero clclean+clzero 10 5 0 4M -5 8M 16M -22% p. 29 Total DRAM Traffic 100 Traffic Reduction (%) 75 50 clinvalidate clundirty clclean clzero clclean+clzero 25 0 4M 8M 16M -25 -50 -14x p. 30 clclean+clzero Improvements 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 4MB 8MB 16MB 0% p. 31 Related Work Cooperative cache management ESKIMO by Isen & John, Micro 09 Wang et al., PACT 02/ ISCA 03; Sartor et al., 05 Useless reads and writes to DRAM by sequential C programs Reduce energy Require large map in hardware, extra cache bits C & Fortran static analysis to give cache hints to evict or keep data Zero initialization [Yang et al., OOPSLA 11] Studied costs in time, cache and traffic Use non-temporal writes to DRAM, increase bandwidth p. 32 Conclusions Software-hardware cooperative cache scrubbing Leverages region allocation semantics Changes to MESI coherence protocol New multicore architectural simulation methodology Reductions 59% traffic DEAD 14% DRAM energy 4.6% execution time 0000000 http://users.elis.ugent.be/~jsartor/ p. 33 p. 34 Execution Time (8MB nursery) Execution Time Reduction (%) 7 6 5 clinvalidate clundirty clclean clzero clclean+clzero 4 3 2 1 0 Mean p. 35 Changes to MESI coherence protocol State clinvalidate clundirty/clcl ean clzero BusInvalidate M invalidate invalidate ⁄ L1/L2 (no WB) L1/L2 (no WB) I E (clclean LRU) E invalidate L1/L2 I invalidate L1/L2 (clclean LRU) M invalidate L1/L2 I S invalidate L1/L2 I invalidate L1/L2 (clclean LRU) BusInvalidate M invalidate L1/L2 I I ⁄ ⁄ BusInvalidate M ⁄ invalidate L1/L2 (no WB) I p. 36 Total DRAM Energy (8MB nursery) 60 Energy Reduction (%) 50 40 30 20 clinvalidate clundirty clclean clzero clclean+clzero 10 0 -10 p. 37 Execution Time Across Nurseries p. 38 Execution Time p. 39 Dynamic DRAM Energy 8MB Nursery p. 40