Near-Optimal Cache Block Placement with Reactive Nonuniform Cache Architectures Nikos Hardavellas, Northwestern University Team: M.
Download ReportTranscript Near-Optimal Cache Block Placement with Reactive Nonuniform Cache Architectures Nikos Hardavellas, Northwestern University Team: M.
Near-Optimal Cache Block Placement with Reactive Nonuniform Cache Architectures Nikos Hardavellas, Northwestern University Team: M. Ferdman, B. Falsafi, A. Ailamaki Northwestern, Carnegie Mellon, EPFL Moore’s Law Is Alive And Well 90nm 90nm transistor Swine Flu A/H1N1 (Intel, 2005) (CDC) 65nm 45nm 32nm 22nm 16nm 2007 2010 2013 2016 2019 Device scaling continues for at least another 10 years 2 © Hardavellas Moore’s LawEnded Is Alive And2002 Well Good Days Nov. [Yelick09] “New” Moore’s Law: 2x cores with every generation On-chip cache grows commensurately to supply all cores with data 3 © Hardavellas 100,000 10,000 1,000 100 10 large caches 1 1990 2000 Year L2 Hit Latency (cycles) L2 Cache Size (KB) Larger Caches Are Slower Caches 25 slow access 20 15 10 5 0 1990 2010 2000 Year 2010 Increasing access latency forces caches to be distributed 4 © Hardavellas Cache design trends As caches become bigger, they get slower: Split cache into smaller “slices”: Balance cache slice access with network latency 5 © Hardavellas Modern Caches: Distributed core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 Split cache into “slices”, distribute across die 6 © Hardavellas Data Placement Determines Performance cache slice core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 Goal: place data on chip close to where they are used 7 © Hardavellas Our proposal: R-NUCA Reactive Nonuniform Cache Architecture • Data may exhibit arbitrarily complex behaviors ...but few that matter! • Learn the behaviors at run time & exploit their characteristics Make the common case fast, the rare case correct Resolve conflicting requirements 8 © Hardavellas Reactive Nonuniform Cache Architecture [Hardavellas et al, ISCA 2009] [Hardavellas et al, IEEE-Micro Top Picks 2010] • Cache accesses can be classified at run-time Each class amenable to different placement • Per-class block placement Simple, scalable, transparent No need for HW coherence mechanisms at LLC Up to 32% speedup (17% on average) -5% on avg. from an ideal cache organization • Rotational Interleaving Data replication and fast single-probe lookup 9 © Hardavellas Outline • • • • • • Introduction Why do Cache Accesses Matter? Access Classification and Block Placement Reactive NUCA Mechanisms Evaluation Conclusion 10 © Hardavellas Cache accesses dominate execution [Hardavellas et al, CIDR 2007] L2-hit stalls Lower is better Mem stalls Total 3 2.5 4-core CMP DSS: TPCH/DB2 1GB database CPI 2 1.5 1 Ideal 0.5 0 0 10 20 L2 Cache Size (MB) 30 Bottleneck shifts from memory to L2-hit stalls 11 © Hardavellas How much do we lose? Higher is better Norm. Throughput DSS-const DSS-real 3 2.5 2 1.5 4-core CMP DSS: TPCH/DB2 1GB database 1 0.5 0 0 20 10 L2 Cache Size (MB) 30 We lose half the potential throughput 12 © Hardavellas Outline • • • • • • Introduction Why do Cache Accesses Matter? Access Classification and Block Placement Reactive NUCA Mechanisms Evaluation Conclusion 13 © Hardavellas Terminology: Data Types core core Read or Write L2 Private core core core Read Read Read Write L2 L2 Shared Read-Only Shared Read-Write 14 © Hardavellas Distributed shared L2 core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 address mod <#slices> Unique location for any block (private or shared) Maximum capacity, but slow access (30+ cycles) 15 © Hardavellas Distributed private L2 core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 On every access allocate data at local L2 slice Private data: allocate at local L2 slice Fast access to core-private data 16 © Hardavellas Distributed private L2: shared-RO access core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 On every access allocate data at local L2 slice Shared read-only data: replicate across L2 slices Wastes capacity due to replication 17 © Hardavellas Distributed private L2: shared-RW access core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core L2 L2 L2 dir L2 L2 L2 L2 X On every access allocate data at local L2 slice Shared read-write data: maintain coherence via indirection (dir) Slow for shared read-write Wastes capacity (dir overhead) and bandwidth 18 © Hardavellas Conventional Multi-Core Caches Shared Private core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core L2 L2 L2 L2 dir L2 L2 L2 Address-interleave blocks + High capacity Each block cached locally + Fast access (local) − Low capacity (replicas) − Coherence: via indirection (distributed directory) We want: high capacity (shared) + fast access (private) − Slow access 19 © Hardavellas Where to Place the Data? read-write read-only migrate • Close to where they are used! • Accessed by single core: migrate locally • Accessed by many cores: replicate (?) If read-only, replication is OK If read-write, coherence a problem Low reuse: evenly distribute across sharers share replicate sharers# 20 © Hardavellas Methodology Flexus: Full-system cycle-accurate timing simulation [Hardavellas et al, SIGMETRICS-PER 2004 Wenisch et al, IEEE Micro 2006] Workloads • OLTP: TPC-C 3.0 100 WH IBM DB2 v8 Oracle 10g • DSS: TPC-H Qry 6, 8, 13 IBM DB2 v8 • SPECweb99 on Apache 2.0 • Multiprogammed: Spec2K • Scientific: em3d Model Parameters • Tiled, LLC = L2 • Server/Scientific wrkld. 16-cores, 1MB/core • Multi-programmed wrkld. 8-cores, 3MB/core • OoO, 2GHz, 96-entry ROB • Folded 2D-torus 2-cycle router, 1-cycle link 21 • 45ns memory © Hardavellas Cache Access Classification Example Read-Write Bubble Blocks inBlocks RW %% • Each bubble: cache blocks shared by x cores • Size of bubble proportional to % L2 accesses • y axis: % blocks in bubble that are read-write Instructions Data-Private Data-Shared 120% 100% 80% 60% % L2 accesses 40% 20% 0% -20% 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers 22 © Hardavellas Cache Access Clustering Data-Private Data-Private Data-Private Data-Private Data-Private Data-Private Data-Private Data-Private Data-Private Data-Private Data-Private Data-Shared Data-Shared Data-Shared Data-Shared Data-Shared Data-Shared Data-Shared Data-Shared Data-Shared Data-Shared Data-Shared 0% -20% 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers migrate locally Instructions Instructions 120% Instructions Instructions 100% R/W 80% 60% 40% R/O 20% migrate Bubble Blocks inBlocks %%RW Read-Write Instructions Instructions 120% Instructions Instructions 100% Instructions Instructions 80% Instructions Instructions 60% Instructions Instructions 40% Instructions 20% Read-Write Bubble Blocks inBlocks RW %% share (addr-interleave) Data-Private Data-Private Data-Private Data-Private Data-Shared Data-Shared Data-Shared Data-Shared share replicate 0% % -4 -2 0 2 4 6 8 10 12 14 16 18 20 -20% sharers# Number of Sharers Server Apps Scientific/MP Apps replicate Accesses naturally form 3 clusters 23 © Hardavellas Instruction Replication • Instruction working set too large for one cache slice core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 Distribute in cluster of neighbors, replicate across 24 © Hardavellas Reactive NUCA in a nutshell • Classify accesses private data: like private scheme (migrate) shared data: like shared scheme (interleave) instructions: controlled replication (middle ground) To place cache blocks, we first need to classify them 25 © Hardavellas Outline • • • • • Introduction Access Classification and Block Placement Reactive NUCA Mechanisms Evaluation Conclusion 26 © Hardavellas Classification Granularity • Per-block classification High area/power overhead (cut L2 size by half) High latency (indirection through directory) • Per-page classification (utilize OS page table) Persistent structure Core accesses the page table for every access anyway (TLB) Utilize already existing SW/HW structures and events Page classification is accurate (<0.5% error) Classify entire data pages, page table/TLB for bookkeeping 27 © Hardavellas Classification Mechanisms • Instructions classification: all accesses from L1-I (per-block) • Data classification: private/shared per-page at TLB miss On 1st access Core i core L2 On access by another core Ld A TLB Miss Ld A TLB Miss core Core j L2 OS OS A: Private to “i” A: Private to “i” A: Shared Bookkeeping through OS page table and TLB 28 © Hardavellas Page Table and TLB Extensions • Core accesses the page table for every access anyway (TLB) Pass information from the “directory” to the core • Utilize already existing SW/HW structures and events TLB entry: P/S vpage ppage vpage ppage 1 bit Page table entry: P/S/I L2 id 2 bits log(n) Page granularity allows simple + practical HW 29 © Hardavellas Data Class Bookkeeping and Lookup • private data: place in local L2 slice Page table entry: P L2 id vpage ppage P vpage ppage TLB entry: • shared data: place in aggregate L2 (addr interleave) Page table entry: S L2 id vpage ppage S vpage ppage TLB entry: Physical Addr.: tag L2 id 30 cache index offset © Hardavellas Coherence: No Need for HW Mechanisms at LLC • Reactive NUCA placement guarantee Each R/W datum in unique & known location Shared data: addr-interleave Private data: local slice core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 Fast access, eliminates HW overhead, SIMPLE 31 © Hardavellas Instructions Lookup: Rotational Interleaving +1 RID +log2(k) 0 2 0 2 1 3 1 3 2 0 2 0 RID 3 1 3 1 0 2 0 2 1 3 1 3 2 0 2 0 3 1 3 1 size-4 clusters: PC: 0xfa480 Addr each slice caches the same blocks local slice + 3 neighbors on behalf of any cluster Destination Addr RID1 & (n 1) Fast access (nearest-neighbor, simple lookup) Balance access latency with capacity constraints Equal capacity pressure at overlapped slices 32 © Hardavellas Outline • • • • • Introduction Access Classification and Block Placement Reactive NUCA Mechanisms Evaluation Conclusion 33 © Hardavellas Speedup over Private Evaluation 60% 50% 40% 30% 20% 10% 0% -10% -20% (S) Shared (R) R-NUCA Ideal (I) SR I S R I S R I SR I S R I S R I SR I SR I OLTP DB2 Apache DSS Qry6 DSS Qry8 DSS Qry13 em3d OLTP Oracle MIX Private-averse workloads Shared-averse workloads Delivers robust performance across workloads Shared: same for Web, DSS; 17% for OLTP, MIX Private: 17% for OLTP, Web, DSS; same for MIX 34 © 2009 Hardavellas Conclusions • Data may exhibit arbitrarily complex behaviors ...but few that matter! • Learn the behaviors that matter at run time Make the common case fast, the rare case correct • Reactive NUCA: near-optimal cache block placement Simple, scalable, low-overhead, transparent, no coherence Robust performance Matches best alternative, or 17% better; up to 32% Near-optimal placement (-5% avg. from ideal) 35 © Hardavellas Thank You! For more information: • N. Hardavellas, M. Ferdman, B. Falsafi and A. Ailamaki. NearOptimal Cache Block Placement with Reactive Nonuniform Cache Architectures. IEEE Micro Top Picks, Vol. 30(1), pp. 20-28, January/February 2010. • N. Hardavellas, M. Ferdman, B. Falsafi and A. Ailamaki. Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches. ISCA 2009. http://www.eecs.northwestern.edu/~hardav/ 36 © Hardavellas BACKUP SLIDES 37 © 2009 Hardavellas Why Are Caches Growing So Large? • Increasing number of cores: cache grows commensurately Fewer but faster cores have the same effect • Increasing datasets: faster than Moore’s Law! • Power/thermal efficiency: caches are “cool”, cores are “hot” So, its easier to fit more cache in a power budget • Limited bandwidth: large cache == more data on chip Off-chip pins are used less frequently 38 © Hardavellas Backup Slides ASR 39 © 2009 Hardavellas ASR vs. R-NUCA Configurations ASR-1 Memory Local L2 Shared L2 Local L2 ASR-2 R-NUCA 12.5× 25.0× 5.6× 2.1× 2.2× 38% In-Order OoO OoO 4 16 16 Memory 150 500 90 Local L2 12 20 16 Avg. Shared L2 25 44 22 Core Type L2 Size (MB) 40 © 2009 Hardavellas ASR design space search ASR Alloc 50% Alloc 0% Alloc 75% Alloc 25% Alloc 100% Speedup over private 6% 4% 2% 0% -2% -4% -6% OLTP DB2 Apache DSS Qry8 41 em3d OLTP Oracle MIX © Hardavellas Backup Slides Prior Work 42 © 2009 Hardavellas Prior Work • Several proposals for CMP cache management ASR, cooperative caching, victim replication, CMP-NuRapid, D-NUCA • ...but suffer from shortcomings complex, high-latency lookup/coherence don’t scale lower effective cache capacity optimize only for subset of accesses We need: Simple, scalable mechanism for fast access to all data 43 © Hardavellas Shortcomings of prior work • L2-Private Wastes capacity High latency (3 slice accesses + 3 hops on shr.) • L2-Shared High latency • Cooperative Caching Doesn’t scale (centralized tag structure) • CMP-NuRapid High latency (pointer dereference, 3 hops on shr) • OS-managed L2 Wastes capacity (migrates all blocks) Spill to neighbors useless (all run same code) 44 © Hardavellas Shortcomings of Prior Work • D-NUCA No practical implementation (lookup?) • Victim Replication High latency (like L2-Private) Wastes capacity (home always stores block) • Adaptive Selective Replication (ASR) High latency (like L2-Private) Capacity pressure (replicates at slice granularity) Complex (4 separate HW structures to bias coin) 45 © Hardavellas Backup Slides Classification and Lookup 46 © 2009 Hardavellas Core i Data Classification Timeline Core j core Ld A L2 inval A TLB Miss TLBi evict A allocate A Ld A TLB Miss reply A core L2 Core k core L2 allocate A OS P i i≠j vpage ppage S x vpage ppage Fast & simple lookup for data 47 © 2009 Hardavellas Misclassifications at Page Granularity Private+Shared Data Accesses from pages with multiple access types MIX em3d 0% DSS Qry13 MIX em3d DSS Qry13 DSS Qry8 DSS Qry6 Apache OLTP Oracle 0% 20% DSS Qry8 20% 40% DSS Qry6 40% 60% Apache 60% 80% OLTP Oracle 80% Correct 100% OLTP DB2 100% Private Data as Shared Total L2 Accesses Instructions+Data OLTP DB2 Total L2 Accesses One Class Access misclassifications • A page may service multiple access types • But, one type always dominates accesses Classification at page granularity is accurate 48 © Hardavellas Backup Slides Placement 49 © 2009 Hardavellas 100% OLTP DB2 OLTP Oracle Apache DSS Qry6 DSS Qry8 DSS Qry13 em3d MIX 80% 60% 40% 20% 0% 1 4 16 64 256 1,024 4,096 16,384 65,536 262,144 1,048,576 Total L2 Accesses (CDF) Private Data Placement Private Data (KB) • Spill to neighbors if working set too large? NO!!! Each core runs similar threads Store in local L2 slice (like in private cache) 50 © Hardavellas 100% OLTP DB2 OLTP Oracle Apache DSS Qry6 DSS Qry8 DSS Qry13 em3d MIX 80% 60% 40% 20% 0% 1 4 16 64 256 1,024 4,096 16,384 65,536 262,144 1,048,576 Total L2 Accesses (CDF) Private Data Working Set Private Data (KB) • OLTP: Small per-core work. set (3MB/16 cores = 200KB/core) • Web: primary wk. set <6KB/core, remaining <1.5% L2 refs • DSS: Policy doesn’t matter much (>100MB work. set, <13% L2 refs very low reuse on private) 51 © Hardavellas 20% MIX em3d 0% DSS Qry13 0% 40% DSS Qry8 20% 60% DSS Qry6 40% 80% Apache 60% 100% OLTP Oracle 80% 1st access 2nd access 3rd-4th access 5th-8th access 9+ access OLTP DB2 OLTP DB2 OLTP Oracle Apache DSS Qry6 DSS Qry8 DSS Qry13 em3d MIX Total L2 Accesses 100% 1 4 16 64 256 1,024 4,096 16,384 65,536 262,144 1,048,576 Total L2 Accesses (CDF) Shared Data Placement Shared Data (KB) Shared Data • Read-write + large working set + low reuse Unlikely to be in local slice for reuse • Also, next sharer is random [WMPI’04] Address-interleave in aggregate L2 (like shared cache) 52 © Hardavellas 100% OLTP DB2 OLTP Oracle Apache DSS Qry6 DSS Qry8 DSS Qry13 em3d MIX 80% 60% 40% 20% 0% 1 4 16 64 256 1,024 4,096 16,384 65,536 262,144 1,048,576 Total L2 Accesses (CDF) Shared Data Working Set Shared Data (KB) 53 © Hardavellas 40% 20% Instructions (KB) MIX em3d 4,096 1,024 256 64 16 4 0% DSS Qry13 0% DSS Qry8 20% 60% DSS Qry6 40% 80% Apache 60% 100% OLTP Oracle 80% 1st access 2nd access 3rd-4th access 5th-8th access 9+ access OLTP DB2 OLTP DB2 OLTP Oracle Apache DSS Qry6 DSS Qry8 DSS Qry13 em3d MIX Total L2 Accesses 100% 1 Total L2 Accesses (CDF) Instruction Placement Instructions • Working set too large for one slice Slices store private & shared data too! Sufficient capacity with 4 L2 slices Share in clusters of neighbors, replicate across 54 © Hardavellas 100% OLTP DB2 OLTP Oracle Apache DSS Qry6 DSS Qry8 DSS Qry13 em3d MIX 80% 60% 40% 20% 4,096 1,024 256 64 16 4 0% 1 Total L2 Accesses (CDF) Instructions Working Set Instructions (KB) 55 © Hardavellas Backup Slides Rotational Interleaving 56 © 2009 Hardavellas Instruction Classification and Lookup • Identification: all accesses from L1-I • But, working set too large to fit in one cache slice core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 Share within neighbors’ cluster, replicate across 57 © 2009 Hardavellas Rotational Interleaving +1 RotationalID 0 TileID 2 8 +log2(k) 16 0 24 2 1 3 9 17 1 25 3 2 10 0 18 2 26 0 3 11 1 19 3 27 1 4 0 12 2 20 0 28 2 5 1 13 3 21 1 29 3 6 2 14 0 22 2 30 0 7 3 15 1 23 3 31 1 Addr RotationalIDdest D RotationalIDcenter TileIDdest TileIDcenter D RotationalIDdest RotationalIDcenter 1 & n 1 Fast access (nearest-neighbor, simple lookup) Equalize capacity pressure at overlapping slices 58 © 2009 Hardavellas Nearest-neighbor size-8 clusters 0 3 6 1 4 7 2 5 1 4 7 2 5 0 3 6 2 5 0 3 6 1 4 7 3 6 1 4 7 2 5 0 4 7 2 5 0 3 6 1 D 59 5 0 3 6 1 4 7 2 6 1 4 7 2 5 0 3 7 2 5 0 3 6 1 4 C © Hardavellas