Hybrid Main Memory Literature Survey HanBin Yoon, Justin Meza, Rachata Ausavarungnirun Talk Outline Project Overview Literature Review Row Buffer Locality Selective Promotion Reuse Synthesis.
Download ReportTranscript Hybrid Main Memory Literature Survey HanBin Yoon, Justin Meza, Rachata Ausavarungnirun Talk Outline Project Overview Literature Review Row Buffer Locality Selective Promotion Reuse Synthesis.
Hybrid Main Memory Literature Survey HanBin Yoon, Justin Meza, Rachata Ausavarungnirun Talk Outline Project Overview Literature Review Row Buffer Locality Selective Promotion Reuse Synthesis 2 Project Overview Use DRAM as a smart cache to PCM Before After CPU CPU DRAM DRAM PCM 3 Literature Review Row Buffer Locality Selective Promotion “Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement” K. Sudan et al., ASPLOS 2010 “CHOP: Adaptive Filter-Based DRAM Caching for CMP Server Platforms” X. Jiang et al., HPCA 2010 Reuse “Adaptive Insertion Policies for High Performance Caching” M. K. Qureshi et al., ISCA 2007 “Adaptive Insertion Policies for Managing Shared Caches” A. Jaleel et al., PACT 2008 4 Row-Buffer Locality “Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement” K. Sudan et al., ASPLOS 2010 Observation Observation: typically, many memory accesses to the OS pages are accessed to a small contiguous chuck of cache blocks in the row Solution: Gather hot cache blocks and map it to the same DRAM row to provide better row buffer locality 6 Possible Solutions Reducing OS page size OS managed mechanism: page size is reduced to 1KB (from 4KB) Hot pages are migrated via DRAM-copy Hot pages are located in the same row Cold pages is promoted to super pages to decrease TLB thrashing through the reduced page size Hardware managed via indirection Additional level of address mapping to map hot pages into the same row Mapping table store the indirection, updated every epoch 7 Tradeoffs Advantages Cluster hot pages into the same row: can potentially increase row buffer locality This technique can be done purely in software Disadvantages Additional indirection can create a critical path Clustering hot pages into the same row does not necessarily mean more row buffer locality 8 Selective Promotion “CHOP: Adaptive Filter-Based DRAM Caching for CMP Server Platforms” X. Jiang et al., HPCA 2010 Motivation Many-core architectures provide lots of cores Problem: Limited pin count bandwidth to memory restricted Want to reduce off-chip bandwidth by identifying and cache hot pages In DRAM on-chip or 3D-stacked Without requiring a large tag store CHOP Architecture Keep a table of recently accessed pages On a page access, Add an entry if not already in table If in table, increment entry When an entry reaches a certain threshold, cache it on chip Results 30% performance improvement over traditional DRAM cache Small (?) storage overhead (800 KB) compared to traditional DRAM cache Hot pages are closer to CPU, don’t need to be re-fetched Data tracked only for recently accessed pages, not all pages in DRAM Lower bandwidth consumption compared to traditional DRAM cache Don’t have to go to main memory as often by caching hot pages Tradeoffs Pros Reduces off-chip bandwidth Smaller tag store than tracking all of DRAM Cons Requires changes to the chip/on-chip DRAM to work Tag store is somewhat large 13 Reuse “Adaptive Insertion Policies for High Performance Caching” M. K. Qureshi et al., ISCA 2007 “Adaptive Insertion Policies for Managing Shared Caches” A. Jaleel et al., PACT 2008 Adaptive Insertion Policies Problem: Cache Thrashing Key Insight: Account for reuse Solution 1: LRU Insertion Solution 2: Bimodal Insertion if (e > 1/32) LRU Insertion else MRU Insertion 15 Adaptive Insertion Policies Solution 3: Dynamic Insertion (MRU <-> Bimodal) PSEL value M inst Set Dueling 16 Adaptive Insertion Policies Results 17 Adaptive Insertion Policies Advantages Retains cache blocks with high reuse Adapts to program phase change Higher cache hit rate than MRU insertion Avoids thrashing Aware of individual workload characteristics Low implementation overhead Disadvantages Slightly harmful for workloads with good temporal locality 18 Synthesis Synthesis Micro-Pages CHOP DIP Goal Improve row buffer locality 1. Reduce off-chip memory bandwidth 2. Small tag store 1. Retain high reuse blocks 2. Avoid trashing Memory hierarchy Main memory On-chip DRAM cache Last-level cache Granularity 1 KB u-pages in 4KB row buffer OS page 64 Bytes Row buffer locality Increases Does not consider N/A Reuse Top N frequently accessed u-pages in dedicated rows Threshold based on Low reuse blocks number of accesses get evicted quickly due to LRU Insertion Applies to PCM Applicable but with wear and energy concerns Use PCM instead of large DRAM main memory Possible to adopt insertion policy to account for reuse 20 Thank You! Appendix Adaptive Insertion Policies Solution for Multi-Core Decide on MRU or Bimodal on per-thread basis 23 Adaptive Insertion Policies Problem: Cache Trashing Access pattern: 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, … MRU 4 3 2 1 LRU MRU 5 4 3 2 LRU MRU 1 5 4 3 LRU MRU 2 1 5 4 LRU No account for reuse 24 Adaptive Insertion Policies Key Insight: Account for reuse Solution 1: LRU Insertion Access pattern: 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, … MRU 4 3 2 1 LRU MRU 4 3 2 5 LRU MRU 4 3 2 1 LRU MRU 2 4 3 1 LRU MRU 3 2 4 1 LRU Access pattern: 6, 7, 8, 9, 10, 6, 7, 8, 9, 10, … 25