Transcript Document
Scalable High Performance Main Memory System Using PCM Technology Moinuddin K. Qureshi Viji Srinivasan and Jude Rivers IBM T. J. Watson Research Center, Yorktown Heights, NY International Symposium on Computer Architecture (ISCA-2009) 18-Jul-15 © 2007 IBM Corporation Main Memory Capacity Wall More cores in system More concurrency Larger working set Demand for main memory capacity continues to increase Main Memory System consisting of DRAM are hitting: 1. Cost wall: Major % of cost of large servers is main memory 2. Scaling wall: DRAM scaling to small technology is challenge 3. Power wall: IBM P670 Server Processor Memory Small (4 proc, 16GB) 384 Watts 314 Watts Large (16 proc, 128GB) 840 Watts 1223 Watts Source: Lefurgy et al. IEEE Computer 2003 Need a practical solution to increase main-memory capacity 2 © 2007 IBM Corporation The Technology Hierarchy More capacity by cheaper, denser, (slower) technology High-Performance Disk Memory System L1(SRAM) 21 EDRAM 23 25 Flash DRAM PCM 27 29 211 213 215 217 HDD 219 221 223 Typical access latency in processor cycles (@ 4 GHz) Phase Change Memory (PCM) promising candidate for large capacity main memory 3 © 2007 IBM Corporation Outline Introduction What is PCM ? Hybrid Memory System Evaluation Lifetime Analysis Summary 4 © 2007 IBM Corporation What is Phase Change Memory? Phase change material (chalcogenide glass) exists in two states: 1. Amorphous: high resistivity 2. Crystalline: low resistivity Bit Line Materials can be switched between states reliably, quickly, large number of times Word Line Word Line PCM stores data in terms of resistance • Low resistance (SET state) = 1 • High resistance (RESET state) = 0 5 N N N I © 2007 IBM Corporation Switching by heating using electrical pulses SET: sustained current to heat cell above Tcryst RESET: cell heated above Tmelt and quenched Temperature How does PCM work ? RESET Tmelt SET Tcryst Time [ns] Large Current Small Current Memory Element SET Low resistance 103-104 W 6 Access Device RESET High resistance 106-107 W Photo Courtesy: Bipin Rajendran, IBM © 2007 IBM Corporation Key Characteristics of PCM + Scales better than DRAM, small cell size Prototypes as small as 3nm x 20 nm fabricated and tested [Raoux+ IBMJRD’08] + Can store multiple bits/cell More density in the same area Prototypes with 2 bits/cell in ISSCC’08. >2 bits/cell expected soon. + Non-Volatile Memory Technology Data retention of 10 years Power implications, system implications Challenges: - More latency compared to DRAM. - Limited Endurance (~10 million writes per cell) - Write bandwidth constrained, so better to write less often. 7 © 2007 IBM Corporation Outline Introduction What is PCM ? Hybrid Memory System Evaluation Lifetime Analysis Summary 8 © 2007 IBM Corporation Hybrid Memory System PCM Main Memory DATA Processor W DRAM Buffer T Flash Or HDD DATA T=Tag-Store PCM Write Queue Hybrid Memory System: 1. DRAM as cache to tolerate PCM Rd/Wr latency and Wr bandwidth 2. PCM as main-memory to provide large capacity at good cost/power 9 © 2007 IBM Corporation Lazy Write Architecture Problem: Double PCM writes to dirty pages on install PCM DRAM Buffer Flash/Disk Processor WRQ For example: Daxpy Kernel: Y[i] = Y[i] + X[i] Baseline has 2 writes for Y[i] and 1 for X[i] Lazy write has 1 write for Y[i] and 1 for X[i] 10 © 2007 IBM Corporation Line Level Write Back Line (Mln) Each toDirty Num NumWrites Writes Per Line (Million) Problem: Not all lines in a dirty page are dirty Solution: Dirty bits per line in DRAM buffer and write-back only dirty lines from DRAM to PCM 20 18 Average Average 16 14 12 10 8 6 4 2 0 0 1 2 3 Line_id 4 5 6 7 8db19 10 11 12 13 14 15 db1 0 1 2 3 4 5 6 7 8 db29 10 11 12 13 14 15 db2 Problem: With LLWB, not all lines in dirty pages are written uniformly 11 © 2007 IBM Corporation Fine Grained Wear Leveling (Mln) Line Writes NumNum Writesto PerEach Dirty Line (Million) Solution: Fine Grained Wear Leveling (FGWL) -When a page gets allocated page is rotated by a random shift value -The rotate value remains constant while page remains in memory -On replacement of a page, a new random value is assigned for a new page -Over time, the write traffic per line becomes uniform. 20 18 Average Average 16 14 12 10 8 6 4 2 0 0 1 2 3 Line_id 4 5 6 7 8 db19 10 11 12 13 14 15 db1 0 1 2 3 4 5 6 7 8 db29 10 11 12 13 14 15 db2 FGWL makes writes across lines in a dirty page uniform 12 © 2007 IBM Corporation Outline Introduction What is PCM ? Hybrid Memory System Evaluation Lifetime Analysis Summary 13 © 2007 IBM Corporation Evaluation Framework Trace Driven Simulator: 16-core system (simple core), 8GB DRAM main-memory at 320 cycles HDD (2 ms) with Flash (32 us) with Flash hit-rate of 99% Workloads: Database workloads & Data parallel kernels 1. Database workloads: db1 and db2 2. Unix utilities: qsort and binary search 3. Data Mining : K-means and Gauss Seidal 4. Streaming: DAXPY and Vector Dot Product Assumption: PCM 4X denser & 4X slower than DRAM 32GB @ 1280 cycle read latency 14 © 2007 IBM Corporation Reduction in Page Faults Page Faults Normalized to 8GB System 2.2 4GB 8GB 16GB 32GB 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 db1 db2 qsort bsearch Benefit from capacity 15 kmeans gauss Need >16GB daxpy vdotp Streaming © 2007 IBM Corporation Impact on Execution Time 1.1 Normalized Execution Time 1 0.9 0.8 8GB DRAM 32GB PCM 32GB DRAM 32GB PCM + 1GB DRAM 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 db1 db2 qsort bsearch kmeans gauss daxpy vdotp gmean PCM with DRAM buffer performs similar to equal capacity DRAM storage 16 © 2007 IBM Corporation Impact of PCM Latency 1.1 Normalized Exec. Time (Avg) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1X DRAM-8GB 2X 4X 8X 16X PCM-32GB 2X 4X 8X 16X HYBRID (1+32)GB 1X DRAM-32GB Hybrid memory system is relatively insensitive to PCM Latency 17 © 2007 IBM Corporation Power Evaluations Value Normalized to 8GB DRAM 2.2 2 8GB DRAM Hybrid (32GB PCM+ 1GB DRAM) 32GB DRAM 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Power Energy Energy x Delay Significant Power and Energy savings with PCM based hybrid memory system 18 © 2007 IBM Corporation Outline Introduction What is PCM ? Hybrid Memory System Evaluation Lifetime Analysis Summary 19 © 2007 IBM Corporation Impact of Write Endurance B Bytes/Cycle written to PCM S PCM capacity in bytes Wmax Max writes per PCM cell Assuming uniform writes to PCM F Frequency of System (4GHz) Y = Number of years (lifetime) Endurance (in cycles) = (S/B).Wmax Num. cycles in Y years = Y. F.225 Y = (S/B). Wmax F.225 There are 225 seconds in a year For a 4GHz System, a 32GB PCM written at 1 Byte per Cycle Y = Wmax 4 million If Wmax = 10 million, PCM will last for 2.5 years 20 © 2007 IBM Corporation Lifetime Results Table shows average bytes per cycle written to PCM and Average lifetime of PCM assuming Wmax = 10 million Configuration Avg. Bytes/Cycle Avg. Lifetime 1GB DRAM + 32GB PCM 0.807 3.0 yrs + Lazy Write 0.725 3.4 yrs + Line Level Write Back 0.316 7.6 yrs + Bypass Streaming Apps 0.247 9.7 yrs Proposed filtering techniques reduce write traffic to PCM by 3.2X, increasing its lifetime from 3 to 9.7 years 21 © 2007 IBM Corporation Outline Introduction What is PCM ? Hybrid Memory System Evaluation Lifetime Analysis Summary 22 © 2007 IBM Corporation Summary Need more main memory capacity: DRAM hitting power, cost, scaling wall PCM is an emerging technology – 4x denser than DRAM but with slower access time and limited write endurance We propose a Hybrid Memory System (DRAM+PCM) that provides significant power and performance benefits Proposed write filtering techniques reduce writes by 3x and increase PCM lifetime from 3 years to 9 years Not touched in this talk but important: Exploiting non-volatile memories for system enhancement & related OS issues. 23 © 2007 IBM Corporation Thanks! 24 © 2007 IBM Corporation