Transcript Document

Scalable High Performance Main Memory
System Using PCM Technology
Moinuddin K. Qureshi
Viji Srinivasan and Jude Rivers
IBM T. J. Watson Research Center, Yorktown Heights, NY
International Symposium on Computer Architecture (ISCA-2009)
18-Jul-15
© 2007 IBM Corporation
Main Memory Capacity Wall
More cores in system  More concurrency  Larger working set
Demand for main memory capacity continues to increase
Main Memory System consisting of DRAM are hitting:
1. Cost wall: Major % of cost of large servers is main memory
2. Scaling wall: DRAM scaling to small technology is challenge
3. Power wall:
IBM P670 Server
Processor
Memory
Small (4 proc, 16GB)
384 Watts
314 Watts
Large (16 proc, 128GB)
840 Watts
1223 Watts
Source: Lefurgy et al. IEEE Computer 2003
Need a practical solution to increase main-memory capacity
2
© 2007 IBM Corporation
The Technology Hierarchy
More capacity by cheaper, denser, (slower) technology
High-Performance Disk
Memory System
L1(SRAM)
21
EDRAM
23
25
Flash
DRAM PCM
27
29
211
213
215
217
HDD
219
221
223
Typical access latency in processor cycles (@ 4 GHz)
Phase Change Memory (PCM) promising candidate
for large capacity main memory
3
© 2007 IBM Corporation
Outline
 Introduction
 What is PCM ?
 Hybrid Memory System
 Evaluation
 Lifetime Analysis
 Summary
4
© 2007 IBM Corporation
What is Phase Change Memory?
Phase change material (chalcogenide glass) exists in two states:
1. Amorphous: high resistivity
2. Crystalline: low resistivity
Bit Line
Materials can be switched between states
reliably, quickly, large number of times
Word
Line
Word
Line
PCM stores data in terms of resistance
• Low resistance (SET state) = 1
• High resistance (RESET state) = 0
5
N
N
N
I
© 2007 IBM Corporation
Switching by heating using electrical pulses
SET: sustained current to heat cell above Tcryst
RESET: cell heated above Tmelt and quenched
Temperature
How does PCM work ?
RESET
Tmelt
SET
Tcryst
Time [ns]
Large
Current
Small
Current
Memory
Element
SET
Low resistance
103-104 W
6
Access
Device
RESET
High resistance
106-107 W
Photo Courtesy: Bipin Rajendran, IBM
© 2007 IBM Corporation
Key Characteristics of PCM
+ Scales better than DRAM, small cell size
Prototypes as small as 3nm x 20 nm fabricated and tested [Raoux+ IBMJRD’08]
+ Can store multiple bits/cell  More density in the same area
Prototypes with 2 bits/cell in ISSCC’08. >2 bits/cell expected soon.
+ Non-Volatile Memory Technology
Data retention of 10 years  Power implications, system implications
Challenges:
- More latency compared to DRAM.
- Limited Endurance (~10 million writes per cell)
- Write bandwidth constrained, so better to write less often.
7
© 2007 IBM Corporation
Outline
 Introduction
 What is PCM ?
 Hybrid Memory System
 Evaluation
 Lifetime Analysis
 Summary
8
© 2007 IBM Corporation
Hybrid Memory System
PCM Main Memory
DATA
Processor
W
DRAM Buffer
T
Flash
Or
HDD
DATA
T=Tag-Store
PCM Write Queue
Hybrid Memory System:
1. DRAM as cache to tolerate PCM Rd/Wr latency and Wr bandwidth
2. PCM as main-memory to provide large capacity at good cost/power
9
© 2007 IBM Corporation
Lazy Write Architecture
Problem: Double PCM writes to dirty pages on install
PCM
DRAM Buffer
Flash/Disk
Processor
WRQ
For example: Daxpy Kernel: Y[i] = Y[i] + X[i]
Baseline has 2 writes for Y[i] and 1 for X[i]
Lazy write has 1 write for Y[i] and 1 for X[i]
10
© 2007 IBM Corporation
Line Level Write Back
Line (Mln)
Each
toDirty
Num
NumWrites
Writes Per
Line (Million)
Problem: Not all lines in a dirty page are dirty
Solution: Dirty bits per line in DRAM buffer and
write-back only dirty lines from DRAM to PCM
20
18
Average
Average
16
14
12
10
8
6
4
2
0
0 1 2 3
Line_id
4 5
6 7 8db19 10 11 12 13 14 15
db1
0 1 2 3
4 5
6 7 8 db29 10 11 12 13 14 15
db2
Problem: With LLWB, not all lines in dirty pages are written uniformly
11
© 2007 IBM Corporation
Fine Grained Wear Leveling
(Mln)
Line
Writes
NumNum
Writesto
PerEach
Dirty Line
(Million)
Solution: Fine Grained Wear Leveling (FGWL)
-When a page gets allocated page is rotated by a random shift value
-The rotate value remains constant while page remains in memory
-On replacement of a page, a new random value is assigned for a new page
-Over time, the write traffic per line becomes uniform.
20
18
Average
Average
16
14
12
10
8
6
4
2
0
0 1 2 3
Line_id
4 5
6 7 8 db19 10 11 12 13 14 15
db1
0 1 2 3
4 5
6 7 8 db29 10 11 12 13 14 15
db2
FGWL makes writes across lines in a dirty page uniform
12
© 2007 IBM Corporation
Outline
 Introduction
 What is PCM ?
 Hybrid Memory System
 Evaluation
 Lifetime Analysis
 Summary
13
© 2007 IBM Corporation
Evaluation Framework
Trace Driven Simulator:
16-core system (simple core), 8GB DRAM main-memory at 320 cycles
HDD (2 ms) with Flash (32 us) with Flash hit-rate of 99%
Workloads:
Database workloads & Data parallel kernels
1. Database workloads: db1 and db2
2. Unix utilities: qsort and binary search
3. Data Mining : K-means and Gauss Seidal
4. Streaming: DAXPY and Vector Dot Product
Assumption:
PCM 4X denser & 4X slower than DRAM  32GB @ 1280 cycle read latency
14
© 2007 IBM Corporation
Reduction in Page Faults
Page Faults Normalized to 8GB System
2.2
4GB
8GB
16GB
32GB
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
db1
db2
qsort
bsearch
Benefit from capacity
15
kmeans
gauss
Need >16GB
daxpy
vdotp
Streaming
© 2007 IBM Corporation
Impact on Execution Time
1.1
Normalized Execution Time
1
0.9
0.8
8GB DRAM
32GB PCM
32GB DRAM
32GB PCM + 1GB DRAM
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
db1
db2
qsort
bsearch kmeans
gauss
daxpy
vdotp
gmean
PCM with DRAM buffer performs similar to equal capacity DRAM storage
16
© 2007 IBM Corporation
Impact of PCM Latency
1.1
Normalized Exec. Time (Avg)
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1X
DRAM-8GB
2X
4X
8X
16X
PCM-32GB
2X
4X
8X
16X
HYBRID (1+32)GB
1X
DRAM-32GB
Hybrid memory system is relatively insensitive to PCM Latency
17
© 2007 IBM Corporation
Power Evaluations
Value Normalized to 8GB DRAM
2.2
2
8GB DRAM
Hybrid (32GB PCM+ 1GB DRAM)
32GB DRAM
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
Power
Energy
Energy x Delay
Significant Power and Energy savings with PCM based hybrid memory system
18
© 2007 IBM Corporation
Outline
 Introduction
 What is PCM ?
 Hybrid Memory System
 Evaluation
 Lifetime Analysis
 Summary
19
© 2007 IBM Corporation
Impact of Write Endurance
B  Bytes/Cycle written to PCM
S  PCM capacity in bytes
Wmax  Max writes per PCM cell
Assuming uniform writes to PCM
F  Frequency of System (4GHz)
Y = Number of years (lifetime)
Endurance (in cycles) = (S/B).Wmax
Num. cycles in Y years = Y. F.225
Y = (S/B). Wmax
F.225
There are 225 seconds in a year
For a 4GHz System,
a 32GB PCM written at
1 Byte per Cycle
Y = Wmax
4 million
If Wmax = 10 million, PCM will last for 2.5 years
20
© 2007 IBM Corporation
Lifetime Results
Table shows average bytes per cycle written to PCM and
Average lifetime of PCM assuming Wmax = 10 million
Configuration
Avg. Bytes/Cycle
Avg. Lifetime
1GB DRAM + 32GB PCM
0.807
3.0 yrs
+ Lazy Write
0.725
3.4 yrs
+ Line Level Write Back
0.316
7.6 yrs
+ Bypass Streaming Apps
0.247
9.7 yrs
Proposed filtering techniques reduce write traffic to PCM by 3.2X,
increasing its lifetime from 3 to 9.7 years
21
© 2007 IBM Corporation
Outline
 Introduction
 What is PCM ?
 Hybrid Memory System
 Evaluation
 Lifetime Analysis
 Summary
22
© 2007 IBM Corporation
Summary
 Need more main memory capacity: DRAM hitting power, cost,
scaling wall
 PCM is an emerging technology – 4x denser than DRAM but
with slower access time and limited write endurance
 We propose a Hybrid Memory System (DRAM+PCM) that
provides significant power and performance benefits
 Proposed write filtering techniques reduce writes by 3x and
increase PCM lifetime from 3 years to 9 years
Not touched in this talk but important: Exploiting non-volatile memories
for system enhancement & related OS issues.
23
© 2007 IBM Corporation
Thanks!
24
© 2007 IBM Corporation