ppt - Rudrajit Datta
Download
Report
Transcript ppt - Rudrajit Datta
Designing a Fast and Adaptive Error
Correction Scheme for Increasing the
Lifetime of Phase Change Memories
Rudrajit Datta and Nur A. Touba
Computer Engineering Research Center
Dept. of Electrical and Computer Engineering
University of Texas at Austin
Introduction
Challenges for traditional memories
• Scalability
• Device leakage
• Retention time
Phase Change Memories (PCM) – a possible substitute
• Non-volatile
• Amenable to process scaling
• High density – 4x DRAM [Seznec 10]
Phase Change Memories
Crystalline state
• Low resistance – ‘1’
Amorphous state
• High resistance – ‘0’
Thermally induced state changes
Scalable
Disadvantages
• Relatively quick degradation
[Fantini 06]
– ~107 writes [Ferreira 10]
• Slow writes
PCM in place of DRAM – fix PCM reliability
Previous Work
Hybrid PCM/DRAM [Zhang 09]
• OS level paging scheme
• BCH code correcting up to 7 errors
– Slow
Spread/minimize PCM writes
• [Ferreira 10] – minimize PCM writes
• [Lee 09] – buffer reorganization and partial writes
Previous Work
Architectural solutions so far
None using novel error correction code (ECC)
• PCM errors increasing function of time
– Function of writes/cell
• Very different from traditional DRAM
– Increasing permanent errors
Proposed Scheme
Adaptive Error Correction
• OS monitors errors corrected
• Signals memory controller
– Increase number of check bits
Physical line size of memory unchanged
• More check bits, less data bits
Main memory to cache bandwidth affected
• Gradually decreasing cache line size
• Minimal performance impact
Orthogonal Latin Square (OLS) codes used
• Fast – single step decode
• Modular
Proposed Scheme
Word 1
Word 2
Word 3
Word 4
OLS Check Bits
Enhanced ECC
Word 1
Word 2
OLS Check Bits
Word 3
Into Cache
Word 1
Word 2
Word 3
Proposed Scheme
Data
Regular
Check-bit
Generator
Enhanced
Check-bit
Generator
Signal from OS
Main
Memory
Information Bits
Regular
Check-bit
Generator
Check Bits
Enhanced
Check-bit
Generator
Corrected Data
Orthogonal Latin Square Codes
Latin Square
• m x m array
• Row-columns permutation of digits 0,1,…..m-1
Orthogonal Latin Squares
• Ordered pair of elements (r, c, s) appear only once
m2 data bits, 2tm check bits, t-error correctable
[Hsiao 70]
Adaptive ECC
Increase number of check bits per line
Break up line into small segments
• Based on number of data bits
Implement ECC separately on each segment
• Constraint – original line size unchanged
• (Data + ECC)Original = ∑Segments (DataSegment + ECCSegment)
Overall error tolerance goes up
Adaptive ECC
Word 1
Word 2
Word 3
Word 4
ECC_OLS
Enhanced ECC
Word 1
Word 2
Word 3
ECC_OLS
Enhanced Adaptive ECC
Segment 1 ECC1 Segment 2 ECC2 Segment 3 ECC3 Segment 4 ECC4
Adaptive ECC – Numerical
example
Original configuration
• 3-bit OLS code on 256-bit line – total 352 bits
• Corrects all 3-error patterns and less
Increased check-bits
• 25% of data-bits store ECC – 192 data bits
– 2 64-bit data segments
– 4 16-bit data segments
• Check-bits – (352 – 192) = 160
– 3-bit OLS on the 64-bit segments
– 2-bit OLS on the 16-bit segments
Adaptive ECC – Numerical
example
Enhanced ECC configuration corrects
•
•
•
•
99.97% 3-bit errors
99.73% 4-bit errors
…..
Small fraction of 14-bit errors
Segmented ECC implementation boosts error tolerance
Results
Memory
Size
Fraction of Memory Used for Storing Extra
Check-bits
0.0
0.25
0.5
0.75
128MB
0.008
0.015
0.213
1.190
256MB
0.006
0.042
0.205
1.117
1GB
0.005
0.026
0.154
0.989
4GB
0.003
0.020
0.125
0.916
Error Tolerance (no. of errors / no. of bits * 100)
for varying memory sizes
Results
Percentage of operational memory lines versus
number of errors injected out of 100,000 experiments
Results
1.4
Proposed_Scheme
1.2
7-error BCH [Zhang 09]
1
0.8
Bit-error Rate
Tolerance
(%)
0.6
0.4
0.2
0
Time
Results
SPEC2006 Benchmarks
Results
SPEC2006 Benchmark – bzip2
Conclusion
Novel error correction scheme for PCM
• Fast
• Adaptive
– Graceful decrease in memory capacity
• Increases PCM lifetime
– Switching period (to enhanced ECC) of the order of
years