The Efficacy of Error Mitigation Techniques for DRAM Retention Failures Samira Khan, Donghyuk Lee, Yoongu Kim, Alaa R.

Download Report

Transcript The Efficacy of Error Mitigation Techniques for DRAM Retention Failures Samira Khan, Donghyuk Lee, Yoongu Kim, Alaa R.

The Efficacy of
Error Mitigation Techniques
for DRAM Retention Failures
Samira Khan, Donghyuk Lee, Yoongu Kim,
Alaa R. Alameldeen, Chris Wilkerson, and
Onur Mutlu
Motivation
Technology
Scaling
DRAM Cells
DRAM Cells
Scaling DRAM cells results in more failures
• Longer manufacture-time tests
• Lower yield
• Higher cost
2
Vision: Online Profiling
Detect
and
Mitigate
DRAM Cells
System
Detect and mitigate errors after
the system has become operational
Reduces cost of testing, increases yield, enables scaling
What is the effectiveness of system-level
detection and mitigation techniques?
3
Summary
• We analyze the efficacy of testing, guardbanding,
ECC, and recent techniques
– Using experimental data from real DRAMs
• Key Conclusions
– Testing alone cannot guarantee reliable operation
– A combination of ECC, testing, and guardbanding is
more effective
– Testing+ECC-based techniques block memory for
significant time  Performance degradation
• We propose a possible online profiling
mechanism
4
Outline
• DRAM Scaling Problem
• Online Profiling as a Solution
• Efficacy of System-Level Detection and Mitigation
– Simple Techniques
– Recently Proposed Techniques
• Towards an Online Profiling System
• Conclusion
5
Outline
• DRAM Scaling Problem
• Online Profiling as a Solution
• Efficacy of System-Level Detection and Mitigation
– Simple Techniques
– Recently Proposed Techniques
• Towards an Online Profiling System
• Conclusion
6
Retention Failure
DRAM Cells
7
Retention Failure
Switch
Refreshed
Every 64 ms
Leakage
Capacitor
Retention
Retention
Time Time
Refresh Interval
64 ms
Time
8
Intermittent Retention Failure
DRAM Cells
• Some retention failures are intermittent
• Two characteristics of intermittent retention failures
1
2
Data Pattern Sensitivity
Variable Retention Time
9
1 Data Pattern Sensitivity
Noise
Interference
10
0
0
1
Failure
No
Failure
Some cells can fail depending on the
data stored in neighboring cells
10
2 Variable Retention Time
Retention Time (ms)
640
512
384
256
128
0
Time
Retention time of some cells change
at random points of time
11
Testing for Retention Failures
Manufacturing Time Testing
PASS
FAIL
Manufacturers perform exhaustive testing
Chips failing
tests
are discarded
of the
DRAM
Chips
12
DRAM Scaling Problem
Manufacturing Time Testing
PASS
FAIL
More interference in smaller technology nodes
leads to lower yield and higher cost
13
Outline
• DRAM Scaling Problem
• Online Profiling as a Solution
• Efficacy of System-Level Detection and Mitigation
– Simple Techniques
– Recently Proposed Techniques
• Towards an Online Profiling System
• Conclusion
14
System-Level Online Profiling
Not fully tested during
manufacture-time
1
Ship modules
2
with possible failures
PASS
FAIL
Detect and mitigate
failures online
3
Increases yield, reduces cost, enables scaling
15
System-Level Online Profiling
What is the effectiveness of detection and mitigation
techniques for retention failures?
Our goal is to analyze the efficacy of
1. Simple Techniques
• Testing, Guardbanding, ECC
2. Recently Proposed Techniques
• ArchShield, RAIDR, SECRET, RAPID, VS-ECC, Hi-ECC
We analyze the effectiveness of these techniques
using experimental data from real DRAM
ArchShield ISCA'13, RAIDR ISCA'12, SECRET ICCD’12, RAPID HPCA'06,
VS-ECC ISCA'11, Hi-ECC ISCA’10
16
Methodology
FPGA-based testing infrastructure
Evaluated 96 chips from three major vendors
17
Outline
• DRAM Scaling Problem
• Online Profiling as a Solution
• Efficacy of System-Level Detection and Mitigation
– Simple Techniques
– Recently Proposed Techniques
• Towards an Online Profiling System
• Conclusion
18
Efficacy of Simple Techniques
1 Testing
2 Guardbanding
3 Error Correcting Code
19
1 Testing
Write some pattern
in the module
Repeat
Read 3
and verify
1
Wait until 2
refresh interval
Test each module with different patterns for many rounds
Zeros (0000), Ones (1111), Tens (1010), Fives (0101), Random
20
Number of Failing Cells Found
Efficacy of Testing
ZERO
200000
ONE
TEN
FIVE
RAND
All
Even after hundreds of
rounds,
small number
Only
a few arounds
can
of
new
cells
keep
failing
discover most of the
150000
100000
failures
50000
0
0
100 200 300 400 500 600 700 800 900 1000
Number of Rounds
Testing alone cannot detect all possible failures
21
2 Guardbanding
• Adding a safety-margin on the refresh interval
• Can avoid VRT failures
4X Guardband
2X Guardband
Refresh Interval
Effectiveness of guardbanding depends on
the difference between retention times of a cell
22
Efficacy of Guardbanding
Number of Failing Cells
1000000
100000
10000
1000
100
10
1
0
4
8
12
16
20
Retention Time (in seconds)
23
Efficacy of Guardbanding
Number of Failing Cells
1000000
100000
10000
1000
100
10
1
0
4
8
12
16
20
Retention Time (in seconds)
23
Efficacy of Guardbanding
Number of Failing Cells
1000000
100000
10000
1000
100
10
1
0
4
8
12
16
20
Retention Time (in seconds)
23
Efficacy of Guardbanding
Number of Failing Cells
1000000
100000
10000
1000
Most of the cells exhibit
closeby retention times
100
10
1
0
4
8
12
16
20
Retention Time (in seconds)
23
Efficacy of Guardbanding
Number of Failing Cells
1000000
100000
There are few cells with
large differences in
retention times
10000
1000
100
10
1
0
4
8
12
16
20
Retention Time (in seconds)
Even a large guardband (5X) cannot detect
5-15% of the intermittently failing cells23
3 Error Correcting Code
• Error Correcting Code (ECC)
– Additional information to detect error and correct data
24
Probability of New Failure
Effectiveness of ECC
No ECC
SECDED
SECDED, 2X Guardband
1E+00
1E-06
1E-12
1E-18
1
10
100
1000
Number of Rounds
25
Probability of New Failure
Effectiveness of ECC
No ECC
SECDED
SECDED, 2X Guardband
1E+00
1E-06
1E-12
1E-18
1
10
100
1000
Number of Rounds
25
Probability of New Failure
Effectiveness of ECC
No ECC
SECDED
SECDED, 2X Guardband
1E+00
1E-06
1E-12
1E-18
1
10
100
1000
Number of Rounds
25
Probability of New Failure
Effectiveness of ECC
No ECC
SECDED
SECDED, 2X Guardband
1E+00
Combination of techniques
SECDED code reduces
reduces error rate by 107 times
error
ratea by
times
Adding
2X100
guardband
reduces error rate
by 1000 times
1E-06
1E-12
1E-18
1
10
100
Number of Rounds
1000
A combination of mitigation techniques is
much more effective
25
Outline
• DRAM Scaling Problem
• Online Profiling as a Solution
• Efficacy of System-Level Detection and Mitigation
– Simple Techniques
– Recently Proposed Techniques
• Towards an Online Profiling System
• Conclusion
26
Efficacy of Recent Techniques
1 Bit Repair Techniques
In the paper
2 Variable-Strength ECC
3 Higher-Strength ECC
27
Higher Strength ECC (Hi-ECC)
No testing, use strong ECC
But amortize cost of ECC over larger data chunk
Can potentially tolerate errors at the cost of
higher strength ECC
Hi-ECC ISCA'10
28
Time to Failure (in years)
Efficacy of Hi-ECC
4EC5ED, 2X Guardband
3EC4ED, 2X Guardband
DECTED, 2X Guardband
SECDED, 2X Guardband
1E+25
1E+20
1E+15
1E+10
1E+05
10 Years
1E+00
1E-05
1
10
100
1000
10000
Number of Rounds
29
Time to Failure (in years)
Efficacy of Hi-ECC
4EC5ED, 2X Guardband
3EC4ED, 2X Guardband
DECTED, 2X Guardband
SECDED, 2X Guardband
1E+25
1E+20
1E+15
1E+10
1E+05
10 Years
1E+00
1E-05
1
10
100
1000
10000
Number of Rounds
29
Time to Failure (in years)
Efficacy of Hi-ECC
4EC5ED, 2X Guardband
3EC4ED, 2X Guardband
DECTED, 2X Guardband
SECDED, 2X Guardband
1E+25
After starting with 4EC5ED,
can reduce to 3EC4ED code
after 2 rounds of tests
1E+20
1E+15
1E+10
1E+05
10 Years
1E+00
1E-05
1
10
100
1000
10000
Number of Rounds
29
Time to Failure (in years)
Efficacy of Hi-ECC
4EC5ED, 2X Guardband
3EC4ED, 2X Guardband
DECTED, 2X Guardband
SECDED, 2X Guardband
1E+25
Can reduce to DECTED code
after 10 rounds of tests
1E+20
1E+15
1E+10
1E+05
10 Years
1E+00
1E-05
1
10
100
1000
10000
Number of Rounds
29
Time to Failure (in years)
Efficacy of Hi-ECC
4EC5ED, 2X Guardband
3EC4ED, 2X Guardband
DECTED, 2X Guardband
SECDED, 2X Guardband
1E+25
1E+20
Can reduce to SECDED code,
after 7000 rounds of tests
(4 hours)
1E+15
1E+10
1E+05
10 Years
1E+00
1E-05
1
10
100
1000
10000
Number of Rounds
Testing can help to reduce
the ECC strength
29
Outline
• DRAM Scaling Problem
• Online Profiling as a Solution
• Efficacy of System-Level Detection and Mitigation
– Simple Techniques
– Recently Proposed Techniques
• Towards an Online Profiling System
• Conclusion
30
Towards an Online Profiling System
Key Observations:
• Testing alone cannot detect all possible failures
• Combination of ECC and other mitigation
techniques is much more effective
– But degrades performance
• Testing can help to reduce the ECC strength
– Even when starting with a higher strength ECC
31
Towards an Online Profiling System
Initially Protect DRAM
with Strong ECC
1
Periodically Test
Parts of DRAM
2
Test
Test
Test
Mitigate errors and
reduce ECC
3
Run tests periodically after a short interval
32
at smaller regions of memory
Outline
• DRAM Scaling Problem
• Online Profiling as a Solution
• Efficacy of System-Level Detection and Mitigation
– Simple Techniques
– Recently Proposed Techniques
• Towards an Online Profiling System
• Conclusion
33
Conclusion
• We analyze the efficacy of testing, guardbanding,
ECC, and recent techniques at system-level
– Using experimental data from real DRAMs
• Key Conclusions
– Testing alone cannot guarantee reliable operation
– A combination of techniques is more effective
– Testing+ECC-based techniques block memory for
significant time  Performance degradation
• We propose Online profiling that runs at
background without disrupting current programs
– Run periodically at smaller regions of memory
34
Thank you
Full data set for 96 chips is available at
http://www.ece.cmu.edu/~safari/tools/dr
am-sigmetrics2014-fulldata.html
The Efficacy of
Error Mitigation Techniques
for DRAM Retention Failures
Samira Khan, Donghyuk Lee, Yoongu Kim,
Alaa R. Alameldeen, Chris Wilkerson, and
Onur Mutlu
1 Bit Repair Techniques
Test DRAM module
at boot up
1
Mitigate failures by
repairing the bits
2
FIXED
These techniques are vulnerable to
new intermittent failures
ArchShield ISCA'13, RAIDR ISCA'12, SECRET ICCD’12, RAPID HPCA'06
48
Time to Failure (in days)
Efficacy of Bit Repair Techniques
No Guardband
2X Guardband
25
20
15
10
5
0
1
101
102
103
104
105
Number of Rounds
106
107
49
Time to Failure (in days)
Efficacy of Bit Repair Techniques
25
20
15
10
No Guardband
2X Guardband
System fails within 13
days, even after initial
Will fail
testing
ofimmediately
107 rounds
even after initial testing
of 104 rounds
5
0
1
101
102
103
104
105
Number of Rounds
106
107
50
Time to Failure (in days)
Efficacy of Bit Repair Techniques
25
20
15
No Guardband
2X Guardband
System fails within 23
days, even after initial
testing of 107 rounds
10
5
0
1
101
102
103
104
105
Number of Rounds
106
107
Even longer tests are not sufficient
to guarantee reliable operation
51
2 Variable-Strength ECC (VS-ECC)
Test DRAM module
at boot up
1
Protect failed lines
with strong ECC
2
Will fail as soon as there are
two bit errors in SECDED lines
VS-ECC ISCA'11
52
Time to Failure (in years)
Efficacy of VS-ECC
No Guardband
2X Guardband
1E+02
10 Years
1E+00
1E-02
1E-04
1E-06
1E-08
0
100 200 300 400 500 600 700 800 900 1000
Number of Rounds
53
Time to Failure (in years)
Efficacy of VS-ECC
No Guardband
2X Guardband
1E+02
10 Years
Memory blocked
for 19 minutes in
2GB DRAM
1E+00
1E-02
1E-04
1E-06
1E-08
0
100 200 300 400 500 600 700 800 900 1000
Number of Rounds
54
Time to Failure (in years)
Efficacy of VS-ECC
No Guardband
2X Guardband
1E+02
10 Years
1E+00
1E-02
Memory blocked
for 7 minutes in
2GB DRAM
1E-04
1E-06
1E-08
0
100 200 300 400 500 600 700 800 900 1000
Number of Rounds
With higher capacity DRAM, memory will be
blocked for an unacceptable amount of time
55
Challenges and Opportunities
Challenges:
• Performance Overhead
• Mitigation Overhead
Testing
Opportunities:
• Enable Failure-aware
Optimizations
56
Reduction in New Failure Rate
Reduction in Error Rate
in all Modules
100000
10000
1000
100
10
1
0
100 200 300 400 500 600 700 800 900 1000
Number of Rounds
57
Number of Failing Cells
(in millions)
Difference in Modules
20
15
10
5
0
2
4
6
8
10
12
14
16
Refresh Interval (in seconds)
18
20
A1
A2
A3
A4
B1
B2
B3
B4
C1
C2
C3
C4
58
Tested DRAM Modules
Manufacturer
Module Name
Assembly Date
(Year-Week)
Number of
Chips
A
A1
2013-18
8
A2
2012-26
8
A3
2013-18
8
A4
2014-08
8
B1
2012-37
8
B2
2012-37
8
B3
2012-41
8
B4
2012-20
8
C1
2012-29
8
C2
2012-29
8
C3
2013-22
8
C4
2012-29
8
B
C
59
Time to Test
Operation
Time (2GB)
Time (64GB)
Write/Read a Row
667.5 ns
667.5 ns
Write/Read 2GB Module
174.98 ms
5.59 s
1 round , 1 pattern
413:96 ms
11.24 s
1 round, 5 patterns
2.06 s
56.22 s
1000 rounds, 5 patterns
34 m
15.6 hours
60
Temperature Controlled Environment
61
Dependence of Retention Time on Temperature
Fraction of cells that
exhibited retention
time failure
at any tWAIT
for any data pattern
o
at 50 C
Normalized retention
times of the same cells
o
at 55 C
Normalized retention
times of the same cells
o
At 70 C
Best-fit exponential curves
for retention time change
with temperature
Slide Courtesy Onur Mutlu ISCA’13
62
Dependence of Retention Time on Temperature
Relationship between retention time and temperature is
consistently bounded (predictable) within a device
o
Every 10 C temperature increase
63
 46.5% reduction in retention
time in the worst case
Effect of Temperature
• Worst fit curve for retention time at different
temperature corresponds to e^-0.0625T , where T is
the temperature [ISCA’13]
• A 10 C increase in temperature results in a
reduction of 1 – e^-0.0625*10 = 46.5%
• 1 second  82 ms at 45 C
• 20 seconds  1640 ms at 85 C
64
Characteristics not Dependent
on Refresh Interval
2s
Probability of New Bit Failure
1E-03
4s
5s
10 s
1E-06
1E-09
1E-12
0
100 200 300 400 500 600 700 800 900 1000
Number of Rounds
65
Expected Number of Multi-Bit Failures
1 Bit Failure
1 Bit Failure, 2X Guardband
2 Bit Failure
2 Bit Failure, 2X Guardband
3 Bit Failure
3 Bit Failure, 2X Guardband
Expected Number of
Words (8B)
1E+06
1E+00
1E-06
1E-12
1E-18
1E-24
1
10
100
Number of Rounds
1000
66