A Large-Scale Study of Flash Memory Errors in the Field Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu.

Download Report

Transcript A Large-Scale Study of Flash Memory Errors in the Field Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu.

A Large-Scale Study of
Flash Memory Errors in the Field
Justin Meza
Qiang Wu
Sanjeev Kumar
Onur Mutlu
Overview
First study of flash reliability:
▪
at a large scale
▪
in the field
Overview
SSD lifecycle
Access pattern New reliability Read
dependence
disturbance
trends
Temperature
Overview
SSD lifecycle
detection
period
Access Early
pattern
New lifecycle
reliability
Read
distinct
from
hard
disk
drive
lifecycle.
dependence
disturbance
trends
Temperature
Overview
SSD lifecycle
We do not observe the
Access
pattern
New reliability
effects of
read disturbance
dependence
errors
in the field.
trends
Temperature
Read
disturbance
Overview
SSD lifecycle
Access Throttling
pattern New
reliability
SSD usage
helps Read
mitigate
dependence
temperature-induced
errors. disturbance
trends
Temperature
Overview
SSD lifecycle
Access pattern
dependence
We quantify the effects of the
Newpage
reliability
Read
cache and
write
amplification disturbance
in the field.
trends
Temperature
Outline
▪
▪
▪
▪
▪
background and motivation
server SSD architecture
error collection/analysis methodology
SSD reliability trends
summary
Background and
motivation
Flash memory
▪
▪
▪
▪
persistent
high performance
hard disk alternative
used in solid-state drives (SSDs)
Flash memory
▪
▪
▪
▪
▪
persistent
high performance
hard disk alternative
used in solid-state drives (SSDs)
prone to a variety of errors
▪
wearout, disturbance, retention
Our goal
Understand SSD reliability:
▪
at a large scale
▪
▪
millions of device-days, across four years
in the field
▪
realistic workloads and systems
Server SSD
architecture
PCIe
Flash chips
SSD controller
▪
▪
▪
translates addresses
schedules accesses
performs wear leveling
10011111 11001111 11000011 00001101
10101110 11100101 11111001 01111011
00011001 11011101 11100011 11111000
11011111 01001101 11110000 10111111
00000001 11011110 00000101 01010110
00001011 10000010 11111110 00011100
...
User data
01001100 01001101 11010010 01000000
10011100 10111111 10101111 11000101
ECC metadata
Types of errors
Small errors
▪
▪
10's of flipped bits per KB
silently corrected by SSD controller
Large errors
▪
▪
▪
100's of flipped bits per KB
corrected by host using driver
referred to as SSD failure
Types of errors
Small errors
▪ ~10's of flipped bits per KB
We examine large errors
▪ silently corrected by SSD controller
(SSD failures) in this study.
Large errors
▪
▪
▪
~100's of flipped bits per KB
corrected by host using driver
refer to as SSD failure
Error collection/
analysis
methodology
SSD data measurement
▪
▪
metrics stored on SSDs
measured across SSD lifetime
SSD characteristics
▪
6 different system configurations
▪
▪
▪
▪
▪
720GB, 1.2TB, and 3.2TB SSDs
servers have 1 or 2 SSDs
this talk: representative systems
6 months to 4 years of operation
15TB to 50TB read and written
Bit error rates (BER)
▪
▪
BER = bit errors per bits transmitted
1 error per 385M bits transmitted to
1 error per 19.6B bits transmitted
▪
▪
averaged across all SSDs in each system type
10x to 1000x lower than prior studies
▪
large errors, SSD performs wear leveling
A few SSDs cause most errors
A few SSDs cause most errors
A few SSDs cause most errors
What factors contribute to
SSD failures in the field?
Analytical methodology
▪
▪
▪
not feasible to log every error
instead, analyze lifetime counters
snapshot-based analysis
Errors 54,326
Data
written
10TB
0
2
10
2TB
5TB
6TB
Errors 54,326
Data
written
10TB
0
2
10
2TB
5TB
6TB
2014-11-1
Errors 54,326
Data
written
10TB
0
2
10
2TB
5TB
6TB
Errors
Data written
2014-11-1
Errors 54,326
Data
written
10TB
0
2
10
2TB
5TB
6TB
Buckets
Errors
Data written
2014-11-1
Errors 54,326
Data
written
10TB
0
2
10
2TB
5TB
6TB
Errors
Data written
2014-11-1
Errors 54,326
Data
written
10TB
0
2
10
2TB
5TB
6TB
Errors
Data written
2014-11-1
SSD reliability
trends
SSD lifecycle
Access pattern New reliability Read
dependence
disturbance
trends
Temperature
SSD lifecycle
Access pattern New reliability Read
dependence
disturbance
trends
Temperature
Storage lifecycle background:
the bathtub curve for disk drives
Failure
rate
[Schroeder+,FAST'07]
Usage
Storage lifecycle background:
the bathtub curve for disk drives
Early
failure
period
Failure
rate
[Schroeder+,FAST'07]
Wearout
period
Useful life
period
Usage
Storage lifecycle background:
the bathtub curve for disk drives
Early
failure
Do period
SSDs
Failure
rate
Wearout
period
display similar
lifecycle
periods?
Useful life
[Schroeder+,FAST'07]
period
Usage
Use data
written to flash
to examine SSD lifecycle
(time-independent utilization metric)
720GB, 1 SSD 720GB, 2 SSDs
0
40
80
Data written (TB)
720GB, 1 SSD 720GB, 2 SSDs
Wearout period
Useful life period
Early failure period
0
40
80
Data written (TB)
720GB, 1 SSD 720GB, 2 SSDs
Wearout period
Early
detection
period
Useful life period
Early failure period
0
40
80
Data written (TB)
SSD lifecycle
detection
period
Access Early
pattern
New lifecycle
reliability
Read
distinct
from
hard
disk
drive
lifecycle.
dependence
disturbance
trends
Temperature
SSD lifecycle
Access pattern New reliability Read
dependence
disturbance
trends
Temperature
Read disturbance
▪
▪
▪
reading data can disturb contents
failure mode identified in lab setting
under adversarial workloads
Read disturbance
▪
▪
▪
reading
data
can disturbance
disturb contents
Does
read
failure
mode
identified
in lab
setting
affect
SSDs
in the
field?
under adversarial workloads
Examine SSDs with
high flash R/Wratios
and most data read
to understand read effects
(isolate effects of read vs. write errors)
3.2TB, 1 SSD (average R/W = 2.14)
0
100
200
Data read (TB)
1.2TB, 1 SSD (average R/W = 1.15)
0
100
200
Data read (TB)
SSD lifecycle
We do not observe the
Access
pattern
New reliability
effects of
read disturbance
dependence
errors
in the field.
trends
Temperature
Read
disturbance
SSD lifecycle
Access pattern New reliability Read
dependence
disturbance
trends
Temperature
Temperature
sensor
720GB, 1 SSD 720GB, 2 SSDs
High temperature:
may throttle or
shut down
1.2TB, 1 SSD
3.2TB, 1 SSD
SSD lifecycle
Access Throttling
pattern New
reliability
SSD usage
helps Read
mitigate
dependence
temperature-induced
errors. disturbance
trends
Temperature
SSD lifecycle
Access pattern New reliability Read
dependence
disturbance
trends
Temperature
Access pattern effects
System buffering
▪
▪
data served from OS caches
decreases SSD usage
Write amplification
▪
▪
updates to small amounts of data
increases erasing and copying
Access pattern effects
System buffering
▪
▪
data served from OS caches
decreases SSD usage
Write amplification
▪
▪
updates to small amounts of data
increases erasing and copying
OS
OS
Page cache
OS
Page cache
OS
Page cache
OS
Page cache
OS
Page cache
OS
Page cache
OS
Page cache
System caching reduces
the impact of SSD writes
OS
Page cache
1.2TB, 2 SSDs
0
3.2TB, 2 SSDs
15
30
Data written to OS (TB)
720GB, 2 SSDs
60
Data written to
flash cells (TB)
20
0
15
30
Data written to OS (TB)
Access pattern effects
System buffering
▪
▪
data served from OS caches
decreases SSD usage
Write amplification
▪
▪
updates to small amounts of data
increases erasing and copying
Flash devices use a
translation layer
to locate data
OS
Translation layer
Logical
address
space
OS
Physical
address
space
<offset1, size1>
<offset2, size2>
...
Sparse data layout
more translation metadata
potential for higher write amplification
Dense data layout
less translation metadata
potential for lower write amplification
Use translation data size
to examine effects of data layout
(relates to application access patterns)
720GB, 1 SSD
Denser
0
1
2
Translation data (GB)
Sparser
Write amplification in the field
Graph search
0.25
0.45
Translation data (GB)
Key-value store
0.25
0.45
Translation data (GB)
SSD lifecycle
Access pattern
dependence
We quantify the effects of the
Newpage
reliability
Read
cache and
write
amplification disturbance
in the field.
trends
Temperature
SSD lifecycle
Access pattern New reliability Read
dependence
disturbance
trends
Temperature
More results in paper
▪
▪
▪
Block erasures and discards
Page copies
Bus power consumption
Summary
▪
▪
Large scale
In the field
Summary
SSD lifecycle
Access pattern New reliability Read
dependence
disturbance
trends
Temperature
Summary
SSD lifecycle
detection
period
Access Early
pattern
New lifecycle
reliability
Read
distinct
from
hard
disk
drive
lifecycle.
dependence
disturbance
trends
Temperature
Summary
SSD lifecycle
We do not observe the
Access
pattern
New reliability
effects of
read disturbance
dependence
errors
in the field.
trends
Temperature
Read
disturbance
Summary
SSD lifecycle
Access Throttling
pattern New
reliability
SSD usage
helps Read
mitigate
dependence
temperature-induced
errors. disturbance
trends
Temperature
Summary
SSD lifecycle
Access pattern
dependence
We quantify the effects of the
Newpage
reliability
Read
cache and
write
amplification disturbance
in the field.
trends
Temperature
A Large-Scale Study of
Flash Memory Errors in the Field
Justin Meza
Qiang Wu
Sanjeev Kumar
Onur Mutlu
Backup slides
System characteristics
SSD
capacity
PCIe
Average SSDs Average Average
age
per
written read (T
(years) server
(TB)
B)
720GB v1, x4
2.4
1.2TB
v2, x4
1.6
3.2TB
v2, x4
0.5
1
2
1
2
1
2
27.2
48.5
37.8
18.9
23.9
14.8
23.8
45.1
43.4
30.6
51.1
18.2
720GB 1.2TB 3.2TB
Devices: 1 2 1 2 1 2
720GB 1.2TB 3.2TB
Devices: 1 2 1 2 1 2
Channels
operate in parallel
DRAM buffer
▪
▪
stores address translations
may buffer writes
1.2TB, 2 SSDs
3.2TB, 2 SSDs