A Large-Scale Study of Flash Memory Errors in the Field Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu.
Download ReportTranscript A Large-Scale Study of Flash Memory Errors in the Field Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu.
A Large-Scale Study of Flash Memory Errors in the Field Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Overview First study of flash reliability: ▪ at a large scale ▪ in the field Overview SSD lifecycle Access pattern New reliability Read dependence disturbance trends Temperature Overview SSD lifecycle detection period Access Early pattern New lifecycle reliability Read distinct from hard disk drive lifecycle. dependence disturbance trends Temperature Overview SSD lifecycle We do not observe the Access pattern New reliability effects of read disturbance dependence errors in the field. trends Temperature Read disturbance Overview SSD lifecycle Access Throttling pattern New reliability SSD usage helps Read mitigate dependence temperature-induced errors. disturbance trends Temperature Overview SSD lifecycle Access pattern dependence We quantify the effects of the Newpage reliability Read cache and write amplification disturbance in the field. trends Temperature Outline ▪ ▪ ▪ ▪ ▪ background and motivation server SSD architecture error collection/analysis methodology SSD reliability trends summary Background and motivation Flash memory ▪ ▪ ▪ ▪ persistent high performance hard disk alternative used in solid-state drives (SSDs) Flash memory ▪ ▪ ▪ ▪ ▪ persistent high performance hard disk alternative used in solid-state drives (SSDs) prone to a variety of errors ▪ wearout, disturbance, retention Our goal Understand SSD reliability: ▪ at a large scale ▪ ▪ millions of device-days, across four years in the field ▪ realistic workloads and systems Server SSD architecture PCIe Flash chips SSD controller ▪ ▪ ▪ translates addresses schedules accesses performs wear leveling 10011111 11001111 11000011 00001101 10101110 11100101 11111001 01111011 00011001 11011101 11100011 11111000 11011111 01001101 11110000 10111111 00000001 11011110 00000101 01010110 00001011 10000010 11111110 00011100 ... User data 01001100 01001101 11010010 01000000 10011100 10111111 10101111 11000101 ECC metadata Types of errors Small errors ▪ ▪ 10's of flipped bits per KB silently corrected by SSD controller Large errors ▪ ▪ ▪ 100's of flipped bits per KB corrected by host using driver referred to as SSD failure Types of errors Small errors ▪ ~10's of flipped bits per KB We examine large errors ▪ silently corrected by SSD controller (SSD failures) in this study. Large errors ▪ ▪ ▪ ~100's of flipped bits per KB corrected by host using driver refer to as SSD failure Error collection/ analysis methodology SSD data measurement ▪ ▪ metrics stored on SSDs measured across SSD lifetime SSD characteristics ▪ 6 different system configurations ▪ ▪ ▪ ▪ ▪ 720GB, 1.2TB, and 3.2TB SSDs servers have 1 or 2 SSDs this talk: representative systems 6 months to 4 years of operation 15TB to 50TB read and written Bit error rates (BER) ▪ ▪ BER = bit errors per bits transmitted 1 error per 385M bits transmitted to 1 error per 19.6B bits transmitted ▪ ▪ averaged across all SSDs in each system type 10x to 1000x lower than prior studies ▪ large errors, SSD performs wear leveling A few SSDs cause most errors A few SSDs cause most errors A few SSDs cause most errors What factors contribute to SSD failures in the field? Analytical methodology ▪ ▪ ▪ not feasible to log every error instead, analyze lifetime counters snapshot-based analysis Errors 54,326 Data written 10TB 0 2 10 2TB 5TB 6TB Errors 54,326 Data written 10TB 0 2 10 2TB 5TB 6TB 2014-11-1 Errors 54,326 Data written 10TB 0 2 10 2TB 5TB 6TB Errors Data written 2014-11-1 Errors 54,326 Data written 10TB 0 2 10 2TB 5TB 6TB Buckets Errors Data written 2014-11-1 Errors 54,326 Data written 10TB 0 2 10 2TB 5TB 6TB Errors Data written 2014-11-1 Errors 54,326 Data written 10TB 0 2 10 2TB 5TB 6TB Errors Data written 2014-11-1 SSD reliability trends SSD lifecycle Access pattern New reliability Read dependence disturbance trends Temperature SSD lifecycle Access pattern New reliability Read dependence disturbance trends Temperature Storage lifecycle background: the bathtub curve for disk drives Failure rate [Schroeder+,FAST'07] Usage Storage lifecycle background: the bathtub curve for disk drives Early failure period Failure rate [Schroeder+,FAST'07] Wearout period Useful life period Usage Storage lifecycle background: the bathtub curve for disk drives Early failure Do period SSDs Failure rate Wearout period display similar lifecycle periods? Useful life [Schroeder+,FAST'07] period Usage Use data written to flash to examine SSD lifecycle (time-independent utilization metric) 720GB, 1 SSD 720GB, 2 SSDs 0 40 80 Data written (TB) 720GB, 1 SSD 720GB, 2 SSDs Wearout period Useful life period Early failure period 0 40 80 Data written (TB) 720GB, 1 SSD 720GB, 2 SSDs Wearout period Early detection period Useful life period Early failure period 0 40 80 Data written (TB) SSD lifecycle detection period Access Early pattern New lifecycle reliability Read distinct from hard disk drive lifecycle. dependence disturbance trends Temperature SSD lifecycle Access pattern New reliability Read dependence disturbance trends Temperature Read disturbance ▪ ▪ ▪ reading data can disturb contents failure mode identified in lab setting under adversarial workloads Read disturbance ▪ ▪ ▪ reading data can disturbance disturb contents Does read failure mode identified in lab setting affect SSDs in the field? under adversarial workloads Examine SSDs with high flash R/Wratios and most data read to understand read effects (isolate effects of read vs. write errors) 3.2TB, 1 SSD (average R/W = 2.14) 0 100 200 Data read (TB) 1.2TB, 1 SSD (average R/W = 1.15) 0 100 200 Data read (TB) SSD lifecycle We do not observe the Access pattern New reliability effects of read disturbance dependence errors in the field. trends Temperature Read disturbance SSD lifecycle Access pattern New reliability Read dependence disturbance trends Temperature Temperature sensor 720GB, 1 SSD 720GB, 2 SSDs High temperature: may throttle or shut down 1.2TB, 1 SSD 3.2TB, 1 SSD SSD lifecycle Access Throttling pattern New reliability SSD usage helps Read mitigate dependence temperature-induced errors. disturbance trends Temperature SSD lifecycle Access pattern New reliability Read dependence disturbance trends Temperature Access pattern effects System buffering ▪ ▪ data served from OS caches decreases SSD usage Write amplification ▪ ▪ updates to small amounts of data increases erasing and copying Access pattern effects System buffering ▪ ▪ data served from OS caches decreases SSD usage Write amplification ▪ ▪ updates to small amounts of data increases erasing and copying OS OS Page cache OS Page cache OS Page cache OS Page cache OS Page cache OS Page cache OS Page cache System caching reduces the impact of SSD writes OS Page cache 1.2TB, 2 SSDs 0 3.2TB, 2 SSDs 15 30 Data written to OS (TB) 720GB, 2 SSDs 60 Data written to flash cells (TB) 20 0 15 30 Data written to OS (TB) Access pattern effects System buffering ▪ ▪ data served from OS caches decreases SSD usage Write amplification ▪ ▪ updates to small amounts of data increases erasing and copying Flash devices use a translation layer to locate data OS Translation layer Logical address space OS Physical address space <offset1, size1> <offset2, size2> ... Sparse data layout more translation metadata potential for higher write amplification Dense data layout less translation metadata potential for lower write amplification Use translation data size to examine effects of data layout (relates to application access patterns) 720GB, 1 SSD Denser 0 1 2 Translation data (GB) Sparser Write amplification in the field Graph search 0.25 0.45 Translation data (GB) Key-value store 0.25 0.45 Translation data (GB) SSD lifecycle Access pattern dependence We quantify the effects of the Newpage reliability Read cache and write amplification disturbance in the field. trends Temperature SSD lifecycle Access pattern New reliability Read dependence disturbance trends Temperature More results in paper ▪ ▪ ▪ Block erasures and discards Page copies Bus power consumption Summary ▪ ▪ Large scale In the field Summary SSD lifecycle Access pattern New reliability Read dependence disturbance trends Temperature Summary SSD lifecycle detection period Access Early pattern New lifecycle reliability Read distinct from hard disk drive lifecycle. dependence disturbance trends Temperature Summary SSD lifecycle We do not observe the Access pattern New reliability effects of read disturbance dependence errors in the field. trends Temperature Read disturbance Summary SSD lifecycle Access Throttling pattern New reliability SSD usage helps Read mitigate dependence temperature-induced errors. disturbance trends Temperature Summary SSD lifecycle Access pattern dependence We quantify the effects of the Newpage reliability Read cache and write amplification disturbance in the field. trends Temperature A Large-Scale Study of Flash Memory Errors in the Field Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Backup slides System characteristics SSD capacity PCIe Average SSDs Average Average age per written read (T (years) server (TB) B) 720GB v1, x4 2.4 1.2TB v2, x4 1.6 3.2TB v2, x4 0.5 1 2 1 2 1 2 27.2 48.5 37.8 18.9 23.9 14.8 23.8 45.1 43.4 30.6 51.1 18.2 720GB 1.2TB 3.2TB Devices: 1 2 1 2 1 2 720GB 1.2TB 3.2TB Devices: 1 2 1 2 1 2 Channels operate in parallel DRAM buffer ▪ ▪ stores address translations may buffer writes 1.2TB, 2 SSDs 3.2TB, 2 SSDs