08-11-2015, 12:14 #1
[PDF] Proactively Protecting Against Disk Failures
Characterizing, Monitoring, and Proactively Protecting Against Disk Failures
Ao Ma, Fred Douglis, Guanlin Lu, Darren Sawyer, Surendar Chandra, Windsor Hsu
 EMC Corporation,  Datrium, Inc.
Modern storage systems orchestrate a group of disks to achieve their performance and reliability goals. Even though such systems are designed to withstand the failure of individual disks, failure of multiple disks poses a unique set of challenges. We empirically investigate disk failure data from a large number of production systems, specifically focusing on the impact of disk failures on RAID storage systems. Our data covers about one million SATA disks from 6 disk models for periods up to 5 years. We show how observed disk failures weaken the protection provided by RAID. The count of reallocated sectors correlates strongly with impending failures. With these findings we designed RAIDSHIELD, which consists of two components. First, we have built and evaluated an active defense mechanism that monitors the health of each disk and replaces those that are predicted to fail imminently. This proactive protection has been incorporated into our product and is observed to eliminate 88% of triple disk errors, which are 80% of all RAID failures. Second, we have designed and simulated a method of using the joint failure probability to quantify and predict how likely a RAID group is to face multiple simultaneous disk failures, which can identify disks that collectively represent a risk of failure even when no individual disk is flagged in isolation. We find in simulation that RAID-level analysis can effectively identify most vulnerable RAID-6 systems, improving the coverage to 98% of triple errors.
PDF (17p) https://www.usenix.org/system/files/...5-paper-ma.pdf
08-11-2015, 12:17 #2
Flash Storage Failure Rates From A Large PopulationJames Hamilton
I love real data. Real data is so much better than speculation and, what I’ve learned from years of staring at production systems, is the real data from the field is often surprisingly different from popular opinion. Disk failure rates are higher than manufacturer specifications, ECC memory faults happen all the time, and events that are just about impossible actually happen surprisingly frequently in large populations. An example of two papers that, nearly 8 years later still remain relevant and are well worth reading are: 1) Failure Trends in a Large Disk Drive Population and 2) Disk Failures in the Real World: What does a MTBF of 100,000 hours mean to you. Both of these classics were published at the same FAST2007 conference.
Flash memory and flash failure rates were an issue we barely dealt with back in 2007. Today most megaclouds have petabytes of Flash storage deployed. As always there are specs on failure rates and strong opinions from experts but there really hasn’t been much public data for large fleets.
I recently came across a paper than does a fairly detailed study of the Facebook SSD population. This study isn’t perfect in that it’s reporting over a large but unspecified number of devices, there are 5 different device models in the population, these devices are operating in different server types, the lifetime of the different devices varies, and the fault tracking is external to the device and doesn’t see the detailed device internal failure data. However it does study devices over nearly 4 years with “many millions of operational hours” in aggregate. The population is clearly large enough to be relevant and, even with many uncontrolled dimensions, it’s a good early look at flash device lifetimes and I found their findings of interest:
- Flash-based SSDs do not fail at a monotonically increasing rate with wear. They instead go through several distinct reliability periods corresponding to how failures emerge and are subsequently detected. Unlike the monotonically-increasing failure trends for individual flash chips, across a large number of flash-based SSDs, we observe early-detection, early failure, usable life, and wear out periods.
- Read disturbance errors (i.e. errors caused in neighboring pages due to a read) are not prevalent in the field. SSDs that have read the most data do not show a statistically significant increase in failure rates.
- Sparse logical data layout across an SSDs physical address space (e.g.. non-contiguous data), as measured by the amount of SSD-internal DRAM buffer usage for flash translation layer metadata, great affects device failure rate. In addition, dense logical data layout with adversarial patterns (e.g. small sparse writes), also negatively affects SSD reliability.
- Higher temperatures lead to higher failure rates, but techniques used in modern SSDs that throttle SSD operation (and consequently, the amount of data written to flash chips) appears to greatly reduce the reliability impact of higher temperatures by reducing access rates to raw flash chips. [JRH: This point seems self-evident that temperature mitigation techniques would reduce the impact of higher temperatures].
- The amount of data written by the operating system to the SSD is not the same as the amount of data that is eventually written to the flash cells. This is due to system level buffering and wear reduction techniques employed in the storage software stack and in the SSDs. [JRH: This highlights the problem of studying flash error rates outside of the storage devices in that we aren’t able to see the impact of write amplification and it’s more difficult to see the impact of buffering layers between the application and the device].
08-11-2015, 12:35 #3
Facebook: A Large Scale Study of Flash Memory Failures in the Field
Carnegie Mellon University
Carnegie Mellon Universit
Servers use flash memory based solid state drives (SSDs) as a high-performance alternative to hard disk drives to store persistent data. Unfortunately, recent increases in flash density have also brought about decreases in chip-level reliability. In a data center environment, flash-based SSD failures can lead to downtime and, in the worst case, data loss. As a result, it is important to understand flash memory reliability characteristics over flash lifetime in a realistic production data center environment running modern applications and system software. This paper presents the first large-scale study of flash-based SSD reliability in the field. We analyze data collected across a majority of flash-based solid state drives at Facebook data centers over nearly four years and many millions of operational hours in order to understand failure properties and trends of flash-based SSDs. Our study considers a variety of SSD characteristics, including: the amount of data written to and read from flash chips; how da a is mapped within the SSD address space; the amount of data copied, erased, and discarded by the flash controller; and flash board temperature and bus power. Based on our field analysis of how fash memory errors manifest when running modern workloads on modern SSDs, this paper is the first to make several major observations: (1) SSD failure rates do not increase monotonically with flash chip wear; instead they go through several distinct periods corresponding to how failures emerge and are subsequently detected, (2) the effects of read disturbance errors are not prevalent in the field, (3) sparse logical data layout across an SSD's physical address space (e.g., non-contiguous data), as measured by the amount of metadata required to track logical address translations stored in an SSD-internal DRAM buffer, can greatly affect SSD failure rate, (4) higher temperatures lead to higher failure rates, but techniques that throttle SSD operation appear to greatly reduce the negative reliability impact of higher temperatures, and (5) data written by the operating system to flash-based SSDs does not always accurately indicate the amount of wear induced on flash cells due to optimizations in the SSD controller and buffering employed in the system software. We hope that the findings of this first large-scale flash memory reliability study can inspire others to develop other publicly-available analyses and novel flash reliability solutions.
PDF (14p) http://users.ece.cmu.edu/~omutlu/pub...gmetrics15.pdf
08-11-2015, 12:54 #4data written by the operating system to flash-based SSDs does not always accurately indicate the amount of wear induced on flash cells due to optimizations in the SSD controller ...
Understanding Flash: Garbage Collection Matters
Understanding Flash: The Write Cliff
Understanding Flash: Unpredictable Write Performance
Understanding Flash: Floating Gates and Wear
Understanding Flash: The Flash Translation Layer
Última edição por 5ms; 08-11-2015 às 12:56.
08-11-2015, 17:58 #5
The count of reallocated sectors correlates strongly with impending failures.
Through the comparison we find that for some disk models (such
as A-1, A-2, and B-1), a certain fraction of failed disks
(usually 30%) develop a similar amount of pending and
uncorrectable sectors. Failed drives of the other disk
models, including C-1, C-2, D-1 develop pending sector
errors but none of them have uncorrectable sector errors,
implying most pending errors have been addressed
with drives’ internal protection mechanisms. No working
disks show these two types of sector errors, revealing
that once disks develop these two types of error, they are
very probable to fail.