From the Archives: Disk failures in the real world
In this post, originally written by Glenn Paulley and posted to sybase.com in October of 2009, Glenn talks about traditional hard drive reliability. An ars technica article from earlier this year indicates that not much has changed in terms of failure rates since this post was written. Of course, taking regular backups, and testing recovery scenarios in order to guard against disk failure is critical if you care about your data.
One thing that the sheer scale of the computing landscape has contributed to the field of Computer Science is the opportunity to study these systems statistically – and in particular to prove or disprove various aspects of hardware and software reliability.
With respect to disk drives, several large studies of disk drive reliability [2,3,4,7] have been published in the last few years. In particular, the study done at Google  showed a steep increase in failure rates – to between 6 and 10 percent – once a drive passed three years of usage, an interesting point since many disk drive manufacturers offer three-year warranties. Their study also showed a lower correlation between heat and drive failure in later-model drives, something that James Hamilton has written about recently in the push towards using less air conditioning within data centers. Recently at FAST 2009, Alyssa Henry of Amazon  spoke in her conference keynote that, at Amazon, the Amazon Simple Storage (EC3) data service sees a hard disk failure rate of 3-5 percent per year across the board, though I am sure, given Google’s survey results, that Amazon’s failure experience is not uniformly distributed across all disk drive manufacturers. Iliadis and Hu  believe that the trend towards lower-cost magnetic media results in higher failure rates, a conclusion also reached  by Remzi Arpaci-Dusseauand his team at the University of Wisconsin in Madison. To some extent, at least, you do get what you pay for.
The actual failure rates reported in these studies is vastly different from the reliability metrics offered by disk drive manufacturers. Moreover, disk hardware failure is only part of the story. Previous work by Remzi Arpaci-Dusseau and his research team at Wisconsin found that transient errors with magnetic disk media were commonplace. Here is a quote from the summaryof the Linux Storage & Filesystem Workshop, held in San Jose in February 2008:
Ric Wheeler (aside: now with RedHat) introduced the perennial error-handling topic with the comment that bad sector handling had markedly improved over the “total disaster” it was in 2007. He moved on to silent data corruption and noted that the situation here was improving with data checksumming now being built into filesystems (most notably BTRFS and XFS) and emerging support for T10 DIF. The “forced unmount” topic provoked a lengthy discussion, with James Bottomley claiming that, at least from a block point of view, everything should just work (surprise ejection of USB storage was cited as the example). Ric countered that NFS still doesn’t work and others pointed out that even if block I/O works, the filesystem might still not release the inodes. Ted Ts’o closed the debate by drawing attention to a yet to be presented paper at FAST ’08 showing over 1,300 cases where errors were dropped or lost in the block and filesystem layers. (emphasis added)
Reference  below studies the lack or mis-reporting of both transient and “hard” filesystem errors across several filesystems. Here is the first paragraph of the paper’s abstract:
The reliability of file systems depends in part on how well they propagate errors. We develop a static analysis technique, EDP, that analyzes how file systems and storage device drivers propagate error codes. Running our EDP analysis on all file systems and 3 major storage device drivers in Linux 2.6, we find that errors are often incorrectly propagated; 1153 calls (13%) drop an error code without handling it.
Write caching or out-of-order writes can cause additional problems. The use of EXT3 on Linux systems, in particular, can result in a corrupt filesystem upon a catastrophic hardware failure due to EXT3’s lack of support for checksumming when writing to the journal – which is supported in EXT4. Arpaci-Dusseau and his research team at Wisconsin have just recently taken this error analysis to the next level . They purposefully and systematically introduced errors into a MySQL database to determine the server’s ability to recover from the sorts of hard and transient failures known to occur on the filesystems studied previously. Their results, coupled with the sweeping disk failure studies mentioned above, should give all DBAs reason to worry. I would encourage DBAs to review the papers below. And keep those backups handy.
 Sriram Subramanian, Yupu Zhang, Rajiv Vaidyanathan, Haryadi S. Gunawi, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Jeffrey F. Naughton (April 2010). Impact of Disk Corruption on Open-Source DBMS. In Proceedings, 2010 IEEE International Conference on Data Engineering, Long Beach, California. To appear.
 Bianca Schroeder and Garth A. Gibson (February 2007). Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?In Proceedings, 5th USENIX Conference on File and Storage Technologies, San Jose, California, pp. 1-16.
 Ilias Iliadis and Xiao-Yu Hu (June 2008). Reliability Assurance of RAID Storage Systems for a Wide Range of Latent Sector Errors. Proceedings of the International Conference on Networking, Architecture, and Storage, Chongqing, China. IEEE Computer Society, ISBN 978-0-7695-3187-8.
 Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz AndrÃ© Barroso (February 2007). Failure Trends in a Large Disk Drive Population. In Proceedings, 5th USENIX Conference on File and Storage Technologies, San Jose, California, pp. 1-16.
 Haryadi S. Gunawi, Cindy Rubio-GonzÃ¡lez, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Ben Liblit (February 2008). EIO: Error Handling is Occasionally Correct.In Proceedings, 6th USENIX Conference on File and Storage Technologies, San Jose, California, pp. 207-222.
 Alyssa Henry (February 2009). Cloud Storage FUD (Failure, Uncertainty, and Durability). Keynote address, 7th USENIX Conference on File and Storage Technologies, San Francisco, California.
 Lakshmi N. Bairavasundaram, Garth R. Goodson, Bianca Schroeder, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau (February 2008). An Analysis of Data Corruption in the Storage Stack. In Proceedings of the 6th USENIX Symposium on File and Storage Technologies (FAST ’08), San Jose, California, pp. 223-238.