Disk failures, data corruptions

An Analysis of Data Corruption in the Storage Stack

Table 1: Corruption classes summary
- This summary overturns my illusion that “once I have checksum and verified, I would trust the data”. In fact, a block of data may pass the checksum verification but is still inconsistent since the firmware returns false negative or there is logic error due to misdirected writes

Disk failures in the real world: What does an MTTF of 1 million hours mean to you?

Figure 2: Lifecycle failure pattern for hard drives
- It is surprising there is a “infant mortality” phenomena which indicates a high failure rate at the early age of hard drives, which contradicts to my intuition that the failure rate should look like an exponential distribution. In fact the article proposes the field replacement data looks more like Weibul distribution

An Analysis of Latent Sector Errors in Disk Drives

Figure 10: The distribution of observed latent sector errors per day
- The advantage of Enterprise disks over Nearline disks becomes clear only after the age around 1 year. So in fact, we do not need to worry too much about the sector error of Nearline disk at its first year, we may upgrade the cheap disk every year, and plus we would have additional software layer checksum to ensure data correctness