Facebook Study Into SSDs Finds ‘Several Distinct Failure Periods’

But paper in partnership with Carnegie Mellon University claims that read disturbance errors do not increase failure rates

A joint study by Facebook engineers and Carnegie Mellon University experts into SSD failure rates has found that SSDs go through “several distinct failure periods” corresponding to the amount of data written to flash chips.

The paper, which its authors said is the “first comprehensive study of flash-based SSD reliability trends”, six SSD platforms used by Facebook were cross-examined for failure causes.

Although the SSDs studied were not disclosed, the paper said that the components were “similar to those” used in server hardware available from firms such as Fusion-io, Hitachi, Intel, OCZ, Seagate and Virident.

‘Several distinct failure periods’

“We observe that SSDs go through several distinct failure periods – early detection, early failure, usable life, and wearout – during their lifecycle, corresponding to the amount of data written to flash chips,” was the authors’ first conclusion.

The researchers advised that additional error correction at the start of an SSD’s life would go some way in reducing the failure rates during the early detection period.

SSD
Inside Facebook’s Swedish data centre

Another observation from the researchers was that SSDs that do not use throttling techniques to manage temperature have more chance of failure.

“Higher temperatures lead to increased failure rates, but do so most noticeably for SSDs that do not employ throttling techniques,” said the study. “In general, we find techniques like throttling, which may be employed to reduce SSD temperature, to be effective at reducing the failure rate of SSDs.”

The most interesting finding from the study, which examined the SSDs over a four-year period, was that read disturbance errors, are “not prevalent in the field”. The researchers said that SSDs that have read the most data do not show a statistically significant increase in failure rates.

“We find that the effect of read disturbance errors is not a predominant source of errors in the SSDs we examine,” said that two Facebook and two Carnegie researchers.

“While prior work has shown that such errors can occur under certain access patterns in controlled environments… we do not observe this effect across the SSDs we examine.”

Take our cloud quiz here!