We’ve noticed an increase in a type of disk failure on some of the storage nodes that ultimately has a severe negative impact on storage performance. In particular, we observe that certain models of drives in certain manufacturing date ranges seem to be more prone to failure.
As a result, we’re looking a bit more closely at our logs to keep an eye on how widespread this is, but most of the older storage seems fine; it has tended towards some of the newer storage using both 2Tb and 4Tb drives. The 2Tb drives are the more surprising to us as the model line involved has generally been performing as expected, with many older storage units using the same drives without having these issues.
We are also engaging our vendor to see if this is something that they are seeing elsewhere, and making sure we keep a close eye on our stock of replacements to deal with these failures.
We’ve gone ahead and replaced some disks in your storage as the type of failures they are generating right now cause dramatic slowdowns in I/O performance for the disk arrays.
As a result of the replacements, the array will remain slow for a period of ~5 or so hours as the arrays rebuild themselves to have the appropriate redundancy.
We’ll be keeping an eye on this problem as we have recently noticed a spike in the number of these events as of late.