You read it here first.
Last night working on updates to the site and talking in a PM we suddenly lost our entire backup array. While trying to recover that array of our backups and wondering where it went and why, I noticed our primary array housing all the master content is running on a single drive with a single point of failure. When I tell you that the last drop of blood drained from head, it would not be an exaggeration. I went into full recovery mode before we lost everything. The data center and I limped the machine offline and into recovery mode to contain the fallout and then the long night began.
Task one, surveil the carnage and get a surface level check on the hardware, first scan said it's all here but with one failed drive in the primary array. Next we could see the storage array it was unmounted but appeared to be viable, good news. Our drive health was still in question until it's up, mounted, and back into production. The data center replaced the failed hard drive in the primary array and rebuilt that array to safeguard our primary data stack. It took 7 hours for our primary drives to get back in sync and functioning as a pair. You guys upload a ton of shit. Once the primary array was built we moved onto the storage array which vanished mid-write.
We were successful in recovering all drives and confirmed the journal on each of them. As all this is going on I couldn't help think why did this happen. It's rare in a professional environment to see this when the gear that's used is all commercial grade. I was left thinking do we have bad cabling, or is a back-plane not seated, is the power securely connected to the devices, do we have a heat problem or memory problem either on the controller or the motherboard, all possible but we didn't know. I wondered, was this problem just going to sit there quietly until the next time it took everything offline silently logging it's destructive behavior in the background. Sticky note this as it's a fix not made yet but you can guarantee it's coming. Bottom line as this is already too lengthy, a sign of being sleep deprived, the data checked out, the cabling, drives, power, and heat issues all confirmed good. So, out of an abundance of caution the data center opted to reapply new thermal paste to the CPUs and replace the RAID controller and riser.
Eleven hours later and no sleep, our machines churned back online.
God, I hate moving equipment. Someone take me out behind the wood shed and plant me six feet under if I ever have to live through this again.