Data Center Meltdown

Last week, the host that I’ve been using for our production server (ServerBeach) had a catastrophic failure that brought down their entire data center. Apparently, a power failure in Virginia caused the host to switch to battery backup. Battery backup powers the servers and networking equipment, but NOT the air conditioners. Then the switch to connect the data center to the backup generators failed to function. As the system ran without air conditioning, temperatures in the data center soared, quickly reaching the point where there was a danger of hardware failure.


The ServerBeach team reacted quickly and did the only responsible thing in the circumstances: they shut down the data center. Their version of events are here
The timing from Uzanto’s perspective could not have been worse: we were in the middle of a critical exercise with a very important client. It was incredibly important for us that it go off without a hitch. This was a hitch, a big embarrassing one. As the message boards on ServerBeach filled up with angry complaints and accusations, and the (thankfully understanding) phone calls from the client came in, I made cell phone calls to our Indian dev team and woke them up.
In a 2 AM (IST) Skype conference call, we decided to
a)Immediately prep our test server with the production quality code, in the event that the data center did not come back online.
b)monitor serverbeach closely and get the system up again the second the data center went back up, and
c)pull the trigger on a high-end managed host to minimize the risk of data center failures in the future.
d)look into architectures that can survive a data center failure (I’m especially interested in using DNS for this: I’ll blog about this in the coming weeks)
Once external power was restored, it took ServerBeach a few hours to restart their data center. Then a second power failure caused the exact same thing to happen again! Thankfully, the recovery period was shorter the second time around, but it was still a couple of hours before network speed was back to normal. We were lucky and our server recovered quickly from the second outage, but some poor bastards didn’t get their servers online until days afterwards.
I don’t blame ServerBeach for the outage: they don’t bill themselves as being a top-of-the line hosting company, but as the best cheap dedicated hosting provider. Their post-mortum of the outage is here.
But it seems clear to me that a higher-end host would have been (I hope!) more likely to have tested the switch that connects the generator to the data center. The data center failure made me look like an amateur to my client, something that is acceptable in a free service but is completely unacceptable in the context of providing a b2b service to an enterprise client.
I’ve already switched to a host that I’m much happier with. For now I’m keeping my ServerBeach machine: as we grow Project X, we’ve quickly found the need for a demo server that we can show our latest code to clients with. The moral of the story: don’t be cheap: in business, you usually get what you pay for.