There was a time when I was confused over if I should place battery backup devices in colocation racks. My first thought was that “you can never be too careful.” Then I began to become complacent. My basic idea was that the colocation would be vastly more capable of protecting the power system than me. If there is a power outage that they can’t stop, then certainly there’s nothing I can do to stop the damage.
Do you see the error in thinking? Certainly the colocation has vast resources in both money and experience to maintain a top-notch power system. Of course, I’m only speaking of colocation environments that are top-notch themselves. I’m not speaking about Mom-N-Pop’s Huntsmans’ Mercantile and Datacenter Solutions. You must first choose a capable datacenter in order to be reasonably assured in placing faith in their infrastructure. However, major catastrophes can and do happen. Errors in engineering will never cease. Datacenters can and do lose power to their floor.
Rimuhosting, a New Zealand hosting provider of no mean reputation, recently had a total power outage in their Dallas datacenter. Their Dallas colocation center, Colo4, released information concerning the outage:
What Happened: On Wednesday, August 10, 2011 at 11:01AM CDT, the Colo4 facility at 3000 Irving Boulevard experienced an equipment failure with one of the automatic transfer switches (ATS) at service entrance #2, which supports some of our long-term customers. The ATS device was damaged and did not allow either commercial or generator power automatically — or through bypass mode. Thus, to restore the power connection, a temporary replacement ATS was required to be put into service.
Colo4’s standard redundant power offering has commercial power backed up by diesel generator and UPS. Each of our six ATSs reports to its own generator and service entrance. The five other ATSs and service entrances at the facility were unaffected.
The ATS failure at service entrance #2 affected customers who had single circuit connectivity (one power supply). For customers who had redundant circuits (or A/B dual power supplies), they access two ATS switches, so the B circuit automatically handled the load. (A few customers with A/B power experienced initial downtime due to a separate switch that was connected to two PDUs and the same service entrance. Power was quickly restored.)
Assessment: As part of our after-action assessment, the Colo4 management team has debriefed with all on-site technical team and electrical contractors as well as the equipment manufacturer, UPS contractors and general contractors to provide assessments on the ATS failure. While an ATS failure is rare, it is even rarer for an ATS to fail and not allow it to go into bypass mode.
While the ATS could be repaired, we made the decision to order a new replacement ATS. This is certainly a more expensive option, but it is the option that provides the best solution for the long-term stability for our customers.
Bad things happen in this world. Be prepared.
This does not mean that you should protect yourself from 67 hours of lost power, however. That would be… costly. In the event of a large power outage, you’re likely going to experience some network loss as well, so your priority will likely not be to keep your customers’ systems completely free from disruption. The goal is to make the recovery smoother. The sudden loss of power to your racks will likely end up in more corruption than a Chicago city council meeting. Only you can determine how long you should be able to sustain power loss at your colocation, but thirty minutes or less seems like a reasonable amount of time to decide if you need to shutdown your systems or not.
As a result of this incident (which I was not directly affected by), my mindset to customer provided battery backed power in a datacenter has changed. Once I was cautious about it, and then I slacked off and ignored it. Now, I’m more a proponent of it than ever. Sure, you will likely not be able to remotely control your servers to perform graceful shut downs if the power affects the datacenter’s network equipment. In that case, hopefully you’ll be given physical access if the building has its physical security on a battery backup (which reminds me, one needs to ask about those kinds of things before choosing a colo).
If it’s a worst case scenario where you have no remote or physical access, make sure that you have proper shutdown procedures and scripts put in place to gracefully shut down all of your systems once the batteries reach a certain level of power consumption. There’s no need to add data corruption issues to the problem of missed SLAs, business downtime and angry users/customers.
What do you do in your colocation space? Do you provide your own battery backed power or do you trust the colocation to not let you down?