Categorized | Technology

Tags : ,

Rackspace explains what went wrong

Posted on 30 June 2009

Let me start off by saying I have been using Rackspace for some time for our hosting and have been VERY impressed by their support they give us. Yesterday’s downtime was still a very frustrating day for many folks though who weathered their outage. Below is a copy of the report Rackspace offered its customers less than 30 minutes ago.


 

INCIDENT REPORT

 


06/29/2009

Grapevine, DFW Power Interruption





At approximately 3:15CDT, a portion of our DFW Datacenter experienced a power outage.
 
The breaker on the primary utility feeder tripped, initiating a sequence of events that ultimately caused a power interruption in Phase I and Phase II of the data center. All systems initially came up on generator power without customer impact.The ‘A’ bank of generators, which support UPS clusters A and B in Phase I and UPS cluster E in Phase II, then experienced excitation failure which escalated to the point where the generators were no longer able to maintain the electrical load. Rackspace then attempted to switch to our secondary utility feeder, but was unable to do so due to an issue in the Pad Mounted Switch (PMS). At approximately 3:15pm CDT, power supply through UPS clusters A, B and E was lost when the batteries in those clusters discharged, and equipment receiving power through those clusters experienced an interruption in service.

 

Once the primary utility feed was restored, Rackspace brought cluster E up on utility power. Devices supported by clusters A and B were brought up on generator power, as the generators were able to hold the reduced electrical load. During this transition, the batteries in UPS clusters A, B and E were recharged.

 

Rackspace then initiated steps to bring UPS clusters A and B online and complete the transition back to utility power. Cluster B was moved to utility power with battery protection. Cluster A required repair to module 2 of the UPS, and remained on generator power. The generators experienced a subsequent excitation failure forcing transition of cluster A back to the primary utility feed prior to the completion of UPS repairs.

 

Once repairs were completed, that module was re-introduced into the A cluster for redundancy. As of the writing of this Incident Report, the infrastructure behind UPS clusters A, B, and E is being fed via the primary utility feed with UPS protection. The generator vendor and UPS/battery will continue to troubleshoot issues, and conduct further root cause analysis.

 

Conclusion:


Rackspace will provide you with more information as it becomes available. Please let us know if you have any further question or comments around this incident.













This post was written by:

Justin Scott - who has written 11 posts on CIO Mojo.

Justin Scott has owned a few technology consulting companies over the years, been a principle with numerous start-ups, and currently heads up the technology efforts for Sitehawk, a leading provider of web-based chemical management software and services. Mr. Scott is fascinated by the inner workings of what makes a well-run technology team. He has managed larger onshore as well as offshore projects, is considered an advid scrum practicioner, and has served in senior data architect and application roles at earlier points of his career. He is a native Floridian currently living in Tennessee and when home from work loves spending time with his beautiful wife and 2 daughters. He has also been known to fish, coach soccer, play raquetball, and play the guitar when time permits.

Contact the author

One Response to “Rackspace explains what went wrong”

  1. I left this comment on July 2nd, but it doesn’t seem to have been posted:

    Rackspace’s outage explanation is an amazingly (bad) example of jargon-laden, un-customer-oriented IT communication. There’s no discussion of real root cause (um, the breaker tripped, but WHY?) or action plan, just a bunch of technobabble that lets no one understand what really happened, why it happened, and what will ensure that it won’t happen again. It’s kind of like the old joke punchline, “Everything you have told me is technically correct, but it’s no use to anyone.” This inspires negative confidence among one’s customers, needless to say.

    I’ve written in my blog about the need (and the ways) to “air the dirty laundry.” See http://www.peterkretzman.com/2008/01/15/why-the-cio-should-air-his-dirty-laundry/


Leave a Reply