Thoughts on the recent colo outage

On the afternoon of Sunday 09Aug2009, our colo overheated and shutdown. The gory details are here, but basically when the air conditioners failed, the room quickly overheated to unsafe levels, and machines took themselves offline before they were physically damaged. All our build/unittest/talos infrastructure, along with large portions of the rest of Mozilla infrastructure, came to an abrupt halt.

Matthew (mrz) phoned me soon after the colo went offline, just to give me a heads up, so I was able to forewarn others in the group. The rough timeline was:

  • 13:30 PDT Sunday afternoon: colo offline
  • 21:30 PDT Sunday evening: Mozilla back online
  • 01:00 PDT Monday morning: RelEng declares build infrastructure back online

While its bad for a colo provider to have failures like this, it was impressive to watch how the RelEng and IT groups pitched in together to get things going again so quickly – reviving ~420 RelEng machines in under 12 hours was quite a feat.

3 thoughts on “Thoughts on the recent colo outage”

  1. Just wanted to say thanks for making a post about this. I suspect there are no other comments because what you said is really straight forward (or you have them protected and haven’t approved any!). I know if I found this interesting lots of other people must have as well.

  2. hi Lucy;

    Glad you found the post interesting. Sometimes its hard to tell if these blog posts are too detailed, not detailed enough, or just right, so thanks for the feedback.

    (and yes, I do review/moderate to catch comment-spam; Akismet is great, but doesn’t stop them all.)


