Recovering from a datacenter outage…

Tuesday night was going to be exciting because it was the code freeze for Firefox3.0beta5… instead, our entire San Jose datacenter went offline at 8pm PST…a whole different type of excitement. Details in Justin’s blog, but it seems we hit a network storm caused by a faulty switch in the colo. The network problem was resolved by 9.25pm. A drive mount problem on cvs server was repaired just after 1am.

However, the Firefox tree remained closed until approx 11am. We worked all night doing recovery work, so why did it take so long to reopen the tree?

  • Once the network problem and cvs server problem were fixed, some machines recovered and came back online automatically, but many did not. There were so many build/unittest/talos machines offline or burning that no-one felt safe reopening the tree for checkins until these were back online. Bug#423809 has details of the repair/recovery work we did on various build/unittest/talos machines.
  • Somehow, one VM got totally corrupted by the network outage, so we ended up having to recreate the VM from scratch. Details in bug#423850. Seemed strange to me that a VM could be corrupted by a network outage… [UPDATE: Since all the VMs live on network attached storage, the instant network failure was just as catastrophic as ripping the disk drive out of a running machine! Thanks to mrz for the explanation.]
  • Some unittest failures started a few hours *before* the network outage, and were not noticed. After the network outage, we brought these unittest machines back up, discovered the failures, and assumed they were caused by code regression. However, it turned out to be regression caused by a totally unrelated change we made to the unittest machine setup earlier in the day; not a code issue and not a network outage issue. Confirming all this took time. Ideally, once unittests started failing, no more changes would have landed. That would have made it quick & easy to find the real root cause, and would likely have resolved everything before the network outage complicated the situation.
  • The longer PGO-build times mean that, once a machine was back online, it took longer for a burning machine to generate a new build, and therefore show up as green on tinderbox page.

While we’ve made great improvements with our automation infrastructure in the last few months, Tuesday’s outage proved how much work we still have to do towards getting machines to boot up in a clean, ready-to-use state.

(Bonus: The same auto-boot-clean-configuration work would also help us when provisioning new machines, and help IT with late night Tier1 support…)

One thought on “Recovering from a datacenter outage…

  1. […] Despite the best efforts of the week’s nasty Mozilla infrastructure failure to foil us, Stuart Morgan posted our existing 2008 Summer of Code project ideas. If you have any other ideas, be sure to let us know post-haste. The student application period begins Monday and lasts throughout the week, and we hope to receive another good crop of applicants again this year. […]