Thunderbird 2.0.0.12 by the (wall-clock) numbers

Mozilla released Thunderbird 2.0.0.12 on Tuesday 26-feb-2008, at 16:40pm PST. From “Dev says go” to “release is now available to public” was just over 14days (14d 7h 45m) wall-clock time, of which Build&Release took just over 6 days (6d 4h 20m).

08:55 12feb: Dev says “go” for rc1
13:55 12feb: 2.0.0.12rc1 builds started
20:55 13feb: 2.0.0.12rc1 linux builds handed to QA
20:55 13feb: 2.0.0.12rc1 mac builds handed to QA
08:05 14feb: 2.0.0.12rc1 signed-win32 signed builds handed to QA
07:30 18feb: 2.0.0.12rc1 update snippets available on betatest update channel
15:30 19feb: QA says “go to beta”.
16:10 19feb: update snippets on beta update channel
08:45 26feb: Dev & QA says “go” for Release; Build already completed final signing, bouncer entries
09:25 26feb: mirror replication started
13:25 26feb: mirror absorption good for testing to start on releasetest channel
14:20 26feb: QA completes testing releasetest.
15:30 26feb: website changes finalized and visible. Build given “go” to make updates snippets live.
16:00 26feb: update snippets available on live update channel
16:40 26feb: release announced

Notes:

1) We’re still doing Thunderbird builds manually, as we’ve not had a chance to test the Build Automation used in FF releases. It *should* work, but needs to be tested properly before we switch to using automation in production for Thunderbird. Producing Thunderbird manually explains some of the delay in producing updates above – there was a weekend in there! Now that Rick has joined MailCo, he’s starting to get up to speed, and help out. We’ll still be doing TB2.0.0.13 manually, but hope to do TB2.0.0.14 using automation. Watch this space!

2) For better or worse, we are putting all our blow-by-blow scribbles public, so the curious can read about it, warts and all, here.

3) As usual, we waited until morning to start pushing to mirrors. This was done so mirror absorption completed as QA were arriving in the office to start testing update channels. We did this because we wanted to reduce the time files were on the mirrors untested; in the past, overly excited people have post the locations of the files as “released” on public forums, even though they are not finished the last of the sanity checks. Coordinating the mirror push like this reduced that likelihood just a bit. I’m counting that wait time as “Build time” even though that might be a little unfair to the Build team.

4) Mirror absorption took 4 hours to reach good values. A little longer then usual, unclear exactly why.

take care

John.

Recovering from a datacenter outage…

Tuesday night was going to be exciting because it was the code freeze for Firefox3.0beta5… instead, our entire San Jose datacenter went offline at 8pm PST…a whole different type of excitement. Details in Justin’s blog, but it seems we hit a network storm caused by a faulty switch in the colo. The network problem was resolved by 9.25pm. A drive mount problem on cvs server was repaired just after 1am.

However, the Firefox tree remained closed until approx 11am. We worked all night doing recovery work, so why did it take so long to reopen the tree?

  • Once the network problem and cvs server problem were fixed, some machines recovered and came back online automatically, but many did not. There were so many build/unittest/talos machines offline or burning that no-one felt safe reopening the tree for checkins until these were back online. Bug#423809 has details of the repair/recovery work we did on various build/unittest/talos machines.
  • Somehow, one VM got totally corrupted by the network outage, so we ended up having to recreate the VM from scratch. Details in bug#423850. Seemed strange to me that a VM could be corrupted by a network outage… [UPDATE: Since all the VMs live on network attached storage, the instant network failure was just as catastrophic as ripping the disk drive out of a running machine! Thanks to mrz for the explanation.]
  • Some unittest failures started a few hours *before* the network outage, and were not noticed. After the network outage, we brought these unittest machines back up, discovered the failures, and assumed they were caused by code regression. However, it turned out to be regression caused by a totally unrelated change we made to the unittest machine setup earlier in the day; not a code issue and not a network outage issue. Confirming all this took time. Ideally, once unittests started failing, no more changes would have landed. That would have made it quick & easy to find the real root cause, and would likely have resolved everything before the network outage complicated the situation.
  • The longer PGO-build times mean that, once a machine was back online, it took longer for a burning machine to generate a new build, and therefore show up as green on tinderbox page.

While we’ve made great improvements with our automation infrastructure in the last few months, Tuesday’s outage proved how much work we still have to do towards getting machines to boot up in a clean, ready-to-use state.

(Bonus: The same auto-boot-clean-configuration work would also help us when provisioning new machines, and help IT with late night Tier1 support…)

Housekeeping:1.8 nightlies now on release automation machines

Last week, we finally moved the 1.8 nightlies over to running on the release automation machines. If we’ve done our homework right, no-one noticed anything! And seriously, if you do notice something wrong with the mozilla1.8 / FF2.0.0.x nightlies, or have any questions, please update bug#417147, and/or contact Rob, myself or anyone on build@m.o.

This is a really big deal for us, and has taken many months of preparation and homework to get to here. Out of paranoia, we’ve been generating two complete sets of nightlies in parallel; one set on the traditional nightly machines, one set on the release automation machines. And we’ve done this for each of the 3 supported o.s. since 15feb2008. If all goes well, we’ll power down the old build machines soon in bug#422298.

Thats a lot of homework – for no visible gain, so why bother – why is this so important to us? To me, this feels like our third major automation milestone since we started the build automation rollout:

  • 1st milestone was shipping FF2.0.0.7 using release automation
  • 2nd milestone was shipping FF3.0beta2 using release automation
  • This 3rd milestone reduces the number of clusters of “similar-but-not-quite-identical” build machines we have to maintain, and reduces some of the weird code paths we support. Both of these help us by reducing the overall complexity of the remaining systems… which speeds up our future code cleanup.
  • Our next (4th) milestone is to do the same with the 1.9/trunk nightlies (bug#421411) which will brings similar improvements.

Stay tuned…