Housekeeping: Moving 1.8 nightlies to release automation machines

I wanted to draw attention to Rob Helmer’s recent blog post from last week. Basically, we are moving 1.8 nightlies from running on TinderboxClient, to running on Buildbot. There are a few reasons this is really important.

  • reliability: by using a pool of shared identical slaves, we expect to be able to survive hardware failures of a generic slave much better then our current situation, where a hardware failure of a dedicated slave will close the tree.
  • easier cleanup: bringing the different nightly and release systems into sync with each other greatly simplifies our future cleanup work. Until now, each change had to be reviewed for possible breakage in several different paths through different build processes, running on differently configured machines. Real mind-warp stuff. This change also makes it easier to spot obsolete build processes that can be trimmed.
  • buildbot: this introduces buildbot into the nightly build process for the first time. Until now, we’ve had our new buildbot/bootstrap enhancements on release builds, but not yet on nightlies, so this is an important milestone for us.
  • this gets us one step closer to being able to change from build-continuously to build-on-checkin
  • misc annoying headaches: occasionally, we have nightly/clobber builds collide with hourly/incremental builds because of bad timing. Similarly, for machines that used to do double-duty of hourly/nightly builds *and* release builds, we had to remember to manually stop nightly builds before starting release builds, or risk crashing out with broken release builds. This change resolves all that, because buildbot handles scheduling better.

This is a big exciting change for us in the Build team, and has taken months of behind the scenes work to make happen. We’ve been running the two production systems in parallel since 14feb, and if all is still ok, we’ll shutdown the old Tinderbox builders on 25feb. If we’ve done our homework right, hopefully, no-one will notice anything different. However, if you do see anything different, or have any questions, please update bug#417147, and/or contact Rob, myself or anyone on build@m.o.

(Assuming all goes ok with this transition on the 1.8 branch, we’ll start doing the same on the 1.9 branch next.)

Firefox 3.0beta3 by the (wall-clock) numbers

Mozilla released Firefox3.0beta3 on Tuesday 12-feb-2008, at 17:45pm PST. From “Dev says go” to “release is now available to public” was 8 days (8d 0h 25m) wall-clock time, of which Build&Release took 2 days 9 hours.

17:20 04feb: Dev says “go” for rc1
18:30 04feb: 3.0b3rc1 builds started
10:10 05feb: Build declared rc1 bad because of mozconfig mismatch after bug#407794.
10:55 05feb: 3.0b3rc2 builds started
13:05 05feb: 3.0b3rc2 mac builds handed to QA
22:35 05feb: 3.0b3rc2 linux and signed-win32 builds handed to QA
23:15 05feb: Tier1 locales discovered broken at time of “Dev go to Build”.
15:20 05feb: figured out how to do rc3 for these locales without invalidating completed testing on rc2 builds
15:20 06feb: 3.0b3rc3 builds started
17:35 06feb: 3.0b3rc3 mac & linux builds handed to QA
21:20 06feb: 3.0b3rc3 signed-win32 signed builds handed to QA
16:05 07feb: 3.0b3rc3 update snippets available on betatest update channel
17:00 11feb: Dev & QA says “go” for Release; Build already completed final signing, bouncer entries
07:10 12feb: mirror replication started
09:55 12feb: mirror absorption good for testing to start on releasetest channel
12:50 12feb: QA completes testing releasetest.
14:00 12feb: website changes finalized and visible. Build given “go” to make updates snippets live.
14:45 12feb: update snippets available on live update channel
17:45 12feb: QA completes testing beta channel. Release announced

Notes:

1) The Build Automation used in FF3.0b3  included a bunch of fixes landed after FF3.0b2, which helped make things smoother. Despite the number of respins, all the housekeeping of the last few weeks paid off.

2) For better or worse, we are putting all our blow-by-blow scribbles public, so the curious can read about it, warts and all, here. Those Build Notes also link to our tracking bug#409880.

3) Build declared rc1 invalid because the release mozconfig used did not include a set of changes made for the nightly mozconfig. The changes to the nightly mozconfig were landed 13-dec-2007 as part of bug#407794, but corresponding changes were not made to the release mozconfigs. Its the first time we’ve had mozconfig changes land since we started using automation, and this uncovered a hole in our process: neither the automation, nor any humans verified these two mozconfigs before starting builds. We will now manually diff these before starting, and are working to automate this in bug#386338.

4) At the time Build was given a “go” to start builds, the locales es-ES, jp, jp-mac and pa-IN were all broken. As some of these were Tier1 locales, there was quite some effort put into figuring how to rebuild just those locales without invalidating the test work already completed in all the other locales. Once we figured out how to do that, we agreed to respin to uptake these locales. For clarity, *all* locales were called rc3, even though most were identical to rc2, and the only difference between rc2 and rc3 was the addition of the 4 locales.

5) Like before, we waited until morning to start pushing to mirrors. This was done so mirror absorption completed as QA were arriving in the office to start testing update channels. We did this because we wanted to reduce the time files were on the mirrors untested; in the past, overly excited people have post the locations of the files as “released” on public forums, even though they are not finished the last of the sanity checks. Coordinating the mirror push like this reduced that likelihood just a bit.

6) Mirror absorption took just under 3 hours to reach all values >= 60%, our usual threshold.

take care

John.

Firefox 2.0.0.12 by the (wall-clock) numbers

Mozilla released Firefox 2.0.0.12 on Thursday 07-feb-2008, at 16:45pm PST. From “Dev says go” to “release is now available to public” was almost 11 days (10d 21h 30m) wall-clock time, of which  Build&Release took almost 4 days (95h20m). Sadly, not all important milestones were recorded, so if anyone have info for the times marked ??:?? below, please let me know and I’ll update.

19:10 28jan: Dev says “go” for rc1
19:45 28jan: 2.0.0.12rc1 builds started
22:45 28jan: 2.0.0.12rc1 linux builds handed to QA
02:05 29jan: 2.0.0.12rc1 mac builds handed to QA
04:00 29jan: 2.0.0.12rc1 win32 signed builds handed to QA
12:00 29jan: update snippets available on betatest update channel
13:00 30nov: bug#414856 declared showstopper, Dev says “go” for rc2.
14:50 30jan: 2.0.0.12rc2 builds started
17:00 30jan: 2.0.0.12rc2 linux builds handed to QA
18:55 30jan: 2.0.0.12rc2 mac builds handed to QA
??:?? ??jan: 2.0.0.12rc2 unsigned-win32 builds waiting to be signed when QA discovered that bug#413250 has another showstopper exploit path
14:40 31jan: Dev says “go” for rc3
14:41 31jan: 2.0.0.12rc3 builds started
16:45 31jan: 2.0.0.12rc3 linux builds handed to QA
18:35 31jan: 2.0.0.12rc3 mac builds handed to QA
10:05 01feb: 2.0.0.12rc3 signed-win32 signed builds handed to QA
18:05 01feb: bug#415292 declared showstopper….
20:00 01feb: Dev says “go” for rc4
20:40 01feb: 2.0.0.12rc4 builds started
01:25 02feb: 2.0.0.12rc4 linux & mac builds handed to QA
04:35 02feb: 2.0.0.12rc4 signed-win32 signed builds handed to QA
08:20 02feb: 2.0.0.12rc4 update snippets available on betatest update channel
10:30 04feb: QA says “go to beta”.
11:25 04feb: update snippets on beta update channel
09:00 07feb: Dev & QA says “go” for Release; Build already completed final signing, bouncer entries
09:20 07feb: mirror replication started
11:45 07feb: mirror absorption good for testing to start on releasetest channel
14:40 07feb: QA completes testing releasetest.
??:?? 07feb: website changes finalized and visible. Build given “go” to make updates snippets live.
??:?? 07feb: update snippets available on live update channel
16:45 07feb: release announced

Notes:

1) The Build Automation used in FF2.0.0.12  included a bunch of fixes landed after FF2.0.0.11, which helped make things smoother. Despite the number of respins, it felt like all the housekeeping of the last few weeks paid off.

2) For better or worse, we are putting all our blow-by-blow scribbles public, so the curious can read about it, warts and all, here. Some highlights were:

  • win32 builds failed out first time because buildbot slave was not restarted correctly. These need to be configured to auto-reboot cleanly.
  • mac build failed out first time and had to be restarted. This has happened to us intermittently before, and we dont know yet how to reproduce or fix this.
  • Long standing, but previously unknown, bug#414966 was fixed. This involved portions of tag respin code using UTC, while other portions used PST. If the initial rc1 build was started after 4pm, this bug would cause some parts of automation to expect one date, while other portions would expect another date, and show up as a problem if we did a respin. As far as we can tell, this bug has always been present, and we have just been “lucky” until now.
  • Long standing bug#388524 was finally fixed. This caused users on “beta” channel to be offered complete updates, even when partial updates were available for them. We now serve partial updates to beta channel users whenever possible.

3) Like before, we waited until morning to start pushing to mirrors. This was done so mirror absorption completed as QA were arriving in the office to start testing update channels. We did this because we wanted to reduce the time files were on the mirrors untested; in the past, overly excited people have post the locations of the files as “released” on public forums, even though they are not finished the last of the sanity checks. Coordinating the mirror push like this reduced that likelihood just a bit.

4) Mirror absorption took 2.5 hours to reach all values >= 50%. However, this felt a little low, so we’ll go back to a 60% threshold for future releases.

take care

John.

Housekeeping: Moving performance tests from Tinderbox to Talos (followup)

Last week, I blogged about us looking at old running systems and figuring out which ones could be shutdown. One set of machines were some old tinderbox performance machines, set up as a quick workaround well over a year ago(!).

As of yesterday, Alice has now got these same tests running on some new talos machines, and running in parallel with our existing tinderbox perf machines. We’re going to leave the two sets of machines running in parallel for another week-and-a-half, to make sure there are no surprises, and expect to power-down the old machines on Monday 25th February. Many thanks to Alice for bringing up multiple sets of new machines so that we could do this transition, and also have turnaround times be similar to what people currently experience on the older tinderbox systems. The details involved are tricky, but the curious can follow along in bug#413695, and the whole set of bugs linked to from that.

Housekeeping: Moving performance tests from Tinderbox to Talos

One of the housekeeping tasks we’ve been doing this month, is figuring what machines we no longer need to be running, and closing them down if possible.

For over a year now, some machines owned by Build have been running performance tests using Tinderbox framework. These machines were originally a short term workaround while Talos performance machines were being brought online, and performance test suites migrated to work within the Talos framework. At this point, over a year later:

  • 3 test suites are being run in Talos *and* in Tinderbox (txul, ts, tdhtml). We can stop running these on Tinderbox at any time.
  • 5 newer suites are being run in Talos *only* (Tp3, tgfx, tsvg, tjss, sunspider).
  • 2 older test suites are being run on Tinderbox *only* (Tp, Tp2). Moving these last two suites over from Tinderbox to Talos means we can then shutdown these old Tinderbox machines.

RobCee and Alice are working on moving those Tp, Tp2 suites. The plan is to get these test suites running on Talos, then run the tests on Talos *and* Tinderbox for a week, just to make sure all is ok, then finally close down the Tinderbox machines. All the test results will still show up on the graph server, no change there. Its worth pointing out that the migrated test will be run on different hardware, with different framework, so will give different results. The new Talos-based results should track the previous Tinderbox based results, but they will be different. For any historical point people care about, we can recreate historical data. Otherwise, we were just planning to redo important milestones, and the last couple of weeks of test runs – it seems to be what most people use.
The details involved are tricky, but the curious can follow along in bug#413695, and the whole set of bugs linked to from that.