Three “go to build” emails

Today we shipped Firefox 4.0 beta10. Lots of cool features in there, details already covered here.

We hope you like this latest beta. No doubt you’ve noticed its only been 11 days since 4.0beta9. We’re picking up the cadence as we get closer to the final release, doing betas more and more frequently, each with lots of improvements. Of course, please file bugs if you hit any problems!

Meanwhile, there’s one behind-the-scene detail that I’m most proud of with beta10.

RelEng got the “go to build” for each of Firefox3.5.17, Firefox3.6.14 and Firefox4.0beta10 all within 50mins of each other, all on Friday afternoon. We were able to generate all three releases concurrently, and hand builds over to QA Friday afternoon / Monday morning without any incident.

This is a great testimonial on how our release infrastructure has improved with the move to buildbot 0.8.x as well as the last 3 months of refactoring and general bug fixing. Of course, there are still lots more improvements we need to do – the next big step is underway, moving all the Fennec release automation code to buildbot 0.8.x and consolidating it with the Firefox release automation code. This will enable us to do multiple Fennec releases at the same time as multiple Firefox releases – something we feel is strategically really important for Mozilla in 2011.

Meanwhile, it was really great to see this infrastructure coming together, and how work done by RelEng so far has made handling those three emails on Friday feasible.

Improving Release Automation: closed bug#478420

We keep finding new things to improve in our automation, so are always filing new dependent bugs and then fixing them. In the 23 months since tracking bug#478420 was created, it has accumulated 163 dependent bugs, of which we’ve fixed 95 and still have 68 open.

For the sake of clarity I’ve left the 95 fixed dependent bugs here, closed bug#478420, and moved the remaining 68 open dependent bugs to a new tracking bug#627271. This was not our first “Improve Release Automation” bug, and it will not be our last. We still have lots of exciting work ahead of us, and more improvements to consider, and we’ll spin off yet another new tracking bug when needed.

While doing all this, it was interesting to grab a coffee and spend a few minutes skimming through the closed bugs remembering the dramas we’d solved, and being reminded how much our infrastructure and capabilities have improved compared to 23 months ago – for RelEng and for Mozilla. Very very cool.

change to Fennec, Firefox & XULRunner directories on ftp.m.o

Here’s a proposal to change the directory structure on ftp.m.o for new Firefox, Fennec and XULrunner builds going forward. To reduce disruption, existing builds would remain where they currently are, until they are aged off as usual.

This fixes an intermittent problem we hit with respins-of-nightly-builds, brings us one step closer to building cool regression-hunting tools, and streamlines RelEng automation as we consolidate Firefox+Fennec automation.


BIKESHED ALERT
: There’s lots of potential opinions here. To avoid infinite loops, please read this entire doc, and the discussions in the two bugs, before commenting. Also, I’ve cross-posted to a few groups, to make sure this is widely seen. However, please respond here in dev.planning, or if appropriate, in the related bugs:
https://bugzilla.mozilla.org/show_bug.cgi?id=449607
https://bugzilla.mozilla.org/show_bug.cgi?id=487036

Details:
On ftp.m.o, this proposal would only change files under http://ftp.mozilla.org/pub/mozilla.org/firefox, http://ftp.mozilla.org/pub/mozilla.org/xulrunner and http://ftp.mozilla.org/pub/mozilla.org/mobile. Some concrete examples would be helpful:

before: firefox/tinderbox-builds/{branchname}-{OS}/{seconds-since-epoch}/
after: firefox/tinderbox-builds/{branchname}/{YYYYMMDDHHMMSS}/{OS}

before: firefox/nightly/YYYY-MM-DD-HH-{branchname}
after: firefox/nightly/{branchname}/YYYYMMDDHHMMSS/{OS}

before: mobile/tinderbox-builds/{branchname}-{OS}/{seconds-since-epoch}/
after: mobile/tinderbox-builds/{branchname}/{YYYYMMDDHHMMSS}/{OS}

before: mobile/nightly/YYYY-MM-DD-HH-{branchname}
after: mobile/nightly/{branchname}/YYYYMMDDHHMMSS/{OS}

As an example, this would change from: http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux/1283011618/ …to: http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central/20100828160658/linux

…and change from: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2011-01-03-03-mozilla-central/firefox-4.0b9pre.en-US.win32.zip …to: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/mozilla-central/20110103035959/win32/firefox-4.0b9pre.en-US.win32.zip

Why change?
1) a common use case is when someone reports a problem with a buildID, and we want to find that specific build on ftp.m.o. The current process, of manually trying to find out approximately when the build was created, and then converting to epoch, or manually eyeballing the timestamps on files on ftp is inefficient. With this change, we would immediately be able to find that build. We could later build tools that directly link to the build on ftp.m.o.

2) Builds created with the same BuildID, for every OS, will be in the same directory. We already do this for nightly builds.

3) This full BuildID corresponds to the full BuildID in the txt file we already create alongside each build we post on ftp.m.o. For developers, this txt file also includes the changeset info. For example:
http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2010-08-29-04-mozilla-central/firefox-4.0b5pre.en-US.win32.txt contains:
20100829040614 414ff9016349

4) This avoids using changesets for unique directory identifier.
Changesets are unique, which is good. However, there are significant drawbacks:
4a) changesets do not sort sequentially, which makes it harder to do a binary divide on filesystem to find a regression.
4b) using changesets raises a different problem about how to handle respin-of-same-changeset. Using BuildID handles respins. However, using changesets would require an additional solution, like creating subdirs numbered build1, build2, or subdirs numbered by BuildIDs/timestamp. That seems even more complicated, and anyway still uses BuildIDs/timestamp info. Even for cases where we do not respin, we’d need to create this subdir anyway, to avoid having respin-logic need to move files (and break links that point to the old location).
4c) using changesets is usually advocated by people trying to figure out what changed between two specific builds. That is better resolved by bug#487036 (see below).

5) This helps fix a set of interconnected bugs
bug#431905 Change build process to generate consistent BuildIDs
bug#449607 change dated dirs on ftp.m.o to use new longer BuildID
bug#496549 relbranch names should have a finer resolution than 1 day
bug#487036 write tool to read buildbot db for BuildID+changesets of nightlies, and then construct URL to feed to hg pushlog
bug#538540 stop putting hour number in nightly directories
bug#584178 list hourly tinderbox builds by changeset on ftp.mozilla.org

6) Semi-related, bug#570814 “Nightly builds should all use the same revision” was fixed recently, so now all the builds for the same night on the same branch get the same BuildID. This should further help tidy up the build directories on ftp.m.o.

7) If RelEng is asked to respin a nightly, and we do so within the same hour as the first nightly (rare but it has happened), the new nightly overwrites the old. Not great, and causes problems for people getting updates that needs manual RelEng repair work.

8 ) By using {OS} as a directory, it makes it easy to delete the directory and recreate as part of posting the files of the build. This fixes the recurring unhappiness whenever filenames change (like between beta) and causes problems for nightly.m.o.

9) This makes the structure for Firefox, Fennec and XULrunner builds consistent. This makes the structure for incremental builds and nightly builds consistent. This consistency allows RelEng to further streamline automation.


Open question:

While we are doing this change, it seems like a good time to also rename the directory “tinderbox-builds”. We no longer using any tinderbox clients to build/test, and we are almost complete with the switchover from tinderbox-waterfall to TBPL, so this term no longer seems valid. I’m suggesting “continuous” or maybe “continuous-builds” as a better name to store all the incremental build-on-checkin work we do throughout the day.

(Alternatives already suggested that I’d prefer to avoid: “buildbot-builds” (in case we ever switch from buildbot), “builds” (too vague/overloaded), “depend_build” (what happens if we do a clobber in the day?) or “per_checkin_build” (what happens if we collapse build queues to have multiple checkins per build?). What alternatives can you come up with?)

Hope all that makes sense – there’s a lot of background and details, so if I missing something, do let me know. Also if you have comments or concerns, please chime in in the dev.planning newsgroup, in either of the bugs at the top, or even here as a comment on this post.

Thanks for reading this far!
John.

xpcshell and reftest are slower on Win7, faster on WinXP

Why, oh why, would xpcshell and reftest run so significantly slower on Win7 vs on WinXP? The other unittest suites give comparable performance except for:

  • reftest: 50% slower (1,488 seconds on WinXP but 2,234 on Win7)
  • xpcshell: 75% slower (1,248 seconds on WinXP but 2,190 on Win7)

This was measured using the same binary build of Firefox, and the same identical hardware being used on both OS.

Also, this difference is *after* Armen and Jimm already landing one fix which really helped, but there’s obviously more to do – details can be found in Armen’s blog and also in bug#617503. Can you help?

Given the number of checkins (and hence tests) we run daily, any help fixing this will be a big win (groan!) for our Win7 test waittimes, which impacts us all.

Infrastructure load for November 2010

Summary:

There were 2,322 pushes in November 2010. This is a continued drop from September (2,436 pushes) and October (2,360 pushes). This continued drop in number of checkins is expected, considering the prolonged lockdown for FF4.0beta7, immediately followed by the lockdown for FF4.0beta8.

The numbers for this month are:

  • 2,322 code changes to our mercurial-based repos, which triggered 292,035 jobs:
  • 43,738 build jobs, or ~61 jobs per hour.
  • 138,585 unittest jobs, or ~192 jobs per hour.
  • 109,712 talos jobs, or ~152 talos jobs per hour.

Interesting side effect of these lockdowns is the significant increase in TryServer usage. This is the first time that TryServer has become significantly more then half the overall load for the entire RelEng infrastructure. It feels like developers who were blocked from landing were continuing to work by developing and testing patches using TryServer, but thats just conjecture.

Details:

  • The long-running lockdown for FF4.0beta7, and then for FF4.0beta8 definitely took their hit on who was able to checkin, and where/when.
  • We are still double-running unittests for some OS; running unittest-on-builder and also unittest-on-tester. This continues while developers and QA work through the issues. Whenever unittest-on-test-machine is live and green, we disable unittest-on-builders to reduce wait times for builds. Any help with these tests would be great!
  • The entire series of these infrastructure load blogposts can be found here.
  • We are still not tracking down any l10n repacks, nightly builds, release builds or any “idle-timer” builds.

Detailed breakdown is :

Here’s how the math works out (Descriptions of build, unittest and performance jobs triggered by each individual push are here:

The end of 2010: naughty and nice

Did I ever say how much I love xkcd.com?

Hope everyone had a great time off.

It’s easy to get swept along with the commercialism and hype of the season. To me, events like Christmas and New Years are not just about gifts-under-the-tree and countdown-to-champagne-at-midnight; they are an important chance to pause and reflect back on the milestones throughout the year. Good and bad. Funny and sad. In work and in personal life.

I can dream, hope, about what the coming year will bring. I hope some things will go better then planned. No doubt some things will not. And some other things will probably completely surprise us. How we handle all these will help us grow as people, and as a community, throughout the year. To everyone who helped me along the way in 2010, I thank you, and here’s to doing our utmost to help make a great 2011 together!