Infrastructure load for October 2010

Summary:

There were 2,360 pushes in October 2010. This is a slight drop below September’s 2,436 pushes. Considering the lockdown for FF4.0b7, I’d expected the number of checkins this month to be lower.

The numbers for this month are:

  • 2,360 code changes to our mercurial-based repos, which triggered 229,632 jobs:
  • 44,884 build jobs, or ~60 jobs per hour.
  • 140,970 unittest jobs, or ~189 jobs per hour.
  • 113,778 talos jobs, or ~153 talos jobs per hour.

Yet again, TryServer continues to be almost half the load of all branches combined on the entire infrastructure.

Details:

  • The long-running lockdown for FF4.0beta7 definitely took it’s hit on who was able to checkin, and where/when.
  • We are still double-running unittests for some OS; running unittest-on-builder and also unittest-on-tester. This continues while developers and QA work through the issues. Whenever unittest-on-test-machine is live and green, we disable unittest-on-builders to reduce wait times for builds. Any help with these tests would be great!
  • The entire series of these infrastructure load blogposts can be found here.
  • We are still not tracking down any l10n repacks, nightly builds, release builds or any “idle-timer” builds.

Detailed breakdown is :

Here’s how the math works out (Descriptions of build, unittest and performance jobs triggered by each individual push are here:

Minefield nightly builds on Fedora14

In case you missed it, Armen recently blogged some exciting news about the work he’s been doing with Tarin, Brett van Gennip and Vitaly at Seneca as well as Chris Tyler of Seneca and Fedora fame!

Fedora14 users are now able to use yum to get nightly builds of Firefox. And then every day, those Fedora14 users will get updated to the newest nightly build!! If you are on Fedora and want to use the latest and greatest Firefox in the approach to Firefox4.0, this is for you.

Armen’s post has all the details of how to configure your Fedora install for these nightly “Minefield” builds here.

Of course, this is just the tip of the iceberg. There’s still lots of loose ends to tidy up. Moving the yum repo to a more scalable location… Figuring how to handle beta and release builds… Figuring what to do with other versions of Fedora… etc, etc… If you find any problems, please file bugs in mozilla.org/Release Engineering.

Stay tuned for more progress reports on this project. However, in the meanwhile, this first visible milestone is a really cool breakthrough for Fedora14 users. Very very nice.

Busy morning yesterday

Yesterday morning, we did a 4-way sim-ship – we simultaneously shipped four different products: Firefox4.0beta8, Fennec4.0beta3, FirefoxHome1.1, Sync Addon for Firefox 1.6.

The cool new features in each of those releases are already covered elsewhere, so I’ll just focus on the mechanics and processes we went through to make this 4-way sim-ship happen.

  1. We’ve become used to our new ability to sim-ship different versions of Firefox smoothly and quickly (for example, shipping Firefox 3.5.x and 3.6.x security releases within 17 hours). However, yesterday was a very different experience for us. We did four releases instead of two. And more importantly, we did different products, not different versions of the same product – which meant different release processes for each of the four releases had to be cross-coordinated.
  2. Firefox 4.0beta8 was bumpy because of bugs we hit in some new RelEng automation code. Sadly all respins for Firefox 4.0b8 were caused by bugs in our RelEng automation. (More details to come soon in separate blogpost, after our postmortem.)
  3. While debugging one of these Firefox 4.0beta8 respins, we were distracted by a real fire alarm – the building fire alarms went off and we all had to evacuate while the fire department went running in looking for the fire. Luckily, while we were waiting outside, Rail discovered he was still within wifi range, so he was able to continue work on fixing the blocker problem. (kudos to Dustin for his impromptu extra support!)
  4. Fennec 4.0beta3 went really smoothly, until a late breaking problem discovered as we uploaded Fennec to the Google Marketplace. Fixing this caused Aki to do *two* complete rebuilds of Fennec, and then some further late night hacking afterward… all in ~10 hours. This super-fast turnaround was only possible because of months of preparation by Aki. Amazing work, Aki, truly amazing.

This was the first 4-way sim-ship we’ve done, which is impressive by itself, and we’ve also learned lots. In addition to the usual release mechanics, there was a lot of additional cross-project coordination to keep us on our toes. Its easy to ignore all the things that went smoothly, and focus on what we need to do better next time, but we should remember it *all*. I know there will be a next time, and I know we will do even better. As Murphy’s Law would predict, all these releases happened while several people were out sick, flying off to family vacations, while the Mozilla AllHands was in full swing and just before the Christmas vacations. For me, I was most impressed by how all the different people across Mozilla pitched in to help, all trusted each others professionalism and all worked together to get these releases out to our users. This was a great experience, so thank you to everyone!

RelEng BrownBag (Dec2010 edition)

This week’s Mozilla All Hands was an excellent opportunity to help more new people understand our Continuous Integration systems. And also to get people, who know how these systems *used* to work, to find out how systems have been changed.

Armen worked over my old slides, adding a bunch of new diagrams and more info to help explain things more clearly. The new and *way* improved slides are here.

As usual, if you have any comments or complaints about this PDF, please let me know. 🙂 All kudos and compliments should go to Armen for a very cool presentation.

All nightly builds are now created equally!

Summary: Mozilla nightly builds were originally not setup to do nightly builds for a branch using the same code revision across all OS. This complicated any attempt to use nightly builds to track down a platform specific bug. This has now been fixed. Send beer and chocolate to catlee.

If you are curious for details, read on!

Because of how Mozilla originally setup the nightly build system, there was three little known quirks from the very beginning:

1) The nightly build was of the tip of the code at the time build started.
This sounds good, but this meant that anyone who checked in late at night ran the risk of breaking the build, and then not being able to back out the change quickly enough, causing the nightly builds to fail out.

2) The machines for each OS started builds at different times
A nightly was started whenever the machine finished building the previous build, and it was the first build started after 3am. The first build after 3am would build using the tip of the code, and be published as a nightly build. However “first job after 3am” when there was only one build machine per OS meant starting the nightly build at different times for different OS; this window of possible changes was ~3hours (longest build time minus 1min). Anyone who checked in during that period would get their change included in the nightly build for that OS, but not in the nightly build for any OS already started.

3) Nightly builds on different OS went into different directories
Because the nightly builds start at different times, the generated builds got different BuildIDs, and are posted into different directories on ftp.m.o. This complicated regression hunting work.

All confusing.

Catlee landed some changes recently which mean that now:

1) The nightly build is of the most recent “good” code.
The nightly build now does not automatically build “tip”. This sounds counterintuitive at first, but actually makes sense – read on. The nightly build now starts by attempting to find a changeset that is newer then the previous nightly, and which is also known to be a good changeset. Right now the definition of “good changeset” is “compiles+links”. Eventually, as there are fewer intermittent tests, the plan is to change that definition to “compiles+links+passes-tests”. Worst case, if *no* changeset has successfully built since the previous nightly, then we’ll fall back to current behavior, and attempt to build tip even though we expect it to fail.

2) Each nightly build is told its BuildID and changeset
The buildbot master tells the build slave which BuildID and changeset to use for the nightly builds. This means the nightly builds are created with the same BuildID for each OS – which means that the nightly builds for each OS show up in the *same* directory on ftp.m.o. No more finding last night’s mozilla-central nightly for linux, mac and win32 in different directories!

All obvious goodness!

Why wasn’t this fixed years ago???” I hear you ask. It has only become possible after all the other recent changes and scaling done in RelEng, as well as detangling what “build start time” and “BuildID” mean in Tinderbox, Makefile and MozillaBuildSystem. Fixing this very long standing annoyance should help developers and QA triage problems with nightly builds, and also makes me happy. For the curious, further details are in bug#570814.

Food experiments in Japan: Cheese KitKats

No visit to Japan would be complete without some strange (to me) food experiments.

Last time I was here, I was surprised to find I actually liked Strawberry KitKat, and RedBean KitKat, so was lulled into a false sense of confidence when I bought these.

Turns out they are white-chocolate-combined-with-cheddar-cheese flavor. Vote: Instant Yuk. Tried a couple of times over the next few days, but there was no way I could make these palatable, so had to toss the rest of the pack out.