20 Jan 2011
JohnMozilla
We keep finding new things to improve in our automation, so are always filing new dependent bugs and then fixing them. In the 23 months since tracking bug#478420 was created, it has accumulated 163 dependent bugs, of which we’ve fixed 95 and still have 68 open.

For the sake of clarity I’ve left the 95 fixed dependent bugs here, closed bug#478420, and moved the remaining 68 open dependent bugs to a new tracking bug#627271. This was not our first “Improve Release Automation” bug, and it will not be our last. We still have lots of exciting work ahead of us, and more improvements to consider, and we’ll spin off yet another new tracking bug when needed.
While doing all this, it was interesting to grab a coffee and spend a few minutes skimming through the closed bugs remembering the dramas we’d solved, and being reminded how much our infrastructure and capabilities have improved compared to 23 months ago – for RelEng and for Mozilla. Very very cool.
11 Jan 2011
JohnMozilla
Here’s a proposal to change the directory structure on ftp.m.o for new Firefox, Fennec and XULrunner builds going forward. To reduce disruption, existing builds would remain where they currently are, until they are aged off as usual.
This fixes an intermittent problem we hit with respins-of-nightly-builds, brings us one step closer to building cool regression-hunting tools, and streamlines RelEng automation as we consolidate Firefox+Fennec automation.
BIKESHED ALERT: There’s lots of potential opinions here. To avoid infinite loops, please read this entire doc, and the discussions in the two bugs, before commenting. Also, I’ve cross-posted to a few groups, to make sure this is widely seen. However, please respond here in dev.planning, or if appropriate, in the related bugs:
https://bugzilla.mozilla.org/show_bug.cgi?id=449607
https://bugzilla.mozilla.org/show_bug.cgi?id=487036
Details:
On ftp.m.o, this proposal would only change files under http://ftp.mozilla.org/pub/mozilla.org/firefox, http://ftp.mozilla.org/pub/mozilla.org/xulrunner and http://ftp.mozilla.org/pub/mozilla.org/mobile. Some concrete examples would be helpful:
before: firefox/tinderbox-builds/{branchname}-{OS}/{seconds-since-epoch}/
after: firefox/tinderbox-builds/{branchname}/{YYYYMMDDHHMMSS}/{OS}
before: firefox/nightly/YYYY-MM-DD-HH-{branchname}
after: firefox/nightly/{branchname}/YYYYMMDDHHMMSS/{OS}
before: mobile/tinderbox-builds/{branchname}-{OS}/{seconds-since-epoch}/
after: mobile/tinderbox-builds/{branchname}/{YYYYMMDDHHMMSS}/{OS}
before: mobile/nightly/YYYY-MM-DD-HH-{branchname}
after: mobile/nightly/{branchname}/YYYYMMDDHHMMSS/{OS}
As an example, this would change from: http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux/1283011618/ …to: http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central/20100828160658/linux
…and change from: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2011-01-03-03-mozilla-central/firefox-4.0b9pre.en-US.win32.zip …to: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/mozilla-central/20110103035959/win32/firefox-4.0b9pre.en-US.win32.zip
Why change?
1) a common use case is when someone reports a problem with a buildID, and we want to find that specific build on ftp.m.o. The current process, of manually trying to find out approximately when the build was created, and then converting to epoch, or manually eyeballing the timestamps on files on ftp is inefficient. With this change, we would immediately be able to find that build. We could later build tools that directly link to the build on ftp.m.o.
2) Builds created with the same BuildID, for every OS, will be in the same directory. We already do this for nightly builds.
3) This full BuildID corresponds to the full BuildID in the txt file we already create alongside each build we post on ftp.m.o. For developers, this txt file also includes the changeset info. For example:
http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2010-08-29-04-mozilla-central/firefox-4.0b5pre.en-US.win32.txt contains:
20100829040614 414ff9016349
4) This avoids using changesets for unique directory identifier.
Changesets are unique, which is good. However, there are significant drawbacks:
4a) changesets do not sort sequentially, which makes it harder to do a binary divide on filesystem to find a regression.
4b) using changesets raises a different problem about how to handle respin-of-same-changeset. Using BuildID handles respins. However, using changesets would require an additional solution, like creating subdirs numbered build1, build2, or subdirs numbered by BuildIDs/timestamp. That seems even more complicated, and anyway still uses BuildIDs/timestamp info. Even for cases where we do not respin, we’d need to create this subdir anyway, to avoid having respin-logic need to move files (and break links that point to the old location).
4c) using changesets is usually advocated by people trying to figure out what changed between two specific builds. That is better resolved by bug#487036 (see below).
5) This helps fix a set of interconnected bugs
bug#431905 Change build process to generate consistent BuildIDs
bug#449607 change dated dirs on ftp.m.o to use new longer BuildID
bug#496549 relbranch names should have a finer resolution than 1 day
bug#487036 write tool to read buildbot db for BuildID+changesets of nightlies, and then construct URL to feed to hg pushlog
bug#538540 stop putting hour number in nightly directories
bug#584178 list hourly tinderbox builds by changeset on ftp.mozilla.org
6) Semi-related, bug#570814 “Nightly builds should all use the same revision” was fixed recently, so now all the builds for the same night on the same branch get the same BuildID. This should further help tidy up the build directories on ftp.m.o.
7) If RelEng is asked to respin a nightly, and we do so within the same hour as the first nightly (rare but it has happened), the new nightly overwrites the old. Not great, and causes problems for people getting updates that needs manual RelEng repair work.
8 ) By using {OS} as a directory, it makes it easy to delete the directory and recreate as part of posting the files of the build. This fixes the recurring unhappiness whenever filenames change (like between beta) and causes problems for nightly.m.o.
9) This makes the structure for Firefox, Fennec and XULrunner builds consistent. This makes the structure for incremental builds and nightly builds consistent. This consistency allows RelEng to further streamline automation.
Open question:
While we are doing this change, it seems like a good time to also rename the directory “tinderbox-builds”. We no longer using any tinderbox clients to build/test, and we are almost complete with the switchover from tinderbox-waterfall to TBPL, so this term no longer seems valid. I’m suggesting “continuous” or maybe “continuous-builds” as a better name to store all the incremental build-on-checkin work we do throughout the day.
(Alternatives already suggested that I’d prefer to avoid: “buildbot-builds” (in case we ever switch from buildbot), “builds” (too vague/overloaded), “depend_build” (what happens if we do a clobber in the day?) or “per_checkin_build” (what happens if we collapse build queues to have multiple checkins per build?). What alternatives can you come up with?)
Hope all that makes sense – there’s a lot of background and details, so if I missing something, do let me know. Also if you have comments or concerns, please chime in in the dev.planning newsgroup, in either of the bugs at the top, or even here as a comment on this post.
Thanks for reading this far!
John.
11 Jan 2011
JohnMozilla
Why, oh why, would xpcshell and reftest run so significantly slower on Win7 vs on WinXP? The other unittest suites give comparable performance except for:
- reftest: 50% slower (1,488 seconds on WinXP but 2,234 on Win7)
- xpcshell: 75% slower (1,248 seconds on WinXP but 2,190 on Win7)
This was measured using the same binary build of Firefox, and the same identical hardware being used on both OS.
Also, this difference is *after* Armen and Jimm already landing one fix which really helped, but there’s obviously more to do – details can be found in Armen’s blog and also in bug#617503. Can you help?
Given the number of checkins (and hence tests) we run daily, any help fixing this will be a big win (groan!) for our Win7 test waittimes, which impacts us all.
03 Jan 2011
JohnMozilla
Summary:
There were 2,322 pushes in November 2010. This is a continued drop from September (2,436 pushes) and October (2,360 pushes). This continued drop in number of checkins is expected, considering the prolonged lockdown for FF4.0beta7, immediately followed by the lockdown for FF4.0beta8.
The numbers for this month are:
- 2,322 code changes to our mercurial-based repos, which triggered 292,035 jobs:
- 43,738 build jobs, or ~61 jobs per hour.
- 138,585 unittest jobs, or ~192 jobs per hour.
- 109,712 talos jobs, or ~152 talos jobs per hour.
Interesting side effect of these lockdowns is the significant increase in TryServer usage. This is the first time that TryServer has become significantly more then half the overall load for the entire RelEng infrastructure. It feels like developers who were blocked from landing were continuing to work by developing and testing patches using TryServer, but thats just conjecture. 
Details:
- The long-running lockdown for FF4.0beta7, and then for FF4.0beta8 definitely took their hit on who was able to checkin, and where/when.
- We are still double-running unittests for some OS; running unittest-on-builder and also unittest-on-tester. This continues while developers and QA work through the issues. Whenever unittest-on-test-machine is live and green, we disable unittest-on-builders to reduce wait times for builds. Any help with these tests would be great!
- The entire series of these infrastructure load blogposts can be found here.
- We are still not tracking down any l10n repacks, nightly builds, release builds or any “idle-timer” builds.
Detailed breakdown is :


Here’s how the math works out (Descriptions of build, unittest and performance jobs triggered by each individual push are here:

01 Jan 2011
JohnMozilla
Did I ever say how much I love xkcd.com?

Hope everyone had a great time off.
It’s easy to get swept along with the commercialism and hype of the season. To me, events like Christmas and New Years are not just about gifts-under-the-tree and countdown-to-champagne-at-midnight; they are an important chance to pause and reflect back on the milestones throughout the year. Good and bad. Funny and sad. In work and in personal life.
I can dream, hope, about what the coming year will bring. I hope some things will go better then planned. No doubt some things will not. And some other things will probably completely surprise us. How we handle all these will help us grow as people, and as a community, throughout the year. To everyone who helped me along the way in 2010, I thank you, and here’s to doing our utmost to help make a great 2011 together!
30 Dec 2010
JohnMozilla
Summary:
There were 2,360 pushes in October 2010. This is a slight drop below September’s 2,436 pushes. Considering the lockdown for FF4.0b7, I’d expected the number of checkins this month to be lower.
The numbers for this month are:
- 2,360 code changes to our mercurial-based repos, which triggered 229,632 jobs:
- 44,884 build jobs, or ~60 jobs per hour.
- 140,970 unittest jobs, or ~189 jobs per hour.
- 113,778 talos jobs, or ~153 talos jobs per hour.
Yet again, TryServer continues to be almost half the load of all branches combined on the entire infrastructure.
Details:
- The long-running lockdown for FF4.0beta7 definitely took it’s hit on who was able to checkin, and where/when.
- We are still double-running unittests for some OS; running unittest-on-builder and also unittest-on-tester. This continues while developers and QA work through the issues. Whenever unittest-on-test-machine is live and green, we disable unittest-on-builders to reduce wait times for builds. Any help with these tests would be great!
- The entire series of these infrastructure load blogposts can be found here.
- We are still not tracking down any l10n repacks, nightly builds, release builds or any “idle-timer” builds.
Detailed breakdown is :


Here’s how the math works out (Descriptions of build, unittest and performance jobs triggered by each individual push are here:

24 Dec 2010
JohnMozilla
In case you missed it, Armen recently blogged some exciting news about the work he’s been doing with Tarin, Brett van Gennip and Vitaly at Seneca as well as Chris Tyler of Seneca and Fedora fame!
Fedora14 users are now able to use yum to get nightly builds of Firefox. And then every day, those Fedora14 users will get updated to the newest nightly build!! If you are on Fedora and want to use the latest and greatest Firefox in the approach to Firefox4.0, this is for you.
Armen’s post has all the details of how to configure your Fedora install for these nightly “Minefield” builds here.
Of course, this is just the tip of the iceberg. There’s still lots of loose ends to tidy up. Moving the yum repo to a more scalable location… Figuring how to handle beta and release builds… Figuring what to do with other versions of Fedora… etc, etc… If you find any problems, please file bugs in mozilla.org/Release Engineering.
Stay tuned for more progress reports on this project. However, in the meanwhile, this first visible milestone is a really cool breakthrough for Fedora14 users. Very very nice.
24 Dec 2010
JohnMozilla
Yesterday morning, we did a 4-way sim-ship – we simultaneously shipped four different products: Firefox4.0beta8, Fennec4.0beta3, FirefoxHome1.1, Sync Addon for Firefox 1.6.
The cool new features in each of those releases are already covered elsewhere, so I’ll just focus on the mechanics and processes we went through to make this 4-way sim-ship happen.
- We’ve become used to our new ability to sim-ship different versions of Firefox smoothly and quickly (for example, shipping Firefox 3.5.x and 3.6.x security releases within 17 hours). However, yesterday was a very different experience for us. We did four releases instead of two. And more importantly, we did different products, not different versions of the same product – which meant different release processes for each of the four releases had to be cross-coordinated.
- Firefox 4.0beta8 was bumpy because of bugs we hit in some new RelEng automation code. Sadly all respins for Firefox 4.0b8 were caused by bugs in our RelEng automation. (More details to come soon in separate blogpost, after our postmortem.)
While debugging one of these Firefox 4.0beta8 respins, we were distracted by a real fire alarm – the building fire alarms went off and we all had to evacuate while the fire department went running in looking for the fire. Luckily, while we were waiting outside, Rail discovered he was still within wifi range, so he was able to continue work on fixing the blocker problem. (kudos to Dustin for his impromptu extra support!)
- Fennec 4.0beta3 went really smoothly, until a late breaking problem discovered as we uploaded Fennec to the Google Marketplace. Fixing this caused Aki to do *two* complete rebuilds of Fennec, and then some further late night hacking afterward… all in ~10 hours. This super-fast turnaround was only possible because of months of preparation by Aki. Amazing work, Aki, truly amazing.
This was the first 4-way sim-ship we’ve done, which is impressive by itself, and we’ve also learned lots. In addition to the usual release mechanics, there was a lot of additional cross-project coordination to keep us on our toes. Its easy to ignore all the things that went smoothly, and focus on what we need to do better next time, but we should remember it *all*. I know there will be a next time, and I know we will do even better. As Murphy’s Law would predict, all these releases happened while several people were out sick, flying off to family vacations, while the Mozilla AllHands was in full swing and just before the Christmas vacations. For me, I was most impressed by how all the different people across Mozilla pitched in to help, all trusted each others professionalism and all worked together to get these releases out to our users. This was a great experience, so thank you to everyone!
17 Dec 2010
JohnMozilla
This week’s Mozilla All Hands was an excellent opportunity to help more new people understand our Continuous Integration systems. And also to get people, who know how these systems *used* to work, to find out how systems have been changed.
Armen worked over my old slides, adding a bunch of new diagrams and more info to help explain things more clearly. The new and *way* improved slides are here.
As usual, if you have any comments or complaints about this PDF, please let me know.
All kudos and compliments should go to Armen for a very cool presentation.
14 Dec 2010
JohnMozilla
Found these photos between Tokyo, and Nagoya, and am providing them without any comment:
‘nuf said!
Older Entries Newer Entries