AOSA(vol2) and Mozilla’s Release Engineering: now on kindle and nook!

“The Architecture of Open Source Applications (vol2)” was published in paperback format in May2012. More details here. A few days ago, we were also published in ebook format, so now you can download the same book from Amazon (for your kindle) or from Barnes&Noble (for your nook).

Very cool to see this, and big thanks to Amy Brown for making this happen.

(As always, proceeds from book sales go to Amnesty International.)

linux on a Boeing 777

Last week, at SFO, while waiting for clearance to pushback from the gate, I noticed that the headrest entertainment system had crashed and was rebooting…and it was linux!

I couldnt read all the messages – the bootup messages scrolled by so quickly – but there was at least one error message about not finding a config file?!? They had to reboot the system a few times before it finally booted up successfully. And yes, once booted, it stayed up and running for the entire flight to LHR.

Here’s the best I could do with some rushed photos, in poor lighting, while they were telling me to turn off all electronic devices.

(click images for enlarged photos)

Oh, and while writing this post, I found someone else’s photo of linux booting on another Boeing 777.
linux on a Boeing 777

Calling all remoties

tl;dr: If you are a remotie…or if you work with someone who is a remotie… I’d love to hear from you.

Whenever “remoties” come up in discussion, I continue to be surprised by the level of interest people have about this.

Its not just a polite “oh, that’s interesting”. Its a suddenly intense outpouring of personal war stories – “oh really? Let me tell you about the time when…”. Some of those stories were told as validation (“yes, we did what you do, and we’re happy it worked for us also” or “we didn’t do what you do, and it ended badly“). Some of these stories were told in denial (“we tried that once, it didn’t work out, which proves it is not ever possible“). Some of these stories were told in despair (“…so now my company wont hire any remoties“). But all of these stories were told with intense personal fervor, sometimes years after the fact!

This shouldn’t have surprised me. As Homa Bahrami pointed out when I met her in Mozilla Summit 2011, and again in meetings this summer, working with remoties is a hard people-organizational problem, not a software-organization problem. Homa also pointed out the intense, long term impact this can have on someone’s personal life and entire career, which explained some of the passionate responses I’ve received so far.

Stepping back, I realized that while most of the people I’ve talked with so far are in the computer business, I’ve also heard similar stories from university lecturers, book publishers, public relations people, medical doctors and traveling sales reps.

This got me thinking about how to contact even more people who work remotely… hence this blog post.

If you are a remotie…or if you work with someone who is a remotie… I’d be really interested to hear from you.

  • Do you have examples of things that did (or did not!) work for you?
  • Do you have ideas of things you haven’t tried, but which you think might help?

As usual, you can post comments below. I do also understand this is a personal thing, especially if you are still working in the situation. Therefore, if you want to email me privately instead, please email me, and put “remoties” somewhere in the subject. I will, of course, honor any requests to keep feedback anonymous, all I ask that you give me any working email address in case something is unclear, and I want to contact you with any followup questions.


Meanwhile, here’s a collection of useful links I’ve found about working remotely. If you know of others, please let me know.

(UPDATED: added another link, joduinn 16mar2014, 09nov2014)

Increasing capacity

tl;dr: We’ve had a lot of infrastructure changes go live in the last 2-3 weeks, so now build and test wait times are MUCH better. The changes made were:

1) Turn off obsolete or broken jobs.
2) Re-image some linux32/linux64/win32 machines as extra Win2008 (64bit) build machines.
3) Enable pymake in production
4) Turn on more tegras
5) Re-imaging 40 OSX10.5 test machines as 20 WinXP and 20 Win7 test machines.

After all the recent meetings and newsgroup posts, I thought it useful to followup with a quick summary of the last few weeks of behind the scenes work on how RelEng infrastructure is handling the increased checkin load, as well as the increased number of builds/tests per checkin. Oh, and its worth noting for the record that all these changes have been done without needing a downtime. 🙂

1) Turn off obsolete or broken jobs.
This is not glamorous work, but there were lots of these and it had a massive impact in terms of reducing load in some areas, which allowed us to reshuffle machines to other areas. Specifically, we’ve disabled:

  • perma-orange/red test suites
  • android-xul tests, no longer needed
  • standalone talos v8 test suite, now included in the dromaeojs suite
  • B2G-on-gingerbread builds. (B2G-on-gingerbread builds were intentionally broken as part of setting up B2G-on-icecreamsandwich builds but left running hidden on tbpl.m.o. This was bad because a) B2G-on-gingerbread builds continued to waste CPU cycles and b) having broken B2G-on-gingerbread builds caused B2G IceCreamSandwich nightly builds to not be run because automation thought that there were no good/green changesets for B2G. Disabling the broken B2G-on-gingerbread builds fixed both these issues. (bug#780915).
  • Disabled osx10.5 tests on mozilla-central, related project branches and try, because FF17 no longer supports osx10.5. However, we still need some osx10.5 machines in order to run osx10.5 tests for mozilla-aurora, mozilla-beta, mozilla-release and ESR.
  • There is still an open question about linux32/64 builds/tests. Can we reduce our linux test capacity on one architecture to use these as test machines for other OS? From the thread it seems like the preference would be to turn down/off 32-bit builds/tests, but if you are interested, please respond in the dev.planning thread.
  • Of course, if you know of other builds or tests that aren’t used, or are perma-red/orange, let RelEng or Sheriff know and we can disable them until they can be fixed! Tests still to be disabled are tracked in bug#784681

2) Re-image some linux32/linux64/win32 machines as extra Win2008 (64bit) build machines.

  • linux32/linux64 builds continue to migrate to AWS, which frees up linux32/linux64 build machines for reimaging as win2008 (64bit) build machines. This requires changing desktop toolchain, so we have to be careful about not breaking binary compat, and needed to roll this change out on the trains. Details in bug#772446. As of today, we’ve increased from ~40 to 102 win2008 (64bit) build machines in production, with even more coming online soon. Details in bug#780022 and bug#784891.
  • This is important because the Win2008 (64bit) machines are used to do both win32 *and* win64 builds.
  • win32 l10n nightlies have code dependencies that require win32. Fixing this to run on win2008 (64bit) will free up even more win32 machines to convert to win2008 (64bit) builders. More details in bug#784848.

3) Enable pymake in production (bug# bug#593585).
This reduced build time on windows significantly. Combined with the extra windows machines, this is all good. Coop blogged more details here, and if you like eyecandy, he even has even a cool graph showing reduced duration of builds!

4) Turn on more tegras
86 new tegras have now been added to our test pool, bringing our current total up to 284 tegras. bug#767456

5) Re-imaging 40 OSX10.5 test machines as 20 WinXP and 20 Win7 test machines.
This increases us to 70 machines for WinXP and 70 machines for Win7x32, which helped improve WinXP, Win7 test wait times.

We’ve still got lots to do, of course. After all, the faster our infrastructure can process builds and tests, the more checkins developers will do, which means the more builds and tests we will need to handle… but things are definitely better, and our numbers in dev.tree-management for the last two weeks shows that!

Firefox: now testing on OSX 10.8

Late last week, started showing two extra rows of green unittest+talos results. Those are because our existing OSX 32bit/64bit builds, which we already test on OSX10.6 and OSX10.7, are now also being tested on OSX10.8!

We test these opt and debug builds on OSX10.8 for incremental builds per checkin during the day, full clobber builds every night and all builds available on ftp.m.o. All the usual goodness, just like we do for OSX10.6, OSX10.7. For anyone interested in more details, check out kmoir’s blogpost, or bug#731278. Congrats to kmoir, on all the behind-the-scenes work needed to cat-herd this large, multiheaded project across the production line, only a few months after joining Mozilla.

This was our second big OSX10.8 project. In early July, bhearsum and edransch got OSX signing automation into production in preparation for the release of OSX10.8. For details of that work, the curious can read bug#730924 and bhearsum’s blog.

Infrastructure load for August 2012

  • #checkins-per-month: We had 5,803 checkins in August2012, another new record, which breaks last month’s record (5,635 checkins), which in turn broke the previous month’s record (5,194)…
  • #checkins-per-day: We had consistently high load across the month, 17-of-31 days had over 200 checkins-per-day.
  • #checkins-per-hour: The peak this month was 12.3 checkins per hour, and throughout the month, we sustained over 11-checkins-per-hour for 5 out of 24 hours in a day.

mozilla-inbound, fx-team:
Ratios this month identical within one percentage of last month. Again, mozilla-inbound continues to be heavily used as an integration branch, with 25.7% of all checkins, far more then the other integration branches fx-team (1.4% of checkins) or mozilla-central (2.9% of checkins). For comparison, I note that more people landed on mozilla-aurora then on mozilla-central.

mozilla-aurora, mozilla-beta:

  • 3.1% of our total monthly checkins landed into mozilla-aurora.
  • 1.9% of our total monthly checkins landed into mozilla-beta.

(Standard disclaimer: I’m always glad whenever we catch a problem *before* we ship a release; it avoids us having to do a chemspill release and also we ship better code to our Firefox users in the first place.)

misc other details:

  • Pushes per day
    • You can clearly see weekends through the month.

  • Pushes by hour of day
    • It is worth noting that for 5 hours in every 24 hour day, we did over 11 checkins-per-hour. For two of those hours, we did over 12 checkins-per-hour. Phrased another way, thats at least one checkin every 5mins for 5 hours.