At Mozilla, ReleaseEngineering == Release Automation + Continuous Integration

Recently, I was asked to lead a discussion with a few VPs within Mozilla about the scope of Release Engineering at Mozilla. Each VP was well established in their career, technically-seasoned, smart, and each brought their own different preconceived notions of what RelEng means, each with different terminology, each from their own perspectives from their own different previous companies. To make things even more interesting, different organizations have different ideas and terminology on what they mean by “Release Engineering”, so getting everyone on the same page was going to be interesting… and important to get right, if we were all to work well together.

This blogpost is a quick summary and if curious, PDFs of slides are here.

At Mozilla, Release Engineering covers two main topics:

1) Release Automation:
People who are not day-to-day-developers typically think of this first. How efficient is the software delivery pipeline within a software organization? How long it takes from “go to build a release” to “users can start downloading updates”? The faster and more reliable this software delivery pipeline, the more competitive the company can be in the marketplace. This used to be where Mozilla’s RelEng, as a group, spent most of their time, sleeping in the office, getting bribes for releases, and all that drama. Now, thankfully, our automation is really great, so chemspills are super-quick (great for our users) and mostly-hands-off (great for the humans in RelEng). There’s still lots to improve, and always some adjustments because of changing-product-requirements, but its already night-and-day improved since 2007. It continues to improve even since we wrote about it in a book!

2) Continuous Integration:
Day-to-day developers think of this, and deal with this, every single day. Anyone doing code changes at Mozilla keeps an eager eye on tbpl.m.o to see if their change is all green (good!), they can close out their bug as FIXED and move on to the next bug. Making the Continuous Integration process more efficient has allowed Mozilla to hire more developers to do more checkins, transition developers from all-on-one-tip-development to multi-project-branch-development, and change the organization from traditional releases to rapid-release model. This required RelEng to scale up significantly in the last <6 years, from a humble 86 machines to ~3,400 machines spread across 4 physical Mozilla colos as well as 3 Amazon AWS regions. Here’s a quick summary diagram of how all these machines are interconnected, which RelEng knows by heart, but which I couldn’t find posted anywhere so I drew as part of doing this presentation.

This was a fun meeting. My favorite quotes from the lively back-forth were: “every software company lives-or-dies by the efficiency of its development process and its software delivery pipeline” …and… “everyone interacts with different parts of the elephant, so everyone has very different ideas of what they are looking at”.

Hopefully, others find this interesting too. Of course, if you have questions or comments, please post them below, or drop me an email.

Firefox 19.0.2 by the (wall-clock) numbers

(Its been a while since my last “by the wall-clock numbers” post. After last week’s CanSecWest, I thought people might be interested in how much Mozilla’s pipeline for delivering code to users continues to improve. This was even noted by the pwn2own contest sponsors during the CanSecWest conference! ).

Firefox 19.0.2 was released on Thursday 07-mar-2013, at 16:40PST. From “go to build” to “release is now available to public” was 16h 26m wall-clock time, of which, the Release Engineering portion was 11h 27m. The wall clock times were:

00:21 07mar: ReleaseCoordinators say “go” for FF19.0.2
02:14 07mar: FF19.0.2 builds started
04:06 07mar: FF19.0.2 android signed multi-locale builds handed to QA
07:07 07mar: FF19.0.2 linux builds handed to QA
08:03 07mar: FF19.0.2 mac builds handed to QA
11:40 07mar: FF19.0.2 signed-win32 builds handed to QA
11:58 07mar: FF19.0.2 update snippets available on test update channel
13:05 07mar: ReleaseCoordinators say “go for release”; “ok to start mirror absorption”
13:11 07mar: ReleaseCoordinators say “go for push to Google play store”
13:29 07mar: FF19.0.2 android pushed to Google play store
13:55 07mar: mirror absorption started
14:00 07mar: mirror absorption good enough for testing
14:31 07mar: QA signoff on updates on test channel
15:26 07mar: ReleaseCoordinators say “go” to make updates snippets live.
15:40 07mar: update snippets available on live update channel
15:54 07mar: QA signoff on updates on live release channel
16:40 07mar: release announced; all done.

In addition to FF19.0.2, I note that we also had to ship FF17.0.4esr, FF20.0b4, Thunderbird 17.0.4, Thunderbird 17.0.4esr, Thunderbird 20.0b1 (build2) in the same super-fast way. Obviously, we don’t want to ship this number of products, this quickly, all the time, but its nice to know that we can if we have to. Really, really, nice. And yes, we quietly continue work to make this delivery pipeline even more efficient! πŸ™‚

Notes:

1) I continue to measure the time between “dev says go” to “release is available”. Explicitly I do *not* measure from “fix is reported” to “release is available”, because I don’t want to put any further time pressure on a developer trying to fix a problem under pressure. It feels much better to me to work a little longer to get the fix right instead of adding even more time pressure looking for a quick fix, and then having to do another emergency release a few days later to fix the “quick fix”.

2) As usual, if you are curious for more details of the actual work done, you can follow along in tracking bug#848753, and various linked bugs.

Thank you to everyone in OpSec, RelEng, QA, IT and ReleaseCoordinators who make this all possible. It was a really busy few hours, but great to see everyone calmly pile, doing what they can to help out. The end result made us proud by our users.

John.

Infrastructure load for February 2013

  • #checkins-per-month: We had 5,382 checkins in February 2013. This drop from last month surprised me. Maybe January was abnormally high because of the first-week-back-after-holidays rush, combined with the B2G workweek? Maybe February was abnormally low because it was a short month, combined with restrictions to checkins as we approached B2Gv1.0.0, B2Gv1.0.1 and Mobile World Congress? Next month’s numbers will help show the trend here, but meanwhile, if you have opinions, I’d be curious to hear them.



    As usual, our build pool handled the load well, with >95% of all builds consistently being started within 15mins.

    Our test pool situation continues to improve, but is not yet as great as the situation with our builders. We’re making good progress, but the rate of checkins, the improved capacity of the build machines to generate more builds that need testing, the ever-increasing number of test suites to run on each build and the hardware specific nature of some test suites make this test capacity problem harder to solve. New hardware is still (slowly) coming. Meanwhile, RelEng, ATeam and devs continue the work of finding test suites which should (in theory!) be able to run on AWS, then fixing them to make them run green. Once a test suite runs green on AWS, RelEng stops scheduling that test suite on physical machines. This means double goodness: the AWS-based test suites have great wait times on AWS, and the remaining physical-hardware-based test suites have slightly improved wait times because fewer jobs are being scheduled on our scarce hardware.

    Of course, some tests *need* hardware, so we’re continuing work to buy and power up more test machines to increase test capacity anyways; please continue to bear with us while this happens. Oh, and of course, if you know of any test suites that no longer need to be run per-checkin, please let us know so we can immediately reduce the load a little. Every little helps put scarce test CPU to better use.

  • #checkins-per-day: During February, 18-of-28 days had over 200 checkins-per-day, and 8-of-28 days had over 250 checkins-per-day (the high-water-mark for the month was 20feb with 270 checkins).
  • #checkins-per-hour: Checkins are still mostly mid-day PT/afternoon ET, but the load has increased across the day. For almost 33% of every day (7 of every 24 hours), we sustained over 10 checkins per hour. Heaviest load times this month were 10-11am PT (14 checkins-per-hour – a new record, exceeding our previous record of 13.36 checkins-per-hour set in November2012!).

mozilla-inbound, mozilla-central, fx-team:
Ratios of checkins across these branches remain fairly consistent. mozilla-inbound continues to be heavily used as an integration branch, with 28.3% of all checkins, consistently far more then the other integration branches combined. As usual, fx-team has ~1% of checkins, mozilla-central has 2.2% of checkins. The lure of sheriff assistance on mozilla-inbound continues to be consistently popular, and as usual, very few people land directly on mozilla-central these days.

mozilla-aurora, mozilla-beta, mozilla-b2g18, gaia-central:
Of our total monthly checkins:

  • 2.5% landed into mozilla-aurora. This is slightly lower then normal
    aurora levels, and expected since b2g changes are no longer being landed
    into aurora and beta.

  • 1.3% landed into mozilla-beta. This is slightly lower then normal
    beta levels, and expected since b2g changes are no longer being landed
    into aurora and beta.

  • 1.8% landed into mozilla-b2g18. These checkins are *only* for the
    B2G releases, so worth calling out here.

  • 3.1% landed into gaia-central, making gaia-central the third
    busiest branch overall, after try and mozilla-inbound. Obviously, these
    checkins are *only* for the B2G releases, so worth calling out here.

misc other details:

  • Pushes per day
    • You can clearly see weekends through the month.

  • Pushes by hour of day
      Mid-morning PT is consistently the biggest spike of checkins, although this month the checkin load stayed high throughout the entire PT working
      day, and particularly spiked between 10-11am PT.