Making life better for localizers

We now produce l10n nightly updates. Its already been announced elsewhere [1] [2], but I thought people might be interested in some context and additional details.

At night, we make “nightly builds” – complete builds of source code, containing all changes made during the day. Developers and testers use that nightly build to see if a new feature/bugfix works. Which is great – and also part of the problem. Every night we make a new nightly build. If someone wants to keep using the latest code, they have to keep manually downloading and installing a new Firefox build every day. This quickly gets annoying, and after a while, only the most dedicated will continue doing this manually. To make it easier for people to stay on the latest nightly, we generate nightly updates. This means you install a Firefox nightly build once, then every morning you will get updated to the newest nightly build.

Works great. And is an important part of the development process at Mozilla.
We make nightly builds for en-US, and for each of the 75+ locales, on each OS. However, we only made nightly updates for en-US, we never made nightly updates for any localized builds.
This means that people working on en-US nightly builds get updated each morning, but localizers who wanted to stay on the latest nightly build have to keep manually installing a new build every morning. If we generate nightly updates for en-US, why wasn’t this already done for l10n? …and if we’ve never done it reliably before, how hard could it be to start doing this?

Well…

Turns out there was a *lot* of systems refactoring needed to make this possible. Those of you who were at the Mozilla Summit in Whistler might remember this presentation summarizing what was known about the project back then. Refactoring existing l10n code and integrating with the rest of release infrastructure. Migrating/refactoring l10n nightly repack code from various dedicated l10n systems into the production pool-of-slaves. Reconciling toolchain differences. Solving edge cases of l10n nightlies missing newly added strings. Learning how nightly updates are generated and how that is different to the way release updates are generated! The list goes on and on and on… For gory details, have a look at the interlinked bugs starting with this one. (There might be other disconnected bugs – its been quite a project!)

To be fair, we’re still wrapping up some loose ends. For example, creating updates for these l10n nightlies takes time. We’ve gone from generating 3 nightly updates to generating 225 nightly updates (3 OS x 75 locales). Per branch. Per night. As of last week, there are l10n nightly updates available on mozilla-central, and mozilla-1.9.2, and as you can imagine, generating these 450 nightly updates, serially, takes time. Anyone used to getting a new en-US nightly update first thing in the morning is now seeing a delay of a few hours. This is temporary, please bear with us. Once we finish the transition from a dedicated nightly update machine to concurrent jobs on the production-pool-of-slaves, we should have fixed this delay, and also be able to start producing l10n nightly updates on other branches as requested.

(One remaining question is: which branches are being used by localisers and testers? Some prefer to work on mozilla-191, some on mozilla-192, and some on mozilla-central. If you are working on localization work with Mozilla, what branch would it be most helpful for you?)

De-tangling all this was a huge project, but crucial to Mozilla’s global localization efforts.

In terms of manhours, this is the 2nd biggest project the group has taken on in the last few years. This is now the end of August2009; Armen has been working on this since May2008, and Coop since Nov2008.

This is all super cool stuff; well-deserved hugs, kudos and beer deliveries, should be sent to Armen, Coop, Axel and Seth.

take care

John.

Major update to Firefox 3.5 (after 59 days)

Two weeks ago, we prompted users with a Major Update offer to upgrade from FF3.0.x->FF3.5.x. Now that its been out for two weeks, I took a quick look at how many users did upgrade, and how does it compare with the previous major update release?
FF3.0.x -> FF3.5 major update:

  • 45 days from release to 1st prompted MU
  • measure after 14 days of prompted MU
  • 24.7% of users on latest branch before prompted MU
  • (~65% of these users upgraded by doing pave-over download-and-install; ~35% upgraded by manually doing CheckForUpdates)
  • 37.3% of users on latest branch two weeks after prompted MU

FF2.0->FF3.0.x major update:

  • 70 days from release to 1st prompted MU
  • measure after 14 days of prompted MU
  • 35.4% of users on latest branch before prompted MU
  • (100% of these users upgraded by doing pave-over install)
  • 61.4% of users on latest branch two weeks after prompted MU

We’re still working out math here, so bear with us if these numbers get tweaked; its not as easy to figure out as you might hope. However, if these numbers are accurate, it looks like:

  • we made the major update offer sooner after the release
  • people are major upgrading at a slower rate, but consistent rate
  • people have been using CheckForUpdates at about the same rate through each of the dot-releases, not just at the initial release. This confirms the value of doing this, so we will continue to always have unprompted Major updates available for people who want to do manual CheckForUpdates.

Its worth noting that, while these two scenarios are the closest I could find for comparison, there are lots of differences between them:

  • The FF3.0 release had more outreach and publicity, f.e download day, compared to the FF3.5 release.
  • FF2.0->3.0 has more visible improvements then FF3.0->FF3.5, hence more incentive to upgrade (or perversely, more resistance to upgrade?).
  • After FF2.0, it took 18 months to ship FF3.0. After FF3.0, it took 12 months to ship FF3.5. As we continue to speed up the release cycle, is this a factor?
  • The number 3.5 sounds like a smaller upgrade. Would people have upgraded if the same exact code was called 4.0? Would fewer people have upgraded if it had been called 3.1?
  • Anything else people think might be a factor?

Thoughts on the recent colo outage

On the afternoon of Sunday 09Aug2009, our colo overheated and shutdown. The gory details are here, but basically when the air conditioners failed, the room quickly overheated to unsafe levels, and machines took themselves offline before they were physically damaged. All our build/unittest/talos infrastructure, along with large portions of the rest of Mozilla infrastructure, came to an abrupt halt.

Matthew (mrz) phoned me soon after the colo went offline, just to give me a heads up, so I was able to forewarn others in the group. The rough timeline was:

  • 13:30 PDT Sunday afternoon: colo offline
  • 21:30 PDT Sunday evening: Mozilla back online
  • 01:00 PDT Monday morning: RelEng declares build infrastructure back online

While its bad for a colo provider to have failures like this, it was impressive to watch how the RelEng and IT groups pitched in together to get things going again so quickly – reviving ~420 RelEng machines in under 12 hours was quite a feat.

Mozilla now has 9 active branches!

The mozilla-1.9.2 branch (aka the Firefox 3.6 branch) went live last week, quietly and with no fuss.  My last post is now out of date. The newly increased list of active branches is:

  • Firefox 3.0.x (aka cvs-trunk, mozilla-1.9.0)
  • Firefox 3.5.x (aka mozilla-1.9.1)
  • Firefox 3.6.x (aka mozilla-1.9.2)
  • Firefox 3.next (aka mozilla-central)
  • mobile-browser
  • TraceMonkey
  • Places
  • Electrolysis
  • Thunderbird 2.0.0.x (aka mozilla-1.8.1)

What struck me the most about setting up this mozilla-1.9.2 branch was how smoothly it went. Even with the other distractions and complications going on at the time, setting up this new mozilla-1.9.2 branch felt to me like the smoothest new branch setup I’d seen so far. An encouraging metric for how our infrastructure is scaling up.

Nice, very nice.

Updated: Aki pointed out that I’d forgotten to include the mobile-browser codeline. 🙁 Now added. joduinn 24aug2009

Infrastructure load for July 2009

Summary:

  • There were 1,295 code changes to our mercurial-based repos. Over the month, these triggered:
    • 13,428 build/unittest jobs, or ~18 jobs per hour.
    • 7,039 talos jobs, or ~9.5 talos jobs per hour.
  • The places and electrolysis project branches were added during July and are now being tracked.
  • The mobile-browser branch is now being tracked.
  • mozilla-1.9.2 branch was postponed until early August, so should be in next month’s report.
  • We are still not tracking any l10n repacks, nightly builds, release builds or any “idle-timer” builds.

Details:

Here’s how the math works out:

Now that talos is also pooled, just like build, unittest systems, it makes it easier to calculate the builds/unittest/talos jobs triggered by each individual push as follows:

  • mozilla-central: 12 jobs per push (L/M/W opt, L/M/W leaktest, L/M/W unittest, linux64 opt, linux-arm, WinMo) and 6 talos jobs
  • mozilla-1.9.1: 12 jobs per push (L/M/W opt, L/M/W leaktest, L/M/W unittest, linux64 opt, linux-arm, WinMo) and 5 talos jobs
  • electrolysis: 12 jobs per push (L/M/W opt, L/M/W leaktest, L/M/W unittest, linux64 opt, linux-arm, WinMo) and no talos.
  • mobile-browser: 5 jobs per push (WinMO m-c, linux-arm m-c, Fennec linux desktop, linux-arm tracemonkey, WinMo electrolysis) and 2 talos jobs.
  • places: 12 jobs per push (L/M/W opt, L/M/W leaktest, L/M/W unittest, linux64 opt, linux-arm) and 6 talos jobs.
  • tracemonkey: 10 jobs per push (L/M/W opt, L/M/W leaktest, L/M/W unittest, linux-arm) and 6 talos jobs.
  • try: 8 jobs per push (L/M/W opt, L/M/W unittest, linux-arm, WinMo) and 6 talos jobs.

UPDATE: fixed math typo. joduinn 13aug2009

We now have 7 active code lines!?!

In the last couple of weeks, Lukas has set up 2 new project branches (details here and here), and fill in some missing mobile builds in TraceMonkey. Our list of active-code-lines is now:

  • Firefox 3.0.x (aka cvs-trunk, mozilla-1.9.0)
  • Firefox 3.5.x (aka mozilla-1.9.1)
  • Firefox 3.next (aka mozilla-central)
  • TraceMonkey
  • Places
  • Electrolysis
  • Thunderbird 2.0.0.x (aka mozilla-1.8.1)

…with another two code-lines more coming soon. Amost all of these are doing the full set of builds (opt/debug/test), unittests and talos, across 6 different o.s. (Linux, linux-arm, Mac, Win32, WinMo, with WinCE on the way)… all per checkin.
These new project branches are important because:

  • they confirm that our infrastructure is designed to scale per-developer-checkins, not per-active-code-line.
  • moving bigger (scarier?) work to these project branches help the Mozilla project do more frequent releases, on a more predictable schedule, while still innovating significant new features each in their own more predictable environment. In theory, its the best of both worlds!!
  • it nice to see that we’re getting faster at spinning up each of these project branches. Our checklist really helps Developers and RelEng figure out all the gotchas before we start. Even though it looks like a really long list of questions, they are all things that have tripped us during setup of previous project branches. Figuring out this checklist before we start lets us avoid having to stop/redo work mid-way through the setup, less annoying and time-wasting for everyone!