Mozilla now has 9 active branches!

The mozilla-1.9.2 branch (aka the Firefox 3.6 branch) went live last week, quietly and with no fuss.  My last post is now out of date. The newly increased list of active branches is:

  • Firefox 3.0.x (aka cvs-trunk, mozilla-1.9.0)
  • Firefox 3.5.x (aka mozilla-1.9.1)
  • Firefox 3.6.x (aka mozilla-1.9.2)
  • Firefox 3.next (aka mozilla-central)
  • mobile-browser
  • TraceMonkey
  • Places
  • Electrolysis
  • Thunderbird 2.0.0.x (aka mozilla-1.8.1)

What struck me the most about setting up this mozilla-1.9.2 branch was how smoothly it went. Even with the other distractions and complications going on at the time, setting up this new mozilla-1.9.2 branch felt to me like the smoothest new branch setup I’d seen so far. An encouraging metric for how our infrastructure is scaling up.

Nice, very nice.

Updated: Aki pointed out that I’d forgotten to include the mobile-browser codeline. 🙁 Now added. joduinn 24aug2009

Infrastructure load for July 2009

Summary:

  • There were 1,295 code changes to our mercurial-based repos. Over the month, these triggered:
    • 13,428 build/unittest jobs, or ~18 jobs per hour.
    • 7,039 talos jobs, or ~9.5 talos jobs per hour.
  • The places and electrolysis project branches were added during July and are now being tracked.
  • The mobile-browser branch is now being tracked.
  • mozilla-1.9.2 branch was postponed until early August, so should be in next month’s report.
  • We are still not tracking any l10n repacks, nightly builds, release builds or any “idle-timer” builds.

Details:

Here’s how the math works out:

Now that talos is also pooled, just like build, unittest systems, it makes it easier to calculate the builds/unittest/talos jobs triggered by each individual push as follows:

  • mozilla-central: 12 jobs per push (L/M/W opt, L/M/W leaktest, L/M/W unittest, linux64 opt, linux-arm, WinMo) and 6 talos jobs
  • mozilla-1.9.1: 12 jobs per push (L/M/W opt, L/M/W leaktest, L/M/W unittest, linux64 opt, linux-arm, WinMo) and 5 talos jobs
  • electrolysis: 12 jobs per push (L/M/W opt, L/M/W leaktest, L/M/W unittest, linux64 opt, linux-arm, WinMo) and no talos.
  • mobile-browser: 5 jobs per push (WinMO m-c, linux-arm m-c, Fennec linux desktop, linux-arm tracemonkey, WinMo electrolysis) and 2 talos jobs.
  • places: 12 jobs per push (L/M/W opt, L/M/W leaktest, L/M/W unittest, linux64 opt, linux-arm) and 6 talos jobs.
  • tracemonkey: 10 jobs per push (L/M/W opt, L/M/W leaktest, L/M/W unittest, linux-arm) and 6 talos jobs.
  • try: 8 jobs per push (L/M/W opt, L/M/W unittest, linux-arm, WinMo) and 6 talos jobs.

UPDATE: fixed math typo. joduinn 13aug2009

We now have 7 active code lines!?!

In the last couple of weeks, Lukas has set up 2 new project branches (details here and here), and fill in some missing mobile builds in TraceMonkey. Our list of active-code-lines is now:

  • Firefox 3.0.x (aka cvs-trunk, mozilla-1.9.0)
  • Firefox 3.5.x (aka mozilla-1.9.1)
  • Firefox 3.next (aka mozilla-central)
  • TraceMonkey
  • Places
  • Electrolysis
  • Thunderbird 2.0.0.x (aka mozilla-1.8.1)

…with another two code-lines more coming soon. Amost all of these are doing the full set of builds (opt/debug/test), unittests and talos, across 6 different o.s. (Linux, linux-arm, Mac, Win32, WinMo, with WinCE on the way)… all per checkin.
These new project branches are important because:

  • they confirm that our infrastructure is designed to scale per-developer-checkins, not per-active-code-line.
  • moving bigger (scarier?) work to these project branches help the Mozilla project do more frequent releases, on a more predictable schedule, while still innovating significant new features each in their own more predictable environment. In theory, its the best of both worlds!!
  • it nice to see that we’re getting faster at spinning up each of these project branches. Our checklist really helps Developers and RelEng figure out all the gotchas before we start. Even though it looks like a really long list of questions, they are all things that have tripped us during setup of previous project branches. Figuring out this checklist before we start lets us avoid having to stop/redo work mid-way through the setup, less annoying and time-wasting for everyone!

RelEng group gathering, July 2009

Last week was the first time that the entire RelEng group was together in the one place at the one time. We’ve grown a lot over the last couple of years, but the few times we met, we’ve always had some people who couldn’t make it. This time, we *all* were in one place at one time!

Armen managed to snag this photo before we scattered for various airports. In case you missed the introductions in last Monday’s Mozilla Foundation call, here they are again! From left to right:


John (jhford), Chris (catlee), Lukas (lsblakk), Aki (aki), Armen (armenzg), John (joduinn), Nick (nthomas), Ben (bhearsum), Alice (alice), Chris (coop).

Several large projects wrapped up at the end of Q2, we were already well into Q3 work, so this was a perfect time for us to step back, and figure out what next big projects to tackle in Q4, Q1. The group has grown rapidly. The scale of the infrastructure is growing fast (409 machines as of today, with more still being powered up). The complexity of the infrastructure is growing also, with a new mobile platform, and 3 new project branches, all added just in July. And a bunch of new o.s. requests that came in today.

All very exciting.

In addition to the scheduled topics, there was lots of impromptu discussions, and whiteboard drawing going on. Oh, alright, yes, there was also late night food, drinks and rockband! 🙂

For other photos of the week, and the creative-commons licence for this photo, check out Armen’s flickr stream here.

Infrastructure load for June 2009

Summary:

  • We pushed 1,018 code changes to our mercurial-based repos. Over the month, this translates into:
    • 11,279 build/unittest jobs, or ~15.7 jobs per hour.
    • 5,837 talos jobs, or ~8 talos jobs per hour.
  • TryServer now does linux-arm and WinMo builds, which creates extra load per push.
  • The mozilla-1.9.1 branch has a few bursts of activity, but you can see attention moving to mozilla-central, as post-3.5 work starts to open up.
  • Aki just pointed out that we’re not yet tracking mobile-browser branch or the l10n repacks. Also, in July we are spinning up new places branch, electrolysis branch, mozilla-1.9.2 branch, so expect to see them in next month’s report.

Details:

As each of these pushes triggers multiple different types of builds/unittest jobs, the math is:

  • mozilla-central: 12 jobs per push (L/M/W opt, L/M/W leaktest, L/M/W unittest, linux64 opt, linux-arm, WinMo)
  • mozilla-1.9.1: 12 jobs per push (L/M/W opt, L/M/W leaktest, L/M/W unittest, linux64 opt, linux-arm, WinMo)
  • tracemonkey: 7 jobs per push (L/M/W opt, L/M/W unittest, linux64 opt)
  • try: 11 jobs per push (L/M/W opt, L/M/W leaktest, L/M/W unittest, linux-arm, WinMo)
  • theoretical total: (515 x 12) + (130 x 12) + (141 x 7) + (232 x 11) = 11,279 jobs per month = 15.7 jobs per hour.

Now that we have pooled all the talos slaves, we can count triggered talos jobs for every opt build as follows:

  • mozilla-central: 6 jobs per push (ubuntu, osx10.4, osx10.5, WinVista, WinXP, linux-arm)
  • mozilla-1.9.1: 5 jobs per push (ubuntu, osx10.4, osx10.5, WinVista, WinXP)
  • tracemonkey: 5 jobs per push (ubuntu, osx10.4, osx10.5, WinVista, WinXP)
  • try: 5 jobs per push (ubuntu, osx1.4, osx10.5, WinVista, WinXP)
  • theoretical total: (515 x 6) + (130 x 5) + (141 x 5) + (232 x 6) = 5837 jobs per month = 8 jobs per hour.

Infrastructure load during May

Sorry for the delay in getting this posted.

Summary:

  • We pushed 1,134 code changes to our mercurial-based repos. This translates into 12,345 build/unittest jobs, or ~16.6 jobs per hour, over the month.
  • We hit 108 pushes on May 19th. This is a new high-water-mark, by far the biggest load in any one single day since we start measuring. This was the rush of checkins to beat the mid-May code freeze date.
  • Mozilla-191 and TryServer now does linux-arm and WinMo builds, which creates extra load per push.
  • The mozilla-1.9.1 branch was more active then usual, as expected in the lead up to FF3.5.0 release.
  • The chart below shows no data for TryServer for the week of 1st – 7th May. This was because of our resetting of the repo, and is expected. The TryServer was up and being used, but this means TryServer numbers are too low.
  • We’re still not measuring load on Talos yet.

Details:

As each of these pushes triggers multiple different types of builds/unittest jobs, the *theoretical* total amount of work done by the pool-of-slaves in May was 12,345 jobs. For each push, we do:

  • mozilla-central: 12 jobs per push (L/M/W opt, L/M/W leaktest, L/M/W unittest, linux64 opt, linux-arm, WinMo)
  • mozilla-1.9.1: 12 jobs per push (L/M/W opt, L/M/W leaktest, L/M/W unittest, linux64 opt, linux-arm, WinMo)
  • tracemonkey: 7 jobs per push (L/M/W opt, L/M/W unittest, linux64 opt)
  • try: 9 jobs per push (L/M/W opt, L/M/W leaktest, L/M/W unittest)
  • theoretical total: (523 x 12) + (282 x 12) + (138 x 7) + (191 x 9) = 12,345 jobs per month = 16.6 jobs per hour.

UPDATE: TryServer supported for linux-arm and WinMo were enabled in the last few days of the month, so I’ve excluded them from the math, and recalculated number-of-jobs.  John. 17July2009

Major update to Firefox 3.5 (the day after)

Looking at the people who moved to FF3.5.0 yesterday:

  • 65% downloaded the FF3.5.0 installer and installed from it
  • 35% manually did Help->CheckForUpdates

That is a large percentage of people doing CheckForUpdates.

Considering this was our first time having Major Update available on release day, and considering there was no user prompting of this new major update ability, I found these percentages quite delightfully stunning.

Major update to Firefox 3.5

We’re doing something quite new things with updates as part of the FF3.5 release. Something that, as far as I know, has never been done before in Mozilla, and which is really really cool.

  1. On the day of the Firefox 3.5 release, existing Firefox3.0 users will be able to upgrade to FF3.5 simply by doing “Help->CheckForUpdate”.
  2. The release of FF3.5 starts a 6month End-of-Life period for FF3.0.x. For those 6 months, we’ll have major update offers available all the time to those FF3.0.x users.

Sounds boring, and kinda simple. To understand just how massive this improvement is, we need to compare this with what happened for FF3.0 and FF2.

2.0.0.x 3.0.x 3.5.x
Days between release and initial MU offer: 248 65 0
Percentage of EOL period that MU was available: 0% 21% 100%

This means people should be able to migrate from FF3.0->FF3.5 faster then we have historically seen people migrate from FF1.5->FF2.0 or FF2.0->FF3.0. How how much faster will people migrate? We don’t know yet, we’ve never done this before.

Obviously, we need to make a product that is compelling and which people *want* to migrate to. But at least now, with these infrastructure changes in place, any user who wants to upgrade will always be able to!

Let us know what you think over the coming months.

For the curious, here’s details on how we did it:

1) Background data on previous dates

Firefox1.5 -> Firefox2:
=======================
24Oct2006: FF2.0.0.0 released, start of FF1.5 End-of-Life.
30May2007: end of FF1.5 End-Of-Life.
In those 219 days, users were never able to major update. Our first major update available for FF1.5 users was 29June2007, a month *after* the formal End-Of-Life.

Or, put another way: Major update were available 0% of the End-of-Life
period.

Firefox2.0 -> Firefox3.0:
=========================
17Jun2008: FF3.0.0 released, start of FF2 End-Of-Life.
17Dec2008: end of FF2 End-Of-Life.
In 183 days, users could only major update 39 out of 183 days. None of those 39 days were during the initial peak of public attention around release day.

Major update were available 21% of the End-of-Life period.

Firefox3.0 -> Firefox3.5:
=========================
30Jun2009: FF3.5.0 released, start of FF3.0 End-Of-Life.
31Dec2009: (approx) end of FF3.0 End-Of-Life.
In those 184 days, we expect major update to always be available, including during initial public attention around release day.

Major updates should be available 100% of the End-of-Life period.

2) WebDev made a small, but important, change to the update infrastructure. This change means that manual CheckForUpdates major update can now be throttled differently to “background-idle” major update.

This means we can issue, and re-issue, major updates as often as we like to users who manually CheckForUpdates… without having to worry that we are annoying “background-idle” users by re-prompting them again and again with a major update dialog box each time.

Users who passively wait for major updates will now only be shown a major update dialog box when Beltzner/Sam ask for the “background-idle” major update to be unthrottled. They can make that decision based on what they think is best for the product, the user experience, and their user-update-fatigue discussions.

Furthermore, as most of the RelEng and QA work was already done earlier, as part of the CheckForUpdates work, this means that Beltzner/Sam can make those “background-idle” decisions without worrying about causing much extra work for RelEng or QA.

(For details on race conditions where people dont see the major update dialog box and on the “update fatigue” debate, see: here, here, here, here, and finally here.)
3) Nick Thomas led a bunch of significant cleanup in RelEng infrastructure, so we can now create major updates quite easily and reliably.

We used this improved infrastructure to create the FF2.0.x-> FF3.0.x major update offers.

We also used this to create “fake” FF3.0.x -> FF3.5beta/rc major update offers several times over the last 6 months in advance of the FF3.5 release. QA were able to test each of these, and file blockers in FF3.5 as needed. By the time it comes to release day, QA have already tested major update several times, including on the latest FF3.5rc3.

We will also be using this to create a new major update offer from FF3.0.x -> FF3.5.x., every time as we ship a new FF3.0.x security release.

There are a few different scenarios we had to handle for that (see the photo of whiteboard before we moved office, for red lines in scenarios A, B, C, D!) but they’re all covered.
This change is important because it fixes a problem described here where users could see a major update offer only until we shipped the next security release. The new security release blocked access to the pre-existing major update. Now, by re-issuing a new major update at the same time as the new security release, users will *always* be able to see a major update offer.

Thats it.

Hopefully all that made sense. I know its a obscure corner in the infrastructure, but I hope this post explains why all this is strategically important to Mozilla and to Firefox.

No more missing entities in the l10n nightlies

While most attention is focused on FF3.5, I wanted to echo what Coop said recently about a boring-sounding, yet really important, change to how we produce l10n nightly builds.

Every l10n nightly is now guaranteed to not be missing any entities.

Whats that mean? Why is that so important?

  1. All the nightly builds (whether they are en-US or any other locale) have the same actual code functionality. ok, so what?
  2. When running the en-US version of Firefox, the code displays the en-US version of a string. When running es-ES version of firefox, the code gets the es-ES version of the string. Yeah, ok, so what?

There is an interesting race condition problem here though.

Between the time when the new string is added in en-US and when a localizer gets to land the equivalent localized string in their locale, we are still producing nightly builds. This means that for some days/weeks, the generated l10n nightly build has problems when Firefox goes to load the localized string to display, and finds the string missing for that locale. The only symptom the nightly l10n user will see is an internal error message or blocked functionality, or refuse-to-start bustages or a crash-with-stackdump.

This has been a problem with l10n nightlies since before I got here, and as far as I know, has always been a problem since the Mozilla project started. This problem also has some significant consequences, detailed below.

What have we changed?

When a new string is added to an en-US build, the en-US nightly has that new string (as you would expect). The l10n nightlies will now also have that new en-US string (as you might not expect) but only until the localized string is created by localizer. (Its also tied into Axel’s L10n dashboard so he can track those non-localized strings, and make sure we don’t accidentally ship with them!). Once a localized version of string is added, the new l10n string will be used.

There are 4 important consequences of this change:

  1. Localizers can now safely download the latest nightly without having to first manually read through the checkin logs, and l10n dashboard to figure if an l10n nightly is safe to use, or whether that new l10n nightly will crash out with missing strings.
  2. Localizers can now see the exact location and usage of what they are translating, in the exact context. Much better than looking at a list of strings in a text file, and having to install en-US to see where the new string is being used. This is a really big deal when figuring out language subtleties.
  3. Fixing this was the last big pre-req before we can start producing nightly updates for l10n builds. Recall that >50% of users are on non-en-US Firefox. However, only en-US nightlies have nightly updates. Until now localizers have to manually download a new nightly after they have manually figured out if it is safe to install. Now that we know each nightly has all entities, it makes it safe to offer automatic nightly updates for l10n, just like we do for en-US. And now that other infrastructure cleanup has been done, this is now finally possible. Here’s even a French nightly update that Armen has running in staging right now. 🙂 The curious can follow along in bug#449828.
  4. This *might* help simplify how localizers get changes into place during the release cycle before a string freeze is announced for a release. Until now, some localizers prefer to wait until all new strings are in place, and string freeze declared before they start doing all translations as quickly as possible. Now, with each l10n nightly being as safe to use as the en-US nightly, localizers might start using l10n nightlies as their default browser. That means translated strings could be added when localizer has time, the automatic update to next nightly would show that translated string in use, and the localizer could adjust if needed for screen size, font bugs, etc.

All in all, its awesome stuff by Armen, Coop and Axel.

Taking Mobile Unittest and Talos offline

Quick note repeating what I mentioned in the Monday Foundation call, the Tuesday developer meeting and what Aki blogged about here

We’re powering off all the mobile linux-arm Unittest and Talos machines tomorrow (Friday  5th June 2009) to box them up and move them to their new home. With any luck they’ll be back online late Friday, but it might take until Monday 8th, depending on a bunch of stuff beyond our control right now. They’ll be in a server room in the new building, and Aki can finally get some desk space! 🙂

Please be gentle with the mobile linux-arm builds while these devices are offline!