5 releases at the same time

On Friday, one of the predictions in our “RelEng 2010 story” came true. We’re now doing 5 releases at the same time.

  • FF3.7a3
  • FF3.6.2
  • FF3.5.9
  • FF3.0.19
  • TB2.0.0.24 (work mostly done, but still on our plates until we ship it)

(Fennec 1.0.1 is likely to start this week also, but it wasn’t counted, because we haven’t been given a “go” yet!)

This is a major milestone for us. A couple of years ago, every time RelEng had to work on *one* release, it was a big deal; the idea of doing 5 releases simultaneously was simply not an option.

Don’t get me wrong; doing 5 releases simultaneously will not be trivial. There’s bound to be gotchas and surprises. However, the mere fact that we can now do this at all is really wonderful to see. Being able to do other work at the same time, well… it speaks volumes on how the group has grown, how the infrastructure has scaled, and how all the behind-the-scenes improvements have helped streamline the release process here at Mozilla.

One way or another, this week will be exciting. Wish us luck!

The RelEng story for 2010

…and here is our 2010 story. For perspective, click to see our 2009 story and our 2008 story.

Let us know what you think! Does this feel like the right focus? Does this address what feels most important to you?

tc
John.
=====
This story should help us make sense of our goals across the four quarters of 2010. As usual, we focused on 4 areas during the year:

1) Continued streamlining of release infrastructure:
We now ship Major releases ever six months, all with partner builds, major updates etc. To keep up with this faster major release cycle, we continue to showcase our streamlined automation. For security releases, we now routinely provide 4-way, 5-way and occasionally 6-way “simultaneous-ship” releases.

This fast-paced major release cycle was possible because of our
continuing behind-the-scenes work on release automation, full featured project-branches, and our scalable machine infrastructure spread across multiple locations. In numbers, our growth path was:

  • May2007: 3 OS on 2 code lines on 89 machines
  • Dec2008: 7 OS on 5 code lines on 253 machines
  • Dec2009: 12 OS on 11 code lines on 550 machines
  • Dec2010: 19 OS on 20 code lines on 1000 machines

2) Continue to refine and simplify:
(…or “reducing our drag coefficient” so we can move faster.)
During 2010, we finally turned off the legacy Thunderbird 2 systems, after supporting those users on MoMo’s behalf since early 2008. We also turned off the Firefox3.0 systems. Both of these were the last of our cvs-based releases. Dropping both of those, combined with some enhancements we upstreamed to buildbot, meant we could continue to make improvements and reduce complexity in our automation. While the frantic nature of this work has reduced a bit since 2009, there’s still plenty of room for improvement that repays us back every time we do releases.

3) Outreach:
In addition to doing builds/tests/perf and releases, there are other ways we can use our infrastructure to help Mozilla. In 2009, we ran weekly code coverage jobs as the first jobs run on our machines outside of the traditional Firefox build/test/perf jobs. In 2010, we extended that further by running fuzzer jobs, and other code hygiene tools during idle times. We also helped Labs, and some other xulrunner partner projects, quickly scale and support users by running their jobs on our infrastructure – thus helping them avoid reinventing the wheel.

4) continued to improve Quality of Life:
The larger team has settled in together and continues to work together well under stress. Our shared skills keeps our bus factor good, and our quality-of-life healthy. We all did good work we were proud of, learned new things at conferences, taught each other new things, took vacations and improved our lives.

The RelEng story for 2009

John Lilly has been encouraging us to use the idea of “a retrospective story looking back on a year” as a way to help make frame what quarterly goals make sense for an upcoming year. Its been useful so far, so we keep doing it.

Our 2008 story is here.

While our 2009 story was in emails, and group meetings, I forgot to post it here, until I noticed it missing just now. It was interesting to read in late December 2009, but re-reading it now, as I post, reminds me of how far we’ve come since last year.

Next I’ll post the 2010 story.

take care
John.
=====
The fun, and the risk, of writing our story at the beginning of 2009 was wondering how those dreams & plans would look in cold clear hindsight. Not to mention things we never even considered but which changed our plans completely.

We’ve had amazing growth recently:

  • 4 people with 89 machines on 2 active code lines at end2007
  • 9 people with 253 machines on 5 active code lines at end2008
  • 11 people with 275 machines on 8 active code lines at end2009

For such a large (small?) group, this year we focused on 4 areas:

1) Strategically improved infrastructure:
FF3.1 shipped nine months after FF3.0; FF3.2 shipped six months after FF3.1, and each release had major new features.

This fast-paced major release cycle was made possible by work for branch-on-demand capabilities and failover-from-one-slave-to-another. We grew capable of:

  • 2 active code lines in May2007
  • 5 active code lines in 2008
  • 8 active code lines in 2009.

…and with fewer downtimes even as we added machines. Its worth noting that each of those 8 branches had full equal capabilities: failover machines, builds, unittest, talos, something we couldnt do until late 2008. Powering up new branches on demand enabled developers to do parallel development, meaning Mozilla released major new features more often, more predictably and also allowed Mozilla to better react to marketplace changes.

2) continued to refine and simplify:
…or “reducing our drag coefficient” so we can move faster. For the early parts of 2009, the cleanup work, pruning old systems, and automation work continued frantically. Each change made our infrastructure, and our group, a little more nimble and lean, improving our ability to make further changes. In 2008 the big example of this was removing tinderbox client from release automation as part of the move from cvs -> hg. This was needed to make project branches possible, and make systems more reliable, but also simplified handling unscheduled requests that came our way, like WinCE, Win-nonSSE, linux64, shark builds, etc.

3) developed new capabilities:
We automated several recurring “one off projects”, so now produce automated major update offers, automated partner builds, xulrunner releases to name a few.

In a sign that we finally turned a corner in 2009, we developed a few new features that we never had before:
* a more resilient buildbot master (if one master fails another master takes over with no downtime)
* better development support (automated code coverage reports)
* a better dashboard (one place to see health of all build/unittest/talos infrastructure, simplifying triage and regression hunting, as well as “are the machines ok” questions that we all do daily across multiple sites) which we use to measure and report infrastructure uptimes, which helps us improve further.

4) continued to improve Quality of Life:
Internally, the larger team has settled in together. Each brought experiences learnt, and provided insights and perspectives to make us all better. The cross training improved our bus factor, and our quality-of-life. We all learned new things, did good work we were proud of, took vacations in 2009 and improved our lives. Our burn-out-rate continued to improve!

We proudly believe that the scale of turnaround achieved in the last 2 years is unique. Its also unique that we are able to talk about it publicly, and provide improvements upstream for others to see and use. In 2009, we were finally able to spend more time explaining to folks, both inside and outside of Mozilla, how to make software development better and ship better products.

Infrastructure load for February 2010

Summary:

The number of pushes started increasing finally, after Firefox 3.6.0 and Fennec 1.0 releases. Try Server usage surpassed all other branches this month.

  • The numbers for this month are:
    • 1,264 code changes to our mercurial-based repos, which triggered 133,897 jobs:
    • 14,003 build jobs, or ~107 jobs per hour.
    • 58,110 unittest jobs, or ~86 jobs per hour.
    • 61,694 talos jobs, or ~92 talos jobs per hour.
  • Our Unittest and Talos load continues high, like last month, and we expect this to jump further as more OS are still being added to Talos.
  • Once we start running Unittests on all the Talos OS, we expect load to jump again. In advance of that, we’re spinning up more machines to handle this future spike in load.
  • The trend of “what time of day is busiest” changed again this month. Not sure what this means, but worth pointing out that each month seems to be different. This makes finding a “good” time for a downtime almost impossible.
  • The entire series of these infrastructure load blogposts can be found here.
  • We are still not tracking down any l10n repacks, nightly builds, release builds or any “idle-timer” builds.

Detailed breakdown is :

Here’s how the math works out (Descriptions of build, unittest and performance jobs triggered by each individual push are here:

Infrastructure load for January 2010

Summary:

The number of pushes continued trending downward, maybe related to the Firefox 3.6.0 and Fennec 1.0 releases that month. Meanwhile, our overall infrastructure load went up, almost doubling. This was caused by RelEng filling out all the different project branches to run the same unittests/performance suites, a frequent request by developers, and also by running Talos on new additional OS.

  • The numbers for this month are:
    • 1,189 code changes to our mercurial-based repos, which triggered:
    • 13,853 build jobs, or ~92 jobs per hour.
    • 54,786 unittest jobs, or ~73 jobs per hour.
    • 56,192 talos jobs, or ~76 talos jobs per hour.
  • Our Unittest and Talos load almost doubled this month – caused by adding Talos on OSX10.6, filling out Talos on linux64, and getting full compliment of Unittest and Talos running on all project branches. There are more OS still being added to Talos, so expect this to jump again soon.
  • Different project branches continue to get different load at different times. This is to be expected when you consider that developers change from one focus area to another as projects wrap up, and other projects start up. However, the barcharts below illustrate that nicely.
  • The trend of “what time of day is busiest” changed again this month. Not sure what this means, but worth pointing out that each month seems to be different. This makes finding a “good” time for a downtime almost impossible.
  • We are still not tracking down any l10n repacks, nightly builds, release builds or any “idle-timer” builds.

Detailed breakdown is :

Here’s how the math works out (Descriptions of build, unittest and performance jobs triggered by each individual push are here:

Talos recalibration (status 02mar2010)

If you don’t care about Talos performance results, or Talos hardware, stop reading now! If you do care, this is the last in this series of posts.

Soon after my last post, we started running the new 2.26GHz minis, concurrently with the older 1.83GHz minis. Every build for several weeks now was performance tested on *both* sets of Talos machines, on all OS, and the graphs plotted on graphserver. Test results have been faster (obviously) and machines significantly more reliable (because of newer OS levels) but we’ve also noticed that overall test setup time is a bit longer, which we suspect is because these Talos machines are in 650castro, whereas the builds are on ftp.m.o. Its survivable, and we’re working on ways to improve it, but still worth noting. Most importantly, though, the two sets of machines track performance changes in Firefox in the same way.

Last week, we had enough rock solid concurrent data from both sets of machines to feel safe disconnecting the older rev2 minis. Out of (mild?) paranoia, we left them powered on, and ready to throw back into production at a moments notice, just in case we’d missed something weird with the new rev3 minis. And we patiently waited a week just in case…

Yesterday, we began powering down and recycling the old minis. The “talos-rev2-*” machines are no more.

At major milestones like this, its easy to get nostalgic – those machines carried us through a lot of major events. Talos changed from dedicated-slaves-per-branch to pool-of-slaves… Talos on TryServer… a whole collection of new Talos test suites… and of course the FF3.0, FF3.5, FF3.6 releases are the big events that spring to my pre-caffeinated mind. We thank them for all they’ve done for us and recycle them as part of the next big step for Talos and also for unittests – bug#545568 and bug#548768. All exciting stuff!!

Please welcome Mike “Bear” Taylor

Mike “Bear” Taylor joins Release Engineering this morning.

Mike is coming to Mozilla from Seesmic (a mobile-specific startup). However, many of you may already know Mike from his years of RelEng work in OSAF on Chandler, and his module owner work for Bonsai and Tinderbox2. He’ll be based in Pennsylvania, but on irc you can find him as “bear”.

Welcome aboard, Bear.