The RelEng story for 2009

John Lilly has been encouraging us to use the idea of “a retrospective story looking back on a year” as a way to help make frame what quarterly goals make sense for an upcoming year. Its been useful so far, so we keep doing it.

Our 2008 story is here.

While our 2009 story was in emails, and group meetings, I forgot to post it here, until I noticed it missing just now. It was interesting to read in late December 2009, but re-reading it now, as I post, reminds me of how far we’ve come since last year.

Next I’ll post the 2010 story.

take care
John.
=====
The fun, and the risk, of writing our story at the beginning of 2009 was wondering how those dreams & plans would look in cold clear hindsight. Not to mention things we never even considered but which changed our plans completely.

We’ve had amazing growth recently:

  • 4 people with 89 machines on 2 active code lines at end2007
  • 9 people with 253 machines on 5 active code lines at end2008
  • 11 people with 275 machines on 8 active code lines at end2009

For such a large (small?) group, this year we focused on 4 areas:

1) Strategically improved infrastructure:
FF3.1 shipped nine months after FF3.0; FF3.2 shipped six months after FF3.1, and each release had major new features.

This fast-paced major release cycle was made possible by work for branch-on-demand capabilities and failover-from-one-slave-to-another. We grew capable of:

  • 2 active code lines in May2007
  • 5 active code lines in 2008
  • 8 active code lines in 2009.

…and with fewer downtimes even as we added machines. Its worth noting that each of those 8 branches had full equal capabilities: failover machines, builds, unittest, talos, something we couldnt do until late 2008. Powering up new branches on demand enabled developers to do parallel development, meaning Mozilla released major new features more often, more predictably and also allowed Mozilla to better react to marketplace changes.

2) continued to refine and simplify:
…or “reducing our drag coefficient” so we can move faster. For the early parts of 2009, the cleanup work, pruning old systems, and automation work continued frantically. Each change made our infrastructure, and our group, a little more nimble and lean, improving our ability to make further changes. In 2008 the big example of this was removing tinderbox client from release automation as part of the move from cvs -> hg. This was needed to make project branches possible, and make systems more reliable, but also simplified handling unscheduled requests that came our way, like WinCE, Win-nonSSE, linux64, shark builds, etc.

3) developed new capabilities:
We automated several recurring “one off projects”, so now produce automated major update offers, automated partner builds, xulrunner releases to name a few.

In a sign that we finally turned a corner in 2009, we developed a few new features that we never had before:
* a more resilient buildbot master (if one master fails another master takes over with no downtime)
* better development support (automated code coverage reports)
* a better dashboard (one place to see health of all build/unittest/talos infrastructure, simplifying triage and regression hunting, as well as “are the machines ok” questions that we all do daily across multiple sites) which we use to measure and report infrastructure uptimes, which helps us improve further.

4) continued to improve Quality of Life:
Internally, the larger team has settled in together. Each brought experiences learnt, and provided insights and perspectives to make us all better. The cross training improved our bus factor, and our quality-of-life. We all learned new things, did good work we were proud of, took vacations in 2009 and improved our lives. Our burn-out-rate continued to improve!

We proudly believe that the scale of turnaround achieved in the last 2 years is unique. Its also unique that we are able to talk about it publicly, and provide improvements upstream for others to see and use. In 2009, we were finally able to spend more time explaining to folks, both inside and outside of Mozilla, how to make software development better and ship better products.

One thought on “The RelEng story for 2009

Leave a Reply