A website honoring Godzilla, using haiku?!?
Maybe this is old news to everyone else, but I found godzillahaiku.tumblr.com fascinating. This one was my favorite, because of the cross reference to Blade Runner.
A website honoring Godzilla, using haiku?!?
Maybe this is old news to everyone else, but I found godzillahaiku.tumblr.com fascinating. This one was my favorite, because of the cross reference to Blade Runner.
On Friday, one of the predictions in our “RelEng 2010 story” came true. We’re now doing 5 releases at the same time.
(Fennec 1.0.1 is likely to start this week also, but it wasn’t counted, because we haven’t been given a “go” yet!)
This is a major milestone for us. A couple of years ago, every time RelEng had to work on *one* release, it was a big deal; the idea of doing 5 releases simultaneously was simply not an option.
Don’t get me wrong; doing 5 releases simultaneously will not be trivial. There’s bound to be gotchas and surprises. However, the mere fact that we can now do this at all is really wonderful to see. Being able to do other work at the same time, well… it speaks volumes on how the group has grown, how the infrastructure has scaled, and how all the behind-the-scenes improvements have helped streamline the release process here at Mozilla.
One way or another, this week will be exciting. Wish us luck!
…and here is our 2010 story. For perspective, click to see our 2009 story and our 2008 story.
Let us know what you think! Does this feel like the right focus? Does this address what feels most important to you?
tc
John.
=====
This story should help us make sense of our goals across the four quarters of 2010. As usual, we focused on 4 areas during the year:
1) Continued streamlining of release infrastructure:
We now ship Major releases ever six months, all with partner builds, major updates etc. To keep up with this faster major release cycle, we continue to showcase our streamlined automation. For security releases, we now routinely provide 4-way, 5-way and occasionally 6-way “simultaneous-ship” releases.
This fast-paced major release cycle was possible because of our
continuing behind-the-scenes work on release automation, full featured project-branches, and our scalable machine infrastructure spread across multiple locations. In numbers, our growth path was:
2) Continue to refine and simplify:
(…or “reducing our drag coefficient” so we can move faster.)
During 2010, we finally turned off the legacy Thunderbird 2 systems, after supporting those users on MoMo’s behalf since early 2008. We also turned off the Firefox3.0 systems. Both of these were the last of our cvs-based releases. Dropping both of those, combined with some enhancements we upstreamed to buildbot, meant we could continue to make improvements and reduce complexity in our automation. While the frantic nature of this work has reduced a bit since 2009, there’s still plenty of room for improvement that repays us back every time we do releases.
3) Outreach:
In addition to doing builds/tests/perf and releases, there are other ways we can use our infrastructure to help Mozilla. In 2009, we ran weekly code coverage jobs as the first jobs run on our machines outside of the traditional Firefox build/test/perf jobs. In 2010, we extended that further by running fuzzer jobs, and other code hygiene tools during idle times. We also helped Labs, and some other xulrunner partner projects, quickly scale and support users by running their jobs on our infrastructure – thus helping them avoid reinventing the wheel.
4) continued to improve Quality of Life:
The larger team has settled in together and continues to work together well under stress. Our shared skills keeps our bus factor good, and our quality-of-life healthy. We all did good work we were proud of, learned new things at conferences, taught each other new things, took vacations and improved our lives.
John Lilly has been encouraging us to use the idea of “a retrospective story looking back on a year” as a way to help make frame what quarterly goals make sense for an upcoming year. Its been useful so far, so we keep doing it.
Our 2008 story is here.
While our 2009 story was in emails, and group meetings, I forgot to post it here, until I noticed it missing just now. It was interesting to read in late December 2009, but re-reading it now, as I post, reminds me of how far we’ve come since last year.
Next I’ll post the 2010 story.
take care
John.
=====
The fun, and the risk, of writing our story at the beginning of 2009 was wondering how those dreams & plans would look in cold clear hindsight. Not to mention things we never even considered but which changed our plans completely.
We’ve had amazing growth recently:
For such a large (small?) group, this year we focused on 4 areas:
1) Strategically improved infrastructure:
FF3.1 shipped nine months after FF3.0; FF3.2 shipped six months after FF3.1, and each release had major new features.
This fast-paced major release cycle was made possible by work for branch-on-demand capabilities and failover-from-one-slave-to-another. We grew capable of:
…and with fewer downtimes even as we added machines. Its worth noting that each of those 8 branches had full equal capabilities: failover machines, builds, unittest, talos, something we couldnt do until late 2008. Powering up new branches on demand enabled developers to do parallel development, meaning Mozilla released major new features more often, more predictably and also allowed Mozilla to better react to marketplace changes.
2) continued to refine and simplify:
…or “reducing our drag coefficient” so we can move faster. For the early parts of 2009, the cleanup work, pruning old systems, and automation work continued frantically. Each change made our infrastructure, and our group, a little more nimble and lean, improving our ability to make further changes. In 2008 the big example of this was removing tinderbox client from release automation as part of the move from cvs -> hg. This was needed to make project branches possible, and make systems more reliable, but also simplified handling unscheduled requests that came our way, like WinCE, Win-nonSSE, linux64, shark builds, etc.
3) developed new capabilities:
We automated several recurring “one off projects”, so now produce automated major update offers, automated partner builds, xulrunner releases to name a few.
In a sign that we finally turned a corner in 2009, we developed a few new features that we never had before:
* a more resilient buildbot master (if one master fails another master takes over with no downtime)
* better development support (automated code coverage reports)
* a better dashboard (one place to see health of all build/unittest/talos infrastructure, simplifying triage and regression hunting, as well as “are the machines ok” questions that we all do daily across multiple sites) which we use to measure and report infrastructure uptimes, which helps us improve further.
4) continued to improve Quality of Life:
Internally, the larger team has settled in together. Each brought experiences learnt, and provided insights and perspectives to make us all better. The cross training improved our bus factor, and our quality-of-life. We all learned new things, did good work we were proud of, took vacations in 2009 and improved our lives. Our burn-out-rate continued to improve!
We proudly believe that the scale of turnaround achieved in the last 2 years is unique. Its also unique that we are able to talk about it publicly, and provide improvements upstream for others to see and use. In 2009, we were finally able to spend more time explaining to folks, both inside and outside of Mozilla, how to make software development better and ship better products.
Summary:
The number of pushes started increasing finally, after Firefox 3.6.0 and Fennec 1.0 releases. Try Server usage surpassed all other branches this month.
Here’s how the math works out (Descriptions of build, unittest and performance jobs triggered by each individual push are here:
Summary:
The number of pushes continued trending downward, maybe related to the Firefox 3.6.0 and Fennec 1.0 releases that month. Meanwhile, our overall infrastructure load went up, almost doubling. This was caused by RelEng filling out all the different project branches to run the same unittests/performance suites, a frequent request by developers, and also by running Talos on new additional OS.
Here’s how the math works out (Descriptions of build, unittest and performance jobs triggered by each individual push are here:
If you don’t care about Talos performance results, or Talos hardware, stop reading now! If you do care, this is the last in this series of posts.
Soon after my last post, we started running the new 2.26GHz minis, concurrently with the older 1.83GHz minis. Every build for several weeks now was performance tested on *both* sets of Talos machines, on all OS, and the graphs plotted on graphserver. Test results have been faster (obviously) and machines significantly more reliable (because of newer OS levels) but we’ve also noticed that overall test setup time is a bit longer, which we suspect is because these Talos machines are in 650castro, whereas the builds are on ftp.m.o. Its survivable, and we’re working on ways to improve it, but still worth noting. Most importantly, though, the two sets of machines track performance changes in Firefox in the same way.
Last week, we had enough rock solid concurrent data from both sets of machines to feel safe disconnecting the older rev2 minis. Out of (mild?) paranoia, we left them powered on, and ready to throw back into production at a moments notice, just in case we’d missed something weird with the new rev3 minis. And we patiently waited a week just in case…
Yesterday, we began powering down and recycling the old minis. The “talos-rev2-*” machines are no more.
At major milestones like this, its easy to get nostalgic – those machines carried us through a lot of major events. Talos changed from dedicated-slaves-per-branch to pool-of-slaves… Talos on TryServer… a whole collection of new Talos test suites… and of course the FF3.0, FF3.5, FF3.6 releases are the big events that spring to my pre-caffeinated mind. We thank them for all they’ve done for us and recycle them as part of the next big step for Talos and also for unittests – bug#545568 and bug#548768. All exciting stuff!!
Mike “Bear” Taylor joins Release Engineering this morning.
Mike is coming to Mozilla from Seesmic (a mobile-specific startup). However, many of you may already know Mike from his years of RelEng work in OSAF on Chandler, and his module owner work for Bonsai and Tinderbox2. He’ll be based in Pennsylvania, but on irc you can find him as “bear”.
Welcome aboard, Bear.
You must be logged in to post a comment.