John O'Duinn's Soapbox

Just another WordPress weblog

…and here is our 2010 story. For perspective, click to see our 2009 story and our 2008 story.

Let us know what you think! Does this feel like the right focus? Does this address what feels most important to you?

tc
John.
=====
This story should help us make sense of our goals across the four quarters of 2010. As usual, we focused on 4 areas during the year:

1) Continued streamlining of release infrastructure:
We now ship Major releases ever six months, all with partner builds, major updates etc. To keep up with this faster major release cycle, we continue to showcase our streamlined automation. For security releases, we now routinely provide 4-way, 5-way and occasionally 6-way “simultaneous-ship” releases.

This fast-paced major release cycle was possible because of our
continuing behind-the-scenes work on release automation, full featured project-branches, and our scalable machine infrastructure spread across multiple locations. In numbers, our growth path was:

  • May2007: 3 OS on 2 code lines on 89 machines
  • Dec2008: 7 OS on 5 code lines on 253 machines
  • Dec2009: 12 OS on 11 code lines on 550 machines
  • Dec2010: 19 OS on 20 code lines on 1000 machines

2) Continue to refine and simplify:
(…or “reducing our drag coefficient” so we can move faster.)
During 2010, we finally turned off the legacy Thunderbird 2 systems, after supporting those users on MoMo’s behalf since early 2008. We also turned off the Firefox3.0 systems. Both of these were the last of our cvs-based releases. Dropping both of those, combined with some enhancements we upstreamed to buildbot, meant we could continue to make improvements and reduce complexity in our automation. While the frantic nature of this work has reduced a bit since 2009, there’s still plenty of room for improvement that repays us back every time we do releases.

3) Outreach:
In addition to doing builds/tests/perf and releases, there are other ways we can use our infrastructure to help Mozilla. In 2009, we ran weekly code coverage jobs as the first jobs run on our machines outside of the traditional Firefox build/test/perf jobs. In 2010, we extended that further by running fuzzer jobs, and other code hygiene tools during idle times. We also helped Labs, and some other xulrunner partner projects, quickly scale and support users by running their jobs on our infrastructure – thus helping them avoid reinventing the wheel.

4) continued to improve Quality of Life:
The larger team has settled in together and continues to work together well under stress. Our shared skills keeps our bus factor good, and our quality-of-life healthy. We all did good work we were proud of, learned new things at conferences, taught each other new things, took vacations and improved our lives.

John Lilly has been encouraging us to use the idea of “a retrospective story looking back on a year” as a way to help make frame what quarterly goals make sense for an upcoming year. Its been useful so far, so we keep doing it.

Our 2008 story is here.

While our 2009 story was in emails, and group meetings, I forgot to post it here, until I noticed it missing just now. It was interesting to read in late December 2009, but re-reading it now, as I post, reminds me of how far we’ve come since last year.

Next I’ll post the 2010 story.

take care
John.
=====
The fun, and the risk, of writing our story at the beginning of 2009 was wondering how those dreams & plans would look in cold clear hindsight. Not to mention things we never even considered but which changed our plans completely.

We’ve had amazing growth recently:

  • 4 people with 89 machines on 2 active code lines at end2007
  • 9 people with 253 machines on 5 active code lines at end2008
  • 11 people with 275 machines on 8 active code lines at end2009

For such a large (small?) group, this year we focused on 4 areas:

1) Strategically improved infrastructure:
FF3.1 shipped nine months after FF3.0; FF3.2 shipped six months after FF3.1, and each release had major new features.

This fast-paced major release cycle was made possible by work for branch-on-demand capabilities and failover-from-one-slave-to-another. We grew capable of:

  • 2 active code lines in May2007
  • 5 active code lines in 2008
  • 8 active code lines in 2009.

…and with fewer downtimes even as we added machines. Its worth noting that each of those 8 branches had full equal capabilities: failover machines, builds, unittest, talos, something we couldnt do until late 2008. Powering up new branches on demand enabled developers to do parallel development, meaning Mozilla released major new features more often, more predictably and also allowed Mozilla to better react to marketplace changes.

2) continued to refine and simplify:
…or “reducing our drag coefficient” so we can move faster. For the early parts of 2009, the cleanup work, pruning old systems, and automation work continued frantically. Each change made our infrastructure, and our group, a little more nimble and lean, improving our ability to make further changes. In 2008 the big example of this was removing tinderbox client from release automation as part of the move from cvs -> hg. This was needed to make project branches possible, and make systems more reliable, but also simplified handling unscheduled requests that came our way, like WinCE, Win-nonSSE, linux64, shark builds, etc.

3) developed new capabilities:
We automated several recurring “one off projects”, so now produce automated major update offers, automated partner builds, xulrunner releases to name a few.

In a sign that we finally turned a corner in 2009, we developed a few new features that we never had before:
* a more resilient buildbot master (if one master fails another master takes over with no downtime)
* better development support (automated code coverage reports)
* a better dashboard (one place to see health of all build/unittest/talos infrastructure, simplifying triage and regression hunting, as well as “are the machines ok” questions that we all do daily across multiple sites) which we use to measure and report infrastructure uptimes, which helps us improve further.

4) continued to improve Quality of Life:
Internally, the larger team has settled in together. Each brought experiences learnt, and provided insights and perspectives to make us all better. The cross training improved our bus factor, and our quality-of-life. We all learned new things, did good work we were proud of, took vacations in 2009 and improved our lives. Our burn-out-rate continued to improve!

We proudly believe that the scale of turnaround achieved in the last 2 years is unique. Its also unique that we are able to talk about it publicly, and provide improvements upstream for others to see and use. In 2009, we were finally able to spend more time explaining to folks, both inside and outside of Mozilla, how to make software development better and ship better products.

Summary:

Overall load since Jan 2009The number of pushes started increasing finally, after Firefox 3.6.0 and Fennec 1.0 releases. Try Server usage surpassed all other branches this month.

  • The numbers for this month are:
    • 1,264 code changes to our mercurial-based repos, which triggered 133,897 jobs:
    • 14,003 build jobs, or ~107 jobs per hour.
    • 58,110 unittest jobs, or ~86 jobs per hour.
    • 61,694 talos jobs, or ~92 talos jobs per hour.
  • Our Unittest and Talos load continues high, like last month, and we expect this to jump further as more OS are still being added to Talos.
  • Once we start running Unittests on all the Talos OS, we expect load to jump again. In advance of that, we’re spinning up more machines to handle this future spike in load.
  • The trend of “what time of day is busiest” changed again this month. Not sure what this means, but worth pointing out that each month seems to be different. This makes finding a “good” time for a downtime almost impossible.
  • The entire series of these infrastructure load blogposts can be found here.
  • We are still not tracking down any l10n repacks, nightly builds, release builds or any “idle-timer” builds.

Detailed breakdown is :
#Pushes this month

#Pushes per hour

what times of what days are busiest in the month

Here’s how the math works out (Descriptions of build, unittest and performance jobs triggered by each individual push are here:
the math behind the graphs

Summary:

Overall load since Jan 2009The number of pushes continued trending downward, maybe related to the Firefox 3.6.0 and Fennec 1.0 releases that month. Meanwhile, our overall infrastructure load went up, almost doubling. This was caused by RelEng filling out all the different project branches to run the same unittests/performance suites, a frequent request by developers, and also by running Talos on new additional OS.

  • The numbers for this month are:
    • 1,189 code changes to our mercurial-based repos, which triggered:
    • 13,853 build jobs, or ~92 jobs per hour.
    • 54,786 unittest jobs, or ~73 jobs per hour.
    • 56,192 talos jobs, or ~76 talos jobs per hour.
  • Our Unittest and Talos load almost doubled this month – caused by adding Talos on OSX10.6, filling out Talos on linux64, and getting full compliment of Unittest and Talos running on all project branches. There are more OS still being added to Talos, so expect this to jump again soon.
  • Different project branches continue to get different load at different times. This is to be expected when you consider that developers change from one focus area to another as projects wrap up, and other projects start up. However, the barcharts below illustrate that nicely.
  • The trend of “what time of day is busiest” changed again this month. Not sure what this means, but worth pointing out that each month seems to be different. This makes finding a “good” time for a downtime almost impossible.
  • We are still not tracking down any l10n repacks, nightly builds, release builds or any “idle-timer” builds.

Detailed breakdown is :
#Pushes this month

#Pushes per hour

what times of what days are busiest in the month

Here’s how the math works out (Descriptions of build, unittest and performance jobs triggered by each individual push are here:
the math behind the graphs

If you don’t care about Talos performance results, or Talos hardware, stop reading now! If you do care, this is the last in this series of posts.

Soon after my last post, we started running the new 2.26GHz minis, concurrently with the older 1.83GHz minis. Every build for several weeks now was performance tested on *both* sets of Talos machines, on all OS, and the graphs plotted on graphserver. Talos TP4 being run on rev2 and rev3 machines Test results have been faster (obviously) and machines significantly more reliable (because of newer OS levels) but we’ve also noticed that overall test setup time is a bit longer, which we suspect is because these Talos machines are in 650castro, whereas the builds are on ftp.m.o. Its survivable, and we’re working on ways to improve it, but still worth noting. Most importantly, though, the two sets of machines track performance changes in Firefox in the same way.

Last week, we had enough rock solid concurrent data from both sets of machines to feel safe disconnecting the older rev2 minis. Out of (mild?) paranoia, we left them powered on, and ready to throw back into production at a moments notice, just in case we’d missed something weird with the new rev3 minis. And we patiently waited a week just in case…

Yesterday, we began powering down and recycling the old minis. The “talos-rev2-*” machines are no more.

At major milestones like this, its easy to get nostalgic – those machines carried us through a lot of major events. Talos changed from dedicated-slaves-per-branch to pool-of-slaves… Talos on TryServer… a whole collection of new Talos test suites… and of course the FF3.0, FF3.5, FF3.6 releases are the big events that spring to my pre-caffeinated mind. We thank them for all they’ve done for us and recycle them as part of the next big step for Talos and also for unittests – bug#545568 and bug#548768. All exciting stuff!!

Mike “Bear” Taylor joins Release Engineering this morning.

Mike is coming to Mozilla from Seesmic (a mobile-specific startup). However, many of you may already know Mike from his years of RelEng work in OSAF on Chandler, and his module owner work for Bonsai and Tinderbox2. He’ll be based in Pennsylvania, but on irc you can find him as “bear”.

Welcome aboard, Bear.

Its late, but I’ve just finished upgrading my wordpress install. The previous install was so out of date that none of the usual migration guides would support it anymore, so I had to hack around the previous data in SQL to get everything carried over correctly.

At this time, I *think* all the loose ends are sorted out, except for some missing categories on pre-existing posts. If I’ve got it right, you should see this post on planet.m.o as usual, and all the pre-existing links should continue to work fine. However, if you notice something that I missed, please let me know, ok?

(also – what do you think of the new theme?)

Today was a holiday here in California, which meant technically no work today – or more realistically, only a few hours work in the morning. So far 2010 has been a very hectic year so it was great to spend my first quiet day of 2010… sitting indoors, doing my taxes while looking out at the beautiful sunny day?!

All was going well until, under “Other Income Adjustments”, I was asked if I earned any income from:

  • Reward from a crime hotline
  • False Imprisonment Compensation
  • Ottoman Turkish Empire Settlement Payment

For the record, my answer was “No”, to all of those questions. But I found it surreal enough that I decided to stop, go outside and catch the last of the sunshine. I’ll try again tomorrow, but it has me still wondering what other unexpected questions might be lurking in the IRS tax codes of the USA.

Since the AllHands, we’ve held this brownbag with a few different groups. Each time, we tweak it further based on comments and questions asked. There might still be further changes, but at this point, it was worthwhile updating the PDF in this blogpost.

If you are interested in being an early “tester” of this BrownBag, let me know – I’d be happy to hold a brown bag session and go over it anytime. Also, if you have any comments about this PDF, please let me know. :-)

Late last week, we finally completed a long running project: we found a new, better, home for our growing array of mobile phones in our continuous integration pool. Here’s what I presented at the weekly Mozilla Foundation call Monday morning about that project, hopefully it makes sense.

1) Each checkin gets:

  • 40 hours of build/unittest/talos on a desktop computer, and

  • 25 hours of unittest/talos on mobile phones.
  • Instead of one phone doing 25 hours of testing, we could have 25 phones each do one hour of testing. Or 50 phones do half-hour of testing…
  • we get 100-120 checkins per day – that is a lot of phones.

2) Phones are really sensitive to wireless noise

  • interference causes intermittent orange/red test results, also manual hang/reset work for RelEng.
  • Building K office was better, because we were far from anyone, but the new office is in downtown Mountain View, and has lots more wireless noise.
  • To fix this, we built a “faraday cage“, a shielded room to eliminate outside wireless interference. Wikipedia has a great description and diagrams, if you follow that link.

  • Hopefully this should give us better stability for tests results. Time will tell.

3) How to arrange the phones in the room

  • lots of running devices, each transmitting radio signal. We’re planning for at least 400 phones all fitting inside the wireless room, so how to deal with cross interference. Not big enough to just arrange on desks / floor like we did while we were waiting for the Faraday cage to be built. Have to think 3D.

  • however, hard to attach to walls, cannot puncture walls – cannot breach cage.
  • whatever we do use within the room must be non-metallic, to minimize wireless disruption within the room
  • phones cannot be *too* close to each other, early testing with cross-interference showed failures if bunched too close.
  • as the number of phones increases, we might find more cross-interference, so whatever we do needs to be possible to move around.
  • must not touch the front / sides of the devices, which might accidentally press a button
  • must be easy to see the screen on the running phone
  • tolerant of heat, either from the phone or the phone charger
  • easy to take dead phones out, open, remove batteries, reimage and replace (we do this ~10 times per week)
  • must be compatible with phones we haven’t bought yet

4) How did we solve it?

  • BedBathandBeyond.com sell rolling mobile bamboo garment racks, which can be assembled with 20 screws each. These can be assembled in minutes.
  • They also sell hanging shoe racks which velcro to the cross-rail, are made of heat-resistant fabric, have plastic as a firm base to each shelf and hold 20 phones each.
  • Each garment rack can hang 40 phones (two shoe racks with 20 phones each). The power strips rest on the bottom wooden shelf and run cables up the back of the shoe racks. In this photo, you can see racks for 120 phones, of which 90+ are already in use.
  • Bonus: each shoe phone is about 3-5 inches apart from another phone in adjacent shoe rack, which is the same spacing we had when the phones were just arranged on some empty desks in the office.

5) Brought to you by: aki, dmoore, jhford and mrz.

  • They all did *tons* of invisible behind-the-scene work to make this happen. Every time you see a mobile phone reporting test results on TinderboxPushLog or draw a dot on GraphServer, give thanks to aki, dmoore, jhford and mrz.

  • If you are curious for a video of the very quick presentation (3.5 minutes, according to Jono and his stopwatch), see Monday’s Mozilla Foundation weekly call here. I mention it in case the video is easier to follow the verbal with hand-waveing vs this blog post.