Congratulations to Chris AtLee and Chris Cooper

It give me great pleasure to formally announce that both Chris AtLee and Chris Cooper have been promoted to Engineering Managers. This formalizes what we’ve been doing in RelEng for multiple quarters already, so in one sense, there is nothing new happening here. However, it is great to recognize the awesome hard work that both Chris and Chris have been doing, and formalize it going forward.

Congrats on the well deserved promotions, catlee and coop!

John.

Rethinking one-on-one meetings

Its easy to skip a blog post about how to run meetings – yawn – and skip on to the
more exciting posts about some new shiny tech topic. Don’t make the same
mistake I did. This is a quick read and will change your working
life.

Deb did a blogpost a while ago about how to run a more efficient 1×1
meeting. To be honest, I saw the post and skipped over it “Dont have
time to read that, and anyway, I’ve done lots of 1x1s – each unique to
needs of each individual, and they go just fine, thanks anyway”.

Then Coop, in his own polite understated way, told me we were going to
try this format. It worked great for Coop’s 1×1 with Armen, and he
thought it might improve Coop’s 1×1 with me. Our first meeting took
longer then usual, but that was each of us getting used to the
changeover. The second meeting, and all meetings since then, have been
much shorter than usual, and far more productive for both people!

Maybe its just something unique to coop and myself?

But it felt worth trying with a few other people, which I did over the
next couple of weeks. At that point, I was totally convinced. We now use
this for all my 1x1s in RelEng.

Why does it work so so well?

  • Set the agenda a day in advance
    • too many 1x1s are impromptu, unprepared and therefore inefficient.
    • making sure the agenda is *not* set by the manager is important; this
      means people can make sure what they need is covered, and the meeting is
      productive to them.
  • Sorted by time
    • the past: Talking about what you just accomplished helps set context,
      and helps even the most modest person discuss recent successes.

    • the present: Whats on your mind right now, typically blockers.
    • the near future: plans for the upcoming week help ensure both people
      agree priorities are right

    • the “far” future: Keeping the current work in context of a person’s
      career path, and in context of a group’s quarterly goals is tricky. Its
      easy for this to get pushed to the side in the day-to-day rush of work,
      but this format helps keep everyone aware of.
  • Require video
    • its easy to get distracted in our constant-interrupt environment, and
      the video helps keep people focused on the person they are talking with.
      This in turn helps the meeting run much quicker.

    • some “remoties” resisted using video at first – “too intrusive” was a
      common reaction. However, it only takes a couple of meetings this way
      before everyone sees how 1x1s with video run more smoothly than
      phone-call-only. Facial cues and body language visual cues are super
      important – just ask anyone who’s got into a misunderstanding on irc or
      email!

    • this 1×1 can be the most direct human contact “remoties” with the
      rest of Mozilla all week. Video is a great reminder that the voice on
      the line is a real human, and some of the saved time at the end can turn
      into seemingly-unimportant-but-actually-vital non-work chitchat. The
      kitchencams are popular for a similar reason.

    The brilliance of Deb’s approach is that it is super low-tech and super
    easy to use. As engineers, we’re always tempted to look for technical
    solution to any problem, but the few attempts I’ve seen so far have all
    added complexity and got in the way. By contrast, Deb stepped back and
    revisited the essence of the original problem from a completely different
    perspective and I love what she came up with.

    Try her suggestion. If it doesnt work for you, go back to what you did
    before, no harm done. But maybe, just maybe, you will love it, and find
    yourself giving a silent “Thank you, Deb.” after every 1×1, just like I do.

    [UPDATE: Ben Horowitz just blogged about this also. joduinn 04-sep-2012]

RelEng gathering in Toronto

The flight disruptions in Europe complicated the RelEng gathering in Toronto last week. Rail’s flights were canceled, and it took a while to find alternate flights that worked – he finally made it to Toronto, and it was great for everyone to meet in person. All united at last.

The week together was awesome. The advance planning is a bit of a headache, but the time spent together and the brainstorming of knotty problems make it all well worthwhile. With so many “remoties”, we’re used to being very a distributed group, yet there were a bunch of problems that we worked though in just the few days we were together. As always, I find myself leaving these group gatherings feeling excited by the things we’ve done, the major projects we’re working on next, and proud of the wide range of smart unique people in the group.

RelEng gathering in Toronto
(from L->R; standing: coop, bhearsum, rail, joduinn, catlee, bear, lsblakk, jhford, nthomas, aki; sitting: armenzg, alice. Photo thanks to Aki!)

ps: Rail’s going to stay in Toronto for next week also, to work with catlee and bhearsum, so if you see him in the Mozilla office, please do take a moment to say hi!

Flight disruptions because of Icelandic volcano

The Icelandic volcano eruption is still causing significant travel disruptions in Europe, and looking to get worse. The news is covered with stories of entire countries closing their airspace for the first time, photos of stranded travelers in airports, stories of people taking taxis from England to Switzerland – all sounds bad. Even RelEng is impacted by these flight disruptions: we’re all meeting in Toronto this week, but sadly Rail is stuck in Moscow.

flightradar24.comThis picture from flightradar24.com posted a more understandable summary of the scale of the disruption. The combination of flight data with maps summed up the situation in a very intuitive way, and I really liked how they did this. Nice job, flightradar24.com .

(Oh, and before you ask why close entire country airspace for “some dust”, you should check out the stories about BritishAirways Flight#9 and KLM Flight#867 during other volcanic eruptions. Both ended well, but still…)

Watch this space – at some point focus of news will shift from the flight and economic disruption of this eruption to how this will change weather patterns.

Updates for project branches

Summary:
We produce nightly builds for each of the release branches (1.9.0, 1.9.1, 1.9.2), along with updates to keep users on the latest mozilla-central nightly.

Project branches are different. We create nightly builds, but creating updates to bring users on a project branch to the next nightly build on the project branch turned out to be hard.

This is now fixed. πŸ™‚

Lorentz nightly users have been getting nightly updates for a couple of weeks now, and that worked fine. We’re now enabling nightly updates for other project branches that have enough daily changes to make it worthwhile.

This is a big milestone in Mozilla’s transition to concurrent branch development, and kudos to nthomas and coop for making this happen.

More details for the curious:
The problem sounds easy at first: we already have updates for mozilla-central, mozilla191, mozilla192… how hard can it be?

Well, like a lot of RelEng, the devil is in the details.

What version number do you call a particular project branch? The same as the branch its based from? Something unique? Oh, and some gotchas:

  • pretending that a project branch was a really old version doesnt work. A safety feature of the updater logic is that you cannot update to an older version. Users on “1.9.4” would hit problems if they tried to “upgrade” to released “1.9.3”. For example, existing Firefox “1.9.2” users could not upgrade to “1.9.0.999”, or “-1.1.1.1”. And we’d have to be careful to not accidently bump into some old, valid, updates still
    live for 1.5.0.12 for example.
  • pretending that tracemonkey was a newer future number wasnt perfect either. When we do have a *real* mozilla-1.9.4 in the future, those builds will get updated to the fake-number-for-projectbranch builds and be broken. Planting landmines for ourselves in the future seemed a bad idea. I call this polluting our update-namespace.
  • the updater systems uses numbers, and its unknown what happens if you try something like “1.9.2tracemonkey” or “x.y.z”. Investigating this required digging in our release automation code, as well as AUS server logic, as well as the Firefox client updater code. Several attempts at this ended up in scary, hard to prove correct code, and caused us to drop this goal in earlier quarters.

The breakthrough was when we realized we could make updates for project branches have the same version number as their parent, yet be on a different update channel. This is similar to how we do partner updates. With some small tweaks, around how we handle fallbacks, we might be able to do similar here. For example, the parent of lorentz is mozilla-192, so we gave lorentz the same *version* as 192, but a different lorentz-nightly channel.

Bingo. This worked first time.

After quite some testing to make sure we weren’t missing a horrible pothole, we used this approach. Once the updates-on-slaves project was live in production, we had the spare CPU cycles to generate these updates every night. Lorentz nightly users were highest priority, so we tried it there first, and all were happy.

The Lorentz branch is going away, but we know we can do this for Electrolysis and Tracemonkey branches now. We’re not going to enable these updates for every project branch; if there’s not at least a change/two per day, its really just not worthwhile. Its not free to generate the update snippets, and preserving them on AUS takes quite some space and time in perpetuity. These same systems still offer updates to any slow-upgrading FF1.5.0.12 users, so we don’t want to put anything on here unless it provides real value.

Having these updates available for developers on Tracemonkey and Electrolysis to dogfood their own work, gives us the usual goodness of faster development, faster bugfixing and faster landing of project branch work into new releases. This is a big milestone in our abililty to transition from single-track-development to concurrent branch development, and is great news… for developers, QA, users and Firefox.

Further details at https://bugzilla.mozilla.org/show_bug.cgi?id=534954.
and the various linked bugs from there.

Updates now generated on pool-of-slaves

Summary:
We used to generate all updates on one dedicated machine (prometheus-vm). We now generate updates as jobs queued to the pool of slaves. This makes our current work faster, and unblocks us to do some awesome stuff. /me bows with gratitude to coop.

More details for the curious:
Why bother refactoring how we create updates? We have one old machine doing night build updates for years, and creating updates are quick, they take just 15mins a night:

2.5 minutes per update x 3 OS x 1 en-US locale x 2 branches = 15 minutes

Having the one machine doing this for the two active code branches was trivial, so why not just leave it alone, there’s plenty of other things to fix, right?

The problem is that 15mins is for en-US nightly updates only. We wanted to treat l10n as equals, so we figured out how to produce nightly updates for l10n builds as well. This had never been done in Mozilla before, and was great for the l10n community. However, when we turned on l10n nightly updates in production, it changed the math significantly. What used to take 15 minutes now took:

2.5 minutes per update x 3 OS x 75 locales x 2 branches = 1,125 minutes = 18.75hours.

Compounding the situation was RelEng being asked to support 3 active fully localized releases (FF3.0, FF3.5, FF3.6) and also 5 project branches that all wanted nightly updates. And increase from 3 OS to 7 OS on most of those branches.

Clearly one machine couldnt do all this in a 24 hour day.

One approach we considered was just to clone this machine, do half the work on one, half on another identical machine. This *might* have worked, but wasnt risk free as the old system isnt documented anywhere, and we’d have to verify the two systems could not trip each other up, corrupt the updates and break users.

Whatever we changed here had to be so well understood that we would be confident it was not breaking any user updates. And it had to scale. And solve the “single point of failure” problem to be reliable for our needs.

Coop figured out how the old system worked, how it could be broken into independent concurrent chunks, and how it could be integrated into buildbot, so these could be run after each nightly. Details for the curious are here. Its been tricky, because the code is fiddly, there are many sharp edges, and high risk – any bugs would generate bad updates that complete break a user – so confidence in the accuracy of the updates has to be rock solid. And it has to not break other users of the same patch generation code, like Seamonkey, Camino, etc.

Coop rolled this into production a week ago, after months of testing in staging. From the outside looking in, no-one noticed any difference, except that 18 hours of update generation was now being done in < 5 hours. For such a huge change, having nobody notice any problems is a great accomplishment.

Within RelEng, this change means that nightly updates for a branch are now done in under 1/3rd of the time. It also means we can now do multiple sets of updates concurrently, spread across the pool, which scales our ability to generate updates. Because of this, we can now generate updates for the new linux64 and osx10.6 64bit builds, like announced here. Because of this, in a 24 hour day, we can now generate updates for 3 fully localized release branches and 5 non-localized project branches… but I’m getting ahead of myself! There’s more in the next post…

Firefox: now building on 64bit OSX 10.6.2 and linux64

Yesterday, TinderBoxPushLog started showing little green squares under “Mac OSX64” and “Linux64”. Guess what – those are OSX10.6 64bit and linux64 bit builds!


There are opt and debug builds, with incremental builds triggered by checkin during the day, full clobber builds run every night and all builds available on ftp.m.o. All the usual goodness.

If you are running on 64bit OS, and want to help, you can download the nightly builds from here and try them out. We tested like crazy in staging, so think it all works, but if you hit any problems, we’d love to hear about them, particularly with the nightly updating, or crash reporting.

As I type, the rest of the mechanics are still being rolled out to production in digestible chunks – unittests, talos, and release-build-automation are some highlights – but it is exciting to see this new-desktop-OS work rollout to production. The curious can follow the 10.6 64bit work here and the linux64 work here.

There’s been a ton of behind the scenes work to make this happen. Please send chocolate/beer/kudos to Bear, Armen, Coop, Josh. Also a special shout out to jlazaro, jdow for all the work imaging up 90 machines so these new OS came online with enough slaves in the pool-of-slaves to keep wait times healthy.

Exciting new things

Lots of big exciting stuff came online in RelEng land last week, and even more this week.

We’ve been so busy rolling all these into production, we’ve not had time to talk about them much. Time to fix that – but where to start… ?

(Note: this is a *great* problem to have!)

Firefox 3.6.3: now that was fast!

From a RelEng point of view, Firefox 3.6.3 was our fastest release yet.

  • 05:28 PST 01-apr-2010: “go to build” after fix landed and builds/tests/talos all ran green
  • 18:01 PST 01-apr-2010: downloads available on mozilla.org, updates available for users doing CheckForUpdates
  • 12hours 30mins elapsed time

While this is great to see, this was not a perfect release for us, and we know we can do better:

  • bandwidth problems between MPT and 650castro caused us wasted time Wednesday night, chasing burning builds only to figure out the bustage was not a code problem after all. This was before the “go to build”, so is not included above, but was a delay – and its unclear how much of a delay this caused.
  • manual errors in the config files when starting release automation caused us to interrupt/stop/restart release-automation. The error was caught early, so no harm while recovering from that, but we forgot to manually retrigger the source tarball step. Its now all fixed, but still…
  • we were caught really short handed. Most of the group was heading out on Easter vacation; one person was sick, one already out on vacation. This caused us to defer some other proposed releases for a few days. This also caused a delay in starting mirror push mid-Thursday because of handover from one RelEng person to another – something we try to avoid doing *during* a release.

We’ll go through this in more detail in our postmortem later today, and file bugs for any other gotchas… all to try and improve the *next* release.

Note: as usual, I do not include the time it takes to *develop* and *test* a fix for a bug. I’m totally happy with developers taking the time needed to develop a good fix, and QA taking the time needed to verify the fix works as expected. If you start rushing any of that, its easy to accidentally compromise the quality of the fix. Which is bad. However, the time taken from “go to build” to “release is out” is a measure of the efficiency of the release automation process; something we’re constantly working on improving *between* releases. The faster we can safely release, the better.

ps: The speed at which Mozilla got this bug fixed, tested and distributed to users was noticed here
– very cool!

UPDATE: fixed typo in URL. joduinn 05apr2010