Welcome Kim Moir

On Monday, we were delighted to have Kim Moir join Mozilla’s Release Engineering group. She’ll be working with coop, who is (coincidentally) also based out of Ottawa, but the rest of us can find her on irc as “kmoir”. Please do say “hi” and welcome her to Mozilla.

Kim brings great perspective to the group, as she has worked on Eclipse release engineering for years, has worked with distributed groups, and has also done lots to raise awareness about release engineering with the open source community. If you are not already familiar with her work, you should read her blog here (great title, by the way!).

Welcome, Kim!

ps: Yes, we’re still hiring. Our Release Engineering group helps Mozilla’s developer community write great code, and then efficiently gets that code into the hands of our Firefox users. If you are passionate about open source, and about building large complex distributed infrastructure, we’d love to hear from you.

Cleaning up the build process for Fennec developers

A little over a week ago, ted, kyle, joey, jhford and myself got together to see how we could help improve the Makefiles specifically to help Fennec developers.

I thought people might be interested in a quick progress report.

The typical compile-package-deploy-test cycle for Fennec developers is time consuming, and tricky, mostly because of layers of workarounds for dependency misfires in the Makefile logic. Because developers do this countless times every day, every developer wants this compile-package-deploy-test cycle to be as quick as possible, so they can make progress quickly. Over time, every developer grows their own trusted workarounds to generate valid builds as quickly as possible, which is a fair solution to suboptimal Makefiles. However, badly documented, poorly understood, fragile workarounds can be a recurring cause of stress and delays, especially when one mistep in the workarounds can send you down a long false debugging trail. Not what people need right now.

To figure out where to start, we:

  1. Asked a few developers for specific pain points that they would like us to start on first. Everyone has their own pet peeve. But if there was something *all* developers mentioned, we looked there first. If there wasn’t a bug already filed, we filed one.
  2. Used some very cool tools Joey created to generate reports on time-usage-by-directories, to help us decide where the biggest time-sink is, and hence where to focus first.
  3. Studied the “how to build fennec” wiki page. This let us see what all new Fennec developers learn first, as well as learn some of the more commonly used workarounds and gotchas. Over time, developers have learned their own undocumented workarounds for different gotchas, so different people follow different build process steps, but this wiki was a great starting point.
  4. Triaged through existing Core:BuildConfig bugs to see if there are any unresolved bugs which identify problem areas.
  5. Timed full clobber builds. On my MBP, I get clobber Fennec builds completing in 20-40mins. Its not yet clear to me why such a range in timings. Joey and jhford and myself all got similar range of timings, which I found even more interesting because we each got the same range of number even though our machines range from MBP laptops up to high-spec 8 way desktop machines.
  6. Timed depend builds. Of course depend builds are trickier to measure – what gets built depends on what you touch. So I tried the “simplest” case. If I do a full clobber build, change nothing, and start a depend build, how long does that depend build take? Note: there was no change after the previous clobber build, so if everything was working right, this should take minimal time to traverse dependencies and should not recompile/relink anything. On my MBP, “nothing-to-do depend builds” take 2m45s -> 3m15s. And worst of all, imho, always did a bunch of recompiling, rebuilding manifests… even though we know nothing has changed… urgh.

To make things better, so far, we:

  • Setup, tested and deployed new Android r7b NDK and faster “gold” linker to production RelEng machines (bug#675572). We also added this same NDK and linker to posted pre-build toolchain for easy developer usage (bug#745956). As well as being faster to compile and link, this also gives us support for Android4.0 (Ice Cream Sandwich).
  • Filed and landed Bug#746741 is to add a new makefile target “build_and_deploy” to encapsulate the rebuild/repackage/install steps on Android. While this might seem like an odd place to start, this is important because there is a lot of confusion caused by the different makefile workarounds that each developer has evolved for themselves over time. Some of these workarounds become folklore which people do on faith, even when the original need has since been fixed. Figuring out what is “supposed to be done” and publishing one clear target which does that consistently, gives us something to keep working on. As we improve things inside the Makefiles, this “build_and_deploy” target will keep getting better, including handling more of the current workarounds, and getting faster every time. As developers discover that this one supported “build_and_deploy” target safely does everything that they need, and is a fast, safe alternative to their workarounds, developers will gradually no longer feel a need to do workarounds… which means developers can instead focus on making the shipping product better.

Having said all that, we’re still only scratching the surface. There’s lots more to fix. Much much more. So we filed bug#748452 to track the list of most-urgent things we’ve found so far.

If you are building Fennec and knows of a problem with the Makefiles that impact your ability to get work done, please file a bug in Core:BuildConfig for us. Of course, the trick is to do this without breaking any other groups that use the same Makefiles. If you already filed a bug before, and its still open, please cc us on the bug, OR at least put a brief description of the problem into an email to any of us, and we’ll triage.

Thanks for the patience. More news soon as we make things better.

Kilimanjaro: “trains, planes and automobiles”

Until recently, Mozilla has mostly focused on shipping one product – Firefox. (Yes, I know Mozilla shipped other products like Thunderbird, SeaMonkey, Camino, but they used the same tool chain, and same/similar release cadence, so can be thought of as similar, if not identical, for the purposes of this discussion.). I think the tight formation flying of Blue Angels seems a good analogy!

Times are changing.

Mozilla now ships multiple very different products: Firefox, Fennec, Sync, BrowserID and soon Boot2Gecko. Each of these products are built by different groups of people, with different toolchains, different features, different processes for tracking blocking/shipping criteria, and most importantly, different release schedules.

If we want to ship a new feature that requires work coordinated across different products, we need all the different parts of the feature to ship at the same time. Coordinating this means each product needs to plan backwards in time, to coordinate when they start working on their part of the shared feature. Also, any schedule slip in any one product needs to be cross-coordinated across all products.

Coordinating the different parts of a new feature into each of these different products is tricky.

Having all products ship their part of the overall feature in a coordinated manner is even trickier.

To me, it feels like arranging transportation for a family event. Some guests live in the same town and can walk over at short notice. Some guests will drive. Some guests will take a train. Some guests will fly in – and some of those will have to get visas. All have to arrive in the same location by the “release date” (ie the day of the family event). This does not mean everyone starts using airplane ticket agents to pay for train tickets or to refuel their cars. However, this does mean everyone has to plan, according to their own release schedules, and transportation of choice, when they need to start making travel plans in order to still arrive at the event on time.

As Mozilla starts to build more complex features across our range of shipping products, we’ll need to learn this new cross-product coordination skill and get better at it, so we can do it again. And again. And again.

Kilimanjaro is the first “coordinate-what-parts-need-to-go-into-which-products-and-by-when-so-they-all-ship-as-one-coordinated-feature” project. There will be others; its cool to see the start of this really cool new phase for Mozilla.

New Mozilla mirror in Cambodia

A few weeks ago, sabay.com.kh became an official mirror node for Mozilla in Cambodia. This will help Firefox users in Cambodia have faster downloads of security updates, as well as anyone in Cambodia looking to download fresh installs of Firefox. Given the market presence and bandwidth capacity of Sabay (and parent company CIDC-IT) in Cambodia, this is great news.

Many thanks to Mike Gaertner, COO of CIDC-IT, for taking the time to meet with me in Phnom Penh in January 2012, and then taking the personal interest to work through the mechanics to make this new mirror node a reality.

We are all “remoties” (Apr2012 edition)

[UPDATE: The newest version of this presentation is here. joduinn 12feb2014, 09nov2014]

At the Mozilla Summit in sept2011, we ran a session on working remotely at Mozilla.

I was surprised/stunned/honored by needing to run this session *twice* because of popular demand, the sheer volume of interaction in each session and the ongoing interest since the summit.

Writing these slides, I realize how much I care about this topic… and how many careful subtle habits we’ve developed within RelEng over the last ~5 years.

During the summit, and again last week in Toronto, I had a chance to meet with Homa Bahrami (Senior Lecturer, Haas Management of Organizations Group, Haas School of Business, Berkeley). Apart from being a great person to talk with, she has lots of organizational and behavioral science background to help explain why the things that we felt were helping, were in fact, something she would expect to help!

(click image for PDF of slides; keynote available on request, but its large!)

As I said at the start of each session, at first it felt odd for a Release Engineer to be talking about work habits of distributed groups… until you think about how physically distributed Mozilla’s Release Engineering group is. I note, for the record, that *none* of RelEng are “in headquarters”. While there are occasional miscommunications, RelEng is fairly well plugged into whats going on… after all, we *need* to be in order to do our job of shipping software quickly, reliably and accurately.

To me, this feels like it actually is about working together in clearly understood ways. The suggestions here have helped “remote” RelEng people in clear and obvious ways, but they *also* help “local” RelEng people work together better.

Please let me know what you think. And of course, if you have ideas or suggestions that I missed, I’d love to hear them.

(Apologies to those who’ve been pestering me to post these over the last few months. Last week’s “remoties” day reminded me how important this is to post – even in its rough state. I’ve fixed the most egregious errors/typos, and merged in some feedback I got in the Q&A sessions. However, these slides still need further work. If you spot anything to fix, please let me know!)

Infrastructure load for March 2012

  • #checkins-per-month: March set another new record with 4,508 checkins for the month. That is now 5 months in a row setting new records: February2012 (4,027 checkins), January2012 (3,962 checkins), December2011 (3,262 checkins), and November2011 (3,209 checkins).
  • #checkins-per-day: We set a new record of 238 checkins per-day on 21-mar-2012. Also worth noting was that, in March, all 22 of the 22 working days had between 135-238 checkins per day, and of those, 2 days had over 200 checkins per day (214 on 12mar, 238 on 21mar).
  • #checkins-per-hour: We set a new record of 9.8 checkins-per-hour. And if that wasnt enough, we did it twice. For the first time, we saw double peak load times in a day: 11:00-noon PDT and 14:00-15:00 PDT. Note these records includes weekends, which are mostly idle, so the real checkins-per-hour on work days is higher.

mozilla-inbound, fx-team:
mozilla-inbound continues to be heavily used as an integration branch, with 26% of all checkins. By comparison, however, the fx-team branch only had 2% of the checkins, much less than mozilla-central’s 4%.

  • In the chart above, note that the number of mozilla-central checkins (192 or 4%) remained about the same, while the number of checkins on mozilla-inbound (1186 or 26%) continues to increase.
  • In the past whenever we had to unwind a large backlog of pending checkins, or back out a complicated bustage, we’d keep the mozilla-central tree closed for all checkins for the duration… which blocked all landings. However, now with mozilla-inbound and fx-team as integration branches in addition to mozilla-central, it means that developers have the option to continue landing their unrelated patches on an open integration branch while the cleanup work continues on the closed branch. In theory, if you give smart humans a easy way to route around a blockage, they’ll quickly start to use it so they can continue to get things done. In reality, very cool to see it actually happening.

mozilla-aurora, mozilla-beta:

  • ~2.5% of our total monthly checkins landed into mozilla-aurora, slightly down from ~3% last month.
  • We’re back down to ~1% of our total monthly checkins landed into mozilla-beta. This is a relatively large percentage, as our trend since we started has been a very consistent ~1% of monthly checkins landing into mozilla-beta. Not sure why February is an anomaly, but it looks like or if this is the start of a trend, so lets watch this.

(Standard disclaimer: I’m always glad whenever we catch a problem *before* we ship a release; it avoids us having to do a chemspill release and we ship better code to our Firefox users.)

misc other details:

  • Pushes per day

  • Pushes by hour of day

Welcome Jordan Lund

Yesterday, Jordan started in Release Engineering. He is our first intern from DIT, and this makes the trip back to Ireland well worth it, even if it was flooding at the time.

Armen will be mentoring Jordan through all the ways of RelEng, as well as the MoTo office, and has a more detailed intro here. If you see Jordan in the MoTo office, or on irc as “jlund”, please say hi, but because of Jordan’s previous volunteer work, I would ask people to be considerate when talking about the trees burning.