“Release Engineering as a Force Multiplier” at Google TechTalks

I was super honored to present as part of the Google TechTalks series. Specifically, Boris Debic at Google asked me to talk about the improvements we’ve made to Mozilla’s software delivery pipeline, and they recorded the entire thing – you can watch the video here:

This was the same keynote I gave at the Release Engineering conference in last month’s ICSE 2013, so if you missed it, click here for more context about the RelEng conference, and see the slides used.

Given the super-big scale of Google’s RelEng infrastructure, presenting to the very technical audience at Google was exciting. It was testimony to the work done by Mozilla’s RelEng that this was worthy of including in the Google Tech Talks series, the audience loved it, and the questions before/after were great.

Thanks to Boris for making this happen, and also for the great chats about Release Engineering at scale. It’s great to meet other people who are working passionately to solve software delivery pipeline problems at scale; it’s even greater when you find that you “click” and find you like each other as humans too. All very cool – thanks again, Boris.

Infrastructure load for May 2013

  • #checkins-per-month: We had 6,041 checkins in May 2013. This is down from last month’s 6,364 checkins, and down from our record of 6,433 in Mar2013.


    Overall load since Jan 2009

  • #checkins-per-day: We had 293 checkins checkins on 21may. During May, 21-of-31 days had over 200 checkins-per-day – thats almost every working day. Only 7-of-31 days had over 250 checkins-per-day, much lower than usual. Unusually, we never exceeded 300 checkins-per-day any day this month – first time in many months.
  • #checkins-per-hour: Checkins are still mostly mid-day PT/afternoon ET. For 5 of every 24 hours, we sustained over 10 checkins per hour. Heaviest load time this month was 1pm-2pm PT (12.5 checkins-per-hour).
  • As usual, our build pool handled the load well, with >95% of all builds consistently being started within 15mins. Our test pool continues to improve. All the hard work by RelEng, ATeam and IT is paying off, we’re seeing more test jobs being handled with better response times. The peak for May was 54,195 test jobs on 23may. Still more work to be done here, but encouraging.

    As always, if you know of any test suites that no longer need to be run per-checkin, please let us know so we can immediately reduce the load a little. Also, if you know of any test suites which are perma-orange, and hidden on tbpl.m.o, please let us know – thats the worst of both worlds – using up scarce CPU time and not being displayed. We’ll make sure to file bugs to get tests fixed, or disabled – every little bit helps put scarce test CPU to better use.

mozilla-inbound, birch/b2g-inbound, mozilla-central, fx-team:
Ratios of checkins across these branches remain fairly consistent. mozilla-inbound continues to be heavily used as an integration branch, dropping slightly to 24.2% of all checkins, yet still consistently far more then all other integration branches combined. mozilla-central increased slightly to 2.3% of checkins.

The “birch as b2g-inbound” experiment continued to be a great success, and with 4.2% of this month’s checkins landing here, birch has already become the 4th busiest branch (after try, mozilla-inbound and gaia-central). Birch is also helping reduce pain of any mozilla-inbound closures, and further proving the lure of sheriff-assisted-landings to developers. Given the success of this experiment, bug#875989 tracks setting up b2g-inbound on a permanent basis.

The lure of sheriff assistance continues to be consistently popular, and as usual, very few people land directly on mozilla-central these days. The fx-team branch saw an increase to 2.1% of checkins, which I believe can be attributed to the increased help with landings from sheriffs (and sometimes gps) this month.

Infrastructure load by branch

mozilla-aurora, mozilla-beta, mozilla-b2g18, gaia-central:
Of our total monthly checkins:

  • 1.7% landed into mozilla-aurora, slightly lower than last month.
  • 1.2% landed into mozilla-beta, slightly lower than last month.
  • 1.7% landed into mozilla-b2g18, slightly higher then last month.
  • 5.9% landed into gaia-central, same as last month. gaia-central continues to be the third busiest branch overall, after try and mozilla-inbound.

misc other details:

  • Pushes per day
    • You can clearly see weekends through the month. Its worth noting that we had >200 checkins-per-day almost every working day in May. This
      has been true for a few months now, so it is starting to feel like 200 checkins-per-day is the new “normal” for Mozilla.

    • Pushes by hour of day
        Mid-morning PT is consistently the biggest volume of checkins, specifically between 1pm-2pm PT, with 12.5 checkins-per-hour.

HOWTO use an unlocked Android phone in Portugal

Here what I used in my trip to Portugal in Jun2013, in case others find this helpful:


Disclaimer:

  • In the US, buying a cellphone “out-of-contract” is not the same as buying a cellphone “unlocked”. All of the following only works for an unlocked phone. Make sure your phone is unlocked before you get on the plane.
  • Different cellphone companies have different policies on this. AT&T declared that, despite my being a multi-year customer, with no contract, they would not unlock my phone per policy. T-Mobile said upfront that they would need ~40days from date-of-purchase of “out-of-contact” phone before I could ask to have it unlocked. On the 40th day, when I asked T-Mobile to unlock my phone, they sent me the phone unlock codes within 48hours.
  • Make sure your phone supports GSM. Sounds obvious, but still needs to be said, as most countries use GSM.

  • Buy a “LycaMobile” pay-as-you-go SIM card. I bought mine at the train station in Lisbon, but they are also for sale on most small street corner stores. While there are several mobile companies selling pay-as-you-go, I went with Lycamobile because they had the best price for all-you-can-use data at 4G speeds, great high speed coverage everywhere I went, and no hassle about using your cellphone as a hotspot. Oh, and comparable prices for voice calls and text messaging.
  • Disassemble your phone to swap out sim card, insert new LycaMobile sim card and power up the phone.
  • On the phone, enter “*#123#″ and press dial (typically, the green handset button). This connects you to an automated service that tells you your balance.
  • To find out what your lycamobile phone number is, dial “*#122#”
  • Assuming that all works, you should now attempt to call any local number. By habit, I call the mobile phone of the person at the store selling me the SIM card.
  • Cultural tip: I never setup voicemail – as discovered in my other recent trips, most people dont both leaving voice messages on cellphones anymore – if they cant reach you when they phone, they hangup and send you a text message instead.
  • Now that you can make/receive calls, to make my Android 2.2 phone transmit/receive data, I had to add the following APN settings:
    * on home screen, go into “settings”
    * go into “wireless & network settings”
    * go into “mobile networks”
    * go into “access point names”
    * if there is not already a “data.lycamobile.pt” APN, then create one as follows:
    ** Name = data.lycamobile.pt
    ** APN == data.lycamobile.pt
    ** Proxy == Not set
    ** Port == Not set
    ** Username == impt
    ** Password == impt
    ** Server == Not set
    ** MMSC == Not set
    ** MMS proxy == Not set
    ** MMS port == Not set
    ** MCC == 268
    ** MNC == 04
    ** Authentication Type == Not set
    ** APN Type == Not set
    …hit save, and go back to “Access Point Names”.

  • verify that this new “data.lycamobile.pt” APK is present, and is selected.
  • verify that “Use only 2G networks” is not selected.
  • Reboot the phone to see if that helps.
  • At this point you should be able to make/receive calls, send/receive text messages, surf the web, use your cellphone as GPS, and use your cellphone as a wifi hotspot.
  • To check your account balance dial “*122#”.
  • When you need additional credits, buy a one-time use scratch-refill “top up” card at almost any corner store, and follow the instructions on the back. You’ll receive a text message with the new balance when the credits are added to your account.

“Release Engineering as a Force Multiplier” keynote at RelEngCon 2013

The world’s first ever Release Engineering conference was held in San Francisco on 20may2013, as part of ICSE 2013.

It was a great honor for me to be invited to give the opening keynote. This was a rare opportunity to outline some of the industry-changing RelEng-at-scale work being done at Mozilla. It also allowed me to describe some very important non-technical tactics that we used to turnaround Mozilla’s situation from “company-threatening-and-human-burnout” to “company-enabling-and-human-sustainable“. All stuff that is applicable to other software companies. Presenting the first session at the first RelEng conference helped set the tone for the event, so being down-to-earth and practical felt important.

My presentation focused on how effective RelEng helped Mozilla compete against larger, better funded, companies. This is what I mean by “Release Engineering as a Force Multiplier”. To give some perspective of scale, I showed:

  • company logos scaled by headcount for each of Apple, Google, Microsoft and Mozilla. I noted that using revenue/profits instead of headcount would be just as disproportionately out of scale.

  • company logos scaled by browser market share for each of Apple, Google, Microsoft and Mozilla, based on publicly available market share data

(The full set of slides are available here, or by clicking on either of the two thumbnails above. If you want the original 25MB keynote file, let me know.)

Anyone who has ever talked with me about RelEng knows I feel very strongly that:

  • Release Engineering is important to the success of every software project. Writing a popular v1.0 product is just the first step. If you want to retain your initial users by shipping v1.0.1 fixes, or grow your user base by shipping new v2.0 features to your existing users, you need a reproducible pipeline for accurately delivering software. Smaller projects may not have someone with formal RelEng title, but there’s always someone doing RelEng work. Otherwise, your product, and your organization, will not survive.
  • Building an effective RelEng pipeline requires a different mindset to writing a great software product. Both are non-trivial, and understanding this difference in mindset enables you to hire wisely.
  • The importance of Release Engineering for the sustainability of a software company is only beginning to be recognized. Release Engineering is not taught as a discipline in any CompSci course that I know of. This has the unfortunate side effect that most Release Engineers only learn on-the-job, usually as a side-effect of helping out during an organizational emergency. This limits the effectiveness of Release Engineering to what can be learned on the job, and because information isn’t widely shared, its hard to learn from other people’s mistakes or successes. By contrast, when I was studying CompSci for my undergrad and postgrad degrees, there were all sorts of course modules, books and published papers detailing which sorting algorithm was best suited for which types of data, which graphics algorithms were best suited for which types of images, etc… but nothing about Release Engineering. Nothing about how to build a pipeline to deliver that software to users… even though every software company lives-or-dies by the efficiency of their delivery-pipeline.

This conference was a great start to helping raise the understanding of the industry on these three points, and many more.

Big hat-tip to the organizers (Bram, Christian, Foutse, Kim) – they did an awesome job putting together a unique and special event. Other industry speakers included Google, LinkedIn, MicrosoftResearch, Netflix, PuppetLabs as well as a wide range of academic speakers – see full program details here and here. In the past, I’ve attended plenty of industry conferences which have a sales-touch to them (“our technology is great, you should buy our stuff” or “our technology is great, you should come work here”), or academic conferences which have academic-accuracy-but-can-lack-urgent-practicality. By contrast, this conference was very different. All the speakers spoke candidly, humbly and objectively about the technology and tactical successes that worked for them at their companies, and equally candidly about their failures as “teaching moments” for everyone to learn from. The level of raw honestly by all these speakers, from all these different companies and academia was super-honest and very collaborative. Additionally, the sheer volume, and the quality, of attendees blew everyone away… the discussions during breaks were just electric. Very very refreshing. I dont know if this magic was because of the carefully chosen mix of industry-vs-academia speakers, or if this was because the organizers were also a mix of industry and academia, but regardless – the end result was simply fantastic.

There’s already plans underway to have this RelEngConference again next year… if you build software delivery pipelines for your company, or if you work in a software company that has software delivery needs, I recommend you follow @relengcon and plan on attending next year. You’ll be very glad you did!

Infrastructure load for April 2013

  • #checkins-per-month: We had 6,364 checkins in April 2013. This is only slightly below our record of 6,433 in Mar2013. Every working day was consistently busy (>200 checkins per working day) and load-per-day was busy across longer periods of each day.

  • #checkins-per-day: On 09apr, we had 311 checkins – our second-busiest day on record (the record remains 323 checkins-per-day on 18mar2013). During April, 22-of-30 days had over 200 checkins-per-day – thats every working day. 12-of-30 days had over 250 checkins-per-day (2-of-30 days had over 300 checkins-per-day!).
  • #checkins-per-hour: Checkins are still mostly mid-day PT/afternoon ET, but the load has increased across the day. For 11 of every 24 hours, we sustained over 10 checkins per hour. Heaviest load times this month were 11am-noon PT (12.77 checkins-per-hour). Its interesting to note we had an atypical spike in load at 5am – possibly from CET based contributors or the B2G workweek.
  • As usual, our build pool handled the load well, with >95% of all builds consistently being started within 15mins. Our test pool is handling this load much better too. In an encouraging sign that all the hard work by RelEng, ATeam and IT is paying off, we’re seeing more test jobs being handled with better response times… The peak for April was 52,118 test jobs on 23apr – our first time handling over 50,000 test jobs in a 24-hour-day. Still more work to be done here, but very encouraging progress.

    As always, if you know of any test suites that no longer need to be run per-checkin, please let us know so we can immediately reduce the load a little. Also, if you know of any test suites which are perma-orange, and hidden on tbpl.m.o, please let us know – thats the worst of both worlds – using up scarce CPU time and not being displayed. We’ll make sure to file bugs to get tests fixed, or disabled – every little bit helps put scarce test CPU to better use.

mozilla-inbound, mozilla-central, fx-team:
Ratios of checkins across these branches remain fairly consistent. mozilla-inbound continues to be heavily used as an integration branch,
with 26.1% of all checkins, consistently far more then the other integration branches combined.

In mid-April, we started using birch as a b2g-inbound for the B2G workweek. This experiment was a great success, and has been continued. We had 1.7% of checkins landed on birch, which is impressive considering it was only in use for half the month!

As usual, fx-team has ~1% of checkins, mozilla-central has 1.8% of checkins.

The lure of sheriff assistance on mozilla-inbound (and now birch/b2g-inbound) continues to be consistently popular, and as usual, very few people land directly on mozilla-central these days.

mozilla-aurora, mozilla-beta, mozilla-b2g18, gaia-central:
Of our total monthly checkins:

  • 2.0% landed into mozilla-aurora, very similar to last month.
  • 1.4% landed into mozilla-beta, very similar to last month.
  • 2.1% landed into mozilla-b2g18, slightly higher then last month.
  • 5.9% landed into gaia-central, slightly higher then last month. gaia-central continues to be the third busiest branch overall, after try and mozilla-inbound. Obviously, these checkins are *only* for the B2G releases, so worth calling out here.

misc other details:

  • Pushes per day
    • You can clearly see weekends through the month. Its worth noting that we had >200 checkins-per-day every working day in April.

    • Pushes by hour of day
        Mid-morning PT is consistently the biggest volume of checkins, although this month the checkin load stayed high throughout the entire PT working day, and particularly spiked between 11am-noon PT, with 12.77 checkins-per-hour. I found the spike in activity at 5am to be interesting, as it is unusual. My theory is that this shows an increase in checkins from people in Europe / CET timezone; partially explained by the B2G workweek in Madrid.