Mitchell Baker and Firefox 27.0 beta2

Just before the holiday break, Mitchell and I sat down together to fulfill a long standing promise I made years ago: to have Mitchell start a Firefox release herself.

After starting Mozilla just over 15 years ago, and dealing with all aspects of running a large organization, Mitchell finally kicked off a Firefox release herself last week – for the very first time. Specifically, she was going to start the official release automation for Firefox 27.0 beta2 and Fennec 27.0 beta2.

Timing was tricky. We didn’t want to disrupt the usual beta release cadence, especially just before the holidays. And Mitchell only had 25 minutes free between meetings, so we spent a few minutes saying hi, getting settled, and then we jumped right into the details.

To kick off Firefox and Fennec releases, there are only a handful of fields a human has to fill in for each product. They are (almost) all fairly self-evident, and a good number of the fields are populated by picking-from-a-list, so we made fast progress. The “Gimme a Firefox” and “Gimme a Fennec” buttons caused a laugh!

8 minutes is all it took.

That 8 minutes included the time to explain what each field did, what value should go into each of the various fields, and why. We even took the time to re-verify everything. After all, this was not just a “demo”… this was real. Mozilla shipped this Firefox 27.0b2 and Fennec 27.0b2 to all our real-live beta users before closing down for the holiday break.

Because it was so quick, we had spare time to chat about how much the infrastructure has improved since the Directors meeting in Building S 6.5 years ago when this promise was originally made. Obviously, there’s plenty of complexity involved in shipping a product release – the daily bug triage meetings about what fixes should/shouldn’t be included in a release, the actual landing of code fixes by developers, the manual spot-checking by QA, the press and PR coordination, the list goes on… – but the fact that such a “simple” user interface could trigger release automation running a couple of hundred compute hours across many machines to reliably ship products to millions of users is a note-worthy measure of Mozilla’s Release Engineering infrastructure. Mitchell was suitably impressed!

And then Mitchell left, with a wave and a smile, a few minutes early for her next meeting, while the various Release Engineering systems sprang into life generating builds, localization repacks, and updates for all our users.

We took this photo afterwards to commemorate the event! Thank you, Mitchell!

AOSA(vol2) and Mozilla’s Release Engineering: now in Russian!

Mozilla’s Release Engineering was part of “The Architecture of Open Source Applications (vol2)” published in paperback in May2012 and then as electronic downloads from Amazon and Barnes&Noble in Sept2012. Earlier this week, Rail was delighted to discover that the book has now been translated into Russian. It is very cool to see this.

As best as I can tell, this translation work was led by А. Панин (A. Panin), and they did a great job. Even taking the time to recreate the images with embedded translated text. Tricky hard work, and very very great to see. Thanks to Mr Panin for making this happen.

You can download the entire book, or just the Mozilla RelEng portion. (As always, proceeds from book sales go to Amnesty International.)

This makes me wonder – are there any other translations, or translations-in-progress, out there?

The financial cost of a checkin (part 2)

My earlier blog post shows how much we spend per checking using AWS “on demand” instances. Rail and catlee has been working on using much cheaper AWS spot instances (see details here and here for details).

As of today, AWS spot instances are now being used for 30-40% of our linux test jobs in production. We continue to monitor closely, and ramp up more every day. Builds are still being done on on-demand instances, but even so, we’re already seeing this work reduce our costs on AWS – now our AWS costs per checkin us USD$26.40 (down from $30.60) ; broken out as follows: USD$8.44 (down from $11.93) for Firefox builds/tests, USD$5.31 (unchanged) for Fennec builds/tests and USD$12.65 (down from 13.36) for B2G builds/tests.

It is worth noting that our AWS bill for November was *down*, even though our checkin load was *up*. While spot instances get more attention, because they are more technically interesting, it is worth noting for the record that only a small percentage of this cost saving was because of using spot instances. Most of the cost savings in November were from the less-glamorous work of identifying and turning off unwanted tests – ~20% of our overall load was no-longer-needed.

Note:

  • A spot instance can be deleted out from under you at zero notice, killing your job-in-progress, if someone else bids more then you for that instance. We’re only seeing ~1% of jobs on spot-instances being killed. We’ve changed our scheduling automation so that now, any spot job which is killed, will automatically be re-triggered on a *non-spot* instance. While a developer might tolerate a delay/restart once, because of the significant cost savings, they would quickly be frustrating if a job was unlucky enough to be killed multiple times.
  • When you ask for an on-demand instance, it is almost instant. By contrast, when you ask for a spot-instance, you do not have any guarantee on how quickly Amazon will provide the instance. We’ve seen delays of up to 20-30mins minutes, all totally unpredictable. Handling delays like this requires changes to our automation logic. All work well in progress, but still, work to be done.
  • This post only includes costs for AWS jobs, but we run a lot more builds+tests on inhouse machines. Cshields continues to work with mmayo to calculate TCO (Total Cost of Ownership) numbers for the different physical machines Mozilla runs in Mozilla colos. Until we have accurate numbers from IT, I’ve set those inhouse costs to $0.00. This is obviously unrealistic, but felt better then confusing this post with inaccurate data.
  • The Amazon prices used here are “OnDemand” prices. For context, Amazon WebServices has 4 different price brackets available, for each different type of machine available:
    ** OnDemand Instance: The most expensive. No need to prepay. Get an instance in your requested region, within a few seconds of asking. Very high reliability – out of the hundreds of instances that RelEng runs daily, we’ve only lost a few instances over the last ~18months. Our OnDemand builders cost us $0.45 per hour, while our OnDemand testers cost us $0.12 per hour.
    ** 1 year Reserved Instance: Pay in advance for 1 year of use, get a discount from OnDemand price. Functionally totally identical to OnDemand, the only change is in billing. Using 1 year Reserved Instances, our builders would cost us $0.25 per hour, while our OnDemand testers cost us $0.07 per hour.
    ** 3 year Reserved Instances: Pay in advance for 3 year of use, get a discount from OnDemand price. Functionally, totally identical to OnDemand, the only change is in billing. Using 3 year Reserved Instances, our builders would cost us $0.20 per hour, while our 3 year Reserved Instance testers cost us $0.05 per hour.
    ** Spot Instances: The cheapest. No need to prepay. Like a live auction, you bid how much you are willing to pay for it, and so long as you are the highest bidder, you’ll get an instance. This price varies throughout the day, depending on what demand other companies place on that AWS region. We’re using a fixed $0.025, and in future, it might be possible to save even more by doing tricker dynamic bidding.

More news as we have it.

John.

On Leaving Mozilla

tl;dr: On 18nov, I gave my notice to Brendan and Bob that I will be leaving Mozilla, and sent an email internally at Mozilla on 26nov. I’m here until 31dec2013. Thats a lot of notice, yet feels right – its important to me that this is a smooth stable transition.

After they got over the shock, the RelEng team is stepping up wonderfully. Its great to see them all pitching in, sharing out the workload. They will do well. Obviously, at times like this, there are lots of details to transition, so please be patient and understanding with catlee, coop, hwine and bmoss. I have high confidence this transition will continue to go smoothly.

In writing this post, I realized I’ve been here 6.5 years, so thought people might find the following changes interesting:

1) How quickly can Mozilla ship a zero-day security release?
was: 4-6 weeks
now: 11 hours

2) How long to ship a “new feature” release?
was: 12-18 months
now: 12 weeks

3) How many checkins per day?
was: ~15 per day
now: 350-400 per day (peak 443 per day)

4) Mozilla hired more developers
increased number of developers x8
increased number of checkins x21
The point here being that the infrastructure improved faster then Mozilla could hire developers.

5) Mozilla added mobile+b2g:
was: desktop only
now: desktop + mobile + phoneOS – many of which ship from the *exact* same changeset

6) updated tools
was: cvs
now: hg *and* git (aside, I don’t know any other organization that ships product from two *different* source-code revision systems.)

7) Lifespan of human Release Engineers
was 6-12 months
now: two-losses-in-6-years (3 including me)
This team stability allowed people to focus on larger, longer term, improvements – something new hires generally cant do while learning how to keep the lights on.

This is the best infrastructure and team in the software industry that I know of – if anyone reading this knows of better, please introduce me! (Disclaimer: there’s a big difference between people who update website(s) vs people who ship software that gets installed on desktop or mobile clients… or even entire phoneOS!)

Literally, Release Engineering is a force multiplier for Mozilla – this infrastructure allows us to work with, and compete against, much bigger companies. As a organization, we now have business opportunities that were previously just not possible.

Finally, I want to say thanks:

  • I’ve been here longer then a few of my bosses. Thanks to bmoss for his council and support over the last couple of years.
  • Thanks to Debbie Cohen for making LEAD happen – causing organizational change is big and scary, I know its impacted many of us here, including me.
  • Thanks to John Lilly and Mike Schroepfer (“schrep”) – for allowing me to prove there was another, better, way to ship software. Never mind that it hadn’t been done before. And thanks to aki, armenzg, bhearsum, catlee, coop, hwine, jhopkins, jlund, joey, jwood, kmoir, mgerva, mshal, nthomas, pmoore, rail, sbruno, for building it, even when it sounded crazy or hadn’t been done before.
  • Finally, thanks to Brendan Eich, Mitchell Baker, and Mozilla – for making the “people’s browser” a reality… putting humans first. Mozilla ships all 90+ locales, even Khmer, all OS, same code, same fixes… all at the same time… because we believe all humans are equal. It’s a living example of the “we work for mankind, not for the man” mindset here, and is something I remain super proud to have been a part of.

take care
John.

Infrastructure load for November 2013

  • We’re back to typical load again in November.
  • #checkins-per-month: We had 7,601 checkins in November 2013. This is our 2nd heaviest load on record, and is back at expected range. For the curious, our heaviest month on record was in August 2013 (7,771 checkins) and our previous 2nd heaviest month was September2013 (7,580 checkins).

    Overall load since Jan 2009

  • #checkins-per-day:Overall load was consistently high throughout the month, with a slight dip for US Thanksgiving. In November, 18-of-30 days had over 250 checkins-per-day, 13-of-30 days had over 300 checkins-per-day, and 1-of-30 days had over 400 checkins-per-day.
    Our heaviest day had 431 checkins on 18nov; close to our single-day record of 443 checkins on 26aug2013.
  • #checkins-per-hour: Checkins are still mostly mid-day PT/afternoon ET. For 10 of every 24 hours, we sustained over 11 checkins per hour. Our heaviest load time this month was 11am-12noon PT 15.6 checkins-per-hour (a checkin every 3.8 minutes!) – slightly below our record of 15.73 checkins-per-hour.

mozilla-inbound, b2g-inbound, fx-team:

  • mozilla-inbound had 16.6% of all checkins. This continues to be heavily used as an integration branch. As developers use other *-inbound branches, the use of mozilla-inbound has reduced over recent months, and is stabilizing around mid-teens of overall usage.
  • b2g-inbound had 11.5% of all checkins. This continues to be a successful integration branch, with usage slightly increased over last month’s 10.3% and a sign that usage of this branch is also stabilizing.
  • fx-team had 6% of all checkins. This continues to be a very active third integration branch for developers. Usage is almost identical to last month, and shows that usage of this branch is also stabilizing.
  • The combined total of these 3 integration branches is 34.1% , which is slightly higher then last month yet fairly consistent. Put another way, sheriff moderated branches consistently handle approx 1/3 of all checkins (while Try handles approx 1/2 of all checkins). The use of multiple *-inbounds is clearly helping improve bottlenecks (see pie chart below) and the congestion on mozilla-inbound is being reduced significantly as people use switch to using other *-inbound branches instead. Overall, this configuration reduces stress and backlog headaches on sheriffs and developers, which is good. All very cool to see working at scale like this.

    Infrastructure load by branch

mozilla-aurora, mozilla-beta, mozilla-b2g18, gaia-central:
Of our total monthly checkins:

  • 2.6% landed into mozilla-central, slightly lower than last month. As usual, most people land on sheriff-assisted branches instead of landing directly on mozilla-central.
  • 1.4% landed into mozilla-aurora, lower then last month’s abnormally high load. This is consistent with the B2G branching, which had B2G v1.2 checkins landing on mozilla-aurora, and now moved to mozilla-b2g26_v1_2.
  • 0.9% landed into mozilla-beta, slightly higher than last month.
  • 0.0% landed into mozilla-b2g18, slightly lower then last month. This dropped to almost zero (total of 8 checkins) as we move B2G to gecko26.
  • 3.3% landed into mozilla-b2g26_v1_2, as part of the B2Gv1.2 branching involving Firefox25. As predicted this is significantly more then last month, and is expected to continue until we move focus to B2G v1.3 on gecko28.
  • Note: gaia-central, and all other gaia-* branches, are not counted here anymore. For details, see here.

misc other details:
As usual, our build pool handled the load well, with >95% of all builds consistently being started within 15mins. Our test pool is getting up to par and we’re seeing more test jobs being handled with better response times. Trimming out obsolete builds and tests continues. As always, if you know of any test suites that no longer need to be run per-checkin, please let us know so we can immediately reduce the load a little. Also, if you know of any test suites which are perma-orange, and hidden on tbpl.m.o, please let us know – those are the worst of both worlds – using up scarce CPU time *and* not being displayed for people to make use of. We’ll make sure to file bugs to get tests fixed – or disabled – every little bit helps put scarce test CPU to better use.