In case you missed the announcements, RelEngConf 2014 is officially now open for registrations. This follows the inaugural and wildly successful Release Engineering conference , held in San Francisco on 20may2013, as part of ICSE 2013. More background here.
Last year’s event was great. The mixture of attendees and speakers, from academia and battle-hardened industry, made for some riveting topics. So I already had high expectations for this year… no pressure on the organizers! Then I heard this years will be held in Google HQ MountainView, and feature opening keynotes from Chuck Rossi (RelEng, Facebook, click for linkedin profile), and Dinah McNutt (RelEng, Google, click for linkedin profile). Looks like RelEngConf 2014 is already lining up to be special also.
If you build software delivery pipelines for your company, or if you work in a software company that has software delivery needs, I recommend you follow @relengcon, block off April 11th, 2014 on your calendar and book now. It will be well worth your time.
See you there!
(Context: In case people missed this transition, my last day at Mozilla was Dec31, so obviously, I’m not going to be doing these monthly infrastructure load posts anymore. I started this series of posts in Jan2009, because the data, and analysis, gave important context for everyone in Mozilla engineering to step back and sanity-check the scale, usage patterns and overall health of Mozilla’s developer infrastructure. The data in these posts have shaped conversations and strategy within Mozilla over the years, so are important to continue. I want to give thanks to Armen for eagerly taking over this role from me during my transition out of Mozilla. Those of you who know Armen know that he’ll do this exceedingly well, in his own inimitable style, and I’m super happy he’s taken this on. I’ve already said this to Armen privately over the last few months of transition details, but am repeating here publicly for the record – thank you, Armen, for taking on the responsibility of this blog-post-series.)
December saw a big drop in overall load – 6,063 is our lowest load in almost half-a-year. However, this is no surprise given that all Mozilla employees were offline for 10-14 days out of the 31days – basically a 1/3rd of the month. At the rate people were doing checkins for the first 2/3rds of the month, December2013 was on track to be our first month ever over 8,000 checkins-per-month.
January saw people jump straight back into work full speed. 7,710 is our second heaviest load on record (slightly behind the current record 7,771 checkins in August2013).
Those are my quick highlights. For more details, you should go read Armen’s post for Dec2013 and post for Jan2014 yourself. He has changed the format a little, but the graphs, data and analysis are all there. And hey, Armen even makes the raw data available in html and json formats, so now you can generate your own reports and graphs if interested. A very nice touch, Armen.
John (still cheering from the sidelines).
(My life been hectic on several other fronts, so I only just now noticed that I never actually published this blog post. Sorry!!)
On 07-nov-2013, I was invited to present “We are all remoties” in Twilio’s headquarters here in San Francisco as part of their in-house tech talk series.
For context, its worth noting that Twilio is doing great as a company, which means they are hiring. And outgrowing their current space, so one option they were investigating was to keep the current space, and open up a second office elsewhere in the bay area. As they’d always been used to working in the one location, this “split into two offices” was top of everyone’s mind… hence the invitation from Thomas to give this company-wide talk about remoties.
Twilio’s entire office is a large, SOMA-style-warehouse-converted-into-open-plan-offices layout, packed with lots of people. The area I was to present in was their big “common area”, where they typically host company all-hand meetings, Friday socials and other big company-wide events. Quite, quite large. I’ve no idea how many people were there but it felt huge, and was wall-to-wall packed. The size gave an echo-y audio effect off the super-high high concrete ceilings and far-distant bare concrete walls, with a weird couple of structural pillars right in the middle of the room. Despite my best intentions, during the session, I found myself trying to “peer around” the pillars, aware of the people blocked from view.
Its great to see the response from folks when slides in a presentation *exactly* hit onto what is on top-of-their-minds. One section, about companies moving to multiple locations, clearly hit home with everyone… not too surprising, given the context. Another section, about a trusted employee moving out from office to start being a 100% remote employee, hit a very personal note – there was someone in the 2nd row who was a long-trusted employee actually about to embark on this exact change. He got quite the attention from everyone around him, and we stopped everything for a few minutes to talk about his exact situation. As far as I can tell, he found the entire session very helpful, but only time will tell how things work out for him.
The very great interactions, the lively Q+A, and the crowd of questions afterwards were all lots of fun and quite informative.
Big thanks to Thomas Wilsher @ Twilio for putting it all together. I found it a great experience, and the lively discussions before+during+after lead me to believe others did too.
PS: For a PDF copy of the presentation, click on the smiley faces! For the sake of my poor blogsite, the much, much, larger keynote file is available on request.
Just before the holiday break, Mitchell and I sat down together to fulfill a long standing promise I made years ago: to have Mitchell start a Firefox release herself.
After starting Mozilla just over 15 years ago, and dealing with all aspects of running a large organization, Mitchell finally kicked off a Firefox release herself last week – for the very first time. Specifically, she was going to start the official release automation for Firefox 27.0 beta2 and Fennec 27.0 beta2.
Timing was tricky. We didn’t want to disrupt the usual beta release cadence, especially just before the holidays. And Mitchell only had 25 minutes free between meetings, so we spent a few minutes saying hi, getting settled, and then we jumped right into the details.
To kick off Firefox and Fennec releases, there are only a handful of fields a human has to fill in for each product. They are (almost) all fairly self-evident, and a good number of the fields are populated by picking-from-a-list, so we made fast progress. The “Gimme a Firefox” and “Gimme a Fennec” buttons caused a laugh!
8 minutes is all it took.
That 8 minutes included the time to explain what each field did, what value should go into each of the various fields, and why. We even took the time to re-verify everything. After all, this was not just a “demo”… this was real. Mozilla shipped this Firefox 27.0b2 and Fennec 27.0b2 to all our real-live beta users before closing down for the holiday break.
Because it was so quick, we had spare time to chat about how much the infrastructure has improved since the Directors meeting in Building S 6.5 years ago when this promise was originally made. Obviously, there’s plenty of complexity involved in shipping a product release – the daily bug triage meetings about what fixes should/shouldn’t be included in a release, the actual landing of code fixes by developers, the manual spot-checking by QA, the press and PR coordination, the list goes on… – but the fact that such a “simple” user interface could trigger release automation running a couple of hundred compute hours across many machines to reliably ship products to millions of users is a note-worthy measure of Mozilla’s Release Engineering infrastructure. Mitchell was suitably impressed!
And then Mitchell left, with a wave and a smile, a few minutes early for her next meeting, while the various Release Engineering systems sprang into life generating builds, localization repacks, and updates for all our users.
We took this photo afterwards to commemorate the event! Thank you, Mitchell!
Mozilla’s Release Engineering was part of “The Architecture of Open Source Applications (vol2)” published in paperback in May2012 and then as electronic downloads from Amazon and Barnes&Noble in Sept2012. Earlier this week, Rail was delighted to discover that the book has now been translated into Russian. It is very cool to see this.
As best as I can tell, this translation work was led by А. Панин (A. Panin), and they did a great job. Even taking the time to recreate the images with embedded translated text. Tricky hard work, and very very great to see. Thanks to Mr Panin for making this happen.
You can download the entire book, or just the Mozilla RelEng portion. (As always, proceeds from book sales go to Amnesty International.)
This makes me wonder – are there any other translations, or translations-in-progress, out there?
My earlier blog post shows how much we spend per checking using AWS “on demand” instances. Rail and catlee has been working on using much cheaper AWS spot instances (see details here and here for details).
As of today, AWS spot instances are now being used for 30-40% of our linux test jobs in production. We continue to monitor closely, and ramp up more every day. Builds are still being done on on-demand instances, but even so, we’re already seeing this work reduce our costs on AWS – now our AWS costs per checkin us USD$26.40 (down from $30.60) ; broken out as follows: USD$8.44 (down from $11.93) for Firefox builds/tests, USD$5.31 (unchanged) for Fennec builds/tests and USD$12.65 (down from 13.36) for B2G builds/tests.
It is worth noting that our AWS bill for November was *down*, even though our checkin load was *up*. While spot instances get more attention, because they are more technically interesting, it is worth noting for the record that only a small percentage of this cost saving was because of using spot instances. Most of the cost savings in November were from the less-glamorous work of identifying and turning off unwanted tests – ~20% of our overall load was no-longer-needed.
- A spot instance can be deleted out from under you at zero notice, killing your job-in-progress, if someone else bids more then you for that instance. We’re only seeing ~1% of jobs on spot-instances being killed. We’ve changed our scheduling automation so that now, any spot job which is killed, will automatically be re-triggered on a *non-spot* instance. While a developer might tolerate a delay/restart once, because of the significant cost savings, they would quickly be frustrating if a job was unlucky enough to be killed multiple times.
- When you ask for an on-demand instance, it is almost instant. By contrast, when you ask for a spot-instance, you do not have any guarantee on how quickly Amazon will provide the instance. We’ve seen delays of up to 20-30mins minutes, all totally unpredictable. Handling delays like this requires changes to our automation logic. All work well in progress, but still, work to be done.
- This post only includes costs for AWS jobs, but we run a lot more builds+tests on inhouse machines. Cshields continues to work with mmayo to calculate TCO (Total Cost of Ownership) numbers for the different physical machines Mozilla runs in Mozilla colos. Until we have accurate numbers from IT, I’ve set those inhouse costs to $0.00. This is obviously unrealistic, but felt better then confusing this post with inaccurate data.
- The Amazon prices used here are “OnDemand” prices. For context, Amazon WebServices has 4 different price brackets available, for each different type of machine available:
** OnDemand Instance: The most expensive. No need to prepay. Get an instance in your requested region, within a few seconds of asking. Very high reliability – out of the hundreds of instances that RelEng runs daily, we’ve only lost a few instances over the last ~18months. Our OnDemand builders cost us $0.45 per hour, while our OnDemand testers cost us $0.12 per hour.
** 1 year Reserved Instance: Pay in advance for 1 year of use, get a discount from OnDemand price. Functionally totally identical to OnDemand, the only change is in billing. Using 1 year Reserved Instances, our builders would cost us $0.25 per hour, while our OnDemand testers cost us $0.07 per hour.
** 3 year Reserved Instances: Pay in advance for 3 year of use, get a discount from OnDemand price. Functionally, totally identical to OnDemand, the only change is in billing. Using 3 year Reserved Instances, our builders would cost us $0.20 per hour, while our 3 year Reserved Instance testers cost us $0.05 per hour.
** Spot Instances: The cheapest. No need to prepay. Like a live auction, you bid how much you are willing to pay for it, and so long as you are the highest bidder, you’ll get an instance. This price varies throughout the day, depending on what demand other companies place on that AWS region. We’re using a fixed $0.025, and in future, it might be possible to save even more by doing tricker dynamic bidding.
More news as we have it.
tl;dr: On 18nov, I gave my notice to Brendan and Bob that I will be leaving Mozilla, and sent an email internally at Mozilla on 26nov. I’m here until 31dec2013. Thats a lot of notice, yet feels right – its important to me that this is a smooth stable transition.
After they got over the shock, the RelEng team is stepping up wonderfully. Its great to see them all pitching in, sharing out the workload. They will do well. Obviously, at times like this, there are lots of details to transition, so please be patient and understanding with catlee, coop, hwine and bmoss. I have high confidence this transition will continue to go smoothly.
In writing this post, I realized I’ve been here 6.5 years, so thought people might find the following changes interesting:
1) How quickly can Mozilla ship a zero-day security release?
was: 4-6 weeks
now: 11 hours
2) How long to ship a “new feature” release?
was: 12-18 months
now: 12 weeks
3) How many checkins per day?
was: ~15 per day
now: 350-400 per day (peak 443 per day)
4) Mozilla hired more developers
increased number of developers x8
increased number of checkins x21
The point here being that the infrastructure improved faster then Mozilla could hire developers.
5) Mozilla added mobile+b2g:
was: desktop only
now: desktop + mobile + phoneOS – many of which ship from the *exact* same changeset
6) updated tools
now: hg *and* git (aside, I don’t know any other organization that ships product from two *different* source-code revision systems.)
7) Lifespan of human Release Engineers
was 6-12 months
now: two-losses-in-6-years (3 including me)
This team stability allowed people to focus on larger, longer term, improvements – something new hires generally cant do while learning how to keep the lights on.
This is the best infrastructure and team in the software industry that I know of – if anyone reading this knows of better, please introduce me! (Disclaimer: there’s a big difference between people who update website(s) vs people who ship software that gets installed on desktop or mobile clients… or even entire phoneOS!)
Literally, Release Engineering is a force multiplier for Mozilla – this infrastructure allows us to work with, and compete against, much bigger companies. As a organization, we now have business opportunities that were previously just not possible.
Finally, I want to say thanks:
- I’ve been here longer then a few of my bosses. Thanks to bmoss for his council and support over the last couple of years.
- Thanks to Debbie Cohen for making LEAD happen – causing organizational change is big and scary, I know its impacted many of us here, including me.
- Thanks to John Lilly and Mike Schroepfer (“schrep”) – for allowing me to prove there was another, better, way to ship software. Never mind that it hadn’t been done before. And thanks to aki, armenzg, bhearsum, catlee, coop, hwine, jhopkins, jlund, joey, jwood, kmoir, mgerva, mshal, nthomas, pmoore, rail, sbruno, for building it, even when it sounded crazy or hadn’t been done before.
- Finally, thanks to Brendan Eich, Mitchell Baker, and Mozilla – for making the “people’s browser” a reality… putting humans first. Mozilla ships all 90+ locales, even Khmer, all OS, same code, same fixes… all at the same time… because we believe all humans are equal. It’s a living example of the “we work for mankind, not for the man” mindset here, and is something I remain super proud to have been a part of.
- We’re back to typical load again in November.
- #checkins-per-month: We had 7,601 checkins in November 2013. This is our 2nd heaviest load on record, and is back at expected range. For the curious, our heaviest month on record was in August 2013 (7,771 checkins) and our previous 2nd heaviest month was September2013 (7,580 checkins).
- #checkins-per-day:Overall load was consistently high throughout the month, with a slight dip for US Thanksgiving. In November, 18-of-30 days had over 250 checkins-per-day, 13-of-30 days had over 300 checkins-per-day, and 1-of-30 days had over 400 checkins-per-day. Our heaviest day had 431 checkins on 18nov; close to our single-day record of 443 checkins on 26aug2013.
- #checkins-per-hour: Checkins are still mostly mid-day PT/afternoon ET. For 10 of every 24 hours, we sustained over 11 checkins per hour. Our heaviest load time this month was 11am-12noon PT 15.6 checkins-per-hour (a checkin every 3.8 minutes!) – slightly below our record of 15.73 checkins-per-hour.
mozilla-inbound, b2g-inbound, fx-team:
- mozilla-inbound had 16.6% of all checkins. This continues to be heavily used as an integration branch. As developers use other *-inbound branches, the use of mozilla-inbound has reduced over recent months, and is stabilizing around mid-teens of overall usage.
- b2g-inbound had 11.5% of all checkins. This continues to be a successful integration branch, with usage slightly increased over last month’s 10.3% and a sign that usage of this branch is also stabilizing.
- fx-team had 6% of all checkins. This continues to be a very active third integration branch for developers. Usage is almost identical to last month, and shows that usage of this branch is also stabilizing.
- The combined total of these 3 integration branches is 34.1% , which is slightly higher then last month yet fairly consistent. Put another way, sheriff moderated branches consistently handle approx 1/3 of all checkins (while Try handles approx 1/2 of all checkins). The use of multiple *-inbounds is clearly helping improve bottlenecks (see pie chart below) and the congestion on mozilla-inbound is being reduced significantly as people use switch to using other *-inbound branches instead. Overall, this configuration reduces stress and backlog headaches on sheriffs and developers, which is good. All very cool to see working at scale like this.
mozilla-aurora, mozilla-beta, mozilla-b2g18, gaia-central:
Of our total monthly checkins:
- 2.6% landed into mozilla-central, slightly lower than last month. As usual, most people land on sheriff-assisted branches instead of landing directly on mozilla-central.
- 1.4% landed into mozilla-aurora, lower then last month’s abnormally high load. This is consistent with the B2G branching, which had B2G v1.2 checkins landing on mozilla-aurora, and now moved to mozilla-b2g26_v1_2.
- 0.9% landed into mozilla-beta, slightly higher than last month.
- 0.0% landed into mozilla-b2g18, slightly lower then last month. This dropped to almost zero (total of 8 checkins) as we move B2G to gecko26.
- 3.3% landed into mozilla-b2g26_v1_2, as part of the B2Gv1.2 branching involving Firefox25. As predicted this is significantly more then last month, and is expected to continue until we move focus to B2G v1.3 on gecko28.
- Note: gaia-central, and all other gaia-* branches, are not counted here anymore. For details, see here.
misc other details:
As usual, our build pool handled the load well, with >95% of all builds consistently being started within 15mins. Our test pool is getting up to par and we’re seeing more test jobs being handled with better response times. Trimming out obsolete builds and tests continues. As always, if you know of any test suites that no longer need to be run per-checkin, please let us know so we can immediately reduce the load a little. Also, if you know of any test suites which are perma-orange, and hidden on tbpl.m.o, please let us know – those are the worst of both worlds – using up scarce CPU time *and* not being displayed for people to make use of. We’ll make sure to file bugs to get tests fixed – or disabled – every little bit helps put scarce test CPU to better use.
Last week, 18-22 November, RelEng gathered in Boston. As usual for these work weeks, it was jam-packed; there was group planning, and lots of group sprints – coop took on the task of blogging with details for each specific day (Mon and Mon, Tue, Wed, Thu, Fri). The meetings with Bocoup were a happy, unplanned, surprise.
Given the very distributed nature of the group, and the high-stress nature of the job, a big part of the week is making sure we maintain our group cohesion so we can work well together under pressure after we return to our respective homes. When all together in person, the trust, respect, love for each other is self-evident and something I’m truly in awe of. I dont know how else to describe this except “magic” – this is super important to me, and something I’m honored to be a part of.
Every gathering needs a group photo, and these are never first-shot-good-enough-ship-it, so while aki was taking a group photo, Massimo quietly setup his gopro to timelapse the fun.
This is Mozilla’s Release Engineering group – aki, armenzg, bhearsum, callek, catlee, coop, hwine, joey, jhopkins, jlund, joduinn, kmoir, mgerva, mshal, nthomas, pmoore, simone, rail. All proudly wearing our “Ship it” shirts.
Every RelEng work week is always an exhausting hectic week, and yet, at the end of each week, as we are saying our goodbyes and heading for various planes/cars/homes, I find myself missing everyone deeply and feeling so so so proud of them all.
tl;dr: In order to improve our osx10.6 test capacity and to quickly start osx10.9 testing, we’re planning to make the following changes to our OSX-build-and-test-infrastructure.
1) convert all 10.7 test machines as 10.6 test machines in order to increase our 10.6 capacity. Details in bug#942299.
2) convert all 10.8 test machines as 10.9 test machines.
3) do most 10.7 builds as osx-cross-compiling-on-linux-on-AWS, repurpose 10.7 builder machines to be additional 10.9 test machines. This cross-compiler work is ongoing, it will take time to complete, and it will take time to transition into production, hence, it is listed last in this list. The curious can follow bug#921040.
Each of these items are large stand-alone projects involving the same people across multiple groups, so we’ll roll each out in the aforementioned sequence.
1) Removing specific versions of an OS from our continuous integration systems based on vendor support and/or usage data is not a new policy. We have done this several times in the past. For example, we have dropped WinXPsp0/sp1/sp2 for WinXPsp3; dropped WinVista for Win7; dropped Win7 x64 for Win8 x64; and soon we will drop Win8.0 for Win8.1; …
** Note for the record that this does *NOT* mean that Mozilla is dropping support for osx10.7 or 10.8; it just means we think *automated* testing on 10.6,10.9 is more beneficial.
2) To see Firefox’s minimum OS requirements see: https://www.mozilla.org/en-US/firefox/25.0.1/system-requirements
3) Apple is offering osx10.9 as a free upgrade to all users of osx10.7 and osx10.8. Also, note that 10.9 runs on any machine that can run 10.7 or 10.8. Because the osx10.9 release is a free upgrade, users are quickly upgrading. We are seeing a drop in both 10.7 and 10.8 users and in just a month since the 10.9 release, we already have more 10.9 users than 10.8 users.
4) Distribution of Firefox users from the most to the least (data from 15-nov-2013):
10.6 – 34%
10.7 – 23% – slightly decreasing
10.8 – 21% – notably decreasing
10.9 – 21% – notably increasing
more info: http://armenzg.blogspot.ca/2013/11/re-thinking-our-mac-os-x-continuous.html
5) Apple is no longer providing security updates for 10.7; any user looking for OS security updates will need to upgrade to 10.9. Because OSX10.9 is a free upgrade for 10.8 users, we expect 10.8 to be in similar situation soon.
6) If a developer lands a patch that works on 10.9, but it fails somehow on 10.7 or 10.8, it is unlikely that we would back out the fix, and we would instead tell users to upgrade to 10.9 anyways, for the security fixes.
7) It is no longer possible to buy any more of the 10.6 machines (known as revision 4 minis), as they are long desupported. Recycling 10.7 test machines means that we can continue to support osx10.6 at scale without needing to buy/rack/recalibrate test and performance results.
8) Like all other large OS changes, this change would ride the trains. Most 10.7 and 10.8 test machines would be reimaged when we make these changes live on mozilla-central and try, while we’d leave a few behind. The few remaining would be reimaged at each 6-week train migration.
If we move quickly, this reimaging work can be done by IT before they all get busy with the 650-Castro -> Evelyn move.
For further details, see armen’s blog http://armenzg.blogspot.ca/2013/11/re-thinking-our-mac-os-x-continuous.html. To make sure this is not missed, I’ve cross-posted this to dev.planning, dev.platform and also this blog. If you know of anything we have missed, please reply in the dev.planning thread.
[UPDATED 29-nov-2013 with link to bug#942299, as the 10.7->10.6 portion of this work just completed.]