HortonWorks, Mozilla, Soapbox
(“Panorama” is the very-serious-current-affairs program of the British Broadcasting Corporation, and has been running continuously since 1953, making it the longest running current affairs program in the world.)
On 1st April, 1957, Panorama ended its show with a brief ~3minute segment on the early harvest of the Spaghetti trees along the Swiss-Italian border.
It is believed to be one of the first times an April’s Fool joke was played on television viewers, and caused quite the stir at the time. Excellently put together, with great attention to detail, and a script echoing an earlier segment about the French wine harvest, I found it a great fun 3minute watch.
More details on Wikipedia and The BBC.
In case you missed the announcements, RelEngConf 2014 is officially now open for registrations. This follows the inaugural and wildly successful Release Engineering conference , held in San Francisco on 20may2013, as part of ICSE 2013. More background here.
Last year’s event was great. The mixture of attendees and speakers, from academia and battle-hardened industry, made for some riveting topics. So I already had high expectations for this year… no pressure on the organizers! Then I heard this years will be held in Google HQ MountainView, and feature opening keynotes from Chuck Rossi (RelEng, Facebook, click for linkedin profile), and Dinah McNutt (RelEng, Google, click for linkedin profile). Looks like RelEngConf 2014 is already lining up to be special also.
If you build software delivery pipelines for your company, or if you work in a software company that has software delivery needs, I recommend you follow @relengcon, block off April 11th, 2014 on your calendar and book now. It will be well worth your time.
See you there!
(Context: In case people missed this transition, my last day at Mozilla was Dec31, so obviously, I’m not going to be doing these monthly infrastructure load posts anymore. I started this series of posts in Jan2009, because the data, and analysis, gave important context for everyone in Mozilla engineering to step back and sanity-check the scale, usage patterns and overall health of Mozilla’s developer infrastructure. The data in these posts have shaped conversations and strategy within Mozilla over the years, so are important to continue. I want to give thanks to Armen for eagerly taking over this role from me during my transition out of Mozilla. Those of you who know Armen know that he’ll do this exceedingly well, in his own inimitable style, and I’m super happy he’s taken this on. I’ve already said this to Armen privately over the last few months of transition details, but am repeating here publicly for the record – thank you, Armen, for taking on the responsibility of this blog-post-series.)
December saw a big drop in overall load – 6,063 is our lowest load in almost half-a-year. However, this is no surprise given that all Mozilla employees were offline for 10-14 days out of the 31days – basically a 1/3rd of the month. At the rate people were doing checkins for the first 2/3rds of the month, December2013 was on track to be our first month ever over 8,000 checkins-per-month.
January saw people jump straight back into work full speed. 7,710 is our second heaviest load on record (slightly behind the current record 7,771 checkins in August2013).
Those are my quick highlights. For more details, you should go read Armen’s post for Dec2013 and post for Jan2014 yourself. He has changed the format a little, but the graphs, data and analysis are all there. And hey, Armen even makes the raw data available in html and json formats, so now you can generate your own reports and graphs if interested. A very nice touch, Armen.
John (still cheering from the sidelines).
(My life been hectic on several other fronts, so I only just now noticed that I never actually published this blog post. Sorry!!)
On 07-nov-2013, I was invited to present “We are all remoties” in Twilio’s headquarters here in San Francisco as part of their in-house tech talk series.
For context, its worth noting that Twilio is doing great as a company, which means they are hiring. And outgrowing their current space, so one option they were investigating was to keep the current space, and open up a second office elsewhere in the bay area. As they’d always been used to working in the one location, this “split into two offices” was top of everyone’s mind… hence the invitation from Thomas to give this company-wide talk about remoties.
Twilio’s entire office is a large, SOMA-style-warehouse-converted-into-open-plan-offices layout, packed with lots of people. The area I was to present in was their big “common area”, where they typically host company all-hand meetings, Friday socials and other big company-wide events. Quite, quite large. I’ve no idea how many people were there but it felt huge, and was wall-to-wall packed. The size gave an echo-y audio effect off the super-high high concrete ceilings and far-distant bare concrete walls, with a weird couple of structural pillars right in the middle of the room. Despite my best intentions, during the session, I found myself trying to “peer around” the pillars, aware of the people blocked from view.
Its great to see the response from folks when slides in a presentation *exactly* hit onto what is on top-of-their-minds. One section, about companies moving to multiple locations, clearly hit home with everyone… not too surprising, given the context. Another section, about a trusted employee moving out from office to start being a 100% remote employee, hit a very personal note – there was someone in the 2nd row who was a long-trusted employee actually about to embark on this exact change. He got quite the attention from everyone around him, and we stopped everything for a few minutes to talk about his exact situation. As far as I can tell, he found the entire session very helpful, but only time will tell how things work out for him.
The very great interactions, the lively Q+A, and the crowd of questions afterwards were all lots of fun and quite informative.
Big thanks to Thomas Wilsher @ Twilio for putting it all together. I found it a great experience, and the lively discussions before+during+after lead me to believe others did too.
PS: For a PDF copy of the presentation, click on the smiley faces! For the sake of my poor blogsite, the much, much, larger keynote file is available on request.
Just before the holiday break, Mitchell and I sat down together to fulfill a long standing promise I made years ago: to have Mitchell start a Firefox release herself.
After starting Mozilla just over 15 years ago, and dealing with all aspects of running a large organization, Mitchell finally kicked off a Firefox release herself last week – for the very first time. Specifically, she was going to start the official release automation for Firefox 27.0 beta2 and Fennec 27.0 beta2.
Timing was tricky. We didn’t want to disrupt the usual beta release cadence, especially just before the holidays. And Mitchell only had 25 minutes free between meetings, so we spent a few minutes saying hi, getting settled, and then we jumped right into the details.
To kick off Firefox and Fennec releases, there are only a handful of fields a human has to fill in for each product. They are (almost) all fairly self-evident, and a good number of the fields are populated by picking-from-a-list, so we made fast progress. The “Gimme a Firefox” and “Gimme a Fennec” buttons caused a laugh!
8 minutes is all it took.
That 8 minutes included the time to explain what each field did, what value should go into each of the various fields, and why. We even took the time to re-verify everything. After all, this was not just a “demo”… this was real. Mozilla shipped this Firefox 27.0b2 and Fennec 27.0b2 to all our real-live beta users before closing down for the holiday break.
Because it was so quick, we had spare time to chat about how much the infrastructure has improved since the Directors meeting in Building S 6.5 years ago when this promise was originally made. Obviously, there’s plenty of complexity involved in shipping a product release – the daily bug triage meetings about what fixes should/shouldn’t be included in a release, the actual landing of code fixes by developers, the manual spot-checking by QA, the press and PR coordination, the list goes on… – but the fact that such a “simple” user interface could trigger release automation running a couple of hundred compute hours across many machines to reliably ship products to millions of users is a note-worthy measure of Mozilla’s Release Engineering infrastructure. Mitchell was suitably impressed!
And then Mitchell left, with a wave and a smile, a few minutes early for her next meeting, while the various Release Engineering systems sprang into life generating builds, localization repacks, and updates for all our users.
We took this photo afterwards to commemorate the event! Thank you, Mitchell!
Mozilla’s Release Engineering was part of “The Architecture of Open Source Applications (vol2)” published in paperback in May2012 and then as electronic downloads from Amazon and Barnes&Noble in Sept2012. Earlier this week, Rail was delighted to discover that the book has now been translated into Russian. It is very cool to see this.
As best as I can tell, this translation work was led by А. Панин (A. Panin), and they did a great job. Even taking the time to recreate the images with embedded translated text. Tricky hard work, and very very great to see. Thanks to Mr Panin for making this happen.
You can download the entire book, or just the Mozilla RelEng portion. (As always, proceeds from book sales go to Amnesty International.)
This makes me wonder – are there any other translations, or translations-in-progress, out there?
My earlier blog post shows how much we spend per checking using AWS “on demand” instances. Rail and catlee has been working on using much cheaper AWS spot instances (see details here and here for details).
As of today, AWS spot instances are now being used for 30-40% of our linux test jobs in production. We continue to monitor closely, and ramp up more every day. Builds are still being done on on-demand instances, but even so, we’re already seeing this work reduce our costs on AWS – now our AWS costs per checkin us USD$26.40 (down from $30.60) ; broken out as follows: USD$8.44 (down from $11.93) for Firefox builds/tests, USD$5.31 (unchanged) for Fennec builds/tests and USD$12.65 (down from 13.36) for B2G builds/tests.
It is worth noting that our AWS bill for November was *down*, even though our checkin load was *up*. While spot instances get more attention, because they are more technically interesting, it is worth noting for the record that only a small percentage of this cost saving was because of using spot instances. Most of the cost savings in November were from the less-glamorous work of identifying and turning off unwanted tests – ~20% of our overall load was no-longer-needed.
- A spot instance can be deleted out from under you at zero notice, killing your job-in-progress, if someone else bids more then you for that instance. We’re only seeing ~1% of jobs on spot-instances being killed. We’ve changed our scheduling automation so that now, any spot job which is killed, will automatically be re-triggered on a *non-spot* instance. While a developer might tolerate a delay/restart once, because of the significant cost savings, they would quickly be frustrating if a job was unlucky enough to be killed multiple times.
- When you ask for an on-demand instance, it is almost instant. By contrast, when you ask for a spot-instance, you do not have any guarantee on how quickly Amazon will provide the instance. We’ve seen delays of up to 20-30mins minutes, all totally unpredictable. Handling delays like this requires changes to our automation logic. All work well in progress, but still, work to be done.
- This post only includes costs for AWS jobs, but we run a lot more builds+tests on inhouse machines. Cshields continues to work with mmayo to calculate TCO (Total Cost of Ownership) numbers for the different physical machines Mozilla runs in Mozilla colos. Until we have accurate numbers from IT, I’ve set those inhouse costs to $0.00. This is obviously unrealistic, but felt better then confusing this post with inaccurate data.
- The Amazon prices used here are “OnDemand” prices. For context, Amazon WebServices has 4 different price brackets available, for each different type of machine available:
** OnDemand Instance: The most expensive. No need to prepay. Get an instance in your requested region, within a few seconds of asking. Very high reliability – out of the hundreds of instances that RelEng runs daily, we’ve only lost a few instances over the last ~18months. Our OnDemand builders cost us $0.45 per hour, while our OnDemand testers cost us $0.12 per hour.
** 1 year Reserved Instance: Pay in advance for 1 year of use, get a discount from OnDemand price. Functionally totally identical to OnDemand, the only change is in billing. Using 1 year Reserved Instances, our builders would cost us $0.25 per hour, while our OnDemand testers cost us $0.07 per hour.
** 3 year Reserved Instances: Pay in advance for 3 year of use, get a discount from OnDemand price. Functionally, totally identical to OnDemand, the only change is in billing. Using 3 year Reserved Instances, our builders would cost us $0.20 per hour, while our 3 year Reserved Instance testers cost us $0.05 per hour.
** Spot Instances: The cheapest. No need to prepay. Like a live auction, you bid how much you are willing to pay for it, and so long as you are the highest bidder, you’ll get an instance. This price varies throughout the day, depending on what demand other companies place on that AWS region. We’re using a fixed $0.025, and in future, it might be possible to save even more by doing tricker dynamic bidding.
More news as we have it.
tl;dr: On 18nov, I gave my notice to Brendan and Bob that I will be leaving Mozilla, and sent an email internally at Mozilla on 26nov. I’m here until 31dec2013. Thats a lot of notice, yet feels right – its important to me that this is a smooth stable transition.
After they got over the shock, the RelEng team is stepping up wonderfully. Its great to see them all pitching in, sharing out the workload. They will do well. Obviously, at times like this, there are lots of details to transition, so please be patient and understanding with catlee, coop, hwine and bmoss. I have high confidence this transition will continue to go smoothly.
In writing this post, I realized I’ve been here 6.5 years, so thought people might find the following changes interesting:
1) How quickly can Mozilla ship a zero-day security release?
was: 4-6 weeks
now: 11 hours
2) How long to ship a “new feature” release?
was: 12-18 months
now: 12 weeks
3) How many checkins per day?
was: ~15 per day
now: 350-400 per day (peak 443 per day)
4) Mozilla hired more developers
increased number of developers x8
increased number of checkins x21
The point here being that the infrastructure improved faster then Mozilla could hire developers.
5) Mozilla added mobile+b2g:
was: desktop only
now: desktop + mobile + phoneOS – many of which ship from the *exact* same changeset
6) updated tools
now: hg *and* git (aside, I don’t know any other organization that ships product from two *different* source-code revision systems.)
7) Lifespan of human Release Engineers
was 6-12 months
now: two-losses-in-6-years (3 including me)
This team stability allowed people to focus on larger, longer term, improvements – something new hires generally cant do while learning how to keep the lights on.
This is the best infrastructure and team in the software industry that I know of – if anyone reading this knows of better, please introduce me! (Disclaimer: there’s a big difference between people who update website(s) vs people who ship software that gets installed on desktop or mobile clients… or even entire phoneOS!)
Literally, Release Engineering is a force multiplier for Mozilla – this infrastructure allows us to work with, and compete against, much bigger companies. As a organization, we now have business opportunities that were previously just not possible.
Finally, I want to say thanks:
- I’ve been here longer then a few of my bosses. Thanks to bmoss for his council and support over the last couple of years.
- Thanks to Debbie Cohen for making LEAD happen – causing organizational change is big and scary, I know its impacted many of us here, including me.
- Thanks to John Lilly and Mike Schroepfer (“schrep”) – for allowing me to prove there was another, better, way to ship software. Never mind that it hadn’t been done before. And thanks to aki, armenzg, bhearsum, catlee, coop, hwine, jhopkins, jlund, joey, jwood, kmoir, mgerva, mshal, nthomas, pmoore, rail, sbruno, for building it, even when it sounded crazy or hadn’t been done before.
- Finally, thanks to Brendan Eich, Mitchell Baker, and Mozilla – for making the “people’s browser” a reality… putting humans first. Mozilla ships all 90+ locales, even Khmer, all OS, same code, same fixes… all at the same time… because we believe all humans are equal. It’s a living example of the “we work for mankind, not for the man” mindset here, and is something I remain super proud to have been a part of.
- We’re back to typical load again in November.
- #checkins-per-month: We had 7,601 checkins in November 2013. This is our 2nd heaviest load on record, and is back at expected range. For the curious, our heaviest month on record was in August 2013 (7,771 checkins) and our previous 2nd heaviest month was September2013 (7,580 checkins).
- #checkins-per-day:Overall load was consistently high throughout the month, with a slight dip for US Thanksgiving. In November, 18-of-30 days had over 250 checkins-per-day, 13-of-30 days had over 300 checkins-per-day, and 1-of-30 days had over 400 checkins-per-day. Our heaviest day had 431 checkins on 18nov; close to our single-day record of 443 checkins on 26aug2013.
- #checkins-per-hour: Checkins are still mostly mid-day PT/afternoon ET. For 10 of every 24 hours, we sustained over 11 checkins per hour. Our heaviest load time this month was 11am-12noon PT 15.6 checkins-per-hour (a checkin every 3.8 minutes!) – slightly below our record of 15.73 checkins-per-hour.
mozilla-inbound, b2g-inbound, fx-team:
- mozilla-inbound had 16.6% of all checkins. This continues to be heavily used as an integration branch. As developers use other *-inbound branches, the use of mozilla-inbound has reduced over recent months, and is stabilizing around mid-teens of overall usage.
- b2g-inbound had 11.5% of all checkins. This continues to be a successful integration branch, with usage slightly increased over last month’s 10.3% and a sign that usage of this branch is also stabilizing.
- fx-team had 6% of all checkins. This continues to be a very active third integration branch for developers. Usage is almost identical to last month, and shows that usage of this branch is also stabilizing.
- The combined total of these 3 integration branches is 34.1% , which is slightly higher then last month yet fairly consistent. Put another way, sheriff moderated branches consistently handle approx 1/3 of all checkins (while Try handles approx 1/2 of all checkins). The use of multiple *-inbounds is clearly helping improve bottlenecks (see pie chart below) and the congestion on mozilla-inbound is being reduced significantly as people use switch to using other *-inbound branches instead. Overall, this configuration reduces stress and backlog headaches on sheriffs and developers, which is good. All very cool to see working at scale like this.
mozilla-aurora, mozilla-beta, mozilla-b2g18, gaia-central:
Of our total monthly checkins:
- 2.6% landed into mozilla-central, slightly lower than last month. As usual, most people land on sheriff-assisted branches instead of landing directly on mozilla-central.
- 1.4% landed into mozilla-aurora, lower then last month’s abnormally high load. This is consistent with the B2G branching, which had B2G v1.2 checkins landing on mozilla-aurora, and now moved to mozilla-b2g26_v1_2.
- 0.9% landed into mozilla-beta, slightly higher than last month.
- 0.0% landed into mozilla-b2g18, slightly lower then last month. This dropped to almost zero (total of 8 checkins) as we move B2G to gecko26.
- 3.3% landed into mozilla-b2g26_v1_2, as part of the B2Gv1.2 branching involving Firefox25. As predicted this is significantly more then last month, and is expected to continue until we move focus to B2G v1.3 on gecko28.
- Note: gaia-central, and all other gaia-* branches, are not counted here anymore. For details, see here.
misc other details:
As usual, our build pool handled the load well, with >95% of all builds consistently being started within 15mins. Our test pool is getting up to par and we’re seeing more test jobs being handled with better response times. Trimming out obsolete builds and tests continues. As always, if you know of any test suites that no longer need to be run per-checkin, please let us know so we can immediately reduce the load a little. Also, if you know of any test suites which are perma-orange, and hidden on tbpl.m.o, please let us know – those are the worst of both worlds – using up scarce CPU time *and* not being displayed for people to make use of. We’ll make sure to file bugs to get tests fixed – or disabled – every little bit helps put scarce test CPU to better use.
Last week, 18-22 November, RelEng gathered in Boston. As usual for these work weeks, it was jam-packed; there was group planning, and lots of group sprints – coop took on the task of blogging with details for each specific day (Mon and Mon, Tue, Wed, Thu, Fri). The meetings with Bocoup were a happy, unplanned, surprise.
Given the very distributed nature of the group, and the high-stress nature of the job, a big part of the week is making sure we maintain our group cohesion so we can work well together under pressure after we return to our respective homes. When all together in person, the trust, respect, love for each other is self-evident and something I’m truly in awe of. I dont know how else to describe this except “magic” – this is super important to me, and something I’m honored to be a part of.
Every gathering needs a group photo, and these are never first-shot-good-enough-ship-it, so while aki was taking a group photo, Massimo quietly setup his gopro to timelapse the fun.
This is Mozilla’s Release Engineering group – aki, armenzg, bhearsum, callek, catlee, coop, hwine, joey, jhopkins, jlund, joduinn, kmoir, mgerva, mshal, nthomas, pmoore, simone, rail. All proudly wearing our “Ship it” shirts.
Every RelEng work week is always an exhausting hectic week, and yet, at the end of each week, as we are saying our goodbyes and heading for various planes/cars/homes, I find myself missing everyone deeply and feeling so so so proud of them all.