This earlier blog post allowed us to do some interesting math. Now, we can mark each different type of job with its cost-per-minute to run, and finally calculate that a checkin costs us at least USD$30.60; the cost was broken out as follows: USD$11.93 for Firefox builds/tests, USD$5.31 for Fennec builds/tests and USD$13.36 for B2G builds/tests.
- This post assumes that all inhouse build/test systems have zero cost, and are free, which is obviously incorrect. Cshields is working with mmayo to calculate TCO (Total Cost of Ownership) numbers for the different physical machines Mozilla runs in our colos. Once those TCO costs figured out, I can plug them into this grid, and create an updated blogpost, with revised costs. Meanwhile, however, calculating this TCO continues to take time, so for now I’ve intentionally excluded all cost of running on any inhouse machines. They are not “free”, so this is obviously unrealistic, but better then confusing this post with inaccurate data. Put another way, the costs which *are* here are an underreported part of the overall cost.
- Each AWS region has different prices for instances. The Amazon prices used here are for the regions that RelEng is already using. We already use the two cheapest AWS regions (US-west-2 and US-east-1) for daily production load, and keep a third region on hot-backup just in case we need it.
- The Amazon prices used here are “OnDemand” prices. For context, Amazon WebServices has 4 different price brackets available, for each different type of machine available:
** OnDemand Instance: The most expensive. No need to prepay. Get an instance in your requested region, within a few seconds of asking. Very high reliability – out of the hundreds of instances that RelEng runs daily, we’ve only lost a few instances over the last ~18months. Our OnDemand builders cost us $0.45 per hour, while our OnDemand testers cost us $0.12 per hour.
** 1 year Reserved Instance: Pay in advance for 1 year of use, get a discount from OnDemand price. Functionally totally identical to OnDemand, the only change is in billing. Using 1 year Reserved Instances, our builders would cost us $0.25 per hour, while our OnDemand testers cost us $0.07 per hour.
** 3 year Reserved Instances: Pay in advance for 3 year of use, get a discount from OnDemand price. Functionally, totally identical to OnDemand, the only change is in billing. Using 3 year Reserved Instances, our builders would cost us $0.20 per hour, while our 3 year Reserved Instance testers cost us $0.05 per hour.
** Spot Instances: The cheapest. No need to prepay. Like a live auction, you bid how much you are willing to pay for it, and so long as you are the highest bidder, you’ll get an instance. This price varies throughout the day, depending on what demand other companies place on that AWS region. Unlike the other types above, a spot instance can be deleted out from under you at zero notice, killing your job-in-progress, if someone else bids more then you. This requires additional automation to detect and retrigger the aborted jobs on another instance. Unlike all others, creating spot instance takes anywhere from a few seconds to 25-30mins to get created, so requires additional automation to handle this unpredictibility. The next post will detail the costs when Mozilla RelEng is running with spot instances in production.
Being able to answer “how much did that checkin actually cost Mozilla” has interesting consequences. Cash has a strange cross-cultural effect – it helps focus discussions.
Now we can see the financial cost of running a specific build or test.
Now its easy to see the cold financial saving of speeding up a build, or the cost saving gained by deleting invalid/broken tests.
Now we can determine approximately how much money we expect to save with some cleanup work, and can use that information to decide how much human developer time is worth spending on cleanup/pruning.
Now we can make informed tradeoff decisions between the financial & market value of working on new features and the financial value of cheaper+faster infrastructure.
Now, it is no longer just about emotional, “feel good for doing right” advocacy statements… now each cleanup work has a clear cold hard cash value for us all to see and to help justify the work as a tradeoff against other work.
All in all, its a big, big deal, and we can now ask “Was that all worth at least $30.60 to Mozilla?”.
(ps: Thanks to Anders, catlee and rail for their help with this.)
- Overall this month was quieter then usual. I’d guess that this was caused by a combination of fatigue after the September B2G workweek, the October stabilization+lockdown period for B2Gv1.2, and Canadian Thanksgiving. Oh, and of course, Mozilla’s AllHands Summit in early October. Data for November is already higher, back towards more typical numbers. A big win was turning off obsolete builds and tests which reduced our load by 20%.
- #checkins-per-month: We had 6,807 checkins in October 2013. This is ~10% below last month’s 7,580 checkins.
- #checkins-per-day:Overall load was down throughout the month. In October, 15-of-31 days had over 250 checkins-per-day, 8-of-31 days had over 300 checkins-per-day. No day in October was over 400 checkins-per-day. Our heaviest day had 344 checkins on 28oct; impressive by most standards, yet well below our single-day record of 443 checkins on 26aug.
- #checkins-per-hour: Checkins are still mostly mid-day PT/afternoon ET. For 7 of every 24 hours, we sustained over 11 checkins per hour. Our heaviest load time this month was 2pm-3pm PT 12.77 checkins-per-hour (a checkin every 4.7 min) – below our record of 15.73 checkins-per-hour.
mozilla-inbound, b2g-inbound, fx-team:
- mozilla-inbound continues to be heavily used as an integration branch. As developers use other *-inbound branches, the use of mozilla-inbound at 15.8% of all checkins is much reduced from typical, and also reduced from last month – which was itself the lowest ever usage of mozilla-inbound. The use of multiple *-inbounds is clearly helping improve bottlenecks (see pie chart below) and the congestion on mozilla-inbound is being reduced significantly as people use switch to using other *-inbound branches instead. This also reduces stress and backlog headaches on sheriffs, which is good. All very cool to see.
- b2g-inbound continues to be a great success, now up to 10.3% of this month’s checkins landing here, a healthy increase over last month’s 8.8% and further evidence that use of this branch is helping.
- With sheriff coverage, fx-team is clearly a very active third place for developers, with 5.6% of checkins this month, This is almost identical to last month, and may become the stable point for this branch.
- The combined total of these 3 integration branches is 30.2%, which is fairly consistent. Put another way, sheriff moderated branches consistently handle approx 1/3 of all checkins (while Try handles approx 1/2 of all checkins).
mozilla-aurora, mozilla-beta, mozilla-b2g18, gaia-central:
Of our total monthly checkins:
- 2.6% landed into mozilla-central, slightly higher than last month. As usual, very few people land directly on mozilla-central these days, when there are sheriff-assisted branches available instead.
- 3.2% landed into mozilla-aurora, much higher than usual. I believe this was caused by the B2G branching, which had B2G v1.2 checkins landing on mozilla-aurora.
- 0.8% landed into mozilla-beta, slightly higher than last month.
- 0.2% landed into mozilla-b2g18, slightly lower then last month. This should quickly drop to zero as we move B2G to gecko26.
- 0.4% landed into mozilla-b2g26_v1_2, which was only enabled for checkins as part of the B2Gv1.2 branching involving Firefox25. This should quickly grow in usage until we move focus to B2G v1.3 on gecko28.
- Note: gaia-central, and all other gaia-* branches, are not counted here anymore. For details, see here.
misc other details:
As usual, our build pool handled the load well, with >95% of all builds consistently being started within 15mins. Our test pool is getting up to par and we’re seeing more test jobs being handled with better response times. Trimming out obsolete builds and tests reduced our load by 20% – or put another way – got us 20% extra “free” capacity. Still more work to be done here, but very encouraging progress. As always, if you know of any test suites that no longer need to be run per-checkin, please let us know so we can immediately reduce the load a little. Also, if you know of any test suites which are perma-orange, and hidden on tbpl.m.o, please let us know – those are the worst of both worlds – using up scarce CPU time *and* not being displayed for people to make use of. We’ll make sure to file bugs to get tests fixed – or disabled – every little bit helps put scarce test CPU to better use.
[UPDATE: added mention of Mozilla Summit in first paragraph. Thanks to coop for catching that omission! joduinn 12nov2013.]
While researching this “better display for compute hours per checkin” post , I noticed that we now “only” consume 207 compute hours of builds and tests per checkin. A month ago, we handled 254 compute-hours-per-checkin, so this is a reduction of 47 compute-hours-per-checkin.
No “magic silver bullet” here, just people quietly doing detailed unglamorous work finding, confirming and turning off no-longer-needed-jobs. For me, the biggest gains were turning off “talos dirtypaint” and “talos rafx” across all desktop OS, a range of b2g device builds, all Android no-ionmonkey builds and tests, and a range of Android armv6, armv7 builds and tests. At Mozilla’s volume-of-checkins, saving 47 hours-per-checkin is a big big deal.
This reduced our overall load by 23%. Or put another way – this work gave us 23% extra “spare” capacity to better handle the remaining builds and tests that people *do* care about.
Great, great work by sheriffs and RelEng. Thank. You.
How many hours of builds and tests do we run per commit?
- 207 compute hours = ~8.6 compute *days* (nov2013)
- 254 compute hours = ~10.5 compute *days* (sep2013)
- 137 compute hours = ~5.7 compute *days* (aug2012)
- 110 compute hours = ~4.6 compute *days*(jan2012)
- ~40 compute hours = ~1.6 compute *days*(2009)
There’s still more goodness to come, as even more jobs continue to be trimmed; the curious can follow bug#784681. Of course, if you see any build/test here which is no longer needed, or is perma-failing-and-hidden on tbpl.mozilla.org, please file a bug linked to bug#784681 and we’ll investigate/disable/fix as appropriate.
After my last post about our compute-load-per-checkin, I received a email that made me sit up and smile. Andershol had “a quick script” that quickly and easily displayed the same information in a gridformat. Not just a suggestion – the actual code that ran, with real output. I found this format super helpful. We’ve refined this a few times now, and I think others would also find this useful, hence this post.
- Each vertical column is the operating system used.
- Each horizontal row is the job type (which build-type, which test-suite,…).
- Each white cell is the elapsed time taken by that specific job on that specific operating system, so for example running “mochitest browser chrome” on linux 32bit opt build took 1h:53m:13s.
It is now easy to quickly see the total time spent on a given OS, by looking at the total in the gray column header (for example, Firefox desktop linux 32bit builds and tests took 21h:44m).
Similarly, its easy to see the total time spent on a given job (build/test), across all OS, by looking at the total in the gray row header. (for example, running “mochitest browser chrome” took 4h:54m on opt, 13h:13m on debug, for a total of 18h:07m).
The three major products (Firefox-for-desktop, Firefox-for-Android, FirefoxOS) are each shown in their own grid, but its worth noting that the jobs in *each* of the *three* grids are being processed per checkin. The combined total of all three grids is the overall compute load that RelEng is running per checkin.
This display format was super helpful to me, so big thanks to Andershol for making this a reality!
Also, its great to see no-longer-needed builds and testsuites being turned off… reducing load from 254 to 207 hours per checkin. Biggest highlights were turning off “talos dirtypaint” and “talos rafx” across all desktop OS, turning off all Android no-ionmonkey builds and tests, and turning off a range of Android armv6, armv7 builds and tests. At Mozilla’s volume-of-checkins, those savings quickly add up.
Of course, if you notice anything else being run which you think is no longer needed, please file a bug and we’ll take care of it.
ps: Andershol has posted the code to https://github.com/andershol/buildtasks; if you have ideas, or would like to suggest enhancements, he’s happily accepting patches!
Last week, I had the distinct privilege of being invited back to present “We are all remoties” in UCBerkeley’s “New Manager Bootcamp” series at Haas.
The auditorium was packed with ~90 people, from a range of different companies and different industries. After my experiences at Mozilla Summit, I started by asking two specific questions:
1) How many of you are remote? (only ~5% of hands went up).
2) How many of you routinely work with people who are not in the same geographical location as yourself (100% of the hands went up!).
I found it interesting that few thought of themselves as “remotie”, yet all were working in geo-distributed teams.
This was similar to what came up during the “We are all remoties” sessions at MozillaSummit just a few days before, as well as at other previous “We are all remoties” sessions I’ve done elsewhere. Somehow, physically working in an office tricks some people into believing they don’t need to think of themselves as “remote”, and hence don’t think “We are all remoties” is relevant to them!?
People were fully engaged, asking tons of great questions right from the start, and were clearly excited by practical tips to working more effectively in distributed groups. The organizers planned ahead, and specifically put this session immediately before lunch, so that the Q+A could continue overtime… and a separate crowded room of 15-20 people continued the great back/forth over food.
After lunch, I was part of a 4-person panel, where the class got to set direction and ask all the questions – no holds barred. As the class, and the panelists, all came from different backgrounds, different cultures, different careers, it was no surprise that the Q+A uncovered different perspectives and attitudes. The class were agreeing/disagreeing with each other and with the panelists. We even had panelists asking each other questions?!?! As individual panelists, we didn’t always agree on the mechanics of what we did, but we all agreed on the motivations of *why* we did what we did: doing a good job, while also taking care of the lives and careers of the individuals, the group, and the overall organization.
The trust and honesty in the room was great, and it was quickly evident that everyone was down-to-earth, asking brutally honest questions simply because they wanted to do right with their new roles and responsibilities. Even while being on the spot with some awkward questions, I admired their sincere desire to do well in their new role, and to treat people well. It gave me hope, and I thank them all for that.
Big thanks to Homa and Kim for putting it all together. I found it a great experience, and the lively discussions during+after lead me to believe others did too.
PS: For a PDF copy of the presentation, click on the smiley faces! For the sake of my poor blogsite, the much, much, larger keynote file is available on request.
Last weekend, during Mozilla Summit, “We are all Remoties” was held *4* times: Brussels (catlee), Toronto (Armen and Kadir) and Santa Clara (myself, twice!). Big props to Kadir for joining in with his data – its always great to meet others who are also thinking about to best work together in a growing and geographically-distributed Mozilla.
I was happy to see that these different speakers, in different locations, all covered the session well, in their own personal style, and all had great responses and interactions. From all accounts, people really found this topic helpful, which is very nice to hear.
The one feedback that did surprise me, from all these sessions, was that most of the people attending were already working remotely, yet very few people based in offices attended, even if their entire group was geo-distributed. The topics covered addressed people in offices too, and several times people who were remoties said to me that they wished their office-based-co-workers had attended.
Its possible that the title makes people think the session only applies to non-office-based people. One earlier title I had was “working effectively in geo-distributed teams”, but that sounded very PHB. Another title (“If you are a remotie, or if you are in an office, working with a remotie…”) was too long, but it brought me to the current title. If everyone who is on a geo-distributed team considered themselves all to be on the same level playing field, then “we are ALL remoties!”.
Spreading the word, including to more people in physical offices, is important to make everyone’s work life more effective. If you’ve any ideas/suggestions, please let me know. And thanks again for the great support in all four summit sessions!
[For a PDF copy of the entire presentation, click here or on the smiley faces! For the sake of my poor blogsite, the much, much, larger keynote files are available on request.]
(This post is unusual, in that I am “reviewing” a book before reading the final print yet. Maybe “previewing” is more accurate?)
I’ve had the great fortune of repeatedly training on the mat with many world-class Aikido practitioners. Two of these, Linda Holiday Sensei (6th dan, runs Aikido of Santa Cruz dojo) and Motomichi Anno Sensei (8th dan, direct student of OSensei the founder of Aikido, recipient of Japan’s Distinguished Service Award, and ran the Kumano Juku Dojo in Shingu, Japan for ~40 years.) have just published a book they have been working on for literally *years*.
This is exciting.
Training with both of these authors has been pivotal for me, on and off the mat. Over the years, I’ve heard readings of various passages, and even been present for some interviews gathering source material. All random snippets, in various drafts, and out of sequence, which makes it hard to predict how the final form will pull together. What I’ve heard so far have been very meaningful to me, so I’m eager to get my hands on a signed 1st edition of this book on Saturday.
More info in the San Francisco Chronicle’s recent interview with Linda Holiday or the book’s official website. If you are interested, there’s a (free!) open-to-the-public book reading by Linda Holiday with live Aikido demonstrations in San Francisco this Saturday.
Oni gashi mas!
- September was special. Our previous record was to run 52,000 test jobs in a 24 hour day on 27aug… impressive by any standards. But in September, we blew past that record twice: we handled 66,456 test jobs on 11sep, and then we handled 73,453 test jobs in a 24 hour day on 17sep. Stunning, simply stunning.
- #checkins-per-month: We had 7,580 checkins in September 2013. This is ~2% below last month’s record 7,771 checkins.
- #checkins-per-day: We hit 416 checkins on 03sep; impressive, yet still below our previous single-day record of 443 checkins on 26aug. During September, yet again all working days were over 200 checkins per day… In fact, if you exclude Friday 08sep and Monday 15sep when people were traveling for the b2g workweek, our weekday load throughout the month was 285 checkins per day, or higher. 19-of-30 days had over 250 checkins-per-day, 13-of-30 days had over 300 checkins-per-day. 2-of-30 days had over 400 checkins-per-day.
- #checkins-per-hour: Checkins are still mostly mid-day PT/afternoon ET. For 8 of every 24 hours, we sustained over 12 checkins per hour. Our heaviest load time this month was 10am-11am PT 15.73 checkins-per-hour (a checkin every 3.8 min – a new record.
mozilla-inbound, b2g-inbound, fx-team:
- mozilla-inbound continues to be heavily used as an integration branch. As developers start to use other *-inbound branches, we saw use of mozilla-inbound at 17.4% of all checkins is still much reduced from typical, yet only slightly higher then last month which was the lowest ever usage of mozilla-inbound. The use of multiple *-inbounds is clearly helping improve bottlenecks (see pie chart below) and the congestion on mozilla-inbound is being reduced significantly as people use switch to using other *-inbound branches instead. This also reduces stress and backlog headaches on sheriffs, which is good. All very cool to see and a definite part of the reason we continue to hit new records this month.
- b2g-inbound continues to be a great success, with 8.8% of this month’s checkins landing here, a slight increase over last month’s 8.2% and further evidence that use of this branch is stabilizing.
- With sheriff coverage, fx-team is clearly a very active third place for developers, with 5.5% of checkins this month, This is a slight drop from last month, but use also appears to be stabilizing. Having sheriff coverage clearly made a difference.
- The combined total of these 3 integration branches is 31.7%, which is fairly consistent. Put another way, sheriff moderated branches consistently handle approx 1/3 of all checkins.
mozilla-aurora, mozilla-beta, mozilla-b2g18, gaia-central:
Of our total monthly checkins:
- 2.3% landed into mozilla-central, slightly higher than last month. As usual, very few people land directly on mozilla-central these days, when there are sheriff-assisted branches available instead.
- 1.7% landed into mozilla-aurora, about the same as last month.
- 0.7% landed into mozilla-beta, slightly lower than last month.
- 0.3% landed into mozilla-b2g18, slightly lower then last month. This should quickly drop to zero as we move to gecko26.
- Note: gaia-central, and all other gaia-* branches, are not counted here anymore. For details, see here.
misc other details:
As usual, our build pool handled the load well, with >95% of all builds consistently being started within 15mins. Our test pool is getting up to par and we’re seeing more test jobs being handled with better response times. The peak per-day test load for September was insane: our previous record was 52,000 test jobs on 27aug… which we blew right past when we handled 66,456 test jobs on 11sep, and then again when we handled 73,453 test jobs a week later on 17sep. Still more work to be done here, but very encouraging progress.
As always, if you know of any test suites that no longer need to be run per-checkin, please let us know so we can immediately reduce the load a little. Also, if you know of any test suites which are perma-orange, and hidden on tbpl.m.o, please let us know – thats the worst of both worlds – using up scarce CPU time and not being displayed for people to make use of. We’ll make sure to file bugs to get tests fixed – or disabled – every little bit helps put scarce test CPU to better use.
Summit is coming.
Summit is exciting. With so many people scattered around the world, this gathering of Mozillians… this summit… is a rare chance for people to get together face-to-face.
Summit is scary and stressful. It is a total change in location and routine, which can be stressful. It forces everyone into a high-volume-of-contact… not anonymous contact like a crowded street in New York… high-volume-and-intense-contact with lots of people you work with, closely or intermittently, on a shared project that we all care about passionately. It’s exciting. It’s invigorating. It’s overwhelming. In the coming days, even extrovert people will need a quiet time or two… more introverted people doubly so. Add some small factors like: jet-lag, sleep deprivation, language barriers, change-of-routine, and it’s easy for people to get frayed at the edges.
With that context, I’d like to offer the following thoughts:
- Respect of self (1): Despite all the great things going on, keep a mental track of how *you* are doing. If you are feeling stressed/overwhelmed with everything, take a few minutes to walk outside in the sunshine, read a book in your room, go for a jog in the sunshine, call family back home, go for a swim… everyone is different, so do whatever works for you. I’ve done this at every conference I attend over the years, and it really helps me recenter. It also lets me mentally process all the inputs so far, and gives me time to remind myself what is important that I still need to do when I go back in the crowd. After all, we’re all here to connect.
- Respect of self (2): Don’t quietly put up with unacceptable behavior. If a conversation or a situation is making you uncomfortable, make a mental note of it, regardless of whether it’s directed at you, or something you observe/hear being directed at someone else. Politely say “I’m starting to feel uncomfortable“. It may not be intended, so this is a great way to give others a chance to quickly learn, self-correct and grow (without risking offense to either party). If that doesn’t fix things, politely excuse yourself with “That’s an interesting opinion, but I have to leave now” and disengage. Some people, at Mozilla and elsewhere, enjoy trolling… but keep in mind that you don’t have to feed the trolls if you don’t want to. Nicole’s presentation is just great, I re-watch it often. If you think the situation merits it, please do let any of the Mozilla Conductors or Site Hosts know.
- Respect of others: Lively, honest, debate is a great way for smart people to quickly solve complex problems. When it works, it’s magic. True magic. And to be encouraged. Sometimes, however, these can spiral out-of-control. The difference, as far as I can tell, is respect. Don’t impose your thoughts/intentions where they are not welcome. To be clear, I’m not saying that people should stop having honest conversations, and suddenly be all super-politically-correct. Just be respectful. If you find yourself in a heated discussion with someone, and you’re not getting anywhere, try the following:
- “Wait, wait, wait. We’re repeating ourselves here, and clearly not agreeing, so lets take pause and reset.”
- Then wait a few seconds, and take a few deep breaths!
- “OK, to reset context, can we assume that we both are professionals in our areas? Can we assume that we both want the best outcome for Mozilla? Agree?” (It is important to have these be asked, and answered, honestly and with “yes” from both! If you cannot even agree to this, you’ve got a different situation to resolve.)
- Once you get a “yes”, then speaking calmly, ask “ok, so using different words, can you tell me why you care about xxxxxx? And I promise to not say *anything* until you tell me you’re finished. Then afterwards, we’ll switch, so I’ll speak without interruption, and you listen. But you first…“.
- Listen. Take notes if it helps. Allow the other person time to pause and collect their thoughts without interruption. Literally no interrupting.
- When they finally say they’re all done, then say “ok, here’s what I heard you say – is this correct?” and paraphrase it all back to them. Adjust for corrections and repeat if needed, but make sure to state the full end-to-end one last time after last corrections, so they clearly hear you say their entire opinion/concerns *once* perfectly, in one uncorrected pass.
- Now, reverse roles. “ok, now it’s my turn to speak without interruption, while you listen“.
- Make sure they can paraphrase back to you, accurately like you did for them.
- Almost every time I do this, we instantly find that we were actually solving unrelated *different* problems… problems which just happened to overlap in one small area. No wonder we couldn’t agree! We were two smart professional people who were each actually solving very different problems. This tactic helped debug *which* problem we were each solving, and typically cleared things up right away.
- Respect of Mozilla: I didn’t create Mozilla, but I’m super glad that Mitchell, Brendan and others did years ago. Imagine for a second… if this was a organization that you had created, and nurtured over the years, how would you want yourself, and everyone else, to treat each other? With that thought in mind, go out into the great crowd and engage.
Hopefully people find these thoughts helpful. Disclaimer, this is an area I’m still working on myself, so any feedback/suggestions/improvements are very very welcome… either here or in email or (yes!) in person!
Travel safe, see (some of) you soon, and lets have a great Summit!
ps: Some additional links I found helpful are: Bob Sutton’s No Asshole Rule and Laura Forrest’s “5 Hacks to make the most of Summit”, bsmedberg’s “Mozilla Summit: Listen Hard”… and yes, of course, I would be remiss to not include this great song:
(Followup from Tuesday’s Platform meeting as well as bhearsum’s blog post#1 and blog post#2.)
Updating our users is something we do very carefully.
Updating *how* we update our users is something we do very *very* VERY carefully.
On Monday 30sept, nightly users on mozilla-central will be served updates from the new AUS server. Users don’t need to do anything different – it should “just work”. (Of course, if you see any problems with updating, we’d like to hear about it… please file a bug!)
Users on aurora, beta and release channels are *not* being switched over yet. All in good time.
Update servers have to work accurately, securely, consistently and at scale. One of the big scary things that any software company has to do is update the system by which all of the company’s users are served updates. The same is true here at Mozilla. After all, if anything goes wrong, we can’t physically go around to each user’s house/office, everywhere around the world, to fix the problem caused by a bad update to their Firefox or Thunderbird installation. (aside: Mobile-app-only software companies avoid this, by manually uploading to Apple/Google stores, and relying on Apple/Google to do the update distribution for them. This helps small companies that *only* ship mobile apps keep their users up-to-date, but is not without risk. Offtopic, so watch for separate blog post.)
Since before 2007, Mozilla is currently serving all updates using AUS (Application Update Service), an update server that was originally written by @Morgamic. Even when I joined Mozilla in 2007, there were ongoing jaded discussions about a repeatedly deferred project called “AUS2 – The next big rewrite”. All to say, the current production AUS code has served (all of Mozilla’s users and Mozilla’s RelEng!) well, even so many years after it was originally written and put into production. Big hat tip to @Morgamic.
As we scaled up the rest of RelEng infrastructure, some early design decisions that used to be fine became trickier for us. Manually updating the live code on live production server with different version numbers was ok-to-do when Mozilla was back on the old-traditional-slower-release-cadence… but became a fragile concern when we moved to rapid-release-cadence. New requirements like being able to throttle specific operating systems at different update rates. New requirements like custom updates users for users of specific custom builds. New requirements like dropping support for specific sub-variants of an OS…
Finally, we found time last year and again earlier this year to dedicate people in RelEng to make concrete progress on this. As we got interrupted by other big externally-facing projects, we’d suspend work until later. Then resume work a few months later. Then suspend again. Then resume again. In so doing, the name and requirements kept evolving AUS2…AUS3…AUS4…BALROG. More details in bug#832454 and bug#583244. RelEng (and some brave volunteers) have been dogfooding for a while, using Balrog to get updates for nightly mozilla-central builds. We’re ready to roll this out to the next most adventurous population – nightly users on mozilla-central.
This has been the result of years of planning, and quiet focused coding by bhearsum, rail, catlee, nthomas over multiple quarters. This is a ReallyBigDeal ™ – for RelEng, for all Mozilla developers, and for Mozilla’s users – so it is exciting to see come into production.