04 Jun 2014
In April 1997, Netscape ReleaseEngineers wrote, and started running, the world’s first? second? continuous integration server. Now, just over 17 years later, in May 2014, the tinderbox server was finally turned off. Permanently.
This is a historic moment for Mozilla, and for the software industry in general, so I thought people might find it interesting to get some background, as well as outline the assumptions we changed when designing the replacement Continuous Integration and Release Engineering infrastructure now in use at Mozilla.
At Netscape, developers would checkin a code change, and then go home at night, without knowing if their change broke anything. There were no builds during the day.
Instead, developers would have to wait until the next morning to find out if their change caused any problems. At 10am each morning, Netscape RelEng would gather all the checkins from the previous day, and manually start to build. Even if a given individual change was “good”, it was frequently possible for a combination of “good” changes to cause problems. In fact, as this was the first time that all the checkins from the previous day were compiled together, or “integrated” together, surprise build breakages were common.
This integration process was so fragile that all developers who did checkins in a day had to be in the office before 10am the next morning to immediately help debug any problems that arose with the build. Only after the 10am build completed successfully were Netscape developers allowed to start checking-in more code changes on top of what was now proven to be good code. If you were lucky, this 10am build worked first time, took “only” a couple of hours, and allowed new checkins to start lunchtime-ish. However, this 10am build was frequently broken, causing checkins to remain blocked until the gathered developers and release engineers figured out which change caused the problem and fixed it.
Fixing build bustages like this took time, and lots of people, to figure out which of all the checkins that day caused the problem. Worst case, some checkins were fine by themselves, but cause problems when combined with, or integrated with, other changes, so even the best-intentioned developer could still “break the build” in non-obvious ways. Sometimes, it could take all day to debug and fix the build problem – no new checkins happened on those days, halting all development for the entire day. More rare, but not unheard of, was that the build bustage halted development for multiple days in a row. Obviously, this was disruptive to the developers who had landed a change, to the other developers who were waiting to land a change, and to the Release Engineers in the middle of it all…. With so many people involved, this was expensive to the organization in terms of salary as well as opportunity cost.
If you could do builds twice a day, you only had half-as-many changes to sort through and detangle, so you could more quickly identify and fix build problems. But doing builds more frequently would also be disruptive because everyone had to stop and help manually debug-build-problems twice as often. How to get out of this vicious cycle?
In these desperate times, Netscape RelEng built a system that grabbed the latest source code, generated a build, displayed the results in a simple linear time-sorted format on a webpage where everyone could see status, and then start again… grab the latest source code, build, post status… again. And again. And again. Not just once a day. At first, this was triggered every hour, hence the phrase “hourly build”, but that was quickly changed to starting a new build immediately after finishing the previous build.
All with no human intervention.
By integrating all the checkins and building continuously like this throughout the day, it meant that each individual build contained fewer changes to detangle if problems arose. By sharing the results on a company-wide-visible webserver, it meant that any developer (not just the few Release Engineers) could now help detangle build problems.
What do you call a new system that continuously integrates code checkins? Hmmm… how about “a continuous integration server“?! Good builds were colored “green”. The vertical columns of green reminded people of trees, giving rise to the phrase “the tree is green” when all builds looked good and it was safe for developers to land checkins. Bad builds were colored “red”, and gave rise to “the tree is burning” or “the tree is closed”. As builds would break (or “burn” with flames) with seemingly little provocation, the web-based system for displaying all this was called “tinderbox“.
Pretty amazing stuff in 1997, and a pivotal moment for Netscape developers. When Netscape moved to open source Mozilla, all this infrastructure was exposed to the entire industry and the idea spread quickly. This remains a core underlying principle in all the various continuous integration products, and agile / scrum development methodologies in use today. Most people starting a software project in 2014 would first setup a continuous integration system. But in 1997, this was unheard of and simply brilliant.
(From talking to people who were there 17 years ago, there’s some debate about whether this was originally invented at Netscape or inspired by a similar system at SGI that was hardwired into the building’s public announcement system using a synthesized voice to declare: “THE BUILD IS BROKEN. BRENDAN BROKE THE BUILD.” If anyone reading this has additional info, please let me know and I’ll update this post.)
If tinderbox server is so awesome, and worked so well for 17 years, why turn it off? Why not just fix it up and keep it running?
In mid-2007, an important criteria for the reborn Mozilla RelEng group was to significantly scale up Mozilla’s developer infrastructure – not just incrementally, but by orders of magnitude. This was essential if Mozilla was to hire more developers, gather many more community members, tackle a bunch of major initiatives, ship releases more predictably and to have these new additional Mozilla’s developers and community contributors be able to work effectively. When we analyzed how tinderbox worked, we discovered a few assumptions from 1997 no longer applied, and were causing bottlenecks we needed to solve.
1) Need to run multiple jobs-of-the-same-type at a time
2) Build-on-checkin, not build-continuously.
3) Display build results arranged by developer checkin not by time.
1) Need to run multiple jobs-of-the-same-type at a time
The design of this tinderbox waterfall assumed that you only had one job of a given type in progress at a time. For example, one linux32 opt build had to finish before the next linux32 opt build could start.
Mechanically, this was done by having only one machine dedicated to doing linux opt builds, and that one machine could only generate one build at a time. The results from one machine were displayed in one time-sorted column on the website page. If you wanted an additional different type of build, say linux32 debug builds, you needed another dedicate machine displaying results in another dedicated column.
For a small (~15?) number of checkins per day, and a small number of types of builds, this approach works fine. However, when you increase the checkins per day, many “hourly” build has almost as many checkins as Netscape had each day in 1997. By 2007, Mozilla was routinely struggling with multi-hour blockages as developers debugged integration failures.
Instead of having only one machine do linux32 opt builds at a time, we setup a pool of identically configured machines, each able to do a build-per-checkin, even while the previous build was still in progress. In peak load situations, we might still get more-then-one-checkin-per-build, but now we could start the 2nd linux32 opt build, even while the 1st linux32 opt build was still in progress. This got us back to having very small number of checkins, ideally only one checkin, per build… identifying which checkin broke the build, and hence fixing that build, was once again quick and easy.
Another related problem here was that there were ~86 different types of machines, each dedicated to running different types of jobs, on their own OS and each reporting to different dedicated columns on the tinderbox. There was a linux32 opt builder, a linux32 debug builder, a win32 opt builder, etc. This design had two important drawbacks.
Each different type of build took different times to complete. Even if all jobs started at the same time on day1, the continuous looping of jobs of different durations meant that after a while, all the jobs were starting/stopping at different times – which made it hard for a human to look across all the time-sorted waterfall columns to determine if a particular checkin had caused a given problem. Even getting all 86 columns to fit on a screen was a problem.
It also made each of these 86 machines a single point of failure to the entire system, a model which clearly would not scale. Building out pools of identical machines from 86 machines to ~5,500 machines allowed us to generate multiple jobs-of-the-same-type at the same time. It also meant that whenever one of these set-of-identical machines failed, it was not a single point of failure, and did not immediately close the tree, because another identically-configured machine was available to handle that type of work. This allowed people time to correctly diagnose and repair the machine properly before returning it to production, instead of being under time-pressure to find the quickest way to band-aid the machine back to life so the tree could reopen, only to have the machine fail again later when the band-aid repair failed.
All great, but fixing that uncovered the next hidden assumption.
2) Build-per-checkin, not build-continuously.
The “grab latest source code, generated a build, displayed the results” loop of tinderbox never looked to check if anything had actually changed. Tinderbox just started another build – even if nothing had changed.
Having only one machine available to do a given job meant that machine was constantly busy, so this assumption was not initially obvious. And given that the machine was on anyway, what harm in having it doing an unnecessary build or two?
Generating extra builds, even when nothing had changed, complicated the manual what-change-broke-the-build debugging work. It also meant introduced delays when a human actually did a checkin, as a build containing that checkin could only start after the unneccessary-nothing-changed-build-in-progress completed.
Finally, when we changed to having multiple machines run jobs concurrently, having the machines build even when there was no checkin made no sense. We needed to make sure each machine only started building when a new checkin happened, and there was something new to build. This turned into a separate project to build out an enhanced job scheduler system and machine-tracking system which could span multiple 4 physical colos, 3 amazon regions, assign jobs to the appropriate machines, take sick/dead machines out of production, add new machines into rotation, etc.
3) Display build results arranged by developer checkin not by time.
Tinderbox sorted results by time, specifically job-start-time and job-end-time. However, developers typically care about the results of their checkin, and sometimes the results of the checkin that landed just before them.
Further: Once we started generating multiple-jobs-of-the-same-type concurrently, it uncovered another hidden assumption. The design of this cascading waterfall assumed that you only had one build of a given type running at a time; the waterfall display was not designed to show the results of two linux32 opt builds that were run concurrently. As a transition, we hacked our new replacement systems to send tinderbox-server-compatible status for each concurrent builds to the tinderbox server… more observant developers would occasionally see some race-condition bugs with how these concurrent builds were displayed on the one column of the waterfall. These intermittent display bugs were confusing, hard to debug, but usually self corrected.
As we supported more OS, more build-types-per-OS and started to run unittests and perf-tests per platform, it quickly became more and more complex to figure out whether a given change had caused a problem across all the time-sorted-columns on the waterfall display. Complaints about the width of the waterfall not fitting on developers monitors were widespread. Running more and more of these jobs concurrently make deciphering the waterfall even more complex.
Finding a way to collect all the results related to a specific developer’s checkin, and display these results in a meaningful way was crucial. We tried a few ideas, but a community member (Markus Stange) surprised us all by building a prototype server that everyone instantly loved. This new server was called “tbpl”, because it scraped the TinderBox server Push Logs to gather its data.
Over time, there’s been improvements to tbpl.mozilla.org to allow sheriffs to “star” known failures, link to self-service APIs, link to the commits in the repo, link to bugs and most importantly gather all the per-checkin information directly from the buildbot scheduling database we use to schedule and keep track of job status… eliminating the intermittent race-condition bugs when scraping HTML page on tinderbox server. All great, but the user interface has remained basically the same since the first prototype by Markus – developers can easily and quickly see if a developer checkin has caused any bustage.
Fixing these 3 root assumptions in tinderbox.m.o code would be “non-trivial” – basically a re-write – so we instead focused on gracefully transitioning off tinderbox. Since Sept2012, all Mozilla RelEng systems have been off tinderbox.m.o and using tbpl.m.o plus buildbot instead.
Making the Continuous Integration process more efficient has allowed Mozilla to hire more developers who can do more checkins, transition developers from all-on-one-tip-development to multi-project-branch-development, and change the organization from traditional releases to rapid-release model. Game changing stuff. Since 2007, Mozilla has grown the number of employee engineers by a factor of 8, while the number of checkins that developers did has increased by a factor of 21. Infrastructure improvements have outpaced hiring!
On 16 May 2014, with the last Mozilla project finally migrated off tinderbox, so the tinderbox server was powered off. Tinderbox was the first of its kind, and helped changed how the software industry developed software. As much as we can gripe about tinderbox server’s various weaknesses, it has carried Mozilla from 1997 until 2012, and spawned an industry of products that help developers ship better software. Given it’s impact, it feels like we should look for a pedestal to put this on, with a small plaque that says “This changed how software companies develop software, thank you Tinderbox”… As it has been a VM for several years now, maybe this blog post counts as a virtual pedestal?! Regardless, if you are a software developer, and you ever meet any of the original team who built tinderbox, please do thank them.
I’d like to give thanks to some original Netscape folks (Tara Hernandez, Terry Weissman, Lloyd Tabb, Leaf, jwz) as well as aki, brendan, bmoss, chofmann, dmose, myk and rhelmer for their help researching the origins of Tinderbox. Also, thank you to lxt, catlee, bhearsum, rail and others for inviting me back to attend the ceremonial final-powering-off event… After the years of work leading up to this moment, it meant a lot to me to be there at the very end.
ps: the curious can view the cvs commit history for tinderbox webpage here (My favorite is v1.10!) …and the cvs commit history for tinderbox server here (UPDATE: Thanks to @justdave for the additional link.)
pps: When a server has been running for so long, figuring out what other undocumented systems might break when tinderbox is turned off is tricky. Here’s my “upcoming end-of-life” post from 02-apr-2013 when we thought we were nearly done. Surprise dependencies delayed this shutdown several times and frequently uncovered new, non-trivial, projects that had to be migrated. You can see the various loose ends that had to be tracked down in bug#843383, and all the many many linked bugs.
ppps: Here’s what MozillaDeveloperNetwork and Wikipedia have to say about Tinderbox server.
(UPDATE: add links to wikipedia, MDN, and to fix some typos. joduinn 28jun2014)
28 May 2014
HDP2.1 shipped on 22apr2014.
This was the first significant feature release shipped since I joined Hortonworks at the start of the year. There’s lots of interesting new features, and functionality in this HDP2.1 release – already well covered by others in great detail here. Oh, and of course, you can
download it from here.
In this post, I’ll instead focus on some of the behind-the-scenes mechanics. There were lots of major accomplishments in this release, but the ones that really stood out to me were:
1) sim-ship windows and linux.
This was the first HDP release where all OS were built from the same changeset and shipped at the same time. Making this happen was a hectic first priority in January. As well as the plumbing/mechanics within RelEng, it also took lots of coordination changes across different groups within Hortonworks to make this happen. The payoff on this was great. We sim-shipped, which is great and massively important for HWX as a company. Even more importantly, we set things up so we could sim-ship for every HDP2.1-and-above release going forward… and we proved it by sim-shipping the quick followup HDP22.214.171.124 release on 02may2014.
2) adding 5 new components.
HDP2.1 contained 17 components, compared to HDP 2.0 (with 12 components) and HDP 1.3 (with 10 components), making HDP2.1 the largest growth of components ever?!? Oh, and in addition to the new components, every one of the 12 pre-existing components were also significantly updated to newer versions. That meant that each required significant new integration work, new installers on all supported OS (…remember the “sim-ship” goal?). Oh, and we were to ship all this new functionality at the fastest cadence yet.
3) improving support for other trains.
In January, we were learning how to support 3 active trains of code: supporting 1.3 and 2.0 maintenance work, while also building out infrastructure for 2.1 new-product-development-work… even while the 2.1 development work was in progress, which obviously complicated things for developers. Today, we’re supporting 4 active trains: maintenance work for 1.3, 2.0 and 2.1, as well as the 2.2 new-product-development-work. This time, the 2.2 infrastructure was built out and live before developers finished working on 2.1… enabling the developers! Things are not perfect yet, by any means, but today (with 4 trains) feels calmer and more organized then earlier this year (with â€œonlyâ€ 3 trains).
All great improvements to see up close, and all important to us as we scale. Big thanks to everyone for their help… and do stay tuned for even more improvements already underway.
12 Feb 2014
(My life been hectic on several other fronts, so I only just now noticed that I never actually published this blog post. Sorry!!)
On 07-nov-2013, I was invited to present “We are all remoties” in Twilio’s headquarters here in San Francisco as part of their in-house tech talk series.
For context, its worth noting that Twilio is doing great as a company, which means they are hiring. And outgrowing their current space, so one option they were investigating was to keep the current space, and open up a second office elsewhere in the bay area. As they’d always been used to working in the one location, this “split into two offices” was top of everyone’s mind… hence the invitation from Thomas to give this company-wide talk about remoties.
Twilio’s entire office is a large, SOMA-style-warehouse-converted-into-open-plan-offices layout, packed with lots of people. The area I was to present in was their big “common area”, where they typically host company all-hand meetings, Friday socials and other big company-wide events. Quite, quite large. I’ve no idea how many people were there but it felt huge, and was wall-to-wall packed. The size gave an echo-y audio effect off the super-high high concrete ceilings and far-distant bare concrete walls, with a weird couple of structural pillars right in the middle of the room. Despite my best intentions, during the session, I found myself trying to “peer around” the pillars, aware of the people blocked from view.
Its great to see the response from folks when slides in a presentation *exactly* hit onto what is on top-of-their-minds. One section, about companies moving to multiple locations, clearly hit home with everyone… not too surprising, given the context. Another section, about a trusted employee moving out from office to start being a 100% remote employee, hit a very personal note – there was someone in the 2nd row who was a long-trusted employee actually about to embark on this exact change. He got quite the attention from everyone around him, and we stopped everything for a few minutes to talk about his exact situation. As far as I can tell, he found the entire session very helpful, but only time will tell how things work out for him.
The very great interactions, the lively Q+A, and the crowd of questions afterwards were all lots of fun and quite informative.
Big thanks to Thomas Wilsher @ Twilio for putting it all together. I found it a great experience, and the lively discussions before+during+after lead me to believe others did too.
PS: For a PDF copy of the presentation, click on the smiley faces! For the sake of my poor blogsite, the much, much, larger keynote file is available on request.
(Update: fixed broken links. joduinn 26jun2014)
29 Dec 2013
Just before the holiday break, Mitchell and I sat down together to fulfill a long standing promise I made years ago: to have Mitchell start a Firefox release herself.
After starting Mozilla just over 15 years ago, and dealing with all aspects of running a large organization, Mitchell finally kicked off a Firefox release herself last week – for the very first time. Specifically, she was going to start the official release automation for Firefox 27.0 beta2 and Fennec 27.0 beta2.
Timing was tricky. We didn’t want to disrupt the usual beta release cadence, especially just before the holidays. And Mitchell only had 25 minutes free between meetings, so we spent a few minutes saying hi, getting settled, and then we jumped right into the details.
To kick off Firefox and Fennec releases, there are only a handful of fields a human has to fill in for each product. They are (almost) all fairly self-evident, and a good number of the fields are populated by picking-from-a-list, so we made fast progress. The “Gimme a Firefox” and “Gimme a Fennec” buttons caused a laugh!
8 minutes is all it took.
That 8 minutes included the time to explain what each field did, what value should go into each of the various fields, and why. We even took the time to re-verify everything. After all, this was not just a “demo”… this was real. Mozilla shipped this Firefox 27.0b2 and Fennec 27.0b2 to all our real-live beta users before closing down for the holiday break.
Because it was so quick, we had spare time to chat about how much the infrastructure has improved since the Directors meeting in Building S 6.5 years ago when this promise was originally made. Obviously, there’s plenty of complexity involved in shipping a product release – the daily bug triage meetings about what fixes should/shouldn’t be included in a release, the actual landing of code fixes by developers, the manual spot-checking by QA, the press and PR coordination, the list goes on… – but the fact that such a “simple” user interface could trigger release automation running a couple of hundred compute hours across many machines to reliably ship products to millions of users is a note-worthy measure of Mozilla’s Release Engineering infrastructure. Mitchell was suitably impressed!
And then Mitchell left, with a wave and a smile, a few minutes early for her next meeting, while the various Release Engineering systems sprang into life generating builds, localization repacks, and updates for all our users.
We took this photo afterwards to commemorate the event! Thank you, Mitchell!