04 Jun 2014
In April 1997, Netscape ReleaseEngineers wrote, and started running, the world’s first? second? continuous integration server. Now, just over 17 years later, in May 2014, the tinderbox server was finally turned off. Permanently.
This is a historic moment for Mozilla, and for the software industry in general, so I thought people might find it interesting to get some background, as well as outline the assumptions we changed when designing the replacement Continuous Integration and Release Engineering infrastructure now in use at Mozilla.
At Netscape, developers would checkin a code change, and then go home at night, without knowing if their change broke anything. There were no builds during the day.
Instead, developers would have to wait until the next morning to find out if their change caused any problems. At 10am each morning, Netscape RelEng would gather all the checkins from the previous day, and manually start to build. Even if a given individual change was “good”, it was frequently possible for a combination of “good” changes to cause problems. In fact, as this was the first time that all the checkins from the previous day were compiled together, or “integrated” together, surprise build breakages were common.
This integration process was so fragile that all developers who did checkins in a day had to be in the office before 10am the next morning to immediately help debug any problems that arose with the build. Only after the 10am build completed successfully were Netscape developers allowed to start checking-in more code changes on top of what was now proven to be good code. If you were lucky, this 10am build worked first time, took “only” a couple of hours, and allowed new checkins to start lunchtime-ish. However, this 10am build was frequently broken, causing checkins to remain blocked until the gathered developers and release engineers figured out which change caused the problem and fixed it.
Fixing build bustages like this took time, and lots of people, to figure out which of all the checkins that day caused the problem. Worst case, some checkins were fine by themselves, but cause problems when combined with, or integrated with, other changes, so even the best-intentioned developer could still “break the build” in non-obvious ways. Sometimes, it could take all day to debug and fix the build problem – no new checkins happened on those days, halting all development for the entire day. More rare, but not unheard of, was that the build bustage halted development for multiple days in a row. Obviously, this was disruptive to the developers who had landed a change, to the other developers who were waiting to land a change, and to the Release Engineers in the middle of it all…. With so many people involved, this was expensive to the organization in terms of salary as well as opportunity cost.
If you could do builds twice a day, you only had half-as-many changes to sort through and detangle, so you could more quickly identify and fix build problems. But doing builds more frequently would also be disruptive because everyone had to stop and help manually debug-build-problems twice as often. How to get out of this vicious cycle?
In these desperate times, Netscape RelEng built a system that grabbed the latest source code, generated a build, displayed the results in a simple linear time-sorted format on a webpage where everyone could see status, and then start again… grab the latest source code, build, post status… again. And again. And again. Not just once a day. At first, this was triggered every hour, hence the phrase “hourly build”, but that was quickly changed to starting a new build immediately after finishing the previous build.
All with no human intervention.
By integrating all the checkins and building continuously like this throughout the day, it meant that each individual build contained fewer changes to detangle if problems arose. By sharing the results on a company-wide-visible webserver, it meant that any developer (not just the few Release Engineers) could now help detangle build problems.
What do you call a new system that continuously integrates code checkins? Hmmm… how about “a continuous integration server“?! Good builds were colored “green”. The vertical columns of green reminded people of trees, giving rise to the phrase “the tree is green” when all builds looked good and it was safe for developers to land checkins. Bad builds were colored “red”, and gave rise to “the tree is burning” or “the tree is closed”. As builds would break (or “burn” with flames) with seemingly little provocation, the web-based system for displaying all this was called “tinderbox“.
Pretty amazing stuff in 1997, and a pivotal moment for Netscape developers. When Netscape moved to open source Mozilla, all this infrastructure was exposed to the entire industry and the idea spread quickly. This remains a core underlying principle in all the various continuous integration products, and agile / scrum development methodologies in use today. Most people starting a software project in 2014 would first setup a continuous integration system. But in 1997, this was unheard of and simply brilliant.
(From talking to people who were there 17 years ago, there’s some debate about whether this was originally invented at Netscape or inspired by a similar system at SGI that was hardwired into the building’s public announcement system using a synthesized voice to declare: “THE BUILD IS BROKEN. BRENDAN BROKE THE BUILD.” If anyone reading this has additional info, please let me know and I’ll update this post.)
If tinderbox server is so awesome, and worked so well for 17 years, why turn it off? Why not just fix it up and keep it running?
In mid-2007, an important criteria for the reborn Mozilla RelEng group was to significantly scale up Mozilla’s developer infrastructure – not just incrementally, but by orders of magnitude. This was essential if Mozilla was to hire more developers, gather many more community members, tackle a bunch of major initiatives, ship releases more predictably and to have these new additional Mozilla’s developers and community contributors be able to work effectively. When we analyzed how tinderbox worked, we discovered a few assumptions from 1997 no longer applied, and were causing bottlenecks we needed to solve.
1) Need to run multiple jobs-of-the-same-type at a time
2) Build-on-checkin, not build-continuously.
3) Display build results arranged by developer checkin not by time.
1) Need to run multiple jobs-of-the-same-type at a time
The design of this tinderbox waterfall assumed that you only had one job of a given type in progress at a time. For example, one linux32 opt build had to finish before the next linux32 opt build could start.
Mechanically, this was done by having only one machine dedicated to doing linux opt builds, and that one machine could only generate one build at a time. The results from one machine were displayed in one time-sorted column on the website page. If you wanted an additional different type of build, say linux32 debug builds, you needed another dedicate machine displaying results in another dedicated column.
For a small (~15?) number of checkins per day, and a small number of types of builds, this approach works fine. However, when you increase the checkins per day, many “hourly” build has almost as many checkins as Netscape had each day in 1997. By 2007, Mozilla was routinely struggling with multi-hour blockages as developers debugged integration failures.
Instead of having only one machine do linux32 opt builds at a time, we setup a pool of identically configured machines, each able to do a build-per-checkin, even while the previous build was still in progress. In peak load situations, we might still get more-then-one-checkin-per-build, but now we could start the 2nd linux32 opt build, even while the 1st linux32 opt build was still in progress. This got us back to having very small number of checkins, ideally only one checkin, per build… identifying which checkin broke the build, and hence fixing that build, was once again quick and easy.
Another related problem here was that there were ~86 different types of machines, each dedicated to running different types of jobs, on their own OS and each reporting to different dedicated columns on the tinderbox. There was a linux32 opt builder, a linux32 debug builder, a win32 opt builder, etc. This design had two important drawbacks.
Each different type of build took different times to complete. Even if all jobs started at the same time on day1, the continuous looping of jobs of different durations meant that after a while, all the jobs were starting/stopping at different times – which made it hard for a human to look across all the time-sorted waterfall columns to determine if a particular checkin had caused a given problem. Even getting all 86 columns to fit on a screen was a problem.
It also made each of these 86 machines a single point of failure to the entire system, a model which clearly would not scale. Building out pools of identical machines from 86 machines to ~5,500 machines allowed us to generate multiple jobs-of-the-same-type at the same time. It also meant that whenever one of these set-of-identical machines failed, it was not a single point of failure, and did not immediately close the tree, because another identically-configured machine was available to handle that type of work. This allowed people time to correctly diagnose and repair the machine properly before returning it to production, instead of being under time-pressure to find the quickest way to band-aid the machine back to life so the tree could reopen, only to have the machine fail again later when the band-aid repair failed.
All great, but fixing that uncovered the next hidden assumption.
2) Build-per-checkin, not build-continuously.
The “grab latest source code, generated a build, displayed the results” loop of tinderbox never looked to check if anything had actually changed. Tinderbox just started another build – even if nothing had changed.
Having only one machine available to do a given job meant that machine was constantly busy, so this assumption was not initially obvious. And given that the machine was on anyway, what harm in having it doing an unnecessary build or two?
Generating extra builds, even when nothing had changed, complicated the manual what-change-broke-the-build debugging work. It also meant introduced delays when a human actually did a checkin, as a build containing that checkin could only start after the unneccessary-nothing-changed-build-in-progress completed.
Finally, when we changed to having multiple machines run jobs concurrently, having the machines build even when there was no checkin made no sense. We needed to make sure each machine only started building when a new checkin happened, and there was something new to build. This turned into a separate project to build out an enhanced job scheduler system and machine-tracking system which could span multiple 4 physical colos, 3 amazon regions, assign jobs to the appropriate machines, take sick/dead machines out of production, add new machines into rotation, etc.
3) Display build results arranged by developer checkin not by time.
Tinderbox sorted results by time, specifically job-start-time and job-end-time. However, developers typically care about the results of their checkin, and sometimes the results of the checkin that landed just before them.
Further: Once we started generating multiple-jobs-of-the-same-type concurrently, it uncovered another hidden assumption. The design of this cascading waterfall assumed that you only had one build of a given type running at a time; the waterfall display was not designed to show the results of two linux32 opt builds that were run concurrently. As a transition, we hacked our new replacement systems to send tinderbox-server-compatible status for each concurrent builds to the tinderbox server… more observant developers would occasionally see some race-condition bugs with how these concurrent builds were displayed on the one column of the waterfall. These intermittent display bugs were confusing, hard to debug, but usually self corrected.
As we supported more OS, more build-types-per-OS and started to run unittests and perf-tests per platform, it quickly became more and more complex to figure out whether a given change had caused a problem across all the time-sorted-columns on the waterfall display. Complaints about the width of the waterfall not fitting on developers monitors were widespread. Running more and more of these jobs concurrently make deciphering the waterfall even more complex.
Finding a way to collect all the results related to a specific developer’s checkin, and display these results in a meaningful way was crucial. We tried a few ideas, but a community member (Markus Stange) surprised us all by building a prototype server that everyone instantly loved. This new server was called “tbpl”, because it scraped the TinderBox server Push Logs to gather its data.
Over time, there’s been improvements to tbpl.mozilla.org to allow sheriffs to “star” known failures, link to self-service APIs, link to the commits in the repo, link to bugs and most importantly gather all the per-checkin information directly from the buildbot scheduling database we use to schedule and keep track of job status… eliminating the intermittent race-condition bugs when scraping HTML page on tinderbox server. All great, but the user interface has remained basically the same since the first prototype by Markus – developers can easily and quickly see if a developer checkin has caused any bustage.
Fixing these 3 root assumptions in tinderbox.m.o code would be “non-trivial” – basically a re-write – so we instead focused on gracefully transitioning off tinderbox. Since Sept2012, all Mozilla RelEng systems have been off tinderbox.m.o and using tbpl.m.o plus buildbot instead.
Making the Continuous Integration process more efficient has allowed Mozilla to hire more developers who can do more checkins, transition developers from all-on-one-tip-development to multi-project-branch-development, and change the organization from traditional releases to rapid-release model. Game changing stuff. Since 2007, Mozilla has grown the number of employee engineers by a factor of 8, while the number of checkins that developers did has increased by a factor of 21. Infrastructure improvements have outpaced hiring!
On 16 May 2014, with the last Mozilla project finally migrated off tinderbox, so the tinderbox server was powered off. Tinderbox was the first of its kind, and helped changed how the software industry developed software. As much as we can gripe about tinderbox server’s various weaknesses, it has carried Mozilla from 1997 until 2012, and spawned an industry of products that help developers ship better software. Given it’s impact, it feels like we should look for a pedestal to put this on, with a small plaque that says “This changed how software companies develop software, thank you Tinderbox”… As it has been a VM for several years now, maybe this blog post counts as a virtual pedestal?! Regardless, if you are a software developer, and you ever meet any of the original team who built tinderbox, please do thank them.
I’d like to give thanks to some original Netscape folks (Tara Hernandez, Terry Weissman, Lloyd Tabb, Leaf, jwz) as well as aki, brendan, bmoss, chofmann, dmose, myk and rhelmer for their help researching the origins of Tinderbox. Also, thank you to lxt, catlee, bhearsum, rail and others for inviting me back to attend the ceremonial final-powering-off event… After the years of work leading up to this moment, it meant a lot to me to be there at the very end.
ps: the curious can view the cvs commit history for tinderbox webpage here (My favorite is v1.10!) …and the cvs commit history for tinderbox server here (UPDATE: Thanks to @justdave for the additional link.)
pps: When a server has been running for so long, figuring out what other undocumented systems might break when tinderbox is turned off is tricky. Here’s my “upcoming end-of-life” post from 02-apr-2013 when we thought we were nearly done. Surprise dependencies delayed this shutdown several times and frequently uncovered new, non-trivial, projects that had to be migrated. You can see the various loose ends that had to be tracked down in bug#843383, and all the many many linked bugs.
(UPDATE: add links to wikipedia, MDN, and to fix some typos. joduinn 28jun2014)