21 Sep 2014
Found this earlier this month while on the way to work. The color scheme really threw me off, so at first I couldn’t even tell it was a Jaguar. I remain speechless.
21 Sep 2014
Found this earlier this month while on the way to work. The color scheme really threw me off, so at first I couldn’t even tell it was a Jaguar. I remain speechless.
05 Sep 2014
“xkcd: volume 0” by Randall Munroe
What can I say? After all the years of reading xkcd.com, buying the book seemed like an obvious “huh, how did I not buy this already” moment.
This was a great wander down memory lane. I found a great many of my favorite xkcd comics, including bobby-drop-tables (therapeutic for anyone with an apostrophe in their surname!), locked-out-of-house, i’m-compiling-code and “the one that makes every Release Engineer I know cringe“.
Somehow, there were even some I’d never seen before, a very happy discovery: chess-coaster (which in turn inspired real life http://xkcd.com/chesscoaster!), the why-i’m-barred-from-speaking-at-crypto-conferences series, girls-on-the-internet, ninjas-vs-stallman, counting sheep…
All in all, a great fun read, and I found the “extra” sidebar cartoons equally fun… especially the yakshaver! If you like xkcd, and don’t already have this book, go get it.
ps: He’s got a new book coming out in a few days, a book tour in progress, and a really subtle turtles-all-the-way-down comic which nudges about the new book… if you look *really* closely! I’m looking forward to getting my hands on it!
01 Sep 2014
After all the fun reading “Meanwhile in San Francisco”, I looked to see if this duo had co-written any other books. Sure enough, they had.
“Lost Cat” tells the true story of how an urban cat owner (one of the authors) loses her cat, then has the cat casually walk back in the door weeks later healthy and well. The book details various experiments the authors did using GPS trackers, and tiny “CatCam” cameras to figure out where her cat actually went. Overlaying that data onto google maps surprised them both – they never knew their cats roamed so far and wide across the city. The detective work they did to track down and then meeting with “Cat StealerA” and “Cat Stealer B” made for a fun read… Just like “Meanwhile in San Francisco”, the illustrations are all paintings. Literally. My all-time favorite painting of any cat ever is on page7.
A fun read… and a great gift to any urban cat owners you know.
04 Jun 2014
In April 1997, Netscape ReleaseEngineers wrote, and started running, the world’s first? second? continuous integration server. Now, just over 17 years later, in May 2014, the tinderbox server was finally turned off. Permanently.
This is a historic moment for Mozilla, and for the software industry in general, so I thought people might find it interesting to get some background, as well as outline the assumptions we changed when designing the replacement Continuous Integration and Release Engineering infrastructure now in use at Mozilla.
At Netscape, developers would checkin a code change, and then go home at night, without knowing if their change broke anything. There were no builds during the day.
Instead, developers would have to wait until the next morning to find out if their change caused any problems. At 10am each morning, Netscape RelEng would gather all the checkins from the previous day, and manually start to build. Even if a given individual change was “good”, it was frequently possible for a combination of “good” changes to cause problems. In fact, as this was the first time that all the checkins from the previous day were compiled together, or “integrated” together, surprise build breakages were common.
This integration process was so fragile that all developers who did checkins in a day had to be in the office before 10am the next morning to immediately help debug any problems that arose with the build. Only after the 10am build completed successfully were Netscape developers allowed to start checking-in more code changes on top of what was now proven to be good code. If you were lucky, this 10am build worked first time, took “only” a couple of hours, and allowed new checkins to start lunchtime-ish. However, this 10am build was frequently broken, causing checkins to remain blocked until the gathered developers and release engineers figured out which change caused the problem and fixed it.
Fixing build bustages like this took time, and lots of people, to figure out which of all the checkins that day caused the problem. Worst case, some checkins were fine by themselves, but cause problems when combined with, or integrated with, other changes, so even the best-intentioned developer could still “break the build” in non-obvious ways. Sometimes, it could take all day to debug and fix the build problem – no new checkins happened on those days, halting all development for the entire day. More rare, but not unheard of, was that the build bustage halted development for multiple days in a row. Obviously, this was disruptive to the developers who had landed a change, to the other developers who were waiting to land a change, and to the Release Engineers in the middle of it all…. With so many people involved, this was expensive to the organization in terms of salary as well as opportunity cost.
If you could do builds twice a day, you only had half-as-many changes to sort through and detangle, so you could more quickly identify and fix build problems. But doing builds more frequently would also be disruptive because everyone had to stop and help manually debug-build-problems twice as often. How to get out of this vicious cycle?
In these desperate times, Netscape RelEng built a system that grabbed the latest source code, generated a build, displayed the results in a simple linear time-sorted format on a webpage where everyone could see status, and then start again… grab the latest source code, build, post status… again. And again. And again. Not just once a day. At first, this was triggered every hour, hence the phrase “hourly build”, but that was quickly changed to starting a new build immediately after finishing the previous build.
All with no human intervention.
By integrating all the checkins and building continuously like this throughout the day, it meant that each individual build contained fewer changes to detangle if problems arose. By sharing the results on a company-wide-visible webserver, it meant that any developer (not just the few Release Engineers) could now help detangle build problems.
What do you call a new system that continuously integrates code checkins? Hmmm… how about “a continuous integration server“?! Good builds were colored “green”. The vertical columns of green reminded people of trees, giving rise to the phrase “the tree is green” when all builds looked good and it was safe for developers to land checkins. Bad builds were colored “red”, and gave rise to “the tree is burning” or “the tree is closed”. As builds would break (or “burn” with flames) with seemingly little provocation, the web-based system for displaying all this was called “tinderbox“.
Pretty amazing stuff in 1997, and a pivotal moment for Netscape developers. When Netscape moved to open source Mozilla, all this infrastructure was exposed to the entire industry and the idea spread quickly. This remains a core underlying principle in all the various continuous integration products, and agile / scrum development methodologies in use today. Most people starting a software project in 2014 would first setup a continuous integration system. But in 1997, this was unheard of and simply brilliant.
(From talking to people who were there 17 years ago, there’s some debate about whether this was originally invented at Netscape or inspired by a similar system at SGI that was hardwired into the building’s public announcement system using a synthesized voice to declare: “THE BUILD IS BROKEN. BRENDAN BROKE THE BUILD.” If anyone reading this has additional info, please let me know and I’ll update this post.)
If tinderbox server is so awesome, and worked so well for 17 years, why turn it off? Why not just fix it up and keep it running?
In mid-2007, an important criteria for the reborn Mozilla RelEng group was to significantly scale up Mozilla’s developer infrastructure – not just incrementally, but by orders of magnitude. This was essential if Mozilla was to hire more developers, gather many more community members, tackle a bunch of major initiatives, ship releases more predictably and to have these new additional Mozilla’s developers and community contributors be able to work effectively. When we analyzed how tinderbox worked, we discovered a few assumptions from 1997 no longer applied, and were causing bottlenecks we needed to solve.
1) Need to run multiple jobs-of-the-same-type at a time
2) Build-on-checkin, not build-continuously.
3) Display build results arranged by developer checkin not by time.
1) Need to run multiple jobs-of-the-same-type at a time
The design of this tinderbox waterfall assumed that you only had one job of a given type in progress at a time. For example, one linux32 opt build had to finish before the next linux32 opt build could start.
Mechanically, this was done by having only one machine dedicated to doing linux opt builds, and that one machine could only generate one build at a time. The results from one machine were displayed in one time-sorted column on the website page. If you wanted an additional different type of build, say linux32 debug builds, you needed another dedicate machine displaying results in another dedicated column.
For a small (~15?) number of checkins per day, and a small number of types of builds, this approach works fine. However, when you increase the checkins per day, many “hourly” build has almost as many checkins as Netscape had each day in 1997. By 2007, Mozilla was routinely struggling with multi-hour blockages as developers debugged integration failures.
Instead of having only one machine do linux32 opt builds at a time, we setup a pool of identically configured machines, each able to do a build-per-checkin, even while the previous build was still in progress. In peak load situations, we might still get more-then-one-checkin-per-build, but now we could start the 2nd linux32 opt build, even while the 1st linux32 opt build was still in progress. This got us back to having very small number of checkins, ideally only one checkin, per build… identifying which checkin broke the build, and hence fixing that build, was once again quick and easy.
Another related problem here was that there were ~86 different types of machines, each dedicated to running different types of jobs, on their own OS and each reporting to different dedicated columns on the tinderbox. There was a linux32 opt builder, a linux32 debug builder, a win32 opt builder, etc. This design had two important drawbacks.
Each different type of build took different times to complete. Even if all jobs started at the same time on day1, the continuous looping of jobs of different durations meant that after a while, all the jobs were starting/stopping at different times – which made it hard for a human to look across all the time-sorted waterfall columns to determine if a particular checkin had caused a given problem. Even getting all 86 columns to fit on a screen was a problem.
It also made each of these 86 machines a single point of failure to the entire system, a model which clearly would not scale. Building out pools of identical machines from 86 machines to ~5,500 machines allowed us to generate multiple jobs-of-the-same-type at the same time. It also meant that whenever one of these set-of-identical machines failed, it was not a single point of failure, and did not immediately close the tree, because another identically-configured machine was available to handle that type of work. This allowed people time to correctly diagnose and repair the machine properly before returning it to production, instead of being under time-pressure to find the quickest way to band-aid the machine back to life so the tree could reopen, only to have the machine fail again later when the band-aid repair failed.
All great, but fixing that uncovered the next hidden assumption.
The “grab latest source code, generated a build, displayed the results” loop of tinderbox never looked to check if anything had actually changed. Tinderbox just started another build – even if nothing had changed.
Having only one machine available to do a given job meant that machine was constantly busy, so this assumption was not initially obvious. And given that the machine was on anyway, what harm in having it doing an unnecessary build or two?
Generating extra builds, even when nothing had changed, complicated the manual what-change-broke-the-build debugging work. It also meant introduced delays when a human actually did a checkin, as a build containing that checkin could only start after the unneccessary-nothing-changed-build-in-progress completed.
Finally, when we changed to having multiple machines run jobs concurrently, having the machines build even when there was no checkin made no sense. We needed to make sure each machine only started building when a new checkin happened, and there was something new to build. This turned into a separate project to build out an enhanced job scheduler system and machine-tracking system which could span multiple 4 physical colos, 3 amazon regions, assign jobs to the appropriate machines, take sick/dead machines out of production, add new machines into rotation, etc.
Tinderbox sorted results by time, specifically job-start-time and job-end-time. However, developers typically care about the results of their checkin, and sometimes the results of the checkin that landed just before them.
Further: Once we started generating multiple-jobs-of-the-same-type concurrently, it uncovered another hidden assumption. The design of this cascading waterfall assumed that you only had one build of a given type running at a time; the waterfall display was not designed to show the results of two linux32 opt builds that were run concurrently. As a transition, we hacked our new replacement systems to send tinderbox-server-compatible status for each concurrent builds to the tinderbox server… more observant developers would occasionally see some race-condition bugs with how these concurrent builds were displayed on the one column of the waterfall. These intermittent display bugs were confusing, hard to debug, but usually self corrected.
As we supported more OS, more build-types-per-OS and started to run unittests and perf-tests per platform, it quickly became more and more complex to figure out whether a given change had caused a problem across all the time-sorted-columns on the waterfall display. Complaints about the width of the waterfall not fitting on developers monitors were widespread. Running more and more of these jobs concurrently make deciphering the waterfall even more complex.
Finding a way to collect all the results related to a specific developer’s checkin, and display these results in a meaningful way was crucial. We tried a few ideas, but a community member (Markus Stange) surprised us all by building a prototype server that everyone instantly loved. This new server was called “tbpl”, because it scraped the TinderBox server Push Logs to gather its data.
Over time, there’s been improvements to tbpl.mozilla.org to allow sheriffs to “star” known failures, link to self-service APIs, link to the commits in the repo, link to bugs and most importantly gather all the per-checkin information directly from the buildbot scheduling database we use to schedule and keep track of job status… eliminating the intermittent race-condition bugs when scraping HTML page on tinderbox server. All great, but the user interface has remained basically the same since the first prototype by Markus – developers can easily and quickly see if a developer checkin has caused any bustage.
Fixing these 3 root assumptions in tinderbox.m.o code would be “non-trivial” – basically a re-write – so we instead focused on gracefully transitioning off tinderbox. Since Sept2012, all Mozilla RelEng systems have been off tinderbox.m.o and using tbpl.m.o plus buildbot instead.
Making the Continuous Integration process more efficient has allowed Mozilla to hire more developers who can do more checkins, transition developers from all-on-one-tip-development to multi-project-branch-development, and change the organization from traditional releases to rapid-release model. Game changing stuff. Since 2007, Mozilla has grown the number of employee engineers by a factor of 8, while the number of checkins that developers did has increased by a factor of 21. Infrastructure improvements have outpaced hiring!
On 16 May 2014, with the last Mozilla project finally migrated off tinderbox, so the tinderbox server was powered off. Tinderbox was the first of its kind, and helped changed how the software industry developed software. As much as we can gripe about tinderbox server’s various weaknesses, it has carried Mozilla from 1997 until 2012, and spawned an industry of products that help developers ship better software. Given it’s impact, it feels like we should look for a pedestal to put this on, with a small plaque that says “This changed how software companies develop software, thank you Tinderbox”… As it has been a VM for several years now, maybe this blog post counts as a virtual pedestal?! Regardless, if you are a software developer, and you ever meet any of the original team who built tinderbox, please do thank them.
I’d like to give thanks to some original Netscape folks (Tara Hernandez, Terry Weissman, Lloyd Tabb, Leaf, jwz) as well as aki, brendan, bmoss, chofmann, dmose, myk and rhelmer for their help researching the origins of Tinderbox. Also, thank you to lxt, catlee, bhearsum, rail and others for inviting me back to attend the ceremonial final-powering-off event… After the years of work leading up to this moment, it meant a lot to me to be there at the very end.
ps: the curious can view the cvs commit history for tinderbox webpage here (My favorite is v1.10!) …and the cvs commit history for tinderbox server here (UPDATE: Thanks to @justdave for the additional link.)
pps: When a server has been running for so long, figuring out what other undocumented systems might break when tinderbox is turned off is tricky. Here’s my “upcoming end-of-life” post from 02-apr-2013 when we thought we were nearly done. Surprise dependencies delayed this shutdown several times and frequently uncovered new, non-trivial, projects that had to be migrated. You can see the various loose ends that had to be tracked down in bug#843383, and all the many many linked bugs.
(UPDATE: add links to wikipedia, MDN, and to fix some typos. joduinn 28jun2014)
12 May 2014
I stumbled across this book by accident recently, and really enjoyed it. One of the reasons I love to travel is because of the different cultural norms… what is “normal” in one location would be considered downright “odd/strange/unusual” in another location. Since I first moved to San Francisco, the different types of people, from different backgrounds, who each call this town “home” continue to fascinate me… and all in a small 7mile x 7mile area.
This book is painted (yes really!) by a San Francisco resident, and does an excellent job of describing the heart of many different aspects of this unique town: Mah Jong in Chinatown, the SF City Library’s fulltime employee who is a social worker for homeless people, Frank Chu, Critical Mass, dogwalkers, Mission Hipsters, Muni drivers … and of course, everything you need to know about a Mission burrito!
A fun read… and a great gift to anyone who has patiently listened while you’ve tried to explain what makes San Francisco so special.
01 Apr 2014
(“Panorama” is the very-serious-current-affairs program of the British Broadcasting Corporation, and has been running continuously since 1953, making it the longest running current affairs program in the world.)
On 1st April, 1957, Panorama ended its show with a brief ~3minute segment on the early harvest of the Spaghetti trees along the Swiss-Italian border.
It is believed to be one of the first times an April’s Fool joke was played on television viewers, and caused quite the stir at the time. Excellently put together, with great attention to detail, and a script echoing an earlier segment about the French wine harvest, I found it a great fun 3minute watch.
08 Mar 2014
In case you missed the announcements, RelEngConf 2014 is officially now open for registrations. This follows the inaugural and wildly successful Release Engineering conference , held in San Francisco on 20may2013, as part of ICSE 2013. More background here.
Last year’s event was great. The mixture of attendees and speakers, from academia and battle-hardened industry, made for some riveting topics. So I already had high expectations for this year… no pressure on the organizers! Then I heard this years will be held in Google HQ MountainView, and feature opening keynotes from Chuck Rossi (RelEng, Facebook, click for linkedin profile), and Dinah McNutt (RelEng, Google, click for linkedin profile). Looks like RelEngConf 2014 is already lining up to be special also.
If you build software delivery pipelines for your company, or if you work in a software company that has software delivery needs, I recommend you follow @relengcon, block off April 11th, 2014 on your calendar and book now. It will be well worth your time.
See you there!
23 Feb 2014
(Context: In case people missed this transition, my last day at Mozilla was Dec31, so obviously, I’m not going to be doing these monthly infrastructure load posts anymore. I started this series of posts in Jan2009, because the data, and analysis, gave important context for everyone in Mozilla engineering to step back and sanity-check the scale, usage patterns and overall health of Mozilla’s developer infrastructure. The data in these posts have shaped conversations and strategy within Mozilla over the years, so are important to continue. I want to give thanks to Armen for eagerly taking over this role from me during my transition out of Mozilla. Those of you who know Armen know that he’ll do this exceedingly well, in his own inimitable style, and I’m super happy he’s taken this on. I’ve already said this to Armen privately over the last few months of transition details, but am repeating here publicly for the record – thank you, Armen, for taking on the responsibility of this blog-post-series.)
December saw a big drop in overall load – 6,063 is our lowest load in almost half-a-year. However, this is no surprise given that all Mozilla employees were offline for 10-14 days out of the 31days – basically a 1/3rd of the month. At the rate people were doing checkins for the first 2/3rds of the month, December2013 was on track to be our first month ever over 8,000 checkins-per-month.
January saw people jump straight back into work full speed. 7,710 is our second heaviest load on record (slightly behind the current record 7,771 checkins in August2013).
Those are my quick highlights. For more details, you should go read Armen’s post for Dec2013 and post for Jan2014 yourself. He has changed the format a little, but the graphs, data and analysis are all there. And hey, Armen even makes the raw data available in html and json formats, so now you can generate your own reports and graphs if interested. A very nice touch, Armen.
John (still cheering from the sidelines).
12 Feb 2014
[UPDATE: The newest version of this presentation is here. joduinn 09nov2014]
(My life been hectic on several other fronts, so I only just now noticed that I never actually published this blog post. Sorry!!)
On 07-nov-2013, I was invited to present “We are all remoties” in Twilio’s headquarters here in San Francisco as part of their in-house tech talk series.
For context, its worth noting that Twilio is doing great as a company, which means they are hiring. And outgrowing their current space, so one option they were investigating was to keep the current space, and open up a second office elsewhere in the bay area. As they’d always been used to working in the one location, this “split into two offices” was top of everyone’s mind… hence the invitation from Thomas to give this company-wide talk about remoties.
Twilio’s entire office is a large, SOMA-style-warehouse-converted-into-open-plan-offices layout, packed with lots of people. The area I was to present in was their big “common area”, where they typically host company all-hand meetings, Friday socials and other big company-wide events. Quite, quite large. I’ve no idea how many people were there but it felt huge, and was wall-to-wall packed. The size gave an echo-y audio effect off the super-high high concrete ceilings and far-distant bare concrete walls, with a weird couple of structural pillars right in the middle of the room. Despite my best intentions, during the session, I found myself trying to “peer around” the pillars, aware of the people blocked from view.
Its great to see the response from folks when slides in a presentation *exactly* hit onto what is on top-of-their-minds. One section, about companies moving to multiple locations, clearly hit home with everyone… not too surprising, given the context. Another section, about a trusted employee moving out from office to start being a 100% remote employee, hit a very personal note – there was someone in the 2nd row who was a long-trusted employee actually about to embark on this exact change. He got quite the attention from everyone around him, and we stopped everything for a few minutes to talk about his exact situation. As far as I can tell, he found the entire session very helpful, but only time will tell how things work out for him.
The very great interactions, the lively Q+A, and the crowd of questions afterwards were all lots of fun and quite informative.
Big thanks to Thomas Wilsher @ Twilio for putting it all together. I found it a great experience, and the lively discussions before+during+after lead me to believe others did too.
PS: For a PDF copy of the presentation, click on the smiley faces! For the sake of my poor blogsite, the much, much, larger keynote file is available on request.
(Update: fixed broken links. joduinn 26jun2014)
29 Dec 2013
Just before the holiday break, Mitchell and I sat down together to fulfill a long standing promise I made years ago: to have Mitchell start a Firefox release herself.
After starting Mozilla just over 15 years ago, and dealing with all aspects of running a large organization, Mitchell finally kicked off a Firefox release herself last week – for the very first time. Specifically, she was going to start the official release automation for Firefox 27.0 beta2 and Fennec 27.0 beta2.
Timing was tricky. We didn’t want to disrupt the usual beta release cadence, especially just before the holidays. And Mitchell only had 25 minutes free between meetings, so we spent a few minutes saying hi, getting settled, and then we jumped right into the details.
To kick off Firefox and Fennec releases, there are only a handful of fields a human has to fill in for each product. They are (almost) all fairly self-evident, and a good number of the fields are populated by picking-from-a-list, so we made fast progress. The “Gimme a Firefox” and “Gimme a Fennec” buttons caused a laugh!
8 minutes is all it took.
That 8 minutes included the time to explain what each field did, what value should go into each of the various fields, and why. We even took the time to re-verify everything. After all, this was not just a “demo”… this was real. Mozilla shipped this Firefox 27.0b2 and Fennec 27.0b2 to all our real-live beta users before closing down for the holiday break.
Because it was so quick, we had spare time to chat about how much the infrastructure has improved since the Directors meeting in Building S 6.5 years ago when this promise was originally made. Obviously, there’s plenty of complexity involved in shipping a product release – the daily bug triage meetings about what fixes should/shouldn’t be included in a release, the actual landing of code fixes by developers, the manual spot-checking by QA, the press and PR coordination, the list goes on… – but the fact that such a “simple” user interface could trigger release automation running a couple of hundred compute hours across many machines to reliably ship products to millions of users is a note-worthy measure of Mozilla’s Release Engineering infrastructure. Mitchell was suitably impressed!
And then Mitchell left, with a wave and a smile, a few minutes early for her next meeting, while the various Release Engineering systems sprang into life generating builds, localization repacks, and updates for all our users.
We took this photo afterwards to commemorate the event! Thank you, Mitchell!