Chuck Rossi and Dinah McNutt keynotes at RelEng Conf 2014

No Comments

The same great people who brought RelEng Conf 2013 did it again earlier this year with the sold-out-wait-listed RelEng Conf 2014. Hosted at Google’s HQ campus, it was a great combination of academic and battle-hardened down-to-earth no-holds-barred industry presentations and panel sessions.

Unlike my keynote last year, this year I had no presentations, so I was able to relax and soak up the great keynote presentations by Chuck Rossi (RelEng at Facebook), and Dinah McNutt (RelEng at Google), as well as others. These are online, and well worth the watch:

Chuck Rossi’s keynote is here:

Dinah McNutt’s keynote is here:

Closing panel discussion is here:

Two years in a row of greatness from Bram Adams, Christian, Foutse, Kim, Stephany Bellomo, Akos Frohner and Boris Debic means that I’m already looking forward to RelEng Conf 2015. Watch the official conference website, follow @relengcon and book your spot immediately to avoid that sad “oh, I’m still on the wait list” feeling.

John.

Calling all Release Engineers, Hortonworks is hiring

No Comments

The Hortonworks Release Engineering team is growing, so we’re hiring!

We’re passionate about open source, and ensure that all 100% of code in a Hortonworks HDP release is open sourced in the Apache Software Foundation Hadoop project. We work with other large organizations to help them upstream their contributions to the Apache project, which helps accelerate the general Hadoop community. Its so important to us, it is part of the Hortonworks Manifesto.

We’re proud of our HDP releases. Our clients rely on HDP in production environments where phrases like “petabytes per day” and “zettabytes” are common. We sim-ship on centos5, centos6, ubuntu, debian, suse and windows – all from the same changeset. Building and testing at this scale has its own special forms of challenges, and is exciting. In the rare case where customers hit production issues, we are able to deliver supported fixes super-quickly.

The Hortonworks Release Engineering team works hard behind the scenes to design, build and maintain the infrastructure-at-scale needed to make this possible. For more details, and to apply, click here.

Note: The current team is spread across 3 cities, so remoties are welcome, even encouraged! Hardly a surprise if you read the other remoties posts on my blog, but worth stating explicitly!

Farewell to Tinderbox, the world’s 1st? 2nd? Continuous Integration server

32 Comments

In April 1997, Netscape ReleaseEngineers wrote, and started running, the world’s first? second? continuous integration server. Now, just over 17 years later, in May 2014, the tinderbox server was finally turned off. Permanently.

This is a historic moment for Mozilla, and for the software industry in general, so I thought people might find it interesting to get some background, as well as outline the assumptions we changed when designing the replacement Continuous Integration and Release Engineering infrastructure now in use at Mozilla.


At Netscape, developers would checkin a code change, and then go home at night, without knowing if their change broke anything. There were no builds during the day.

Instead, developers would have to wait until the next morning to find out if their change caused any problems. At 10am each morning, Netscape RelEng would gather all the checkins from the previous day, and manually start to build. Even if a given individual change was “good”, it was frequently possible for a combination of “good” changes to cause problems. In fact, as this was the first time that all the checkins from the previous day were compiled together, or “integrated” together, surprise build breakages were common.

This integration process was so fragile that all developers who did checkins in a day had to be in the office before 10am the next morning to immediately help debug any problems that arose with the build. Only after the 10am build completed successfully were Netscape developers allowed to start checking-in more code changes on top of what was now proven to be good code. If you were lucky, this 10am build worked first time, took “only” a couple of hours, and allowed new checkins to start lunchtime-ish. However, this 10am build was frequently broken, causing checkins to remain blocked until the gathered developers and release engineers figured out which change caused the problem and fixed it.

Fixing build bustages like this took time, and lots of people, to figure out which of all the checkins that day caused the problem. Worst case, some checkins were fine by themselves, but cause problems when combined with, or integrated with, other changes, so even the best-intentioned developer could still “break the build” in non-obvious ways. Sometimes, it could take all day to debug and fix the build problem – no new checkins happened on those days, halting all development for the entire day. More rare, but not unheard of, was that the build bustage halted development for multiple days in a row. Obviously, this was disruptive to the developers who had landed a change, to the other developers who were waiting to land a change, and to the Release Engineers in the middle of it all…. With so many people involved, this was expensive to the organization in terms of salary as well as opportunity cost.

If you could do builds twice a day, you only had half-as-many changes to sort through and detangle, so you could more quickly identify and fix build problems. But doing builds more frequently would also be disruptive because everyone had to stop and help manually debug-build-problems twice as often. How to get out of this vicious cycle?

In these desperate times, Netscape RelEng built a system that grabbed the latest source code, generated a build, displayed the results in a simple linear time-sorted format on a webpage where everyone could see status, and then start again… grab the latest source code, build, post status… again. And again. And again. Not just once a day. At first, this was triggered every hour, hence the phrase “hourly build”, but that was quickly changed to starting a new build immediately after finishing the previous build.

All with no human intervention.

By integrating all the checkins and building continuously like this throughout the day, it meant that each individual build contained fewer changes to detangle if problems arose. By sharing the results on a company-wide-visible webserver, it meant that any developer (not just the few Release Engineers) could now help detangle build problems.

What do you call a new system that continuously integrates code checkins? Hmmm… how about “a continuous integration server“?! Good builds were colored “green”. The vertical columns of green reminded people of trees, giving rise to the phrase “the tree is green” when all builds looked good and it was safe for developers to land checkins. Bad builds were colored “red”, and gave rise to “the tree is burning” or “the tree is closed”. As builds would break (or “burn” with flames) with seemingly little provocation, the web-based system for displaying all this was called “tinderbox“.

Pretty amazing stuff in 1997, and a pivotal moment for Netscape developers. When Netscape moved to open source Mozilla, all this infrastructure was exposed to the entire industry and the idea spread quickly. This remains a core underlying principle in all the various continuous integration products, and agile / scrum development methodologies in use today. Most people starting a software project in 2014 would first setup a continuous integration system. But in 1997, this was unheard of and simply brilliant.

(From talking to people who were there 17 years ago, there’s some debate about whether this was originally invented at Netscape or inspired by a similar system at SGI that was hardwired into the building’s public announcement system using a synthesized voice to declare: “THE BUILD IS BROKEN. BRENDAN BROKE THE BUILD.” If anyone reading this has additional info, please let me know and I’ll update this post.)


If tinderbox server is so awesome, and worked so well for 17 years, why turn it off? Why not just fix it up and keep it running?

In mid-2007, an important criteria for the reborn Mozilla RelEng group was to significantly scale up Mozilla’s developer infrastructure – not just incrementally, but by orders of magnitude. This was essential if Mozilla was to hire more developers, gather many more community members, tackle a bunch of major initiatives, ship releases more predictably and to have these new additional Mozilla’s developers and community contributors be able to work effectively. When we analyzed how tinderbox worked, we discovered a few assumptions from 1997 no longer applied, and were causing bottlenecks we needed to solve.


1) Need to run multiple jobs-of-the-same-type at a time
2) Build-on-checkin, not build-continuously.
3) Display build results arranged by developer checkin not by time.


1) Need to run multiple jobs-of-the-same-type at a time
The design of this tinderbox waterfall assumed that you only had one job of a given type in progress at a time. For example, one linux32 opt build had to finish before the next linux32 opt build could start.

Mechanically, this was done by having only one machine dedicated to doing linux opt builds, and that one machine could only generate one build at a time. The results from one machine were displayed in one time-sorted column on the website page. If you wanted an additional different type of build, say linux32 debug builds, you needed another dedicate machine displaying results in another dedicated column.

For a small (~15?) number of checkins per day, and a small number of types of builds, this approach works fine. However, when you increase the checkins per day, many “hourly” build has almost as many checkins as Netscape had each day in 1997. By 2007, Mozilla was routinely struggling with multi-hour blockages as developers debugged integration failures.

Instead of having only one machine do linux32 opt builds at a time, we setup a pool of identically configured machines, each able to do a build-per-checkin, even while the previous build was still in progress. In peak load situations, we might still get more-then-one-checkin-per-build, but now we could start the 2nd linux32 opt build, even while the 1st linux32 opt build was still in progress. This got us back to having very small number of checkins, ideally only one checkin, per build… identifying which checkin broke the build, and hence fixing that build, was once again quick and easy.

Another related problem here was that there were ~86 different types of machines, each dedicated to running different types of jobs, on their own OS and each reporting to different dedicated columns on the tinderbox. There was a linux32 opt builder, a linux32 debug builder, a win32 opt builder, etc. This design had two important drawbacks.

Each different type of build took different times to complete. Even if all jobs started at the same time on day1, the continuous looping of jobs of different durations meant that after a while, all the jobs were starting/stopping at different times – which made it hard for a human to look across all the time-sorted waterfall columns to determine if a particular checkin had caused a given problem. Even getting all 86 columns to fit on a screen was a problem.

It also made each of these 86 machines a single point of failure to the entire system, a model which clearly would not scale. Building out pools of identical machines from 86 machines to ~5,500 machines allowed us to generate multiple jobs-of-the-same-type at the same time. It also meant that whenever one of these set-of-identical machines failed, it was not a single point of failure, and did not immediately close the tree, because another identically-configured machine was available to handle that type of work. This allowed people time to correctly diagnose and repair the machine properly before returning it to production, instead of being under time-pressure to find the quickest way to band-aid the machine back to life so the tree could reopen, only to have the machine fail again later when the band-aid repair failed.

All great, but fixing that uncovered the next hidden assumption.


2) Build-per-checkin, not build-continuously.

The “grab latest source code, generated a build, displayed the results” loop of tinderbox never looked to check if anything had actually changed. Tinderbox just started another build – even if nothing had changed.

Having only one machine available to do a given job meant that machine was constantly busy, so this assumption was not initially obvious. And given that the machine was on anyway, what harm in having it doing an unnecessary build or two?

Generating extra builds, even when nothing had changed, complicated the manual what-change-broke-the-build debugging work. It also meant introduced delays when a human actually did a checkin, as a build containing that checkin could only start after the unneccessary-nothing-changed-build-in-progress completed.

Finally, when we changed to having multiple machines run jobs concurrently, having the machines build even when there was no checkin made no sense. We needed to make sure each machine only started building when a new checkin happened, and there was something new to build. This turned into a separate project to build out an enhanced job scheduler system and machine-tracking system which could span multiple 4 physical colos, 3 amazon regions, assign jobs to the appropriate machines, take sick/dead machines out of production, add new machines into rotation, etc.


3) Display build results arranged by developer checkin not by time.

Tinderbox sorted results by time, specifically job-start-time and job-end-time. However, developers typically care about the results of their checkin, and sometimes the results of the checkin that landed just before them.

Further: Once we started generating multiple-jobs-of-the-same-type concurrently, it uncovered another hidden assumption. The design of this cascading waterfall assumed that you only had one build of a given type running at a time; the waterfall display was not designed to show the results of two linux32 opt builds that were run concurrently. As a transition, we hacked our new replacement systems to send tinderbox-server-compatible status for each concurrent builds to the tinderbox server… more observant developers would occasionally see some race-condition bugs with how these concurrent builds were displayed on the one column of the waterfall. These intermittent display bugs were confusing, hard to debug, but usually self corrected.

As we supported more OS, more build-types-per-OS and started to run unittests and perf-tests per platform, it quickly became more and more complex to figure out whether a given change had caused a problem across all the time-sorted-columns on the waterfall display. Complaints about the width of the waterfall not fitting on developers monitors were widespread. Running more and more of these jobs concurrently make deciphering the waterfall even more complex.

Finding a way to collect all the results related to a specific developer’s checkin, and display these results in a meaningful way was crucial. We tried a few ideas, but a community member (Markus Stange) surprised us all by building a prototype server that everyone instantly loved. This new server was called “tbpl”, because it scraped the TinderBox server Push Logs to gather its data.

Over time, there’s been improvements to tbpl.mozilla.org to allow sheriffs to “star” known failures, link to self-service APIs, link to the commits in the repo, link to bugs and most importantly gather all the per-checkin information directly from the buildbot scheduling database we use to schedule and keep track of job status… eliminating the intermittent race-condition bugs when scraping HTML page on tinderbox server. All great, but the user interface has remained basically the same since the first prototype by Markus – developers can easily and quickly see if a developer checkin has caused any bustage.


Fixing these 3 root assumptions in tinderbox.m.o code would be “non-trivial” – basically a re-write – so we instead focused on gracefully transitioning off tinderbox. Since Sept2012, all Mozilla RelEng systems have been off tinderbox.m.o and using tbpl.m.o plus buildbot instead.

Making the Continuous Integration process more efficient has allowed Mozilla to hire more developers who can do more checkins, transition developers from all-on-one-tip-development to multi-project-branch-development, and change the organization from traditional releases to rapid-release model. Game changing stuff. Since 2007, Mozilla has grown the number of employee engineers by a factor of 8, while the number of checkins that developers did has increased by a factor of 21. Infrastructure improvements have outpaced hiring!

On 16 May 2014, with the last Mozilla project finally migrated off tinderbox, so the tinderbox server was powered off. Tinderbox was the first of its kind, and helped changed how the software industry developed software. As much as we can gripe about tinderbox server’s various weaknesses, it has carried Mozilla from 1997 until 2012, and spawned an industry of products that help developers ship better software. Given it’s impact, it feels like we should look for a pedestal to put this on, with a small plaque that says “This changed how software companies develop software, thank you Tinderbox”… As it has been a VM for several years now, maybe this blog post counts as a virtual pedestal?! Regardless, if you are a software developer, and you ever meet any of the original team who built tinderbox, please do thank them.

I’d like to give thanks to some original Netscape folks (Tara Hernandez, Terry Weissman, Lloyd Tabb, Leaf, jwz) as well as aki, brendan, bmoss, chofmann, dmose, myk and rhelmer for their help researching the origins of Tinderbox. Also, thank you to lxt, catlee, bhearsum, rail and others for inviting me back to attend the ceremonial final-powering-off event… After the years of work leading up to this moment, it meant a lot to me to be there at the very end.

John.

ps: the curious can view the cvs commit history for tinderbox webpage here (My favorite is v1.10!) …and the cvs commit history for tinderbox server here (UPDATE: Thanks to @justdave for the additional link.)

pps: When a server has been running for so long, figuring out what other undocumented systems might break when tinderbox is turned off is tricky. Here’s my “upcoming end-of-life” post from 02-apr-2013 when we thought we were nearly done. Surprise dependencies delayed this shutdown several times and frequently uncovered new, non-trivial, projects that had to be migrated. You can see the various loose ends that had to be tracked down in bug#843383, and all the many many linked bugs.

ppps: Here’s what MozillaDeveloperNetwork and Wikipedia have to say about Tinderbox server.

(UPDATE: add links to wikipedia, MDN, and to fix some typos. joduinn 28jun2014)

Hortonworks HDP 2.1 shipped!

No Comments

HDP2.1 shipped on 22apr2014.

This was the first significant feature release shipped since I joined Hortonworks at the start of the year. There’s lots of interesting new features, and functionality in this HDP2.1 release – already well covered by others in great detail here. Oh, and of course, you can
download it from here.

In this post, I’ll instead focus on some of the behind-the-scenes mechanics. There were lots of major accomplishments in this release, but the ones that really stood out to me were:

1) sim-ship windows and linux.
This was the first HDP release where all OS were built from the same changeset and shipped at the same time. Making this happen was a hectic first priority in January. As well as the plumbing/mechanics within RelEng, it also took lots of coordination changes across different groups within Hortonworks to make this happen. The payoff on this was great. We sim-shipped, which is great and massively important for HWX as a company. Even more importantly, we set things up so we could sim-ship for every HDP2.1-and-above release going forward… and we proved it by sim-shipping the quick followup HDP2.1.2.0 release on 02may2014.

2) adding 5 new components.
HDP2.1 contained 17 components, compared to HDP 2.0 (with 12 components) and HDP 1.3 (with 10 components), making HDP2.1 the largest growth of components ever?!? Oh, and in addition to the new components, every one of the 12 pre-existing components were also significantly updated to newer versions. That meant that each required significant new integration work, new installers on all supported OS (…remember the “sim-ship” goal?). Oh, and we were to ship all this new functionality at the fastest cadence yet.

3) improving support for other trains.
In January, we were learning how to support 3 active trains of code: supporting 1.3 and 2.0 maintenance work, while also building out infrastructure for 2.1 new-product-development-work… even while the 2.1 development work was in progress, which obviously complicated things for developers. Today, we’re supporting 4 active trains: maintenance work for 1.3, 2.0 and 2.1, as well as the 2.2 new-product-development-work. This time, the 2.2 infrastructure was built out and live before developers finished working on 2.1… enabling the developers! Things are not perfect yet, by any means, but today (with 4 trains) feels calmer and more organized then earlier this year (with “only” 3 trains).

All great improvements to see up close, and all important to us as we scale. Big thanks to everyone for their help… and do stay tuned for even more improvements already underway.

John.

RelEngCon 2014 registration is now open!

No Comments

In case you missed the announcements, RelEngConf 2014 is officially now open for registrations. This follows the inaugural and wildly successful Release Engineering conference , held in San Francisco on 20may2013, as part of ICSE 2013. More background here.

Last year’s event was great. The mixture of attendees and speakers, from academia and battle-hardened industry, made for some riveting topics. So I already had high expectations for this year… no pressure on the organizers! Then I heard this years will be held in Google HQ MountainView, and feature opening keynotes from Chuck Rossi (RelEng, Facebook, click for linkedin profile), and Dinah McNutt (RelEng, Google, click for linkedin profile). Looks like RelEngConf 2014 is already lining up to be special also.

If you build software delivery pipelines for your company, or if you work in a software company that has software delivery needs, I recommend you follow @relengcon, block off April 11th, 2014 on your calendar and book now. It will be well worth your time.

See you there!
John.

Infrastructure load for December 2013 and January 2014

1 Comment

(Context: In case people missed this transition, my last day at Mozilla was Dec31, so obviously, I’m not going to be doing these monthly infrastructure load posts anymore. I started this series of posts in Jan2009, because the data, and analysis, gave important context for everyone in Mozilla engineering to step back and sanity-check the scale, usage patterns and overall health of Mozilla’s developer infrastructure. The data in these posts have shaped conversations and strategy within Mozilla over the years, so are important to continue. I want to give thanks to Armen for eagerly taking over this role from me during my transition out of Mozilla. Those of you who know Armen know that he’ll do this exceedingly well, in his own inimitable style, and I’m super happy he’s taken this on. I’ve already said this to Armen privately over the last few months of transition details, but am repeating here publicly for the record – thank you, Armen, for taking on the responsibility of this blog-post-series.)

December saw a big drop in overall load – 6,063 is our lowest load in almost half-a-year. However, this is no surprise given that all Mozilla employees were offline for 10-14 days out of the 31days – basically a 1/3rd of the month. At the rate people were doing checkins for the first 2/3rds of the month, December2013 was on track to be our first month ever over 8,000 checkins-per-month.

January saw people jump straight back into work full speed. 7,710 is our second heaviest load on record (slightly behind the current record 7,771 checkins in August2013).


Overall load since Jan 2009

Those are my quick highlights. For more details, you should go read Armen’s post for Dec2013 and post for Jan2014 yourself. He has changed the format a little, but the graphs, data and analysis are all there. And hey, Armen even makes the raw data available in html and json formats, so now you can generate your own reports and graphs if interested. A very nice touch, Armen.

John (still cheering from the sidelines).

“Release Engineering as a Force Multiplier” keynote at RelEngCon 2013

15 Comments

The world’s first ever Release Engineering conference was held in San Francisco on 20may2013, as part of ICSE 2013.

It was a great honor for me to be invited to give the opening keynote. This was a rare opportunity to outline some of the industry-changing RelEng-at-scale work being done at Mozilla. It also allowed me to describe some very important non-technical tactics that we used to turnaround Mozilla’s situation from “company-threatening-and-human-burnout” to “company-enabling-and-human-sustainable“. All stuff that is applicable to other software companies. Presenting the first session at the first RelEng conference helped set the tone for the event, so being down-to-earth and practical felt important.

My presentation focused on how effective RelEng helped Mozilla compete against larger, better funded, companies. This is what I mean by “Release Engineering as a Force Multiplier”. To give some perspective of scale, I showed:

  • company logos scaled by headcount for each of Apple, Google, Microsoft and Mozilla. I noted that using revenue/profits instead of headcount would be just as disproportionately out of scale.

  • company logos scaled by browser market share for each of Apple, Google, Microsoft and Mozilla, based on publicly available market share data

(The full set of slides are available here, or by clicking on either of the two thumbnails above. If you want the original 25MB keynote file, let me know.)

Anyone who has ever talked with me about RelEng knows I feel very strongly that:

  • Release Engineering is important to the success of every software project. Writing a popular v1.0 product is just the first step. If you want to retain your initial users by shipping v1.0.1 fixes, or grow your user base by shipping new v2.0 features to your existing users, you need a reproducible pipeline for accurately delivering software. Smaller projects may not have someone with formal RelEng title, but there’s always someone doing RelEng work. Otherwise, your product, and your organization, will not survive.
  • Building an effective RelEng pipeline requires a different mindset to writing a great software product. Both are non-trivial, and understanding this difference in mindset enables you to hire wisely.
  • The importance of Release Engineering for the sustainability of a software company is only beginning to be recognized. Release Engineering is not taught as a discipline in any CompSci course that I know of. This has the unfortunate side effect that most Release Engineers only learn on-the-job, usually as a side-effect of helping out during an organizational emergency. This limits the effectiveness of Release Engineering to what can be learned on the job, and because information isn’t widely shared, its hard to learn from other people’s mistakes or successes. By contrast, when I was studying CompSci for my undergrad and postgrad degrees, there were all sorts of course modules, books and published papers detailing which sorting algorithm was best suited for which types of data, which graphics algorithms were best suited for which types of images, etc… but nothing about Release Engineering. Nothing about how to build a pipeline to deliver that software to users… even though every software company lives-or-dies by the efficiency of their delivery-pipeline.

This conference was a great start to helping raise the understanding of the industry on these three points, and many more.

Big hat-tip to the organizers (Bram, Christian, Foutse, Kim) – they did an awesome job putting together a unique and special event. Other industry speakers included Google, LinkedIn, MicrosoftResearch, Netflix, PuppetLabs as well as a wide range of academic speakers – see full program details here and here. In the past, I’ve attended plenty of industry conferences which have a sales-touch to them (“our technology is great, you should buy our stuff” or “our technology is great, you should come work here”), or academic conferences which have academic-accuracy-but-can-lack-urgent-practicality. By contrast, this conference was very different. All the speakers spoke candidly, humbly and objectively about the technology and tactical successes that worked for them at their companies, and equally candidly about their failures as “teaching moments” for everyone to learn from. The level of raw honestly by all these speakers, from all these different companies and academia was super-honest and very collaborative. Additionally, the sheer volume, and the quality, of attendees blew everyone away… the discussions during breaks were just electric. Very very refreshing. I dont know if this magic was because of the carefully chosen mix of industry-vs-academia speakers, or if this was because the organizers were also a mix of industry and academia, but regardless – the end result was simply fantastic.

There’s already plans underway to have this RelEngConference again next year… if you build software delivery pipelines for your company, or if you work in a software company that has software delivery needs, I recommend you follow @relengcon and plan on attending next year. You’ll be very glad you did!

Nostalgia and excitement

No Comments

…seems the best description of the last few days.

Friday was a big day for several reasons. We:

  • started building FF3.5beta99 (build#1)
  • aborted FF3.5beta99 (build#1) after a blocker was found, and started FF3.5beta99 (build#2)
  • pushed FF3.0.11build#2 to beta users
  • started building TB2.0.0.22
  • oh, and moved office.

The first four would have counted as a busy day. A really busy day. Add to that the contingency planning to make sure that we could still be ready whenever we finally got the “go” to start FF3.5rc1, regardless of when the physical building move really happened. Both FF3.5rc1 date and building move date changed quite a bit, so we just made plans for the worse case – doing it all on the same day. There was some last minute changes to the contingency plans when we added FF3.5beta99 to the schedule late last week.

While I dont normally like respins, in this one case, I was happy for the FF3.5beta99 respin, as it suddenly gave us ~3 hours before we would need the signing machine again. So, Aki, John Ford and myself quickly moved the mobile devices from Aki’s desk, and the signing machine keymaster out of the server room, into two cars, and drove over in careful slow convoy to the new building. Thankfully Aki thought to put all the mobile devices into a portable guitar base pedal case, which made it “easy” to carry.

(Healthy paranoia caused *me* to carry keymaster, and the signing keys, so I would deal with the consequences if it got dropped in the move. )

…although keymaster looked too unsecured in the back seat like that, so Aki sat in the back, and physically held it for the drive.

Earlier today, I went back to Building K, tracking down some loose ends. It was weird and nostalgic walking around the ghost of the empty building all by myself. Its been my home-from-home for the last two years, and it was surreal to see it all empty like this.

ps: if anyone knows who this crutch belongs to, could they let us know?

Big tip of the hat to Rhian, Chris Beard, Karen, Erica and IT for a phenomenal job on all this. I’ve done moves like this in previous companies, and there’s always a million-and-one loose details. But they seemed to have everything all calmly taken care of. Quite amazing!

Thunderbird 2.0.0.9 by the (wall-clock) numbers

No Comments

Mozilla released Thunderbird 2.0.0.9 on Wednesday 14-nov-2007, at 5.10pm PST.

From “Dev says code ready to release” to “release is now available to public” was 15 days 22.5 hours wall-clock time, of which the Beta period took 6 days 8 hours, and Build&Release took just over 2.5 days (62.5 hours).

17:30 30oct: Dev say go
09:40 31oct: mac builds handed to QA
10:00 31oct: linux builds handed to QA
17:55 31oct: win32 signed builds handed to QA
06:50 02nov: update snippets available on betatest update channel
14:30 06nov: QA says “go” for Beta
16:10 06nov: update snippets available on beta update channel
00:30 13nov: Dev & QA says “go” for Release; Build starts final signing, bouncer entries
08:25 13nov: final signing, bouncer entries done; mirror replication started
09:40 13nov: Build announced enough mirror coverage for QA to use releasetest channel
12:40 13nov: win32 installer bug#403670 discovered
14:00 13nov: declare bug#403670 as showstopper, put TB2.0.0.9 on hold.
18:20 13nov: root cause and fix of bug#403670 found.
05:05 14nov: one rebuilt win32 installer handed to QA to verify bugfix
05:40 14nov: QA confirmed new win32 installer is ok.
08:30 14nov: all rebuilt win32 installers handed to QA
10:10 14nov: QA signoff on rebuilt win32 installers, mirror replication started
15:00 14nov: mirror replication confirmed complete on new win32 installers
16:00 14nov: update snippets available on release update channel (for end users)
17:10 14nov: release announced

1) This was not a “human free” release. The automation work done for FF2.0.0.9 has not been tested for TB2.0.0.9. In theory it should work just fine, but we just havent had time to test it, so we chose to play safe and do this release manually. Hence this took more time for Build to produce. All of that time was manually intensive Build work.
2) bug#403670 was caused by a combination of factors. One factor was human error, I incorrectly setup a workarea on a signing machine, the same incorrect setup works fine for Firefox releases; the signing doc has now been updated. The other factor was a long-standing-but-previously-unknown error handling problem in one of our signing scripts, how to fix this is being debated within the Build team. Note: this problem was with the windows installer only, not with any Thunderbird code, and not the linux/mac installers. Overall, this delayed the release by approx 22hours.
3) Mirror absorption times were messed up by the stop-and-restart caused by bug#403670.
4) The daylight savings PST change happened during this release, giving us an extra hour. That is counted in the overall times above.

take care
John

Keeping perspective: 34hours vs 37hours

No Comments

It took 34 hours to produce Firefox3.0beta1 rc1.

Those 34 hours were frantic. Two people, tag teaming day & night, working with the nervous tension of knowing that a single one character typo could invalidate the entire build, and force us to start all over again. Those 34 hours only got us as far as producing unsigned builds on each platform – roughly 1/3 of the overall Build work needed to do a release – before we hit a problem. A typo. At the beginning of it all, one person typed PDT into one computer, while the other person typed PST into another computer. That typo meant rc1 did not include a last minute important bugfix. So, we scrapped rc1 and started all over again, building rc2. (I note that the D and S are even next to each other on the keyboard [sigh!]. And if it wasnt for the timezone change last week, it would have not mattered either[sigh! sigh!])

To put that 34 hours in perspective, Build took 37 hours to do everything needed for the complete FF2.0.0.9 release… and most of that was actually just watching the automation chugging along. Active human work was down to a handful of hours for signing, bouncer/mirror updates, and a little nervous manual rechecking of the automated checks, just to be sure, to be sure.

Why the night and day difference?

We’ve been focusing on automation for the FF2.0.0.x branch over the last few months, shipping FF2.0.0.7, FF2.0.0.8 and FF2.0.0.9 each time with automation improved from the previous release. Sadly, none of this automation work is live on trunk yet. All the trunk releases, like the alphas, and now this FF3.0beta1, are done the old fashioned way. By hand. One command at a time.

This week was a stark reminder of what things used to be like, and gave perspective on how much we’ve accomplished so far this year.

Older Entries Newer Entries