Thunderbird by the (wall-clock) numbers

Mozilla released Thunderbird on Wednesday 14-nov-2007, at 5.10pm PST.

From “Dev says code ready to release” to “release is now available to public” was 15 days 22.5 hours wall-clock time, of which the Beta period took 6 days 8 hours, and Build&Release took just over 2.5 days (62.5 hours).

17:30 30oct: Dev say go
09:40 31oct: mac builds handed to QA
10:00 31oct: linux builds handed to QA
17:55 31oct: win32 signed builds handed to QA
06:50 02nov: update snippets available on betatest update channel
14:30 06nov: QA says “go” for Beta
16:10 06nov: update snippets available on beta update channel
00:30 13nov: Dev & QA says “go” for Release; Build starts final signing, bouncer entries
08:25 13nov: final signing, bouncer entries done; mirror replication started
09:40 13nov: Build announced enough mirror coverage for QA to use releasetest channel
12:40 13nov: win32 installer bug#403670 discovered
14:00 13nov: declare bug#403670 as showstopper, put TB2.0.0.9 on hold.
18:20 13nov: root cause and fix of bug#403670 found.
05:05 14nov: one rebuilt win32 installer handed to QA to verify bugfix
05:40 14nov: QA confirmed new win32 installer is ok.
08:30 14nov: all rebuilt win32 installers handed to QA
10:10 14nov: QA signoff on rebuilt win32 installers, mirror replication started
15:00 14nov: mirror replication confirmed complete on new win32 installers
16:00 14nov: update snippets available on release update channel (for end users)
17:10 14nov: release announced

1) This was not a “human free” release. The automation work done for FF2.0.0.9 has not been tested for TB2.0.0.9. In theory it should work just fine, but we just havent had time to test it, so we chose to play safe and do this release manually. Hence this took more time for Build to produce. All of that time was manually intensive Build work.
2) bug#403670 was caused by a combination of factors. One factor was human error, I incorrectly setup a workarea on a signing machine, the same incorrect setup works fine for Firefox releases; the signing doc has now been updated. The other factor was a long-standing-but-previously-unknown error handling problem in one of our signing scripts, how to fix this is being debated within the Build team. Note: this problem was with the windows installer only, not with any Thunderbird code, and not the linux/mac installers. Overall, this delayed the release by approx 22hours.
3) Mirror absorption times were messed up by the stop-and-restart caused by bug#403670.
4) The daylight savings PST change happened during this release, giving us an extra hour. That is counted in the overall times above.

take care

Keeping perspective: 34hours vs 37hours

It took 34 hours to produce Firefox3.0beta1 rc1.

Those 34 hours were frantic. Two people, tag teaming day & night, working with the nervous tension of knowing that a single one character typo could invalidate the entire build, and force us to start all over again. Those 34 hours only got us as far as producing unsigned builds on each platform – roughly 1/3 of the overall Build work needed to do a release – before we hit a problem. A typo. At the beginning of it all, one person typed PDT into one computer, while the other person typed PST into another computer. That typo meant rc1 did not include a last minute important bugfix. So, we scrapped rc1 and started all over again, building rc2. (I note that the D and S are even next to each other on the keyboard [sigh!]. And if it wasnt for the timezone change last week, it would have not mattered either[sigh! sigh!])

To put that 34 hours in perspective, Build took 37 hours to do everything needed for the complete FF2.0.0.9 release… and most of that was actually just watching the automation chugging along. Active human work was down to a handful of hours for signing, bouncer/mirror updates, and a little nervous manual rechecking of the automated checks, just to be sure, to be sure.

Why the night and day difference?

We’ve been focusing on automation for the FF2.0.0.x branch over the last few months, shipping FF2.0.0.7, FF2.0.0.8 and FF2.0.0.9 each time with automation improved from the previous release. Sadly, none of this automation work is live on trunk yet. All the trunk releases, like the alphas, and now this FF3.0beta1, are done the old fashioned way. By hand. One command at a time.

This week was a stark reminder of what things used to be like, and gave perspective on how much we’ve accomplished so far this year.

Free Software builds now also available…

… at

This special build of Firefox2.0.0.9 uses the exact same code cutoff time and cvs branch as the regular Firefox2.0.0.9 release, but was compiled with branding, logos and talkback removed.

As an aside, I didnt know much about this special build until recently, hence there was no plan to include this in our build automation work. However, looking back on, I see quite a few of them, and asking around, it was done manually once the dust settled on a given Firefox release. We are now tracking automating these FreeSoftware builds in bug#385783, with some related cleanup in bug#402582.

Firefox by the (wall-clock) numbers

Mozilla released Firefox on Thursday 01-nov-2007, at 5.40pm PST.

From “do we need a release” to “release is now available to public” was 11 days 2 hours wall-clock time, of which the Beta period took 2.75 days, and Build&Release took 37 hours.

15:35 22oct: decide regressions introduced in FF2008 justify producing a quick FF2009 to address
12:30 25oct: Dev says “go”
14:40 25oct: 2009rc1 builds started
20:00 25oct: linux builds handed to QA
22:00 25oct: mac builds handed to QA
01:00 26oct: win32 signed builds handed to QA
19:40 26oct: update snippets on betatest update channel
16:30 29oct: QA says “go” for Beta
16:50 29oct: update snippets on beta update channel
10:40 01nov: Dev & QA says “go” for Release; Build starts final signing, bouncer entries
14:15 01nov: final signing, bouncer entries done; mirror replication started
17:15 01nov: update snippets on live update channel; announced

While Build Automation in FF2009 was much smoother than FF2008, this was not yet a “human free” release:
1) The talkback server had been renamed after the FF2.0.0.8 release shipped and before FF2.0.0.9 started, so our first automation run timed out at the end of the build, waiting for humans to answer the RSA “are you sure you want to connect to this machine” login question?! 🙁 We didnt detect this until the build overran the estimated completion time, but then after a quick fix, we were forced to rerun the entire build again. This would have been caught if our nightlies were part of the same build automation (see bug#401936)
2) We still manually do signing, adding bouncer entries, starting mirror replication and monitoring mirror replication, pushing snippets to beta channel, pushing snippets to release channel. Combined, these took 6.5 hours of the Build time, and are worthy of automation attention. Pushing updates snippets to betatest channel has been automated since the FF2008 release.
3) Mirror absorption took 3 hours to reach 72-80%. The mac DMG files always straggle much lower then everything else for mirror absorption, apparently a known problem with how webservers handle that file type, but new details are emerging in bug#402141. Experiments continue, but every time we do a release, we always give thanks to morgamic for giving us the tools to measure with!

take care

The Baby Owners Manual

Bought this book again recently, and thought it was finally time to post a review of it.

I first found this in a bookshop years ago, just when some engineer friends of mine had their first baby, so I bought it as an impulse joke gift for them. It was easy to read, informative, and entertaining. I’m an engineer, with no prior baby experience, as were my two newly-parented friends; obviously the author’s target audience.

The book itself was written by father-and-son combination (a doctor and a parent) in the style of a computer manual – you know… the manual you never read… the manual which comes with your new PC… full of simplified diagrams, with bubbles and arrows, showing you how to plug in the printer? and troubleshooting techniques if the mouse doesnt work?… well, this book is exactly that, except its all about how to pickup a baby, burp a baby, change a baby’s diaper (different instructions for boy and girl!), wrap a baby, simple medical issues, while sending you to your nearest Baby Service Provider for more complex problems.

They smiled politely when I gave them the book, but you could tell they thought I was a little nuts.

Weeks later, they each pulled me aside and confided that they learnt lots from the book, loved it and were busy recommending it to other parents. It had become their first book to reach for, exactly because of its quick-troubleshooting design, and they learnt lots of practical tips just browsing through. Wow, funny and really useful. That settled it. Over the years, its become a kinda tradition now for me to buy it for any engineer friends who are having their first baby. So, Monday night, I delivered a copy of this book, along with some other gifts to a proud new parent at Mozilla. At this point, I’ve bought maybe a dozen copies, mostly through amazon, so who knows what that is doing to my own account profile! 🙂

The publishers must think its successful, because they have recently started a series of books in a similar vein: The Dog Owner’s Manual, The Cat Owner’s Manual, The Toddler Owner’s Manual, The Home Owner’s Manual, etc…

Firefox by the (wall-clock) numbers

Mozilla released Firefox on Tuesday 18-oct-2007, at 5.30pm PST.

From “code freeze” to “fix available to public” was 14 days 2 hours wall-clock time, which included a 7day Beta period (this was a non-firedrill release). Build&Release took 68 hours.

15:00 04oct: Dev says “go”
15:33 04oct: 2008rc1 builds started
18:20 04oct: linux builds handed to QA
19:45 04oct: mac builds handed to QA
12:45 05oct: win32 signed builds handed to QA
20:05 05oct: update snippets on betatest update channel
11:30 08oct: 2008rc1 halted. Respin declared for bugs 398422 and 398837
15:20 08oct: Dev says “go”
16:05 08oct: 2008rc2 builds started
19:50 08oct: linux builds handed to QA
22:05 08oct: mac builds handed to QA
00:45 08oct: win32 signed builds handed to QA
01:00 10oct: update snippets on betatest update channel
15:05 10oct: QA says “go” for Beta
16:05 10oct: update snippets on beta update channel
11:55 18oct: Dev & QA says “go” for Release; Build starts final signing, bouncer entries
14:25 18oct: final signing, bouncer entries done; mirror replication started
17:30 18oct: update snippets on live update channel; announced

While Build Automation in FF2008 was much smoother than FF2007, this was not yet a “human free” release:
1) signing still done manually in two places. This is known and expected.

2) As the initial build steps get automated, the steps near the end of the process become more visible. Steps like pushing-updates-snippets-to-channels, adding bouncer entries, starting mirror replication and monitoring mirror replication are now worthy of automation attention. Combined, these took 6.5 hours of the Build time, and were all manual.

3) It was interesting to note that we needed only 3 hours of mirror replication time to reach 65-72% mirror absorption. There’s been quite a lot of folklore around how long it takes for mirror replication, but as mirrors have changed, we’ve been measuring to get concrete data. Even for a mirror replication in daytime, like in this release, we saw quick absorption around 60% within the first 2hours. We are still experimenting with IT to find out how much absorption is “enough”, so decided to wait until absorption hit around 70%, just to play safe. This is definitely not a science, we will continue experimenting with this in future releases… any comments/feedback very very welcome!

take care

The Tipping Point by Malcolm Gladwell

It felt to me like he was covering a bunch of different topics, or short essays, all in the same book. Some resonated with me much more then others. In particular, these two:

Chapter1: Epidemics:
To me, I always thought of epidemics in the medical sense, flu outbreaks, avian flu, etc. However, I was fascinated by how the same study of epidemics could be applied to other completely unrelated fields. Human fashion. Graffiti. Litter. Teenage smoking. One example he detailed was a gonorrhea outbreak in Colorado Springs, Colorado (population 100,000+), which tipped over from statistically insignificant background noise, to epidemic, because of the activity of 168 people in 6 local bars in 4 small neighborhoods of the town. A statically insignificant small group of people. Another epidemic example he detailed was how “The Broken Window Theory” was applied to the New York City subway, dealing with graffiti, and fare-evaders. I particularly like the two cultural insights behind how they dealt with fare-evaders.

The first culture changes was with the cops. Seem the cops preferred to chase bigger fish, instead of wasting the afternoon doing paperwork on one trivial misdemeanor fare-evader arrest. However, a few simple ideas changed things dramatically. Instead of doing onesy-twosy arrests, they had 10+ plain clothes cops handcuff fare evaders to each other on the platform like a large daisy chain, and then only come up from the subway station with a “full” daisy chain. Instead of driving each suspect through traffic to the police station, they converted a bus into a mobile police station so that paperwork, fingerprints, background checks could done on site without a slow trip to the police station. Instead of just fining someone for fare-evading, and letting them go, they always ran a full background criminal check on each fare evader – and found something interesting: 1 out of 7 fare-evaders had outstanding arrest warrants; 1 out of 20 had illegal weapons. Suddenly, the cops on the street felt it was not “just a fare-evader”… now a daisy-chain of 20 fare-evaders was a really interesting surprise bonanza box, and the easiest way in the world to catch “real” bad guys.

The second cultural change was with the subway riders. Seems that the general public attitude had deteriorated to “why should I pay, if everyone else is evading”. Even people who would not normally break the rules, who would never consider themselves as criminals, were avoiding paying fares “because everyone else did it too”. However, the daisy-chain of handcuffed fare-evaders was a clearly visible deterrent, a reminder of what the rules were, and how society expected people to behave. Quickly, the number of people trying to fare-evade dropped. Which meant more people paid. It also created the perception of the subway being safer, so more people felt safe to choose to travel on subway. So even more people paid. All this meant they had more money to fix other problems, like old rolling stock, tracks, ticketing systems, etc.
Of course, the NYC subway is not perfect, then and now. However, a handful of small, carefully chosen physical changes, triggered a couple of critical cultural changes, which turned around a problem that had previously been almost given up for lost. I realized its really easy to trick yourself into thinking that a big reward requires a big effort project, and then with those mental blinkers on, only allow yourself to consider big ticket items. “How little things can make a big difference” is a really good subtitle for this book.

Chapter 5: Human group size:
Gladwell contends that human groups, and intra-group human loyalties, only scale up to about 150 people. His suggests its a function of the limits of the human brain to handle all the combinations of relationships between everyone in the group. For groups under 150 people, the inter-personal relationships, friendships, peer pressure, of everyone knowing each other tends to keep people focused and working together towards a common group goal. However, groups that grow over 150 quickly lose internal cohesion, internal focus, because they are just too big for everyone to really know everyone else. Once that cohesion breaks down, people instead start forming smaller subgroups, trusting their own subgroup, questioning motives of those other subgroups, and focusing on their own personal agendas. In a Western-business culture setting, it would be called internal-company-rivalry.

I once worked in a company as it grew from 42people to 180people, and experienced that change of internal cohesion myself. At the time, I just knew “things had changed”, but only later, looking back, I figured out it was to do with how many people were in the company, and not feeling like we were all working together anymore. Personally, I’d always thought the change from “we tight knit small band of brothers in a small company” to “I’m just a nameless cog in a faceless bureaucracy” started to happen somewhere around 100 people, but that was just a gut guess. However, Gladwell shows examples of hunter gatherer tribes, ranging from Australia to Greenland, all averaging just under 150 people per village. The Hutterites (a religion similar roots to the Amish) have a strict policy that once a community approaches 150 people, its splits into two equal separate communities. In business, the same principle is followed by Gore Associates (the manufacturer of GoreTex)… and they believe that contributes to why they have employee turnover 1/3 of the industry average, are profitable for 35 years in a row (and counting), and are constantly successfully innovating new products and markets… all without formal management structures. In groups under 150, he suggests that “personal loyalties and direct man-to-man contacts” keep everyone focused on doing the right thing for the organisation.

There’s quite a lot of the book given to describing Mavens, Connectors and Salesmen. I’m still thinking that part over, not convinced yet.

Would I recommend this book? Yes, especially these two chapters.

Firefox 3alpha8 by the (wall-clock) numbers

Mozilla released Firefox3a8 on Thursday, 20-sep-2007, at 08:30am PST.

This was a manual build run (not automated on trunk yet), and an alpha release (not a high-priority security release), so the numbers are quite different to the earlier Firefox2.0.0.7 release. Even as an apples-to-baseballs comparison, I thought the numbers were interesting and worth sharing. From “code freeze” to “available for public download” was 14.33 days wall-clock time. Of that time Build&Release took 2.25 days (55 hours including the respin).

00:01 06-sept: M8 code freeze, tree closed
18:48 11-sept: Dev verifies last fix landed, and gives “go” to build
20:46 11-sept: Build starts building
01:26 12-sept: blocker bug#395862 filed
08:40 12-sept: blocker patch landed
09:22 12-sept: Build restarts building
13:49 12-sept: linux & mac builds handed to QA
18:01 12-sept: signed-win32 build handed to QA
11:17 18-sept: QA signed off on all builds
00:01 19-sept: Build supposed to finish signing and publish builds externally
02:58 20-sept: files available externally for download
08:31 20-sept: mirror absorption completed and release announced

There were a few interesting point about this release
1) There was a 5.75 day delay between when the code freeze started, and when the tree was first deemed ready for builds to start.
2) After builds started, a last minute blocker bug caused those builds to be abandoned and new builds started. This respin cost Build 12 hours.
3) Between 13-17sept inclusive, both Build and QA switched to work on FF2.0.0.7 (a higher priority security firedrill release). This caused wall-clock delays.
4) After QA signoff, we delayed releasing Firefox3a8 from 18sept to 19sept, to avoid traffic load of releasing Firefox3a8 on the same day as Firefox2.0.0.7.
5) There was a 1 day delay between when QA signed off on the builds and when Build group ran the remaining manual steps (signing installer, pushing bits externally, etc). These remaining Build steps only took a handful of hours to complete. However, the person doing those remaining manual steps (ie me!), was sidetracked with other non-release work.

Firefox by the (wall-clock) numbers

Mozilla released Firefox on Tuesday 18-sep-2007, at 3pm PST. For background on this security firedrill, see here.

This was our first production run using the new automation, so I thought the following wall-clock numbers might be interesting. From “initial report” to “fix available to public” was 6.25 day wall-clock time. Of that, Build&Release took just under 2 days (45 hours).

09:00 Wed: bug reported 9am (or 8.30am?). Dev start working on fix
13:40 Fri: fix landed on 1.8 branch
14:30 Fri: build started
18:30 Fri: linux builds handed to QA
22:30 Fri: mac builds handed to QA
22:30 Fri: win32 unsigned builds handed to QA
11:58 Sat: win32 signed builds handed to QA (1st time)
01:30 Sun: win32 signed builds handed to QA (2nd time, rebuilt on old
12:10 Sun: update snippets pushed to beta update channel
15:00 Tue: update snippets pushed to live update channel; announced

Full disclaimer, while this fast turnaround kept Mike Shaver happy, it was not yet a “human free” release. We hit 4 issues, which required manual intervention:

1) last minute question about possible CVS-cross-branch tagging problem in automation scripts. Problem unconfirmed, but decided to manually tag anyway, just to be safe. Problem still unconfirmed, but test case now designed to clarify for future releases (see bug#396290)

2) l10n builds on win32 had the wrong cr-lf settings in README, EULA. This root cause of this was an internal communications snafu within the Build&Release group. Historically, we build l10n win32 on different machines to win32 en-US machines. As part of automation rollout, some folks thought the l10n win32 builds were now being done on same machines as en-US for 2005+2006, some thought l10n win32 was still being built on different machines. Because these different machines have different cygwin cr-lf settings, this problem first surfaced as a problem where text files like README, EULA had the wrong cr-lf settings. It was caught by a recently added test. Rather the debug/fix the problem, we just built on the old l10n machine and shipped that for win32. This miscommunication has been clarified. Still checking if there’s anything else here we missed.

3) signing still done manually. This is known and expected. Note: as the step-before-signing finished late at night, the automation waited overnight until human woke up and did the signing the next morning.

4) manually copying bits from stage to build-console after each step completed. This was a known issue that we expected to have fixed for the scheduled 2007 release, but was not yet in place when this Firefox2.0.0.7 firedrill started. After each step finished, we had to manually copy files between “stage” and “build-console”, so that the next step would find the files it was expecting. Was intrusive and annoying. On track to be completed before end sept. (see bug#396438)

ps: After the release, we’ve heard a few questions about the new GPG key. The previous key had expired sept2006, and was still being used, until this new key was available in August2007. We used the new key in Firefox3a7, and also in Firefox2007. After the Firefox2007 release, some questions about how to confirm the new public signing key on key servers. We’ve reviewed the keys on key servers, and they seem ok, but are still investigating. (see bug#377781).