Samantha the Jill-in-the-box

Not really work related, but I thought others might like this photo.

First, there was Mozilla letting you bring dogs to the office.

Then, enter Samantha.
Then, Samantha started playing with a basketball.

Then, Samantha started playing with a tennis ball.

Then, we got table tennis at the Mozilla office. (…you can see where this is going…)

Table tennis is very popular with everyone here – including Samantha! As usual, Samantha still wants to get the ball…She can *hear* the ball fine, but is too short to look over the tabletop so she can never see exactly where it is. So, she excitedly hops up like a JackJill-in-the-box to see which way the ball is going, and then drops down and rushes to that side, then hops up again, then drops down and rushes to the other side, then hops up again…. it took several photos before I got the timing right, and got this photo of Samantha at the peak of her hop!

Obviously, if the ball falls off the table, or someone fumbles a shot, its a quick race between both players and Samantha to see who gets the ball first!

In more recent weeks, we’ve all got better at table tennis, and also, she’s now a little bored with it all, so we’re able to relax a bit more when playing!

Build-always vs Build-on-checkin (continued)

My initial blog on this topic attracted a bunch of comments, and some great blogs on this, especially RobHelmer’s blog and Rob Campbell’s blog.

One thing that became quickly apparent is that if we do change from “build-always” to “build-on-checkin”, we should have the pool of slaves already setup beforehand. We had originally planned to start off with one slave per o.s., and then grow to multiple slaves per o.s.. However, setting up multiple slaves *before* changing the build-mode is also fine to do, so we’ll change plans to that. For details, see bug#411629.

I wanted to quickly thank everyone for their comments so far, they were really helpful. We’re still digging through all the other posts & comments, but if people had any further thoughts, please keep them coming…

Comment spam and Akismet

Wherever there are people, you’ll find other people trying to sell them stuff. From overpriced tshirts, bootleg recordings and scalped tickets outside rock concerts, to spam-on-house-phone, spam-on-fax, spam-on-email and now spam-blog-comments.

I was struggling, moderating blog comments the “traditional” way, until Sam Sidler showed me a better way. Since installing Akismet, I can now a) allow real people to add comments, and b) not have to wade through all the spam-comments every day. Very cool. Thanks Sam!

ps: Trivia from Akismet is that  91% of all blog-comments are spam. At what point will spam-blog-comments drown out useful blog-comments, to the point where people start to avoid blogs? Would it ever overrun the medium? Emails are not quite at that stage yet, but house phones all have callerID & answering machines to screen us from spam-phone calls, and fax machines continue to churn out pages of spam-faxes.

Localized dreams?

Different dreams depending on your locale? Huh?

There’s a theory that people dream as a way to rehearse reacting to surprises/threats/problems in waking life. Which means that a Western city dweller might dream of being mugged, or a kid running out in front of a car, or showing up at a meeting unprepared. By contrast, the Amazonian Mehinaku tribe dream of being bitten by snakes, wasps, etc. Children who havent yet been exposed to society enough for all this, start with “default” dreams of being chased by monsters – an evolutionary carryover. Nightmares after watching a horror movie are explained as being caused by your brain being fooled by the movie into thinking it was a plausible threat, and then wanting to rehearse for that new type of threat.

For details, see Psychology Today

Planning ahead… but how far ahead?

Today, I made it through the day, working on my ToDo list of what I thought needed to be done today, in between the random interrupts and distractions that constantly surprise us all. Revisiting my ToDo list several times during the day helps keep me on track. If all goes well, by the end of the day, I’ve made progress on most of my ToDo list, *and* managed to handle the various interrupts. Otherwise, sometimes I feel like my busy day was all unfocused churn; busy but not productive – the all-too-familiar cry of “I did lots today, but got nothing done”!

So, thats today. But does today’s work fit in with tomorrow’s work? And the day after? And the day after that? For me, making a mental plan of the week, and then trying to guesstimate what can be done before what is tricky. Easy right?

Does this week’s work fit with next week’s work? And the week after?

Does this month fit with next month?

Does this quarter fit with next quarter?

In theory, if we plot where we are today, across four quarterly goals, we should land exactly at the pre-decided goals for 2008, plotted in a perfectly straight line from start to finish. In theory. Heh. Yeah, right. Not surprisingly, as the time span lengthens, it all gets dizzyingly complex really, really, quickly. A quick scan of the business section of the local bookshop proves how common this problem is. A project is more complex then expected, and takes longer to do. The cross dependencies with other people’s work grows. The risks of unexpected surprises grows. People change. None of those goals exist yet. You can try your hardest to reduce the variables, but in reality its impossible to predict a bunch of things over the next 12 months, so its all really a giant guessing game.

Despite all the possible pitfalls, distractions and disclaimers, the idea here is to have some idea of what we’re all aiming towards. Its makes planned work in a given quarter make sense, compared to the quarters before and after. Also, we break big projects down into small chunks, but at some point, we need to expect all the chunks to come together making a completed big project. Where does that happen? Here. To stop people freaking out about the scale of it all, John & Mitchell asked that we tell it all as a story: pretend its the end of 2008, and we’re looking back over the entire year, telling a story of what we did. Here is the Build story.

We might not quite hit everything, despite the best of planning. Or we might hit it on the nail, but by taking a completely different path. Or miss something completely, and nail some unexpected last-minute project instead. Who knows. But at least, this forced us to stop, look up, and think a little about the bigger picture. All good, imho.

Let us know what you think?!

tc

John.

=====
[snip]

Well, 2008 was quite a amazing year.

From a tactical perspective, we rolled out automation on all the remaining FF and TB branches, shipped the FF3 release, expanded the scope of the automation to cover all of the random last-minute items that crop up, expanded the TB automation to also include Lightning, kept the TB releases going until we transitioned them over to a self-sufficient MailCo, worked through the FF Mercurial transition, and did more firedrill security releases than we care to count. We still have lots to do in terms of improving turnaround time, integration with test automation, performance automation, etc, but still… not bad.

But from a strategic perspective, the highlight of the year was how we did all that *and* at the same time, morphed from being a burn-out-and-replace group into a sustainable, humane, efficient team proud of what we’ve proven we can do. That change has finally given us the time to catch our breath, look beyond the next firedrill deadline, and realize there’s something special about our situation.

Frequently, in this industry, the Build group is an afterthought, hidden deep in the bowels of a software organization, somehow treated differently from Dev, QA and IT. Opaque to the rest of the company, the rest of the world and even worse, the Build groups in other companies. This means every Build group is forced to reinvent the wheel, learning from their own mistakes as they go along, rather then being able to learn from the mistakes/successes of others. The few people talking publicly in this area are typically selling software/consulting/books.

We however are in a unique situation. We have a rare ability to let people see how Build&Release works. A transparent Build & Release infrastructure is a unique luxury we have, which most people simply can not have. By adding Build&Release to our open source projects, like browser/email/calendar, we can build up the pool of knowledge in this industry by showing other Build groups what worked for us, as well as what didnt work. We stay impartial, because we’re not selling anything! We keep things practical because we use our work in production – we’re not just writing theoretical whitepapers. Heck, people might even decide to use our work as their open-source-build-platform! We can make Mozilla a public case study of how to make things better… or at least, how to not make things worse!

[snip]

Build-always vs Build-on-checkin

We currently build all the time. Fair enough. But I mean ALL the time. We actually generate and publish builds even if we know nothing has changed since the last build. If the computer is sitting idle anyway, what the harm is that, right!? Well, actually…

1)  When a developer lands a change, they’d like to see a new build generated right away containing their change. If the build machines are already busy doing a build, then the developer has to wait for the currently-in-progress-build to finish first. Fair enough. But waiting for a build which contains nothing new is just a waste of time, imho. While we think of our build time as being approx 20mins long, its actually 20mins starting counting after the current build has finished. An unlucky person making a change 5 minutes after a build starts, would have to wait 35mins to see their change in a build. These diagrams might help explain:

Obviously, even with this change, a developer could still be unlucky and land a change just after someone else’s change triggered a build-in-progress, in which case, things are no better/worse then they already are today. But if the developer is lucky, finds systems idle, then builds start immediately, and the build turnaround time is much improved.

2) After Build generates a build, the QA/performance machines take the build and measure build performance. Typically these performance results are then manually compared with previous builds, looking for deviations, etc. “Hey, this new build is x% faster/slower compared to last build”. Whenever there’s a deviation, humans try to figure out if its caused by code change, build change, perf infrastructure change, or unreproducible-time-space-continuum problems(!). This human analysis would be simplified (a little) if it was easy to automatically tell when builds were the same or contained different code.

So, what to do here?

We’re proposing changing our build machines to trigger a build only when a checkin is detected. Once a checkin is detected, an idle machine would start building immediately. Once a build finishes, we’d check to see if there had been any changes in the interim, and if yes, then immediately start building again…but if no changes, then just sit idle, waiting for changes.

Rolling this out will require changes in both the Build and QA/perf infrastructure, and we’re still figuring out all the gotchas, but we think its well worth the effort. Dev get builds faster. Build get cleaner infrastructure. QA/Perf analysis gets (slightly) simplified. For more details, see Rob Helmer’s blog (here and here).

(footnote: Running builds in parallel is a suggestion we are still investigating, which should help even further. The initial idea, of running ‘n’ continuously building processes staggered a few minutes apart does reduce developer wait time, but also generates ‘n’ times the number of builds over the course of a 24 hour day. Most of which are identical, and all of this would have ripple on impact on IT storage capacity and QA/performance testing infrastructure. By contrast, changing infrastructure to build-on-checkin, everyone gains immediately during the more quiet times of the day, and for the busier times this would bring us closer to having a pool of multiple buildbot slaves, available to start building in parallel during busy times, and be idle most of the time. More on this in another blog post.)

Trimming wall-clock times – whats next?

We made lots of progress with automation in 2007. Most of that effort was focused on automating the existing processes, speeding things up by taking the human out of loop, and reducing late night human errors. And most importantly, buying us human time to automate further.

While there is still work to do there, our focus is now shifting a little to things like:

  • Finish cleaning up all the automation bugs found during FF3.0beta2. We were so overrun with other releases ongoing at the same time as the all-manual FF3.0beta1 that we lost track of some automation problems. It was really frustrating, and quite expensive timewise, to have beta2 automation hit problems that we already knew about, had forgotten about, and then had to manually intervene to recover from.
  • Improving handover between groups. All the wall clock times recorded in blogs so far include time spent by Build doing work. Thats fair enough. But those wall clock times also include any transition time to/from the Build group; for example time taken for Build people to notice that a Build task has completed, and notify other groups, or time taken for the Build people to see an email telling Build to start work. Was this brutal, maybe. But the reality is that, both of these can frequently cost several hours at a time, if people are sleeping or in meetings (or both!). We transition between groups a lot during a release; Damon and Polvi managed to put together this accurate work-flow diagram! The point is that all these transition times really add up. As other parts of the Build & Release hubris are calming down, these transition times are now becoming a significant portion of “Build time”. Improving these requires some notification work, and also some cleanup of our verification steps.
  • Continued de-tangling of various build systems. The more we can streamline and simplify our build systems, the easier our Build Automation work will be. And the easier our ongoing support calls will be. For example, now that we’ve shipped TB1.5.0.14, we can start looking at mothballing the various 1.8.0 branch machines. How we do the source tarball is really (unnecessarily?) complex. Making nightly builds more like release builds. There are countless other examples.
  • Improving human coordination across the entire release team. While FF3.0beta2 went so smoothly that we were ready to ship 3 days early, we discovered a human communication snafu. The docs and website folks were *way* ahead of schedule for the planned Friday release, but were never notified that the schedule had moved up to release early. They were caught on the hop Monday evening when they discovered the release had moved up, and was only hours away. Everyone scrambled to catch up, and we still shipped 3 days early. But it cost us a few hours delay, and caused a lot of unnecessary scrambling, stress and catchup that we could have avoided if we’d coordinated the schedule change better.

As engineers, its easy to focus on the interesting engineering problems. However, for Build Automation so far, we’re approaching this from a slightly different angle: what “reasonably” easy thing can we do quickly for the most gain? Think global time optimization. Or the low hanging fruit cliche. Or baby steps. Call it what you like, doing this means we get visible improvements very quickly, it means we get some of the more distracting silly things out of the way and it buys us that most rare of resources, uninterrupted time to focus on solving the remaining knotty problems.

Fingers crossed for 2008!