Trimming wall-clock times – whats next?

We made lots of progress with automation in 2007. Most of that effort was focused on automating the existing processes, speeding things up by taking the human out of loop, and reducing late night human errors. And most importantly, buying us human time to automate further.

While there is still work to do there, our focus is now shifting a little to things like:

  • Finish cleaning up all the automation bugs found during FF3.0beta2. We were so overrun with other releases ongoing at the same time as the all-manual FF3.0beta1 that we lost track of some automation problems. It was really frustrating, and quite expensive timewise, to have beta2 automation hit problems that we already knew about, had forgotten about, and then had to manually intervene to recover from.
  • Improving handover between groups. All the wall clock times recorded in blogs so far include time spent by Build doing work. Thats fair enough. But those wall clock times also include any transition time to/from the Build group; for example time taken for Build people to notice that a Build task has completed, and notify other groups, or time taken for the Build people to see an email telling Build to start work. Was this brutal, maybe. But the reality is that, both of these can frequently cost several hours at a time, if people are sleeping or in meetings (or both!). We transition between groups a lot during a release; Damon and Polvi managed to put together this accurate work-flow diagram! The point is that all these transition times really add up. As other parts of the Build & Release hubris are calming down, these transition times are now becoming a significant portion of “Build time”. Improving these requires some notification work, and also some cleanup of our verification steps.
  • Continued de-tangling of various build systems. The more we can streamline and simplify our build systems, the easier our Build Automation work will be. And the easier our ongoing support calls will be. For example, now that we’ve shipped TB1.5.0.14, we can start looking at mothballing the various 1.8.0 branch machines. How we do the source tarball is really (unnecessarily?) complex. Making nightly builds more like release builds. There are countless other examples.
  • Improving human coordination across the entire release team. While FF3.0beta2 went so smoothly that we were ready to ship 3 days early, we discovered a human communication snafu. The docs and website folks were *way* ahead of schedule for the planned Friday release, but were never notified that the schedule had moved up to release early. They were caught on the hop Monday evening when they discovered the release had moved up, and was only hours away. Everyone scrambled to catch up, and we still shipped 3 days early. But it cost us a few hours delay, and caused a lot of unnecessary scrambling, stress and catchup that we could have avoided if we’d coordinated the schedule change better.

As engineers, its easy to focus on the interesting engineering problems. However, for Build Automation so far, we’re approaching this from a slightly different angle: what “reasonably” easy thing can we do quickly for the most gain? Think global time optimization. Or the low hanging fruit cliche. Or baby steps. Call it what you like, doing this means we get visible improvements very quickly, it means we get some of the more distracting silly things out of the way and it buys us that most rare of resources, uninterrupted time to focus on solving the remaining knotty problems.

Fingers crossed for 2008!

One thought on “Trimming wall-clock times – whats next?

  1. Love the diagram. It certainly helps to see what could be speeded up. For example, what’s wrong with speculatively generating the update snippets while QA are testing the full builds? If the QA passes, you are ready to go. If not, you haven’t lost anything.