What are all these machines telling us?

Some exciting news.

Tomorrow’s Open Design lunchtime session will start exploring how we use all the various build systems (Tinderbox, Bonsai, hgweb, graphserver, talos, buildbot, etc). It will be at noon PST Thursday 19th Feb, downstairs in building K, and for remote folks, dial in info is: +1 650 903 0800 x92 Conf# 201 (US/International), +1 800 707 2533 (pin 369) Conf# 201 (US Toll Free). This is really just the start of a series of meetings/discussions/blogs on this topic, so regardless of whether you can make it tomorrow or not, any/all ideas are very welcome. This is a really complex area and will take time to detangle. Below are some quick notes which we’ll be using to help set context and help get people thinking beforehand.

Historically, we’ve used bonsai+tinderbox work for monitoring the status of our one active code line, and we were able to stretch this infrastructure to two active code lines. However, in the last year, we now grown to support:

  • 2 different repos (cvs + hg)
  • 5 active code lines (up from 2 active code lines)
  • additional o.s. platforms (linux64, linux-arm, winCE…)
  • unit tests
  • talos performance + graphserver
  • try server
  • code coverage (coming)
  • approx 290 machines (up from 80+ machines)
  • a *whole* bunch more developers scattered around the world

After spending the last year on bring all these systems online, and then working on stabilizing them, the next question is: how on earth do we figure out whats going on?

Developers have to do the tinderbox -> graphserver -> hgweb -> bugzilla dance multiple times a day (and that’s tricky to explain to new contributors). Sheriffs do that same dance continuously – and good luck to them on regression hunting. QA have no easy way to see if an intermittent test is caused by code bug, test bug or infrastructure bug. RelEng have no clear way to see the health of all these systems. Release drivers have to have custom bugzilla queries, a breathtaking memory of details of previous releases and an ability to be in all places at all times…

There’s got to be a better way to monitor the health of our code base, health of our infrastructure, health of a given release and hey, even measure the overall efficiency of our development process.

To help start discussions, here are some questions…

  • How can we tell if code base is broken or safe to checkin to?
  • How do we know if something is broken by a checkin?
  • Which tests are intermittently failing, and since when? (trending data for test results, tied to possible guilty checkins, faulty machines)
  • When did a performance change first start… and what changed at that time?
  • What are end-to-end turnaround times?
  • How can we tell if our machines have spare capacity, or if jobs are blocked waiting for free machines?
  • Are all machines healthy and reporting correctly?
  • What is the “velocity” of development work – how many changes can a developer safely land in a day?
  • What is the “health” of our code base – how often do builds break, tests fail?
  • What about release metrics – how many remaining blockers? Do we need another beta? (Trending data for releases, so we can better predict future releases)
  • Tracking changes: “I’ve fixed this bug in tracemonkey, how can I confirm if its fixed in mozilla-191?”, and “is my followon bustage-patch in tracemonkey also in mozilla-191”?
  • Do we need a custom/home-grown product to figure all this out or is there some combination of off-the-shelf 3rd party software we can use? If we do custom/home-grown, how do we maintain it?

…and some links of interest (in no particular order):

So – what suggestions do you have?

One thought on “What are all these machines telling us?

  1. One of the biggest problems I see is the fact that we still haven’t got unit test boxes on TryServer.

    Supposedly I’m meant to run about 10 hours of tests (3 platforms, all TUnit, mochitest, crashtest) before I’m going to check in to check for possible regressions, even if I did one platform, it could be 3-4 hours depending on my machine.

    I believe this is one of the most significant reasons (apart from people being lazy) that the tree is not green for more of the time.

    Lets get our act together and get some test boxes on the TryServer.

Leave a Reply