Making unittest life better…

During the Open Design Lunch last week, one topic that came up frequently was around unittests. Most questions were variations of “intermittent unittest failures block developers from landing”, and “unittests take too long to run”.

Hopefully this blog post will explain some of the work already done/inprogress to make this better.

Short answer is:

  • fixup unittest machines & toolchain
  • fix unittest framework so each unittest run does not require a rebuild
  • run unrelated unittest suites concurrently
  • split out big suites like mochitest into multiple smaller suites

Solving these problems will get us muchly improved end-to-end turnaround time, simplify debugging intermittent failures, and allow us to start running unittests on nightly and release builds.

A longer, more detailed answer needs more text, some diagrams… and obviously coffee!

Each “unittest run” actually does the following steps sequentially: pull tip-of-tree, build (with modified mozconfig), TUnit, reftest, crashtest, mochitest, mochichrome, browserchrome, a11y. However, this means if you run unittests twice in a row, even without any code change, you are actually doing: pull tip-of-tree, build (with modified mozconfig), TUnit, reftest, crashtest, mochitest, mochichrome, browserchrome, a11y, pull tip-of-tree, build (with modified mozconfig), TUnit, reftest, crashtest, mochitest, mochichrome, browserchrome, a11y. Note the double pull, and double build.

This causes several important problems:

  1. each unittest cycle takes a long time, because its doing a build every time.
  2. it was not practical to run each unittest suite as a separate concurrent job, because:
    • each unittest suite would need its own build step (costing more overall CPU time) and
    • because each build would have its own BuildID (complicating work of reassembling together all the test results afterwards).
  3. developers have to wait until the *last* suite is completed before they see results from *any* suite.
  4. it has never been possible to run unittests on nightly builds or release builds. (because it would require rebuilding, which defeats the purpose!)
  5. it complicates debugging intermittent failures because:
    • crashes for each rebuild get different memory stackdumps
    • each build pulls tip-of-tree, so if a change lands while you are re-running tests, each build could get a different pull of tip-of-the-tree source code, and you’d be testing different things.
    • each build has a different BuildID, so harder to confirm if all builds have same code.
    • having new builds each time makes it hard to spot any machine or compiler problems.
    • typical way to find an intermittent problem is to run test ‘n’ times. If you run “reftest” 5 times in a row, thats quick and useful. However, the wasted time of rebuilding and then running all suites serially even if you are only interested in rerunning just one suite, really adds up. Running build+all unittest suites 5 times in a row quickly becomes impractical, especially when you require the tip-of-tree to remain constant for the duration.

Our plan to fix these is:

  1. Make sure that the spec of machines/VMs being used were sufficient for either build or unittest jobs. Also, consolidate both toolchains into one toolchain suitable for both builds and unittests.
    • There was lots of work done by lsblakk, robcee, schrep and others during summer 2008 to make unittest machines identical to build machines in one general purpose pool-of-slaves.
    • There was also a lot of work done by robcee, schrep, mrz, justin and myself to see if the intermittent tests would be solved by moving to faster VMs or dedicated physical hardware. While its true that we can always make incremental improvements in turnaround time by spec-ing faster VMs or buying faster dedicated physical machines, those experiments found (different!) intermittent unittest failures each time.
    • I assert that fixing the system design problems outlined above will get us significantly better turnaround time, and also solve other problems that just brute force cant fix, so should be done first. Only after that global (large) optimization is done, should we revisit the discussion about local (smaller) optimizations.
  2. Consolidate the two toolchains, and consolidate the two sets of machines in one production pool-of-slaves. This was finished just before Christmas 2008 and means that:
    • all build slaves and unittest slaves are now part of the one pool-of-slaves, and all able to do either builds *or* unittests.
    • we can enable unittests on a new branch as the same time as we enable builds on any new branch
    • we have more machines to scale up and handle build&unittest load on whatever branch is the most active branch.
    • we can now run unittests everywhere we can run builds. We’re already running unittests on each active code line. We’re nearly finished enabling unittests on try server (see bug#445611)
  3. Separate out build from unittest
    • consolidate build mozconfig with unittest mozconfig
    • cleanup test setup assumptions about what files/env.settings are needed by a unittest suite, being done by Ted in bug#421611.
    • one by one, as each suite is separated out, we enable that standalone suite running by itself in pool-o-slaves, and disable that suite from as part of the “build-and-remaining-unittest-suite” jobs. (see  bug#383136)

Once we have all the unittest suites running without requiring a build step, then we can:

  1. quickly re-run test suites on the same *identical* build, get easy to compare stack traces, and have no concerns about unexpected landings changing what we build from tip-of-tree.
  2. quickly re-run specific test suite of interest much quicker (if you only care about reftest, only rerun reftest…)
  3. run tests on older builds to figure out when a test started failing intermittently.
  4. run each separate test suite concurrently on different machines, and post results for each suite as each individual suite completes.
  5. split the longest running suites into smaller bite-size suites, for better efficiency.
  6. start running unittests on nightly and release builds.

All in all, this is very exciting stuff! Not sure how much of that came across in the Open Design Lunch, but hopefully that all makes sense – let me know if you have questions/comments?

tc
John.
=====
ps: An early attempt to reducing the build+unittest time was to adjust some compile options to reduce build time, but that actually complicates matters (Win32 unittest tests non-PGO builds; Mac unittest tests intel-only builds, not universal builds, etc). We’re still investigating what to do here; any suggestions?

Thank you for the Open Design lunch

Yesterday’s Open Design lunch was really exciting. Lots of people, plenty of ideas, a guest appearance by Rob Helmer, lights, cameras, and of course pizza! 🙂 Thanks to Asa, Jono and Rhian for making this happen, it was great!!

Some of the ideas turned out to be for projects already in progress, and some ideas were completely new. There was also loads of emails and IRC before/after which was great. It going to take a bit to sort through all the ideas that came up, expect to see lots of lively followup discussions in mozilla.dev.tree-management.

Also, apologies to those who missed it because they didnt find out in time, that was my bad. If I’d had my act together, I’d have posted about this in dev-tree-management beforehand, and also given people more advance notice, so they could plan accordingly. There will certainly be more Open Design lunches on this whole area, and I’ll make sure to get that right next time. Sorry. Meanwhile, Asa posted the video and irc logs of the entire lunch on air.mozilla.com.

ps: After all the jokes yesterday about Lufthansa napkins, I thought people would be curious to see where all the napkin scribbling took place.

SFO->MUC is a long flight, so in quiet times, the flight crew would curiously stop by to see what that guy was doing, scribbling in the corner of the galley for hours on end. After a few minutes of pointing to diagrams and excitedly talking about cross-branch merges, incoming load variance, pool capacity planning, parallelization, trending data, they’d nod, smile politely and go away – probably thinking I was a complete nut or something. Oh well. I dont mind, it was wonderfully productive. And the entire flight crew were fantastic, totally totally fantastic. Note the proximity of the coffee machine and chocolate!:-D

What are all these machines telling us?

Some exciting news.

Tomorrow’s Open Design lunchtime session will start exploring how we use all the various build systems (Tinderbox, Bonsai, hgweb, graphserver, talos, buildbot, etc). It will be at noon PST Thursday 19th Feb, downstairs in building K, and for remote folks, dial in info is: +1 650 903 0800 x92 Conf# 201 (US/International), +1 800 707 2533 (pin 369) Conf# 201 (US Toll Free). This is really just the start of a series of meetings/discussions/blogs on this topic, so regardless of whether you can make it tomorrow or not, any/all ideas are very welcome. This is a really complex area and will take time to detangle. Below are some quick notes which we’ll be using to help set context and help get people thinking beforehand.

Historically, we’ve used bonsai+tinderbox work for monitoring the status of our one active code line, and we were able to stretch this infrastructure to two active code lines. However, in the last year, we now grown to support:

  • 2 different repos (cvs + hg)
  • 5 active code lines (up from 2 active code lines)
  • additional o.s. platforms (linux64, linux-arm, winCE…)
  • unit tests
  • talos performance + graphserver
  • try server
  • code coverage (coming)
  • approx 290 machines (up from 80+ machines)
  • a *whole* bunch more developers scattered around the world

After spending the last year on bring all these systems online, and then working on stabilizing them, the next question is: how on earth do we figure out whats going on?

Developers have to do the tinderbox -> graphserver -> hgweb -> bugzilla dance multiple times a day (and that’s tricky to explain to new contributors). Sheriffs do that same dance continuously – and good luck to them on regression hunting. QA have no easy way to see if an intermittent test is caused by code bug, test bug or infrastructure bug. RelEng have no clear way to see the health of all these systems. Release drivers have to have custom bugzilla queries, a breathtaking memory of details of previous releases and an ability to be in all places at all times…

There’s got to be a better way to monitor the health of our code base, health of our infrastructure, health of a given release and hey, even measure the overall efficiency of our development process.

To help start discussions, here are some questions…

  • How can we tell if code base is broken or safe to checkin to?
  • How do we know if something is broken by a checkin?
  • Which tests are intermittently failing, and since when? (trending data for test results, tied to possible guilty checkins, faulty machines)
  • When did a performance change first start… and what changed at that time?
  • What are end-to-end turnaround times?
  • How can we tell if our machines have spare capacity, or if jobs are blocked waiting for free machines?
  • Are all machines healthy and reporting correctly?
  • What is the “velocity” of development work – how many changes can a developer safely land in a day?
  • What is the “health” of our code base – how often do builds break, tests fail?
  • What about release metrics – how many remaining blockers? Do we need another beta? (Trending data for releases, so we can better predict future releases)
  • Tracking changes: “I’ve fixed this bug in tracemonkey, how can I confirm if its fixed in mozilla-191?”, and “is my followon bustage-patch in tracemonkey also in mozilla-191”?
  • Do we need a custom/home-grown product to figure all this out or is there some combination of off-the-shelf 3rd party software we can use? If we do custom/home-grown, how do we maintain it?

…and some links of interest (in no particular order):

So – what suggestions do you have?

Jan 2009 data, by time-of-day

Following on from my last post, I was curious what is a good time for downtimes, so did some further digging. Here’s the same “pushes-for-the-month-of-Jan” data broken down instead by time of day.

Here’s the same data broken down by average#pushes per hour, across the entire month.

Basically, the inflow of pushes never stop! It looks, afaict, like we’ve officially grown into a literal 24×7 project now, and there’s no real good time for a downtime anymore. If January is “normal”, then midnight-5am PST on some weekends might be the least-disruptive time for a downtime, but even so, that wasn’t true for every weekend.

The move from dedicated-machines to pool-of-slaves is really paying off here. While we still need downtimes for some types of maintenance, a lot of maintenance on slaves can be done *without* closing the tree; instead, while the tree remains open, we simply take one slave out of the pool, fix it, put it back in the pool, take the next slave out, etc, while the rest of the pool-of-slaves continue working as normal. Doing this takes more time from RelEng point-of-view, but its less disruptive for developers because the tree remains open and jobs are still being processed throughout. This was simply not an option when we were running on dedicated machines, and is more and more important now that we don’t really have a “good time for a downtime” anymore.

Measuring infrastructure load for Jan 2009

To help with capacity planning, I pulled together some numbers for January. I’m still sorting through all this, but thought these early results were worth sharing.

In January, people pushed 1,128 code changes into the mercurial-based repos here in Mozilla.

As each of these pushes triggers multiple different types of builds/unittest jobs, the *theoretical* total amount of work done by the pool-of-slaves in January was 11,511. For each push, we do:

  • mozilla-central: 11 jobs per push (L/M/W opt, L/M/W leaktest, L/M/W unittest, linux64 opt, linux-arm)
  • mozilla-1.9.1: 10 jobs per push (L/M/W opt, L/M/W leaktest, L/M/W unittest, linux64 opt)
  • tracemonkey: 7 jobs per push (L/M/W opt, L/M/W unittest, linux64 opt)
  • theoretical total: (681 x 11) + (297 x 10) + (150 x 7) = 11,511 jobs. Or ~371 jobs per day. Or ~15 jobs per hour. (Considering how many of our jobs take over an hour to complete, this is quite scary!)

I say “theoretical total” because there are two complications here, which would slightly reduce numbers, so I dont yet have *actual* numbers:

  • if two pushes arrive into hg.m.o on same repo in hg.m.o within 2 minutes of each other, we count them as one push, not two.
  • if the entire pool-of-slaves is busy, then any pending build/unittest jobs get queued up for the next available slave. To stop the slaves from falling behind in peak times like that, we “collapse the queue”, and have the next available slave take *all* pending jobs. This is good from the point of view of keeping turnaround times as quick as possible, and keeping up with incoming jobs. However, it complicates regression hunting. Part of the reason for getting these numbers is to measure and see what we should do here.

…but this theoretical total is very close. I’m still working on this.
Some other details:

  • a developer making ‘n’ changesets locally from a local repo and pushing them all up to hg.m.o at one time is counted as only one push. Put another way, this only counts changes landed into the mozilla-191/mozilla-central/tracemonkey repos based at hg.mozilla.org; this explicitly excludes “hg commits” to a developer’s local repo – which is what you see if you use “hg log”.
  • its interesting to see the focus of activity, and the number of pushes, to a given repo change over time. This matches with the gut sense you get from irc/bugzilla, seeing people focus on one area and then move to another, but thats just my guess? Having the pool-of-slaves dynamically shift from one repo to another as-needed is really working well here.
  • I’ve excluded all talos jobs, because those machines are organized differently, and I’ll need different math for that. Also excluded are all try-server jobs. Also excluded are all changes to FF2.0.0.x, TB2.0.0.x, FF3.0.x. Once I get the hg-based numbers going routinely, I’ll start to look at the cvs-based numbers.

Hopefully people find this is interesting, I’ll keep digging here.

Fennec on WinCE

Aki got automated WinCE builds going yesterday! Isnt this awesome??!

Some quick details:

  • These builds are available for download from ftp.mozilla.org since yesterday.
  • The one broken build since yesterday afternoon / evening was caused by a patch merge conflict, which the developer since fixed, hence builds are back to green again.
  • We are building using the mozilla-central and mobile-browser repos that will be used for the upcoming fennec-on-wince alpha.
  • Disclaimer: the builds produce a zip file. Ted, Wolfe and Aki are working in bug#474530 to figure out Windows CAB installers. We also don’t yet have nightly updates available, thats still be worked out.
  • belzner showed me a link to a demo of a recent WinCE build.
  • Aki is swiftly developing a taste for fine Irish whiskey. 🙂

Nice work, Aki!

Watching history in the making…

Two weeks ago, I added http://www.whitehouse.gov/blog/ to my daily RSS feeds.

Since then, each morning, I get to glimpse the volume, and sheer dynamic range, of work going on there. The speed of context switching makes my head spin. Lobbying restrictions. Child health care. Plans for Guantanamo. Weather emergencies in Arkansas and Kentucky. Interviews on Al Arabiya. Freedom of Information Act. A gender-equal payment law… Nice going for just two weeks, and that’s just the highlights of the unclassified stuff. For extra fun, you can now also see the full text of the new law, and even comment on it.

“…The law is now up on our website, where you can review its full text and and submit your thoughts, comments, and ideas.”

Lots of this information was probably available in different forms in the past, if you knew where to look and who to ask. But not easily to mere mortals. Getting this out of the hands of gatekeepers, or media spin-doctors and instead making it available directly to the public feels to me like how government is supposed to be. The phrase “…by the people, for the people…” seems refreshingly appropriate once again.
I really never thought I’d see the day, and I have to say this openness really makes me proud.
As an aside: If you take top-of-the-world financial institutions, and drive them literally to bankruptcy (or in some cases to rescue by government handouts), how exactly you can justify getting a bonus in that situation is beyond me. That $18billion (yes, billion with a B) could instead pay for a lot of salaries. I wonder how many of those laid-off financial workers, including some computer engineer friends of mine, would still have a job if that “bonus” money was used instead for employee salaries. These are the same financial institutions that make me sit silently every time I look at the carnage in my 401k. Shameful is one word for it.