Increasing capacity

2 Comments

tl;dr: We’ve had a lot of infrastructure changes go live in the last 2-3 weeks, so now build and test wait times are MUCH better. The changes made were:

1) Turn off obsolete or broken jobs.
2) Re-image some linux32/linux64/win32 machines as extra Win2008 (64bit) build machines.
3) Enable pymake in production
4) Turn on more tegras
5) Re-imaging 40 OSX10.5 test machines as 20 WinXP and 20 Win7 test machines.


After all the recent meetings and newsgroup posts, I thought it useful to followup with a quick summary of the last few weeks of behind the scenes work on how RelEng infrastructure is handling the increased checkin load, as well as the increased number of builds/tests per checkin. Oh, and its worth noting for the record that all these changes have been done without needing a downtime. :-)

1) Turn off obsolete or broken jobs.
This is not glamorous work, but there were lots of these and it had a massive impact in terms of reducing load in some areas, which allowed us to reshuffle machines to other areas. Specifically, we’ve disabled:

  • perma-orange/red test suites
  • android-xul tests, no longer needed
  • standalone talos v8 test suite, now included in the dromaeojs suite
  • B2G-on-gingerbread builds. (B2G-on-gingerbread builds were intentionally broken as part of setting up B2G-on-icecreamsandwich builds but left running hidden on tbpl.m.o. This was bad because a) B2G-on-gingerbread builds continued to waste CPU cycles and b) having broken B2G-on-gingerbread builds caused B2G IceCreamSandwich nightly builds to not be run because automation thought that there were no good/green changesets for B2G. Disabling the broken B2G-on-gingerbread builds fixed both these issues. (bug#780915).
  • Disabled osx10.5 tests on mozilla-central, related project branches and try, because FF17 no longer supports osx10.5. However, we still need some osx10.5 machines in order to run osx10.5 tests for mozilla-aurora, mozilla-beta, mozilla-release and ESR.
  • There is still an open question about linux32/64 builds/tests. Can we reduce our linux test capacity on one architecture to use these as test machines for other OS? From the thread it seems like the preference would be to turn down/off 32-bit builds/tests, but if you are interested, please respond in the dev.planning thread.
  • Of course, if you know of other builds or tests that aren’t used, or are perma-red/orange, let RelEng or Sheriff know and we can disable them until they can be fixed! Tests still to be disabled are tracked in bug#784681

2) Re-image some linux32/linux64/win32 machines as extra Win2008 (64bit) build machines.

  • linux32/linux64 builds continue to migrate to AWS, which frees up linux32/linux64 build machines for reimaging as win2008 (64bit) build machines. This requires changing desktop toolchain, so we have to be careful about not breaking binary compat, and needed to roll this change out on the trains. Details in bug#772446. As of today, we’ve increased from ~40 to 102 win2008 (64bit) build machines in production, with even more coming online soon. Details in bug#780022 and bug#784891.

  • This is important because the Win2008 (64bit) machines are used to do both win32 *and* win64 builds.
  • win32 l10n nightlies have code dependencies that require win32. Fixing this to run on win2008 (64bit) will free up even more win32 machines to convert to win2008 (64bit) builders. More details in bug#784848.

3) Enable pymake in production (bug# bug#593585).
This reduced build time on windows significantly. Combined with the extra windows machines, this is all good. Coop blogged more details here, and if you like eyecandy, he even has even a cool graph showing reduced duration of builds!

4) Turn on more tegras
86 new tegras have now been added to our test pool, bringing our current total up to 284 tegras. bug#767456

5) Re-imaging 40 OSX10.5 test machines as 20 WinXP and 20 Win7 test machines.
This increases us to 70 machines for WinXP and 70 machines for Win7x32, which helped improve WinXP, Win7 test wait times.

We’ve still got lots to do, of course. After all, the faster our infrastructure can process builds and tests, the more checkins developers will do, which means the more builds and tests we will need to handle… but things are definitely better, and our numbers in dev.tree-management for the last two weeks shows that!

2 Comments (+add yours?)

  1. njn
    13 Sep 2012 @ 02:43:54

    Great work.

    Reply

  2. Justin Lebar
    13 Sep 2012 @ 04:37:27

    What’s the change in the average and 90th percentile of wait times for a complete tryserver build, before and after these infra changes?

    Reply

Leave a Reply