Measuring infrastructure load for Jan 2009

To help with capacity planning, I pulled together some numbers for January. I’m still sorting through all this, but thought these early results were worth sharing.

In January, people pushed 1,128 code changes into the mercurial-based repos here in Mozilla.

As each of these pushes triggers multiple different types of builds/unittest jobs, the *theoretical* total amount of work done by the pool-of-slaves in January was 11,511. For each push, we do:

  • mozilla-central: 11 jobs per push (L/M/W opt, L/M/W leaktest, L/M/W unittest, linux64 opt, linux-arm)
  • mozilla-1.9.1: 10 jobs per push (L/M/W opt, L/M/W leaktest, L/M/W unittest, linux64 opt)
  • tracemonkey: 7 jobs per push (L/M/W opt, L/M/W unittest, linux64 opt)
  • theoretical total: (681 x 11) + (297 x 10) + (150 x 7) = 11,511 jobs. Or ~371 jobs per day. Or ~15 jobs per hour. (Considering how many of our jobs take over an hour to complete, this is quite scary!)

I say “theoretical total” because there are two complications here, which would slightly reduce numbers, so I dont yet have *actual* numbers:

  • if two pushes arrive into hg.m.o on same repo in hg.m.o within 2 minutes of each other, we count them as one push, not two.
  • if the entire pool-of-slaves is busy, then any pending build/unittest jobs get queued up for the next available slave. To stop the slaves from falling behind in peak times like that, we “collapse the queue”, and have the next available slave take *all* pending jobs. This is good from the point of view of keeping turnaround times as quick as possible, and keeping up with incoming jobs. However, it complicates regression hunting. Part of the reason for getting these numbers is to measure and see what we should do here.

…but this theoretical total is very close. I’m still working on this.
Some other details:

  • a developer making ‘n’ changesets locally from a local repo and pushing them all up to hg.m.o at one time is counted as only one push. Put another way, this only counts changes landed into the mozilla-191/mozilla-central/tracemonkey repos based at; this explicitly excludes “hg commits” to a developer’s local repo – which is what you see if you use “hg log”.
  • its interesting to see the focus of activity, and the number of pushes, to a given repo change over time. This matches with the gut sense you get from irc/bugzilla, seeing people focus on one area and then move to another, but thats just my guess? Having the pool-of-slaves dynamically shift from one repo to another as-needed is really working well here.
  • I’ve excluded all talos jobs, because those machines are organized differently, and I’ll need different math for that. Also excluded are all try-server jobs. Also excluded are all changes to FF2.0.0.x, TB2.0.0.x, FF3.0.x. Once I get the hg-based numbers going routinely, I’ll start to look at the cvs-based numbers.

Hopefully people find this is interesting, I’ll keep digging here.

2 thoughts on “Measuring infrastructure load for Jan 2009

Leave a Reply