Infrastructure load for July 2012

  • #checkins-per-month: We had 5,635 checkins in July2012, another new record, and well above our previous record of 5,246 checkins in May2012.
  • #checkins-per-day: We had consistently high load across the month, and 19-of-30 days had over 200 checkins-per-day. Put another way, we had over 200 checkins-per-day every work day in July except Canada Day (02jul2012) and US Independence Day (04jul2012).
  • #checkins-per-hour: The peak this month was 11.35 checkins per hour, and throughout the month, we sustained over 10-checkins-per-hour for 5 out of 24 hours in a day.

mozilla-inbound, fx-team:
mozilla-inbound continues to be heavily used as an integration branch, with 26% of all checkins, far more then the other integration branches fx-team (1.5% of checkins) or mozilla-central (~3% of checkins). For comparison, I note that more people landed on mozilla-aurora then on mozilla-central.

mozilla-aurora, mozilla-beta:

  • 3.8% of our total monthly checkins landed into mozilla-aurora.
  • 2.1% of our total monthly checkins landed into mozilla-beta. This is higher then previous months, but I guess this to be related to the NativeFennec-landing-on-beta work this month.

(Standard disclaimer: I’m always glad whenever we catch a problem *before* we ship a release; it avoids us having to do a chemspill release and also we ship better code to our Firefox users in the first place.)

misc other details:

  • Pushes per day
    • You can clearly see weekends through the month, as well as the impact of the national holidays on 02jul2012, 04jul2012.

  • Pushes by hour of day
    • It is worth noting that for 5 hours in every 24 hour day, we did over 10 checkins-per-hour. Phrased another way, thats one checkin every 6mins for 5 hours.

4 thoughts on “Infrastructure load for July 2012

  1. Just out of curiosity, how do you handle backouts in your stats? From a infra point of view, backouts are just like all other checkins (they need to trigger builds, etc.); but from a development point of view, backouts mean extra noise (since both the backout push and the things it backed out need to ignored to figure out activity, sort of).

    Of course, from the sidelines, it always looks like you need massively more build slaves :p (But that’s because any slaves added won’t cost me things; from your actually-have-to-pay-for-equipment-and-people view, it’s probably much different.)

  2. As a sheriff, the amount of coalescing that has been happening on inbound is certainly unpleasant. We’ve had a few instances recently where we had to close the tree for long periods of time because something broke and figuring out what did it was non-trivial due to missing tests on a long string of pushes. Also, Try is getting backed up so badly during the week now that many are (understandably, but regrettably) just landing their patches on inbound without testing and hoping for the best.

    So I would agree with Mook’s sentiment πŸ™‚

  3. I agree with you Mook, but only if you’re measuring developer activity. For developer activity, you’d want to have pushes that are backed out negate one another (or perhaps only count once). However, both the push and backout ran some number of tests (hopefully when the need for a backout was realized, we canceled the tests on the push) and so from an infrastructure load perspective, both the backout and the push should be counted the same — they both cause the same amount of infrastructure load.

    And we do need more slaves. Our current production pool was optimized for simplicity by having all three OS’s served off the same hardware. In hindsight that decision has been both a boon and a cross. A boon because we can do things like recycling 10.5 OS X machines and converting them to windows machines without worrying about disrupting performance numbers. It’s a cross because we can’t just go buy more off the shelf (the hardware is no longer available).

    The high amount of coalescing that is occurring on inbound is not going to be sustainable for our tools or for the sanity of our sheriffs. We need to broaden the slave pool to reduce the coalescing and to properly serve try with a decent turn around time (I’d like to shoot for 2 hours).

    Because it literally takes months to add a slave to the pool, (from purchase, to delivery, to racked, to power on, to certification test, to production) we need to solve this by recycling the machines we have to serve the current bottlenecks (like 10.5 slaves –> windows slaves) and by optimizing for total turn around time (the buildfaster project that Coop is leading). At the same time, we need to already be ordering slaves to help out in November because the infrastructure load is not going down, by any measurement.

  4. ctalbert:

    Yeah, backouts are by no means free – see the start of my second sentence above πŸ˜‰ Just wanted to make sure what’s being measured is clear, so that people don’t end up taking the wrong conclusions from them. (A good conclusion is “infra needs more resources”; a not-obviously-true one is “Firefox is progressing very fast”, though it certainly is from other metrics.)

    Difficulties in acquiring the right kind of test slaves is a definitely a problem, though; sadly, I can’t think of any useful solutions off the top of my head. What might help is to start making sure all future dev/misc machine purchases (i.e. things that don’t need to be standardized) end up buying to the same spec, so they can be used for test slaves once they stop being sold… That just delays the migration pain to new test slave ref platforms, though, since it’s bound to happen some time. (And doesn’t help with the current test slave pain, just the next one.)

    I suspect having different reference hardware for different platforms isn’t going to actually help, since that just means you have three sets of hardware to hit the problem at different times (i.e. more interruption).

    What might temporarily help a bit is to use different slaves for tests that don’t depend on timing – reftests and crashtests probably don’t need to be on the talos slaves. (This is mainly to free up the limited number of talos slaves for testing, not because I expect those tests to take a large amount of time… though I have no timing information, to be honest.)

    … Sorry for the rambling, this is going nowhere, I’ll stop now :p