Beefing up try server to clear the backlog… (update: backlog cleared)

Quick followup after my last blogpost:

  • 09:15 Fri: moved 6 win32 slaves to tryserver from production pool-of-slaves as a emergency loan.
  • 14:34 Fri: tryserver backlog on win32 fully cleared. New jobs were being processed as soon as they appeared in the try server queue.
  • 17:00 Fri: all 4 new win32 slaves for try server now online, and the loaned 6 win32 slaves returned to the production pool-of-slaves.

As far as we can tell, TryServer has been able to keep up with the incoming jobs since. Looks like the 4 new win32 slaves was enough to keep up. We’ll continue to keep an eye on it for a while longer.

Sadly, we don’t have a way to automatically monitor this queue yet, so it needs manual inspection. But you can help.

If you submit a job to TryServer, and wait >15 minutes to see your job on TryServer Tinderbox waterfall, please file a blocker bug. Best to file is mozilla.org/ServerOperations, like for any other critical server, and which are constantly triaged 24×7. If its something broken thats not covered in the existing support docs, IT can escalate to RelEng, which means calling my cellphone out-of-hours. We’re currently at 0-2mins wait time, but I’d ask people to wait 15 mins before filing, just in case of brief delays in reporting, spikes in traffic, etc!