Beefing up try server to clear the backlog… (update: backlog cleared)

No Comments

Quick followup after my last blogpost:

  • 09:15 Fri: moved 6 win32 slaves to tryserver from production pool-of-slaves as a emergency loan.
  • 14:34 Fri: tryserver backlog on win32 fully cleared. New jobs were being processed as soon as they appeared in the try server queue.
  • 17:00 Fri: all 4 new win32 slaves for try server now online, and the loaned 6 win32 slaves returned to the production pool-of-slaves.

As far as we can tell, TryServer has been able to keep up with the incoming jobs since. Looks like the 4 new win32 slaves was enough to keep up. We’ll continue to keep an eye on it for a while longer.

Sadly, we don’t have a way to automatically monitor this queue yet, so it needs manual inspection. But you can help.

If you submit a job to TryServer, and wait >15 minutes to see your job on TryServer Tinderbox waterfall, please file a blocker bug. Best to file is mozilla.org/ServerOperations, like for any other critical server, and which are constantly triaged 24×7. If its something broken thats not covered in the existing support docs, IT can escalate to RelEng, which means calling my cellphone out-of-hours. We’re currently at 0-2mins wait time, but I’d ask people to wait 15 mins before filing, just in case of brief delays in reporting, spikes in traffic, etc!

Beefing up try server to clear the backlog

2 Comments

Since enabling unittests on the try server, the existing try-server pool-of-slaves has been getting more work to do. This is because:

  • more people are using try server now
  • each request to try server is doing more stuff (ie unittests-which-include-an-additional-build, in addition to the standalone builds!)

We had some spare capacity beforehand, so mac, linux have been keeping up, but win32 machines were not able to keep up. To clear the win32 backlog, this morning, we switched over 6 win32 slaves from the main production pool-of-slaves to try-server pool-of-slaves, to clear the win32 backlog.
We’ll move those slaves back to main production pool-of-slaves once the backlog is cleared, and the previously requested 4 additional try slaves are online (see bug#485883, bug#485885 for details).
This means that for a few hours today, we’ll have 6 fewer slaves on production pool-of-slaves, so some win32 jobs might take longer then usual. Please be patient. We’re hoping that we’ll have caught up on try-server backlog and returned these 6 win32 slaves in a few hours, so maybe it wont even be noticeable. If you do see problems, please let us know in bug#486672.
More later…

John.

Newer Entries