Speed up TryServer by using TryChooser

If you haven’t already started using TryChooser, please try (pun intended!) it next time you are pushing to TryServer. If you have any questions about the syntax, please use the handy command line generator here, or read the docs. If you have difficulty after doing either of those ping the RelEng buildduty person.

Background

Earlier this summer, we redesigned TryServer so each push to TryServer would run every possible type of job available in mozilla-central. This was the easiest and fastest way to get developers using TryServer as quickly as possible, which helped reduce breakages on mozilla-central. People loved TryServer, usage rocketed and it quickly became the heaviest load on our infrastructure. However, this also meant that every push to TryServer ran a lot of builds/unittests/talos jobs – whether the developer wanted them or not. A lot of CPU time is being wasted on jobs that people did not actually want, which in turn caused delays for other jobs coming next in the queue.

A month ago, lsblakk setup TryChooser [https://bugzilla.mozilla.org/show_bug.cgi?id=473184]. This allows developers to ask for only the builds and tests that they actually want in their push-to-try commit comments. This avoids wasting CPU cycles on unwanted jobs – which speeds up TryServer wait times for everyone. Great! Some people quickly started using TryChooser, which was great in terms of freeing up CPU cycles. However, the uptake has leveled off. Last week, and also the week before, only approx 1/2 the people using TryServer used TryChooser to specify what jobs they wanted. The piechart shows who did/didnt use TryChooser in the period between Mon, 20 Sep 2010 00:00:00 -0700 (PDT) and Sun, 26 Sep 2010 00:00:00 -0700 (PDT)

Please, use TryChooser to help save CPU cycles, and make TryServer faster for everyone!

Infrastructure load for August 2010

Summary:

There were 2,707 pushes in August 2010. This is well above our previous record of 1,971 in January, and 50% above the 1,838 jobs we handled last month. TryServer continues to be the busiest branch of the entire infrastructure, and its worth noting that we did more pushes to TryServer during this month then we did to the entire RelEng infrastructure, combined across all branches, in any given month during first half of 2009.

The numbers for this month are:

  • 2,707 code changes to our mercurial-based repos, which triggered 336,910 jobs:
  • 51,217 build jobs, or ~69 jobs per hour.
  • 162,909 unittest jobs, or ~219 jobs per hour.
  • 122,784 talos jobs, or ~117 talos jobs per hour.

Details:

  • You can clearly see the drop in load over the last few days in August – caused by a US national holiday, and a Canadian national holiday, on the same long weekend.
  • The trend of “what time of day is busiest” changed again this month. Not sure what this means, but worth pointing out that each month seems to be different. This makes finding a “good” time for a downtime almost impossible.
  • We are still double-running unittests for some OS; running unittest-on-builder and also unittest-on-tester. This continues while developers and QA work through the issues. Whenever unittest-on-test-machine is live and green, we disable unittest-on-builders to reduce wait times for builds.
  • The entire series of these infrastructure load blogposts can be found here.
  • We are still not tracking down any l10n repacks, nightly builds, release builds or any “idle-timer” builds.
  • Anamaria is getting closer to having dashboard reports like this generated automatically – something I’ll rejoice!

Detailed breakdown is :

Here’s how the math works out (Descriptions of build, unittest and performance jobs triggered by each individual push are here: