1,797 makefiles?!

Catlee made an interesting discovery while digging through historical data in the buildbot db. Its not just that builds feel slower; they *are* slower!

Its important to point out a few things about this chart:

  1. The machines used over the year are identical for each OS.
  2. The times explicitly are for only compile+link of full clobber nightly mozilla-central builds. Times for doing “hg clone” beforehand, or for uploading completed builds afterwards, are explicitly excluded.
  3. Full clobber builds were measured because incremental builds take wildly different times depending on what was being changed.
  4. Nothing else is running on these machines.

Linux times wobbled for a bit, but take about the same duration, but OSX and win32 times basically doubled in the last year. Win32 went from ~1h25m to over 3hours, and then back down to 2h30mins!? OSX went from ~1h15m to >2h30m, with an expected dip as we transitioned from “PPC+intel32” to “intel64” to “intel64+intel32” builds. Sure, we’ve added more code for Firefox 4.0, but I find it hard to believe that we added *that* much, and only on OSX, Win32!

Whats going on? Well, therein lies the problem. Its hard to tell what is actually happening during the compile-and-link. Because the hardware, OS, and toolchain were consistent, I find myself looking at the makefiles with fresh interest. A quick scan of my mozilla-central clone on my laptop finds 1,797 files (Makefile, Makefile.in and *.mk files) with a combined total of 152,123 lines – and I’m not sure I found everything?!?

In the past we’ve stumbled across and fixed some bugs in Makefiles which helped speed up compile/link time, but this tangled web of makefiles needs some serious spring cleaning. We don’t know where to start yet, but the payback will be totally worth it. If you are interested in helping, or have any ideas, please let me know.

Steampunk Palin by Jim Felker

After I saw Aza Raskin mention this, I couldn’t get this out of my head – no matter how hard I tried. So I bought the comic, hoping that would scratch the itch and help me forget.

No luck.

A summary of the plot might help here. Sarah Palin survives an assassination attempt, but wakes up after a coma to discover doctors had to rebuild her as part robot. She teams up with McCain, Obama and a robot army to fight the evil Oil and Nuclear industry that is now polluting Alaska.

My, oh my. I still don’t know what to say.

Speeding up “hg clone”

If you use TryServer, or ever check in code into any RelEng supported branch, you need to read this quick post from a few days ago.

On Friday, catlee enabled “hg share” on our RelEng slaves. Sounds boring (or exciting) depending on your perspective. What matters to most people reading here is knowing it reduced the wall-clock time for every try build by about 25mins. To be very precise, its removed ~25mins off the ~30minute “hg clone” step, which happens before the compile and link phase can start… Each and every time we build.

This is great for three reasons:

  • everyone gets their try builds faster (great).
  • by completing this current job quicker, the same slaves are available sooner to start working on the next try job. (extra greatness!).
  • this reduces load on hg.m.o, which means that the remaining cloning is completed quicker by the less-heavily-loaded hg.m.o server. (even extra goodness!!).

NOTE: To start with, this is only on linux and OSX10.6 (coming soon to win32 and OSX10.5) and for now, its only on Tryserver builds (coming soon to nightly, release, etc builds). Every time this change is rolled out across another portion of the RelEng infrastructure, expect to see everything get just a little speedier.

Send flowers, chocolate, beer or even just a brief thank you note to catlee and bhearsum!

Infrastructure load for January 2011

Summary:

Interesting!! We had 2,636 pushes in January 2011. This is a significant jump from the last few months, and almost hit our previous record (2,707 pushes in August 2010). Also interesting that a few branches were really busy but most branches had zero checkins.

Details:

  • Shipping Fennec4.0beta4, Firefox4.0beta9, Firefox4.0beta10 and now Firefox4.0beta11 in quick succession, and with very short lockdowns, seemed to help unjam checkins backlog this month. A great relief for everyone!
  • This faster cadence seems to have helped focus efforts, with less need for working on a project branch while waiting for a clear time to land in m-c. Also, as we get closer to the actual shipping of Firefox 4.0, it feels like most of the bigger pieces are done, and the remaining fixes still landing are each smaller fixes, which do not need a project branch, and can be done on tryserver. Of course, that is just my interpretations… if you have other interpretations of the same data, let me know!
  • The load on TryServer jumped to 53% of our overall load. Looks like more people are now doing TryServer run before landing, which means the patches that do land are less-risky, and a tree that stays green more often!
  • The numbers for this month are:
    • 2,636 code changes to our mercurial-based repos, which triggered 335,210 jobs:
    • 49,971 build jobs, or ~67 jobs per hour.
    • 158,121 unittest jobs, or ~213 jobs per hour.
    • 127,118 talos jobs, or ~171 talos jobs per hour.
  • We are still double-running unittests for some OS; running unittest-on-builder and also unittest-on-tester. This continues while developers and QA work through the issues. Whenever unittest-on-test-machine is live and green, we disable unittest-on-builders to reduce wait times for builds. Any help with these tests would be great!
  • The entire series of these infrastructure load blogposts can be found here.
  • We are still not tracking down any l10n repacks, nightly builds, release builds or any “idle-timer” builds.

Detailed breakdown is :

Here’s how the math works out (Descriptions of build, unittest and performance jobs triggered by each individual push are here:

Infrastructure load for December 2010

Summary:

There were 1,766 pushes in December 2010. This is a continued and significant drop from September (2,436 pushes) , October (2,360 pushes) and November (2,322 pushes). This continued drop in the number of checkins is expected, considering the prolonged lockdown for FF4.0beta8, immediately followed by the lockdown for FF4.0beta9, and then the Christmas/NewYears holidays.

The numbers for this month are:

  • 1,766 code changes to our mercurial-based repos, which triggered 220,238 jobs:
  • 33,232 build jobs, or ~45 jobs per hour.
  • 105,396 unittest jobs, or ~142 jobs per hour.
  • 81,610 talos jobs, or ~110 talos jobs per hour.

Details:

  • The long-running lockdown for FF4.0beta8, and then for FF4.0beta9 definitely took their hit on who was able to checkin, and where/when.
  • The load on TryServer reduced back to ~50% of our overall load. So far, I do not know why. Anyone got suggestions?
  • We are still double-running unittests for some OS; running unittest-on-builder and also unittest-on-tester. This continues while developers and QA work through the issues. Whenever unittest-on-test-machine is live and green, we disable unittest-on-builders to reduce wait times for builds. Any help with these tests would be great!
  • The entire series of these infrastructure load blogposts can be found here.
  • We are still not tracking down any l10n repacks, nightly builds, release builds or any “idle-timer” builds.

Detailed breakdown is :

Here’s how the math works out (Descriptions of build, unittest and performance jobs triggered by each individual push are here: