Where does time go when you build?

4 Comments

Following on from this blogpost about how much time is spent in Makefiles, pcwalton sent me a link to his blog. He has a great breakdown of mac osx builds, showing where all those minutes go for each mac osx build.

We spend ~55% of the time compiling (“g++-4.2″, “gcc-4.2″, “gcc”), but whats in the other 45% – and why?? Some investigation work ahead of us. Thanks pcwalton for doing all this work – data like this really helps figure out where to focus attention first.

Branch mechanics for Firefox4.0, Fennec4.0

No Comments

To cater for some last minute code changes, we had to make last minute changes to our branching and automation plans for Firefox4.0, Fennec4.0.

For mechanical branching details like this, a diagram of the revised plan is worth at least a thousand words – hopefully this photo of the whiteboard near my desk makes sense:

To set context, existing repos/branches are red lines and planned FirefoxNext/FennecNext repos/branches are green lines. Within that context,

FF4.0 (black lines):
* will build using bits from mozilla-2.0 and l10n-mozilla-2.0
* to avoid having developers and localizers land patches twice, for FF4.0rc1, RelEng did the following at “go to build”:
** copied over from l10n-central to http://hg.mozilla.org/releases/l10n-mozilla-2.0
** copied over from mozilla-central to http://hg.mozilla.org/releases/mozilla-2.0
* now that fennec-specific changes have landed on mozilla-central, RelEng will no longer copy everything over from mozilla-central to mozilla-2.0. Instead, developers with any last-minute fixes for Firefox4.0 will have to double land on mozilla-central and mozilla-2.0 after approval, just like they currently do for mozilla-1.9.2, mozilla-1.9.1.
* after the last 4.0RC, these same repos will be used for FF4.0.1, FF4.0.2…

Fennec4.0 (blue lines):
* will build using bits from mozilla-2.1, mobile-2.0, l10n-mozilla-2.0 (previous betas were built from mozilla-central, mobile-browser, l10n-central)
* Fennec4.0RC1 cannot build from mozilla-2.0, because of last minute Fennec fixes that are too risky for Firefox4.0. Instead Fennec4.0rc1 will be built from newly created mozilla-2.1.
* Fennec4.0RC1 was planning to build from mobile-browser, but now instead will build from releases/mobile-2.0. These changes ensure that all infrastructure is lined up in case we need to do a fast Fennec4.0.1 release immediately after Fennec4.0.
* to avoid having developers and localizers land patches twice, for Fennec4.0rc1, RelEng will do the following at “go to build”:
** copy over from mobile-browser to http://hg.mozilla.org/releases/mobile-2.0
** copy over from l10n-central to http://hg.mozilla.org/releases/l10n-mozilla-2.0
** RelEng will repeat this until Fennec4.0rcN is signed off and released.
* after the last 4.0RC, these same repos will be used for Fennec4.0.1, Fennec4.0.2…

There’s a lot going on here, so hope all this makes sense! Of course, questions/comments very very welcome.

ps: Debate continues on consolidating mobile-browser into mozilla-central immediately after Fennec4.0 ships, and also about the larger faster-cadence of feature releases. We can revisit those topics later, but I’m explicitly ignoring both of these topics here, because I’m focused on what we need to do in order to *ship* Firefox4.0 and Fennec4.0 in the first place. Nothing here blocks those discussions, and its important to be able to release, and also have ability to ship immediately security fixes.

pps: for the curious, until recently, here’s what we were *expecting* to do for the Firefox4.0, Fennec4.0 releases,

and here’s the other option we considered – not as organized, but fewer last minute infrastructure changes and would also have worked to a certain extent.

“Go to build” Firefox 4.0rc1

2 Comments

We finally hit zero blockers, and stayed there, long enough that Beltzner felt it was safe to start building FF4.0rc1. So, today, at 12:09PST, beltzner sent the “go to build” email, and I took this picture.

At this point, some of the builds are already handed to QA, and we’ll have the rest to QA by first thing in the (PST) morning. You can feel the excitement in corridors and in irc channels.

Of course, this close to the finish line, it’s easy to get tempted to rush things, to take a shortcut to do things quickly – but that would only invite problems. Instead, we’re doing what we’ve done for every release – calmly, quietly, confidently, following a process that we’ve been refining and testing with every alpha, beta and security-release leading up to today.

(Oh, by the way, today we’re also calmly doing a chemspill Firefox 3.6.15 at the same time!)

This poster is just perfectly appropriate.

GODDZLA in the Mission

2 Comments

Walked past this Nissan GT-R parked on the side of the street in the Mission district of San Francisco. Spent several minutes walking around it, admiring. The license plate was a nice touch – wonder if the owner is involved with Mozilla? All the jokes about “a monster from Japan” are appropriate for this beast when you realize this 3.8L V6 Twin-turbo produces 485HP!

1,797 makefiles?!

11 Comments

Catlee made an interesting discovery while digging through historical data in the buildbot db. Its not just that builds feel slower; they *are* slower!

Its important to point out a few things about this chart:

  1. The machines used over the year are identical for each OS.
  2. The times explicitly are for only compile+link of full clobber nightly mozilla-central builds. Times for doing “hg clone” beforehand, or for uploading completed builds afterwards, are explicitly excluded.
  3. Full clobber builds were measured because incremental builds take wildly different times depending on what was being changed.
  4. Nothing else is running on these machines.

Linux times wobbled for a bit, but take about the same duration, but OSX and win32 times basically doubled in the last year. Win32 went from ~1h25m to over 3hours, and then back down to 2h30mins!? OSX went from ~1h15m to >2h30m, with an expected dip as we transitioned from “PPC+intel32″ to “intel64″ to “intel64+intel32″ builds. Sure, we’ve added more code for Firefox 4.0, but I find it hard to believe that we added *that* much, and only on OSX, Win32!

Whats going on? Well, therein lies the problem. Its hard to tell what is actually happening during the compile-and-link. Because the hardware, OS, and toolchain were consistent, I find myself looking at the makefiles with fresh interest. A quick scan of my mozilla-central clone on my laptop finds 1,797 files (Makefile, Makefile.in and *.mk files) with a combined total of 152,123 lines – and I’m not sure I found everything?!?

In the past we’ve stumbled across and fixed some bugs in Makefiles which helped speed up compile/link time, but this tangled web of makefiles needs some serious spring cleaning. We don’t know where to start yet, but the payback will be totally worth it. If you are interested in helping, or have any ideas, please let me know.

Steampunk Palin by Jim Felker

3 Comments

After I saw Aza Raskin mention this, I couldn’t get this out of my head – no matter how hard I tried. So I bought the comic, hoping that would scratch the itch and help me forget.

No luck.

A summary of the plot might help here. Sarah Palin survives an assassination attempt, but wakes up after a coma to discover doctors had to rebuild her as part robot. She teams up with McCain, Obama and a robot army to fight the evil Oil and Nuclear industry that is now polluting Alaska.

My, oh my. I still don’t know what to say.

Speeding up “hg clone”

1 Comment

If you use TryServer, or ever check in code into any RelEng supported branch, you need to read this quick post from a few days ago.

On Friday, catlee enabled “hg share” on our RelEng slaves. Sounds boring (or exciting) depending on your perspective. What matters to most people reading here is knowing it reduced the wall-clock time for every try build by about 25mins. To be very precise, its removed ~25mins off the ~30minute “hg clone” step, which happens before the compile and link phase can start… Each and every time we build.

This is great for three reasons:

  • everyone gets their try builds faster (great).
  • by completing this current job quicker, the same slaves are available sooner to start working on the next try job. (extra greatness!).
  • this reduces load on hg.m.o, which means that the remaining cloning is completed quicker by the less-heavily-loaded hg.m.o server. (even extra goodness!!).

NOTE: To start with, this is only on linux and OSX10.6 (coming soon to win32 and OSX10.5) and for now, its only on Tryserver builds (coming soon to nightly, release, etc builds). Every time this change is rolled out across another portion of the RelEng infrastructure, expect to see everything get just a little speedier.

Send flowers, chocolate, beer or even just a brief thank you note to catlee and bhearsum!

Infrastructure load for January 2011

3 Comments

Summary:

Interesting!! We had 2,636 pushes in January 2011. This is a significant jump from the last few months, and almost hit our previous record (2,707 pushes in August 2010). Also interesting that a few branches were really busy but most branches had zero checkins.

Overall load since Jan 2009

Infrastructure load by branch

Details:

  • Shipping Fennec4.0beta4, Firefox4.0beta9, Firefox4.0beta10 and now Firefox4.0beta11 in quick succession, and with very short lockdowns, seemed to help unjam checkins backlog this month. A great relief for everyone!
  • This faster cadence seems to have helped focus efforts, with less need for working on a project branch while waiting for a clear time to land in m-c. Also, as we get closer to the actual shipping of Firefox 4.0, it feels like most of the bigger pieces are done, and the remaining fixes still landing are each smaller fixes, which do not need a project branch, and can be done on tryserver. Of course, that is just my interpretations… if you have other interpretations of the same data, let me know!
  • The load on TryServer jumped to 53% of our overall load. Looks like more people are now doing TryServer run before landing, which means the patches that do land are less-risky, and a tree that stays green more often!
  • The numbers for this month are:
    • 2,636 code changes to our mercurial-based repos, which triggered 335,210 jobs:
    • 49,971 build jobs, or ~67 jobs per hour.
    • 158,121 unittest jobs, or ~213 jobs per hour.
    • 127,118 talos jobs, or ~171 talos jobs per hour.
  • We are still double-running unittests for some OS; running unittest-on-builder and also unittest-on-tester. This continues while developers and QA work through the issues. Whenever unittest-on-test-machine is live and green, we disable unittest-on-builders to reduce wait times for builds. Any help with these tests would be great!
  • The entire series of these infrastructure load blogposts can be found here.
  • We are still not tracking down any l10n repacks, nightly builds, release builds or any “idle-timer” builds.

Detailed breakdown is :
#Pushes this month

#Pushes per hour

Here’s how the math works out (Descriptions of build, unittest and performance jobs triggered by each individual push are here:
the math behind the graphs

Infrastructure load for December 2010

1 Comment

Summary:

There were 1,766 pushes in December 2010. This is a continued and significant drop from September (2,436 pushes) , October (2,360 pushes) and November (2,322 pushes). This continued drop in the number of checkins is expected, considering the prolonged lockdown for FF4.0beta8, immediately followed by the lockdown for FF4.0beta9, and then the Christmas/NewYears holidays.

Overall load since Jan 2009The numbers for this month are:

  • 1,766 code changes to our mercurial-based repos, which triggered 220,238 jobs:
  • 33,232 build jobs, or ~45 jobs per hour.
  • 105,396 unittest jobs, or ~142 jobs per hour.
  • 81,610 talos jobs, or ~110 talos jobs per hour.

Infrastructure load by branch

Details:

  • The long-running lockdown for FF4.0beta8, and then for FF4.0beta9 definitely took their hit on who was able to checkin, and where/when.
  • The load on TryServer reduced back to ~50% of our overall load. So far, I do not know why. Anyone got suggestions?
  • We are still double-running unittests for some OS; running unittest-on-builder and also unittest-on-tester. This continues while developers and QA work through the issues. Whenever unittest-on-test-machine is live and green, we disable unittest-on-builders to reduce wait times for builds. Any help with these tests would be great!
  • The entire series of these infrastructure load blogposts can be found here.
  • We are still not tracking down any l10n repacks, nightly builds, release builds or any “idle-timer” builds.

Detailed breakdown is :
#Pushes this month

#Pushes per hour

Here’s how the math works out (Descriptions of build, unittest and performance jobs triggered by each individual push are here:
the math behind the graphs

Three “go to build” emails

No Comments

Today we shipped Firefox 4.0 beta10. Lots of cool features in there, details already covered here.

We hope you like this latest beta. No doubt you’ve noticed its only been 11 days since 4.0beta9. We’re picking up the cadence as we get closer to the final release, doing betas more and more frequently, each with lots of improvements. Of course, please file bugs if you hit any problems!

Meanwhile, there’s one behind-the-scene detail that I’m most proud of with beta10.

RelEng got the “go to build” for each of Firefox3.5.17, Firefox3.6.14 and Firefox4.0beta10 all within 50mins of each other, all on Friday afternoon. We were able to generate all three releases concurrently, and hand builds over to QA Friday afternoon / Monday morning without any incident.

This is a great testimonial on how our release infrastructure has improved with the move to buildbot 0.8.x as well as the last 3 months of refactoring and general bug fixing. Of course, there are still lots more improvements we need to do – the next big step is underway, moving all the Fennec release automation code to buildbot 0.8.x and consolidating it with the Firefox release automation code. This will enable us to do multiple Fennec releases at the same time as multiple Firefox releases – something we feel is strategically really important for Mozilla in 2011.

Meanwhile, it was really great to see this infrastructure coming together, and how work done by RelEng so far has made handling those three emails on Friday feasible.

Older Entries Newer Entries