First sighting of Firefox 64bit builds on Window64

Armen was delighted to see this today. So was I.

This is still just a very very very early experimental build, but with that disclaimer, if you want to try using it, you can get it here. There are lots of open questions about compiler versions, toolchain, mozconfig settings, etc, which need to be figured out before we start imaging up infrastructure like we have done for OSX10.6 64bit and linux64. However, it is great to finally see some concrete visible progress after all the work put into this.

Many thanks to Makoto, dmandelin and danderson for their help to Armen so far. We’ve a long way still to go, so any help would be GREAT! If you are is interested in helping, or just following the blow-by-blow details, have a look at bug#558448.

Power off and recycle the Firefox 3.0 machines

1st June 2010 will be a big day for RelEng. In addition to the FF3.6.4 and FF3.5.10 releases, we’ll also finally be able to power off the FF3.0 infrastructure. The CVS-based machines listed in bug#554226 have been supported in production for over 3.5 years, so we’ll be sad (and happy!) to see them go.

After all those years, its quite possible that people are relying on those machines in ways we do not even know about. Hence this widespread notice. If you have any reasons these Firefox 3.0 machines should be left running, please let us know by commenting in bug#554226. After we power off these machines, they can be restored from tape backup if needed, but doing that is non-trivial, so should only be considered an extreme last resort.

    What will change:

  • No FF3.0.x incremental/depend/hourly builds will be produced.
  • No FF3.0.x clobber/nightly builds will be produced.
  • No FF3.0.x release builds will be produced.
  • The FF3.0 waterfall page will be removed from tinderbox. Specifically, this page http://tinderbox.mozilla.org/showbuilds.cgi?tree=Firefox3.0 will go away as it will be empty.

    What will *not* change:

  • Existing FF3.0.x builds would still be available for download from http://ftp.mozilla.org/pub/mozilla.org/firefox/releases/
  • Existing update offers would still be available. For example:
    • FF3.0.14 users can still update to FF3.0.19.
    • FF3.0.19 users can still update to latest FF3.6.x release (which is FF3.6.3 as of this writing).
  • Newly revised major update offers, like from FF3.0.19 -> a future FF3.6.9 release, could still be produced as needed (because these are produced on the FF3.6.x infrastructure, not on the powered off FF3.0 infrastructure.)
  • Any mozilla-1.9.0 machines which are not Firefox specific should continue to run as usual.

    Why do this:

  • Reuse some of these machines over to production pool-of-slaves or try
    pool-of-slaves, where there is more demand
  • Reduce manual support workload and systems complexity for RelEng and IT.
  • Allows us speed up making changes to infrastructure code, as there’s now no longer a need to special-case and retest FF3.0 specific situations. As soon as we power off the Thunderbird2.0 machines, we can stop having to support both cvs *and* Mercurial throughout build automation.
  • For the curious, Mozilla’s 6month End-of-Life support policy can be seen here (https://wiki.mozilla.org/ReleaseRoadmap) and is also mentioned on the Firefox “all-older” download page here: http://www.mozilla.com/en-US/firefox/all-older.html

If you have any reasons that these Firefox3.0 machines should continue running, please comment in bug#554226. Now.

Yes, really.

Now.

Thanks
John.

Little Brother by Cory Doctorow

Aki pushed this “young adult” book my way recently, and I liked it because:

  • The story is set in and around San Francisco. As far as I can tell, all the locations mentioned are accurate. This is true for both famous landmarks, and small local-only landmarks in my neighborhood.
  • The computer hacking portions of the story were detailed and realistic, without getting in the way of the story.
  • The topics of privacy, as well as competing state-vs-federal jurisdictions during major emergencies, were all covered in a very informative and readable manner. Not a surprise to find out that the author was Director of European Affairs, for the Electronic Frontier Foundation.

Oh, and yes, the story was good too! Thumbs up from me.

(ps: Thanks for the loan, Aki!)

Fennec: now building on Android

Yesterday, Android builds started showing up on Tinderbox and they were green!

There are incremental builds triggered by checkins during the day, full clobber builds run every night and all builds available on ftp://ftp.mozilla.org/pub/mozilla.org/mobile. If you have an Android phone, and want to help, you can download these nightly builds and try them out. Note: we still do not have nightly updates for Android figured out, so for now you have to remember to come back and re-download a newer nightly build to see newer fixes.

The curious can follow the rest of the Android mechanics work in bug#538524 as it is rolled out to production in digestible chunks – unittests, talos, and release-build-automation are some highlights still being worked on.

It is exciting to see support for this new OS roll into production. Please send encouragement/chocolate/beer to Bear, Armen, Coop, Aki, Vlad and MichaelWu for the tons of behind the scenes work they did to make this happen.

Postmortem of tree closure

All trees were closed from Friday morning until mid-day Monday. Its all fixed since just before noon Monday PDT, so if you don’t care about the details, you can stop reading now.

Details for the curious:

  • There was a scheduled IT downtime Thurs night to change VPN
    configurations between 650castro and MPT. This was to fix the 8hr disconnect problem in bug#555794.
  • This change caused more frequent (every 3-5minutes) VPN disconnects between masters in MPT and slaves in 650castro. These disconnects would interrupt builds-in-progress, causing burning builds from late Thursday night.
  • During Friday, several fixes were tried, without success, each of which took time to implement, and then see if connections were working. By Friday night, IT started reverting back to old VPN configuration. The revert completed approx 11am PDT Saturday morning. However, we were seeing VPN disconnects and build failures.
  • Over the weekend, RelEng tried two changes to reduce WAN traffic:
    • Set up a new master in Castro so that machines wouldn’t have to traverse VPN for most test jobs. Migrating machines to new master took until Monday morning. This does make our WAN traffic more efficient, so we’ll leave this in place anyway.
    • Disconnect fast build machines in Castro from masters in MPT. Disconnecting these machines significantly reduced our build capacity but it was worth trying to see if we could reopen the tree with anything. This made no difference, and once the WAN connection was fixed, we reverted this.
  • RelEng and IT met Monday morning; Derek changed router configurations, so no VPN was needed for the communication between build machines in 650castro and MPT. This router reconfig meant that the random disconnects are gone, by bypassing the need for VPN totally and directly wiring part of the WAN circuit into the build networks at 650castro and MPT.
  • This has been holding since mid-day Monday, and patches have been landing as fast as sheriffs could coordinate with developers.
  • Not totally out of the woods yet; we’ve seen two cache corruption problems that we didn’t see before. These cause the tests to fail because of grabbing the wrong file across the WAN. Details being tracked in bug#555794.
  • IT are still figuring out what actually went wrong with the VPN/router changes. At this point, we know that the VPN performance problems were related to an IPSec interoperability problem between Juniper and Cisco hardware. Juniper acknowledged the possibility of a bug on Monday. More info in bug#555794 as we find out.
  • Unrelated, but confusing the matter was other people in RelEng adding new machines to production as part of bug#557294. This was unrelated, but added some confusion to debugging on Friday night / Sat morning.

Hope all that makes sense. Its been a rough few days for RelEng and IT, so thanks for the patience. Let me know if there are any questions?

Infrastructure load for April 2010

Summary:

April 2010 had the 2nd-highest number of pushes since we started recording load in Jan 2009; slightly down from last month’s record high. Try Server usage continues to be 1/3 of our load – mozilla-central is 1/3, and all the other project branches combined make up the remaining 1/3.

The numbers for this month are:

  • 1,746 code changes to our mercurial-based repos, which triggered 187,592 jobs:
  • 21,194 build jobs, or ~29 jobs per hour.
  • 82,824 unittest jobs, or ~115 jobs per hour.
  • 83,574 talos jobs, or ~116 talos jobs per hour.

Details:

  • The number of builds we generate per checkin changed this month: we turned off WinMO builds everywhere, and enabled maemo5gtk, maemo5qt builds on specific branches.
  • Our Unittest and Talos load continues high, like last month, and we expect this to jump further as more OS are still being added to Talos.
  • Once we start running Unittests on all the Talos OS, we expect load to jump again. Once live and green, we’ll disable unittest-on-builders, and I’ll update the math here. In advance of that, we’re spinning up more machines to handle this future spike in load.
  • The trend of “what time of day is busiest” changed again this month. Not sure what this means, but worth pointing out that each month seems to be different. This makes finding a “good” time for a downtime almost impossible.
  • The entire series of these infrastructure load blogposts can be found here.
  • We are still not tracking down any l10n repacks, nightly builds, release builds or any “idle-timer” builds.

Detailed breakdown is :

Here’s how the math works out (Descriptions of build, unittest and performance jobs triggered by each individual push are here:

Infrastructure load for March 2010

Summary:

March 2010 sets a new record for the number of pushes since we started recording load in Jan 2009. Try Server usage continues to mount, surpassing all other branches again.

The numbers for this month are:

  • 1,971 code changes to our mercurial-based repos, which triggered 214,066 jobs:
  • 23,787 build jobs, or ~32 jobs per hour.
  • 95,493 unittest jobs, or ~128 jobs per hour.
  • 94,786 talos jobs, or ~128 talos jobs per hour.
  • It is interesting to note that for several months now, our load is roughly broken into 3 parts: 1/3 TryServer, 1/3 mozilla-central, and 1/3 all-other-branches-combined.

Details:

  • Our Unittest and Talos load continues high, like last month, and we expect this to jump further as more OS are still being added to Talos.
  • Once we start running Unittests on all the Talos OS, we expect load to jump again. In advance of that, we’re spinning up more machines to handle this future spike in load.
  • The trend of “what time of day is busiest” changed again this month. Not sure what this means, but worth pointing out that each month seems to be different. This makes finding a “good” time for a downtime almost impossible.
  • The entire series of these infrastructure load blogposts can be found here.
  • We are still not tracking down any l10n repacks, nightly builds, release builds or any “idle-timer” builds.

Detailed breakdown is :

Here’s how the math works out (Descriptions of build, unittest and performance jobs triggered by each individual push are here:

UPDATED: thanks to jesse for spotting a math typo, now fixed. joduinn 05may2010.