John O'Duinn's Soapbox

Just another WordPress weblog
  • 17-jun-2008: Launch of Firefox3.0
  • 30-jun-2009: Launch of Firefox3.5
  • 30-dec-2009: End-of-life for Firefox3.0
  • 22-jun-2010: Formal “ok to poweroff” Firefox3.0 machines
  • 23-jun-2010: Firefox 3.0 machines are finally offline.

Personally, I have mixed feelings here – some of these are machines that rhelmer and I setup as I was first joining Mozilla. But probably the biggest feeling for me is relief! Removing these cvs-based, dedicated-unique machines will simplify the support work of RelEng and IT.

We’re recycling the physical hardware where possible, and the VMs in the meta-physical-bits-sense into new builder VMs in the pool-o-slaves where they can be more useful. All good and of course, you can follow details in bug#554226.

The King is dead. Long live the King!!

Please join me in welcoming Anamaria to RelEng as an intern.

She just arrived in California over the weekend, and today was her first day at the office! She’ll be working with Catlee, nthomas (and with local assist from lsblakk) on using python+SQLAlchemy to generate some interesting dashboards for the live-to-the-second status of RelEng infrastructure. If her work here here or here are anything to go by, she’ll rock the place. :-)

You can follow along on her blog – and of course, if you are in 650Castro, or in #build, please do stop by and say hello.

Each time we do this, we find new systems added, new OS and tests added, and get new feedback on how to make this better, so updating this again. As usual, if you have any comments about this PDF, please let me know. :-)

The Burning Man Film Festival was in the Red Vic theatre on Haight Street this weekend; I almost missed it, but stopped by tonight to watch a few hours of assorted short films. This was a good way for me to remember the sights and sounds of it all – and of course, there was the inevitable mix of funny, sad, strange and very personal stories.

One story that struck me particularly was “Burn on the Bayou” about Burning Man 2005, when Hurricane Katrina hit New Orleans and the rest of the Gulf Coast. This brought memories flooding back of people leaving Burning Man as news of the destruction spread; some driving all the way from Burning Man in Nevada, some flying to the nearest still-working airport, then figuring out something; some people trained disaster professionals going to do what they’d been training for, some people just going because they had to do something to help. Most ended up living there for months – one person from my camp moved there for a few years – and this became the start of Burners Without Boarders.

Five years later, the reconstruction continues. There are still BurnersWithoutBoarders helping along the Gulf and now facing the new problems caused by the BP oil spill. There are also BurnersWithoutBoarders in Haiti and other locations. If you are able to donate time or equipment or money, check out their website; these are hardworking folks in very trying circumstances making a difference each and every day.

Since the awesome redesign of TryServer that Lukas did recently, we’ve seen the load on TryServer jump significantly as people use it more. Since the start of the year, TryServer had ~15-35 pushes per day. Last week, TryServer had 91 pushes on Thursday, and 137 pushes on Friday.

Thats great proof of how much people like the new TryServer – cool – we want people to love it, and use it. After all, we made it better, because we hoped developers will find it even more valuable. We even added extra machines to handle possible growth throughout the rest of the year. However, the recent jump in usage over the last few weeks is already approaching the extra growth we’d predicted for the *year*.

We still want people to continue using TryServer, but from a scan through TryServer usage logs we’ve found two things that everyone can do to help avoid accidentally wasting TryServer capacity :

1) If you want to rerun a specific test suite on a specific build on TryServer, please DO NOT re-push the same source code to TryServer again because:

  • resubmitting source code tells TryServer to build new binaries and then rerun all tests on the new binaries, which may defeat the entire purpose of retesting!
  • you have to wait significantly longer for results, because it has to recompile everything first.
  • this will run a bunch of build/unittest/talos jobs you dont care about – well over 65 hours of computer time *per* push doing 32bit+64bit builds, opt+debug builds, unittests, talos, etc – which ties up those TryServer slaves from doing work for other people.

Instead, it is much faster, and more efficient, to file a bug in mozilla.org:ReleaseEngineering and have the RelEng buildduty person manually trigger just the test suites you want on your *existing* build. This is an interim step until we have a self-service UI for this. This same UI will also allow you to choose what you want run for your job on TryServer, an improvement over the current situation where we default to running everything.

2) Consider refreshing your local repo to latest changeset *before* pushing to TryServer. If your patch works as you hoped, you can then ask to land it on mozilla-central. By contrast, if you were based on 6-week-old repo when you pushed to TryServer, and everything worked, it doesnt really tell you much. You’re 6 weeks out of date – who knows what other code has changed in 6 weeks – so you’ll still have to refresh your local repo forward, and resubmit to TryServer *again* anyway to make sure your code is valid before asking to landing on mozilla-central.

Both of these changes could really help – every CPU cycle saved helps!!

Summary:

May 2010 had lower load compared to recent months, Try Server usage continues to grow quickly, jumping from 33% to 44% of overall load in one month.

Overall load since Jan 2009The numbers for this month are:

  • 1,584 code changes to our mercurial-based repos, which triggered 164,380 jobs:
  • 18,510 build jobs, or ~25 jobs per hour.
  • 69,702 unittest jobs, or ~94 jobs per hour.
  • 76,168 talos jobs, or ~102 talos jobs per hour.

Infrastructure load by branch

Details:

  • The number of builds we generate per checkin changed throughout the month: linux64, 10.6 64bit, win64 android came online at different dates. To be conservative, I’m omitting these from this month, and I’ll track all these new OS in June.
  • Our Unittest and Talos load continues high, like last month, and is increasing as we continue to add more OS to Talos. Again, being conservative, those part-month numbers are excluded from this month.
  • Running Unittests on all the Talos OS is still progressing. Getting unittests running green on the new Talos OS was tough. However, its proven harder then we’d like to get the unittest-on-builders turned off, as we keep finding new surprise dependencies that need reworking. Until these are all resolved, we continue to be in the worst-case situation of running unittest-on-builders AND unittest-on-test-minis. Once we finally disable unittest-on-builders, I’ll update the math here. We added 160 more machines to help with this load, but this continues to cause us significant load issues daily.
  • The trend of “what time of day is busiest” changed again this month. Not sure what this means, but worth pointing out that each month seems to be different. This makes finding a “good” time for a downtime almost impossible.
  • The entire series of these infrastructure load blogposts can be found here.
  • We are still not tracking down any l10n repacks, nightly builds, release builds or any “idle-timer” builds.

Detailed breakdown is :
#Pushes this month

#Pushes per hour

Here’s how the math works out (Descriptions of build, unittest and performance jobs triggered by each individual push are here:
the math behind the graphs

Towel Day is held in memory of the death of Douglas Adams (1952 – 2001). This year, Tiffney arranged a lunchtime reading of The Hitchhikers Guide to the Galaxy for Towel Day.

In between the work meetings, and office interrupts at the office, it was really great to take time to pause, sit, re-read portions of these oh-so-familiar books, and this time read them out loud to others. The recipe for Pan Galactic Gargle Blasters and also the “S.E.P. at Lord’s Cricket Grounds” are personal favourites, but its all great. Also, for the rest of the entire day, it was fun to see someone walk by with a towel casually slung over their shoulder, and know they were also fans. Mozilla being Mozilla, there were several people who saw my towel and bathrobe who instantly said “oh no, I forgot it was Douglas Adams Towel day”, and only one person who stopped me and asked quizzically – “ummm….are you wearing a bathrobe”!?!

Big thanks to Tiffney for making this happen!

Armen was delighted to see this today. So was I.

This is still just a very very very early experimental build, but with that disclaimer, if you want to try using it, you can get it here. There are lots of open questions about compiler versions, toolchain, mozconfig settings, etc, which need to be figured out before we start imaging up infrastructure like we have done for OSX10.6 64bit and linux64. However, it is great to finally see some concrete visible progress after all the work put into this.

Many thanks to Makoto, dmandelin and danderson for their help to Armen so far. We’ve a long way still to go, so any help would be GREAT! If you are is interested in helping, or just following the blow-by-blow details, have a look at bug#558448.

1st June 2010 will be a big day for RelEng. In addition to the FF3.6.4 and FF3.5.10 releases, we’ll also finally be able to power off the FF3.0 infrastructure. The CVS-based machines listed in bug#554226 have been supported in production for over 3.5 years, so we’ll be sad (and happy!) to see them go.

After all those years, its quite possible that people are relying on those machines in ways we do not even know about. Hence this widespread notice. If you have any reasons these Firefox 3.0 machines should be left running, please let us know by commenting in bug#554226. After we power off these machines, they can be restored from tape backup if needed, but doing that is non-trivial, so should only be considered an extreme last resort.

    What will change:

  • No FF3.0.x incremental/depend/hourly builds will be produced.
  • No FF3.0.x clobber/nightly builds will be produced.
  • No FF3.0.x release builds will be produced.
  • The FF3.0 waterfall page will be removed from tinderbox. Specifically, this page http://tinderbox.mozilla.org/showbuilds.cgi?tree=Firefox3.0 will go away as it will be empty.

    What will *not* change:

  • Existing FF3.0.x builds would still be available for download from http://ftp.mozilla.org/pub/mozilla.org/firefox/releases/
  • Existing update offers would still be available. For example:
    • FF3.0.14 users can still update to FF3.0.19.
    • FF3.0.19 users can still update to latest FF3.6.x release (which is FF3.6.3 as of this writing).
  • Newly revised major update offers, like from FF3.0.19 -> a future FF3.6.9 release, could still be produced as needed (because these are produced on the FF3.6.x infrastructure, not on the powered off FF3.0 infrastructure.)
  • Any mozilla-1.9.0 machines which are not Firefox specific should continue to run as usual.

    Why do this:

  • Reuse some of these machines over to production pool-of-slaves or try
    pool-of-slaves, where there is more demand
  • Reduce manual support workload and systems complexity for RelEng and IT.
  • Allows us speed up making changes to infrastructure code, as there’s now no longer a need to special-case and retest FF3.0 specific situations. As soon as we power off the Thunderbird2.0 machines, we can stop having to support both cvs *and* Mercurial throughout build automation.
  • For the curious, Mozilla’s 6month End-of-Life support policy can be seen here (https://wiki.mozilla.org/ReleaseRoadmap) and is also mentioned on the Firefox “all-older” download page here: http://www.mozilla.com/en-US/firefox/all-older.html

If you have any reasons that these Firefox3.0 machines should continue running, please comment in bug#554226. Now.

Yes, really.

Now.

Thanks
John.

Aki pushed this “young adult” book my way recently, and I liked it because:

  • The story is set in and around San Francisco. As far as I can tell, all the locations mentioned are accurate. This is true for both famous landmarks, and small local-only landmarks in my neighborhood.
  • The computer hacking portions of the story were detailed and realistic, without getting in the way of the story.
  • The topics of privacy, as well as competing state-vs-federal jurisdictions during major emergencies, were all covered in a very informative and readable manner. Not a surprise to find out that the author was Director of European Affairs, for the Electronic Frontier Foundation.

Oh, and yes, the story was good too! Thumbs up from me.

(ps: Thanks for the loan, Aki!)