End of an era – no more Thunderbird 2.0 machines, or cvs machines!

05-apr-2007: Launch of Thunderbird 2.0
08-dec-2009: Launch of Thunderbird 3.0
24-jun-2010: Launch of Thunderbird 3.1
25-jun-2010: Formal “ok to poweroff” Thunderbird 2.0 machines
29-jun-2010: Thunderbird 2.0 machines are finally offline

Powering off these machines is a massive milestone for RelEng, because:
1) These are the last cvs-based machines being used in production. Because these were cvs-based, they had a really early version of release-automation, which means its a great relief to not have to do any more TB2.0.0.x releases.
2) These are also the last of the dedicated-unique machines – everything else is using our shared pool-o-slaves infrastructure. Its a great relief to not have to worry about keeping spare long-since-discontinued PPC xserves around in case an old dedicated-unique machines dies in production and closes the tree without warning.

Thunderbird 3.1.x work continues full pace, over on hg. 🙂 You can get more details here. As usual, any remaining users on TB2 can major update to TB3.1 simply by doing “Help->CheckForUpdates”.

NOTE: if you think you need any of these machines for something else, please comment in bug#574901. Now. Right now! Before I reach my trusty axe.

tc
John.

Firefox 3.6.6 by the (wall-clock) numbers

As most of you already know, Firefox3.6.6 was released on Saturday 26-jun-2010, at 20:42PST. However, did you know this was our fastest ever turnaround on a Firefox release? That was our first time shipping a release inside a 24hour day.

From “Dev says go” to “release is now available to public” was 22h 33m wall-clock time. The Release Engineering portion of that was 10h 15m. By comparison, our previous fastest release turnaround was FF3.5.5 (3d 4h 45m from start to finish, with Release Engineering taking 13-16hours). For FF3.6.6, the times were:

22:09 25jun: Dev says “go” for FF3.6.6
22:18 25jun: FF3.6.6 builds started
00:17 26jun: FF3.6.6 linux, mac, unsigned-win32 builds handed to QA
02:20 26jun: FF3.6.6 signed-win32 builds handed to QA
07:40 26jun: FF3.6.6 update snippets available on test update channel
17:15 26jun: Dev & QA says “go” for Release; Build already completed final signing, bouncer entries
17:35 26jun: mirror replication started
18:00 26jun: mirror absorption good enough for testing
20:30 26jun: website changes finalized and visible. Build given “go” to make updates snippets live.
20:32 26jun: update snippets available on live update channel
20:42 26jun: release announced

Notes:

1) This is an awesome new record for the fastest Firefox release since I started recording wall-clock times. Its even more awesome when you add complications like:
* it was a firedrill release we had no advance warning about.
* it started late Friday night – the worst possible in terms of RelEng’s almost-global timezone coverage.
* lack-of-responsiveness from external partners delayed verification of fix by a few hours.

2) As usual, our blow-by-blow scribbles are public, so you can read all the details here or in tracking bug#574906.

This super-super fast release turnaround showed how the ongoing release-automation work continues to improve times – and also how well the teams worked together on this, including the smooth handoffs back-and-forth across timezones!

Awesome. Truly awesome.

Thank you
John.

End of an era – no more Firefox3.0 machines

  • 17-jun-2008: Launch of Firefox3.0
  • 30-jun-2009: Launch of Firefox3.5
  • 30-dec-2009: End-of-life for Firefox3.0
  • 22-jun-2010: Formal “ok to poweroff” Firefox3.0 machines
  • 23-jun-2010: Firefox 3.0 machines are finally offline.

Personally, I have mixed feelings here – some of these are machines that rhelmer and I setup as I was first joining Mozilla. But probably the biggest feeling for me is relief! Removing these cvs-based, dedicated-unique machines will simplify the support work of RelEng and IT.

We’re recycling the physical hardware where possible, and the VMs in the meta-physical-bits-sense into new builder VMs in the pool-o-slaves where they can be more useful. All good and of course, you can follow details in bug#554226.

The King is dead. Long live the King!!

Welcome to Anamaria!

Please join me in welcoming Anamaria to RelEng as an intern.

She just arrived in California over the weekend, and today was her first day at the office! She’ll be working with Catlee, nthomas (and with local assist from lsblakk) on using python+SQLAlchemy to generate some interesting dashboards for the live-to-the-second status of RelEng infrastructure. If her work here here or here are anything to go by, she’ll rock the place. 🙂

You can follow along on her blog – and of course, if you are in 650Castro, or in #build, please do stop by and say hello.

Burning Man Film Festival, San Francisco

The Burning Man Film Festival was in the Red Vic theatre on Haight Street this weekend; I almost missed it, but stopped by tonight to watch a few hours of assorted short films. This was a good way for me to remember the sights and sounds of it all – and of course, there was the inevitable mix of funny, sad, strange and very personal stories.

http://www.youtube-nocookie.com/v/6vMmNe0-hSA&hl=en_US&fs=1&rel=0&border=0One story that struck me particularly was “Burn on the Bayou” about Burning Man 2005, when Hurricane Katrina hit New Orleans and the rest of the Gulf Coast. This brought memories flooding back of people leaving Burning Man as news of the destruction spread; some driving all the way from Burning Man in Nevada, some flying to the nearest still-working airport, then figuring out something; some people trained disaster professionals going to do what they’d been training for, some people just going because they had to do something to help. Most ended up living there for months – one person from my camp moved there for a few years – and this became the start of Burners Without Boarders.

Five years later, the reconstruction continues. There are still BurnersWithoutBoarders helping along the Gulf and now facing the new problems caused by the BP oil spill. There are also BurnersWithoutBoarders in Haiti and other locations. If you are able to donate time or equipment or money, check out their website; these are hardworking folks in very trying circumstances making a difference each and every day.

Please be gentle with the shiny new TryServer

Since the awesome redesign of TryServer that Lukas did recently, we’ve seen the load on TryServer jump significantly as people use it more. Since the start of the year, TryServer had ~15-35 pushes per day. Last week, TryServer had 91 pushes on Thursday, and 137 pushes on Friday.

Thats great proof of how much people like the new TryServer – cool – we want people to love it, and use it. After all, we made it better, because we hoped developers will find it even more valuable. We even added extra machines to handle possible growth throughout the rest of the year. However, the recent jump in usage over the last few weeks is already approaching the extra growth we’d predicted for the *year*.

We still want people to continue using TryServer, but from a scan through TryServer usage logs we’ve found two things that everyone can do to help avoid accidentally wasting TryServer capacity :

1) If you want to rerun a specific test suite on a specific build on TryServer, please DO NOT re-push the same source code to TryServer again because:

  • resubmitting source code tells TryServer to build new binaries and then rerun all tests on the new binaries, which may defeat the entire purpose of retesting!
  • you have to wait significantly longer for results, because it has to recompile everything first.
  • this will run a bunch of build/unittest/talos jobs you dont care about – well over 65 hours of computer time *per* push doing 32bit+64bit builds, opt+debug builds, unittests, talos, etc – which ties up those TryServer slaves from doing work for other people.

Instead, it is much faster, and more efficient, to file a bug in mozilla.org:ReleaseEngineering and have the RelEng buildduty person manually trigger just the test suites you want on your *existing* build. This is an interim step until we have a self-service UI for this. This same UI will also allow you to choose what you want run for your job on TryServer, an improvement over the current situation where we default to running everything.

2) Consider refreshing your local repo to latest changeset *before* pushing to TryServer. If your patch works as you hoped, you can then ask to land it on mozilla-central. By contrast, if you were based on 6-week-old repo when you pushed to TryServer, and everything worked, it doesnt really tell you much. You’re 6 weeks out of date – who knows what other code has changed in 6 weeks – so you’ll still have to refresh your local repo forward, and resubmit to TryServer *again* anyway to make sure your code is valid before asking to landing on mozilla-central.

Both of these changes could really help – every CPU cycle saved helps!!

Infrastructure load for May 2010

Summary:

May 2010 had lower load compared to recent months, Try Server usage continues to grow quickly, jumping from 33% to 44% of overall load in one month.

The numbers for this month are:

  • 1,584 code changes to our mercurial-based repos, which triggered 164,380 jobs:
  • 18,510 build jobs, or ~25 jobs per hour.
  • 69,702 unittest jobs, or ~94 jobs per hour.
  • 76,168 talos jobs, or ~102 talos jobs per hour.

Details:

  • The number of builds we generate per checkin changed throughout the month: linux64, 10.6 64bit, win64 android came online at different dates. To be conservative, I’m omitting these from this month, and I’ll track all these new OS in June.
  • Our Unittest and Talos load continues high, like last month, and is increasing as we continue to add more OS to Talos. Again, being conservative, those part-month numbers are excluded from this month.
  • Running Unittests on all the Talos OS is still progressing. Getting unittests running green on the new Talos OS was tough. However, its proven harder then we’d like to get the unittest-on-builders turned off, as we keep finding new surprise dependencies that need reworking. Until these are all resolved, we continue to be in the worst-case situation of running unittest-on-builders AND unittest-on-test-minis. Once we finally disable unittest-on-builders, I’ll update the math here. We added 160 more machines to help with this load, but this continues to cause us significant load issues daily.
  • The trend of “what time of day is busiest” changed again this month. Not sure what this means, but worth pointing out that each month seems to be different. This makes finding a “good” time for a downtime almost impossible.
  • The entire series of these infrastructure load blogposts can be found here.
  • We are still not tracking down any l10n repacks, nightly builds, release builds or any “idle-timer” builds.

Detailed breakdown is :

Here’s how the math works out (Descriptions of build, unittest and performance jobs triggered by each individual push are here:

Towel Day 2010 @ Mozilla

Towel Day is held in memory of the death of Douglas Adams (1952 – 2001). This year, Tiffney arranged a lunchtime reading of The Hitchhikers Guide to the Galaxy for Towel Day.

In between the work meetings, and office interrupts at the office, it was really great to take time to pause, sit, re-read portions of these oh-so-familiar books, and this time read them out loud to others. The recipe for Pan Galactic Gargle Blasters and also the “S.E.P. at Lord’s Cricket Grounds” are personal favourites, but its all great. Also, for the rest of the entire day, it was fun to see someone walk by with a towel casually slung over their shoulder, and know they were also fans. Mozilla being Mozilla, there were several people who saw my towel and bathrobe who instantly said “oh no, I forgot it was Douglas Adams Towel day”, and only one person who stopped me and asked quizzically – “ummm….are you wearing a bathrobe”!?!

Big thanks to Tiffney for making this happen!