John O'Duinn's Soapbox

Just another WordPress weblog

Browsing Posts in Mozilla

The Tinderbox waterfall now has two new links at the top.

  • What jobs are ahead of you in the queue? If you landed a patch, but TPBL doesnt show it running yet, you should look here. This will show you what build/unittest/talos jobs, on what OS, are ahead of you in the queue, and give you an approximate idea of how long you’ll have to get to the top of the queue.
  • What jobs are running right now? Self-explanatory, and has more in-progress details than Tinderbox/TBPL while the job is still running.

For both of these, you can search by changeset, and you can sort on the different column headings. If people find this helpful, please send beer+chocolate to nthomas. :-)

The pace towards Firefox 4.0 release is definitely picking up. FF4.0beta4 went out today. FF4.0beta5 is right behind it. The fast cadence of betas is one measure of how fast things are. Another way to measure how busy developers have been is by looking at #checkins :

  • 2,216: so far in August (as-of Sunday night 22nd, and we’ve still about 1/3 of the month to go!!)
  • 1,971: March 2010 (entire month – and our previous record)
  • 1,892: May 2010 (entire month)
  • 1,838: July 2010 (entire month)
      For comparison, we used to be able to handle ~1,000 checkins per month back in early 2009.

      If you are curious for all numbers going back to January 2009, see here:

Hexayurt

1 comment

Trying an experiment this year instead of the usual tent.

Some friends of mine had these last year, and they were great. Obviously, well insulated means warm at night, and cool in the day – all wonderful things at BurningMan. However, they also kept the dust down, and kept the light out, so you could actually get some sleep after the beginning of sunrise.

Lets see how this experiment goes. So far, we’ve got all the parts cut, and taped. We’ve even tried some initial test placements, but never yet actually put it all together yet. Just in case, we’re still bringing tents from last year – after all, “what could possibly go wrong”!?!

Note:

Stay tuned – I’ll let you know how it went.

Every year at Burning Man, Emergency Services handles a range of incidents. Here’s an infograph showing incident data for the last 3 years, broken down by incident type.

The source data is freely published on afterburn.burningman.com, but I really like how they visualize the data. This layout is immediately familiar to burners and is visually intuitive – more incidents of a specific type == larger area for that type. Click on the thumbnail for a larger version, and spend a few minutes skimming details; it was interesting reading!

The authors (GOOD and Hyperakt) end with “Try not to get flown out by helicopter”!

Excellent advice! :-)

The countdown for Burning Man is well underway, so this infograph was a timely discovery.

Amidst all the other data, the comparison with other large events struck a chord with me. The complexity of logistics at Burning Man makes 50,000 people seem like a lot of people… until you see it alongside Glastonbury Festival (137,000), Woodstock Festival (500,000), the Hajj (1.6million) and Kumbh Mela (40million). Wikipedia (being Wikipedia!) has a page listing the largest gatherings in human history – a fascinating read!

Thanks to xmason for putting this infograph together, and to abillings for drawing this to my attention.

Summary:

July 2010 logged 1,838 pushes – very similar to last months 1,892 and almost our previous record of 1,971 in January. You can clearly see the drop in load between 3rd and 10th of July, caused by Mozilla Summit 2010. Oh, and yet again, TryServer was still the busiest branch of the entire infrastructure.

Overall load since Jan 2009The numbers for this month are:

  • 1,838 code changes to our mercurial-based repos, which triggered 233,634 jobs:
  • 35,239 build jobs, or ~47 jobs per hour.
  • 111,603 unittest jobs, or ~150 jobs per hour.
  • 86,792 talos jobs, or ~117 talos jobs per hour.

Infrastructure load by branch

Details:

  • There’s been lots of progress, but we are still double-running unittests for some OS; running unittest-on-builder and also unittest-on-tester. This continues while developers and QA work through the issues. Whenever unittest-on-test-machine is live and green, we disable unittest-on-builders to reduce wait times for builds.
  • The trend of “what time of day is busiest” changed again this month. Not sure what this means, but worth pointing out that each month seems to be different. This makes finding a “good” time for a downtime almost impossible.
  • The entire series of these infrastructure load blogposts can be found here.
  • We are still not tracking down any l10n repacks, nightly builds, release builds or any “idle-timer” builds.
  • Anamaria is getting closer to having dashboard reports like this generated automatically – something I’ll rejoice!

Detailed breakdown is :
#Pushes this month

#Pushes per hour

Here’s how the math works out (Descriptions of build, unittest and performance jobs triggered by each individual push are here:
the math behind the graphs

Firefox3.6.8 was released on Friday 23-jul-2010, at 13:31PST. That was our second time shipping a release inside of 24hours.

From “Dev says go” to “release is now available to public” was 23h 17m wall-clock time. The Release Engineering portion of that was 15h 12m. This was slightly slower than our fastest ever release FF3.6.6, but still well inside of 24 hours from start to finish. For FF3.6.8, the times were:

14:14 22jul: Dev says “go” for FF3.6.8
14:37 22jul: FF3.6.8 builds started
17:04 22jul: FF3.6.8 linux, mac, unsigned-win32 builds handed to QA
19:55 22jul: FF3.6.8 signed-win32 builds handed to QA
03:55 23jul: FF3.6.8 update snippets available on test update channel
09:51 23jul: Dev & QA says “go” for Release; ok to start mirror absorption
10:20 23jul: mirror absorption started
11:17 23jul: mirror absorption good enough for testing
13:06 23jul: website changes finalized and visible. Build given “go” to make updates snippets live.
13:11 23jul: update snippets available on live update channel
13:31 23jul: release announced

Notes:

1) This was an interesting release in that it started off as a super-low urgency just-in-case release, so was being worked on in/around other time-critical housekeeping in progress. Mid-way through, the release was declared a chemspill release, and became top priority for all groups involved. If this release had been declared a chemspill release from the outset, the initial RelEng portions would have been treated as high priority, and FF3.6.8 would have been yet another record-breaking release, even faster then FF3.6.6.

2) As usual, our blow-by-blow scribbles are public, so you can read all the details here or in tracking bug#581165.

Being able to consistently ship releases in such a fast turnaround shows how FF3.6.6 and FF3.6.8 were not unusual – they are the new reality. Not that we want to do that all the time – however, its nice to know that we can move fast if we have to. Really really nice.

Thank you
John.

Summary:

June 2010 logged 1,892 pushes – almost our previous record of 1,971 in January. Note this number for June is *under* reporting TryServer usage, as we accidentally lost Try Server usage logs from 01-10june. We assert, without proof, that we would have easily set a new record if we had the missing 10 days of data for TryServer, our busiest branch. Even missing 10-of-30 days of TryServer in June, TryServer was still the busiest branch of the entire infrastructure compared with full month data for other branches.

Overall load since Jan 2009The numbers for this month are:

  • 1,892 code changes to our mercurial-based repos, which triggered 234,387 jobs:
  • 35,308 build jobs, or ~49 jobs per hour.
  • 111,513 unittest jobs, or ~154 jobs per hour.
  • 87,566 talos jobs, or ~121 talos jobs per hour.

Infrastructure load by branch

Details:

  • Losing logs for 1/3 of month for our busiest branch means we are underreporting for June. Hopefully the work catlee/nthomas/anamarias are doing to automate reports will be live soon, to prevent this happening again
  • Our Unittest and Talos load continues high, like last month, and we expect this to jump further as more OS are still being added to Talos.
  • We’re still double-running unittests for some OS; running unittest-on-builder and also unittest-on-tester while developers and QA work through the issues. Whenever unittest-on-test-machine is live and green, we disable unittest-on-builders to reduce wait times for builds.
  • The trend of “what time of day is busiest” changed again this month. Not sure what this means, but worth pointing out that each month seems to be different. This makes finding a “good” time for a downtime almost impossible.
  • The entire series of these infrastructure load blogposts can be found here.
  • We are still not tracking down any l10n repacks, nightly builds, release builds or any “idle-timer” builds.

Detailed breakdown is :
#Pushes this month

#Pushes per hour

Here’s how the math works out (Descriptions of build, unittest and performance jobs triggered by each individual push are here:
the math behind the graphs

[UPDATE: thanks to jhford for catching some copy-paste typos! joduinn 15-jul-2010]

05-apr-2007: Launch of Thunderbird 2.0
08-dec-2009: Launch of Thunderbird 3.0
24-jun-2010: Launch of Thunderbird 3.1
25-jun-2010: Formal “ok to poweroff” Thunderbird 2.0 machines
29-jun-2010: Thunderbird 2.0 machines are finally offline

Powering off these machines is a massive milestone for RelEng, because:
1) These are the last cvs-based machines being used in production. Because these were cvs-based, they had a really early version of release-automation, which means its a great relief to not have to do any more TB2.0.0.x releases.
2) These are also the last of the dedicated-unique machines – everything else is using our shared pool-o-slaves infrastructure. Its a great relief to not have to worry about keeping spare long-since-discontinued PPC xserves around in case an old dedicated-unique machines dies in production and closes the tree without warning.

Thunderbird 3.1.x work continues full pace, over on hg. :-) You can get more details here. As usual, any remaining users on TB2 can major update to TB3.1 simply by doing “Help->CheckForUpdates”.

NOTE: if you think you need any of these machines for something else, please comment in bug#574901. Now. Right now! Before I reach my trusty axe.

tc
John.

As most of you already know, Firefox3.6.6 was released on Saturday 26-jun-2010, at 20:42PST. However, did you know this was our fastest ever turnaround on a Firefox release? That was our first time shipping a release inside a 24hour day.

From “Dev says go” to “release is now available to public” was 22h 33m wall-clock time. The Release Engineering portion of that was 10h 15m. By comparison, our previous fastest release turnaround was FF3.5.5 (3d 4h 45m from start to finish, with Release Engineering taking 13-16hours). For FF3.6.6, the times were:

22:09 25jun: Dev says “go” for FF3.6.6
22:18 25jun: FF3.6.6 builds started
00:17 26jun: FF3.6.6 linux, mac, unsigned-win32 builds handed to QA
02:20 26jun: FF3.6.6 signed-win32 builds handed to QA
07:40 26jun: FF3.6.6 update snippets available on test update channel
17:15 26jun: Dev & QA says “go” for Release; Build already completed final signing, bouncer entries
17:35 26jun: mirror replication started
18:00 26jun: mirror absorption good enough for testing
20:30 26jun: website changes finalized and visible. Build given “go” to make updates snippets live.
20:32 26jun: update snippets available on live update channel
20:42 26jun: release announced

Notes:

1) This is an awesome new record for the fastest Firefox release since I started recording wall-clock times. Its even more awesome when you add complications like:
* it was a firedrill release we had no advance warning about.
* it started late Friday night – the worst possible in terms of RelEng’s almost-global timezone coverage.
* lack-of-responsiveness from external partners delayed verification of fix by a few hours.

2) As usual, our blow-by-blow scribbles are public, so you can read all the details here or in tracking bug#574906.

This super-super fast release turnaround showed how the ongoing release-automation work continues to improve times – and also how well the teams worked together on this, including the smooth handoffs back-and-forth across timezones!

Awesome. Truly awesome.

Thank you
John.