John O’Duinn’s Soapbox

Thu, 26-Jun-08

De-tangling timestamps: part2

Filed under: Mozilla — John @ 16:13:35 PST, 26-Jun-08 (Thu)

Yesterday, Alice de-tangled one more part of the messy time stamp problem by fixing bug#419487.

All data points for time on Talos and graph server now use “build time”, not “testrun time”.

This will greatly simplify a lot of manual regression triage work for people looking at performance graphs on graph server. Now, if you are tracking down what change caused a perf regression:

  • Before 8am PDT Wednesday, 25 June 2008, all charts use “testrun time”. This means debugging regressions require manually padding a regression range multiple hours wider  - enough to catch from start of build through queue to job starting on available slave. Different O.S. take different amounts of time, and any machine hiccups really complicate this padding-guess-work further. If you get it wrong, you can incorrectly rule out bad changes, so pad out more then you think. It means extra triage work, but is safer.
  • After 8am PDT Wednesday, 25 June 2008, all charts use “build time”. We’re still fixing other problems with timestamps, so you still need *some* extra range padding, but much less padding then before. At most manually pad out to the next/previous hour. This padding should be fixed once the BuildID changes in bug#431270 and bug#431905 are landed.

Anyone curious for details should read Alice’s recent post to mozilla.dev.builds and mozilla.dev.performance (”change in talos time stamps (as of 8am PDT June 25th 2008)”), bug#291167, bug#417633 and bug#419487. This is a continuation of the work described (in tedious detail!) in my previous blog post.

This sounds like a small simple change, but it was not. Its a tricky, complex, area in the infrastructure, with lots relying on it, and lots of different people with different assumptions about how time is used here. There was lots of behind-the-scenes homework on this, and to avoid causing any confusion, we held off landing this until after Firefox3 shipped.

Tip of the hat to Alice for pushing this through to production so smoothly.

Mon, 23-Jun-08

We take our dress code seriously…

Filed under: Mozilla — John @ 22:28:31 PST, 23-Jun-08 (Mon)

…as Armen discovered when he decided to *not* wear a Hawaiian shirt to the office.

Armen surprised by Mozilla fashion police

Tue, 17-Jun-08

“netapp woes, and bug#435134″ FIXED! FIXED! FIXED!

Filed under: Mozilla — John @ 13:57:26 PST, 17-Jun-08 (Tue)

While everyone else is talking about the Firefox3 release, here’s a behind the scenes story just before the Firefox3 release.

Followup to my earlier post about netapp woes. For the 21 days since we filed bug#435134, we’ve been struggling to keep trees open and various machines up and running. What was happening?

Technical details of the root causes are here.

From our point of view here in RelEng, as “users” of VMware/NetApp, we would see 8-12 VMs of our 84 VMs would randomly lock up at the same instant, for 15-45 seconds at a time.

The first couple of times, we incorrectly thought this hiccup was caused by network outages. However, each time IT confirmed the network was healthy and we never had any way to track the problem, so we’d repair the VMs and just move on. Looking back, I’ve found bug#435052, bug#429406, but there were other times where we never even filed bugs, so no record of them.

Once the interrupted VMs resumed, the o.s. within each of those VMs would come back to life… usually in a broken state. The 15-45 second lockup was enough for the o.s. to timeout connections to what it thought was the machine’s local hard disk, just as if someone had unplugged the disk of a running computer and then plugged it back in. Depending on the length of lockup, the VM would resume with:

  • missing or corrupted disks, which we’d have to manually reconstruct. If that failed, we’d delete the VM and recreate it from scratch.
  • disks that had become read-only, which a clean reboot would fix, although you could then still have…
  • disks being just fine, but the application files on the disk (i.e. the build or unittest in progress) were corrupted. Which caused subsequent runs to fail out with unusual errors. This required understanding where the different application level files were buried, and then manually cleaning up until the applications on the VM started working again. Depending what they were doing, they would fail out in builds with weird compiler / linker errors. Or would fail out of unittests, with what looked like random unittest errors.
  • no problems at all. This was a rare and very pleasant surprise, whenever it happened. It seemed to depend on how long the timeout was, but that situation was very very rare, and we didnt even count those!

We’d have to investigate each broken VM, and repair as appropriate. Best case, a builder VM could be repaired from light damage before any other unittest and talos machines noticed a problem. Repair time: a few minutes. Worst case, a buildbot master or unittest master VM would get corrupted, require lengthy repairs, and take down all the slaves attached to it for the duration. Repair time: 5-6 hours.

Sometimes, we’d be still reviving dead VMs, when another set of VMs would die…taking down some of the VMs we’d just revived.

Its worth repeating that this was a problem with *all* our VMs, regardless of branch or purpose. It didnt matter whether the VM was doing builds or unittests, running on win32 or linux, running as slave or master, running on 1.8/1.9/moz2 tinderbox trees. And each failure gave different symptoms every time.

After a few days of this continuous behavior, our lives had deteriorated into manically watching tinderbox, getting screenfuls of new nagios emails every time we checked email, and scrambling to prop up whatever VM just died.

So long as we could fix machines faster than they died, and so long as we worked 24hours a day, 7 days a week, we could keep the trees open with only occasional machine burning problems being visible to developers.

As the failure rate got higher, it turned into a losing battle, and finally late afternoon Sunday 8th, we had to give up and just close the trees. Not just one tree. Close *all* trees.

By Tuesday (10th), Justin had stable ESXHosts and NetApps, so we started reviving / repairing all our VMs. And this time, the VMs stayed up! :-) By Monday (16th), we’d repaired the last of the broken VMs and life returned to normal after a never-boring-for-one-minute 21 days.

Many many thanks to bhearsum and nthomas for all their work continuously reviving VMs. Because of their non-stop repair work, we were able to keep the trees open during the FF3.0rc2 and FF3.0rc3 releases.

…and thanks also to Justin and mrz for all their work chasing this down. Debugging 3 different interwoven problems is not fun.
tc
John.
=====
ps: Confusing the matter was bug#407796, where a linux o.s. kernel update was needed in the VM o.s. to prevent the VM disk from going read-only. Doing this kernel update required scheduling downtime for that tinderbox, doing config updates, and a restart. Only after some “updated” VMs re-failed, did we finally get confirmation that the kernel version we needed was different to what we were told to use. We were reupdating kernels to the new “correct” kernel when they also started failing… in sets, just like a network outage…

Wed, 04-Jun-08

How to use Jawbone headset with Skype/SJPhone on a MacBookPro

Filed under: Tech tips, Mozilla — John @ 17:04:21 PST, 04-Jun-08 (Wed)

I wanted to setup a headset for my work VOIP phone calls from my laptop. I already had a Jawbone headset for my cellphone, why not use that?
Literally all I had to do was:

  • make Jawbone discoverable (when powered off, press the black shiney section with raised lettering, until the LED starts alternating Red/White)
  • on Mac, in Bluetooth menu, “Set up Bluetooth Device”, pick “Headset”and walk through the dialogs to find devices. Enter passcode, which is defaulted to ‘0000′.
  • in Skype, preferences dialog, audio tab, set the “Audio Output”, “Audio Input” and “Ringing” options to each use the “Jawbone” menu item.
  • in SJPhone preferences dialog, audio tab, set the “Output” and “Input” options to each use the “Jawbone” menu item.

That was it.
It all just worked first time, and was literally all up and running in two minutes. It would have been even faster except I had to dig up the instructions on making Jawbone discoverable! It took me much longer to write this blog post, but thought this info might be useful to others.

For the record, I was using the following:

  • MacBookPro running OSX 10.4.11
  • Skype v2.7.0.330
  • SJPhone v1.60.299a
  • Jawbone headset(!)

Mon, 02-Jun-08

General Fuzz and Creative Commons

Filed under: Soapbox, Mozilla — John @ 08:37:46 PST, 02-Jun-08 (Mon)

A friend of mine has been creating original music for years. James is passionate about his music, happily shares it with others, and is truly delighted whenever someone likes it!

While donations are always happily received, he explicitly did not want to charge money for his music. His CD release parties usually involve handing out free copies of the new CD to everyone at the venue who would like to listen to it, and encouraging people to download entire albums from his website. At this point he’s released 5 albums, with another one due to release soon. You can download all of these albums, in their entirety, from his website.

Despite not charging for his music, he also did not want anyone stealing his original composition music. So, a few years ago, after quite a bit of homework, he settled on using Creative Commons on all his music.

Now, thanks to James’s Creative Commons licensing and with help from justdave, whomever first dials into a Release meeting can now listen to “Reflective Moments” by General Fuzz. If you like it, you can find all his other works, in their entirety at www.generalfuzz.net. It would make James happy! :-)

Wed, 28-May-08

De-tangling timestamps: part1

Filed under: Mozilla — John @ 19:29:36 PST, 28-May-08 (Wed)

Alice, Rob Helmer and Nick recently fixed an important, long standing problem about how Build Infrastructure handed off builds to the Talos Infrastructure. Their work fixed:

  • an intermittent couple-of-times-a-day talos outage, which has been happening ever since we started using Talos in production.
  • intermittent cases where Talos would occasionally skip over a build without testing it.

Thats enough reasons to make this an important fix, but its also important because its makes some future timestamp cleanup work possible. For the curious, here are some background details:

  • When builds were produced by each o.s. builder, the build infrastructure copied generated builds into a specific directory.
  • When Talos machines wanted to test a build, they copied builds from that same specific build directory in order to start testing. Talos would then plot test data on the graph server using “testrun time”  (time stamp of when Talos started running the test), *not* using the start time of when the build was created. This is an important point, and at the root of a bunch of regression triage complexities.
  • Because new builds would be copied into the *same* specific directory, they would overwrite the previous build. Which means that, when testing a build, we didnt know when that build was actually created. All we could tell was what time the “testrun” started for that build. So long as we tested as quickly as we produced new builds, it was close enough.

…but when we ramped up volume of builds and tests to production levels, we discovered:-

  • New builds being copied into the same specific directory could collide with Talos downloading the previous build, and cause Talos to fail out with an error. The next Talos attempt would work fine, but because each Talos run takes so long to complete, it would appear that Talos was burning for a couple of hours, until the next test run completed successfully. This happened intermittently a few times every day. This is now fixed.
  • Builds are generated at different speeds; linux builds quicker then win32 for example. This means that the contents of the specific directory are refreshed at different rates. The linux code built in the dir almost always contains code of a different timestamp from the win32 code in the dir. Enabling PGO caused win32 build times to double, which made this discrepancy even worse. This is now improved, but not fully fixed.
  • In situations where the builds were generated quickly enough, and tests ran slowly enough, we could see: a 1st build becomes available, Talos starts testing 1st build, a new 2nd build becomes available, a 3rd build becomes available, overwriting 2nd build. When Talos finishes testing 1st build, Talos detects and starts testing the available build (the 3rd build, skipping over the 2nd build completely). This is now fixed. There is another, similar sounding but unrelated bug about how Buildbot optimizes pending-requests, by collapsing them all together, see bug#436213.

Thats it for this fix. There’s still plenty more cleanup needed around how time/date is stored in different parts of the infrastructure, but this was an important big first step.

Next steps will include:

  • fixing how talos handles re-runs/duplicate data
  • have the dated dir be based on yet-to-be-enhanced BuildID
  • changing Talos and graph server to use “build time”, not “testrun time”. This will greatly simplify a lot of manual regression triage work for people.
  • simplify underlying code that lines up builds and test results on tinderbox/waterfall pages.
  • figuring out when is a good time to flip the switch in Talos&graph server, marking all data before a certain point as “testrun time”, and data after that point as “build time”.

Anyone curious for details should read Alice’s recent post to mozilla.dev.apps.firefox and mozilla.dev.performance (”time stamps of talos performance results & finding regressions”), bug#291167, bug#417633 and bug#419487. BuildID changes are being discussed in bug#431270 and bug#431905.

Its a tricky, complex, area in the infrastructure, so hopefully all that makes sense?!!?

Tue, 27-May-08

netapp woes, and bug#435134

Filed under: Mozilla — John @ 01:18:34 PST, 27-May-08 (Tue)

Since middle of last week, we’ve been struggling with bug#435134, a problem where a random set of build machines would all lose file / network connections at the same time. In each case, the VM would fail out with differently weird errors, like cvs merge conflicts even though no-one had landed any changes… or system header files with corrupted contents, causing compiler errors… or compilers throwing internal errors…

The failing VMs were on different branches, running different o.s. and doing different builds. The only thing that made us think these were related was that the failures were all detected within minutes of each other, and that no-one had landed any code changes anywhere because of pure luck of timing in the various Mozilla releases.

Simple reboots were not enough; in each case we had to delete out the working area completely and then restart. Then the machines would run successfully/green for a couple of cycles… only to then fail out in other weird-yet-similar ways a few hours later. It made for a very exciting (or very annoying!?) few days for Justin, mrz, nthomas and myself; it certainly didnt help social plans for anyone over the long weekend here in the US.

The problem is not yet fixed, so we’ll need to do further debugging. However, now that Justin has us avoiding the likely culprit, one head on netapp-c, we have been able to keep the VMs up and building happily for 24hours now, which is great progress.

Big tip of the hat to Justin, mrz and nthomas for all their help getting things stable before today’s go/nogo meeting for FF3.0rc2.

Mon, 19-May-08

A close call…

Filed under: Soapbox, Mozilla — John @ 17:52:02 PST, 19-May-08 (Mon)

Found this while catching up on the news today.

Most war footage shown in the US is very de-personalized, planes blowing up bridges, missiles blowing up buildings, etc. Body bags and returning wounded personnel get scant coverage. If you don’t look for the details, you might think it was all a glossy action movie, where no one gets hurt, and all the actors go home for dinner once the cameras stop. By contrast, this photo shows the very real dangers on a very personal level.


(click image for original image and story on Christian Science Monitor website)

The things I noticed were:
- he is not wearing a helmet. Or body armor. Depending on the weapon shooting at him, unclear if those would help anyway, but still… a tshirt..?!?
- he is wearing a wedding ring.

Wed, 14-May-08

We have *how* many machines? (”dedicated specialised slaves” vs “pool of identical slaves”)

Filed under: Mozilla — John @ 23:58:32 PST, 14-May-08 (Wed)

On 1.9/trunk, its important to point out that almost all of these 88 machines need to remain up, and working perfectly, in order to keep the 1.9/trunk tinderbox tree open. If one of these machines dies, we usually have to close the tree.

This is because most of these machines are specialized unique machines, built and assigned to do only one thing. For example, bm-xserve08 only does Firefox mac depend/nightly builds; if the hard disk dies, we don’t automatically load balance over to another identically configured machine thats already up and running in parallel. Instead, we close the tree and quickly try to repair that broken specialized unique machine. Or manually build up a new machine to be as close as possible to the unique dead machine. All in a rush, so we can reopen the tree as soon as possible. Looks like this: dedicated unique slaves

Obviously, the more machines we bring online, the more vulnerable we are to routine hardware failures, network hiccups, etc. Kinda like a string of Christmas tree lights which goes dark when any one bulb burns out. The longer your string of Christmas tree lights, the more bulbs you have, the more chance you have of a single bulb burning out, and the more your chances of the tree going dark.

When we started working on moz2 infrastructure, the conversation went something like “what do you want on moz2?”, “everything we have on FF3″, “ummm… everything? really?”"yes, the full set. Oh, and we’ll need a few sets of them for a few active different mercurial branches running concurrently”.

So, how do we scale our infrastructure and also improve reliability?One of the big changes in how we are building out the moz2 infrastructure was to *not* have specialized unique machines. Instead, we have a pool of identical slaves for each o.s., each slave equally able to handle whatever bundled work is handed to it. This has a couple of important consequences:

  • if one generic slave dies, we dynamically and automatically re-allocate the work to happen on one of the remaining slaves in the pool. Builds would turn around slower, and we’d obviously start repairing the machine, but at least work would continue smoothly, and the tree would not close!
  • if we decide we want to add an additional branch, or if we feel the current number of slaves are not able to handle the workload, we can simply add new identical slaves to the pool, and automatically dynamically re-allocate the work across the enlarged pool.

Looks like this:

pool of identical slaves

Adding 88 new unique machines for each of 3-5 new additional active branches would be painfully to setup, and just about impossible to maintain. And we’d be *guessing* how much development work there would be in the next 18 months, and then building the infrastructure out. Instead of having to SWAG our needs for the next 18 months and then setup frantically now, this shared pool approach allows us to grow gradually as needed. Oh, and it should be more robust. :-)

(Many thanks to BenT for the christmas tree lights analogy. I was saying “a chain is only as strong as the weakest link”, but BenT’s analogy offers much better possibilities for awful tree jokes.)

Fri, 09-May-08

Some new faces in ReleaseEngineering

Filed under: Mozilla — John @ 15:11:34 PST, 09-May-08 (Fri)

Belatedly, I’d like to welcome two new faces here in Release Engineering. Armen Zambrano Gasparnian(armenzg on irc) started last week, and he will be working trying to untangle some of our l10n infrastructure.  Lukas Blakk (lsblakk on irc) started here this week, and she will be working with Robcee on the unittest automation infrastructure.

They’re already digging into various problems, and in their spare time, have even started tempting us with homemade baking (a yummy chocolate with pecan and almond cake that didnt last 2 hours), and promises of more to come over the summer. Its our first time having interns here in the group, so its quite exciting for all of us. Next time you are in Building K, do stop by upstairs, and say hi.

(Oh, and I’m also glad to report that Armen seems to be enjoying the official dress code.)

« Previous PageNext Page »

email: john (at) oduinn (dot) com
All content on this website (c) John O'Duinn, 1998-2007