De-tangling timestamps: part2

Yesterday, Alice de-tangled one more part of the messy time stamp problem by fixing bug#419487.

All data points for time on Talos and graph server now use “build time”, not “testrun time”.

This will greatly simplify a lot of manual regression triage work for people looking at performance graphs on graph server. Now, if you are tracking down what change caused a perf regression:

  • Before 8am PDT Wednesday, 25 June 2008, all charts use “testrun time”. This means debugging regressions require manually padding a regression range multiple hours wider  – enough to catch from start of build through queue to job starting on available slave. Different O.S. take different amounts of time, and any machine hiccups really complicate this padding-guess-work further. If you get it wrong, you can incorrectly rule out bad changes, so pad out more then you think. It means extra triage work, but is safer.
  • After 8am PDT Wednesday, 25 June 2008, all charts use “build time”. We’re still fixing other problems with timestamps, so you still need *some* extra range padding, but much less padding then before. At most manually pad out to the next/previous hour. This padding should be fixed once the BuildID changes in bug#431270 and bug#431905 are landed.

Anyone curious for details should read Alice’s recent post to mozilla.dev.builds and mozilla.dev.performance (”change in talos time stamps (as of 8am PDT June 25th 2008)”), bug#291167, bug#417633 and bug#419487. This is a continuation of the work described (in tedious detail!) in my previous blog post.

This sounds like a small simple change, but it was not. Its a tricky, complex, area in the infrastructure, with lots relying on it, and lots of different people with different assumptions about how time is used here. There was lots of behind-the-scenes homework on this, and to avoid causing any confusion, we held off landing this until after Firefox3 shipped.

Tip of the hat to Alice for pushing this through to production so smoothly.

“No Parking” Signs in San Francisco

Parking in San Francisco is a serious no-holds-barred business. If you manage to get a garage space, selling your car to buy a car that fits is “normal”. I have friends who figured that ‘n’ parking tickets a month was cheaper then the cost of renting a garage, even if they could find one. So, people who do finally manage to have a parking garage space get frustrated by street parking blocking in their driveway. Towing companies make a lot of money around here.
Signs like this are routinely ignored:

Traditional No Parking sign

…but signs like this are rarely ignored!

San Francisco No Parking Sign

“netapp woes, and bug#435134” FIXED! FIXED! FIXED!

While everyone else is talking about the Firefox3 release, here’s a behind the scenes story just before the Firefox3 release.

Followup to my earlier post about netapp woes. For the 21 days since we filed bug#435134, we’ve been struggling to keep trees open and various machines up and running. What was happening?

Technical details of the root causes are here.

From our point of view here in RelEng, as “users” of VMware/NetApp, we would see 8-12 VMs of our 84 VMs would randomly lock up at the same instant, for 15-45 seconds at a time.

The first couple of times, we incorrectly thought this hiccup was caused by network outages. However, each time IT confirmed the network was healthy and we never had any way to track the problem, so we’d repair the VMs and just move on. Looking back, I’ve found bug#435052, bug#429406, but there were other times where we never even filed bugs, so no record of them.

Once the interrupted VMs resumed, the o.s. within each of those VMs would come back to life… usually in a broken state. The 15-45 second lockup was enough for the o.s. to timeout connections to what it thought was the machine’s local hard disk, just as if someone had unplugged the disk of a running computer and then plugged it back in. Depending on the length of lockup, the VM would resume with:

  • missing or corrupted disks, which we’d have to manually reconstruct. If that failed, we’d delete the VM and recreate it from scratch.
  • disks that had become read-only, which a clean reboot would fix, although you could then still have…
  • disks being just fine, but the application files on the disk (i.e. the build or unittest in progress) were corrupted. Which caused subsequent runs to fail out with unusual errors. This required understanding where the different application level files were buried, and then manually cleaning up until the applications on the VM started working again. Depending what they were doing, they would fail out in builds with weird compiler / linker errors. Or would fail out of unittests, with what looked like random unittest errors.
  • no problems at all. This was a rare and very pleasant surprise, whenever it happened. It seemed to depend on how long the timeout was, but that situation was very very rare, and we didnt even count those!

We’d have to investigate each broken VM, and repair as appropriate. Best case, a builder VM could be repaired from light damage before any other unittest and talos machines noticed a problem. Repair time: a few minutes. Worst case, a buildbot master or unittest master VM would get corrupted, require lengthy repairs, and take down all the slaves attached to it for the duration. Repair time: 5-6 hours.

Sometimes, we’d be still reviving dead VMs, when another set of VMs would die…taking down some of the VMs we’d just revived.

Its worth repeating that this was a problem with *all* our VMs, regardless of branch or purpose. It didnt matter whether the VM was doing builds or unittests, running on win32 or linux, running as slave or master, running on 1.8/1.9/moz2 tinderbox trees. And each failure gave different symptoms every time.

After a few days of this continuous behavior, our lives had deteriorated into manically watching tinderbox, getting screenfuls of new nagios emails every time we checked email, and scrambling to prop up whatever VM just died.

So long as we could fix machines faster than they died, and so long as we worked 24hours a day, 7 days a week, we could keep the trees open with only occasional machine burning problems being visible to developers.

As the failure rate got higher, it turned into a losing battle, and finally late afternoon Sunday 8th, we had to give up and just close the trees. Not just one tree. Close *all* trees.

By Tuesday (10th), Justin had stable ESXHosts and NetApps, so we started reviving / repairing all our VMs. And this time, the VMs stayed up! 🙂 By Monday (16th), we’d repaired the last of the broken VMs and life returned to normal after a never-boring-for-one-minute 21 days.

Many many thanks to bhearsum and nthomas for all their work continuously reviving VMs. Because of their non-stop repair work, we were able to keep the trees open during the FF3.0rc2 and FF3.0rc3 releases.

…and thanks also to Justin and mrz for all their work chasing this down. Debugging 3 different interwoven problems is not fun.
tc
John.
=====
ps: Confusing the matter was bug#407796, where a linux o.s. kernel update was needed in the VM o.s. to prevent the VM disk from going read-only. Doing this kernel update required scheduling downtime for that tinderbox, doing config updates, and a restart. Only after some “updated” VMs re-failed, did we finally get confirmation that the kernel version we needed was different to what we were told to use. We were reupdating kernels to the new “correct” kernel when they also started failing… in sets, just like a network outage…

How to use Jawbone headset with Skype/SJPhone on a MacBookPro

I wanted to setup a headset for my work VOIP phone calls from my laptop. I already had a Jawbone headset for my cellphone, why not use that?
Literally all I had to do was:

  • make Jawbone discoverable (when powered off, press the black shiney section with raised lettering, until the LED starts alternating Red/White)
  • on Mac, in Bluetooth menu, “Set up Bluetooth Device”, pick “Headset”and walk through the dialogs to find devices. Enter passcode, which is defaulted to ‘0000’.
  • in Skype, preferences dialog, audio tab, set the “Audio Output”, “Audio Input” and “Ringing” options to each use the “Jawbone” menu item.
  • in SJPhone preferences dialog, audio tab, set the “Output” and “Input” options to each use the “Jawbone” menu item.

That was it.
It all just worked first time, and was literally all up and running in two minutes. It would have been even faster except I had to dig up the instructions on making Jawbone discoverable! It took me much longer to write this blog post, but thought this info might be useful to others.

For the record, I was using the following:

  • MacBookPro running OSX 10.4.11
  • Skype v2.7.0.330
  • SJPhone v1.60.299a
  • Jawbone headset(!)

General Fuzz and Creative Commons

A friend of mine has been creating original music for years. James is passionate about his music, happily shares it with others, and is truly delighted whenever someone likes it!

While donations are always happily received, he explicitly did not want to charge money for his music. His CD release parties usually involve handing out free copies of the new CD to everyone at the venue who would like to listen to it, and encouraging people to download entire albums from his website. At this point he’s released 5 albums, with another one due to release soon. You can download all of these albums, in their entirety, from his website.

Despite not charging for his music, he also did not want anyone stealing his original composition music. So, a few years ago, after quite a bit of homework, he settled on using Creative Commons on all his music.

Now, thanks to James’s Creative Commons licensing and with help from justdave, whomever first dials into a Release meeting can now listen to “Reflective Moments” by General Fuzz. If you like it, you can find all his other works, in their entirety at www.generalfuzz.net. It would make James happy! 🙂