John O’Duinn’s Soapbox

Thu, 24-Jul-08

Brown bag: Intro to branching and versioning

Filed under: Mozilla — John @ 12:54:46 PST, 24-Jul-08 (Thu)

Last week, I did a quick brown bag presentation here in the office. The idea was to answer some of the questions I’ve been asked frequently since we shipped Firefox3.0 and moved future work from cvs to Mercurial.

Tony’s suggested title of “All you wanted to know about the Moco Build system” was just a little over ambitious - that is a talk I’d like to *attend*! :-) For this brown bag, I focused on branching and versioning issues, including things like:

- what repo/branch is used for what release
- the different version #s
- the different flags used in bugzilla during triage and releases

You can download the presentation in PDF format from here. Its only 10 pages long, and its missing the verbal narrative, but it still feels worthwhile posting this for others to see.

If people find this interesting, I’d be happy to do other brown bags, so if you’ve any suggestions, please chime in, and I’ll see what I can do. This was also my first time giving a brown bag presentation here, so would love to hear comments / feedback /ideas on things to do better, before I find myself giving other brown bags!

(Many thanks to Gerv, Beltzner, BenHearsum, NickThomas for their help keeping me honest.)

New graph server in production!

Filed under: Mozilla — John @ 01:33:50 PST, 24-Jul-08 (Thu)

During Tuesday night’s scheduled maintenance window, Alice pushed the new graph server live into production.

There’s still lots of things still to fix, but its already way better then the “classic” graph server being replaced. If you haven’t already seen it, go browse around http://graphs.mozilla.org and let us know what you think!

The switchover went nice and smoothly. So smoothly, its hard to know that Alice and Morgamic had *tons* of behind the scenes cleanup work to do, in order to get all the code into one place, and make this switchover possible. Because of all the cleanup, we can now happily say:

No more “I cant easily build graphserver because graphserver code is scattered between hg & cvs”.

No more “unknown private patches on graphs-stage”.

No more “maintaining old graphs.m.o codebase”.

At last *now* the real graph server development work can pick up speed. Big tip-o-the-hat to Alice and Morgamic for making this happen!

update:  fixed broken URL, thanks to Dan and Rob for spotting that.
update: added Mark to last sentance! Sorry, Mark, my bad. :-)

Mon, 21-Jul-08

Firefox 2.0.0.16 *and* Firefox 3.0.1….at the same time!

Filed under: Mozilla — John @ 18:11:00 PST, 21-Jul-08 (Mon)

Last week, we hit an important milestone for our group.

One person was able to start a security release for FF3.0.1 *and* a security release for FF2.0.0.16 both on the same morning. The usual wall-clock times blog-post is coming, but the important point is that one person could do the two releases, by himself, at the same time, without any delay to either release…

…oh, and all while Ben was also finishing off the Thunderbird2.0.0.x release-automation with Nick, and still having a normal life!

This is a huge huge *huge* milestone for our group, and worth taking a moment’s pause.

Thu, 10-Jul-08

SF Fire Department by the (wall-clock) numbers

Filed under: Cars & Driving, Mozilla — John @ 02:47:54 PST, 10-Jul-08 (Thu)

Today, while I was driving out of town on vacation:

  • 13:19: Sitting at red light, I’m in the front car in a left turn lane. The lights just turned red when I was approaching, so I have time to watch the world around me. Its another warm sunny day here in San Francisco, and everyone is out enjoying the weather. I’m at a busy intersection in the Mission district, multiple lanes of fast moving traffic, lots of cars, a bus at the corner bus stop, and crowded sidewalks.
  • 13:21: Bus starts pulling out of bus stop, heading away from intersection. A women running along sidewalk at full speed, without slowing, runs out across the busy road, in the intersection, dodging traffic to try to catch the departing bus. She was crossing the road *behind* the bus, so the bus driver never saw her, and continued driving away. She continues running through traffic anyway, dodging narrow gaps between cars, determined to catch the bus, and makes it most of the way across the road before she stumbles and falls awkwardly. She lands half in the road, half on the sidewalk at the corner, and doesn’t get up. I watch her lying there, clutching her leg in pain. Somehow all the cars missed her. Stuck on the other side of the busy intersection, I watch oncoming cars turning right-on-red barely miss her without even seeing her lying there half in the road.
  • 13:23: Lights finally change to green. I turn through the intersection, and pull into corner bus stop, positioning the car so anyone turning the corner would go wide to avoid my car (and at the same time the woman). Another car pulled in to help at the same time. Another pedestrian stops to help also.
  • 13:25: Finish quick assessment of her state, and start call to 911. Location. Female, 20s. Sprained/twisted ankle. Small bloody scrapes to her face and a broken nose ring, from when her head hit the sidewalk in the fall. Never lost consciousness. Breathing and pulse reasonable. Awake and talking coherently. No visible bleeding. No major skull bruising. Mild shock, crying, concerned for her lunch appointment, lack of medical insurance.
  • 13:27: Finish call with 911.
  • 13:31: SFFD unit rolled through intersection with lights & sirens. We try to flag them down, thinking they dont see us. They pointed acknowledgment back, but were already in transit to another call and didn’t stop.
  • 13:33: SFFD fire tender arrived on scene
  • 13:38: SFFD ambulance arrived on scene
  • 13:53: Patient being loaded into ambulance, on backboard, with head and leg bound. SFFD crews finished repacking equipment before both units depart. After some last questions, I’m cleared to leave, so drive away.

From when I finished the call with 911, to when the first unit arrived on scene was 6mins. Also, it was really great that two other people stopped to help.

Sun, 06-Jul-08

Installing Ruby On Rails on osx10.4…

Filed under: Tech tips, Mozilla — John @ 23:15:46 PST, 06-Jul-08 (Sun)

These steps worked first time for me: http://hivelogic.com/articles/2007/02/ruby-rails-mongrel-mysql-osx …with only two minor hiccups:

1)  The step to install RubyGems specifies v0.9.2, but this should be v0.9.4. The instructions work identically, just change the version#.

2) If you hit this error while installing rails:

$ sudo gem install rails --include-dependencies
Bulk updating Gem source index for: http://gems.rubyforge.org
ERROR:  While executing gem ... (Gem::GemNotFoundException)
Could not find rails (> 0) in any repository
$
$ sudo gem update
$ sudo gem install rails --include-dependencies
...

I later discovered both of these nits already noted in the comments. There was so many other spam comments there, I only found them after I’d already figured out the solutions myself! :-( Regardless of those two very minor nits, I found Dan’s set of instructions on hivelogic.com to be outstanding. I went from complete zero to having a hello-world up and running in just a few minutes… all thanks to Dan’s hivelogic doc.

Thu, 26-Jun-08

De-tangling timestamps: part2

Filed under: Mozilla — John @ 16:13:35 PST, 26-Jun-08 (Thu)

Yesterday, Alice de-tangled one more part of the messy time stamp problem by fixing bug#419487.

All data points for time on Talos and graph server now use “build time”, not “testrun time”.

This will greatly simplify a lot of manual regression triage work for people looking at performance graphs on graph server. Now, if you are tracking down what change caused a perf regression:

  • Before 8am PDT Wednesday, 25 June 2008, all charts use “testrun time”. This means debugging regressions require manually padding a regression range multiple hours wider  - enough to catch from start of build through queue to job starting on available slave. Different O.S. take different amounts of time, and any machine hiccups really complicate this padding-guess-work further. If you get it wrong, you can incorrectly rule out bad changes, so pad out more then you think. It means extra triage work, but is safer.
  • After 8am PDT Wednesday, 25 June 2008, all charts use “build time”. We’re still fixing other problems with timestamps, so you still need *some* extra range padding, but much less padding then before. At most manually pad out to the next/previous hour. This padding should be fixed once the BuildID changes in bug#431270 and bug#431905 are landed.

Anyone curious for details should read Alice’s recent post to mozilla.dev.builds and mozilla.dev.performance (”change in talos time stamps (as of 8am PDT June 25th 2008)”), bug#291167, bug#417633 and bug#419487. This is a continuation of the work described (in tedious detail!) in my previous blog post.

This sounds like a small simple change, but it was not. Its a tricky, complex, area in the infrastructure, with lots relying on it, and lots of different people with different assumptions about how time is used here. There was lots of behind-the-scenes homework on this, and to avoid causing any confusion, we held off landing this until after Firefox3 shipped.

Tip of the hat to Alice for pushing this through to production so smoothly.

Mon, 23-Jun-08

We take our dress code seriously…

Filed under: Mozilla — John @ 22:28:31 PST, 23-Jun-08 (Mon)

…as Armen discovered when he decided to *not* wear a Hawaiian shirt to the office.

Armen surprised by Mozilla fashion police

Wed, 18-Jun-08

“No Parking” Signs in San Francisco

Filed under: Travel, Cars & Driving — John @ 21:14:53 PST, 18-Jun-08 (Wed)

Parking in San Francisco is a serious no-holds-barred business. If you manage to get a garage space, selling your car to buy a car that fits is “normal”. I have friends who figured that ‘n’ parking tickets a month was cheaper then the cost of renting a garage, even if they could find one. So, people who do finally manage to have a parking garage space get frustrated by street parking blocking in their driveway. Towing companies make a lot of money around here.
Signs like this are routinely ignored:

Traditional No Parking sign

…but signs like this are rarely ignored!

San Francisco No Parking Sign

Tue, 17-Jun-08

“netapp woes, and bug#435134″ FIXED! FIXED! FIXED!

Filed under: Mozilla — John @ 13:57:26 PST, 17-Jun-08 (Tue)

While everyone else is talking about the Firefox3 release, here’s a behind the scenes story just before the Firefox3 release.

Followup to my earlier post about netapp woes. For the 21 days since we filed bug#435134, we’ve been struggling to keep trees open and various machines up and running. What was happening?

Technical details of the root causes are here.

From our point of view here in RelEng, as “users” of VMware/NetApp, we would see 8-12 VMs of our 84 VMs would randomly lock up at the same instant, for 15-45 seconds at a time.

The first couple of times, we incorrectly thought this hiccup was caused by network outages. However, each time IT confirmed the network was healthy and we never had any way to track the problem, so we’d repair the VMs and just move on. Looking back, I’ve found bug#435052, bug#429406, but there were other times where we never even filed bugs, so no record of them.

Once the interrupted VMs resumed, the o.s. within each of those VMs would come back to life… usually in a broken state. The 15-45 second lockup was enough for the o.s. to timeout connections to what it thought was the machine’s local hard disk, just as if someone had unplugged the disk of a running computer and then plugged it back in. Depending on the length of lockup, the VM would resume with:

  • missing or corrupted disks, which we’d have to manually reconstruct. If that failed, we’d delete the VM and recreate it from scratch.
  • disks that had become read-only, which a clean reboot would fix, although you could then still have…
  • disks being just fine, but the application files on the disk (i.e. the build or unittest in progress) were corrupted. Which caused subsequent runs to fail out with unusual errors. This required understanding where the different application level files were buried, and then manually cleaning up until the applications on the VM started working again. Depending what they were doing, they would fail out in builds with weird compiler / linker errors. Or would fail out of unittests, with what looked like random unittest errors.
  • no problems at all. This was a rare and very pleasant surprise, whenever it happened. It seemed to depend on how long the timeout was, but that situation was very very rare, and we didnt even count those!

We’d have to investigate each broken VM, and repair as appropriate. Best case, a builder VM could be repaired from light damage before any other unittest and talos machines noticed a problem. Repair time: a few minutes. Worst case, a buildbot master or unittest master VM would get corrupted, require lengthy repairs, and take down all the slaves attached to it for the duration. Repair time: 5-6 hours.

Sometimes, we’d be still reviving dead VMs, when another set of VMs would die…taking down some of the VMs we’d just revived.

Its worth repeating that this was a problem with *all* our VMs, regardless of branch or purpose. It didnt matter whether the VM was doing builds or unittests, running on win32 or linux, running as slave or master, running on 1.8/1.9/moz2 tinderbox trees. And each failure gave different symptoms every time.

After a few days of this continuous behavior, our lives had deteriorated into manically watching tinderbox, getting screenfuls of new nagios emails every time we checked email, and scrambling to prop up whatever VM just died.

So long as we could fix machines faster than they died, and so long as we worked 24hours a day, 7 days a week, we could keep the trees open with only occasional machine burning problems being visible to developers.

As the failure rate got higher, it turned into a losing battle, and finally late afternoon Sunday 8th, we had to give up and just close the trees. Not just one tree. Close *all* trees.

By Tuesday (10th), Justin had stable ESXHosts and NetApps, so we started reviving / repairing all our VMs. And this time, the VMs stayed up! :-) By Monday (16th), we’d repaired the last of the broken VMs and life returned to normal after a never-boring-for-one-minute 21 days.

Many many thanks to bhearsum and nthomas for all their work continuously reviving VMs. Because of their non-stop repair work, we were able to keep the trees open during the FF3.0rc2 and FF3.0rc3 releases.

…and thanks also to Justin and mrz for all their work chasing this down. Debugging 3 different interwoven problems is not fun.
tc
John.
=====
ps: Confusing the matter was bug#407796, where a linux o.s. kernel update was needed in the VM o.s. to prevent the VM disk from going read-only. Doing this kernel update required scheduling downtime for that tinderbox, doing config updates, and a restart. Only after some “updated” VMs re-failed, did we finally get confirmation that the kernel version we needed was different to what we were told to use. We were reupdating kernels to the new “correct” kernel when they also started failing… in sets, just like a network outage…

Wed, 04-Jun-08

How to use Jawbone headset with Skype/SJPhone on a MacBookPro

Filed under: Tech tips, Mozilla — John @ 17:04:21 PST, 04-Jun-08 (Wed)

I wanted to setup a headset for my work VOIP phone calls from my laptop. I already had a Jawbone headset for my cellphone, why not use that?
Literally all I had to do was:

  • make Jawbone discoverable (when powered off, press the black shiney section with raised lettering, until the LED starts alternating Red/White)
  • on Mac, in Bluetooth menu, “Set up Bluetooth Device”, pick “Headset”and walk through the dialogs to find devices. Enter passcode, which is defaulted to ‘0000′.
  • in Skype, preferences dialog, audio tab, set the “Audio Output”, “Audio Input” and “Ringing” options to each use the “Jawbone” menu item.
  • in SJPhone preferences dialog, audio tab, set the “Output” and “Input” options to each use the “Jawbone” menu item.

That was it.
It all just worked first time, and was literally all up and running in two minutes. It would have been even faster except I had to dig up the instructions on making Jawbone discoverable! It took me much longer to write this blog post, but thought this info might be useful to others.

For the record, I was using the following:

  • MacBookPro running OSX 10.4.11
  • Skype v2.7.0.330
  • SJPhone v1.60.299a
  • Jawbone headset(!)
« Previous PageNext Page »

email: john (at) oduinn (dot) com
All content on this website (c) John O'Duinn, 1998-2007