netapp woes, and bug#435134

Since middle of last week, we’ve been struggling with bug#435134, a problem where a random set of build machines would all lose file / network connections at the same time. In each case, the VM would fail out with differently weird errors, like cvs merge conflicts even though no-one had landed any changes… or system header files with corrupted contents, causing compiler errors… or compilers throwing internal errors…

The failing VMs were on different branches, running different o.s. and doing different builds. The only thing that made us think these were related was that the failures were all detected within minutes of each other, and that no-one had landed any code changes anywhere because of pure luck of timing in the various Mozilla releases.

Simple reboots were not enough; in each case we had to delete out the working area completely and then restart. Then the machines would run successfully/green for a couple of cycles… only to then fail out in other weird-yet-similar ways a few hours later. It made for a very exciting (or very annoying!?) few days for Justin, mrz, nthomas and myself; it certainly didnt help social plans for anyone over the long weekend here in the US.

The problem is not yet fixed, so we’ll need to do further debugging. However, now that Justin has us avoiding the likely culprit, one head on netapp-c, we have been able to keep the VMs up and building happily for 24hours now, which is great progress.

Big tip of the hat to Justin, mrz and nthomas for all their help getting things stable before today’s go/nogo meeting for FF3.0rc2.

One thought on “netapp woes, and bug#435134