On Friday 15th May, the failover capabilities of the pool-o-slaves paid off, yet again.
As you may recall, on 12th May, we were able to take down 76 VMs for scheduled maintenance, without closing the tree. With that many systems offline, we had longer wait times, but everything kept working, and people could still do checkins, see builds/tests/performance results like usual. Quite impressive, really.
On 15th May, a totally unrelated DHCP server failed without warning. This took out 4 ESX hosts running approx 30 VMs for several hours. The builds/tests that were in progress at the time of the failure were lost, but otherwise no-one noticed a thing. Already queued jobs were allocated to remaining machines, automatically working around the outage, and our infrastructure just kept working while IT revived the DHCP server. Bug#493181 has details, for the curious.
We’re a long way from claiming 5-9s uptime, but the structural improvements are really paying off. Once a few other projects wrap up, we can start seriously talking about SLAs. We’ve come a long way in the last 2 years, and this is all very exciting stuff…