Last night, we took 76 VMs offline (22% of our 342 machines), so we could do a major firmware upgrade on the EqualLogic arrays. Thats a lot of work, and it all went smoothly. But thats not the important part this time.
The major milestone is that we did this *without* needing to close the tree for mozilla-1.9.1, mozilla-central or tracemonkey. Throughout the firmware upgrade, developers were still able to land patches, triggering builds, unittests and talos runs on all o.s., as well as using TryServer on all o.s. This was an intended feature in the design of our new infrastructure, but last night was the first time we really tried it out in a controlled way. And it worked perfectly.
Put another way: with the old infrastructure, losing 22% of our machines would have been a massively disruptive all-hands-on-deck tree-closing event. Last night showed how much things are improved.
Some details for the curious:
- As slaves became idle before 7pm, we told them to gracefully shutdown. This meant that they would not accept new jobs, and could be powered off without developers getting reports of burning/broken builds at 7pm.
- Once all the VMs using EqualLogic arrays were powered off, Aravind was able to start the firmware upgrade. He’ll blog about it separately, but this was a major update in all senses of the word, so took a full 4 hours for him to get through.
- As soon as the firmware upgrades looked good, approx 11pm, we start powering back up all the VMs.
- All the moz2 VMs are configured to autoboot back in a working state, so they automatically reconnected to the master, and started accepting queued jobs, with no human intervention at all. These all came up smoothly first time.
- The Firefox3.0 and Thunderbird2.0 machines dont fully come up in fully working state, so needed some manual work, but this was relatively quick and on only a few machines. All came up smoothly.
- We expected to be finished by 7am, but were in fact all done before 1am, 6hours early.
All in all, really a great evening, and great to see how RelEng and IT worked together on this – last night and also in the weeks of prep leading up to last night. Big tip of the hat to mrz, aravind, phong, catlee, nthomas and bhearsum for all their work.
That was awesome, thank you.
John.
=====
(full disclosure: While moz2 trees remained open, we did close the trees for Firefox3.0, and Thunderbird2.0. This was an intentional decision because a) very few people are using them, and b) we wanted to focus as much as possible of our remaining resources on keeping the active code lines running as smoothly as possible.)
[…] As you may recall, on 12th May, we were able to take down 76 VMs for scheduled maintenance, without closing the tree. With that many systems offline, we had longer wait times, but everything kept working, and people could still do checkins, see builds/tests/performance results like usual. Quite impressive, really. […]