John O’Duinn’s Soapbox

Thu, 08-May-08

We have *how* many machines…. and whatdy mean, its not enough?

Filed under: Mozilla — John @ 10:27:09 PST, 08-May-08 (Thu)

While the sheer number of machines in my previous post surprised all of us, its more interesting to note that its not enough. Its simply just not enough. Even today, we’re constantly under the gun, bringing new machines online as fast as possible.

  • The 30 machines marked idle/waiting-to-mothball will all be recycled and used for blocked projects that need machines.
  • Justin’s group recently brought another VMware host online, and built out extra disk space so we have space to create 30+ new VMs - 6 new VMs are coming online this week, additional to whats listed in previous blog.
  • We’re ordering another batch of 80 mac minis, as we’ve already used up the previous batch of 50 minis, after we used the initial batch of 30 minis.

We hope its enough machines for a while.

Never mind the cost of all these machines. Pretend they were all free.

All of these machines need rented colo rack space, network bandwidth, electricity, a/c, humans to install and support them, humans to configure them up and bring them online. In a ripple-on effect, the more builds we produce, the more diskspace and infrastructure we need for ftp, downloads, virus scanning, tinderbox servers, etc.

Thats just to have them come online. Then starts the human time for the constant care and feeding that each of these unique individual machines need. For one or two machines, its easy. When you look at 200 machines, and then an additional 150 or so machines, its a no-brainer that this approach does not scale.

Wed, 07-May-08

We have *how* many machines?

Filed under: Mozilla — John @ 23:20:55 PST, 07-May-08 (Wed)

As best as I can tell, it looks like we have the following machines running on each branch:

02 machines for 1.8.0
+ 29 machines for 1.8
+ 88 machines for 1.9/trunk
+ 33 machines for moz2
====
152 machines in use today
+ 10 ref-images
+ 30 machines idle/waiting-to-mothball
====
192 machines total

1) These numbers do not include any community machines yet. We’re still working on this.
2) The 88machines on 1.9/trunk are made up of 40 builder, 23 unittest and 25 talos machines.
3) Most of those 30 machines marked “idle/waiting-to-mothball” were only discovered during this housekeeping. Some of these now have bugs to track mothballing and being recycled… we’re still working through the list. It was interesting to find out how many people were still using machines that they thought were supported, but which we did not even know existed, or which we thought were long desupported!
4) Its taken weeks to collate this data, and I’m still not certain we’ve identified everything. We need a central list that can be the single-source-of-truth for all these machines. Instead of doing this on various wiki pages, we’re talking with Justin, mrz and Jeremy to see if we can use the same asset tracking db they use when they install machines into the colo. That would work much better for this, but need some customization. Stay tuned…

We’re still gathering more info…to be continued in another blog post.

Sun, 04-May-08

No wall-clock numbers for Thunderbird 2.0.0.14

Filed under: Mozilla — John @ 23:26:12 PST, 04-May-08 (Sun)

We used the Thunderbird2.0.0.14 release to get Rick Tessner at MozillaMessagingCo up to speed. There’s a lot of Build mechanics to take in, so its not fair to add extra pressure by measuring all the wall-clock times.

Rick is also working to have the existing release automation we use for Firefox be used for Thunderbird also. In theory, it should just work, and initial experiments seem promising, but we’ll need a full test cycle on this before we can switch over in production. The curious can follow along in bug#427769.

Firefox 2.0.0.14 by the (wall-clock) numbers

Filed under: Mozilla — John @ 23:12:57 PST, 04-May-08 (Sun)

Mozilla released Firefox2.0.0.14 on Wednesday 16-apr-2008, at 3:05pm PST. From “Dev says go” to “release is now available to public” was just over 12 days (12d 3h 20m) wall-clock time, of which Build&Release took just over 3.5 days (3d 14h 35m).

11:45 04apr: Dev says “go” for rc1
13:20 04apr: FF2.0.0.14 builds started
16:50 05apr: FF2.0.0.14 linux and mac builds handed to QA
03:40 07apr: FF2.0.0.14 signed-win32 builds handed to QA
10:20 07apr: FF2.0.0.14 update snippets available on betatest update channel
16:40 08apr: Dev & QA says “go” for Beta
17:00 08apr: update snippets on beta update channel
19:40 15apr: Dev & QA says “go” for Release; Build already completed final signing, bouncer entries
07:30 16apr: mirror replication started
11:15 16apr: mirror absorption good for testing to start on releasetest channel
13:10 16apr: QA completes testing releasetest.
14:20 16apr: website changes finalized and visible. Build given “go” to make updates snippets live.
14:25 16apr: update snippets available on live update channel
15:05 16apr: release announced

Notes:

1) Our blow-by-blow scribbles are public, so the curious can read about it, warts and all, here. Those Build Notes also link to our tracking bug#426307.

2) While this was a firedrill release, and it went quite smoothly, it still some non-technical delays making the wall-clock numbers longer then usual.

  • The code fix was landed mid-day Friday, and builds started lunchtime Friday. However, the Build and QA groups explicitly did not work the weekend, after a recent series of working weekends, adding an artificial delay waiting for manual announcements and signing.
  • We decided to extend the beta period from 14apr until 16apr, to avoid possibly disrupting people’s online US tax submissions on 15apr.
  • Like before, we waited until morning to start pushing to mirrors, even though we got the formal “go” the night before. This was done so mirror absorption completed as QA were arriving in the office to start testing update channels. We did this because we wanted to reduce the time files were on the mirrors untested; in the past, overly excited people have post the locations of the files as “released” on public forums, even though they are not finished the last of the sanity checks. We suspect that coordinating the mirror push like this reduced that likelihood just a bit, but it feels like we should verify that. We continue to count this waiting time as “Build&Release time”, even though we are all just waiting.

3) Mirror absorption took just over 3 hours to reach all values >= 65%, a higher then usual threshold.

take care

John.

Thu, 24-Apr-08

So, how exactly do all the automated build and test systems connect together?

Filed under: Mozilla — John @ 08:51:25 PST, 24-Apr-08 (Thu)

Trying to describe how our various build, unittest and talos systems connect together is tricky. The Release Engineering group spent a week all together recently, with lots of diagrams on whiteboards, just to explain it to each other.

Trying to describe it *without* a whiteboard is even more tricky… and there’s always lots of hand waving.

Trying to describe it in clear concise article is…wow. Ben Hearsum and John Resig did a really nice overview here.  Well worth a read, in case you missed it.

Thu, 17-Apr-08

“Software Update Channel” != “Software Distribution Channel”

Filed under: Soapbox, Mozilla — John @ 18:14:45 PST, 17-Apr-08 (Thu)

Recent blog posts by John, Asa and Matt happened as my home WinXP computer offered to “update” Safari… something I have never installed!?!

Most comments on their blogs can be paraphrased as “you’re only complaining because its a competing browser”… or “you’re only complaining because it somehow costs Mozilla money”.

Thats missing the point completely.

Here’s a quick non-browser example.

Suppose Microsoft Windows Automatic Updates (which delivers O.S. security fixes) suddenly also offered to download and install Microsoft’s GearsOfWar game? And defaulted to “yes”. Even if you never owned that game before. If you have your preferences set to “ask me”, then you get a chance to uncheck the checkbox, *if* you notice. But if your preferences are set to “apply automatically”, which is the default, you’ll just get GearsOfWar installed automatically.

The very first time this happens to me, I’d assume that the vendor considers “software update channel” to be the same as “software distribution channel”, and they want to sell me their other products. So, I’d turn off updates. Which, by the way, means I no longer get O.S. security fixes. If I was really annoyed, I might turn off updates for other vendors while I’m at it, so I no longer get Norton Anti-Virus updates either.

Agreeing to receive updates is agreeing to letting a trusted other person quickly fix problems on my computer, before I even know its a problem. Sometimes its fixes bugs in software, so users dont keep hitting problems that were fixed last year; anyone remember downloading patches for Win31? (heck, anyone remember ftp-ing downloads pre-1995?) Sometimes, the speed at which the fix is distributed is critical to protect users; anti-virus updates, browser security fixes, and O.S. security fixes are great examples of this.

If people stop trusting updates, because a few vendors abuse that trust, its bad for the software industry and its bad for users.

Its that simple.

Sat, 12-Apr-08

Firefox 2.0.0.13 by the (wall-clock) numbers

Filed under: Mozilla — John @ 17:28:35 PST, 12-Apr-08 (Sat)

Mozilla released Firefox2.0.0.13 on Tuesday 25-mar-2008, at 16:30pm PST. From “Dev says go” to “release is now available to public” was 15.25 days (15d 5h 55m) wall-clock time, of which Build&Release took just over 2.33 days (2d 8h 10m).

10:35 10mar: Dev says “go” for rc1
14:50 11mar: FF2.0.0.13 builds started
16:55 11mar: FF2.0.0.13 linux builds handed to QA
19:00 11mar: FF2.0.0.13 mac builds handed to QA
07:10 12mar: FF2.0.0.13 signed-win32 builds handed to QA
14:40 12mar: FF2.0.0.13 update snippets available on betatest update channel
11:30 18mar: Dev & QA says “go” for Beta
12:25 18mar: update snippets on beta update channel
09:10 25mar: Dev & QA says “go” for Release; Build already completed final signing, bouncer entries
10:25 25mar: mirror replication started
11:20 25mar: mirror absorption good for testing to start on releasetest channel
14:20 25mar: QA completes testing releasetest.
15:00 25mar: website changes finalized and visible. Build given “go” to make updates snippets live.
16:00 25mar: update snippets available on live update channel
16:30 25mar: release announced

Notes:

1) This was Ben Hearsum’s first time doing a release. He works in the Release group, and he’s smart, but he’s never done a release for Mozilla. Ever. The fact that he jumped into doing this release with absolutely no advance notice, and was able to use our existing automation without needing to ask any questions at all says lots about both Ben and how things are improving.

2) From Build’s point of view, this was a fast release. We took 2 days 8 hours, which is one of our fastest releases ever. Note: between the “Dev says go to build” and “build started” was a delay of 1 day 4 hours where Build did nothing. This delay was because we were busy with 3.0beta4 and also trying to balance out some other workloads across the group. I counted this delay as part of our 2days 8 hours, but I have to point out that if we had been ready, our total time for FF2.0.0.13 would actually been halved; we would have only needed a totally screaming fast 1day 4hours.

3) For better or worse, we are putting all our blow-by-blow scribbles public, so the curious can read about it, warts and all, here. Those Build Notes also link to our tracking bug#422122.

4) Like before, we waited until morning to start pushing to mirrors. This was done so mirror absorption completed as QA were arriving in the office to start testing update channels. We did this because we wanted to reduce the time files were on the mirrors untested; in the past, overly excited people have post the locations of the files as “released” on public forums, even though they are not finished the last of the sanity checks. Coordinating the mirror push like this reduced that likelihood just a bit.

5) Mirror absorption took 1 hour to reach all values >= 50%, slightly faster and slightly lower then our usual threshold.

take care

John.

Wed, 09-Apr-08

Firefox 3.0beta4 by the (wall-clock) numbers

Filed under: Mozilla — John @ 01:46:45 PST, 09-Apr-08 (Wed)

Mozilla released Firefox3.0beta4 on Monday 10-mar-2008, at 17:25pm PST. From “Dev says go” to “release is now available to public” was just over 7 days (7d 6h 10m) wall-clock time, of which Build&Release took just over 3 days (3d 2h 05m).

11:15 03mar: Dev says “go” for rc1
16:10 03mar: 3.0b4rc1 builds started
23:15 03mar: 3.0b4rc1 mac builds handed to QA
00:05 04mar: 3.0b4rc1 linux builds handed to QA
06:00 04mar: 3.0b4rc1 signed-win32 builds handed to QA
11:15 04mar: 3.0b4rc1 three missing linux locales were resolved and handed to QA. See bug#419771 and bug#407796 for details.
15:35 04mar: 3.0b4rc1 update snippets available on betatest update channel
08:35 07mar: 3.0b4rc1 showstopper: discovered win32 was compiled without PGO. Need to respin win32 builds. Mac and linux confirmed ok.
11:50 07mar: 3.0b4rc2 win32 builds started
00:05 08mar: 3.0b4rc2 signed-pgo-win32 builds handed to QA
14:00 08mar: 3.0b4rc2 update snippets available on betatest update channel
20:00 09mar: Dev & QA says “go” for Beta; Build already completed final signing, bouncer entries
07:00 10mar: mirror replication started
09:15 10mar: mirror absorption good for testing to start on releasetest channel
13:15 10mar: QA completes testing releasetest.
14:45 10mar: website changes finalized and visible. Build given “go” to make updates snippets live.
15:50 10mar: update snippets available on live beta update channel
17:25 10mar: QA completes testing beta channel. Release announced

Notes:

1) The Build Automation used in FF3.0b4 included a bunch of fixes landed after FF3.0b3, which helped make things smoother. Despite the respin, yet again, all the housekeeping of the last few weeks paid off.

2) For better or worse, we are putting all our blow-by-blow scribbles public, so the curious can read about it, warts and all, here. Those Build Notes also link to our tracking bug#418926.

3) It took us much longer then usual to start the builds.We had been distracted on other projects during the prior week, and not done *any* of the prerequesite setup work in advance of this release.
4) We hit bug#419771 and bug#407796 as fallout from the recent kernel updates on this machine, which delayed announcing win32 builds by a few hours.

5) In 3.0b4rc1, the win32 builds were confirmed to be compiled *without* the PGO compiler optimizer. This was a problem caused by how the new PGO compiler was being enabled in tinderbox, and was completely a Build snafu. The same changes were required to two copies of an identical config file, but we only updated one, and forgot about the other. We had to completely rebuild the win32 builds from the beginning, and verified the bits as they were being produced. Note that mac and linux builds did not have to be rebuilt, but to avoid confusion, we symlinked linux-rc1 -> linux-rc2 and mac-rc1 -> mac-rc2.

6) Like before, we waited until morning to start pushing to mirrors. This was done so mirror absorption completed as QA were arriving in the office to start testing update channels. We did this because we wanted to reduce the time files were on the mirrors untested; in the past, overly excited people have post the locations of the files as “released” on public forums, even though they are not finished the last of the sanity checks. Coordinating the mirror push like this reduced that likelihood just a bit.

7) Mirror absorption took 2 hours 15mins to reach all values >= 60%, slightly faster then our usual threshold.

take care

John.

Fri, 28-Mar-08

Thunderbird 2.0.0.12 by the (wall-clock) numbers

Filed under: Mozilla — John @ 18:10:45 PST, 28-Mar-08 (Fri)

Mozilla released Thunderbird 2.0.0.12 on Tuesday 26-feb-2008, at 16:40pm PST. From “Dev says go” to “release is now available to public” was just over 14days (14d 7h 45m) wall-clock time, of which Build&Release took just over 6 days (6d 4h 20m).

08:55 12feb: Dev says “go” for rc1
13:55 12feb: 2.0.0.12rc1 builds started
20:55 13feb: 2.0.0.12rc1 linux builds handed to QA
20:55 13feb: 2.0.0.12rc1 mac builds handed to QA
08:05 14feb: 2.0.0.12rc1 signed-win32 signed builds handed to QA
07:30 18feb: 2.0.0.12rc1 update snippets available on betatest update channel
15:30 19feb: QA says “go to beta”.
16:10 19feb: update snippets on beta update channel
08:45 26feb: Dev & QA says “go” for Release; Build already completed final signing, bouncer entries
09:25 26feb: mirror replication started
13:25 26feb: mirror absorption good for testing to start on releasetest channel
14:20 26feb: QA completes testing releasetest.
15:30 26feb: website changes finalized and visible. Build given “go” to make updates snippets live.
16:00 26feb: update snippets available on live update channel
16:40 26feb: release announced

Notes:

1) We’re still doing Thunderbird builds manually, as we’ve not had a chance to test the Build Automation used in FF releases. It *should* work, but needs to be tested properly before we switch to using automation in production for Thunderbird. Producing Thunderbird manually explains some of the delay in producing updates above - there was a weekend in there! Now that Rick has joined MailCo, he’s starting to get up to speed, and help out. We’ll still be doing TB2.0.0.13 manually, but hope to do TB2.0.0.14 using automation. Watch this space!

2) For better or worse, we are putting all our blow-by-blow scribbles public, so the curious can read about it, warts and all, here.

3) As usual, we waited until morning to start pushing to mirrors. This was done so mirror absorption completed as QA were arriving in the office to start testing update channels. We did this because we wanted to reduce the time files were on the mirrors untested; in the past, overly excited people have post the locations of the files as “released” on public forums, even though they are not finished the last of the sanity checks. Coordinating the mirror push like this reduced that likelihood just a bit. I’m counting that wait time as “Build time” even though that might be a little unfair to the Build team.

4) Mirror absorption took 4 hours to reach good values. A little longer then usual, unclear exactly why.

take care

John.

Fri, 21-Mar-08

Recovering from a datacenter outage…

Filed under: Mozilla — John @ 03:08:29 PST, 21-Mar-08 (Fri)

Tuesday night was going to be exciting because it was the code freeze for Firefox3.0beta5… instead, our entire San Jose datacenter went offline at 8pm PST…a whole different type of excitement. Details in Justin’s blog, but it seems we hit a network storm caused by a faulty switch in the colo. The network problem was resolved by 9.25pm. A drive mount problem on cvs server was repaired just after 1am.

However, the Firefox tree remained closed until approx 11am. We worked all night doing recovery work, so why did it take so long to reopen the tree?

  • Once the network problem and cvs server problem were fixed, some machines recovered and came back online automatically, but many did not. There were so many build/unittest/talos machines offline or burning that no-one felt safe reopening the tree for checkins until these were back online. Bug#423809 has details of the repair/recovery work we did on various build/unittest/talos machines.
  • Somehow, one VM got totally corrupted by the network outage, so we ended up having to recreate the VM from scratch. Details in bug#423850. Seemed strange to me that a VM could be corrupted by a network outage… [UPDATE: Since all the VMs live on network attached storage, the instant network failure was just as catastrophic as ripping the disk drive out of a running machine! Thanks to mrz for the explanation.]
  • Some unittest failures started a few hours *before* the network outage, and were not noticed. After the network outage, we brought these unittest machines back up, discovered the failures, and assumed they were caused by code regression. However, it turned out to be regression caused by a totally unrelated change we made to the unittest machine setup earlier in the day; not a code issue and not a network outage issue. Confirming all this took time. Ideally, once unittests started failing, no more changes would have landed. That would have made it quick & easy to find the real root cause, and would likely have resolved everything before the network outage complicated the situation.
  • The longer PGO-build times mean that, once a machine was back online, it took longer for a burning machine to generate a new build, and therefore show up as green on tinderbox page.

While we’ve made great improvements with our automation infrastructure in the last few months, Tuesday’s outage proved how much work we still have to do towards getting machines to boot up in a clean, ready-to-use state.

(Bonus: The same auto-boot-clean-configuration work would also help us when provisioning new machines, and help IT with late night Tier1 support…)

« Previous PageNext Page »

email: john (at) oduinn (dot) com
All content on this website (c) John O'Duinn, 1998-2007