De-tangling timestamps: part1

Alice, Rob Helmer and Nick recently fixed an important, long standing problem about how Build Infrastructure handed off builds to the Talos Infrastructure. Their work fixed:

  • an intermittent couple-of-times-a-day talos outage, which has been happening ever since we started using Talos in production.
  • intermittent cases where Talos would occasionally skip over a build without testing it.

Thats enough reasons to make this an important fix, but its also important because its makes some future timestamp cleanup work possible. For the curious, here are some background details:

  • When builds were produced by each o.s. builder, the build infrastructure copied generated builds into a specific directory.
  • When Talos machines wanted to test a build, they copied builds from that same specific build directory in order to start testing. Talos would then plot test data on the graph server using “testrun time”  (time stamp of when Talos started running the test), *not* using the start time of when the build was created. This is an important point, and at the root of a bunch of regression triage complexities.
  • Because new builds would be copied into the *same* specific directory, they would overwrite the previous build. Which means that, when testing a build, we didnt know when that build was actually created. All we could tell was what time the “testrun” started for that build. So long as we tested as quickly as we produced new builds, it was close enough.

…but when we ramped up volume of builds and tests to production levels, we discovered:-

  • New builds being copied into the same specific directory could collide with Talos downloading the previous build, and cause Talos to fail out with an error. The next Talos attempt would work fine, but because each Talos run takes so long to complete, it would appear that Talos was burning for a couple of hours, until the next test run completed successfully. This happened intermittently a few times every day. This is now fixed.
  • Builds are generated at different speeds; linux builds quicker then win32 for example. This means that the contents of the specific directory are refreshed at different rates. The linux code built in the dir almost always contains code of a different timestamp from the win32 code in the dir. Enabling PGO caused win32 build times to double, which made this discrepancy even worse. This is now improved, but not fully fixed.
  • In situations where the builds were generated quickly enough, and tests ran slowly enough, we could see: a 1st build becomes available, Talos starts testing 1st build, a new 2nd build becomes available, a 3rd build becomes available, overwriting 2nd build. When Talos finishes testing 1st build, Talos detects and starts testing the available build (the 3rd build, skipping over the 2nd build completely). This is now fixed. There is another, similar sounding but unrelated bug about how Buildbot optimizes pending-requests, by collapsing them all together, see bug#436213.

Thats it for this fix. There’s still plenty more cleanup needed around how time/date is stored in different parts of the infrastructure, but this was an important big first step.

Next steps will include:

  • fixing how talos handles re-runs/duplicate data
  • have the dated dir be based on yet-to-be-enhanced BuildID
  • changing Talos and graph server to use “build time”, not “testrun time”. This will greatly simplify a lot of manual regression triage work for people.
  • simplify underlying code that lines up builds and test results on tinderbox/waterfall pages.
  • figuring out when is a good time to flip the switch in Talos&graph server, marking all data before a certain point as “testrun time”, and data after that point as “build time”.

Anyone curious for details should read Alice’s recent post to mozilla.dev.apps.firefox and mozilla.dev.performance (“time stamps of talos performance results & finding regressions”), bug#291167, bug#417633 and bug#419487. BuildID changes are being discussed in bug#431270 and bug#431905.

Its a tricky, complex, area in the infrastructure, so hopefully all that makes sense?!!?

netapp woes, and bug#435134

Since middle of last week, we’ve been struggling with bug#435134, a problem where a random set of build machines would all lose file / network connections at the same time. In each case, the VM would fail out with differently weird errors, like cvs merge conflicts even though no-one had landed any changes… or system header files with corrupted contents, causing compiler errors… or compilers throwing internal errors…

The failing VMs were on different branches, running different o.s. and doing different builds. The only thing that made us think these were related was that the failures were all detected within minutes of each other, and that no-one had landed any code changes anywhere because of pure luck of timing in the various Mozilla releases.

Simple reboots were not enough; in each case we had to delete out the working area completely and then restart. Then the machines would run successfully/green for a couple of cycles… only to then fail out in other weird-yet-similar ways a few hours later. It made for a very exciting (or very annoying!?) few days for Justin, mrz, nthomas and myself; it certainly didnt help social plans for anyone over the long weekend here in the US.

The problem is not yet fixed, so we’ll need to do further debugging. However, now that Justin has us avoiding the likely culprit, one head on netapp-c, we have been able to keep the VMs up and building happily for 24hours now, which is great progress.

Big tip of the hat to Justin, mrz and nthomas for all their help getting things stable before today’s go/nogo meeting for FF3.0rc2.

A close call…

Found this while catching up on the news today.

Most war footage shown in the US is very de-personalized, planes blowing up bridges, missiles blowing up buildings, etc. Body bags and returning wounded personnel get scant coverage. If you don’t look for the details, you might think it was all a glossy action movie, where no one gets hurt, and all the actors go home for dinner once the cameras stop. By contrast, this photo shows the very real dangers on a very personal level.


(click image for original image and story on Christian Science Monitor website)

The things I noticed were:
– he is not wearing a helmet. Or body armor. Depending on the weapon shooting at him, unclear if those would help anyway, but still… a tshirt..?!?
– he is wearing a wedding ring.

We have *how* many machines? (“dedicated specialised slaves” vs “pool of identical slaves”)

On 1.9/trunk, its important to point out that almost all of these 88 machines need to remain up, and working perfectly, in order to keep the 1.9/trunk tinderbox tree open. If one of these machines dies, we usually have to close the tree.

This is because most of these machines are specialized unique machines, built and assigned to do only one thing. For example, bm-xserve08 only does Firefox mac depend/nightly builds; if the hard disk dies, we don’t automatically load balance over to another identically configured machine thats already up and running in parallel. Instead, we close the tree and quickly try to repair that broken specialized unique machine. Or manually build up a new machine to be as close as possible to the unique dead machine. All in a rush, so we can reopen the tree as soon as possible. Looks like this: dedicated unique slaves

Obviously, the more machines we bring online, the more vulnerable we are to routine hardware failures, network hiccups, etc. Kinda like a string of Christmas tree lights which goes dark when any one bulb burns out. The longer your string of Christmas tree lights, the more bulbs you have, the more chance you have of a single bulb burning out, and the more your chances of the tree going dark.

When we started working on moz2 infrastructure, the conversation went something like “what do you want on moz2?”, “everything we have on FF3”, “ummm… everything? really?””yes, the full set. Oh, and we’ll need a few sets of them for a few active different mercurial branches running concurrently”.

So, how do we scale our infrastructure and also improve reliability?One of the big changes in how we are building out the moz2 infrastructure was to *not* have specialized unique machines. Instead, we have a pool of identical slaves for each o.s., each slave equally able to handle whatever bundled work is handed to it. This has a couple of important consequences:

  • if one generic slave dies, we dynamically and automatically re-allocate the work to happen on one of the remaining slaves in the pool. Builds would turn around slower, and we’d obviously start repairing the machine, but at least work would continue smoothly, and the tree would not close!
  • if we decide we want to add an additional branch, or if we feel the current number of slaves are not able to handle the workload, we can simply add new identical slaves to the pool, and automatically dynamically re-allocate the work across the enlarged pool.

Looks like this:

pool of identical slaves

Adding 88 new unique machines for each of 3-5 new additional active branches would be painfully to setup, and just about impossible to maintain. And we’d be *guessing* how much development work there would be in the next 18 months, and then building the infrastructure out. Instead of having to SWAG our needs for the next 18 months and then setup frantically now, this shared pool approach allows us to grow gradually as needed. Oh, and it should be more robust. 🙂

(Many thanks to BenT for the christmas tree lights analogy. I was saying “a chain is only as strong as the weakest link”, but BenT’s analogy offers much better possibilities for awful tree jokes.)

Some new faces in ReleaseEngineering

Belatedly, I’d like to welcome two new faces here in Release Engineering. Armen Zambrano Gasparnian(armenzg on irc) started last week, and he will be working trying to untangle some of our l10n infrastructure.  Lukas Blakk (lsblakk on irc) started here this week, and she will be working with Robcee on the unittest automation infrastructure.

They’re already digging into various problems, and in their spare time, have even started tempting us with homemade baking (a yummy chocolate with pecan and almond cake that didnt last 2 hours), and promises of more to come over the summer. Its our first time having interns here in the group, so its quite exciting for all of us. Next time you are in Building K, do stop by upstairs, and say hi.

(Oh, and I’m also glad to report that Armen seems to be enjoying the official dress code.)

We have *how* many machines…. and whatdy mean, its not enough?

While the sheer number of machines in my previous post surprised all of us, its more interesting to note that its not enough. Its simply just not enough. Even today, we’re constantly under the gun, bringing new machines online as fast as possible.

  • The 30 machines marked idle/waiting-to-mothball will all be recycled and used for blocked projects that need machines.
  • Justin’s group recently brought another VMware host online, and built out extra disk space so we have space to create 30+ new VMs – 6 new VMs are coming online this week, additional to whats listed in previous blog.
  • We’re ordering another batch of 80 mac minis, as we’ve already used up the previous batch of 50 minis, after we used the initial batch of 30 minis.

We hope its enough machines for a while.

Never mind the cost of all these machines. Pretend they were all free.

All of these machines need rented colo rack space, network bandwidth, electricity, a/c, humans to install and support them, humans to configure them up and bring them online. In a ripple-on effect, the more builds we produce, the more diskspace and infrastructure we need for ftp, downloads, virus scanning, tinderbox servers, etc.

Thats just to have them come online. Then starts the human time for the constant care and feeding that each of these unique individual machines need. For one or two machines, its easy. When you look at 200 machines, and then an additional 150 or so machines, its a no-brainer that this approach does not scale.

We have *how* many machines?

As best as I can tell, it looks like we have the following machines running on each branch:

02 machines for 1.8.0
+ 29 machines for 1.8
+ 88 machines for 1.9/trunk
+ 33 machines for moz2
====
152 machines in use today
+ 10 ref-images
+ 30 machines idle/waiting-to-mothball
====
192 machines total

1) These numbers do not include any community machines yet. We’re still working on this.
2) The 88machines on 1.9/trunk are made up of 40 builder, 23 unittest and 25 talos machines.
3) Most of those 30 machines marked “idle/waiting-to-mothball” were only discovered during this housekeeping. Some of these now have bugs to track mothballing and being recycled… we’re still working through the list. It was interesting to find out how many people were still using machines that they thought were supported, but which we did not even know existed, or which we thought were long desupported!
4) Its taken weeks to collate this data, and I’m still not certain we’ve identified everything. We need a central list that can be the single-source-of-truth for all these machines. Instead of doing this on various wiki pages, we’re talking with Justin, mrz and Jeremy to see if we can use the same asset tracking db they use when they install machines into the colo. That would work much better for this, but need some customization. Stay tuned…

We’re still gathering more info…to be continued in another blog post.

No wall-clock numbers for Thunderbird 2.0.0.14

We used the Thunderbird2.0.0.14 release to get Rick Tessner at MozillaMessagingCo up to speed. There’s a lot of Build mechanics to take in, so its not fair to add extra pressure by measuring all the wall-clock times.

Rick is also working to have the existing release automation we use for Firefox be used for Thunderbird also. In theory, it should just work, and initial experiments seem promising, but we’ll need a full test cycle on this before we can switch over in production. The curious can follow along in bug#427769.

Firefox 2.0.0.14 by the (wall-clock) numbers

Mozilla released Firefox2.0.0.14 on Wednesday 16-apr-2008, at 3:05pm PST. From “Dev says go” to “release is now available to public” was just over 12 days (12d 3h 20m) wall-clock time, of which Build&Release took just over 3.5 days (3d 14h 35m).

11:45 04apr: Dev says “go” for rc1
13:20 04apr: FF2.0.0.14 builds started
16:50 05apr: FF2.0.0.14 linux and mac builds handed to QA
03:40 07apr: FF2.0.0.14 signed-win32 builds handed to QA
10:20 07apr: FF2.0.0.14 update snippets available on betatest update channel
16:40 08apr: Dev & QA says “go” for Beta
17:00 08apr: update snippets on beta update channel
19:40 15apr: Dev & QA says “go” for Release; Build already completed final signing, bouncer entries
07:30 16apr: mirror replication started
11:15 16apr: mirror absorption good for testing to start on releasetest channel
13:10 16apr: QA completes testing releasetest.
14:20 16apr: website changes finalized and visible. Build given “go” to make updates snippets live.
14:25 16apr: update snippets available on live update channel
15:05 16apr: release announced

Notes:

1) Our blow-by-blow scribbles are public, so the curious can read about it, warts and all, here. Those Build Notes also link to our tracking bug#426307.

2) While this was a firedrill release, and it went quite smoothly, it still some non-technical delays making the wall-clock numbers longer then usual.

  • The code fix was landed mid-day Friday, and builds started lunchtime Friday. However, the Build and QA groups explicitly did not work the weekend, after a recent series of working weekends, adding an artificial delay waiting for manual announcements and signing.
  • We decided to extend the beta period from 14apr until 16apr, to avoid possibly disrupting people’s online US tax submissions on 15apr.
  • Like before, we waited until morning to start pushing to mirrors, even though we got the formal “go” the night before. This was done so mirror absorption completed as QA were arriving in the office to start testing update channels. We did this because we wanted to reduce the time files were on the mirrors untested; in the past, overly excited people have post the locations of the files as “released” on public forums, even though they are not finished the last of the sanity checks. We suspect that coordinating the mirror push like this reduced that likelihood just a bit, but it feels like we should verify that. We continue to count this waiting time as “Build&Release time”, even though we are all just waiting.

3) Mirror absorption took just over 3 hours to reach all values >= 65%, a higher then usual threshold.

take care

John.