Improving end-to-end times

What does “end-to-end time” mean, and how are we making it better?

Basically, its a rough measure of how long it takes a developer to find out if their landed patch is good or not. Its the time measured from:

  • when someone lands a patch in hg
  • to wait for slave to be available
  • to do a build
  • to wait for slave to be available
  • to run unittest and post results
  • to wait for talos slave to be available
  • to run talos and post results.
  • to stop.

Some of these steps can be run concurrently, but most are run serially. Here’s a diagram which might help:

Exact times vary, depending on volume of checkins, and which o.s. you are looking at. However, the basic structure is the same for all checkins on all o.s., for production and for TryServer:

  • t0: a patch lands into hg repo, build job queued
  • t1: next available build/unittest slave starts building
  • t2: build slave finishes, build published to ftp.m.o, talos job queued
  • t3: next available talos slave starts running talos on that build
  • t4: talos slave finishes, results appear on graphserver

During normal periods, the waiting times (t0-t1, and t2-t3) are the biggest chunks of the end-to-end time. During crunch periods before a release, these waiting times totally dwarf everything else in the end-to-end time. So we’ve been focusing our efforts on these wait times first.

  • t0-t1: To fix this, we’ve already added some slaves, which helped reduce this wait time, and are adding even more. This is being measured in the “wait time” posts to the mozilla.dev.tree-management newsgroup.
  • t2-t3: To fix this, we’re changing Talos from the “triplicates of dedicated slaves” model to a “pool of slaves” model. We don’t yet have a good way to measure this wait time, but we manually saw many jobs waiting 12+ hours in just this one step during the leadup to FF3.5beta4.

Once we get these two big chunks of waiting time optimized out, the end-to-end times should be much reduced. We’ll then re-calc to see where the next biggest chunks of time are in end-to-end times, and re-focus our optimization efforts there.

Hope all that makes sense?

John.

=====

Note: for the sake of diagram simplicity, I’m avoiding:

  • how long unittests take to run, and how unittests are being changed to run suites concurrently on separate slaves. See details here.
  • how multiple builds are queued after the one checkin, but based on slave availability, all jobs might/mightn’t get allocated slaves at the same time. In the diagram, I’ve shows the 3 builds starting at the same time, because it was simpler to draw and explain!  πŸ™‚

Preparing for “Bike To Work” Day

Now back in California, and enjoying the start of summer here. Julie’s been encouraging me to dust off my bike, and start getting ready for “Bike To Work” day. Without planning it, we were both in Firefox jerseys, so I couldn’t resist stopping for this photo from the Marin headlands, looking back to the Golden Gate Bridge with San Francisco in the background.

The weather, and the views, were glorious!! A perfect, perfect day.

Talos improvements

In case you missed this, its worth highlighting. The work catlee did in bug#468731 fixed 3 important recurring Talos problems:

1) Talos needed long downtimes/tree closures

We used to have to schedule really long downtimes anytime we touched Talos machines; even a 5 second reboot could force us to close the tree for 3+ hours. This was because after the reboot, we had to wait for new builds to start, be produced, be detected by Talos, run Talos, have Talos report green, before we could safely reopen the tree. Now we can just do the reboot, re-queue an existing build to Talos, run Talos, have Talos report green, before we reopen the tree. This significantly reduces how long of a downtime we need when working on production Talos machines from now on.

2) Rerun Talos on same build

This same change lets us resubmit the same build to Talos. This is ideal for cases where there is a Talos failure/regression reported, and no-one knows if it is an intermittent code problem, a Talos framework problem, or a physical machine problem. Now, we can requeue the same build and see if it fails again, and if it fails on a different machine. Very very useful. No public interface for this yet, so for now we have to do this manually, on request. Please file a bug in RelEng, with details of the build you want re-run, and we’ll manually kick it off for you.

3) Talos sometimes skipped a queued build

This replaces the original code Talos used to detect if there was a new build available. That code always had race condition bugs that would cause Talos to skip over some entries, so Talos would occasionally skip over some builds… which was (correctly) frustrating to developers. All that code is now gone, whats left is now easier to maintain, all builds are processed in the order they were queued, and the Talos systems are now slightly more integrated with the rest of the build/unittest infrastructure.

All in all, quite a big win – way to go, catlee!

barbershop terminology, and queuing theory, in Tokyo

Normally I buzz/cut my own hair, so haven’t been to a barbershop in years. However, traveling for a month in Japan with only carryon bags meant no hairclippers, so I went looking for a barbershop.

In US, and Ireland, for buzzcuts, the terminology is
#1 (3mm length)
#2 (6mm length)
#3 (9.5mm length)

In Japan, for buzzcuts, it seems the terminology is:
#1 (1mm length)
#2 (2mm length)
#3 (3mm length)

I discovered this difference while *in* the barbers seat, and yes, thankfully I was able to sort it out in time, despite the language barrier!

The place I went to near my hotel (QBHouse) put a lot of thought into making a haircut as quick and cheap as possible. For example:

  • Each shop has a green/orange/red traffic light outside – you can see it for blocks away. Green = no wait. Orange = waittime of 5-10mins. Red = waittime > 15 mins, go do something else in the area and come back in a few mins. Because of this traffic light system, they dont need much space for waiting customers, and also customers feel like it takes less time to get a haircut, so they return more frequently.
  • All haircuts are the same price and you pay by putting money ((1000Yen ~= $10USD, exact amount only) into a machine at the door as you come in. Tipping is not allowed. There’s no cashier and they dont take no credit cards, hence lower overheads.
  • Instead of washing hair, and hence drying it afterwards, they use a retractable vacuum and sterilizing equipment instead. This speeds up haircut time, and also saves on plumbing costs.
  • They aim to get you seated-cut-and-out within 10mins, so slow/complicated haircuts like bleach/dyes are simply not done, which improves accurate traffic light predictability.
  • Each hair-cutting-station is cleverly designed to be very compact, and also something that the barber can keep totally clean in a few seconds between each customer… further reducing wait times.
  • Instead of having a few large stores, they instead have many small stores (2-3 hair-cutting-stations, and waiting space for 3-4 people, seemed typical per store). They can then afford to have multiple smaller stores in the same area. This makes it more likely that there is a store within a few minutes walk of you whenever you decide to get a cut. In case its busy, there’s going to be another store nearby you can try instead, and because of the external traffic lights, you can tell if its busy by looking from a distance.
  • By being open for long hours, (10am->8pm, not closed for lunch), they spread the user load to reduce wait times even further.
  • All the focus to improve efficiency and reduce overhead means that each store can quickly be profitable, even if there are other branches nearby. Also, because they are each small, its easy to make guesses in new areas, and less painful to cut losses on unprofitable stores.

I’ve been meaning to post this for a while, but after last week’s fun and games with wait times for queued pending jobs, I dug this up again, as the analogies seems interesting. Cheap to use. Low wait times. Streamline setup/cleanup between jobs. Lots of small “cheap” stores make it easy to scale up, or reduce down, as needed.
What do you think?

[UPDATED: fixed links to external sites that had moved, updated photo. joduinn 15nov2010.]

Infrastructure load for Mar 2009

Summary:

  • We pushed 1,016 code changes into the mercurial-based repos here in March. This translates into 10,196 build/unittest jobs, or 14.5 jobs per hour.
  • WinCE builds are being triggered for every push to mozilla-central, starting late Feb.
  • We’re not measuring load on TryServer or Talos yet. I’m looking into that, and will post more later.

Details:

As each of these pushes triggers multiple different types of builds/unittest jobs, the *theoretical* total amount of work done by the pool-of-slaves in February was 10,781 jobs. For each push, we do:

  • mozilla-central: 12 jobs per push (L/M/W opt, L/M/W leaktest, L/M/W unittest, linux64 opt, linux-arm, WinCE)
  • mozilla-1.9.1: 10 jobs per push (L/M/W opt, L/M/W leaktest, L/M/W unittest, linux64 opt)
  • tracemonkey: 7 jobs per push (L/M/W opt, L/M/W unittest, linux64 opt)
  • theoretical total: (585 x 12) + (248 x 10) + (183 x 7) = 10,781 jobs per month = 14.5 jobs per hour.

…and added more build/unittest slaves to TryServer

The main production pool-of-slaves for build/unittest is holding up today (just about). However, we’re still catching up on TryServer. To help, we added the following slaves to TryServer today:

  • 4 new linux slaves
  • 4 new mac slaves
  • 5 loaner mac slaves from production pool-of-slaves.

At this time, the TryServer build/unittest slaves are all caught up for win32, linux, but we’ve still got pending mac jobs still (14 builds, another 14 builds+unittests). Try Talos is still swamped, and will be for a while; its much harder to add additional Talos slaves there, and there’s a big backlog of pending jobs still (32 windows, 28 linux, 8 mac).
To add to the fun, we hit a new problem with the EqualLogic disk arrays throwing errors under load last night. Until we get this sorted out, we dont have disk space to create additional slaves. The curious can follow along in this bug.

16 new slaves – a drop in the pool

During Monday+Tuesday morning, we added 12 new slaves to the main production pool-of-slaves. This was in addition to the 4 new slaves we added to tryserver pool-of-slaves last week. Despite adding those 16 slaves, we were totally unable to keep up with the volume of incoming jobs today. This was a problem for Build/Unittest/Talos production servers and TryServers.

There’s no nice way to say this – the backlog of incoming jobs today was ugly.

I don’t know if it was because its a Tuesday (typically busy day), made worse by people surging back from Easter vacations… or because of a rush of checkins in the leadup to FF3.5b4. All I know is that, despite the new extra machines, we were totally overrun today – we simply did not have enough machines to keep up with the sheer number of incoming build/unittest/talos jobs.

  1. Sorry. I know its frustrating (to put it very mildly).
  2. Believe me we are working flat out to ramp up capacity to handle this. The 4 new ESX hosts that IT+RelEng installed last week, along with extra disk space gave us much needed extra capacity to setup more slaves.
  3. Another 8 slaves are still finishing their move from staging into tryserver production tonight. Given how today went, we’re already working on bringing up another set of slaves as quickly as possible.

Please hang in there.
John.

Followup to yesterday’s Mozilla Foundation call (or “Please file bugs”!)

In the last two weeks, we’ve hit problems where critical servers were down, the people who relied on them were (rightly) frustrated waiting for them to come back online, and at the same time, the people who should have been repairing them didn’t know there was a problem, or thought someone else was working on it.

Some recent examples are:

* Jonauth’s dev dashboard was blocked from accessing graph server, after causing graph server to crash. Some people thought bug#485928 was tracking the issue, but the one sentence in comment#9 was lost in noise of rest of bug. Fixed within hours of filing bug#486662.
* TryServer not displaying builds on waterfall. This happened about the same time as the iscsi outage. Interesting is that TryServer was actually processing jobs fine, but developers had no way to see this. Bug#485380 got lost-in-weeds. Fixed within hours of filing bug#485869.
* TryServer builds being queued for >24hours. No bug filed originally. Fixed within hours of filing bug#485869.

In each case, once an explicit bug was filed, the problem was fixed within a few hours.

Obviously, automated monitoring of all critical systems would be ideal, and we continue to get more and more systems under the watchful eye of nagios all the time. However, in the meanwhile, if you see a critical system having problems, please file a blocker bug describing exactly what you see is broken. Don’t worry about debugging where the root cause it, or if you can workaround it, the important thing is to make sure people who can fix it know about it. If you cant quickly/easily see a bug focused on just that problem, and if not, file a bug. If we already know about it, we will happily DUP it and make sure its being worked on with the right priority. If we *didnt* already know about it, we’ll make sure the right folks in RelEng or IT jump on it right away!

Please, don’t be shy about filing bugs…. and yes, you can quote me on that! πŸ˜‰

John.

Power off and recycle the last of the old Firefox2 machines

After the FF2 EOL on 17dec2008, and my earlier posts here and here, we’re finally getting ready to power off and recycle the FF2 machines listed below. The actual work is being tracked in bug#487235.

While these machines could be restored from tape backup if needed, doing that is non-trivial, so should only be considered as last resort.

What will change:

  • no longer produce FF2.0.0.x incremental/depend/hourly builds
  • no longer produce FF2.0.0.x clobber/nightly builds
  • no longer produce FF2.0.0.x release builds
  • remove the FF2.0.0.x waterfall page on tinderbox, as it will be empty.

What will *not* change:

  • FF2.0.0.20 builds would still be available for download from http://www.mozilla.com/en-US/firefox/all-older.html.
  • existing update offers would still be available. For example:
    • FF2.0.0.14 users who do check for update will still get updated to FF2.0.0.20.
    • FF2.0.0.20 users who do check for update will still get updated to FF3.0.5.
  • newly revised major update offers, like from FF2.0.0.20 -> FF3.0.9, could still be produced as needed (because these are produced on the FF3.0.x infrastructure, not on the powered off FF2 infrastructure.)
  • Thunderbird2.0.0.x use 9 other machines for doing builds, etc. These are not being touched, and will continue to be supported as usual until EOL after Thunderbird 3.0 ships.

Why do this:

  • reuses some of these machines in production pool-of-slaves or try pool-of-slaves, where there is more demand
  • reduce manual support workload for RelEng and IT.
  • allows us speed up making changes to infrastructure code, as there’s now no longer a need to special-case and retest FF2 specific situations.

What machines are we talking about:
balsa-18branch
bm-xserve03
bm-xserve04
production-pacifica
production-pacifica-vm02
production-prometheus-vm
production-prometheus-vm02
staging-crazyhorse
staging-pacifica-vm
staging-pacifica-vm02
staging-patrocles
staging-prometheus-vm
staging-prometheus-vm02

================