Improving end-to-end times

What does “end-to-end time” mean, and how are we making it better?

Basically, its a rough measure of how long it takes a developer to find out if their landed patch is good or not. Its the time measured from:

  • when someone lands a patch in hg
  • to wait for slave to be available
  • to do a build
  • to wait for slave to be available
  • to run unittest and post results
  • to wait for talos slave to be available
  • to run talos and post results.
  • to stop.

Some of these steps can be run concurrently, but most are run serially. Here’s a diagram which might help:

Exact times vary, depending on volume of checkins, and which o.s. you are looking at. However, the basic structure is the same for all checkins on all o.s., for production and for TryServer:

  • t0: a patch lands into hg repo, build job queued
  • t1: next available build/unittest slave starts building
  • t2: build slave finishes, build published to ftp.m.o, talos job queued
  • t3: next available talos slave starts running talos on that build
  • t4: talos slave finishes, results appear on graphserver

During normal periods, the waiting times (t0-t1, and t2-t3) are the biggest chunks of the end-to-end time. During crunch periods before a release, these waiting times totally dwarf everything else in the end-to-end time. So we’ve been focusing our efforts on these wait times first.

  • t0-t1: To fix this, we’ve already added some slaves, which helped reduce this wait time, and are adding even more. This is being measured in the “wait time” posts to the mozilla.dev.tree-management newsgroup.
  • t2-t3: To fix this, we’re changing Talos from the “triplicates of dedicated slaves” model to a “pool of slaves” model. We don’t yet have a good way to measure this wait time, but we manually saw many jobs waiting 12+ hours in just this one step during the leadup to FF3.5beta4.

Once we get these two big chunks of waiting time optimized out, the end-to-end times should be much reduced. We’ll then re-calc to see where the next biggest chunks of time are in end-to-end times, and re-focus our optimization efforts there.

Hope all that makes sense?

John.

=====

Note: for the sake of diagram simplicity, I’m avoiding:

  • how long unittests take to run, and how unittests are being changed to run suites concurrently on separate slaves. See details here.
  • how multiple builds are queued after the one checkin, but based on slave availability, all jobs might/mightn’t get allocated slaves at the same time. In the diagram, I’ve shows the 3 builds starting at the same time, because it was simpler to draw and explain!  🙂

One thought on “Improving end-to-end times

  1. I don’t really understand above what the debug build is used for ? Won’t you gain if you had only one build ? (reducing the build load, so reducing the time until a build slave is available to do the work ?)

Leave a Reply