Auto-rebooting Talos

Since Feb 2008, we’ve fixed:

  • a talos redness that hit us 2-3 times *every* day since Talos first started (caused by a synchronization problem between the build and talos systems)
  • made it possible for IT to support these Talos machines as part of their oncall work (making all the Talos machines boot cleanly into working state).
  • simplify (at least a little) how to setup new Talos machines on a new project branch.

Building on those fixes, we’re now focusing again on reducing the variances between the different Talos machines. We’ve always tried making the machines be as identical as possible to each other (sequential serial numbers, carefully controlling what is installed, etc), but even so, still there was a lot of variance in the test results. Last week however, after weeks of testing, Alice and Chris AtLee made a breakthrough:

Having Talos machines cleanly reboot after completing every 5th job, and before accepting another job means that developers never see any burning. There’s still details to be worked out before we can roll this out across all Talos machines across all branches… but so far, this looks really encouraging.

For background info, have a look at bug#463020 and the live graph on graphserver. If you have any suggestions, or ideas which might help, we’d love to hear them.

One thought on “Auto-rebooting Talos