In case you missed this, its worth highlighting. The work catlee did in bug#468731 fixed 3 important recurring Talos problems:
1) Talos needed long downtimes/tree closures
We used to have to schedule really long downtimes anytime we touched Talos machines; even a 5 second reboot could force us to close the tree for 3+ hours. This was because after the reboot, we had to wait for new builds to start, be produced, be detected by Talos, run Talos, have Talos report green, before we could safely reopen the tree. Now we can just do the reboot, re-queue an existing build to Talos, run Talos, have Talos report green, before we reopen the tree. This significantly reduces how long of a downtime we need when working on production Talos machines from now on.
2) Rerun Talos on same build
This same change lets us resubmit the same build to Talos. This is ideal for cases where there is a Talos failure/regression reported, and no-one knows if it is an intermittent code problem, a Talos framework problem, or a physical machine problem. Now, we can requeue the same build and see if it fails again, and if it fails on a different machine. Very very useful. No public interface for this yet, so for now we have to do this manually, on request. Please file a bug in RelEng, with details of the build you want re-run, and we’ll manually kick it off for you.
3) Talos sometimes skipped a queued build
This replaces the original code Talos used to detect if there was a new build available. That code always had race condition bugs that would cause Talos to skip over some entries, so Talos would occasionally skip over some builds… which was (correctly) frustrating to developers. All that code is now gone, whats left is now easier to maintain, all builds are processed in the order they were queued, and the Talos systems are now slightly more integrated with the rest of the build/unittest infrastructure.
All in all, quite a big win – way to go, catlee!
You must log in to post a comment.