Postmortem of tree closure

All trees were closed from Friday morning until mid-day Monday. Its all fixed since just before noon Monday PDT, so if you don’t care about the details, you can stop reading now.

Details for the curious:

  • There was a scheduled IT downtime Thurs night to change VPN
    configurations between 650castro and MPT. This was to fix the 8hr disconnect problem in bug#555794.
  • This change caused more frequent (every 3-5minutes) VPN disconnects between masters in MPT and slaves in 650castro. These disconnects would interrupt builds-in-progress, causing burning builds from late Thursday night.
  • During Friday, several fixes were tried, without success, each of which took time to implement, and then see if connections were working. By Friday night, IT started reverting back to old VPN configuration. The revert completed approx 11am PDT Saturday morning. However, we were seeing VPN disconnects and build failures.
  • Over the weekend, RelEng tried two changes to reduce WAN traffic:
    • Set up a new master in Castro so that machines wouldn’t have to traverse VPN for most test jobs. Migrating machines to new master took until Monday morning. This does make our WAN traffic more efficient, so we’ll leave this in place anyway.
    • Disconnect fast build machines in Castro from masters in MPT. Disconnecting these machines significantly reduced our build capacity but it was worth trying to see if we could reopen the tree with anything. This made no difference, and once the WAN connection was fixed, we reverted this.
  • RelEng and IT met Monday morning; Derek changed router configurations, so no VPN was needed for the communication between build machines in 650castro and MPT. This router reconfig meant that the random disconnects are gone, by bypassing the need for VPN totally and directly wiring part of the WAN circuit into the build networks at 650castro and MPT.
  • This has been holding since mid-day Monday, and patches have been landing as fast as sheriffs could coordinate with developers.
  • Not totally out of the woods yet; we’ve seen two cache corruption problems that we didn’t see before. These cause the tests to fail because of grabbing the wrong file across the WAN. Details being tracked in bug#555794.
  • IT are still figuring out what actually went wrong with the VPN/router changes. At this point, we know that the VPN performance problems were related to an IPSec interoperability problem between Juniper and Cisco hardware. Juniper acknowledged the possibility of a bug on Monday. More info in bug#555794 as we find out.
  • Unrelated, but confusing the matter was other people in RelEng adding new machines to production as part of bug#557294. This was unrelated, but added some confusion to debugging on Friday night / Sat morning.

Hope all that makes sense. Its been a rough few days for RelEng and IT, so thanks for the patience. Let me know if there are any questions?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.