RelEng production systems go hybrid… now available on AWS

As of Friday afternoon (06jul2012), RelEng started generating a small number of production builds and try builds on Amazon Web Services.

(Terminology alert: this means Mozilla’s network of RelEng machines are now considered a mix of a private cloud, and a public cloud, …which is called a hybrid cloud.)

catlee already covered this in Mozilla’s Platform Meeting, but this multi-month project is a massively important step forward for Mozilla’s Release Engineering infrastructure as well as for all Mozilla’s developers, so is worth calling attention to three important details:

  • Security
  • Seamless integration
  • Dynamic allocation

Security

The security of our RelEng infrastructure is obviously important to Mozilla, so we setup these Amazon-based VMs inside a Virtual Private Cloud (VPC). While it is technically possible to have the VMs inside the VPC connect directly to the external internet, we felt it was safer to prohibit any access from the VPC to or from the internet. Therefore the only connection we have to/from our VPC is over a VPN link directly into Mozilla’s existing Build network, within Mozilla’s secured infrastructure.

If an Amazon VM needs to reach an external site for any reason, it can only do this by going from Amazon over VPN to Mozilla’s Build Network and then out through Mozilla’s firewalls. If a Mozilla person wants to access one of our Amazon VMs, they have to do this by going through Mozilla BuildNetwork over the VPN link to the VPC. We designed this very restricted access to help protect these vital systems. It was reassuring to also see all the security audits that Amazon has done.


Seamless integration
We integrated Amazon’s VMs nicely into our existing mix of VMs and physical machines in the Mozilla build network. The easiest way to see if your specific build was handled by an Amazon VM (called an “EC2 instance” in Amazon-speak), is to look at the machine name on tbpl.mozilla.org.

The only other way that you can tell we are using AWS for some of the builds is that the additional compute capacity is helping reduce wait times for our builds!


Dynamic allocation
As you can see from looking at our monthly load posts, load on our RelEng infrastructure varies over different times of the day, and over different days of the week. To handle this efficiently, we now dynamically add and remove Amazon VMs from production at any given time to match the demand at that time. We do this as follows:

  • Our automation monitors the queue for pending builds
  • If there is a backlog of pending builds in the queue, our automation dynamically starts reviving enough VMs in our Virtual Private Cloud to handle the backlog.
  • As each of these VMs come online, they connect to a buildbot master, indicating they are idle and ready to process jobs.
  • A buildbot master assigns a pending build job to the newly available idle slave.
  • Once the build job is completed, the slave goes back to the master looking for another job.
  • If there are no backlog of pending jobs for 60 minutes, then our automation starts suspending the idle Amazon VMs. Suspending VMs like this allows us to quickly bring VMs back into production in a few seconds to handle any new backlog, while also reducing costs during low load times. Also, note that the 60minute threshold worked well for us in staging, but we’ll likely adjust this in the near future as we more experience with real-world load.

As of today, we only let some B2G builds overflow onto AWS like this, and we continue to monitor builds and the dynamic allocation carefully. Assuming this continues to work well, we will soon let the rest of the B2G builds overflow to AWS. Then next will be fennec/android builds, and then linux desktop builds. Our focus in the immediate short term will be to siphon excess load from our Mozilla build machines over to AWS, allowing us to better handle the increased number of B2G and Fennec builds being enabled in production recently. This also allows us to reimage some/all of our physical linux builders as physical win64 builders to immediately help with our win64 builder wait times. Eventually, we may start running win64 builds, and maybe even some unittests, on AWS but that need further investigation – stay tuned!


Its hard to overstate how important this is for us.

The increase in build types for B2G and fennec and desktop, combined with the increase in number of checkins per day has RelEng systems continually under heavy load. We first tried using AWS in 2008, but the Amazon VMs that we were using kept being restarted, usually before the build completed; the build would automatically restart everything once revived a few seconds later, but it still blocked us from actually being able to use these in production. Some renewed experiments in summer2011, and discussions with others companies who were doing similar investigations looked promising, so we started work on this in full force in Feb2012.

We hope you like this, and of course, if you see any problems, let us know asap, or file a bug!

[UPDATED: fixed typos, joduinn 17jul2012]

New builds in production: Fennec-Armv6, B2G-Armv7 and B2G-desktop

In the last two weeks, the following new build types were enabled on our production infrastructure:

  • arm v6: These Fennec builds run on Arm v6 chips. Mobile developers asked for these builds because so many people still use Arm v6 phones. We generate these builds as well as continue to generate the existing NativeFennec Arm v7 builds and XULFennec builds. You can find these on tree.mozilla.org as “Android Armv6 opt”. More details in bug#723946.
  • B2G Arm v7: These boot2gecko builds run on Arm v7 chips. We generate both Opt and Debug builds. You can find these on tree.mozilla.org as “Armv7a GB opt” and “Armv7a GB debug”. More details in bug#758425.
  • B2G desktop. These boot2gecko builds are compiled specifically to run on *desktop* machines, not on boot2gecko devices. The intended users for these builds are developers / QA / localizers and community people who are helping work on B2G and can do a large portion of their work without access to physical devices. You can find these on tree.mozilla.org as “Ng” for each of win/mac/linux desktop platforms. More details in bug#744008.

Next time you see them, give thanks to armenzg and bhearsum for their speedy behind-the-scenes work to make all this happen.

John.

Infrastructure load for June 2012

  • #checkins-per-month: We had 5,194 checkins in June2012. This is on par with last month’s record of 5,246 checkins in May2012.
  • #checkins-per-day: We had consistently high load across the month, and 15-of-30 days had over 200 checkins-per-day.
  • #checkins-per-hour: The peak this month was 11.5 checkins per hour. It is worth noting that throughout the month, we sustained over 10-checkins-per-hour for 5 out of 24 hours in a day.

mozilla-inbound, fx-team:
mozilla-inbound continues to be heavily used as an integration branch, with 23% of all checkins, by comparison with the fx-team branch (~2% of checkins) or mozilla-central (~4% of checkins). These ratios have been fairly consistent over the last few months.

mozilla-aurora, mozilla-beta:

  • 3.8% of our total monthly checkins landed into mozilla-aurora, which is higher then previous months. I suspect this is caused by checkins for nativefennec-beta-on-aurora.
  • ~2.6% of our total monthly checkins landed into mozilla-beta. This is higher then previous months, but I guess this to be related to the NativeFennec-landing-on-beta work this month.

(Standard disclaimer: I’m always glad whenever we catch a problem *before* we ship a release; it avoids us having to do a chemspill release and also we ship better code to our Firefox users in the first place.)

misc other details:

  • Pushes per day

  • Pushes by hour of day
    • It is worth noting that for 5 hours in every 24 hour day, we did over 10 checkins-per-hour. Or a checkin every 6mins, if thats easier to wrap your brain around. 🙂