RelEng “group gathering” in the Toronto office

Last week, the entire RelEng group gathered in Toronto. Everyone in the Toronto office was amazingly welcoming to people camping out in conference rooms and on sofas; many many thanks!
Random highlights:

  • Welcoming Chris AtLee to Mozilla and to the RelEng group. He has just started, and survive the first day hazing orientation, so this seemed like a good time for us all to get together in person. He’s already wading through some long standing items, and getting to figure his way around. Expect to hear good things from Chris soon.
  • Getting the entire group together in one place at one time was great. The only time we’ve ever managed this before was in March 2008, and it was also great. In the 6 months since we last met in person, we’ve had people move countries, have babies, helped Mozilla move from cvs->hg, shipped the FF3.0 release, and shipped the FF2->FF3 major update. Needless to say, we had lots to talk about, both in scheduled meetings, and in those all-important spontaneous two-people-doodling-on-a-whiteboard “meetings” that crop up all over the place. This entire week reaffirmed to me just how valuable that face-to-face time can be.
  • Airplane on a treadmill. This came up early in the week, and continued for the entire week. Without further comment, here’s what Mythbusters and xkcd.com(!) had to say.
  • Discovering that Médecins Sans Frontières / Doctors Without Borders was located in the same building as Mozilla. I’ve always had great respect for the work of MSF, and somehow found it inspiring that they were even in the same building as Mozilla.
  • Learning just how important coffee & food was to the Mozilla Toronto office. Every morning started with a long discussion about the virtues of this bean roast vs that bean roast, while powering up the various caffeine dispensers (see Madhava’s photo on flickr). There is also an interesting low-tech, but effective, postit note on the wall tracking monthly coffee consumption for the office. Hopefully, they’ll remember to discount the spike in late October from their math!
  • Half way between the hotel and the office was a “Second Cup” coffee shop. By the 2nd day, I was being recognised by name as I walked in the door. By the 4th day, they would make my drink when they saw me walking down the sidewalk towards the cafe, so it was ready by the time I reached the counter.

Thats it. Again thanks to everyone in Toronto… next time, Rockband! 🙂

How big is BurningMan?

If you’ve never been to Burning Man, the sheer scale of it all is hard to wrap your head around. Heck, if you have been to Burning Man, its still hard to grasp. I’ve been looking for concrete numbers to help wrap my head around.

  • Area: just over 5 5.5 square miles (up from 4 sq. miles last year.  San Francisco is 7 46.7 sq. miles)
  • Population: 49,600 (San Francisco is 800,000)

…but I find those numbers still too big to grasp.

  • Number of porta potties: 1000

One thousand – thats a lot of porta potties?!! Still too many to easily visualize, but it got me thinking. I’ve seen what a line of 20 porta potties looks like – but what would a line of 1000 porta potties look like?

A standard porta pottie is 44″ wide. Standing them all immediately beside each other makes a line of porta potties 44000″ long or 0.6944 miles long. Huh. By comparison, look at Charleston Road – the main street nearest the Mozilla office here in Mountain View. If you placed the first porta potty at the corner of Charleston Road and Rengstorff, the continuous unbroken line of blue porta potties would stretch all the way to the corner of Charleston Road and Shoreline…blocking all side streets and office entrances along the way. See details in this map. It would take you approx 14 minutes to just walk from the start to the end of that long line of blue porta potties with white roofs… The Great Blue PortaPotty Wall of Mountain View?

Taking this thought one step further.

Imagine if someone asked you to organize renting 1000 porta potties. Now to add to the logistics, consider how to deliver those 1000 porta potties to a remote desert location multiple hours drive from the nearest town in Nevada. Oh, and hire a large group of people who will work literally around the clock keeping them all continuously cleaned and restocked for weeks on end. Oh, and they’ll need food and lodging and large equipment too. Then bring the people and the porta potties all out there, and then bring it all back.

If all that doesn’t sound daunting to you, well… have you ever considered volunteering at Burning Man? The logistics and coordination that goes to make Burning Man possible is really quite something, and more help is always welcome! Oh, and dont worry, its not all about porta potties. 🙂 There are many other volunteer roles for medics, firefighters, counselors, airplane-specialists for the airport, large-scale construction workers, solar power technicans, ice sales-folks, greaters, etc, etc, etc… see the full list at: http://www.burningman.com/participate/volunteer.html

John.

[UPDATE: fixed error in size of San Francisco city area, caused by error over “square miles vs miles squared”; thanks to skierpage for spotting the error. Also calculated area of Burning Man more precisely. John 24oct2008]

how many slaves are “enough”?

Last week in the frenzy leadup to FF3.1b1 code freeze, we got into a state where there were too many changes and too many builds being queued for the available pool of slaves to keep up. Literally, there were new builds being requested faster then than we could generate builds. Worst hit was win32, because win32 builds take much longer then than the other platforms.

Short answer: Since early summer, we’ve used double the number of slaves we had for FF3.0 – which was more then than enough until last week. We’ve since added even more slaves to the pool, which cleared out the backlog and also should prevent this from happening again. At peak demand, jobs were never lost, they just got consolidated together. Having multiple changes consolidated into the same build means the overrun machines can keep up, but makes it harder to figure out which specific change caused a regression.
Long answer:

First, make some coffee, then keep reading…

If you recall from this earlier post, we’ve moved from having one dedicated-machine-per-build-purpose, to now using a pool of shared identical machines. Pending jobs/builds are queued up, and then allocated to the next available idle slave, without caring if its a opt-build, debug-build, etc. Any slave can do the work. More importantly, failure of any one slave does not close the tree.

Related topic is: its tough to predict how many slaves will be enough for future demand. We started in early summer2008 by having twice as many builders as we had builders on Firefox3.0. That guesstimate was based on the following factors:

  • using shared pool across 2 active code lines (mozilla-central, actionmonkey). This has since changed to 3 active code lines (mozilla-central, trace-monkey, mobile-browser), with different volumes of traffic on each.
  • assuming that the combined number of changes landing across all active code lines being similar to what we saw in FF3.0. We didnt have project branches back then, but we had approx the same number of developers/community landing changes at about the same rate.
  • changed from “build-continuously” to “build-on-checking”. This greatly reduced the number of “waste” builds using up capacity. We still generate some “waste” builds (no code change, but needed to stop builds falling off tinderbox, and to keep talos slaves busy). The question is how many of these “waste” builds are really needed, and can we reduce them further?

This worked fine until last week when a lot of changes landed in the rush to FF3.1beta1, and then regressions forced a lot of back-out-and-try-again builds. Here’s a graph that might help:

Once we realised the current pool of machines was not able to keep up with demand:

  1. We added new build machines to the pool. This really helped. As each new slave was added to production pool, it immediately was assigned one of the pending builds and started working – helping deal with the backlog of jobs. By the time we added the last slave to the production pool, there were no pending jobs, so there was nothing for it to do, and it remained idle until new builds were queued and immediately processed.
  2. On the TraceMonkey branch, we triggered a “waste” “nothing changed” build every 2 hours. This was done for mozilla-central to ensures that Talos machines are alway testing something, and builds dont fall off the tinderbox waterfall. We originally setup TraceMonkey to build on same schedule as mozilla-central, but as we didnt have any Talos machines on TraceMonkey, we can safely reduce down that frequency. In bug#458158, dbaron and nthomas increased the gap between “waste” “nothing changed” builds on TraceMonkey branch, from 2 hours to 10hours, which is the longest gap we could have, while still frequent enough to prevent builds falling off tinderbox. They also turned off PGO for win32, which seemed fine, as there are no talos machines measuring performance on TraceMonkey branch anyway; turning off PGO reduced the TraceMonkey buildtime, which meant that the slave would be freed up sooner to deal with another pending job. I tried to visualise this drop in pending jobs by the notch in the graph above.

Looking at the graph again, we dont know if we will see  “future#1” or “future#2”. We’re still unable to predict how many slaves is enough for future demand. We’re adding new project branches, hiring people, adding tests. We’re obsoleting other project branches. Once we fix some infrastructure bugs, we can stop “waste” builds completely. Either way, the infrastructure is designed to handle this flexibility, and we have plenty of room for quick expansion if need arises…

We dont yet have an easy way to track how heavily loaded the pool-of-slaves is, so I have to ask for some help.

Until we get a dashboard/console working, for now, can I ask people to watch for the following: Whenever you land a change, an idle slave should detect and start building within 2 minutes (its intentionally not immediate – there’s a tree stable timer of a couple of mins). If we have enough slaves, there should be one build produced per changeset. Its possible, but rare, that people land within 2 mins of each other, and therefore correctly get included in the same build, but that should be very rare. More usually, each checkin would be in a different build. If you start seeing lots of changesets in the one build, and especially if you see this for a few builds in a row, it may mean that the pool-of-slaves cannot keep up, and queued jobs are being consolidated together. In that situation, please let us know by filing a bug with mozilla.org:ReleaseEngineering and include the buildID of the build, details on your changeset and the other changesets that were also there, which o.s., etc and we’ll investigate. There are many other factors which could be at play, but it *might* be an indicator that there are not enough slaves, and if so, we’ll quickly add some more slaves.

Hope all that makes sense, but please ping me if there are any questions, ok?

Thanks for reading this far!
John.

[updated 13oct2008 to fix broken english syntax.]

Welcome to Chris AtLee

Quick introduction.

Chris joined Release Engineering today, and will be based in Toronto… assuming he passes the strict acceptance tests of the Toronto office.

He’s a cool guy, and we’re delighted to have him join us. In the Toronto area, you’ll find him on the guitar, trying to prepare for Ted’s upcoming visit. On irc you’ll find him as “catlee”, which we’ve decided to pronounce “cat-lee”. 🙂

Please do stop by and say hi!

Alice now on planet…

Earlier this week, Alice started blogging, after finally being nudged out into the blogosphere. For more details on Talos then you might ever think you wanted to know, just keep reading her posts on planet! 🙂

Her most recent post sounds fairly straightforward, but the devil really is in the details here. This fix really made a huge difference to IT’s late night pager support of Talos problems… and by extension, makes life much better for all developers using Talos. In my opinion, this is as important as the fix for the 2-3 per day intermittent Talos burnage…which was solved late May.

Obviously, these make Talos more reliable for others, and drastically reduce potential downtimes. These two fixes also mean that Alice and the rest of RelEng can spend less time on maintenance, and more time on constructive development of new features, and that IT’s support life gets a little easier.

All very good news!!:-)

Now creating L10n nightlies a whole new way!

Since last Wed (01oct2008), we’ve been producing *two* sets of L10n nightly builds every night on the FF3.0 line, built in slightly different ways.

The builds produced the “new” way are at:
http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/latest-mozilla1.9.0-l10n/

The builds produced the “old” way are at:
http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/old-l10n/

If you notice any problem with the new l10n nightly builds, please first check to see if the “old” nightly build has the same problem. If you find  a problem in the new l10n nightlies that is not a problem in the old l10n nightlies, please file a bug in mozilla.org::ReleaseEngineering, and we’ll get right on it.

Nothing has came up so far, in our weeks of running on staging, testing so far or even in Friday’s testday focus on these builds. If all continues to go well, we’re planning to stop producing the 1.9 nightlies the “old” way this coming Wednesday (08oct2008). We’ll probably wait a few more days before mothballing all the various custom machines that were being used for the “old” way. After that, we’ll begin similar changes on the mozilla-central / FF3.1 systems.

This is a really really huge deal, and very exciting for RelEng and for the l10n community. Details are below, but trust me, this is big. 🙂

Some background for the curious:

(Some details covered during Monday’s Weekly Update call.)
(Some details in Armen’s blog and Seth’s blog.)
(Lots more details are in this presentation from the Mozilla2008 summit.)

Every night, we do a full compile and link of the en-US version of Firefox. We then take that en-US version, open it up, replace the locale strings with the strings for a different locale, and rebundle everything back together. This is called a “nightly l10n repack”. We then repeat the process for all the other locales, on all o.s. If there are string changes during the day, we have a slightly different system that does repacks during daytime. For official releases we have yet another slightly different system that does repacks for releases.

The change we are rolling out is important for several reasons:

– Doing repacks is really slow. Each individual repack is quick… approx 1 minute. However, we now have 60+ locales (and counting). Multiply that by 3 different o.s. and it adds up to 180+ repacks. One minute per repack adds up to 3 hours for the entire set.

– The “old” system treats all 180+ repacks all as one giant “block”, starting alphabetically with linux/af and ending with win32/zh-TW. With the “old” system, a locale arriving after code freeze would force us to throw out all repacks and start the entire set for 180+ repacks again. Or keep that set, and repack the late arriving locales manually, which is error prone and not that much faster. The “new” system treats all repacks as independent jobs which can be done in parallel. The consequences of this are huge:
— We can now handle a locale arriving after code freeze during a release as just another job-in-the-queue, without disrupting the existing set of locales in progress, and without needing to do the late arriving repack manually.
— We can share the 180+ repacks across the pool of available build machines, which will give us huge time gains. Instead of trying to improve overall time by trying to shave a few seconds from each repack, we’re breaking this into discrete pieces that can be tackled in parallel.

– Now, all L10n repacks (whether for releases, nightlies or incremental during the day), will all be created the same way, running on the same pool of machines that are also producing en-US release/nightly/incremental builds. This eliminates bugs caused when a nightly l10n machine is somehow slightly differently configured then a release machine. (During the FF2.0.0.7 release, for example, we had to scramble to fix a surprise CR-LF problem on win32 l10n builds caused by exactly this.)

– Moving from using dedicated specialized machines to a pool of machines is important for reliability and uptimes. Until now, if a specific l10n machine crashed, there was no failover, it just closed the tree, and required a late night pager for IT and RelEng to fix before the tree could be reopened. Now if machine in the pool dies, repacks will continue on another available machine in the pool.

– This brings us closer to the ultimate goal here… producing updates for people who using l10n nightlies. We currently produce nightly updates for users of en-US nightlies, but we’ve never produced updates for l10n nightlies… yet! Stay tuned….

(Oh, and there always the good hygiene of removing lots of clunky/obsolete code. We’ve got so much legacy/complex systems making it hard to figure out whats safe to change, that any cleanup/streamlining really helps simplify other work being done by others in the group. This recurring payback is great!)

There’s a ton of cleanup work behind the scenes here that made this possible. I have to point out that Armen has been working on this, and only this, non-stop since late April. Its been amazing having him patiently pick through the various tangled knots to figure out how to make all this happen. Also, many thanks to axel, bhearsum, coop, nthomas and seth for their help untangling folklore and various historically important systems and code weirdnesses to get to this point.

(If I’ve missed anyone over the last 5 months of this project, sorry, poke me and I’ll correct!)