Recalibrating Talos performance results on newer hardware

If you don’t care about Talos performance results, or Talos hardware, stop reading now!

In the last dev platform meeting, I mentioned we’d be doing a lot of Talos recalibration work in early January 2010, and promised updates as we figured out the plan. We’re still working out some details, but here’s what we know so far.

Talos runs performance tests on mac mini hardware. The 1.83GHz minis we use today were all spec’d 2+ years ago, when we migrated from the previous 1.66GHz minis. During 2009, we kept buying and powering up more 1.83GHz minis to keep up with increasing demand of:

  • having more developers doing more frequent checkins
  • adding more talos suites to run per checkin and
  • running Talos on additional new OS.

However, these 1.83GHz minis have long since been discontinued by Apple, so we can’t buy any more. The “trade-in” program we ran in November got us another ~42 minis, bringing us up to 159 minis – barely enough to support the extra load needed through the FF3.6 release cycle.

At this point, our only option is to buy a whole larger set of newer spec minis, and recalibrate Talos performance results on those newer spec minis. The new spec minis are: 2.26GHz minis, with 2GB RAM and 160GB disk. We’ll recalibrate the Talos results by running the new minis concurrently with the old minis for a week or two. After we verify there are no inconsistencies in the results, we can then power off the old Talos minis, and recycle them for use elsewhere.

To avoid possible confusion about performance results, this is being done soon after FF3.6 ships and before FF3.7 works ramps up. In the meanwhile, we’re now starting the behind-the-scenes unboxing, racking, networking and configuration work.

This means a few things:

  • changing Talos hardware means re-calibrating all Talos results, including results for a list of selected important historical milestones. We’re still figuring that list out.
  • its important that all talos machines, across all branches, all OS, are identical, so developers can do like-with-like comparison of results across the board. This means we need to replace all the minis at one time.
  • for OSX, we’ll have to change from using OSX 10.5.2 to using OSX 10.5.8. This is because 10.5.2 does not work on the new minis. This OS change *might* modify performance results so we held off doing it until this time.
  • these new minis will be able to support OSX 10.6 when we add 10.6 support to Talos.
  • we’re working with IT to figure out how to rack,power,network these machines here in 650castro, as the MPT colo is already full.
  • we’re also using this time to move these machines from the QA network (where Talos was originally created) over to the Build network along with the rest of the build and unittest systems. We’ll also bring the logins, pswds, etc in sync to make them all easier to maintain.

Obviously, there are lots of details in this project to be careful of. If you think we’ve missed something please let us know. I’m cross-posting this to a couple of newsgroups to make sure this important upcoming change is noticed. However, please post followups in dev.planning or in bug#537065.

Thanks for reading this far, and take care
John.

Complicity by Iain Banks

This is another murder-mystery story based in Scotland. While this book started off in a similar vein to The Crow Road, it turned out to be darker and more graphically violent. Part of the story was written from the viewpoint of a serial killer, in a very convincing manner. Part was written from the “normal” world of people reacting to the police investigation about the murders while going about their lives, and I found that equally convincing.

Overall, I found the book a lot more disturbing. At the same time, I also found it impossible to put down the book, and I *had* to finish it. Still not sure how to rate it, but obviously Iain Banks is able to spin a very compelling story.

Mozilla the mighty micro multinational?

(Maybe this is old news to everyone else, but I stumbled across this article recently, and found it quite interesting.)
http://money.cnn.com/magazines/business2/business2_archive/2006/07/01/8380230/index.htm

The term “mighty micro multinational” was new to me, but after reading this article from 2006, I think the description fits Mozilla fairly well. Basically, as I understand it, the writer makes the distinction between:

  • traditional companies which are based in one location until forced to react to change. Examples include ExxonMobile and IBM. This change can be because of acquisition (buying a company in a different city/state/country), offshoring (moving “less critical” work to cheaper locations, while keeping “critical” work in HQ), real estate (outgrow physical HQ, and cant find adjacent office space in the neighborhood), or a whole range of other external factors.
  • a “new” category of companies which are multi-location, and multi-national, from their very inception. Literally they are location agnostic. And they do this by choice. Examples include Skype, MySQL and VistaPrint; interestingly enough, Mozilla was not mentioned in the article at all.

The difference between these two types of companies is important in internal coordination, communication, attitudes to different cultures, and most importantly hiring. Being location agnostic means able to hire “the best person for the job”, not just “the best person for the job who lives nearby, or is willing to relocate for the job”. Quote: “…it doesn’t matter much where [the developer] is physically as long as he has a broadband connection.” To handle the extra complexity, the company is held together by having people who are naturally able to bridge the different cultures, keep all the communication flowing across time zones, and are skilled in detangling communication snafus whenever they arise. He calls that role “the magic expatriate”, and as far as I can tell, Mozilla has an amazing collection of magic expatriates!

Software engineers scattered around the world each with a laptop, VOIP headset and IM/IRC sounds slightly futuristic in the business-focused article, yet its normal life in Mozilla for years now. And its very cool.

(As an exercise to the reader, just try saying “Mozilla the mighty micro multinational” 3 times in a row quickly!)

BrownBag introduction to RelEng

The development environment at Mozilla is fairly complex. As a new Mozilla developer, it can be tricky just learning where to land code changes, and figuring out if your changes landed ok and worked as you hoped! If you think back to your first day contributing code changes at Mozilla – is there anything you wish was explained to you on the first day, which would have helped you get up to speed quicker and made your life easier?

Until recently we’ve been relying on MDC, some scattered wiki docs, and lots of word-of-mouth – seasoned veterans mentoring newcomers until the newcomers can, in turn, mentor other newcomers. But this is tough to scale as more and more people start contributing. Also, as we modify, streamline and scale up RelEng infrastructure, we occasionally see veterans teach newcomers how things *used* to be, not how things are *now* !

Here’s a brown bag that we hope will help with all that.

We did a trial run of this during the AllHands, with some pre-existing Mozilla developers, and tweaked it a bit based on comments. Once we’ve been through a few iterations, we’ll put together a video clip but for now, you’ll have to imagine my voice doing the talkover voice. Now its time to find some real newcomers and see if it makes sense… so, what do you think?

update: we’ve done this brownbag a few times now since the AllHands, and each time, we tweak it further based on comments and questions asked. There might still be further changes, but at this point, it was worthwhile updating the PDF in this blogpost. joduinn 15feb2010, again 14jun2010 and again 17dec2010.

Infrastructure load for November 2009

Summary:

  • Overall, total numbers approx the same as last month. However, it is interesting to see fewer checkins on mozilla-central, but a significant increase in mozilla-191, mozilla-192 checkins. I suspect this is caused by increased focus on security releases and the upcoming FF3.6 release. The increased load on try is probably also related to this.
  • The numbers for this month are:
    • 1,675 code changes to our mercurial-based repos, which triggered:
    • 20,029 build jobs, or ~84 jobs per hour.
    • 40,421 unittest jobs, or ~56 jobs per hour.
    • 41,467 talos jobs, or ~58 talos jobs per hour.
  • 18th November was our third busiest day of the year, with 103 checkins. For comparison, this high level was only exceeded on 22nd Sept (116 checkins) and 20th May (108 checkins).
  • We enabled new Talos suites, and also disabled some unittest suites at different times during the month. For simplicity, I’ve ignored those changes for now, and will include in next month’s data.
  • We are still not tracking down any l10n repacks, nightly builds, release builds or any “idle-timer” builds.

Here’s how it looks compared to the year so far:

Detailed breakdown is :

Here’s how the math works out:

The types of build, unittest and performance jobs triggered by each individual push are best described here.

The Writing on the Wall

Another round of construction started here recently in the Mountain View office.  They’re trying really hard to keep the dust and disruption to a minimum so they hung plastic sheeting over doorways, and taped plastic over the carpets in the corridors – its even inside the elevators.

Its funny how quickly you can get used to working in what feels like the set of a bad SciFi movie! However, while swiping my ID card on the way back to my desk, the following made me stop, double-take and then start carefully looking around me.

“Demo: Not now”

To be clear, in this context:

demo != demonstration
demo == demolition

Turns out, the entire corridor I was standing in was going to disappear… just not now.

linux64: now with extra builds and talos!

Some of you may have noticed this new item on this menu on GraphServer.

There’s been a lot of work with linux64 over the last few weeks behind the scenes.

1) There are now nightly and per-checkin builds available for mozilla-central, mozilla-192, mozilla-191, tracemonkey, electrolysis. Because we only have 10 linux64 build slaves, we dont have builders on Places, TryServer or the cvs-based mozilla-190/Firefox3.0.

2) We’ve got a pool of linux64 talos slaves running all the usual Talos suites, per build, on those same branches. You can now see those results on graphs.mozilla.org, listed just like any other OS. Just like it should be. 🙂

3) Caveats:

  • For the sake of speed, we’ve cloned the *one* preexisting linux64 machine (which dbaron? setup up), without generating a clean, new, refimage with fully identified toolchain. If you see any toolchain problems, please let us know, but as its identical to whats been in place before, hopefully it will continues to be good enough for now.
  • Unittests are not yet being run on linux64. This is being worked on as part of a bigger problem; unittests used to require doing a build first. This in turn meant we only could run unittests on platforms that we supported using for builds, so we dont have unittests on 10.4, xp, vista, etc. More on this as it develops, but its not complete yet.
  • We’re still working out some TBPL display updates to get linux64 showing up on TBPL. For now, you must use Tinderbox waterfall to see the linux64 builds. The curious can follow bug#532560)

Spinning up this new OS took work from most people in the group, and is the first new desktop platform we’ve supported in years. Very very cool work and a great way to end the week. Enjoy!

Another Major Update from FF2.0.0.20->FF3.0.15

Last week, we offered Firefox 2 (yes 2!) users a Major Update offer to Firefox 3.0.15. This was despite our official End Of Life for Firefox 2 way back in December 2008.

While most attention is naturally focused on new releases, and on new security releases, there were 5.3% of our users still using Firefox 2. Those users were not getting new fixes and features; even worse, these users were all using versions of the browser that had known, published, exploits – exploits that were already fixed in later supported releases of Firefox.

The previous major update offer was intentionally left available, so any FF2 user who did manual CheckForUpdates would get upgraded to FF3.0.6. However, few did. As most of these Firefox2 users were on FF2.0.0.20, they were obviously willing and able to upgrade when security releases prompted them to. It seemed worth the effort to prompt them again, with a new Major Update offer, and see how many would upgrade.

In the first 7 days after publishing those new major update snippets, 16% of FF2 users have upgraded. Its a slower rate of upgrading then we get for normal security releases. However, its still a significant amount, and its great to see those users get back onto supported, more secure, releases. I’ll continue to monitor uptake, and keep you posted.

(ps: It was really cool that nthomas and abillings were able to find the time to squeeze yet another release into the schedule in the midst of all the releases for FF3.0, FF3.5, FF3.6 beta/RCs and Fennec beta/RCs. To keep this work quick and safe, we did a FF2->FF3.0 MU offer, rather then attempting FF2.0->FF3.5, which would require a bigger testing cycle, details in . On behalf of those users who are only now discovering the Awesome Bar, our faster performance and all the new JIT work, I thank you both!!)

The Crow Road by Iain Banks

“It was the day my grandmother exploded.”

A great opening line, and it made me stop my browsing in the bookshop to read on, a little curious. By the end of the first chapter, I was hooked and needed to buy the book. This coming-of-age story in rural Scotland is interwoven with social commentary and a family murder mystery. There were surprisingly lots of similarities with growing up in rural Ireland, and I found this book a really good read. Even if you did not grow up in rural Scotland (or Ireland), I think you’d still enjoy the book; you just might not get all the inside jokes or cultural references.

While I had heard of the author before, I always thought he wrote science fiction books that just didn’t work for me. This was my first time discovering that he wrote non-science fiction also, and I liked this book.

Firefox 3.5.5 by the (wall-clock) numbers

Firefox3.5.5 was released on Thursday 05-nov-2009, at 16:00PST.

This was our fastest turnaround on a release. By far.

From “Dev says go” to “release is now available to public” was approx 3 days (3d 4h 45m) wall-clock time. Release Engineering took 13-16hours. By comparison, the next fastest release turnaround was FF3.5.3 (~37hours) and FF2.0.0.9 (~37hours).

11:13 02nov: Dev says “go” for FF3.5.5
13:06 02nov: FF3.5.5 builds started
17:05 02nov: FF3.5.3 linux, mac builds handed to QA
20:03 02nov: FF3.5.3 signed-win32 builds handed to QA
00:28 03nov: FF3.5.3 update snippets available on test update channel
22:00 04nov: Dev & QA says “go” for Beta, and for Release; Build already completed final signing, bouncer entries
07:30 05nov: mirror replication started
10:55 05nov: mirror absorption good enough for testing
14:40 05nov: website changes finalized and visible. Build given “go” to make updates snippets live.
14:51 05nov: update snippets available on live update channel
16:00 05nov: release announced

Notes:

1) As we continue streamlining this process, now the long pole is communication between the groups, and also how the websites release notes are assembled and published. For this release, there were 8.5 9.5 hours of waiting between “go to mirrors” and “mirror push started”. Most of Thursday was spent updating release notes on websites. Meanwhile, we populated the mirrors, which takes ~3.5 hours of watching mirrors, but only took two brief commands on our part.

3) Our blow-by-blow scribbles are public, so the curious can read all about it here. Those Build Notes also link to our tracking bug#525814.
This super-super fast release turnaround was handled calmly and efficiently. It was a real credit to the team to see how well everyone worked well together on this, including smooth handoffs back-and-forth across timezones so everyone still had a life ! 🙂

take care
John.