RelEng goes to see Nine Inch Nails

To wrap up our 2009 RelEng intern orientation program, John Ford went with Aki and myself to see Nine Inch Nails live in Shoreline Friday night, just a few minutes walk from the office.

Aki’s a real fan, as you might guess. It was only my first time seeing them live, despite having all their albums for years. When we discovered John Ford didn’t know them at all… well… we just had to fix that!

Despite the crowds, Shoreline was cold – so much for warm California weather – and while I was slightly cold in my Firefox fleece, John Ford kept claiming he was fine in his tshirt because “he’s Canadian”!?!? 🙂

NIN did a great job of picking all the tracks I really liked the most (!), over all their albums, and having really great transitions from one piece to other. Lots of photos here on flickr. All in all, a great night out, and a great way to start the Memorial Day long weekend here – many thanks to Aki for getting tickets before they sold out.

ps: We definitely had nothing to do with the Mountain View police car stolen at the concert and then found two days later.

Semi-invisible outage on Friday 15th May

On Friday 15th May, the failover capabilities of the pool-o-slaves paid off, yet again.

As you may recall, on 12th May, we were able to take down 76 VMs for scheduled maintenance, without closing the tree. With that many systems offline, we had longer wait times, but everything kept working, and people could still do checkins, see builds/tests/performance results like usual. Quite impressive, really.

On 15th May, a totally unrelated DHCP server failed without warning. This took out 4 ESX hosts running approx 30 VMs for several hours. The builds/tests that were in progress at the time of the failure were lost, but otherwise no-one noticed a thing. Already queued jobs were allocated to remaining machines, automatically working around the outage, and our infrastructure just kept working while IT revived the DHCP server. Bug#493181 has details, for the curious.

We’re a long way from claiming 5-9s uptime, but the structural improvements are really paying off. Once a few other projects wrap up, we can start seriously talking about SLAs. We’ve come a long way in the last 2 years, and this is all very exciting stuff…

Welcome (back) Lukas!

Last Monday was a Canadian holiday, and this coming Monday will be a US holiday, so with no Mozilla Foundation meeting, I’m resorting to announcing this in a blog post.

Lukas Blakk has joined RelEng as a fulltime employee.

If you’ve never met Lukas, you should know that last summer, Lukas arrived for her internship, and had barely unpacked when we handed her the entire unittest infrastructure without warning. With very little guidance, she took it all on. Lukas worked non-stop stabilizing and streamlining machine configs in a million-and-one little details, chasing intermittent unittest failures to figure out if they were caused by code bugs, testware bugs, RelEng bugs or IT bugs… or a combination! She was so great, we hired her as a part-time contractor when she was heading back to Seneca. In that role, she worked with catlee and myself to consolidate all the build and unittest machines into one shared pool (details here). This project massively simplified life in RelEng, improved end-to-end turnaround time for developers; more importantly, it was a pre-req to getting unittests running on TryServer, and also to separating out build from unittest for faster turnaround times (details here, here and here). All massive stuff.

We’re delighted she’s graduated and is coming back to Mozilla. If you’ve never met Lukas before, go find her in the Toronto office or on irc (lsblakk) and say hello.

Welcome (back) Lukas… oh, and by the way, we’ve got a project called xulrunner we’d like you to have a look at! 🙂

Bike to Work Day

  • 5:15am: fall out of bed
  • 6:00am: left my house
  • 6:40am: left 24th & mission in a pack of 50+? other cyclists; “feels like critical mass”
  • 9:50am: arrive 42 miles later
  • 10:00am: arrive in meeting, after a quick rush cleanup

The route went along the east coast of the peninsula (see route here) and was *much* better then last time I did bike to work day. Sadly, because it was cold and foggy all the way, my Firefox cycling jersey stayed hidden under all the layers.

With a more efficient start, it feels like I could this could be done more routinely. The SF2G folks helped make it lots of fun, so I’ll try this again during the summer.

ps: Very glad that today was not called “Bike to and from work day” – I’d never make it.:-)

Last night was a major milestone

Last night, we took 76 VMs offline (22% of our 342 machines), so we could do a major firmware upgrade on the EqualLogic arrays. Thats a lot of work, and it all went smoothly. But thats not the important part this time.

The major milestone is that we did this *without* needing to close the tree for mozilla-1.9.1, mozilla-central or tracemonkey. Throughout the firmware upgrade, developers were still able to land patches, triggering builds, unittests and talos runs on all o.s., as well as using TryServer on all o.s. This was an intended feature in the design of our new infrastructure, but last night was the first time we really tried it out in a controlled way. And it worked perfectly.

Put another way: with the old infrastructure, losing 22% of our machines would have been a massively disruptive all-hands-on-deck tree-closing event. Last night showed how much things are improved.

Some details for the curious:

  • As slaves became idle before 7pm, we told them to gracefully shutdown. This meant that they would not accept new jobs, and could be powered off without developers getting reports of burning/broken builds at 7pm.
  • Once all the VMs using EqualLogic arrays were powered off, Aravind was able to start the firmware upgrade. He’ll blog about it separately, but this was a major update in all senses of the word, so took a full 4 hours for him to get through.
  • As soon as the firmware upgrades looked good, approx 11pm, we start powering back up all the VMs.
  • All the moz2 VMs are configured to autoboot back in a working state, so they automatically reconnected to the master, and started accepting queued jobs, with no human intervention at all. These all came up smoothly first time.
  • The Firefox3.0 and Thunderbird2.0 machines dont fully come up in fully working state, so needed some manual work, but this was relatively quick and on only a few machines. All came up smoothly.
  • We expected to be finished by 7am, but were in fact all done before 1am, 6hours early.

All in all, really a great evening, and great to see how RelEng and IT worked together on this – last night and also in the weeks of prep leading up to last night. Big tip of the hat to mrz, aravind, phong, catlee, nthomas and bhearsum for all their work.

That was awesome, thank you.
John.
=====
(full disclosure: While moz2 trees remained open, we did close the trees for Firefox3.0, and Thunderbird2.0. This was an intentional decision because a) very few people are using them, and b) we wanted to focus as much as possible of our remaining resources on keeping the active code lines running as smoothly as possible.)

Welcome (back) Armen!

It feels like only yesterday that Armen finished his internship, and went back to Seneca. Ever since then he’s been continuing to rock, working with coop to untangle our old l10n systems, setup parallel l10n repacks running on our production pool-of-slaves, shaving hours off our release times and generally making life better.
Well, now he’s graduated. And he’s back, this time working from the Toronto office!

Welcome (back) Armen, its great to have you at Mozilla again!:-)

End of an era – no more Firefox 2 machines

  • 24-oct-2006: Launch of Firefox2.0
  • 17-jun-2008: Launch of Firefox3.0
  • 17-dec-2008: End-of-life for Firefox2.0
  • 08-may-2009: Close bug#487235, as the Firefox 2.0 machines are finally gone.

We’re recycling the physical hardware to fix holes in our unittest coverage for Firefox3.0 and 3.5 -  you just cant buy PPC-based xserves anymore! 🙂 The VMs were “recycled” in the meta-physical-bits-sense, and new VMs are coming online to help with Firefox3.5. And obviously, removing these machines simplifies the life of RelEng and IT as we don’t have to support them anymore. All good.

Yet, to be honest, I feel a mixture of joy and nostalgia. They did serve us well over the years; just look how much Mozilla and Firefox and Thunderbird have changed between 24-oct-2006 and today!!

Somehow, “The King is dead. Long live the King!” seems appropriate.

Welcome John Ford!

Quick note to welcome John Ford from Seneca as an intern in RelEng. He started here in Mountain View on Monday, and is already working helping Aki get mobile builds going on TryServer. You can follow his adventures here or by looking for RelEng bugs assigned to “jford at mozilla dot com”.

This should be quite exciting – stay tuned!

Talos now measures shutdown times

In last week’s downtime, Alice re-tried enabling TShutdown tests in production Talos, and this time it all went smoothly. This change is important for two reasons:

1) Fixes an intermittent Talos orange problem
Basically, each Talos test assumed that the previous suite had already ended and exited browser successfully. However, sometimes (usually Vista!), we found that closing a healthy browser took longer then expected. This would cause the next Talos suite to fail out because of the lingering process left by the previous talos suite.

This fix should greatly reduce intermittent oranges from Talos in mozilla-central, mozilla-1.9.1 and tracemonkey. In the few days since its been enabled, things look much better already!

2) Users care about shutdown times
Just like we measure startup time, it feels right to measure shutdown time. It was never measured before, but once the idea came up, this felt like a good thing to measure. There are also some edge cases where users exit-and-quickly-restart firefox, which can become unhappy if the browser process is still slowly closing down.

The curious can find more details in Alice’s blogpost here.

HOWTO: travel on the Tokyo metro

The Tokyo subway and train system is massive; as someone who could not read/write/speak Japanese, I found its a little daunting at first. However, with the following three techniques, I quickly found it very easy to get around.

1) Print out this PDF of the subway map on a *color* printer. Or download the official Tokyo Metro Android app. If you plan to travel by train outside Tokyo, I found this app really helpful.

Carry it always! I found it invaluable when lost, asking for directions, or even just trying to confirm if I was on the correct train going the right direction. When language barriers get in the way, pointing politely to a printout map does wonders!

2) Learn the codes for your planned route.

On the subway map, each route has a different color. Also, each station has a name in Kanji, a name in ASCII, and a letter-plus-two-digit code. For example, in the bottom left corner of the map, you can see the station “Nishi-magome” is on the red “Asakusa” line, and has the code “A01”. To be precise, its really one code per line per station, so some bigger stations have multiple codes: for example, Shibuya has three subway lines, so the same one station is called “Z01”, “F16” and “G01”, depending on which subway line you are using.
These letter-plus-two-digit codes are clearly posted in every station, and on all maps. I found these codes much easier to remember then the real Japanese names of the stations, so these codes became essential for me to quickly figure out if I had missed my stop, if we were now arriving at my station, or if I was on train going the wrong way.

For example:

  • from my hotel to the Mozilla office: go from “Z01” to “Z05”.
  • from my hotel to Hombu Aikido dojo: go from “F16” to “F12”, change platforms to the “E” platform, where the same station is now called “E02” and take train to “E03”.
  • from my hotel to Akihabara “Electronics town”:  go from “G01” to “G09”, change platforms, and then go from “H08” to “H15”.

3) Get a commuter ticket.

This lets you avoid the hassle of buying tickets at crowded ticket machines, and having to figure out exact fares on each subway trip. If you are in Tokyo more then a day or two, its well worth it for convenience alone!

There’s two big brands of commuter tickets: “Suica” and “Passmo”. Within Tokyo, either can be used on any subway. I’ve been told they both also work on buses, and can even be used in some shops like a debit card also. If you are going outside of Tokyo, “Suica” can also be used on trains in some other cities, check for details here.

  • You can buy Suica or Passmo cards at any train station. Official train company offices seem to want some simple paperwork filled in. Instead I bought my Suica card at a newspaper stand on the west side of Shibuya station. Prices are all the same.
  • when entering the subway, wave the card over the sensor in the turnstile as you enter. (This works even if your card is in a wallet/handbag!) As you walk though, the display on the far end of the turnstile shows you how much credit you have left.
  • when exiting the subway, wave the card over the sensor in the turnstile. As you walk though, the display at the far end of the turnstile shows you how much the fare was for your trip, and how much credit you have left.
  • to recharge your card, look for a ticket machine with the Suica or Passmo logo, press the “english” button on the top-right corner of the display, then just follow the prompts.
  • more details, and photos here.

4) Note carefully which station entrance and exit you need.
I never really thought of this before Tokyo, but the train stations are huge – multiple city blocks. If you come out the wrong exit without paying attention, you can be very lost, and very far from where you were going. In frustration, I’d find myself walking back to the station, reentering, and then walking around inside the station until I found the correct exit. Save yourself this headache by looking for the name of the exit before you start your journey.

5) Note carefully the platform marking, and follow those instructions.
The overhead signs were bewildering to me. However, the color coded markings on the tile floors were really helpful. Follow what others are doing, stand in the right color coded platform area, and you will be perfectly located when the next incoming train stops and opens its doors. Every time. Yes, really.

With a map, a memorized series of station-codes and a commuter card, I found getting around Tokyo on the metro super easy and super efficient.

(UPDATED to add references to the new Tokyo Metro official android app and the Hitachi national rail app. joduinn 25mar2015. Fixed broken links 10apr2018)