10 Oct 2011
JohnMozilla
(I raised this in the Enterprise Working Group (EWG) call this week, and it resonated strongly with some people. Therefore, I’m posting this out more widely to hopefully get more feedback.)
After all the recent discussions about “what enterprise users wanted”, I found myself wondering if we were all even attempting to solve the same problem, so I stepped back, and re-read *lots* of posts from different enterprises over the last few months.
I now believe Mozilla, and the enterprises in the Enterprise Working Group, are working to solve three overlapping but orthogonal problems.
1) Cost of verifying that a new version of Firefox is safe to deploy.
Some enterprises verify with a quick running of an ACID test. Some SaaS vendors verify by doing wider testing, and deploying bugfixes to their products. One complication for SaaS vendors is that end users may be running on newer versions of Firefox anyway, on non-enterprise machines. This can cause problems that make both the SaaS vendor, and Mozilla, look bad. We havent spent much time on this so far.
(I still wonder if we could design a testsuite compatibility test suites, in the same mindset as HTML5, JavaCompatibilityKit, etc that might help speed up this verification step?)
2) Cost of deploying a new version of Firefox to all supported users
Once an enterprise has verified a specific version of Firefox, how much effort does it take to deploy that new version onto all their machines/users. This discussion typically quickly focuses on MSI and similar technologies for doing widespread deployments, although there are some other options like an inhouse AUS or equiv. Regardless of the technology used, the idea here is to have a centralized way to move forward all users to a newer version of Firefox, without having to walk/drive/fly a human to every computer in order to manually do a new install. Sometimes this also includes discussions about silent updates.
3) Frequency of doing this all over again
The frequency of the Firefox release cadence directly impacts how often enterprises have to go back to do (1) and (2) all over again.
The verify+deploy work is typically so painful that most enterprises only do this for “new feature” releases, and not for “security only dot-releases”. For most enterprises, it seems that Mozilla’s cadence of “new feature” releases every 12-18-24 months was infrequent enough that the verify+deploy work was tolerable. However, Mozilla’s more frequent feature releases means more frequent cost of verification+deploying, which can become a significant business problem.
The ESR proposal is attempting to address this increased recurring cost and this is where most of the discussions have been taking place so far.
(It’s worth noting that everyone involved from Mozilla and different enterprises understands and agrees that Mozilla’s faster cadence of “new feature” releases is important for Mozilla to remain relevant in the browser marketplace.)
Just my thoughts, but I’d be curious to hear what others think.
John.
10 Oct 2011
JohnMozilla
As some of you reading this may already know, there’s an proposal under discussion for how Mozilla could support releases for longer durations as requested by some enterprises.
I’d like to modify the latest Extended Support Release (ESR) proposal as follows:
1) Mozilla would not generate overlapping Extended Support Release (ESR) builds
2) Change the timing of when enterprises start to deploy new versions of Firefox.
The details of this are subtle, so please bear with me, while I try to explain with an brief example:
1) Mozilla anoints a specific Firefox release to be supported for a total of 42 weeks.
For the purposes of discussion, lets say this is Firefox 8.0. This means Firefox 8.0 users would be guaranteed to receive seven scheduled security-only dot-releases (plus, of course, any unplanned security chemspills that came up in that 42 weeks timeframe). Before the end of the 42 weeks, Mozilla would anoint another release to be supported for 42 weeks. To continue this example, after Firefox 8.0, the next release to be anointed would be Firefox 15.
Schedule-wise, this means:
** 8.0.1 would sim-ship with 9.0.
** 8.0.2 would sim-ship with 10.0.
…
** 8.0.7 would sim-ship with 15.0
** 15.0.1 would sim-ship with 16.0
…
2) Enterprises would start to verify/certify with the 8.0.0 release.
However, enterprises would *not* deploy 8.0.0. Specifically, enterprises would only start deployments of 8.0.1 at the time that 9.0 is released. (This is important for mechanical details about how updates are served – see more below.). (To be precise, enterprises deploy the latest 8.0.x available at the time 9.0 is released; if there are no chemspills, this would be 8.0.1, but if there are chemspills, it is always the latest latest 8.0.x available at the time of the 9.0 release).
3) When doing releases, RelEng makes a small change to how we publish updates between releases:

* 8.0.1 would sim-ship with 9.0.
** Mozilla would NOT enable updates from 8.0.0 -> 8.0.1
** Mozilla would enable updates from 8.0.0 -> 9.0.0
* 8.0.2 would sim-ship with 10.0.
** Mozilla would enable updates from 8.0.1 -> 8.0.2
** Mozilla would enable updates from 9.0.0 -> 10.0.0
…
* 8.0.7 would sim-ship with 15.0
** Mozilla would enable updates from 8.0.6 -> 8.0.7
** Mozilla would enable updates from 14.0.0 -> 15.0.0
* 15.0.1 would sim-ship with 16.0
** Mozilla would NOT enable updates from 15.0.0 -> 15.0.1
** Mozilla would enable updates from 8.0.7 -> 15.0.1
** Mozilla would enable updates from 15.0.0 -> 16.0.0
* 16.0.1 would sim-ship with 17.0
** Mozilla would enable updates from 15.0.1 -> 15.0.2
** Mozilla would enable updates from 16.0.0 -> 17.0.0
…
Thats it.
There are a few reasons why I recommend these modifications to the proposal:
1) Minimal changes to RelEng release automation or our update infrastructure. This means mechanically, we can put this new proposal into action more easily.
2) No need for any metrics infrastructure changes – all current infrastructure should just work as-is.
3) No need for Mozilla to generate overlapping concurrent ESR releases. This is significant because:
3a) the original proposal would have Mozilla sim-ship Firefox13.0, Firefox8.0.5esr and Firefox13.0esr at the same time as we also migrate aurora->beta and central->aurora. This is a significant increase in the mechanical work for RelEng in a *very* tight timeframe.
3b) this reduces the number of landings developers have to do for security fixes for 12-weeks-in-every-42 weeks. This also reduces the number of release builds to be generated if we have a chemspill in that 12-weeks-in-every-42-weeks window.
I believe these modifications still meet all the same objectives of the original proposal, yet are mechanically easier to implement. Therefore, I believe Mozilla could put this modified ESR proposal into action more easily then the existing ESR proposal.
Let me know if I missed anything, or if you have any concerns.
Thanks
John.
29 Sep 2011
JohnMozilla
After a great internship, we sadly waved goodbye to mjessome a few weeks
ago when he went back to UWaterloo to continue his studies.

Marc spent the summer working with Lukas Blakk on integrating two large parts of our developer infrastructure: TryServer and Bugzilla. Marc’s work is running in staging right now, and if all continues to go well, this should start to see light-of-day in production soon.
While it might sound strange to tie these two very different systems together like this, the idea here is quite fundamentally important to everyone who uses TryServer for Mozilla, and should roll out to other RelEng production systems soon afterwards. The curious can click here for more details or follow bug#657828.
I am happy to report that Marc seemed to still have lots of fun during his internship. Marc’s fire eating class at The Crucible was a sideeffect of a discussion he and I had during the RelEng work week in Toronto. By comparing these photos with photos taken when I took the class, you can easily tell that mjessome is much better at this then I ever was. I also discovered that interview questions about juggling and riding a unicycle, were somehow missing from RelEng’s list of interview questions, but I’ve now fixed that.



Thanks for all your great work, Marc, we really enjoyed working with you, and hope we cross paths again soon.
ps: if you are reading this, and are interested in an intership at Mozilla, have a look at our internship job postings.
21 Sep 2011
JohnMozilla
On July 25-30, all of RelEng held a work week in Toronto. As usual for these work weeks, we do project planning for next quarter(s), brainstorm open issues, reviews of major projects completed since the last work week, do face-to-face reviews, and even (this time) practice some public speaking as well as sign up to write part of a book (more details soon)!
Between everything else, we made time to tour Mozilla’s still-under-construction new Toronto office. This photo taken at the freight elevator, at the back of the office, standing between piles of construction supplies.
This is Mozilla’s Release Engineering group.

In case you don’t know all these smiling faces, they are:
standing L->R: bear, joduinn, lsblakk, bhearsum, rail, mjessome, catlee, jhford, joey.
kneeling L->R: coop, aki, armenzg, nthomas.
Given the high-stress nature of the job, and also the very distributed nature of the group, our group cohesion, as well as our trust, respect, love for each other is self-evident and something I’m truly in awe of. This means more to me then I can find the right words for.
Every RelEng work week is always an exhausting hectic week, and yet, I always find it really invigorating also. At the end of each week, as we are saying our goodbyes and heading for various planes/homes, I find myself missing everyone deeply and starting to look forward to the next gathering….currently scheduled for February 2012.
20 Sep 2011
JohnMozilla
I just realised it’s been almost 2 months since my last blogpost. There has been a lot of exciting stuff going on, and I apologize for not writing to share it all.
I will do better.
22 Jul 2011
JohnMozilla
I’ve just renewed the GPG keys which RelEng use for signing builds with our release automation. The details are in bug#673281, but I thought crossposting might be of help to others. If you dont care about GPG keys and signatures, skip now.
0) login to signing machine
1) Verify you are in a clean working directory and have a good gpg install.
$ cd
$ mv ~/.gnupg ~/.gnupg.backup
$ mkdir ~/.gnupg
$ cd ~/.gnupg
$ gpg --version
gpg (GnuPG) 1.4.7
$
2) Create new key, and two sub keys.
$ gpg --gen-key
gpg (GnuPG) 1.4.7; Copyright (C) 2006 Free Software Foundation, Inc.
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it
under certain conditions. See the file COPYING for details.
gpg: keyring `/Users/john/.gnupg/secring.gpg' created
Please select what kind of key you want:
(1) DSA and Elgamal (default)
(2) DSA (sign only)
(5) RSA (sign only)
Your selection? 2
DSA keypair will have 1024 bits.
Please specify how long the key should be valid.
0 = key does not expire
= key expires in n days
w = key expires in n weeks
m = key expires in n months
y = key expires in n years
Key is valid for? (0) 2y
Key expires at Sat Jul 20 20:06:32 2013 PDT
Is this correct? (y/N) y
You need a user ID to identify your key; the software constructs the user ID from the Real Name, Comment and Email Address in this form:
"Heinrich Heine (Der Dichter) "
Real name: Mozilla Software Releases
Email address: releases@mozilla.org
Comment:
You selected this USER-ID:
"Mozilla Software Releases "
Change (N)ame, (C)omment, (E)mail or (O)kay/(Q)uit? O
You need a Passphrase to protect your secret key.
We need to generate a lot of random bytes. It is a good idea to perform
some other action (type on the keyboard, move the mouse, utilize the
disks) during the prime generation; this gives the random number
generator a better chance to gain enough entropy.
...
gpg: key 1797CA3D marked as ultimately trusted
public and secret key created and signed.
gpg: checking the trustdb
gpg: 3 marginal(s) needed, 1 complete(s) needed, PGP trust model
gpg: depth: 0 valid: 1 signed: 0 trust: 0-, 0q, 0n, 0m, 0f, 1u
gpg: next trustdb check due at 2013-07-21
pub 1024D/1797CA3D 2011-07-22 [expires: 2013-07-21]
Key fingerprint = C60B CDD2 9B91 A82F B837 A467 C0F5 550C 1797 CA3D
uid Mozilla Software Releases
Note that this key cannot be used for encryption. You may want to use
the command "--edit-key" to generate a subkey for this purpose.
Command>
Command> quit
$
$ gpg --list-keys
/Users/john/.gnupg/pubring.gpg
------------------------------
pub 1024D/1797CA3D 2011-07-22 [expires: 2013-07-21]
uid Mozilla Software Releases
$
$ echo "so far so good"
$
$ gpg --edit-key releases@mozilla.org
gpg (GnuPG) 1.4.7; Copyright (C) 2006 Free Software Foundation, Inc.
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it
under certain conditions. See the file COPYING for details.
Secret key is available.
pub 1024D/1797CA3D created: 2011-07-22 expires: 2013-07-21 usage: SC
trust: ultimate validity: ultimate
[ultimate] (1). Mozilla Software Releases
Command>
Command>
Command> addkey
Key is protected.
You need a passphrase to unlock the secret key for
user: "Mozilla Software Releases "
1024-bit DSA key, ID 1797CA3D, created 2011-07-22
Please select what kind of key you want:
(2) DSA (sign only)
(4) Elgamal (encrypt only)
(5) RSA (sign only)
(6) RSA (encrypt only)
Your selection? 2
DSA keypair will have 1024 bits.
Please specify how long the key should be valid.
0 = key does not expire
= key expires in n days
w = key expires in n weeks
m = key expires in n months
y = key expires in n years
Key is valid for? (0) 2y
Key expires at Sat Jul 20 20:14:05 2013 PDT
Is this correct? (y/N) y
Really create? (y/N) y
We need to generate a lot of random bytes. It is a good idea to perform
some other action (type on the keyboard, move the mouse, utilize the
disks) during the prime generation; this gives the random number
generator a better chance to gain enough entropy.
.....
pub 1024D/1797CA3D created: 2011-07-22 expires: 2013-07-21 usage: SC
trust: ultimate validity: ultimate
sub 1024D/B7D648C4 created: 2011-07-22 expires: 2013-07-21 usage: S
[ultimate] (1). Mozilla Software Releases
Command>
Command>
Command> addkey
Key is protected.
You need a passphrase to unlock the secret key for
user: "Mozilla Software Releases "
1024-bit DSA key, ID 1797CA3D, created 2011-07-22
Please select what kind of key you want:
(2) DSA (sign only)
(4) Elgamal (encrypt only)
(5) RSA (sign only)
(6) RSA (encrypt only)
Your selection? 4
ELG-E keys may be between 1024 and 4096 bits long.
What keysize do you want? (2048)
Requested keysize is 2048 bits
Please specify how long the key should be valid.
0 = key does not expire
= key expires in n days
w = key expires in n weeks
m = key expires in n months
y = key expires in n years
Key is valid for? (0) 2y
Key expires at Sat Jul 20 20:14:53 2013 PDT
Is this correct? (y/N) y
Really create? (y/N) y
We need to generate a lot of random bytes. It is a good idea to perform
some other action (type on the keyboard, move the mouse, utilize the
disks) during the prime generation; this gives the random number
generator a better chance to gain enough entropy.
...
pub 1024D/1797CA3D created: 2011-07-22 expires: 2013-07-21 usage: SC
trust: ultimate validity: ultimate
sub 1024D/B7D648C4 created: 2011-07-22 expires: 2013-07-21 usage: S
sub 2048g/46784661 created: 2011-07-22 expires: 2013-07-21 usage: E
[ultimate] (1). Mozilla Software Releases
Command>
Command> list
pub 1024D/1797CA3D created: 2011-07-22 expires: 2013-07-21 usage: SC
trust: ultimate validity: ultimate
sub 1024D/B7D648C4 created: 2011-07-22 expires: 2013-07-21 usage: S
sub 2048g/46784661 created: 2011-07-22 expires: 2013-07-21 usage: E
[ultimate] (1). Mozilla Software Releases
Command>
Command> quit
Save changes? (y/N) y
$
3) create the public key file.
[snip]
Create a new text file “KEY” containing the following boilerplate text:
This file contains the PGP keys of various developers that work on
Mozilla and its subprojects (such as Firefox and Thunderbird).
Please don’t use these keys for email unless you have asked the owner
because some keys are only used for code signing.
Please realize that this file itself or the public key servers may be
compromised. You are encouraged to validate the authenticity of these keys in an out-of-band manner.
[snip]
3a) Append the following to “KEY” text file:
$ gpg --fingerprint --list-sigs releases@mozilla.org >> KEY
$ gpg --armor --export releases@mozilla.org >> KEY
4) Verify the private key / public key pair work
4a) on signing machine:
*) create a small helloworld.txt file
*) $ gpg --armor --detach-sig readme.txt
*) transfer KEY, readme.txt, readme.txt.asc to another machine
4b) on another machine
$ gpg --import KEY
$ gpg --verify readme.txt.asc readme.txt
gpg: Signature made Thu Jul 21 22:08:21 2011 PDT using DSA key ID C52175E2
gpg: Good signature from "Mozilla Software Releases "
gpg: WARNING: This key is not certified with a trusted signature!
gpg: There is no indication that the signature belongs to the owner.
Primary key fingerprint: 9D03 193D 6BDC 541B D796 C4E4 7F4D 6645 1EBC AB3A
Subkey fingerprint: 247C A658 AA95 F617 1EB0 F13E A7D7 5CC7 C521 75E2
5) Post the template public keyfile “KEY” as patch for review, and checkin. This checked in file will later be posted by the automation alongside the signed builds.
6) Post the template public keyfile to http://pgp.mit.edu, http://wwwkeys.pgp.net and other keymasters.
7) all done – declare victory!
13 Jul 2011
JohnMozilla
On ftp.m.o, we now use the full BuildID in the directory names for nightly builds. For example:
Before: ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2011-06-20-12-mozilla-central
After: ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2011-07-13-03-07-41-mozilla-central/
This is actually more important then might first appear, because this means our automation will avoid user-visible problems we’ve hit in the past. Occasionally, we get two nightly/clobber builds within an hour, and hence both builds were both placed into the same dated directory. This caused problems for people who downloaded one build then got updates expecting the other build; the user is then broken.
Further, we have lots of munging code that deal with different directory-name-formats on ftp.m.o. Each time we fix a bug like this, it means we can then trim and refactor our automation code even further, making the remaining code cleaner, easier to maintain and more reliable.
There were a million-and-one little details to keep straight, in order to make sure nothing broke during this change. Most of these were surprise, undocumented, dependencies – all fun to figure out and debug. Apart from a brief problem with updates, nothing broke when coop rolled this out. Nice work coop, thank you!
(If curious, there’s lots more details in bug#449607, and in coop’s blogpost#1, blogpost#2.)
(Closing this also closes our 5th oldest bug – filed on 07aug2008 – which makes seeing coop complete this even sweeter! Thanks to coop for grabbing this bug from me, and driving it down.)
06 Jul 2011
JohnMozilla
If that question doesn’t make sense, I should explain: behind the scenes, there are two big projects underway for Android:
1) coordinate getting the remaining orange tests fixed for Android.
These are tests that pass green on desktop, and for Maemo, but are failing on Android. Some are intermittently orange; some are perma-orange. This involves some RelEng fixes, but mostly involve coordinating work by ATeam, Developers and QA. The current list of bugs is here. We’d always love any help we can get fixing those tests!
2) increase the number of tegras we have in production.
Currently we own 96 tegras, of 86 are online, and the rest are physically broken or waiting for reimaging. Our reimage/reboot/status-tracking for these boards is going fairly well now, and we’ve been able to keep this rate of machines up for weeks now.
However, now that the imaging/production process is stable, we focus on the next part.
There is a limit on how many jobs these slow boards can process in a day. That means developers wait a relatively long time for Android test results. To keep the wait times from being even worse, RelEng restricts which branches have Android tests enabled on them. Now that we have racks to put them in, we’re getting 200 more tegra boards made for us. They dont have this in stock, so we’re receiving them in batches as they are made. Once we get these into production, and we can enable Android tests across the board, we’ll have a better idea of what our real load profile is, and can order more if needed.
To speed up delivery, I physically drove with jhford over to nvidia to collect this first batch of boards.
(Aside: After all the formal emails, and purchasing paperwork, it was great to chat with the guys in shipping, who didn’t know what to make of me, but were really really helpful. I love directions that end with “… ok, so then drive past the rolldown doors, park between the two dumpsters, and knock on the unmarked door”. 30mins later, there’s a group of us outside playing real-life Tetris to get all the boxes to fit in the car. It worked! Thanks Genaro!!)

This first batch proved that you can fit 40 tegras, as well as two Release Engineers into an Audi, and still have room to see out the windows for the drive back to Mozilla!!
01 Jul 2011
JohnMozilla
By now everyone knows that Firefox5.0 shipped, for desktop and for mobile, on 21June2011. That has already been covered elsewhere in great detail. However, now that the dust has settled, there are some behind-the-scenes details that I felt were important to draw attention to.
1) The Firefox5.0 and Fennec5.0 releases were both were based on the *same* identical changeset.
That wasn’t a coincidence – since 5.0beta2, every beta leading up to the Firefox5.0 release and Fennec 5.0 release was built from the same *identical* changeset. This was a major new milestone, made possible by streamlined infrastructures, and has important consequences for Mozilla’s ability to quickly find/fix/deliver security releases to protect our users.
2) We shipped Firefox5.0, Fennec5.0 and Firefox 3.6.18 all on the same day.
Shipping a major “new feature” release is tricky business – there’s a lot of fiddly details, and it’s typically “all hands on deck”. Because of this, we used to make sure *nothing* else was scheduled anywhere near a major release day. However, shipping a major release and announcing the security fixes in it, without also shipping a fix for older branches can be seen as a mixed message to our users on the older branches. Simultaneously shipping the same security fixes in security releases for the older supported branches is whats best for our entire set of users. However, this is tricky to do, and requires a lot of extra planning.
The first time we felt organized enough to safely ship a major “new feature” release was when we shipped Firefox4.0 and Firefox3.6.x and Firefox3.5.x on the one day. It was a very long, hectic 14 hour day (from 7am – ~9pm PDT), but it meant we could protect all Firefox users at the same time from a late breaking security exploit. Fennec4.0 had to wait and ship a week later. By contrast, when we shipped Firefox5.0, Fennec5.0 and Firefox3.6.18, we were all done in a calm orderly 4 hours (from 7am – ~11am PDT).
3) In the days leading up to the Firefox5.0/Fennec5.0/3.6.18 release, we did nine last-minute betas/releases in quick succession.
This fast turnaround was only possible because RelEng’s ongoing automation improvements has reduced our build times from 45hours down to 8hours. It was only because of this fast turnaround that Mozilla could accommodate some last minute fixes without impacting the release schedule. I feel its important to point out that, while there were tight deadlines, and a lot of very precise fast moving footwork, this was all done without burning out the humans in RelEng working those releases.
As Gary pointed out in a brief celebration speech Tuesday, its a big deal to change Mozilla’s development culture from “ship one big bang product when its ready” to “ship lots of smaller incremental products much more frequently”. Doing that transition in such a short timeframe is super impressive, and I believe was made possible by the infrastructure work from RelEng. Switching from the one-track model (trunk in cvs) to the multiple-concurrent-tracks model (currently ~35 active project branches in mercurial). Stabilizing the new mobile product infrastructure, and integrating it with the stable desktop product automation. Relentlessly improving our automation. “baby steps… relentless baby steps”. The list goes on and on…
I’m immensely proud of the quiet behind-the-scenes work that RelEng has done over the last 4 years to make this faster-release-cadence environment possible here at Mozilla. Thank you aki, armen, bear, bhearsum, catlee, coop, dustin, jhford, lsblakk, rail, nthomas.
27 Jun 2011
JohnMozilla
Now that Firefox5.0 shipped, we’ve got time to go back and do some cleanup. We’re going to disable the Firefox3.5 jobs this week.
This was already announced in last week’s platform meeting , but after all those years, its quite possible that people are relying on those jobs in ways we do not even know about. Hence this widespread notice. If you have any reasons these Firefox 3.5 jobs should be left running, please let us know by commenting in bug#666407.
What will change:
- No FF3.5.x incremental/depend/hourly builds will be produced.
- No FF3.5.x clobber/nightly builds will be produced.
- No FF3.5.x release builds will be produced.
- The FF3.5 waterfall page will be removed from tinderbox. Specifically, this page http://tinderbox.mozilla.org/showbuilds.cgi?tree=Firefox3.5 will go away as it will be empty.
What will *not* change:
- Existing FF3.5.x builds would still be available for download from http://ftp.mozilla.org/pub/mozilla.org/firefox/releases/
- Existing update offers would still be available. For example:
- FF3.5.14 users can still update to FF3.5.19.
- FF3.5.19 users can still update to latest FF3.6.x release (which is FF3.6.18 as of this writing).
- Newly revised major update offers, like from FF3.5.19 -> a future FF3.6.19 release, could still be produced if needed
- Any mozilla-1.9.1 machines which are not Firefox specific should continue to run as usual.
Why do this:
- Free up compute cycles in the shared production pool-of-slaves or try
pool-of-slaves. This will help make life better for all other jobs.
- Reduce manual support workload and systems complexity for RelEng and IT.
- Allows us speed up making changes to infrastructure code, as there’s now no longer a need to special-case and retest FF3.5 specific situations.
If you have any reasons that these Firefox3.5 jobs should continue running, please comment in bug#666407. Now.
Yes, really.
Now.
Thanks
John.
Older Entries Newer Entries