UCBerkeley “New Manager Bootcamp”

Earlier this week, I had the distinct privilege of being invited to be on a panel at UCBerkeley’s “New Manager Bootcamp“.

This was my first time participating on an “expert panel” like this, so I really wast sure what I was getting myself into.

The auditorium was packed with ~90 people, all seasoned professionals from a range of different companies and different industries. They’d spent a bunch of time in workshops, listening and learning in an intensive crash-course. Now the tables were turned – they got to set the pace, and ask all the questions. After intros, and one “warm up” question from the organizer, the free-flow open questions started. From all corners of the room. Non-stop. For 75mins.

The trust and honesty in the room was great, and it was quickly evident that everyone was down-to-earth, asking brutally honest questions simply because they wanted to do right with their new roles and responsibilities.

The first few questions were “easy” black-and-white type questions. Things quickly got interesting with tricky gray-zone questions for the rest of the session. Each panelist responded super-honestly on how we’d each handled those tricky situations. Given that we all came from different backgrounds, different cultures, different careers, it was no surprise that we had different perspectives and attitudes for these gray-zone questions. We even had panelists asking each other questions, live on stage!?! As individual panelists, we didnt always agree on the mechanics of what we did, but we all agreed on the motivations of *why* we did what we did: taking care of people’s lives, and careers, individually, as part of the group, and as part of the company.

I found this educational, and I hope it was useful for the people asking the questions! Afterwards, I spent time in a nearby coffee shop quietly thinking about the questions, and reliving the different experiences behind the answers I shared on stage.

Unexpectedly, I was also asked to come back the next day, to talk about “we are all remoties“. Turns out that geo-distributed groups was a popular topic of discussion throughout the bootcamp, but I was still surprised at the level of interest when Homa asked for a quick show of “who would be willing to skip lunch for an extra session on remoties” and almost everyone jumped up! The “remoties” presentation was rushed, because of the tight time grabbing food-to-go, making sure not to delay the other scheduled sessions, and the flood of questions. Yet, people were fully engaged, sitting on the floor with food, asking great questions, and really excited by what was possible for distributed groups when the mechanics were debugged.

Distributed work groups are obviously a big issue, not just in open source software projects, but also in a lot of other companies in the bay area.

Big thanks to Homa and Kim for putting it all together. The timing of this was fortuitous, and I found myself thinking about possible ideas for Mozilla’s ManagerHacking series that morgamic revived recently and will be coming up again in a few weeks.

Infrastructure load for March 2013

  • #checkins-per-month: We had 6,433 checkins in March 2013. This is well past our previous record of 6,247 in Jan2013. Every working day was consistently busy (>200 checkins per working day) and load-per-day was busy across longer periods of each day.

  • #checkins-per-day: On 18mar, we had 323 checkins – a new record for a single day, breaking our previous record of 307 checkins-per-day on 06jan2013. During March, 20-of-31 days had over 200 checkins-per-day – thats every working day except 28mar (because of Easter weekend?). 13-of-31 days had over 250 checkins-per-day (3-of-31 days had over 300 checkins-per-day!).
  • #checkins-per-hour: Checkins are still mostly mid-day PT/afternoon ET, but the load has increased across the day. For 9 of every 24 hours, we sustained over 10 checkins per hour, the heaviest sustained use we’ve seen so far across our day. Heaviest load times this month were 2-3pm PT (13.22 checkins-per-hour).
  • As usual, our build pool handled the load well, with >95% of all builds consistently being started within 15mins.

    Our test pool situation continues to improve, as we continue migrating any test jobs that do not *require* hardware to AWS. As before, any test suite which we can run on AWS means double goodness: the AWS-based test suites have great wait times on AWS, and the remaining physical-hardware-based test suites have slightly improved wait times because fewer jobs are being scheduled on our scarce hardware. Even so, its not yet as great as the situation with our builders. For the tests that *do* require hardware, it continues to be a slow process to bring those additional physical machines online. Meanwhile, RelEng, ATeam and devs continue the work of finding test suites which should (in theory!) be able to run on AWS, then fixing them to make them run green. Once a test suite runs green on AWS, RelEng stops scheduling that test suite on physical machines.

    If you know of any test suites that no longer need to be run per-checkin, please let us know so we can immediately reduce the load a little. Also, if you know of any test suites which are perma-orange, and hidden on tbpl.m.o, please let us know – thats the worst of both worlds – using up scarce CPU time and not being displayed. Every little helps put scarce test CPU to better use.

mozilla-inbound, mozilla-central, fx-team:
Ratios of checkins across these branches remain fairly consistent. mozilla-inbound continues to be heavily used as an integration branch, with 27.9%% of all checkins, consistently far more then the other integration branches combined. As usual, fx-team has ~1% of checkins, mozilla-central has 1.6% of checkins.

The lure of sheriff assistance on mozilla-inbound continues to be consistently popular, and as usual, very few people land directly on mozilla-central these days.

mozilla-aurora, mozilla-beta, mozilla-b2g18, gaia-central:
Of our total monthly checkins:

  • 2.4% landed into mozilla-aurora, very similar to last month.
  • 1.6% landed into mozilla-beta, very similar to last month.
  • 1.5% landed into mozilla-b2g18, very similar to last month.
  • 4.8% landed into gaia-central, slightly higher then last month. gaia-central continues to be the third busiest branch overall, after try and mozilla-inbound. Obviously, these checkins are *only* for the B2G releases, so worth calling out here.

misc other details:

  • Pushes per day
    • You can clearly see weekends through the month. Its worth noting that we had >200 checkins-per-day every working day in March except 28mar (because of Easter weekend?).

    • Pushes by hour of day
        Mid-morning PT is consistently the biggest spike of checkins, although this month the checkin load stayed high throughout the entire PT working day, and particularly spiked between 2-3pm PT, with 13.22 checkins-per-hour.

Behind the scenes prep for B2G workweek

In case anyone missed this during this morning’s Mozilla Foundation call – here’s a quick summary of all the invisible prep-work that helped make last week’s B2G workweek so awesome.

1) Nightly builds
* now generated for Arm (panda boards), Otoro, Unagi, Unagi-ENG, Inari, Hamachi, Leo
* for that set of devices, we generate “nightly” builds twice a day. Once for 8am PDT morning. Once for 8am Madrid CET morning.
* … on each of mozilla-central, mozilla-b2g18, mozilla-b2g18_v1_0_1

2) Stood up an extra 250 slaves. More importantly, created 22 masters in AWS so we now have 70 masters total (with 30 in AWS) and can quickly burst-grow-capacity to create more slaves if needed.
* Reimaged 80 in-house build & test machines to optimize for Firefox OS development, based on watching load and usage at the last workweek.

3) Set up an alternate “birch” branch to use mozilla-inbound; By having b2g workweek developers use “birch” instead of mozilla-inbound, this allowed b2g-workweek developers a faster, less crowded, branch to land on, and reduced risk of blocking whenever a non-b2g change blocked mozilla-inbound.

Did all that work help? By all accounts yes. But of course, the proof in the numbers. Last week, 1490 checkins were landed, and all systems held super-responsive (>95% of jobs handled on time throughout the week, with one dip down to >90%!). Impressive to see the infrastructure handle the load like that.

Please give a big hug and thanks to RelEng/ATeam/IT, especially the following:

catlee, rail, hwine, armenzg (RelEng)
ctalbert, jmaher, jgriffin, edmorley, ryanvm (ATeam)
dmoore, arr, fox2mike, vinh, jakem, solarce, sheeri, klibby, sal, van (IT)