An Elegant Puzzle- Systems of Engineering Management Read online

Page 2


  The guiding principles I use for sizing teams are:

  • Managers should support six to eight engineers

  This gives them enough time for active coaching, coordinating, and furthering their team’s mission by writing strategies,2 leading change,3 and so on.

  Tech Lead Managers (TLMs). Managers supporting fewer than four engineers tend to function as TLMs, taking on a share of design and implementation work. For some folks this role can uniquely leverage their strengths, but it’s a role with limited career opportunities. To progress as a manager, they’ll want more time to focus on developing their management skills. Alternatively, to progress toward staff engineering roles, they’ll find it difficult to spend enough time on the technical details.

  Coaches. Managers supporting more than eight or nine engineers typically act as coaches and safety nets for problems. They are too busy to actively invest in their team or their team’s area of responsibility. It’s reasonable to ask managers to support larger teams during the transition to a more stable configuration, but it is a bad status quo.

  • Managers-of-managers should support four to six managers

  This gives them enough time to coach, to align with stakeholders, and to do a reasonable amount of investment in their organization. On the other hand, it will also keep them busy enough that they won’t be tempted to create work for their team.

  Ramping up. Managers supporting fewer than four other managers should be in a period of active learning on either the problem domain or on transitioning from supporting engineers to supporting managers. In the steady state, this can lead to folks feeling underutilized, or being tempted to meddle in daily operations.

  Coaches. Similar to supporting a large team of engineers, supporting a large team of managers leaves you functioning purely as a problem-solving coach.

  Figure 2.2

  A team composed of two working groups, an on-call rotation, and different tenured engineers.

  • On-call rotations want eight engineers

  For production on-call responsibilities,4 I’ve found that two-tier 24/7 support requires eight engineers. As teams holding their own pagers have become increasingly mainstream, this has become an important sizing constraint, and I try to ensure that every engineering team’s steady state is eight people.

  Shared rotations. It is sometimes necessary to pool multiple teams together to reach the eight engineers necessary for a 24/7 on-call rotation. This is an effective intermediate step toward teams owning their own on-call rotations, but it is not a good long-term solution. Most folks find being on-call for components that they’re unfamiliar with to be disproportionately stressful.

  • Small teams (fewer than four members) are not teams

  I’ve sponsored quite a few teams of one or two people, and each time I’ve regretted it. To repeat: I have regretted it every single time. An important property of teams is that they abstract the complexities of the individuals that compose them. Teams with fewer than four individuals are a sufficiently leaky abstraction that they function indistinguishably from individuals. To reason about a small team’s delivery, you’ll have to know about each on-call shift, vacation, and interruption.

  They are also fragile, with one departure easily moving them from innovation back into toiling to maintain technical debt.

  Keep innovation and maintenance together. A frequent practice is to spin up a new team to innovate while existing teams are bogged down in maintenance. I’ve historically done this myself, but I’ve moved toward innovating within existing teams.5 This requires very deliberate decision-making and some bravery, but in exchange you’ll get higher morale and a culture of learning, and will avoid creating a two-tiered class system of innovators and maintainers.

  Fitting together those guiding principles, the playbook that I’ve developed is surprisingly simple and effective:

  Teams should be six to eight during steady state.

  To create a new team, grow an existing team to eight to ten, and then bud into two teams of four or five.

  Never create empty teams.

  Never leave managers supporting more than eight individuals.

  Like all guidelines, this is a structure to aid thinking through sizing problems, not a straitjacket to restrict every exception. The context of any situation deserves careful examination, but increasingly I’ve found that the long-term costs of exceptions outweigh what I once considered their strengths.

  2.2 Staying on the path to high-performing teams

  A friend is six months into supporting a 60 person engineering group. Perhaps unsurprisingly, most of their teams believe that they have urgent hiring needs. Should my friend spread hiring equally across the teams in need, or focus hiring on just one or two teams until their needs are fully staffed? That is the question.

  It’s a great question, and captures a deeply challenging aspect of leading an organization. It’s fun to do initial discovery, learning from and about everyone. The rare moment when you choose to reorganize the team6 is painful, but concludes quickly. What’s much harder is keeping the faith when you’ve played your cards and need to find space for your plans to come to fruition. Staying the course is particularly fraught when it comes to growing an organization, because some teams always need more than you choose to provide.

  When you talk about growing an organization, the conversation usually leads to hiring. While I believe that hiring is a very important approach to growing organizations, I also believe that we reach for it too often. In order to prioritize hiring for scenarios in which it’ll do the most good, over the past year I’ve developed a loose framework for reasoning about what a given team needs to increase performance.

  Figure 2.3

  Four states of a team.

  2.2.1 Four states of a team

  The framework starts with a vocabulary for describing teams and their performance within their surrounding context.

  Teams are slotted into a continuum of four states:

  A team is falling behind if each week their backlog is longer than it was the week before. Typically, people are working extremely hard but not making much progress, morale is low, and your users are vocally dissatisfied.

  A team is treading water if they’re able to get their critical work done, but are not able to start paying down technical debt or begin major new projects. Morale is a bit higher, but people are still working hard, and your users may seem happier because they’ve learned that asking for help won’t go anywhere.

  A team is repaying debt when they’re able to start paying down technical debt, and are beginning to benefit from the debt repayment snowball: each piece of debt you repay leads to more time to repay more debt.

  A team is innovating when their technical debt is sustainably low, morale is high, and the majority of work is satisfying new user needs.

  Teams want to climb from falling behind to innovating, while entropy drags them backward. Each state requires a different tact.

  Figure 2.4

  Four stages of a team’s progress, from falling behind to innovating.

  2.2.2 System fixes and tactical support

  In this framework, teams transition to a new state exclusively by adopting the appropriate system solution for their current state. As a manager, your obligation is to identify the correct system solution for a given transition, initiate that solution, and then support the team as best you can to create space for the solutions to work their magic. If you skip to supporting the team tactically before initiating the correct system solution, you’ll exhaust yourself with no promise of salvation. For each state, here is the strategic solution that I’ve found most effective, along with some ideas about how to support the team while that solution comes to fruition:

  When the team is falling behind, the system fix is to hire more people until the team moves into treading water. Provide tactical support by setting expectations with users, beating the drum around the easy wins you can find, and injecting optimism.

  As a cavea
t, the system fix is to hire net new people, increasing the overall capacity of the company. Sometimes people instead attempt to capture more resources from the existing company, and I’m pretty negative on that. People are not fungible, and generally folks end up in useful places, so I’m skeptical of reassigning existing individuals to drive optimality. By nature, it’s also impossible for this kind of discussion to not become political, even when everyone involved has deep trust in and respect for each other.

  When the team is treading water, the system fix is to consolidate the team’s efforts to finish more things, and to reduce concurrent work until they’re able to begin repaying debt (e.g., limit work in progress). Tactically, the focus here is on helping people transition from a personal view of productivity to a team view.

  When the team is repaying debt, the system fix is to add time. Everything is already working, you just need to find space to allow the compounding value of paying down technical debt to grow. Tactically try to find ways to support your users while also repaying debt, to avoid disappearing into technical debt repayment from your users’ perspective. Especially for a team that started out falling behind and is now repaying debt, your stakeholders are probably antsy waiting for the team to start delivering new stuff, and your obligation is to prevent that impatience from causing a backslide!

  Innovating is a bit different, because you’ve nominally reached the end of the continuum, but there is still a system fix! In this case, it’s to maintain enough slack in your team’s schedule that the team can build quality into their work, operate continuously in innovation, and avoid backtracking. Tactically, ensure that the work your team is doing is valued: the quickest path out of innovation is to be viewed as a team that builds science projects, which inevitably leads to the team being defunded.

  I can’t stress enough that these fixes are slow. This is because systems accumulate months or years of static, and you have to drain that all away. Conversely, the same properties that make these fixes slow to fix make them extremely durable once in effect!

  The hard part is maintaining faith in your plan—both your faith and the broader organization’s faith. At some point, you may want to launder accountability through a reorg, or maybe skip out to a new job, but if you do that you’re also skipping the part where you get to learn. Stay the path.

  Figure 2.5

  Return on investment when consolidating versus spreading investment.

  2.2.3 Consolidate your efforts

  As an organizational leader, you’ll be dealing with a number of teams, each of which is in a different place on this continuum. You’ll also have limited resources to apply, and they’ll usually be insufficient to simultaneously move every team down the continuum. Many folks try to move all teams at the same time, peanut buttering7 their limited resources, but resist that indecision-framed-as-fairness: it’s not a fair outcome if no one gets anything.

  For each constraint, prioritize one team at a time. If most teams are falling behind, then hire onto one team until it’s staffed enough to tread water, and only then move to the next. While this is true for all constraints, it’s particularly important for hiring.

  Adding new individuals to a team disrupts that team’s gelling process, so I’ve found it much easier to have rapid growth periods for any given team, followed by consolidation/gelling periods during which the team gels. The organization will never stop growing, but each team will.

  2.2.4 Durable excellence

  This approach to nurturing great organizations is the opposite of a quick fix. While it’s slow, I’ve found that it consistently leads to enduring, real improvement in the happiness and throughput of an organization. Most importantly, these improvements stick around long enough to compound, creating a durable excellence.

  2.3 A case against top-down global optimization

  After I wrote “Staying on the Path to High-Performance Teams,”8 quite a few people asked the same follow-up question: “Once a team has repaid its technical debt, shouldn’t the now surplus team members move to other teams?”

  This makes a lot of sense, because the team, with so little technical debt left, is now overstaffed relative to its global priority. Repeated across many teams, this could lead to an organization having far too many engineers allocated against last year’s problems, and too few against today’s.

  This is an important problem to address!

  First, let me explain why I’m skeptical of reallocating individuals to address global priority shifts, and then I’ll suggest a couple alternative approaches to this conundrum.

  2.3.1 Team first

  Fundamentally, I believe that sustained productivity comes from high-performing teams, and that disassembling a high-performing team leads to a significant loss of productivity, even if the members are fully retained. In this worldview, high-performing teams are sacred, and I’m quite hesitant to disassemble them.

  Teams take a long time to gel. When a group has been working together for a few years, they understand each other and know how to set each other up for success in a truly remarkable way. Shifting individuals across teams can reset the clock on gelling, especially for teams in the early stages of gelling, and when there are significant differences in team culture. That’s not to say that you want teams to never change—that leads to stagnation—but perhaps preserving a team’s gelled state requires restrained growth.

  Sometimes you will want to grow faster than a gelled team allows, and that’s okay! The lesson is that you have to account for re-gelling costs after periods of change, not that you should never change them. This is part of why my proposed model9 recommends rapidly hiring into teams loaded down by technical debt, not into innovating teams, which avoids incurring re-gelling costs on high-performing teams.

  2.3.2 Fixed costs

  Another reason that I lean away from moving folks off high-performing teams is that most teams have high fixed costs and relatively small variable costs: moving one person can shift an innovating team back into falling behind, and now neither team is doing particularly well. This is especially true on teams responsible for products and services.

  My rule of thumb is that it takes eight engineers on a team to support a two-tier on-call rotation, so I’m generally reluctant to move any team with membership below that line. However, fixed costs come in many other varieties: “keeping the lights on” work, precommitted contracts, support questions from other teams, etc.

  There are some teams with very low fixed costs—a startup without any users, a team supporting a product that you’ve turned off entirely—and I suspect that the rules for those teams are different. I also suspect that such teams are quite uncommon in successful companies.

  2.3.3 Slack

  The premise of moving folks to optimize global efficiency also implies a deeper understanding of how productivity is generated than I’ve ever personally achieved. I’m a strong believer in not adding more resources to a team with visible slack, but I’m less convinced that the inverse applies.

  The expected time to complete a new task approaches infinity as a team’s utilization approaches 100 percent, and most teams have many dependencies on other teams. Together, these facts mean you can often slow a team down by shifting resources to it, because doing so creates new upstream constraints.

  In further defense of slack, I find that teams put spare capacity to great use by improving areas within their aegis, in both incremental and novel ways. As a bonus, they tend to do these improvements with minimal coordination costs, such that the local productivity doesn’t introduce drag on the surrounding system.

  Most importantly, “slackful” teams function as an organizational debugger: you don’t have to consider them when debugging the overall organizational throughput. I’ve found it much easier to work a couple constraints at a time, solving forward without needing to revisit previous constraints.

  The Goal by Eliyahu M. Goldratt10 and Thinking in Systems: A Primer by Donella H. Meadows11 are both phenomenal books on thi
s topic.

  Figure 2.6

  Fixed costs of running a team.

  2.3.4 Shift scope; rotate

  Okay, so what does work? I’ve found it most fruitful to move scope between teams, preserving the teams themselves. If a team has significant slack, then incrementally move responsibility to them, at which point they’ll start locally optimizing their expanded workload. It’s best to do this slowly to maintain slack in the team, but if it’s a choice of moving people rapidly or shifting scope rapidly, I’ve found that the latter is more effective and less disruptive.

  Shifting scope works better than moving people because it avoids re-gelling costs, and it preserves system behavior. Preserving behavior keeps your existing mental model intact, and if it doesn’t work out, you can always revert a workload change with less disruption than would be caused by a staffing change.

  The other approach that I’ve seen work well is to rotate individuals for a fixed period into an area that needs help. The fixed duration allows them to retain their identity and membership in their current team, giving their full focus to helping out, rather than splitting their focus between performing the work and finding membership in the new team. This is also a safe way to measure how much slack the team really has!

  A coworker of mine suggested that some companies have very successfully moved toward the swarming model (at the organization level, not just at the team level), and I hope that I eventually get a chance to hear from people who’ve successfully gone the other direction! One of the most exciting aspects of organizational design is that there are so many different approaches that work well.