Jess Blevins

Posted on Mar 29, 2021

Tanking For Your Team: A Tech Lead's Guide to the On-Call Developer Role

#oncall #devops #productivity #webdev

The Purpose of This Blog Post

This blog post is a comprehensive look at what I believe it means to be an effective on-call developer for your team. The on-call developer can be a rewarding and deeply impactful role for every team to have alongside other scrum team roles. I will share my thoughts about why the on-call developer can and should do more than respond to pager duty.

This blog post is also a love letter to team-based ownership of code features and microservices. The on-call developer can only be so effective without having ownership of the code they are responsible for maintaining.

This blog post is a little bit about what we can learn from role-playing games.

"Iconic Party" from Dungeons & Dragons Player Handbook, 5th edition.

Introduction

What does it mean to be on-call for a developer? I'll start with a definition from Atlassian Incident Management:

On call is the practice of designating specific people to be available at specific times to respond in the event of an urgent service issue, even though they are not formally on duty.

Usually, an "urgent service issue" refers to a high severity incident affecting end users. When the Network Operations Center or Site Reliability Engineers are paging a developer on-call, it's probably an urgent service issue.

However, we often have other non-urgent, but-still-more-urgent-than-next-sprint issues come up all the time. And that's why we can expand the on-call developer to shielding the rest of the development team by taking care of any interruptions during a sprint.

I believe we can solve a lot of organizational problems by creating room for the On-Call Developer role alongside the other important scrum team roles (e.g. Product Owner, Scrum Master). It is possible to make the role work so that your organization’s knowledge management and incident response times scale well. Most importantly, this role gets rid of the need for singularly heroic acts from tech leads.

Tanking For Your Team – An Analogy

Johanna, Crusader of Zakarum, from "Heroes of the Storm." Blizzard Entertainment.

What is a tank?

In role-playing games (RPGs), a team’s “tank” is the character class designed to withstand damage for the team and, secondly, ensure they can draw the enemy’s attention to themselves.

What does this have to do with being an on-call developer?

In writing this post, I realized more and more that the on-call developer works as the team’s tank. The rest of the development team are damage dealers, trying to burn down sprint goals. Using this analogy, the on-call developer needs to:

Be able to mitigate themselves (and our business) from the damage of costly interruptions.
“Pull aggro” - draw attention to themselves so that nobody else is necessarily distracted.

When a team’s enemies are focused on your tank, your team’s damage dealers are free to maximize their damage per second (DPS) on the team’s main targets. By balancing dungeon encounter responsibilities across the different classes, the entire team can achieve glorious victory!

This is also true for your scrum team. 🌈

"Damage" Mitigation -- Ensuring the On-Call Developer is Supported

Reinhardt and Lúcio from "Overwatch." Blizzard Entertainment.

Before setting up your on-call rotation, how do you ensure your developers have the support they need to handle high severity issues? It’s stressful, being on-call, and sometimes a long Mean Time To Repair (MTTR) a high-severity incident can cost the company a lot of money.

In other words, how can we keep our Tank from immediately getting squished by a large mob? What is the developer version of plate armor?

I have some suggested prerequisites for putting together an on-call rotation. These prerequisites are designed to support your on-call developer and make this process as painless as possible:

Your tech lead has onboarded the dev team to the features and/or microservices that your team owns.
- Ideally, those features and microservices are well-documented.
You have runbooks (i.e. how-to guides) for handling high severity issues.
You have a system in place for knowledge transfers across rotation changes.
On-call rotations should not be too large (or too small). The ideal size is probably ~5 developers. On a much larger rotation, developers are not performing on-call duties enough to gain expertise and sharpen their skills to reduce MTTR metrics.
At the same time, individuals should not be on-call more than 25% of the time, or there is a risk of burnout.
You have created a psychologically safe environment.
- At a minimum, you should provide psychological safety for your on-call rotation by using blameless post-mortems. You should also cultivate a culture of teaching and learning.
The team’s Engineering Manager (EM) and Product Owner (PO) understand that middle-of-the-night pages are MAJOR issues and should not occur regularly. If an on-call developer is responding to non-working hour pages, that person needs to be compensated appropriately.

Onboarding, Runbooks, and Shift Notes

The first form of risk mitigation is to have your tech leads or Subject Matter Experts (SMEs) train the rest of the development team on the code components or owned microservices. Set an expectation that the SME is not going to be the primary responder to all incidents. Moving forward, that person is there to provide back-up when asked or needed by the on-call developer.

This acts as a forcing function for the SME to write documentation for others to follow.

The next form of mitigation is to have production operation runbooks.

Runbooks are a specific type of documentation tailored to the types of issues you may see happen in production.

Writing a runbook sounds hard – how do you write step-by-step instructions for handling a problem you don't know about yet?

In my experience, most teams kind of do it already! We write documents outlining the team's risk assessment and mitigation plans for features. We write ticket comments about how to handle code that might cause a problem in production. We have release day monitoring practices. We have a vague idea of what to do if something looks bad on a dashboard.

Write that stuff down in the form of a runbook for your on-call developer to reference!

Many times, it’s as simple as having a documentation page for your feature that says, “if x is happening, turn off this feature toggle.” Other times, you want to think about monitoring things like production error logs or a specific grafana dashboard.

Write. It. Down.

Write down shift notes

Finally, be sure to keep everyone in your rotation up-to-date with shift notes. My current team keeps a record of all Production Incident Notes. We have a template to record the following:

Incident description
Resolution
Root cause
Takeaways and next steps
Timeline of events

On a previous team that owned several microservices, our shift notes included entries about operational duties and maintaining our staging environments.

Keep Your Rotation Small (But Not Too Small!)

For developers to gain confidence and strengthen our skills in being on-call, we need to do it regularly! Just like anything else, we need to practice being on-call so that we can get good at it.

In her blog post, “Making On-Call Not Suck,” Molly Struve writes about the characteristics of a broken on-call system and how to fix it. In describing the broken on-call system, she writes:

Eventually, the team was so big that people were going on-call once every 3-4 months. This may seem like a dream come true, but in reality, it was far from it. […] The large rotation meant that on-call shifts were so infrequent that devs were not able to get the experience and reps they needed to know how to handle on-call issues effectively. […] As backward as it may sound, being on-call more is a benefit because devs have become a lot more comfortable with it and are able to really figure out a strategy that works best for them.

Molly's experience rings true to my own at a large enterprise company. I am much more confident and comfortable responding to pages for my specific team than I am the larger organization's rotation.

In summary, support your on-call rotation by keeping the size at around 5 developers so that people get experience but do not burn out.

But Who's the Healer?!

Anduin Wrynn from "World of Warcraft: Battle for Azeroth." Blizzard Entertainment.

I don't know. Probably the EM.

Pulling Aggro for the Team

In role-playing games, aggro is a mechanism used by enemies to prioritize which characters to attack. The player who generates the most aggro will be preferentially targeted by the enemy. This is referred to as “pulling aggro.”

Eowyn, Shieldmaiden of Rohan, pulling aggro for Théoden.

In a development team, the person with the highest baseline priority for interruption is typically the tech lead. But the tech lead should not always be the team’s tank. There are a couple of reasons for this:

It leads to a low Bus Factor.
It impedes the growth of the entire development team.

To handle this, the tech lead needs to cede responsibilities to the on-call developer and allow that person to pull aggro for the team. This means the on-call developer should be the primary developer responsible for context-switching out of sprint work and handling interruptions. Such responsibilities may include:

Responding to all incident pages.
Release day monitoring.
Helping the team’s PO and EM triage incoming bugs.
Reviewing any non-owner changes made to the team’s owned code components.
Helping the PO and EM respond to stakeholder questions about owned functionality.

Responding to stakeholder interruptions

Sometimes, stakeholders (customer service managers, POs, other developers) can interrupt scrum teams with non-trivial technical questions. And if you work on a team that owns any kind of framework or utility used by other developers, you probably get hit with a lot of questions. (And this is despite your best efforts to document everything for your downstream users! 😢)

I have spent most of my career working on teams responsible for enabling other development teams. Through trial and error, I have found that the most effective way of handling these kinds of interruptions is by explicitly assigning someone the duty of answering them. And it shouldn’t always be the same person.

I suggest the on-call developer because they’re already being interrupted with costly context switching.

Everyone else on the development team can stay concentrated on sprint goals, and the stakeholder gets a prompt and meaningful answer.

With a collective responsibility model, it is not always clear who should take the time to respond, let alone perform any kind of research or analysis. Sometimes it’s the tech lead answering everything, which leads to a low Bus Factor. Sometimes there’s a bystander effect, where anyone or no one may take responsibility for answering incoming questions.

Stakeholder waiting for an answer from a team with a collective responsibility model.

Your team needs to take responsibility for unblocking stakeholders without hurting your own team’s needs. The on-call developer can do that for you while the rest of the development team focuses on the main target – your sprint goals.

Maximizing DPS – Ensuring the rest of your team is effective at burning down sprint goals

Onyxia's Lair from "World of Warcraft." Blizzard Entertainment.

In this last section, I want to talk about the little ways a tank supports their team and makes everyone else better at their job of killing the dragon. (The dragon is your sprint goal).

Ok... My tanking analogy is falling apart here.

But here’s a section about how the on-call developer role makes everyone else a better and more productive developer working on sprint goals.

Being on-call encourages developers to build a better product

Being on-call makes us better developers when we’re not on-call. By handling production incidents and triaging bugs directly, everyone in the rotation gains the benefit of insight into how our code affects production and our clients. This makes us better day-to-day developers as we work on improving the product or building new ones.

Developers should follow their work downstream – by seeing customer difficulties firsthand, they make better and more informed decisions in their daily work. By doing this, we create feedback on the non-functional aspects of our code – all the elements that are not related to the customer-facing feature – and identify ways that we can improve manageability, operability, and so on. - DevOps Handbook, pg 233

On-call developers can fix things hindering the development team

Ideally, on-call developers should not be expected to perform sprint work. They are already responsible for so much context-switching that anything they can accomplish towards burning down sprint goals is nice-to-have.

I think a better use of the on-call developer’s “free” time is to work on things that improve the on-call experience. This includes things like automating manual processes, improving alerts, writing dev tools -- anything that lowers the operational burden of being on-call is fair game.

I have been smitten with this idea since I saw it in Charity Majors’ blog post, “On-Call Shouldn’t Suck: A Guide for Managers:”

When an engineer is on call, they are not responsible for normal project work — period. That time is sacred and devoted to fixing things, building tooling, and creating guard-rails to protect people from themselves. If nothing is on fire, the engineer can take the opportunity to fix whatever has been annoying them. Allow for plenty of agency and following one’s curiosity, wherever it may lead, and it will be a special treat.

How is this maximizing your sprint output potential? The on-call developer is improving everyone’s output by working on improving our environments and tools. 🛠

Conclusion

Thank you for reading my very geeky blog post about creating an effective on-call developer role for your team. Thanks very much to those that have written about this topic before me.