DEV Community: Jess Blevins

How to develop a team-based onboarding experience

Jess Blevins — Wed, 10 Aug 2022 16:08:00 +0000

Background/Motivation

In a large enterprise environment, you may have a high-level onboarding experience for all developers at your company. However, such an experience will not necessarily prepare new hires or transfers for actually working in their individual teams.

Within a large company, you may have teams using lots of different technology stacks or practices. There may be no single developer onboarding experience at the company level.

Individual teams have the greatest impact on whether new developers have the material and support they need to do their work. This is a guide for writing your team's specific onboarding material.

Tips and Best Practices for Onboarding New Developers to Your Team

Document everything! The goal is to reduce verbal-based "tribal knowledge" by writing things down. By documenting everything, you enable developers to work asynchronously and learn things when team members are not available to help. This is especially important for organizations that allow people to work flexible hours or different time zones. 🌎🌏
At the same time, recognize that people have different learning styles! It's not always going to be appropriate to send a bunch of documentation links to the new member of your team. Try to strike a balance between sharing documentation and talking to your new teammate so that they can learn to help themselves while still connecting with you, their new team.
Make minimal assumptions about what people "should" know! New hires or transfers may join your team with zero experience working with the specific tools and technologies that your team uses. This goes for programming languages, infrastructure, design patterns, Computer Science concepts, etc. Don't worry - this does not mean that you need to write something like a Beginner's Guide to JavaScript. It means that you should be mindful that your new teammate may not be super familiar with a given topic.
Use links to direct people where to learn more about a topic. A "references" section at the top or bottom of your documentation is a helpful place to include links for learning more about topics.
Writing about code is hard! Consider using diagrams! ⏹🔄⏺
Spend some time considering which processes and technologies that your code depends on. What do developers need to understand about these dependencies? Do you know who to contact or where to look if you need to learn more about something your team doesn't directly own or maintain?

Team Onboarding Documentation Checklist 📝

Your team has its own documentation page in the space where R&D documentation is kept (e.g. Confluence)
Team contact information. Things to consider:
- How to contact the team for escalations or production incidents 🚨
- Where to ask for help / non-urgent questions 🤔
- How to ask for feature requests 🙏🏻
Team members and their roles
Team Charter or Mission Statement (usually a 1-2 sentence describing what your team does)
Team's current Objectives and Key Results
List of owned features and links to their documentation
Slack or Microsoft Teams Channels that team members should know about
- Required channels ⭐️
- Optional channels
Email distribution lists to join
Things to get access to (e.g. authorization roles)
Team processes (if different from company processes)
- What is expected in your team's code review and QA testing process?
- Developer on-call duties 📟
  - Rotation Schedule
  - Requirements (get access to X, subscribe to status updates with Y, etc)
  - Responsibilities (respond to alerts, write on-call shift notes, etc)
  - What to do if you can't respond to a page
  - Links to any feature-level on-call runbooks

References

Production-Ready Microservices: Building Standardized Systems Across an Engineering Organization by Susan Fowler
The DevOps Handbook: How to Create World-Class Agility, Reliability, & Security in Technology Organizations by Gene Kim, Jez Humble, John Willis, Patrick Debois
OKR: Learn Google's Goal System with Examples and Templates

Production-Ready Feature Documentation Checklist

Jess Blevins — Wed, 10 Aug 2022 15:44:00 +0000

Background

This list is intended to cover everything developers need to know to maintain a production-ready feature (or microservice). In an enterprise environment, documentation may sometimes just mean linking to existing policies rather than writing your own. Some things in the list may not apply to your feature or microservice at all!

Feature Documentation Checklist 📝

READMEs and Code Comments are Not Documentation
Many developers limit the documentation of their microservices to a README file in their repository or to comments scattered throughout the code. While having a README is essential, and all microservice code should contain appropriate comments, this is not production-ready documentation and requires that developers check out and search through the code.
- Production-Ready Microservices, page 119

The feature or microservice has its own Documentation Page wherever your company keeps R&D Documentation (e.g. Confluence).
A high-level description of the feature.
Contact information
- Paging Information to contact for Production Escalations 🚨
- If applicable, contact info includes a link to your team's website or separate documentation page.
An architecture diagram 🎨

This diagram should detail the architecture of the service, including its components, its endpoints, the request flow, its dependencies (both upstream and downstream), and information about any databases or caches... Architecture diagrams are essential for several reasons. It's nearly impossible to understand how and why a microservice works just by reading through the code, and so a well-designed architecture diagram is an easily understandable visual description and summary of the microservice.
- Production-Ready Microservices, page 121

Information about the component's request flow(s), endpoints, dependencies, and stakeholders
- Document all API endpoints of the service or feature.
- Upstream Dependencies: which features or technologies does your component rely on? For each dependency, consider documenting:
  - Links to documentation for the upstream dependencies
  - Frequency of maintenance updates required due to these dependencies (e.g. if you are using something like Nodejs or Java, how often should you be checking for version updates?)
- Downstream Stakeholders (e.g. specific features or other teams that rely on your feature)
Decision Log
On-call Runbook / Pager Playbook
- Operations SOPs
- Monitoring / Telemetry Dashboards
- Alerts SOPs (step-by-step instructions that explain how to review logs and debug the issue that caused the alert)
Developer's (or DevOps) Guide to the Feature
- Links to relevant Developer Tools and Resources needed for the feature
- Links to beginner's or quickstart guides to the tools and resources
Automated Testing Suites - where to find them, how to run them, best practices, etc
List of infrastructure resources that are deployed with the feature or microservice
Step-by-step guide to setting up the feature for local development
- Where is a good "entry point" for developers looking to read through the feature code? (e.g. file name, function name, etc)
- How to set up local environment for development. Consider whether you need to document things like config files that are used, which authorization credentials are needed, etc
- How to start the feature and step through the workflows of the feature. Include all commands or scripts that need to be run
- How to verify that the feature is up and running correctly
Links to documented Code Conventions, Style Guides used, or Lint Tool Configs
Development Cycle and Deployment Pipeline
- How do developers commit code to your feature? What are your expectations for pull requests?
- How often is your code deployed to production? Do you follow a CI/CD model?
- Do you use feature toggles to gate changes?
Deep Dive on Technical Designs and Documentation. This is where you document your architecture design.

References

Production-Ready Microservices: Building Standardized Systems Across an Engineering Organization by Susan Fowler
Documenting Software Architectures: Views and Beyond by Judith Stafford, Robert Nord, Paulo Merson, Reed Little, James Ivers, David Garlan, Len Bass, Felix Bachmann, Paul Clements
Feature Toggles (aka Feature Flags) by Martin Fowler

Tanking For Your Team: A Tech Lead's Guide to the On-Call Developer Role

Jess Blevins — Mon, 29 Mar 2021 14:16:15 +0000

The Purpose of This Blog Post

This blog post is a comprehensive look at what I believe it means to be an effective on-call developer for your team. The on-call developer can be a rewarding and deeply impactful role for every team to have alongside other scrum team roles. I will share my thoughts about why the on-call developer can and should do more than respond to pager duty.

This blog post is also a love letter to team-based ownership of code features and microservices. The on-call developer can only be so effective without having ownership of the code they are responsible for maintaining.

This blog post is a little bit about what we can learn from role-playing games.

"Iconic Party" from Dungeons & Dragons Player Handbook, 5th edition.

Introduction

What does it mean to be on-call for a developer? I'll start with a definition from Atlassian Incident Management:

On call is the practice of designating specific people to be available at specific times to respond in the event of an urgent service issue, even though they are not formally on duty.

Usually, an "urgent service issue" refers to a high severity incident affecting end users. When the Network Operations Center or Site Reliability Engineers are paging a developer on-call, it's probably an urgent service issue.

However, we often have other non-urgent, but-still-more-urgent-than-next-sprint issues come up all the time. And that's why we can expand the on-call developer to shielding the rest of the development team by taking care of any interruptions during a sprint.

I believe we can solve a lot of organizational problems by creating room for the On-Call Developer role alongside the other important scrum team roles (e.g. Product Owner, Scrum Master). It is possible to make the role work so that your organization’s knowledge management and incident response times scale well. Most importantly, this role gets rid of the need for singularly heroic acts from tech leads.

Tanking For Your Team – An Analogy

Johanna, Crusader of Zakarum, from "Heroes of the Storm." Blizzard Entertainment.

What is a tank?

In role-playing games (RPGs), a team’s “tank” is the character class designed to withstand damage for the team and, secondly, ensure they can draw the enemy’s attention to themselves.

What does this have to do with being an on-call developer?

In writing this post, I realized more and more that the on-call developer works as the team’s tank. The rest of the development team are damage dealers, trying to burn down sprint goals. Using this analogy, the on-call developer needs to:

Be able to mitigate themselves (and our business) from the damage of costly interruptions.
“Pull aggro” - draw attention to themselves so that nobody else is necessarily distracted.

When a team’s enemies are focused on your tank, your team’s damage dealers are free to maximize their damage per second (DPS) on the team’s main targets. By balancing dungeon encounter responsibilities across the different classes, the entire team can achieve glorious victory!

This is also true for your scrum team. 🌈

"Damage" Mitigation -- Ensuring the On-Call Developer is Supported

Reinhardt and Lúcio from "Overwatch." Blizzard Entertainment.

Before setting up your on-call rotation, how do you ensure your developers have the support they need to handle high severity issues? It’s stressful, being on-call, and sometimes a long Mean Time To Repair (MTTR) a high-severity incident can cost the company a lot of money.

In other words, how can we keep our Tank from immediately getting squished by a large mob? What is the developer version of plate armor?

I have some suggested prerequisites for putting together an on-call rotation. These prerequisites are designed to support your on-call developer and make this process as painless as possible:

Your tech lead has onboarded the dev team to the features and/or microservices that your team owns.
- Ideally, those features and microservices are well-documented.
You have runbooks (i.e. how-to guides) for handling high severity issues.
You have a system in place for knowledge transfers across rotation changes.
On-call rotations should not be too large (or too small). The ideal size is probably ~5 developers. On a much larger rotation, developers are not performing on-call duties enough to gain expertise and sharpen their skills to reduce MTTR metrics.
At the same time, individuals should not be on-call more than 25% of the time, or there is a risk of burnout.
You have created a psychologically safe environment.
- At a minimum, you should provide psychological safety for your on-call rotation by using blameless post-mortems. You should also cultivate a culture of teaching and learning.
The team’s Engineering Manager (EM) and Product Owner (PO) understand that middle-of-the-night pages are MAJOR issues and should not occur regularly. If an on-call developer is responding to non-working hour pages, that person needs to be compensated appropriately.

Onboarding, Runbooks, and Shift Notes

The first form of risk mitigation is to have your tech leads or Subject Matter Experts (SMEs) train the rest of the development team on the code components or owned microservices. Set an expectation that the SME is not going to be the primary responder to all incidents. Moving forward, that person is there to provide back-up when asked or needed by the on-call developer.

This acts as a forcing function for the SME to write documentation for others to follow.

The next form of mitigation is to have production operation runbooks.

Runbooks are a specific type of documentation tailored to the types of issues you may see happen in production.

Writing a runbook sounds hard – how do you write step-by-step instructions for handling a problem you don't know about yet?

In my experience, most teams kind of do it already! We write documents outlining the team's risk assessment and mitigation plans for features. We write ticket comments about how to handle code that might cause a problem in production. We have release day monitoring practices. We have a vague idea of what to do if something looks bad on a dashboard.

Write that stuff down in the form of a runbook for your on-call developer to reference!

Many times, it’s as simple as having a documentation page for your feature that says, “if x is happening, turn off this feature toggle.” Other times, you want to think about monitoring things like production error logs or a specific grafana dashboard.

Write. It. Down.

Write down shift notes

Finally, be sure to keep everyone in your rotation up-to-date with shift notes. My current team keeps a record of all Production Incident Notes. We have a template to record the following:

Incident description
Resolution
Root cause
Takeaways and next steps
Timeline of events

On a previous team that owned several microservices, our shift notes included entries about operational duties and maintaining our staging environments.

Keep Your Rotation Small (But Not Too Small!)

For developers to gain confidence and strengthen our skills in being on-call, we need to do it regularly! Just like anything else, we need to practice being on-call so that we can get good at it.

In her blog post, “Making On-Call Not Suck,” Molly Struve writes about the characteristics of a broken on-call system and how to fix it. In describing the broken on-call system, she writes:

Eventually, the team was so big that people were going on-call once every 3-4 months. This may seem like a dream come true, but in reality, it was far from it. […] The large rotation meant that on-call shifts were so infrequent that devs were not able to get the experience and reps they needed to know how to handle on-call issues effectively. […] As backward as it may sound, being on-call more is a benefit because devs have become a lot more comfortable with it and are able to really figure out a strategy that works best for them.

Molly's experience rings true to my own at a large enterprise company. I am much more confident and comfortable responding to pages for my specific team than I am the larger organization's rotation.

In summary, support your on-call rotation by keeping the size at around 5 developers so that people get experience but do not burn out.

But Who's the Healer?!

Anduin Wrynn from "World of Warcraft: Battle for Azeroth." Blizzard Entertainment.

I don't know. Probably the EM.

Pulling Aggro for the Team

In role-playing games, aggro is a mechanism used by enemies to prioritize which characters to attack. The player who generates the most aggro will be preferentially targeted by the enemy. This is referred to as “pulling aggro.”

Eowyn, Shieldmaiden of Rohan, pulling aggro for Théoden.

In a development team, the person with the highest baseline priority for interruption is typically the tech lead. But the tech lead should not always be the team’s tank. There are a couple of reasons for this:

It leads to a low Bus Factor.
It impedes the growth of the entire development team.

To handle this, the tech lead needs to cede responsibilities to the on-call developer and allow that person to pull aggro for the team. This means the on-call developer should be the primary developer responsible for context-switching out of sprint work and handling interruptions. Such responsibilities may include:

Responding to all incident pages.
Release day monitoring.
Helping the team’s PO and EM triage incoming bugs.
Reviewing any non-owner changes made to the team’s owned code components.
Helping the PO and EM respond to stakeholder questions about owned functionality.

Responding to stakeholder interruptions

Sometimes, stakeholders (customer service managers, POs, other developers) can interrupt scrum teams with non-trivial technical questions. And if you work on a team that owns any kind of framework or utility used by other developers, you probably get hit with a lot of questions. (And this is despite your best efforts to document everything for your downstream users! 😢)

I have spent most of my career working on teams responsible for enabling other development teams. Through trial and error, I have found that the most effective way of handling these kinds of interruptions is by explicitly assigning someone the duty of answering them. And it shouldn’t always be the same person.

I suggest the on-call developer because they’re already being interrupted with costly context switching.

Everyone else on the development team can stay concentrated on sprint goals, and the stakeholder gets a prompt and meaningful answer.

With a collective responsibility model, it is not always clear who should take the time to respond, let alone perform any kind of research or analysis. Sometimes it’s the tech lead answering everything, which leads to a low Bus Factor. Sometimes there’s a bystander effect, where anyone or no one may take responsibility for answering incoming questions.

Stakeholder waiting for an answer from a team with a collective responsibility model.

Your team needs to take responsibility for unblocking stakeholders without hurting your own team’s needs. The on-call developer can do that for you while the rest of the development team focuses on the main target – your sprint goals.

Maximizing DPS – Ensuring the rest of your team is effective at burning down sprint goals

Onyxia's Lair from "World of Warcraft." Blizzard Entertainment.

In this last section, I want to talk about the little ways a tank supports their team and makes everyone else better at their job of killing the dragon. (The dragon is your sprint goal).

Ok... My tanking analogy is falling apart here.

But here’s a section about how the on-call developer role makes everyone else a better and more productive developer working on sprint goals.

Being on-call encourages developers to build a better product

Being on-call makes us better developers when we’re not on-call. By handling production incidents and triaging bugs directly, everyone in the rotation gains the benefit of insight into how our code affects production and our clients. This makes us better day-to-day developers as we work on improving the product or building new ones.

Developers should follow their work downstream – by seeing customer difficulties firsthand, they make better and more informed decisions in their daily work. By doing this, we create feedback on the non-functional aspects of our code – all the elements that are not related to the customer-facing feature – and identify ways that we can improve manageability, operability, and so on. - DevOps Handbook, pg 233

On-call developers can fix things hindering the development team

Ideally, on-call developers should not be expected to perform sprint work. They are already responsible for so much context-switching that anything they can accomplish towards burning down sprint goals is nice-to-have.

I think a better use of the on-call developer’s “free” time is to work on things that improve the on-call experience. This includes things like automating manual processes, improving alerts, writing dev tools -- anything that lowers the operational burden of being on-call is fair game.

I have been smitten with this idea since I saw it in Charity Majors’ blog post, “On-Call Shouldn’t Suck: A Guide for Managers:”

When an engineer is on call, they are not responsible for normal project work — period. That time is sacred and devoted to fixing things, building tooling, and creating guard-rails to protect people from themselves. If nothing is on fire, the engineer can take the opportunity to fix whatever has been annoying them. Allow for plenty of agency and following one’s curiosity, wherever it may lead, and it will be a special treat.

How is this maximizing your sprint output potential? The on-call developer is improving everyone’s output by working on improving our environments and tools. 🛠

Conclusion

Thank you for reading my very geeky blog post about creating an effective on-call developer role for your team. Thanks very much to those that have written about this topic before me.

Postscript

I am not a Tank main. I prefer range DPS and healer roles, but sometimes you need to play the role your team needs. 😏

References

https://www.geeknative.com/48262/art-inside-dd-5e-players-handbook/
https://www.atlassian.com/incident-management/on-call
https://en.wikipedia.org/wiki/Tank_(video_games)
https://www.icy-veins.com/wow/tanking-guide
https://www.dungeonsolvers.com/2019/05/10/i-fight-for-my-friends-how-to-be-a-tank-in-dd-5e/
https://dotesports.com/overwatch/news/reinhardt-camera-control-patch-17485
https://sre.google/sre-book/being-on-call/
https://www.atlassian.com/incident-management/postmortem/blameless
https://www.atlassian.com/incident-management/on-call/improving-on-call
https://dev.to/molly/making-on-call-not-suck-490
https://en.wikipedia.org/wiki/Hate_(video_games)
https://en.wikipedia.org/wiki/Bus_factor
https://en.wikipedia.org/wiki/Bystander_effect
https://medium.com/opsgenie/why-every-team-in-your-organization-should-embrace-on-call-1b43b31f6d8a
Gene Kim, Patrick Debois, Jez Humble, John Willis. The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations.
https://charity.wtf/2020/10/03/on-call-shouldnt-suck-a-guide-for-managers/

DEV Community: Jess Blevins

How to develop a team-based onboarding experience

Background/Motivation

Tips and Best Practices for Onboarding New Developers to Your Team

Team Onboarding Documentation Checklist 📝

Related Posts I've written

References

Production-Ready Feature Documentation Checklist

Background

Feature Documentation Checklist 📝

References

Tanking For Your Team: A Tech Lead's Guide to the On-Call Developer Role

The Purpose of This Blog Post

Introduction

Tanking For Your Team – An Analogy

What is a tank?

What does this have to do with being an on-call developer?

"Damage" Mitigation -- Ensuring the On-Call Developer is Supported

Onboarding, Runbooks, and Shift Notes

Writing a runbook sounds hard – how do you write step-by-step instructions for handling a problem you don't know about yet?

Write down shift notes

Keep Your Rotation Small (But Not Too Small!)

But Who's the Healer?!

Pulling Aggro for the Team

Responding to stakeholder interruptions

Maximizing DPS – Ensuring the rest of your team is effective at burning down sprint goals

Being on-call encourages developers to build a better product

On-call developers can fix things hindering the development team

Conclusion

Postscript

References