These are the notes from Chapter 30: Embedding an SRE to Recover from Operational Overload from the book Site Reliability Engineering, How Google Runs Production Systems.
This is a post of a series. The previous post can be seen here:
Your job while embedded with the team is to articulate why processes and habits contribute to, or detract from, the service's scalability.
Remember that your job is to make the service work, not to shield the development team from alerts.
After scoping the dynamics and pain points of the team, lay the groundwork for improvement through best practices like postmortems and by identifying sources of toil and how to best address them.
Sort the team fires into toil and not-toil. When you're finished, present the list to the team and clearly explain why each fire is either work that should be automated or acceptable overhead for running the service.
Your first goal for the team should be writing a service level objective (SLO), if one doesn't already exist. The SLO is important because it provides a quantitative measure of the impact of outages, in addition to how important a process change could be.
Once your embedded assignment concludes, you should remain available for design and code reviews. Keep an eye on the team for the next few months to confirm that they're slowly improving their capacity planning, emergency response, and rollout processes.
As you might have guessed already, the quotes with recommendations above are not just applicable to SREs but any experienced engineer that finds himself/herself in a position that needs to mentor or guide one or multiple teams at the company.
If you haven’t been into this situation before, pay attention to them as they might come in handy when the time comes.
If you liked this post, consider subscribing to my newsletter Bit Maybe Wise.
You can also follow me on Twitter and Mastodon.
Photo by Daria Nepriakhina on Unsplash
Top comments (0)