DEV Community

Cover image for SRE book notes: Being On-Call
Hercules Lemke Merscher
Hercules Lemke Merscher

Posted on • Originally published at bitmaybewise.substack.com

SRE book notes: Being On-Call

These are the notes from Chapter 11: Being On-Call from the book Site Reliability Engineering, How Google Runs Production Systems.

This is a post of a series. The previous post can be seen here:


The SRE teams are quite different from purely operational teams in that they place heavy emphasis on the use of engineering to approach problems.


The quantity of on-call can be calculated by the percent of time spent by engineers on on-call duties. The quality of on-call can be calculated by the number of incidents that occur during an on-call shift.


We strongly believe that the "E" in "SRE" is a defining characteristic of our organization, so we strive to invest at least 50% of SRE time into engineering: of the remainder, no more than 25% can be spent on-call, leaving up to another 25% on other types of operational, nonproject work.


To make sure that the engineers are in the appropriate frame of mind to leverage the latter mindset, it’s important to reduce the stress related to being on-call.

The “latter mindset” mentioned here refers to the way of thinking on being rational, focused, and deliberate cognitive functions.

Stress hormones like cortisol and corticotropin-releasing hormone (CRH) are known to cause behavioral consequences—including fear—that can impair cognitive functions and cause suboptimal decision making.

It’s always good to see content like this supported by scientific research.


Heuristics are very tempting behaviors when one is on-call. For example, when the same alert pages for the fourth time in the week, and the previous three pages were initiated by an external infrastructure system, it is extremely tempting to exercise confirmation bias by automatically associating this fourth occurrence of the problem with the previous cause.

this is fine meme


The most important on-call resources are:

  • Clear escalation paths
  • Well-defined incident-management procedures
  • A blameless postmortem culture

Mistakes happen, and software should make sure that we make as few mistakes as possible. Recognizing automation opportunities is one of the best ways to prevent human errors.


Noisy alerts that systematically generate more than one alert per incident should be tweaked to approach a 1:1 alert/incident ratio. Doing so allows the on-call engineer to focus on the incident instead of triaging duplicate alerts.


Being out of touch with production for long periods of time can lead to confidence issues, both in terms of overconfidence and underconfidence, while knowledge gaps are discovered only when an incident occurs.


If you liked this post, consider subscribing to my newsletter Bit Maybe Wise.

You can also follow me on Twitter and Mastodon.


Photo by LumenSoft Technologies on Unsplash

Top comments (1)

Collapse
 
indika_wimalasuriya profile image
Indika_Wimalasuriya

Great post. Being on call can be a highly demanding aspect SRE. To minimize the frequency of callouts, it is essential to focus on improving the availability and reliability of your systems. Even in the event of a callout, it is crucial to be prepared and have a plan in place to address any issues that may arise.