Hercules Lemke Merscher

Posted on Jan 26, 2023 • Originally published at bitmaybewise.substack.com

SRE book notes: Being On-Call

#sre #books #oncall

These are the notes from Chapter 11: Being On-Call from the book Site Reliability Engineering, How Google Runs Production Systems.

This is a post of a series. The previous post can be seen here:

SRE book notes: Practical Alerting

Hercules Lemke Merscher ・ Jan 25 ・ 2 min read

#sre #books #alerting

The SRE teams are quite different from purely operational teams in that they place heavy emphasis on the use of engineering to approach problems.

The quantity of on-call can be calculated by the percent of time spent by engineers on on-call duties. The quality of on-call can be calculated by the number of incidents that occur during an on-call shift.

We strongly believe that the "E" in "SRE" is a defining characteristic of our organization, so we strive to invest at least 50% of SRE time into engineering: of the remainder, no more than 25% can be spent on-call, leaving up to another 25% on other types of operational, nonproject work.

To make sure that the engineers are in the appropriate frame of mind to leverage the latter mindset, it’s important to reduce the stress related to being on-call.

The “latter mindset” mentioned here refers to the way of thinking on being rational, focused, and deliberate cognitive functions.

Stress hormones like cortisol and corticotropin-releasing hormone (CRH) are known to cause behavioral consequences—including fear—that can impair cognitive functions and cause suboptimal decision making.

It’s always good to see content like this supported by scientific research.

Heuristics are very tempting behaviors when one is on-call. For example, when the same alert pages for the fourth time in the week, and the previous three pages were initiated by an external infrastructure system, it is extremely tempting to exercise confirmation bias by automatically associating this fourth occurrence of the problem with the previous cause.

The most important on-call resources are:

Clear escalation paths

Well-defined incident-management procedures

A blameless postmortem culture

Mistakes happen, and software should make sure that we make as few mistakes as possible. Recognizing automation opportunities is one of the best ways to prevent human errors.

Noisy alerts that systematically generate more than one alert per incident should be tweaked to approach a 1:1 alert/incident ratio. Doing so allows the on-call engineer to focus on the incident instead of triaging duplicate alerts.

Being out of touch with production for long periods of time can lead to confidence issues, both in terms of overconfidence and underconfidence, while knowledge gaps are discovered only when an incident occurs.

If you liked this post, consider subscribing to my newsletter Bit Maybe Wise.

You can also follow me on Twitter and Mastodon.

Photo by LumenSoft Technologies on Unsplash

Reinvent your career. Join DEV.

It takes one minute and is worth it for your career.

Get started

Top comments (1)

Indika_Wimalasuriya • Jan 27 '23

Great post. Being on call can be a highly demanding aspect SRE. To minimize the frequency of callouts, it is essential to focus on improving the availability and reliability of your systems. Even in the event of a callout, it is crucial to be prepared and have a plan in place to address any issues that may arise.

DEV Community

SRE book notes: Being On-Call

SRE book notes: Practical Alerting

Hercules Lemke Merscher ・ Jan 25 ・ 2 min read

Top comments (1)

Tune in for AWS Security LIVE!

Read next

Exploring AI's Power: Building a Basic Chatbot in Python

Honest Author Rankings Boost Peer Review Accuracy: New Study Shows Promise in Machine Learning Conferences

How Frequency-Based Methods Improve AI's Ability to Detect and Classify Sounds

Fast New Method Maps Complex Data Patterns Using Advanced Math Compression