DEV Community

Cover image for SRE book notes: Managing Incidents
Hercules Lemke Merscher
Hercules Lemke Merscher

Posted on • Originally published at bitmaybewise.substack.com

SRE book notes: Managing Incidents

These are the notes from Chapter 14: Managing Incidents from the book Site Reliability Engineering, How Google Runs Production Systems.

This is a post of a series. The previous post can be seen here:

This chapter puts you in the shoes of some personas dealing with different incidents and how they deal with the situation at hand.

It ends by summarizing the best practices for incident management:


Prioritize. Stop the bleeding, restore service, and preserve the evidence for root-causing.

Prepare. Develop and document your incident management procedures in advance, in consultation with incident participants.

Trust. Give full autonomy within the assigned role to all incident participants.

Introspect. Pay attention to your emotional state while responding to an incident. If you start to feel panicky or overwhelmed, solicit more support.

Consider alternatives. Periodically consider your options and re-evaluate whether it still makes sense to continue what you’re doing or whether you should be taking another tack in incident response.

Practice. Use the process routinely so it becomes second nature.

Change it around. Were you incident commander last time? Take on a different role this time. Encourage every team member to acquire familiarity with each role.


If you liked this post, consider subscribing to my newsletter Bit Maybe Wise.

You can also follow me on Twitter and Mastodon.


Photo by Matt C on Unsplash

Heroku

Build apps, not infrastructure.

Dealing with servers, hardware, and infrastructure can take up your valuable time. Discover the benefits of Heroku, the PaaS of choice for developers since 2007.

Visit Site

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs