Incident Automation: What to Automate, What to Leave to Humans

#sre #devops #automation #incident

Incident response automation is a trap. Some things should be automated. Some things absolutely should not be. Getting the line wrong is worse than automating nothing.

What to automate

1. Alert enrichment. Before a human sees an alert, automate pulling in related data: recent deploys, dependent service health, historical correlation. Save the human 10 minutes of context-gathering.

2. Known-good remediations. If an alert always has the same fix (restart service X, clear cache Y), automate the fix. But: require a human confirmation for the first 30 days before full auto.

3. Communication scaffolding. When an incident starts, auto-create the Slack channel, invite the on-call, post an initial status template, update the status page with a placeholder. Humans fill in the details.

4. Post-incident paperwork. Auto-generate a post-mortem template with timeline pulled from chat and monitoring. Humans edit and refine.

5. Routine handoff. When on-call rotates, auto-summarize what's been happening and who's been paged. Saves the incoming engineer 15 minutes of catching up.

What not to automate

1. Root cause analysis. AI or rules-based systems can suggest causes, but the final call has to be human. The 'wrong' root cause ends up in the post-mortem and misleads future responders.

2. Impact assessment. 'How many users are affected' needs context only humans have. Automation will miss business-critical customer segments.

3. Executive communication. Your VP of customer success doesn't want a templated bot message. They want 'here's what's happening, here's what we're doing, here's when I'll update you next' from a human.

4. Deciding severity. Yes, automate the initial guess. But a human has to confirm. Severity drives organizational response, and 'a bot marked this as sev-3' will be questioned the moment it matters.

5. The critical decision during an outage. Should we roll back? Should we fail over? Should we scale up? These are judgment calls with consequences. Don't hand them to a bot.

The principle

Automate the mechanical, human the judgmental. If the task has a clear right answer and no downside if it's wrong, automate. If it requires context, accountability, or judgment, leave it to humans.

The measurement

After adding automation, ask two questions:

Are humans faster to resolution?
Are humans feeling more in control, or less?

If resolution is faster but humans feel like they've lost visibility, you over-automated. Pull back.

The emotional piece

Humans run incidents because humans are accountable. Automation that undermines that accountability creates downstream problems. Build tools that make humans faster and more confident, not tools that replace them. There is a big difference.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com