Your Agent Just Handled That SEV2. Now What?

#incident #devops #agents #sre

The outage isn't the hard part. What's hard is the minute after the alert fires. Who's leading? Which Slack channel? What do we tell support? The fix is usually quick. The coordination is what kills you.

What agents actually do today

Teams are starting to route coordination work to agents. Engineer gets a Datadog alert, the agent checks who's on call, acknowledges, pulls recent deploys, pages the right person, logs everything. Engineer never leaves the editor.

The agent didn't fix anything. It removed the 20 minutes of friction before anyone starts debugging.

One engineer told me about their first agent-handled SEV2:

"I woke up to a full picture. Agent had pulled the deploy from 20 minutes ago, logged the spike, paged me. I fixed it in ten minutes instead of spending thirty figuring out what was happening."

Deciding what to hand off

Before the next incident, you need to answer this: what can the agent do without asking?

Some things are obvious. Acknowledging incidents, looking up who's on call, logging to the timeline, gathering context from deploys and logs, paging with full context. Let the agent do those.

Some things should stay with humans. Rollback decisions, customer comms, postmortem conclusions, declaring "we're resolved." These need judgment that agents don't have yet.

Then there's the gray zone. Escalation timing and severity classification. For escalation, pick a number and write it down. For severity, let agents triage but have humans confirm anything customer-facing.

When the agent is better than your process

One team told me the agent didn't improve their incidents. It exposed how broken the process already was. Nobody remembers ten steps at 2 AM. The process was asking humans to do machine work, and the agent made that visible.

Fewer tools, not more

More tools means more ambiguity. Seventy tools means the agent picks between list_incidents, get_incidents, search_incidents, and query_incidents. Four tools, same thing, different names. Now the agent is guessing.

Narrower surfaces make agents more dependable. We kept Runframe's MCP server to sixteen tools around incident workflows. If it doesn't help run an incident, it's out.

What I'd do

Start with SEV3s. Low severity, cheap mistakes, good place to watch how the agent behaves. Don't let agents near SEV0 until you've seen them work on dozens of lower-severity incidents first.

Write down delegation boundaries. Not runbooks that say "do X then Y," but guardrails: never do X without human approval, escalate after Y minutes. Test those before you need them.

And pay attention when the agent does something unexpected. That's it telling you what you forgot to define.

This is the short version. Full post on [runframe.io/blog/your-ai-agent-just-handled-that-incident(https://runframe.io/blog/your-ai-agent-just-handled-that-incident)

We built Runframe around these ideas. If you're thinking about agent-driven incident management, the MCP server is open source.