Paulo Victor Leite Lima Gomes

Posted on Jun 29

Agentic incident response is where autonomy meets the pager

#ai #agents #incidentresponse #devops

The riskiest place to put an AI agent is not always the code editor.

Sometimes it is the incident channel.

AWS has been talking about agentic AI for operational work, including autonomous incident response with AWS DevOps Agent and patterns for distributed agentic workloads. That is a natural next step. If agents can read logs, inspect metrics, understand recent deployments, suggest runbook steps, and draft remediation plans, of course vendors will point them at outages.

I understand the appeal.

Incidents are messy. The clock is loud. People are tired. Dashboards disagree. Half the context lives in old runbooks, old pull requests, and someone who is on a plane. A tool that can gather evidence quickly and propose the next move sounds genuinely useful.

But incident response is a much harder test than code generation.

When a coding agent makes a bad change, the team can usually review it before merge. When an operations agent makes a bad call during an outage, the blast radius can arrive before the postmortem template opens.

That does not mean agents do not belong near the pager.

It means the pager is where the fantasy version of autonomy has to grow up.

incidents punish confidence

The thing I dislike most in incident response tooling is fake certainty.

Production systems rarely fail in clean textbook shapes. The error rate climbs, but only for one region. Latency is high, but only behind a specific customer path. The database looks fine until someone checks lock wait time. The deployment was thirty minutes ago, but the symptom began after a cache expired. The cloud provider status page is green because of course it is.

Human responders learn to be suspicious of first explanations.

Agents need the same humility, but software usually expresses humility through constraints, not personality.

A useful incident agent should say what it knows, where the evidence came from, what is missing, and which actions are reversible. It should separate observation from hypothesis. It should make it easy for a human to reject the proposed path without losing the gathered evidence.

That sounds obvious, but many demos collapse those steps into a confident answer.

"I found the problem and fixed it" is a great demo sentence.

It is also a terrifying production default.

read-only first is not optional

The first mode for an incident agent should be read-only.

Not because read-only tools are boring. Because read-only is how you earn trust.

An agent that can quickly collect recent deploys, alarms, logs, traces, feature flag changes, dependency status, Kubernetes events, database metrics, and customer impact is already valuable. Most incidents begin with a context scramble. Reducing that scramble is real leverage.

But gathering context is different from mutating production.

The line between "show me the likely failing deployment" and "roll back the likely failing deployment" should be bright. The line between "identify pods with restart loops" and "delete the pods" should be bright. The line between "find the expensive query" and "kill sessions" should be bright.

For low-risk actions, maybe teams eventually allow carefully scoped automation. Restart a known worker pool. Scale a non-critical queue consumer inside limits. Flip a kill switch that already exists for this class of failure.

Fine.

But that should come after the system has proven itself in observation mode.

The dangerous path is letting a tool graduate from "assistant" to "operator" because the demo was impressive and the dashboard has a button.

approval gates need to be specific

Human approval is not a magic safety layer.

Anyone who has responded to a serious incident knows how easy it is to click the plausible thing under pressure. The chat is moving, executives are asking for updates, customers are affected, and the agent has a neat explanation with three green checkmarks.

If the approval prompt says "approve remediation," that is not enough.

The approval should say exactly what will happen.

Which service will change? Which region? Which command? Which credentials? Which feature flag? Which deployment? What is the expected customer impact? What is the rollback path? What evidence supports this action? What evidence argues against it?

That is not bureaucracy. That is the difference between judgment and a rubber stamp.

Agents can help here if we design the workflow well. They can turn a messy pile of telemetry into a structured action proposal. They can link the relevant graph, log sample, deployment diff, and runbook. They can say, "this is reversible in two minutes" or "this requires database migration rollback and should not be done from chat."

That is useful.

But the system has to make the human approve a concrete operation, not a vibe.

the audit trail is part of the fix

Incident response already has an evidence problem.

During the incident, people move fast. After the incident, everyone wants a timeline. The team reconstructs who saw what, who changed what, when the metric moved, which customer reports mattered, and which decision actually improved the system.

Add agents and the timeline gets another layer.

What did the agent read? Which logs did it sample? Which time range did it choose? Which runbook did it follow? Which commands did it propose? Which commands did a human approve? Did it ignore a warning? Did the model summarize a dashboard incorrectly? Did the operator edit the command before running it?

If those details are not captured automatically, they will not exist when the postmortem needs them.

And without them, the organization learns the wrong lesson.

Maybe the agent helped. Maybe it introduced noise. Maybe it found the right clue but suggested the wrong action. Maybe the human used the agent as a search tool and made the real decision independently. Maybe the tool was excellent, but the runbook it followed was stale.

Those are different outcomes.

They require different fixes.

An incident agent without a durable audit trail is not an operational tool. It is a very confident participant in a conversation nobody can replay.

runbooks become executable interfaces

The boring opportunity here is runbooks.

Most teams have some mix of markdown runbooks, wiki pages, dashboard links, tribal memory, and shell commands copied from the last incident. Some runbooks are good. Many are aspirational. Some contain commands that only work if you already know the missing step.

Agents will expose that quality gap quickly.

If a runbook is clear, scoped, current, and testable, an agent can help execute the investigative parts and prepare action proposals. If the runbook is vague, stale, or full of implicit assumptions, the agent may make the wrong thing look structured.

That changes how I think about operational documentation.

Runbooks are no longer just pages for humans to read at 3 a.m. They are becoming interfaces for automation. They need inputs, preconditions, permissions, expected outputs, rollback notes, escalation paths, and known-dangerous steps.

That does not mean turning every runbook into code.

It means writing runbooks as if another actor will follow them literally.

Because now one might.

measure noise, not magic

The success metric for an incident agent should not be "number of autonomous fixes."

That metric will create the wrong product.

I would rather measure whether the agent reduces time to useful context, improves the quality of incident timelines, lowers repeated diagnostic toil, catches missing runbook steps, and helps responders make better reversible decisions.

Did it reduce mean time to understanding?

Did it reduce wrong turns?

Did it preserve evidence?

Did responders trust it more after three months of use, or did they quietly stop reading its suggestions?

That last question matters. Incident tools either earn attention or spend it. A noisy agent in an outage is worse than a useless one, because people still have to decide whether to ignore it.

The pager is already a scarce-attention environment.

Do not add a chatbot that needs its own incident commander.

the punchline

Agentic incident response is coming because the value is obvious.

Operational work is full of context gathering, correlation, repetitive checks, runbook lookups, status drafting, and careful decision support. Agents can help with that. I want them to help with that.

But production does not care that a remediation plan was generated elegantly.

Production cares whether the action was correct, scoped, reversible, approved, observable, and explainable afterward.

That is why the first real design questions are not about model cleverness. They are about boundaries. Read-only defaults. Explicit approval gates. Tool permissions. Evidence capture. Rollback paths. Runbook quality. Audit trails. Trust earned over boring incidents before anyone asks for autonomy during scary ones.

The best incident agents will probably feel less like heroic operators and more like very fast SRE assistants with excellent notes and limited hands.

Good.

That is exactly the shape I would want near the pager.

references

To test my projects, I use Railway. If you want $20 USD to get started, use this link.

DEV Community