Paulo Victor Leite Lima Gomes

Posted on May 15

agentic sre is where ai hype meets the pager

#ai #devops #sre #observability

AWS published a post recently about building an end-to-end agentic SRE, and I had two reactions at the same time.

The first one was: yes, obviously. Incident response is full of repetitive investigation work that agents should help with.

The second one was: oh no, we are absolutely going to hurt ourselves with this.

Not because SRE agents are a bad idea. I think they are one of the more useful AI directions, actually. But the pager is a very different environment from a coding task on a quiet Tuesday afternoon. Production incidents are where vague automation, incomplete context, bad permissions, and confident summaries turn from annoying into expensive.

incident response is mostly context gathering

A lot of incident work is not heroic debugging. It is context gathering under pressure.

You check dashboards. You compare deploy timestamps. You look at logs. You inspect error rates. You ask whether one region is worse than another. You check whether a dependency is degraded. You search Slack for the last person who touched this thing. You read a runbook that is probably 70% correct and 30% archaeology.

That is exactly the kind of messy, tool-heavy workflow where agents can help.

An agent that can pull CloudWatch metrics, query traces, summarize logs, inspect recent deployments, and prepare a timeline could save real minutes. And minutes matter when customers are down and everyone is pretending to be calm in the incident channel.

Stack Overflow also had a good piece on observability and human intuition in an AI world, and I think that framing is important. The goal is not to replace intuition. The goal is to give humans a better starting point for judgment.

A good incident agent should make the human sharper, not more passive.

the dangerous part is the verb

The problem starts when the agent moves from "look" to "do."

There is a huge difference between:

"summarize the last 30 minutes of elevated 5xx errors"
"find likely related deploys"
"compare this service against last week's baseline"
"rollback the deployment"
"scale the service"
"change the retry policy"
"disable this feature flag"
"restart the cluster"

The first group is investigation. The second group is operation.

Both are useful. Only one of them can make the incident worse in three seconds.

This is where a lot of AI demos become misleading. In a demo, the agent diagnoses the problem, proposes a fix, runs the action, and the graph turns green. Nice. In production, the agent may diagnose a symptom as the root cause, apply a fix that hides the signal, or take an action that works for one customer path while breaking another.

Humans do this too, of course. The difference is that humans tend to be slower, more socially accountable, and easier to interrupt. An agent with broad permissions can be wrong very efficiently.

Efficiency is not always your friend during an incident.

observability is the safety system

If you want agentic SRE, observability is not a nice add-on. It is the safety system.

The agent needs reliable telemetry, but the humans need telemetry about the agent too:

What data did it inspect?
Which queries did it run?
What assumptions did it make?
Which actions did it propose?
Which actions did it execute?
Who approved them?
What changed after the action?

If the agent says "the database is the bottleneck," I want to know whether it looked at saturation, lock waits, connection pool exhaustion, disk latency, downstream timeouts, or just one sad-looking CPU graph.

This is why I am skeptical of incident agents that only produce beautiful natural-language summaries. Summaries are useful, but they can also compress away uncertainty. During an incident, uncertainty is not noise. It is part of the signal.

A good SRE agent should show its work like a nervous staff engineer in a postmortem.

permissions should match the phase of the incident

The easiest bad design is to give the agent one big production role and trust it to be careful.

Please do not do that.

Incident response has phases, and the permissions should match them.

For normal operation, an agent should mostly be read-only. Let it inspect metrics, logs, traces, deploy metadata, feature flag state, config history, runbooks, and recent alerts. This alone is already valuable.

For mitigation, allow a small set of reversible actions: create an incident timeline, draft a rollback command, propose a feature-flag change, open a PR, page the owning team, or prepare a runbook step. Maybe some teams allow low-risk automated actions, but they should be explicit and boring.

For high-impact operations, require human approval. Rollbacks, traffic shifting, database failovers, permission changes, and infrastructure mutation should not be hidden behind "the AI thought it was best."

This is not anti-automation. This is how grown-up automation works. The blast radius decides the approval model.

runbooks become executable contracts

One thing I like about the agentic SRE direction is that it may finally force teams to clean up their runbooks.

A runbook written only for humans can be vague:

Check the dashboard and restart the service if it looks stuck.

A runbook used by an agent needs better structure:

Which dashboard?
Which metrics define "stuck"?
What threshold matters?
What command restarts the service?
Is restart safe during a deploy?
Who approves it?
How do we verify recovery?
What should never be restarted automatically?

That is healthy pressure.

The same happened with CI/CD. Once deployment became automated, teams had to make the release process explicit. Agentic SRE could do the same for operations. Not because the agent is magical, but because automation punishes ambiguity.

If your runbook cannot be followed by a careful junior engineer at 3 AM, it probably cannot be safely followed by an agent either.

the pager is not a benchmark

The most important thing I would avoid is turning incidents into an AI leaderboard.

"The agent resolved 42% of incidents automatically" sounds impressive until you ask which incidents, which actions, how many false positives, how many hidden regressions, and how many humans quietly cleaned up afterward.

Better metrics would be more boring:

time to useful first summary
percentage of incidents with complete timelines
reduction in repeated manual diagnostic steps
approval rate for agent-proposed actions
rollback or revert rate after agent-assisted mitigation
postmortem findings caused by missing context
number of times the agent escalated correctly instead of guessing

I care much more about an agent that reliably saves ten minutes of investigation than one that occasionally performs a heroic autonomous fix and occasionally makes everyone sweat.

Hero automation is fun in demos. Boring assistance is what survives production.

what i would build first

If I were adding agentic SRE to a team today, I would start with the least glamorous version.

Read-only incident assistant. No mutation. No secret powers.

It would join the incident channel, collect telemetry, build a timeline, link recent deploys, summarize symptoms, identify likely owners, and keep a running "known facts vs guesses" list.

Then I would add proposed actions, not executed actions. The agent can draft the rollback command, but a human runs it. The agent can suggest the feature flag, but a human flips it. The agent can propose scaling, but it has to show the evidence.

Only after that works for a while would I consider limited automated mitigation. And even then, I would start with narrow actions that are reversible, logged, and already accepted as safe by the team.

The boring maturity model is something like:

read-only summarizer
timeline and evidence builder
runbook navigator
action recommender
human-approved operator
narrow autonomous mitigator

Skipping from step one to step six is how you get a postmortem with the phrase "unexpected agent behavior" in it.

the real shift

The bigger story is not that AWS, or any vendor, can build an SRE chatbot.

The bigger story is that operations are becoming another place where agents participate in the workflow. Not as magic coworkers. As tool-using processes with access, memory, logs, permissions, and failure modes.

That means platform teams need to design around them.

The same questions keep coming back: what can the agent see, what can it change, how do we review it, how do we observe it, how do we roll it back, and who owns the mess when it is wrong?

Agentic SRE is exciting because it attacks real toil. It is dangerous for the same reason. The work is real, the systems are real, and the pager does not care that the demo looked amazing.

So yes, bring agents into incident response.

Just make them earn trust the same way every other operational tool does: read-only first, observable always, reversible where possible, and very careful around anything that can turn a small fire into a bigger one.