Everyone says AI agents will reduce on-call fatigue.
So I tried adding one into a real production incident workflow not to replace engineers, but to assist with triage, summarization, and next-step recommendations.
It helped in some places.
It failed in others.
And the biggest lesson had less to do with the model and more to do with system design.
The Setup
I integrated an AI agent into a typical incident response flow:
- Incoming alerts from monitoring systems
- Initial triage and classification
- Root cause hypothesis
- Suggested remediation steps
The agent was allowed to:
- Summarize alerts
- Group duplicate incidents
- Suggest possible causes
- Draft remediation steps
The agent was NOT allowed to:
- Execute production changes
- Restart services
- Modify configs
- Trigger escalations automatically
This was intentional. I wanted to see where it adds value without risking production.
What Worked Surprisingly Well
1. Alert Summarization
The agent reduced noisy alerts into clean summaries.
Instead of reading through logs, I got:
“High latency observed in service X after deployment Y. Likely related to dependency Z.”
This alone saved time during high-pressure incidents.
2. Duplicate Incident Grouping
It grouped alerts that were actually the same issue.
This reduced alert fatigue and helped focus on the real root cause faster.
3. Drafting Next Steps
It suggested reasonable first actions:
- Check recent deployments
- Validate dependency health
- Inspect error spikes
Not perfect, but a solid starting point.
What Broke Almost Immediately
1. Wrong Prioritization
The agent sometimes treated low-impact issues as critical.
Severity is not just data. It is context.
And context is hard.
2. False Confidence
The responses sounded very confident even when wrong.
This is dangerous in production systems.
Confidence ≠ correctness.
3. Noisy Recommendations
Some suggestions were technically valid but operationally useless.
Example:
- “Restart the service”
In production, that is not always acceptable without deeper checks.
4. Escalation Confusion
It struggled to decide when to involve humans.
Too early → noise
Too late → risk
That balance is harder than it looks.
The Real Problem: System Design
After a week, it became clear:
The AI agent was not the main problem.
The real issues were:
- Weak incident workflows
- Poor escalation design
- Lack of structured context
- No clear guardrails
If your system is messy, the AI will reflect that mess faster.
The Architecture That Works Better
Here is what I would recommend instead:
- Alert comes in
- AI summarizes + groups signals
- AI suggests possible causes
- Human validates context
- AI drafts remediation options
- Human approves final action
AI as a co-pilot, not an autopilot.
Key Takeaways
- AI is great at summarization and pattern detection
- It struggles with context and real-world constraints
- Confidence can be misleading
- System design matters more than model capability
Most teams trying AI in incident response are not failing because of the model.
They are failing because their workflow is not designed for AI.
Final Thought
AI can absolutely improve incident response.
But if your escalation paths, permissions, and observability are weak,
the agent will not fix your system.
It will expose it.
Question for You
Would you allow an AI agent in your on-call workflow?
- Recommendation only
- Limited action with approval
- Full automation
Curious to hear how others are approaching this.
Top comments (18)
Great article! You made a valid point. I think that agents has their limitations. You need human in the loop to verify the code the or the decision of the AI. I think that OpenClaw is probably the most autonomous agent at the moment. You need to verify the accuracy of the work of the agent.
Appreciate that, Benjamin — completely agree.
Human-in-the-loop is still critical, especially when decisions have real impact. In my experience, the challenge isn’t just accuracy, it’s knowing when the system is confident vs when it’s guessing.
That’s where I see a lot of these agent systems struggling today — they can automate steps, but they still need clear boundaries and validation points.
I like your point on autonomous agents as well. Feels like we’re getting closer, but we’re not at a stage where you can fully trust them without guardrails.
You made a valid point. It is reason why I am cautious about agents these days. I agree with you 100% to have guardrails for the agents or the AGI.
That’s a fair take, and honestly a healthy mindset right now.
I think the real shift is treating guardrails as a core part of the system, not an afterthought. The more critical the workflow, the more you need clear boundaries, validation, and fallback paths.
Agents are powerful, but without those controls, they can fail in very non-obvious ways.
That is true!
Great write-up - and yes, I see AI as promising in this field ...
Appreciate that!
I’m bullish on AI here too, but I think we’re still in the phase where it works best as a “visibility layer” rather than a decision-maker.
It’s great at summarizing and surfacing signals, but the moment it has to reason across messy, real-world systems, things get interesting 😄
Have you tried using it in any production workflows yet?
Not personally, but I agree it can be really powerful (and a time saver) when used for certain things - and it can be a time and money drain for other things ... it all depends on WHEN and HOW you use it, and that's definitely an "evolving art" - people come up with clever ideas and approaches all the time ...
For instance, I just came across this article, which describes a simple but really brilliant approach (for software/app development, in this case):
boristane.com/blog/how-i-use-claud...
The main takeaway, for me, is how he's asking the AI to produce "artifacts" in the form of MD files to document the results - and he then goes into a loop, adding notes/comments, and asking the AI to refine the document based on that, etc - until he's satisfied ...
And using that approach he goes through a "research" => "plan" => "implement" pipeline, with each stage resulting in tangible 'artifacts' (the MD files), which also serve as the Context - rather than having messy/ephemeral/unstructured chat sessions with endless "prompting" and retrying, with the Context not really being tangible, and with the developer not really feeling 'in control' ...
Eye opener, in the category of "why didn't I think of that myself?" ;-)
This is a great example — thanks for sharing.
The “artifact-first” approach really resonates. I’ve seen similar patterns work well when the AI is forced to externalize its thinking instead of keeping everything in a chat loop. It makes the process auditable and, more importantly, gives humans something concrete to validate.
In incident workflows, I think this could translate into things like structured runbooks, timelines, or even evolving incident summaries rather than just transient responses.
The interesting challenge is making sure those artifacts stay grounded in real system context, otherwise they can still drift.
Curious — do you see this approach scaling well in more dynamic environments like on-call or incident response?
This is a brilliant way to put it:
"the AI is forced to externalize its thinking instead of keeping everything in a chat loop"
and this:
"It makes the process auditable"
and:
"do you see this approach scaling well in more dynamic environments like on-call or incident response"
Yes I do, in the ways you're hinting at ... this idea of 'artifacts', to broaden the idea: it's all about adding structure to your process, control, accountability, observability, etc ...
People are now starting to realize that the holy grail is not in "prompting harder" (and burning more tokens), you need a more disciplined/structured/intelligent approach ...
Really appreciate that — you captured it perfectly.
I completely agree, the shift is less about “better prompting” and more about building structured workflows around the AI. The moment you treat outputs as artifacts instead of responses, it changes how you design the entire system.
In my experience, that’s also where reliability starts to improve — you get traceability, iteration, and a clearer feedback loop.
The interesting next step is figuring out how to keep that structure lightweight enough for fast-moving environments like on-call, without slowing engineers down. That balance is where most systems either succeed or fall apart.
"building structured workflows around the AI" - that's a good way to summarize it!
Exactly — that’s the shift I’m seeing too.
Once you start thinking in terms of structured workflows instead of prompts, AI becomes much more predictable and usable.
Feels like we’re moving from “prompt engineering” to more of a “system design” mindset for AI.
Hi Ravi,
I was wondering if you want to connect with me on LinkedIn?
Hi Benjamin,
Absolutely, happy to connect. Here’s my LinkedIn:
linkedin.com/in/ravi-teja-reddy-ma...
It sounds good!
Tried something similar in our setup — summarization was a big win, but prioritization was always off without real context. Feels like AI helps only when your incident workflow is already clean, otherwise it just makes the gaps more obvious.
100% agree with this. That was exactly my takeaway too.
The agent didn’t fail because it was “bad”, it failed because our workflow had hidden gaps that humans were compensating for without realizing.
Once those gaps showed up, the AI just amplified them instead of fixing them.
Curious, did you end up adding more structured context (like better incident metadata or runbooks) after that?