Ravi Teja Reddy Mandala

Posted on Apr 13

Why Most AI Agents Fail in Production Systems: A Systems Perspective

#ai #sre #devops #cloud

Most conversations around AI agents focus on model performance.

In real production environments, that is rarely the limiting factor.

After working closely with production systems, a clear pattern emerges:

AI does not fail because of intelligence limitations.
It fails because of system design gaps.

Let’s break this down from a systems engineering perspective.

1. Signal Quality > Model Quality

AI systems rely entirely on input signals.

But most production environments expose:

logs without context
metrics without causality
alerts without correlation

This creates fragmented visibility.

Even a highly capable model cannot make reliable decisions on inconsistent signals.

In practice, poor observability architecture is the first failure point.

2. Missing System Abstractions

Human operators rely on implicit understanding:

service dependencies
failure blast radius
historical patterns

AI systems do not have this intuition.

If your architecture does not explicitly define:

service relationships
ownership boundaries
failure domains

Then the system becomes non-interpretable for machines.

AI requires structured abstractions. Most systems were never designed for that.

3. Non-Deterministic Workflows

Incident response in many teams is:

partially documented
context-driven
experience-heavy

This works well for humans.

But AI systems require:

deterministic steps
clearly defined decision paths
reproducible workflows

Without this, automation becomes unreliable and unpredictable.

4. The Hidden Constraint: System Readiness

Before introducing AI into production, a more important question should be asked:

Is the system ready for AI?

A production system is “AI-ready” only if:

signals are consistent and correlated
dependencies are explicitly modeled
workflows are structured and repeatable

Without these, AI will amplify system weaknesses instead of solving them.

Key Insight

We are trying to apply AI to systems that were never designed to be understood by machines.

That is the core problem.

A Better Approach

Instead of asking:

“How do we improve the AI model?”

We should ask:

“How do we redesign systems to be machine-interpretable?”

That shift changes everything.

For engineers already experimenting with AI in production:

What has been the hardest challenge so far
— signal quality, dependency visibility, or workflow reliability?

Top comments (6)

Bhavin Sheth • Apr 14

we tried adding AI on top of messy logs and it just made wrong decisions faster. Fixing observability and clear workflows helped way more than tweaking the model.

Ravi Teja Reddy Mandala • Apr 25

That’s a great point and honestly one of the most common failure patterns I’ve seen too.

AI ends up amplifying existing system weaknesses instead of fixing them. If observability, data quality, and workflows aren’t solid, the agent just accelerates bad decisions.

I like how you framed it, fixing the system often gives far more ROI than tweaking the model. AI should sit on top of well-structured signals, not compensate for missing ones.

Marina Eremina • Apr 14

Thanks for sharing, really valuable insight! It would be great to hear your thoughts on designing systems from scratch that are well optimized for AI-assisted incident management. Sounds like a great topic for an article 🙂

Ravi Teja Reddy Mandala • Apr 25

Thank you, really appreciate that!

That’s exactly where I think the conversation needs to go next. Designing systems with AI in mind from day one changes everything, especially around observability, feedback loops, and decision boundaries.

I’m planning to write a follow-up piece specifically on “AI-ready system design for incident management”, covering things like structured signals, human-in-the-loop workflows, and safe automation patterns.

Would love to hear your perspective as well when I publish it 🙂

Marina Eremina • Apr 25

Great to hear that, looking forward to the follow-up!

Ravi Teja Reddy Mandala • Apr 25

“Thank you, Marina! Really appreciate the support. I’ll make sure the follow-up goes deeper into practical design patterns and real-world applications. Looking forward to sharing it soon and getting your thoughts on it.”