I have talked to dozens of AI teams about production. The same things keep breaking.

#agents #ai #llm #monitoring

I have been a PM at NETRA long enough to have had the same conversation about 40 times.
An AI team reaches out. They're building something serious in the agent space, a customer-facing agent. Things are moving fast. Shipping is happening.
And somewhere in the first ten minutes, they describe the same problem in different words.
"We got to know that the quality had dropped from a user complaint."
"Our costs spiked and we saw it in the monthly finance review."
"We changed the prompt three weeks ago and we're not sure if that's what caused this."
Every time. Different team. Different product. Same pattern.

What's actually happening?
These aren't careless teams. Most of them have logging. A few have dashboards they check regularly.
But there's a consistent gap between what their tools tell them and what they actually need to know.
Their tools answer: is the system running? Did something break? What happened on this specific request?

What they actually need to know: is the agent's performance getting better? Are users getting a worse experience than last week post release? If the agent performance is drifting from its expected behaviour, would we catch it before a user does?
Those are different questions. And most of the tooling AI teams have right now was built for the first set, not the second.
The gap isn't a monitoring gap. It's a visibility gap. Most AI teams can tell you if their AI system is healthy. Very few can tell you if its improving.

The three failure modes I keep seeing:

After enough of these conversations, the patterns get predictable.
The prompt change nobody could validate A developer updates the system prompt. Runs a few manual checks, looks good, ships it. Three weeks later I get to know that- a user is saying the agent's been giving weird answers lately through a complaint raised. The developer goes looking. No baseline. No before and after. No way to connect the complaint to the release. Just logs from a period when everything appeared to be working fine.
The cost spike that showed up in a meeting A new model version gets deployed. More capable, slightly more expensive per call. Usage patterns shift in ways nobody anticipated. The cost delta doesn't show up in anyone's dashboard. It shows up two weeks later when finance flags it. The engineering team gets a message asking to explain the increase.
The quality drop nobody noticed Gradually, imperceptibly, the agent starts giving slightly worse answers. No exceptions thrown. No error rates spiking. The system is technically healthy by every measure the team has. Users just slowly have a worse experience. Someone on customer success eventually notices a pattern in support conversations. By then it's been weeks.

Every single team I've talked to have a version of at least one of these stories.

What changes when evals, simulation and alerts are in place
I'm not going to pretend this is a magic fix. It takes defining what good looks like for your specific product. That part is on you.
But here's what I've seen consistently on the other side of that work.
Teams stop finding out about quality drops from users. They catch them first, sometimes weeks before a user would have noticed. Prompt changes go from "we think this is better" to "we can show this is better." In Netra, scored comparisons against real user queries, before and after each change, tied to the exact version shipped. The shift from gut feel to actual data changes how confident teams are going into releases.
And cost spikes stop being surprises. Alert rules tuned to AI-specific signals mean the team finds out first, not accounting.
The shift isn't just operational. It changes how teams make decisions. "We think it's better" becomes "we can show it's better." That's a different kind of confidence going into every release.

Why I keep bringing this up to developers?
Honestly? Because most developers I talk to are building AI products right now with an observability layer that was designed for something else, still doing manual testing- which is the age-old approach.
Infrastructure monitoring was built for deterministic systems. Up or down. Pass or fail. AI products aren't deterministic. They drift. They degrade. They change in ways that don't look like outages but matter just as much to users.
Evals and alerts are the layer built specifically for that problem. Not a workaround. Not a repurposed tool. Built for the non-deterministic, probabilistic nature of AI agents in production.
You can't catch probabilistic drift with deterministic monitoring. That's not a team problem. That's a tooling problem.
If you're shipping AI changes right now and your quality signal is "seems fine," that's the gap. Not because you're not paying attention. Because the tools you have weren't built for this question.

40 conversations. Different teams, different products, different industries. The same gap in every single one.
The tools most AI teams are using were built for a different problem. Infrastructure monitoring tells you when things break. It was never designed to tell you when things drift, degrade, or quietly get worse in ways that don't look like failures.

Evaluations, Simulations and alerts were built for that problem specifically. Not as an add-on. As the foundation for shipping AI with actual confidence rather than hope.

That's the pattern. That's why I keep talking about it. That's what we built Netra to solve. If you want to see what it looks like in practice, try out:getnetra.ai

DEV Community

I have talked to dozens of AI teams about production. The same things keep breaking.

Top comments (0)