My colleague's AI agent kept breaking in production. Here's what we found when we looked closer.

#ai #agentdev #testing #llmops

A couple of months ago a PM I know messaged me out of the blue. She knew I was on the Netra team and figured we might have something that could help with a problem she was stuck on.
Her team had built a customer support agent. It was passing all its tests. Evals looked good. But they kept getting escalations from users who said the agent had resolved their issue incorrectly. She'd been through the conversation logs herself and couldn't find anything obviously wrong.
We were still building Netra's simulation feature at the time, honestly still figuring out what it needed to catch in the real world. But this felt like exactly the kind of problem worth running it against.
I said: let's set up a session and see what comes up.

What the logs showed
When she walked me through the conversation I understood why nothing obvious had jumped out.
A user had contacted support about an account issue. The agent asked a clarifying question it needed answered correctly to resolve the issue properly. The user responded with something like "yeah, doesn't matter."
The agent accepted it. Treated it as answered and moved on. Closed the ticket.
It wasn't a technical failure. It wasn't making things up. It was the agent handling an unexpected input in the worst possible way. It treated the response as a valid answer and kept going.
In every test case they had for that flow, the user gave a direct answer. "Yes." "No." A specific value. The agent handled all of them correctly.
Nobody had ever tested what happened when a user gave an ambiguous, indirect response instead.
The agent wasn't broken. The test suite was just never designed to encounter the user who doesn't give you what you need.

What we ran
We set up a simulation with a specific type of user in mind — someone who answers indirectly, uses casual language, and doesn't always give the agent what it needs. Not a hostile user. Not a dramatic edge case. Just someone who doesn't follow the flow the agent was designed for.
We ran it against the same customer support flow.
In the first run we found three more places where the agent accepted incomplete or ambiguous responses and moved on. Different questions, same pattern. The agent would receive something that wasn't quite an answer and treat it as one. Each time the conversation technically completed. Each time the outcome would have been wrong for the user.
None of these appeared in their evals. None of them would have. Evals check whether the agent handles a given input correctly. They don't check what happens when the input isn't what the agent was designed to receive.
Evals test the agent against inputs you predicted. Simulation tests it against users you didn't.

What we both learned
Here's the part I think about most.
She came out of that session with a clear set of failures to fix before the next release. But I came out of it with something I hadn't expected. A much sharper sense of what simulation actually needs to do.
Before that session I was thinking about simulation mostly in terms of behavioral patterns. Different user types. Edge case coverage. What I hadn't fully understood was that the most important thing isn't the behavior pattern. It's the ambiguity.
Real users don't fail agents by being difficult. They fail agents by being human. Indirect. Casual. Not always precise. The agent accepting "yeah, doesn't matter" wasn't a dramatic failure. It was a quiet one. Exactly the kind that doesn't show up until someone calls back two hours later.
That shaped how we built things in Netra. The fact checker specifically. It validates what information the agent actually captured across the full conversation, not just whether it technically reached the end. An agent can complete every step and still leave the user with the wrong outcome. The fact checker is built to catch exactly that. And it's not something you can replicate by just testing with a few different users manually. It runs automatically across every turn of every simulation. So does the multi-turn structure, which shows you where things quietly went wrong, not just where they loudly broke.
You can't build a good simulation tool without running it against real problems. The feature gets better when it meets things it wasn't designed to catch.

Why this still matters
That conversation is the clearest example I have of why simulation sits in a different category from evals.
Evals measure whether the agent gave the right answer. Simulation tests whether the agent completed its goal when the user didn't cooperate. That second category — behavioral testing for AI agents — is what most teams are missing. Both matter and both should be part of how you test before you ship. But they're answering different questions. And the gap between them is exactly where real users live.
If you're building an AI agent right now and relying only on evals to catch problems before shipping, you're probably covering your outputs well. The question is whether you've tested for the inputs you didn't predict.

That's the question simulation was built to answer. And honestly, I understood it properly for the first time because a PM I know had a broken agent and we looked at it together.
If you want to see how Netra's simulation works in practice, getnetra.ai is a good place to start.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.