I thought my AI agent was solid.
It passed every test I threw at it. Clean prompts, expected inputs, edge cases I could think of. I tweaked the prompt, adjusted temperature, ran it a dozen times. It worked. So I shipped it.
Then I tested it the way real users interact with systems, and it started failing almost immediately.
Not in dramatic ways. Subtle ones. The kind that don’t show up in demos, but absolutely show up in production.
That’s what this post is about.
Most of us test AI agents on what I’ve started calling the “happy path.” We give the agent a clean input, maybe a couple of variations, see a reasonable response, and move on. If you’re doing evals, you might score correctness or similarity against a dataset. If you’re doing observability, you’ll catch issues once users are already hitting them.
The problem is that none of this really answers a more important question: how does this agent behave when assumptions break?
Real users don’t type perfect prompts. They make typos, repeat themselves, paste partial instructions, get frustrated, or phrase things in ways you didn’t anticipate. Some of them will try to manipulate the system. Others will just be weird in perfectly normal human ways. On top of that, LLMs are non-deterministic. The same input doesn’t always produce the same behavior over time.
An agent that “works” once is not the same thing as an agent that’s reliable.
What finally made this obvious to me was taking a single prompt I trusted and mutating it in lots of small, realistic ways. Not synthetic benchmark data. Just variations that reflect how people actually interact with systems: tone changes, small noise, longer context, partial encoding, instruction overrides.
That’s when things started breaking.
I saw latency spikes I hadn’t noticed. I saw outputs drift in ways that violated assumptions I thought were stable. I saw cases where the agent followed user instructions it absolutely shouldn’t have. None of these showed up in my original tests.
This isn’t a new lesson if you’ve worked on distributed systems. We learned a long time ago that reliability doesn’t come from writing more unit tests. It comes from intentionally stressing systems and observing how they fail. Chaos engineering exists because real systems don’t fail along neat boundaries.
AI agents aren’t any different. We’ve just been treating them like static functions instead of probabilistic systems interacting with messy humans.
That gap is what led me to build Flakestorm. Not as a replacement for eval frameworks or observability tools, but as something that sits earlier in the lifecycle. The goal isn’t to measure how “good” an answer is in the abstract. It’s to expose failure modes before users do.
The approach is simple: start with a prompt you care about, generate adversarial but realistic variations, run them against your agent, and assert things you actually depend on. Response shape. Latency. Safety rules. Semantic intent. Then look at where it breaks.
Sometimes the result is reassuring. Often it isn’t. Either way, you learn something concrete.
I’ve found this kind of testing useful precisely because it’s uncomfortable. It forces you to confront assumptions you didn’t realize you were making. It also changes how you think about prompts and system design, because you stop optimizing for a single “correct” response and start thinking in terms of behavior under pressure.
If you’re already using evals, this complements them. If you rely on observability, this helps you catch issues before they show up in dashboards at 2 a.m. And even if you don’t use Flakestorm specifically, I’d strongly recommend adopting this mindset.
If you only test happy paths, your agent is already broken. You just haven’t seen where yet.
For anyone curious, Flakestorm is open source and runs locally. The repo is here: https://github.com/flakestorm/flakestorm. Even if you don’t use it, I hope the way of thinking is useful.
Top comments (0)