Our voice agent passed every test and still woke me up at 3am

#ai #voice #testing #llm

Replaying real call transcripts as your test set is a trap. The failures come from the inputs a user produces exactly once.

TL;DR: Our voice-agent regression suite was 312 recorded production calls, all passing. The page at 3am came from a caller who switched between English and Hindi mid-sentence, a pattern that appeared zero times in those 312 calls. Replaying real transcripts tests the confidence you already have. It does not test the inputs that actually break you. We moved to simulating adversarial callers, and below is what I learned trying five tools to generate and grade those simulated conversations (as of June 2026).

The test set was real, and that was the problem
For about four months our regression set was 312 recorded production calls. It felt rigorous. Real audio, real ASR output, real user intents, replayed on every deploy. Green for weeks.

Then the 3am page. A caller switched between English and Hindi inside single sentences. Our ASR mis-segmented the mixed-language audio, the intent classifier saw garbage, and the agent fell into a clarification loop it could not exit. The caller hung up. The dashboards were fine the whole time.

I went looking for that pattern in the 312 calls. It was not there. Not once. The people who code-switch like that had mostly churned months earlier, so the behavior was absent from the recordings exactly because it was a problem we never handled. A test set built from past traffic contains what already happened, weighted toward the common case. The failures that page you are rare by definition, and rare things are missing from a sample of the past.

Why replay gives false confidence
Replaying recorded calls is a regression test for behavior you have already seen. That is useful. It catches the case where a deploy breaks something that used to work. What it cannot do is produce an input you have never received. For that you have to manufacture the input on purpose: the fast talker who never pauses, the caller who interrupts the agent two words in, the code-switcher, the person who changes their mind halfway through a sentence, the line with a TV on in the background. That is simulation, and it is a different activity from replay.

What I tried to generate and grade simulated calls
Five tools, roughly a week each, same eight adversarial caller profiles. None of these is voice-specific; I drove them off transcripts plus a separate ASR/TTS layer. Honest notes, your mileage will differ:

Promptfoo: fast to wire into CI and good for red-teaming a prompt with generated variants. The fiddly part was that conversation state across turns was a manual build.
LangSmith: dataset versioning and the trace view were the best of the set. The simulation half I had to assemble myself.
Future AGI Simulate: persona-based, you define caller personas and it runs them through the agent, which matched how I already thought about adversarial callers (as of June 2026). Voice was not first-class, so I ran it on transcripts with ASR and TTS bolted on.
Braintrust: the nicest UI for eyeballing where a run diverged. Persona definitions lived outside it, in my code.
DeepEval: the most knobs for synthetic-conversation generation. Tuning the synthesizer to stop producing unrealistic turns took a while.
Confident AI: a reasonable hosted layer on top of DeepEval, though it is another account and key to manage. I am deliberately not crowning one. Braintrust had the UI I liked, DeepEval had the most generation control, and the persona abstraction in Future AGI's Simulate (part of their open work at github.com/future-agi) lined up with how I list out adversarial callers. Any of them can run a persona once you have written the persona.

The thing that actually moved the needle was not the tool
It was the persona list. Once we had written eight adversarial callers (the angry caller, the two-words-then-silence caller, the code-switcher, the background-noise line, and so on), every tool above could run them and grade the results. The leverage was in naming the failure modes, not in the framework that executed them. We spent two days arguing about the personas and twenty minutes wiring the runner.

The open question I still have
The space of adversarial callers is infinite, and we maintain eight. We chose those eight from incident postmortems, which means we are still only simulating failures we have already been burned by at least once. The genuinely novel failure, the next 3am page, is still unguarded. I do not have a principled way to pick simulation personas before the incident teaches me the persona. If you have one, that is the comment I want to read.

FAQ
Why not just add each failed call to the regression set after the incident?
We do. It is still reactive. The replay suite trails production by one outage, permanently.

Doesn't simulated traffic drift away from what real users do?
Yes, and that is a real cost. We re-sample the real call distribution monthly and adjust how often each persona fires. Simulation supplements replay; it does not replace it.

Is any of this voice-specific?
Most of it applies to text agents too. Voice just adds two more failure surfaces: ASR segmentation and barge-in timing. The code-switching incident was really an ASR segmentation failure that a text agent would never have hit.

DEV Community

Our voice agent passed every test and still woke me up at 3am

Replaying real call transcripts as your test set is a trap. The failures come from the inputs a user produces exactly once.

Top comments (0)