Kartik N V J K

Posted on Jun 30 • Originally published at futureagi.com

How I caught the voice-agent failures my eval dashboard kept missing

#voiceai #ai #llm #testing

I was working on a retail support voice agent, and on paper it looked great. Transcription quality, green. Conversation coherence, green. Task completion, green. Every standard metric I had was happy.

The problem was that it kept getting order numbers wrong. It would confidently read back the wrong number, the call would still "complete," and my dashboard would still show green. The checks I had simply did not know what an order number was, or why getting it wrong was the whole ballgame.

That is the gap I want to write about. The default metrics catch the failures every voice agent shares. They cannot catch the failures that belong to your specific domain. Here is what I learned about writing checks that do.

Built-in metrics catch the universal failures, not yours

Every voice agent, in any business, can fail in the same handful of ways:

Did it transcribe what the user said correctly?
Did the conversation stay coherent across turns?
Did it actually finish the task by the end of the call?
Did it stay safe, no leaked personal data, no obvious prompt injection?

Off-the-shelf checks handle all of that well. What they cannot know is your taxonomy. A retail agent lives or dies on order numbers. A clinical scribe lives or dies on drug names. An insurance agent lives or dies on coverage codes. None of that is in a generic metric, so none of it shows up on a generic dashboard.

Know when to reach for a custom check

The rule I settled on is simple:

If the failure is universal, use a built-in check.
If the failure is specific to your domain, write a custom one.
In production, run both on every call.

The last point matters more than it sounds. The two layers catch completely different kinds of errors. A call can pass the universal checks and still fail the domain one, which is exactly what was happening to me. You want both signals, not one.

What a custom check actually looks like

It helps to see a few real ones, in plain terms, no code:

Drug names for a medical scribe. Not just "was the word spelled right," but did the drug, the dose, the frequency, and the route all survive together. Getting three of four right is still a dangerous note.
Brand voice for a marketing line. Did it open with the approved greeting, avoid the banned phrases, and keep the right tone. A coherent call can still sound nothing like the brand.
Quote correctness for an insurance claim. Check each field, the deductible, the limit, the exclusions, against the actual policy, and flag any mismatch for a human to look at.

In every case the check encodes something only my team knows. That is the whole point of writing your own.

Pick the shape of the score before you write it

Before writing the check, decide what kind of answer it gives back:

Yes or no, for clean pass-or-fail rules ("did it read the right order number").
A category label, when you want to sort calls into buckets.
A score from 0 to 1, when quality is a gradient, like how closely the brand voice matched.

Pick the one that matches what you will do with it. A pass-rate dashboard wants yes or no. A trend line wants the gradient. Choosing the shape first saves you from rewriting the check later.

Two ways to actually write one

There are two honest paths, and they suit different people.

The first is to write the rule yourself, in plain English: spell out exactly what counts as a pass, what counts as a fail, and where the grey area is. Best when you, the engineer, already know the rule cold.

The second is to let a tool propose one. Some tooling reads your failed calls, groups the similar ones together, and drafts a candidate rule with example pass and fail cases pulled from real calls. You then edit and accept it. This is faster when the person reviewing is a domain expert rather than a coder.

Whichever path you take, keep one rule: a human approves the check before it goes live. Do not let anything auto-promote its own scoring into production.

The part that actually makes it work: calibrate against humans

This is the step I underestimated. Your first version of a check will score some calls wrong. Calibration is how you close that gap.

Build a small set of calls and have a domain expert label each one with the "true" score.
Run your check against that set and see where it disagrees with the expert.
Fix the wording or the examples, run it again, and repeat until it mostly agrees.

Plan for a few rounds, not one. And keep a human approving every change, with a record of who approved what. In anything regulated, that paper trail is the thing that makes the whole setup usable at all.

Pair the judge with a hard check whenever you can

A check that leans on a language model to judge will drift, and it quietly lets false passes pile up over time. So wherever a real, verifiable test exists, run it alongside the judge:

a number that has to fall in a valid range,
a format or schema that has to validate,
an exact entity that has to match.

The judge handles the fuzzy part, the hard check handles the part that has a right answer, and together they catch what either one alone would miss.

Re-run new versions on old calls

Every time I tweaked a check, I asked one question: how does the new version score the last month of calls compared to the old version? Re-running a check over your call history gives you that before-and-after diff directly. It is the easiest way to catch a "small" wording change that secretly moved everyone's scores.

Two tradeoffs I made on purpose

Calibration stays manual. A human signs off on every change to a check. It is slower, but it is the only version of this I actually trusted in a regulated workflow.
I only sample a slice of calls for the heavier scoring, to keep cost and privacy in check. Rare failures still surface, because grouping failed calls together brings the uncommon ones up even when you are not scoring everything.

The metric that finally caught my order-number problem was never going to be on a default dashboard. I had to write it, and then spend more time calibrating it against a human than writing it in the first place. That ratio surprised me, but it is also the reason the check ended up trustworthy.

If you run voice agents, I would love to hear which domain failure your standard metrics never caught. For me it will always be that confidently wrong order number.

Top comments (1)

Adam Lewis • Jul 1

The order-number miss is the useful part. "Task completion, green" was never really an acceptance check, it graded whether the call ended, not whether it did the one thing the call was for. Once you write the check that knows what an order number is, you've moved the actual requirement out of your head and into something the agent can be graded against. Your calibration point is the same thing from the other side: a check that can't ever disagree with the green dashboard isn't testing anything yet.