Jay

Posted on Apr 14

From 66% to 96%: How I Fixed a Drive-Thru Voice Agent Before It Took a Single Real Call

#ai #llm #agents #voiceai

I've been building voice agents for a while. The hardest part isn't the STT or TTS layer.

It's this: how do you test edge cases before you have real users?

The default answer is the vibe-check loop. You call your own agent, order a burger, say "yeah that felt okay," and move on. I did this for longer than I should have.

The Scenario

I built a drive-thru voice agent called "Future Burger." Requirements were simple: take orders fast, stay concise, skip the small talk.

The architecture was brain-first. STT and TTS are just the ears and mouth, interchangeable peripherals. The LLM handles reasoning, context switching, and tool calling.

If the agent can't figure out that "Actually, make that a Sprite" means replacing the previous drink, no amount of voice synthesis polish saves the interaction. So I focused entirely on the intelligence layer.

Step 1: Synthetic Data (Skipping the Cold Start)

Instead of waiting weeks for real call logs, I used FutureAGI's Dataset to build a ground truth dataset. You define a schema and it produces structured input/output pairs.

I asked for two fields: user_transcript (what the user says) and expected_order (what the agent should actually book).

Prompt used:

"Generate 500 diverse drive-thru interactions. Include complex orders like 'Cheeseburger no pickles', combo meals, and modifications."

In seconds I had 500 labeled pairs ready for evaluation. What surprised me here was how fast this exposed gaps I hadn't even thought to test. Mid-sentence order changes, multilingual switches, impatient customers. Edge cases I always meant to write but never did.

Step 2: Baseline Prompt (Workbench + Experiments)

Before touching latency or audio quality, I needed to confirm the logic holds. I drafted the initial system prompt (v0.1) in the Prompt Workbench, saved it as a versioned template, and ran an experiment across those 500 scenarios using three models: gpt-5-nano, Gemini-3-Flash, and gpt-5-mini.

Result: 80% accuracy. Decent. But the responses were wall-of-text paragraphs. Every reply opened with something like:

"Certainly! I have updated your order to include a cheeseburger without pickles and a medium Sprite. Is there anything else I can help you with today?"

Fine for a chatbot. For a voice agent where every word adds latency, it's a failure mode.

Step 3: Simulation (The Stress Test)

I connected the agent and ran a simulation with layered scenario types: hesitant users, stuttering, mid-order changes, rushed and angry customers.

The results were immediate:

Latency issues. The agent was too wordy. It started every response with "Certainly!" and ran three sentences too long.
Logic breaks. When a user changed their mind, the agent added both items to the cart instead of replacing the first.
Success rate: 66%.

One in three conversations ending in failure is not a quirk to patch later. That's a production blocker.

Step 4: Automated Optimization

This is the part I found most useful. Instead of manually editing the system prompt and guessing which instruction caused which failure, I let the optimization engine analyze the conversation logs directly.

I defined 10 evaluation criteria specific to this agent, including:

Context_Retention

Objection_Handling

Language_Switching

Because the platform evaluates native audio rather than transcripts alone, it recognized failure patterns across hundreds of simulated conversations and surfaced two actionable fixes:

Fix 1 (High Latency): "Reduce decision tree depth for menu inquiries and remove redundant validation steps."
Fix 2 (Hallucination): "Restrict generative capabilities to the defined menu_items vector store to prevent inventing dishes."

I selected the failed simulation runs and ran ProTeGi optimization with two objectives:

Task_Completion

Customer_Interruption_Handling

The system iterated on the system prompt automatically, testing variants like "Be extremely brief" or "If user changes mind, overwrite previous item." It ran each variant against the simulator in a loop until the metrics climbed.

I've done this manually on other projects. It takes hours. Watching it run in a loop was a genuinely different experience.

Step 5: Results

Before: Polite, slow, failed to track mid-order changes
After: Crisp. "Burger, no pickles. Got it." 96% accuracy on the "Indecisive" scenario

Going from 66% to 96% without writing a single new instruction manually validated the loop: Dataset > Simulate > Evaluate > Optimize.

What I Took From This

The cold start problem for voice agents is real. You can't get quality data without users, and you can't get users without quality behavior. Synthetic simulation breaks that dependency.

The bigger shift for me was realizing that most prompt debugging is just pattern matching on logs. You run the agent, it fails, you guess why, you edit, you repeat. That process is automatable. The hard part is setting up the right evaluation criteria upfront.

If you're still in the vibe-check phase and want to see what the full evaluation infrastructure looks like, the architecture walkthrough is here.

Curious what evaluation criteria others track for voice agents in production. Context retention and objection handling were obvious for this use case, but I'd like to know what else people actually measure.