Zeeshan Ghazanfar

Posted on Apr 29

What Broke In Our Voice Agent In Production

#agents #ai #performance #softwareengineering

I work on the AI layer at BrainPack - agents that run against real enterprise systems, not clean demo sandboxes.

One failure mode we hit with production voice agents was not "the model gave a bad answer."

It was quieter than that.

The agent needed to call a business tool, wait on a database-backed workflow, and then come back with a useful spoken answer. In a demo, that looks fine. In production, the user hears silence.

Silence is a failure mode.

For text chat, a 5 second delay is usually tolerable if the UI shows a loading state. For voice, even 2 or 3 seconds of unexplained silence feels broken. If the tool call takes longer, users start repeating themselves, interrupting, or assuming the call dropped.

In our LiveKit voice stack, the production configuration had several moving parts:

STT provider and model
LLM provider and model
TTS provider and model
Turn detection
Voice activity detection
Tool execution
Transcript storage
User information capture
Optional Talk-to-DB access
Call recording metadata

The first version treated tool latency mostly as a backend problem.

That was wrong.

Tool latency is also a conversation design problem.

What Changed

1. We added mandatory pre-tool speech

Before a tool runs, the agent now gives a short spoken update like:

Let me check that for you.

Not a paragraph. Not fake confidence. Just a small signal that the call is alive.

2. We separated normal conversation from data-backed questions

The assistant should not call Talk-to-DB for greetings, policy questions, or small talk. It should only route to data when the user is clearly asking about records, reports, counts, trends, metrics, filters, comparisons, or other database-backed facts.

This reduced unnecessary tool calls.

3. We added a voice-specific path for oversized answers

Some database answers are too large to read aloud. The system now detects when a Talk-to-DB response is too large and can move the full answer into a PDF email flow instead of forcing a bad voice experience.

4. We made the waiting state audible

The LiveKit agent config includes ambient and busy audio settings:

Ambient volume: 0.5
Busy volume: 1.0
Keyboard busy probability: 0.8
Mouse busy probability: 0.2

These numbers are not magic. They are just explicit controls so the waiting state can be tuned and measured instead of left to vibes.

5. We treated the assistant as a maintained system

The agent has a tracked worker state:

idle
building
starting
running
stopping
stopped
unhealthy
error

That matters because voice agents fail in operational ways too. Containers stop. Providers change behavior. Prompt instructions decay as new tools are added. Turn detection that works in one environment can behave badly in another.

The Lesson

Production voice AI is not just model selection.

The model is one part of a longer chain:

speech in -> transcript -> intent -> tool decision -> external system -> response planning -> speech out

Every link can fail.

At BrainPack, fully managed AI means we keep watching those links after launch. We monitor transcripts, tool behavior, silence points, model drift, prompt behavior, and worker health. Then we re-prompt, re-evaluate, and adjust the system when production exposes something the demo did not.

Most voice agent failures do not look dramatic in logs.

Sometimes the real bug is a user waiting in silence.

That is still a bug.

DEV Community