I work on the AI layer at BrainPack - agents that run against real enterprise systems, not clean demo sandboxes.
One failure mode we hit with production voice agents was not "the model gave a bad answer."
It was quieter than that.
The agent needed to call a business tool, wait on a database-backed workflow, and then come back with a useful spoken answer. In a demo, that looks fine. In production, the user hears silence.
Silence is a failure mode.
For text chat, a 5 second delay is usually tolerable if the UI shows a loading state. For voice, even 2 or 3 seconds of unexplained silence feels broken. If the tool call takes longer, users start repeating themselves, interrupting, or assuming the call dropped.
In our LiveKit voice stack, the production configuration had several moving parts:
- STT provider and model
- LLM provider and model
- TTS provider and model
- Turn detection
- Voice activity detection
- Tool execution
- Transcript storage
- User information capture
- Optional Talk-to-DB access
- Call recording metadata
The first version treated tool latency mostly as a backend problem.
That was wrong.
Tool latency is also a conversation design problem.
What Changed
1. We added mandatory pre-tool speech
Before a tool runs, the agent now gives a short spoken update like:
Let me check that for you.
Not a paragraph. Not fake confidence. Just a small signal that the call is alive.
2. We separated normal conversation from data-backed questions
The assistant should not call Talk-to-DB for greetings, policy questions, or small talk. It should only route to data when the user is clearly asking about records, reports, counts, trends, metrics, filters, comparisons, or other database-backed facts.
This reduced unnecessary tool calls.
3. We added a voice-specific path for oversized answers
Some database answers are too large to read aloud. The system now detects when a Talk-to-DB response is too large and can move the full answer into a PDF email flow instead of forcing a bad voice experience.
4. We made the waiting state audible
The LiveKit agent config includes ambient and busy audio settings:
- Ambient volume:
0.5 - Busy volume:
1.0 - Keyboard busy probability:
0.8 - Mouse busy probability:
0.2
These numbers are not magic. They are just explicit controls so the waiting state can be tuned and measured instead of left to vibes.
5. We treated the assistant as a maintained system
The agent has a tracked worker state:
idlebuildingstartingrunningstoppingstoppedunhealthyerror
That matters because voice agents fail in operational ways too. Containers stop. Providers change behavior. Prompt instructions decay as new tools are added. Turn detection that works in one environment can behave badly in another.
The Lesson
Production voice AI is not just model selection.
The model is one part of a longer chain:
speech in -> transcript -> intent -> tool decision -> external system -> response planning -> speech out
Every link can fail.
At BrainPack, fully managed AI means we keep watching those links after launch. We monitor transcripts, tool behavior, silence points, model drift, prompt behavior, and worker health. Then we re-prompt, re-evaluate, and adjust the system when production exposes something the demo did not.
Most voice agent failures do not look dramatic in logs.
Sometimes the real bug is a user waiting in silence.
That is still a bug.
Top comments (0)