The cleanest illustration of why this matters comes from a small, ordinary failure on a small, ordinary outbound campaign that I've been reading about: roughly one day, a few hundred cold-call attempts, and about $100 of telephony plus STT plus TTS plus model spend, evaporated by a voice agent that occasionally found itself dialing into someone else's voicemail or IVR or, the most expensive case, another voice agent. The exchange the operator screenshotted is the kind of thing that reads as a comedy bit until you remember it's billing the whole time:
— Hello.
— Hello, how can I help you?
— I'm calling because…
— Hello, how can I help you?
— Sure, could you tell me…
In a chat window this would be a funny screenshot. On a phone it's billing the whole time — telephony plus STT plus TTS plus model tokens, on two endpoints, both confidently polite, neither programmed to recognise the other side as a peer and hang up. The lesson the operator pulled from this, and the one I want to walk through, is the larger one: a voice agent is not a chatbot with a phone number. It's a realtime system, and almost every "voice agent failure in production" I've now read about reduces to chat-architecture assumptions being applied to a medium that doesn't tolerate them.
Let me unpack what specifically doesn't translate.
Latency in chat and latency on a call are different objects
In a chat the unit of cost is "time until the model starts streaming a reply." A two-second pause is fine. The user is reading the previous turn or sipping coffee or alt-tabbed away. In a phone call the unit of cost is time of silence on an open audio channel, and that has a perceptual budget set by human conversational physiology, not by your latency dashboards.
The specific budget is well-studied. Levinson and Torreira's 2015 paper Timing in turn-taking and its implications for processing models of language, drawing on a corpus across ten languages, reports that the typical gap between turns in natural conversation is around 200 milliseconds, with modal values clustering in the 100–300ms range — and overlap is more common than long pauses. The authors note the cognitive trick that makes this possible: speakers begin planning their response before the previous turn ends. Two hundred milliseconds is an interaction signature, not a latency target you choose.
Once you exceed that, perceptual breakdowns happen on a sliding scale. The voice-AI industry — see, e.g., AssemblyAI's "300ms rule" writeup — converges on a perceptual gradient: by 300–400ms the listener is starting to notice the silence; by 500ms they're starting to assume something is wrong with their own line; sub-500ms is the working threshold below which an agent feels live. Retell AI, one of the larger commercial platforms, claims about 600ms end-to-end and frames that as competitive. It is competitive, and that also tells you the ceiling: even the leading systems are sitting just above the perceptual breakdown line, not below it.
Now look at what the chat-style architecture has to fit inside that budget on every turn:
- streaming STT to recognise the user's speech;
- LLM call (with potentially several tool calls — CRM lookup, calendar check, database query);
- response generation;
- streaming TTS, with first-byte audio out the door before the rest is ready.
In a chat you can spend several seconds on this and the user waits. On a call you cannot, because the other party is not waiting; they are filling the silence with "hello?" and starting to repeat themselves and asking if you're still there. The streaming transcript captures all of it. The model now has to respond to a turn that is partly the original question and partly the interruption-and-repeat, and the conversation begins to liquefy.
The big-prompt problem doesn't translate
The chat reflex when an agent isn't reasoning well is to make the prompt longer. Add more rules. Add more examples. Add more tools. A long context-rich system prompt is the standard chat-deployment pattern.
In voice, this fails for a separate reason, distinct from the context-rot problem Anthropic and the Lost-in-the-Middle line of research have written about elsewhere (although that problem is also present). The voice-specific failure is goal drift mid-call. The operator I'm retelling here used a low-latency Gemini Flash–class model for one project (Google documents a separate Live Preview line for native realtime audio; it's not clear from the source whether the operator was on Live Preview or on the standard Flash variant adapted to a voice pipeline). What the operator observed was that the model could keep up with the latency budget but, given a long playbook stuffed into one prompt, would lose track within a few turns of which stage of the call it was in: had it asked about budget yet, was it still confirming identity, was it allowed to close. The model wasn't slow; it was disoriented. A fast model with a long prompt is not the same as a fast, focused model.
The substitution that works isn't a smarter model. It's an explicit graph.
Calls are graphs, not soup
A voice agent that holds up in production does not look like a single "be helpful and talk to the customer" prompt. It looks like a set of named stages with explicit transitions, each stage carrying a short instruction, restricted tool access, and explicit fallbacks. The platforms that ship voice agents (Retell's flow editor, ElevenLabs's Conversational AI workflow editor) make this graph structure visible, because that's what works:
[Greeting]
│
▼
[Identity check] ── wrong person ──▶ [Apologise] ─▶ [End call]
│
▼ identity confirmed
[Consent] ── not given ──▶ [Apologise] ─▶ [End call]
│
▼ consent given
[Question 1] ─▶ [Question 2] ─▶ [Question 3]
│
▼
[Closing]
│
▼
[End call]
Fallbacks (any state):
voicemail detected ─▶ [Leave message] ─▶ [End call]
human IVR ─▶ [Press digit / wait for transfer]
technical issue ─▶ [Apologise + "we'll call back"] ─▶ [End call]
another bot detected ─▶ [End call]
budget cap reached ─▶ [End call] (hard limit; not a prompt instruction)
This is dull engineering. It is also the engineering that turns "the agent sometimes gets confused" into "the agent's behaviour is auditable and the failure modes are named." Each stage has a budget — both in tokens and in real seconds — and each transition is explicit. No stage's instruction is "use your judgement"; if a stage needs judgement, that's a sign it should be split into two stages.
What works in voice (and what doesn't)
The same operator's piece is candid about which categories of voice deployment they made work and which they couldn't. The patterns are clean enough to tabulate; what's interesting is why the column splits look the way they do.
| Category | What changes vs. chat | Result |
|---|---|---|
| Inbound lead qualification (small fixed questionnaire) | Closed-world flow; user has consented to the call by submitting the form; small graph with a clear success criterion | Worked. ~40 hours/week saved on a four-rep team. |
| Webinar attendance reminders (call N minutes before start) | Single objective, single FAQ branch ("who are you / what's the webinar about"), short call | Worked. Attendance lifted from ~10% to ~30%. |
| Cold outbound (open-world dial) | Voicemail, IVR, gatekeepers, other bots, "send us an email instead," "I don't make those decisions," "who gave you my number" — each needs explicit behaviour | Did not work. $100/day burning on indeterminate paths. |
The pattern is structural, not coincidental. Inbound and reminders have a closed world: you control the flow because you also control the entry point. The user dialled or opted in; they're inside your graph from second one. Cold outbound has the opposite property: the world dials back, and the world contains things your graph does not. The right default for cold outbound is therefore not a smarter agent or a better prompt; it's a more aggressive exit policy — every recognisable open-world input maps to a transition that ends the call without burning cycles.
The hidden cost is that every one of those open-world inputs has to be recognised before you can transition on it. Recognising "this is voicemail and not a person" is itself a hard signal-processing task, and getting it wrong on either side is expensive: false positives end calls with real prospects, false negatives leave the agent monologuing to a beep for the maximum call duration the platform allows. (And if the platform has no maximum, which is somebody's first oversight, the bill is the limit.)
Why managed voice platforms are not just "Twilio with a wrapper"
You can build all of this directly on Twilio's media streams and your own STT/TTS/LLM pipes. The case for using a managed voice-agent platform (Retell, ElevenLabs, or one of several others that have appeared in the last 18 months) isn't that they're hard to imitate. It's that the things they ship under the hood are exactly the things that make the difference between a demo and a production deployment, and you only realise this after you've discovered them yourself:
- Interruption handling. When the user talks over the agent, the TTS has to actually stop, the STT has to absorb the new turn, and the agent state has to update. "The TTS stops mid-syllable" is not a free behaviour; it's the result of a tightly integrated audio pipeline.
- Streaming STT/TTS coordination with first-byte targets. Generating a full response and then sending it to TTS is fatal for latency. Streaming the text as it's generated, and beginning TTS on the first sentence, is fatal *un*tested. There is no architecture-on-paper that gets this right; it has to be tuned.
- Regression tests for prompts and tool calls. When you change the wording in the consent stage, you want to know that the budget-question stage didn't silently start failing. The platforms ship saved-conversation regression tests precisely because hand-written voice tests are unreasonably hard to maintain.
- Hard limits on call duration and spend. Not a prompt — a limit. If the agent enters an infinite politeness loop with another bot, the call has to end because the limit said so, not because the agent reasoned its way out.
- Post-call extraction. A consistent set of fields pulled from the transcript at end-of-call, rather than asked of the model live.
What the platforms are actually selling is the boring stuff that turns out to be load-bearing. It is much cheaper to buy this than to discover what each piece is for and rebuild it badly.
The pre-launch checklist
If I were starting on a voice agent today, this is the order I'd want answered before I picked a model:
- What are the named stages of a typical successful call?
- What's the one objective of each stage?
- What inputs does each stage expect, and what data does it have access to?
- Which tools are valid in which stages? (Most stages should have zero.)
- What are the legal transitions? Which transitions are explicitly forbidden?
- What counts as success? What counts as a dead end?
- When is the agent required to end the call?
- How is voicemail recognised? IVR? Another bot?
- What's the latency budget for each stage, and how do we know we hit it?
- Which conversations do we save as regression tests?
- What's the per-call spend cap that auto-terminates? (This is not optional.)
- What's the per-day spend cap on the campaign?
The last two are not jokes. The classic voice-agent incident is the cost-disaster one — not because the agent did something dramatic, but because nobody set the limit.
What I'm taking from this
The framing I keep coming back to is that voice agents are not the natural successor to chatbots. They're a different class of system that happens to share an LLM. The chat lineage tells you to hand the model a long prompt, give it broad tool access, and let it figure out the conversation; the voice medium punishes every one of those choices. The systems that work in voice tend to be small, explicit graphs with named stages, narrow tool grants per stage, hard time-and-money limits, and an aggressive exit policy when the world doesn't behave like the graph expected.
The one-line summary the operator I read closes on is the one I'd keep: making the agent call is not the hard part; making it stop calling, in the right way, at the right time, when it's clearly off the rails, is the hard part. That's the engineering. Everything before it is plumbing.
Top comments (1)
100% percent agreed. Especially today with AI agents basically each message equals a lot of ping pong between reasoning - toolcalls. Imagine user continues talking while agent has already started finding an answer, or user forget some info and want to interrupt an AI agent to provide more context. Too many new variables..