Ken Imoto

Posted on May 4 • Edited on May 7

Your voice agent has 300ms before users bail -- the three latency cliffs that kill voice UX

#ai #voiceai #ux #performance

I watched 30 users talk to the same voice agent

Same script. Same questions. The only thing I changed was the response latency: 300ms, 500ms, 800ms.

At 300ms, people just talked. No awkward pauses, no confusion. One user didn't even realize it was an AI until I told her afterward.

At 500ms, something shifted. Users started talking over the agent. They'd ask a question, wait half a second, then rephrase it -- which reset the entire processing pipeline and made the delay even worse.

At 800ms, it was painful. "Hello? Can you hear me?" One guy just hung up.

The experience didn't degrade gradually. It fell off cliffs. I'd love to tell you I predicted this. I didn't. I just watched 30 people get increasingly annoyed at my code.

Three cliffs, not a slope

Most latency discussions treat response time as a sliding scale: faster is better, slower is worse. That's true in a vague sense, but it misses something important about voice specifically.

Voice AI has three hard thresholds where user behavior changes abruptly. Cross one, and you're not dealing with a slightly worse experience -- you're dealing with a different kind of interaction entirely.

Latency	What users do	What you need to build
0-300ms	Talk naturally, forget it's AI	Nothing. You're golden
300-500ms	Notice the gap, but tolerate it	Consider filler responses
500-800ms	Talk over the agent, repeat themselves	Fillers mandatory, explicit turn-taking
800ms-1.5s	"Can you hear me?"	Progress indicators required
1.5-4s	Start thinking about hanging up	Stream partial responses
4s+	Gone	Your design is broken

Let me walk through the three cliffs.

Cliff 1: 300ms -- the conversation boundary

Below 300ms, a voice agent passes as conversational. Not "good for a computer" -- actually conversational. Users stay in the flow of dialogue without becoming aware they're waiting for a machine.

AssemblyAI calls this the "300ms rule," and their benchmark data backs it up. Below this threshold, users behave the same way they would talking to another person. Above it, the spell breaks. They become conscious that something is processing their words, and their speech patterns change.

This maps to what we know about human conversation. Stivers et al. measured turn-taking gaps across 10 languages (published in PNAS, 2009), and the median is around 200ms. That's not cultural -- it's neurological. Our brains expect responses in that window.

300ms gives you a 100ms buffer on top of the human baseline. It's tight, but it's enough.

In 2026, hitting this target is no longer theoretical. Hume's EVI 3 delivers speech-to-speech responses under 300ms. Cartesia Sonic reports around 40ms time-to-first-audio. Deepgram's speech-to-text alone runs sub-300ms. On the open-source side, Kokoro -- an 82M-parameter TTS model -- runs natively on a MacBook Neural Engine or smartphone NPU with near-zero latency. The pieces exist. The challenge is assembling the full pipeline (STT + LLM + TTS) without blowing the budget.

Cliff 2: 500ms -- the overlap trap

This one's sneaky, because it creates a feedback loop that makes everything worse.

When silence hits 500ms in a conversation, humans interpret it as a turn signal. "They're not going to respond, so it's my turn now." This isn't a conscious decision -- it's baked into how we process dialogue.

So when your voice agent takes 520ms to start responding, the user jumps in. "I said, what's the weather in --" And now your speech-to-text engine receives new audio input. Depending on your architecture, this either:

Resets the processing pipeline entirely (new input = start over)
Creates a garbled transcript that confuses the LLM
Gets queued behind the first response, creating a pile-up

All three outcomes increase latency on the next turn. The user notices the longer delay, talks over the agent again, and you've got a death spiral.

I saw this pattern in 8 out of 12 users in the 500ms test group. The ones who didn't overlap were the patient ones -- the kind of people who wait three seconds after a traffic light turns green. You can't design for that demographic.

The fix at this level is explicit turn-taking signals. A quick "mmhmm" or "let me check" buys you the time the silence would otherwise eat. Vapi AI's analysis found that even a simple filler sound cut overlap incidents by over 60%.

Cliff 3: 800ms -- conversation collapse

800ms is four times the natural human turn-taking gap. At this point, users stop treating the interaction as a conversation and start treating it as a broken phone connection. I know this threshold intimately because two of my own prototypes lived here for months before I figured out why nobody wanted to use them.

You've been there. International calls with satellite delay, where you and the other person keep stepping on each other's sentences, then both go silent, then both start again. That's what 800ms feels like to your users.

Retell AI's benchmark data shows that at 800ms+, users exhibit three consistent behaviors:

Repeat the question (assuming they weren't heard)
Meta-check ("Are you still there?" / "Hello?")
Abandon (hang up or close the app)

Cresta's research found that beyond 1.5 seconds, experience degradation becomes steep enough that recovery is nearly impossible. Users who hit 1.5s+ latency in the first exchange have much higher drop-off rates for the entire session -- even if subsequent responses are faster.

The damage is front-loaded. Your first response sets the user's mental model for the whole interaction.

The echo problem: the hidden fourth cliff

There's a compounding factor most teams ignore until it's too late: echo.

When latency is high, the user's own voice can bounce back to them with a 1-2 second delay. If you've ever heard yourself on a slight delay while talking -- maybe through a monitor speaker in a conference room -- you know how disorienting it is. Most people can't keep talking normally when they hear their own voice on a delay. Try it sometime -- have someone play your voice back to you at a one-second offset. You'll stumble within five words.

This means high-latency systems don't just feel slow -- they actively disrupt the user's ability to communicate. Echo cancellation quality becomes a make-or-break factor once you cross the 800ms cliff. You're no longer just optimizing for speed; you're preventing a physiological interference pattern.

Where the industry actually stands in 2026

Vendor benchmarks are generous. When ElevenLabs reports 75ms for Flash v2.5, that's model inference time -- not the end-to-end latency your user experiences. Trillet's independent benchmarks from early 2026 measured 532ms TTFB for short prompts and 906ms for longer conversational turns once you factor in network round-trip, API auth, and encoding overhead.

The full voice pipeline has three stages, each eating clock:

Speech-to-text: 100-300ms (Deepgram, AssemblyAI lead here)
LLM inference: 200-800ms (this is where 70% of total latency hides)
Text-to-speech: 40-150ms (Cartesia Sonic, ElevenLabs Flash, Qwen3-TTS at 97ms TTFA)

Add those up and you're looking at 340ms best-case for a simple response, 1,250ms for anything requiring real reasoning. The 300ms cliff is reachable for short, predictable exchanges. The 500ms cliff is where most production systems actually live.

Edge computing is closing the gap. Audio tokenization improvements have cut average voice agent latency from 2,500ms to around 600ms over the past year. Model quantization, speculative decoding, and prompt caching each shave off another 10-15%.

But here's the uncomfortable truth: if your LLM needs to think for 400ms, no amount of TTS optimization will save you from the 500ms cliff. I spent two weeks optimizing TTS before realizing the bottleneck was upstream. Two weeks I'd like back.

What this means for what you build

If you're building a voice agent today, the three cliffs give you a framework for prioritization:

If you're above 800ms, nothing else matters until you fix latency. No feature, no personality tuning, no prompt engineering will compensate for users who can't hold a conversation with your product.

If you're between 500-800ms, implement fillers and turn-taking signals immediately. A well-timed "let me look that up" is worth more than shaving 50ms off your TTS.

If you're between 300-500ms, focus on the first response. Front-load your fastest path. Cache common opening exchanges. Make the first 3 seconds of the interaction feel instant, even if later turns are slightly slower.

If you're below 300ms, congratulations -- you're in the conversation zone. Now you can worry about personality, tone, and everything else that makes a voice agent actually useful.

Measure your p95 latency, not your median. Your cliff-crossing moments happen on the slow tail, and that's where users form their worst impressions.

📘 If you want to go deeper
The 300ms Threshold: Why Talking to AI Feels Wrong -- Kindle English edition. Covers the full latency optimization stack across 12 chapters: human conversation baselines, the three cliffs framework, pipeline architecture (STT/LLM/TTS), filler design, echo cancellation, and edge deployment strategies.

References

DEV Community