Barge-In Is the Voice-Agent Feature Nobody Benchmarks. I Added It and Lost 120ms.

#voiceai #ai #ux #webrtc

I have read more voice-agent benchmarks than I would like to admit. They all measure the same thing: how many milliseconds from "user stops talking" to "agent starts talking." Stack comparisons, P95 charts, the whole genre. Every one of them treats the conversation as a relay race where only one runner is moving at a time.

Then I shipped barge-in: the ability for a user to talk over the agent and have it shut up gracefully. And I discovered the thing none of those benchmarks measure. Letting the user interrupt is not free. On my own pipeline, turning it on added 120ms to the exact latency number every chart obsesses over. Nobody benchmarks the cost of barge-in, because barge-in is the feature that makes your headline number worse, and no one wants to publish that.

This is the field report.

What barge-in actually is

Barge-in is the agent yielding the floor when the user starts speaking mid-response. It is the difference between a conversation and a press conference. Without it, the agent finishes its sentence no matter what you do, and you end up talking over a machine that cannot hear you over itself.

Here is the part that surprised me. Every benchmark I had read measures a half-duplex world: the user talks, then the agent talks, cleanly alternating. Barge-in breaks that model entirely. To support it, the agent has to keep listening while it is speaking. That is full-duplex, and full-duplex is where the hidden cost lives.

The numbers that get published are turn-taking latency: human conversation hands off in roughly 200-300ms, while most agents land somewhere between 800ms and 1500ms (Gradium, Softcery). Those are the numbers in the charts. The number nobody charts is how long it takes the agent to stop once you cut in.

The thing I shipped, and the thing it broke

My setup is a fairly ordinary cascade: WebRTC transport, streaming STT, an LLM, streaming TTS, with an orchestrator wiring the frames together. End-to-end I was sitting in a respectable place. Then the complaints came in, and they were the same complaints everyone gets: the agent steamrolls you, it answers before you finish, it keeps talking when you try to correct it.

So I added barge-in. The mechanism is not exotic. While TTS is playing, you keep a VAD running on the inbound mic stream. The instant it fires, you duck the agent audio and decide whether to yield the turn. The common production move is to drop TTS gain by about 24dB the moment VAD fires, without killing the stream, so you can recover if it was a false alarm (Future AGI).

It worked. The agent stopped steamrolling. And my latency went up, because of the part I had not thought hard enough about: false barge-ins.

Why naive barge-in is a disaster

The first version fired on everything. A cough. A door. The user saying "mm-hm" to agree, which is not an interruption at all, it is a backchannel. Worst of all, it fired on the agent's own voice leaking back through the mic. The agent heard itself, decided someone was interrupting, and went silent. A machine startled by its own echo, like a dog barking at the mirror and then losing the staring contest.

The fix for the echo problem is acoustic echo cancellation: feed the speaker output back as a reference signal, subtract it from the mic input, and you are left with just the human. That is table stakes for full-duplex and I will not relitigate it here.

The fix for the cough-and-backchannel problem is where the latency went. You cannot trust a single VAD frame. Energy-based VAD does not know the difference between "I disagree, stop" and someone clearing their throat in a coffee shop. Background noise pushing energy above threshold is exactly the failure mode the field keeps naming (Future AGI). So you add a guard. You require the interrupting speech to persist for a minimum duration before you commit to yielding.

That guard is the 120ms. And it buys something real. A minimum-duration guard can cut the false-barge-in rate by 60-80%, but it adds roughly 200ms to the barge-in path (Future AGI). I tuned mine tighter than that and landed at +120ms before my false-positive rate dropped under the 5% I was aiming for. The published target for barge-in is brutal in both directions: 95%+ accuracy, under 5% false positives, under 5% missed real interruptions (Future AGI). You do not get there for free, and the currency you pay in is the same milliseconds your benchmark is bragging about.

The two timers nobody puts on the same chart

Here is what I think the benchmarks get structurally wrong. There is not one latency in a voice agent. There are two, and they pull in opposite directions.

Timer	What it measures	Direction barge-in pushes it
Turn-taking latency	User stops -> agent starts	This is what every chart reports
Barge-in latency	User cuts in -> agent stops	This is the one nobody reports

Turn-taking latency is the relay-race number. Barge-in latency is the interrupt-handling number, and the field is starting to put real targets on it: interruption response under 200ms, measured from user-speech onset to TTS suppression (Future AGI). The trap is that these two timers fight. Make the agent quicker to yield and you generate more false stops. Add a guard to kill the false stops and you slow the yield. You are not optimizing a number. You are choosing a point on a tradeoff curve, and the benchmark that reports only the first timer is hiding the second axis entirely.

The research framing I found most honest measures the minimum latency required to reach 90% barge-in accuracy, rather than reporting latency and accuracy as if they were independent (Future AGI). That is the joint metric. That is what a barge-in benchmark should look like, and almost nobody publishes it.

Where the +120ms actually fits in the budget

To be clear about scale: in a cascade, the latency that gets all the attention is the STT-to-LLM-to-TTS chain, which even at its fastest is a few hundred milliseconds of stacked work. The barge-in path is a separate budget. It runs in parallel, on the listening side, the whole time the agent is talking. The response chain and the listening path never touch.

So the +120ms does not lengthen your response. It lengthens the interruption. When a user cuts in, that is the delay before the agent goes quiet. And that delay has a much lower tolerance than response latency does. People forgive an agent that takes 600ms to answer. They do not forgive an agent that keeps talking for 600ms after they have clearly told it to stop, because at that point it is not slow, it is rude. The barge-in timer is the one your users feel as a personality flaw.

What 2026 turn-taking models change

The honest version of this story is that the brute-force guard is the old way, and the field has moved. The fix for "VAD is too dumb to tell a cough from a correction" is to stop using a bare energy threshold and use a model that understands turns.

This is the shift everyone is making right now. The 2026 production stack is migrating from energy-threshold VAD toward dedicated turn-taking models that classify backchannel versus barge-in versus continued silence as a learned signal (Future AGI). The named players:

Deepgram Flux does model-native end-of-turn detection using acoustic, semantic, and conversational context instead of silence thresholds, landing around 250ms end-of-turn and removing the need for a separate VAD-plus-endpointing stack (Deepgram).
Krisp Turn Prediction v3 pushes end-of-turn latency below 200ms, and in May 2026 benchmarks its accuracy curve sat below LiveKit's built-in and Deepgram Flux's across the operating range (Krisp).
LiveKit Agents ships adaptive interruption handling at 86% precision and 100% recall, with the barge-in and backchannel-suppression logic living in the orchestrator, not the ASR model (Inworld).

That last point reframed the whole problem for me. Barge-in quality lives in your orchestrator, not your speech-to-text model. The model tells you what it heard; the orchestrator decides what to do about it, and that decision is the entire game (AssemblyAI). I had been tuning the wrong layer for a week.

A semantic turn detector earns back most of my 120ms because it does not need a long duration guard. It can tell that "actually—" is an interruption and "yeah, mm-hm" is not, from the prosody and the words, not from how long the sound lasted. The guard was a crutch for a dumb VAD. A model that understands the turn lets you commit to the decision sooner with the same accuracy, which is the only way to move down the tradeoff curve instead of along it. Combining audio and text this way is what closes the gap to roughly 300ms without cutting users off mid-thought (Future AGI).

What I would tell myself before shipping it

Three things rearranged in my head, and they are the things I wish a benchmark had told me.

Measure the second timer. If your dashboard only has turn-taking latency, you are flying with one instrument. Add barge-in latency, measured from user-speech onset to TTS suppression, and watch them as a pair. The moment you start optimizing one in isolation, you are quietly wrecking the other.

The guard is a tax, not a feature. A minimum-duration guard is the cheapest way to stop false barge-ins and the most expensive way to feel responsive. It is fine as a first pass. It is a bad place to live. If you are still paying a 120-200ms guard tax six months in, you have not solved barge-in, you have postponed it.

Barge-in is an orchestrator problem. I spent days assuming a better STT model would fix my interruptions. It would not have. The yield-or-hold decision lives above the model, and that is where the engineering actually is. Pick your transport and orchestrator for how they handle interruption events, because that is the layer your users will judge.

The number nobody puts on the chart is the number your users feel first. An agent that answers fast but will not stop talking is not a fast agent. It is a fast bulldozer. I would rather lose 120ms and have it know when to shut up.

I pulled the latency-budget framing and the cascade anatomy behind this from my book The 300ms Voice-AI UX Problem, which is where I worked out why turn-taking is the part of the budget that does not behave like the rest of it. This post is what happened when I stopped reading about turn gaps and started measuring the one in the other direction.