Marcus Chen

Posted on May 26

Voice agent latency is a lie. The number you care about is barge-in interrupt rate.

#ai #rust #performance #llm

Last quarter we shipped our voice agent into production. The p99 end-to-end latency was 280 milliseconds. Our largest competitor's was 450 milliseconds. On every dashboard, we were faster.

Our user research panel said our agent felt slower.

The "felt slower" gap was 8 percentage points on a 5-point Likert. Statistically significant. We had been measuring the wrong thing.

It took us two weeks to figure out what the panel was actually measuring, and four weeks after that to fix the right number. The wrong number was end-to-end latency. The right number was barge-in interrupt rate.

Where the dashboard lied

Voice agent benchmarks measure response time. ASR converts speech to text, the LLM produces a response, TTS turns it into audio, you ship it. The end-to-end clock is what gets reported.

That clock is not what users experience as "speed."

What users experience is the loop between starting to interrupt the agent and the agent shutting up. If they say "wait" mid-sentence and the agent finishes the sentence first, that is a one-to-two-second pause from the user's perspective.

That gap, the barge-in delay, was 380 milliseconds for us. Our competitor's was 60 milliseconds. Users felt that gap on every interruption.

How we measured barge-in interrupt rate

The metric: of attempts where the user starts speaking during agent speech, what percentage result in the agent yielding within X milliseconds?

Two methods.

Synthetic. A corpus of 500 recorded interruption attempts pulled from prior support calls. We fed each audio segment into a copy of the agent and measured time-from-first-syllable to agent-stops-speaking.

python
# barge_in_eval.py (simplified)
def measure_barge_in(agent, recording):
    start = time.monotonic_ns()
    agent.play(recording.agent_response_audio)
    interrupt_t = start + recording.interrupt_offset_ns
    play_user_audio(recording.user_audio, at=interrupt_t)
    stop_t = wait_for_agent_silence()
    return (stop_t - interrupt_t) / 1_000_000  # ms

Real. Instrumented the production audio pipeline to emit one span when VAD (Voice Activity Detection) fires and another when TTS interrupts. Both go to OTel. Subtracting the timestamps gives the per-call barge-in latency.

Our barge-in interrupt rate at the 100ms threshold was 41%. At 250ms it was 89%, but 250ms is too slow to feel responsive.

The three things we changed

1. Pin the audio buffer pages

Our agent ran in a long-lived Tokio runtime. The audio buffers were allocated on the heap and occasionally got paged to swap when the model weights were active.

use libc::{mlock, c_void};

unsafe fn pin_buffer(buf: &[u8]) -> std::io::Result<()> {
    let ret = mlock(buf.as_ptr() as *const c_void, buf.len());
    if ret != 0 {
        return Err(std::io::Error::last_os_error());
    }
    Ok(())
}

After this, VAD detected user speech within 25ms of first syllable.

2. VAD threshold tuning

A/B tested 0.4 to 0.65 on the synthetic corpus. 0.5 was best. 4% earlier detection than 0.6 with only 1.2% false positive increase.

3. TTS interrupt path

The killer. Our TTS streamed audio in 200ms chunks. When VAD fired, the audio queue still held 400ms of buffered audio that played to completion. Users heard the agent finish a fragment of a sentence before silence.

async fn handle_barge_in(state: &mut AgentState) {
    state.llm_handle.cancel();
    state.tts_queue.clear();
    state.audio_out.stop().await;
}

We dropped chunk size to 30ms and flushed the queue immediately on VAD fire.

Results

Four weeks of work. Barge-in interrupt rate at 100ms threshold moved from 41% to 89%. The "felt slower" gap closed within one user research cycle.

Our actual p99 latency went up slightly (280ms to 305ms) because of the smaller TTS chunks. The dashboard number got worse. The user-felt number got dramatically better.

The number that mattered

Voice agent latency is the dashboard number. Barge-in interrupt rate is the user number.

Most voice agent teams I have talked to do not measure barge-in interrupt rate. They measure end-to-end latency, they get a number that feels low, they ship. Then their users say "your agent sucks" and the team cannot reconcile what the dashboard says with what the user says.

The reconciliation is the metric you are not tracking.

What I am still tuning

Eight months in, I have stopped trusting the dashboard more than the user research panel. The dashboard wants to be right. The panel just is.

The barge-in threshold itself is the part I am least sure about. 100ms is our target. 60ms is our competitor's. Whether 60ms gives a meaningful UX delta over 100ms for the users we serve, I genuinely cannot tell yet.

Distinguishing intentional from filler interrupts is the next obvious area. Yielding on "wait" is correct. Yielding on "mhm" is wrong. We currently treat both the same.

And the felt-slower measurement is the one I am most aware of being weak on. Our 5-point Likert is the best we have, and it is not great. If anyone is running rigorous voice agent UX studies, the methodology would be more useful to me than the dashboard ever will.

Top comments (2)

Harjot Singh • May 31

This is a masterclass in measuring-the-wrong-thing, and it generalizes way past voice. You won the metric you could see (p99 280ms vs 450ms) and lost the experience the user actually felt, because the dashboard measured a proxy and the panel measured reality. Barge-in interrupt rate is the right number precisely because it captures the moment that breaks the illusion of a conversation: the user starts talking and the agent keeps going, or cuts them off, and no end-to-end latency figure tells you that happened. The deeper lesson for anyone shipping agents: the easy-to-measure metric and the metric-that-determines-whether-users-trust-it are usually different, and optimizing the visible one to a beautiful number can actively mask a degrading experience. The discipline is to find the metric that correlates with the human judgment and gate on that, even when it's harder to instrument. That measure-what-matters-not-what's-easy principle is core to how I think about evals in Moonshift. How are you capturing barge-in rate now, VAD overlap detection, or post-hoc analysis of who-was-talking-when in the transcript?

Joakim William Hauge • May 26

This is a really good example of why operational AI systems become difficult to reason about purely through traditional infrastructure metrics.

The system was technically “faster,” but runtime behavior during interaction made the experience feel worse.

Feels very similar to what happens in autonomous workflows generally:
the workflow may remain technically operational while retries compound, execution paths drift, or runtime behavior quietly deteriorates underneath the surface.

The interesting problems increasingly seem to emerge in the gap between system metrics and perceived operational quality.