DEV Community

Marcus Chen
Marcus Chen

Posted on

Voice agent latency is a lie. The number you care about is barge-in interrupt rate.

Last quarter we shipped our voice agent into production. The p99 end-to-end latency was 280 milliseconds. Our largest competitor's was 450 milliseconds. On every dashboard, we were faster.

Our user research panel said our agent felt slower.

The "felt slower" gap was 8 percentage points on a 5-point Likert. Statistically significant. We had been measuring the wrong thing.

It took us two weeks to figure out what the panel was actually measuring, and four weeks after that to fix the right number. The wrong number was end-to-end latency. The right number was barge-in interrupt rate.

Where the dashboard lied

Voice agent benchmarks measure response time. ASR converts speech to text, the LLM produces a response, TTS turns it into audio, you ship it. The end-to-end clock is what gets reported.

That clock is not what users experience as "speed."

What users experience is the loop between starting to interrupt the agent and the agent shutting up. If they say "wait" mid-sentence and the agent finishes the sentence first, that is a one-to-two-second pause from the user's perspective.

That gap, the barge-in delay, was 380 milliseconds for us. Our competitor's was 60 milliseconds. Users felt that gap on every interruption.

How we measured barge-in interrupt rate

The metric: of attempts where the user starts speaking during agent speech, what percentage result in the agent yielding within X milliseconds?

Two methods.

Synthetic. A corpus of 500 recorded interruption attempts pulled from prior support calls. We fed each audio segment into a copy of the agent and measured time-from-first-syllable to agent-stops-speaking.

python
# barge_in_eval.py (simplified)
def measure_barge_in(agent, recording):
    start = time.monotonic_ns()
    agent.play(recording.agent_response_audio)
    interrupt_t = start + recording.interrupt_offset_ns
    play_user_audio(recording.user_audio, at=interrupt_t)
    stop_t = wait_for_agent_silence()
    return (stop_t - interrupt_t) / 1_000_000  # ms

Enter fullscreen mode Exit fullscreen mode

Real. Instrumented the production audio pipeline to emit one span when VAD (Voice Activity Detection) fires and another when TTS interrupts. Both go to OTel. Subtracting the timestamps gives the per-call barge-in latency.

Our barge-in interrupt rate at the 100ms threshold was 41%. At 250ms it was 89%, but 250ms is too slow to feel responsive.

The three things we changed

1. Pin the audio buffer pages

Our agent ran in a long-lived Tokio runtime. The audio buffers were allocated on the heap and occasionally got paged to swap when the model weights were active.

use libc::{mlock, c_void};

unsafe fn pin_buffer(buf: &[u8]) -> std::io::Result<()> {
    let ret = mlock(buf.as_ptr() as *const c_void, buf.len());
    if ret != 0 {
        return Err(std::io::Error::last_os_error());
    }
    Ok(())
}
Enter fullscreen mode Exit fullscreen mode

After this, VAD detected user speech within 25ms of first syllable.

2. VAD threshold tuning

A/B tested 0.4 to 0.65 on the synthetic corpus. 0.5 was best. 4% earlier detection than 0.6 with only 1.2% false positive increase.

3. TTS interrupt path

The killer. Our TTS streamed audio in 200ms chunks. When VAD fired, the audio queue still held 400ms of buffered audio that played to completion. Users heard the agent finish a fragment of a sentence before silence.

async fn handle_barge_in(state: &mut AgentState) {
    state.llm_handle.cancel();
    state.tts_queue.clear();
    state.audio_out.stop().await;
}
Enter fullscreen mode Exit fullscreen mode

We dropped chunk size to 30ms and flushed the queue immediately on VAD fire.

Results

Four weeks of work. Barge-in interrupt rate at 100ms threshold moved from 41% to 89%. The "felt slower" gap closed within one user research cycle.

Our actual p99 latency went up slightly (280ms to 305ms) because of the smaller TTS chunks. The dashboard number got worse. The user-felt number got dramatically better.

The number that mattered

Voice agent latency is the dashboard number. Barge-in interrupt rate is the user number.

Most voice agent teams I have talked to do not measure barge-in interrupt rate. They measure end-to-end latency, they get a number that feels low, they ship. Then their users say "your agent sucks" and the team cannot reconcile what the dashboard says with what the user says.

The reconciliation is the metric you are not tracking.

What I am still tuning

Eight months in, I have stopped trusting the dashboard more than the user research panel. The dashboard wants to be right. The panel just is.

The barge-in threshold itself is the part I am least sure about. 100ms is our target. 60ms is our competitor's. Whether 60ms gives a meaningful UX delta over 100ms for the users we serve, I genuinely cannot tell yet.

Distinguishing intentional from filler interrupts is the next obvious area. Yielding on "wait" is correct. Yielding on "mhm" is wrong. We currently treat both the same.

And the felt-slower measurement is the one I am most aware of being weak on. Our 5-point Likert is the best we have, and it is not great. If anyone is running rigorous voice agent UX studies, the methodology would be more useful to me than the dashboard ever will.

Top comments (0)