The 4-layer voice-agent latency stack, traced with OTel spans

#ai #observability #voice #rust

How I instrument ASR, LLM, TTS, and the client with OpenTelemetry, and which number in each layer I actually look at

TL;DR. A voice agent is four moving parts stuck together: speech to text, the model that writes the reply, text to speech, and the client that plays the audio back. End to end latency hides which of those four is slow on any given turn, so I stopped tracking it as one number and started tracing each stage as its own OTel span with a shared session id. The number I watch hardest is barge-in: when the user starts talking over the agent, how many milliseconds until the agent actually stops sending audio. In our setup we want that under 200ms, and when p95 barge-in creeps past that, the agent feels like it is talking at you instead of with you. Everything below is how I wire the spans, what attributes go on each one, and the p95 I page on per layer.

The thing I keep saying, and the thing that keeps being true: voice agents fail in production not because of raw latency but because nobody simulated the audio and LLM pipeline together. You can have a fast ASR, a fast model, a fast TTS, and a voice agent that still feels broken, because the failure lives in the seams between them and in the parts (barge-in, jitter) that no single-stage benchmark touches. Tracing is how I get the seams to show up.

A note before the layers. This is just the setup we run, the spans we emit, and the mistakes that made us add each attribute. Some of it is probably specific to our stack and will not transfer. I will flag that where I can.

The shape of a turn, and why one span is not enough

One turn is: user says a thing, agent says a thing back. Underneath that is roughly: audio frames come in, ASR turns them into text (streaming partials as it goes); the text plus history goes to the LLM, which streams tokens back; as text comes out, TTS turns it into audio, also streaming; the client receives audio frames and plays them, with some buffering to smooth out jitter.

If you wrap the whole turn in a single span and call it voice.turn, you get a duration and almost no ability to act on it. A 1,400ms turn could be a slow first token, or TTS waiting on the full sentence before it starts, or the client buffering too aggressively. Same total, three different fixes.

So the parent span is voice.turn, and each stage is a child span. Every span carries the same audio.session_id and an audio.turn_id, so I can pull one turn out of Tempo and see all four stages laid out in time. The attribute I care about most on the streaming stages is not total duration. It is first byte: how long until the stage produced its first useful output. First byte is what the user feels, because all three stages are streaming and the user starts perceiving progress at the first byte, not the last.

import time
from contextlib import contextmanager
from opentelemetry import trace

tracer = trace.get_tracer("voice.pipeline")

@contextmanager
def stage_span(stage, session_id, turn_id):
    span = tracer.start_span(f"audio.{stage}")
    span.set_attribute("audio.stage", stage)
    span.set_attribute("audio.session_id", session_id)
    span.set_attribute("audio.turn_id", turn_id)
    started = time.monotonic()
    state = {"fb": False}
    def mark_first_byte():
        if state["fb"]:
            return
        span.set_attribute("audio.first_byte_ms", round((time.monotonic() - started) * 1000.0, 1))
        state["fb"] = True
    try:
        with trace.use_span(span, end_on_exit=False):
            yield mark_first_byte
    except Exception as exc:
        span.record_exception(exc); span.set_attribute("audio.error", True); raise
    finally:
        span.set_attribute("audio.total_ms", round((time.monotonic() - started) * 1000.0, 1))
        if not state["fb"]:
            span.set_attribute("audio.first_byte_ms", -1)  # produced nothing
        span.end()

Calling it around the LLM stage: you call first_byte() inside the streaming loop the first time a token shows up, and the wrapper does the timing math.

async def run_llm_stage(session_id, turn_id, messages, llm_client):
    chunks = []
    with stage_span("llm", session_id, turn_id) as first_byte:
        async for token in llm_client.stream(messages):
            first_byte()           # no-op after the first call
            chunks.append(token)
    return "".join(chunks)

I use time.monotonic() and not time.time() on purpose. Wall clock can jump (NTP corrections), and on a sub-second budget a backwards clock gives you negative latencies that poison the percentiles. One more thing I learned the annoying way: audio.session_id is high cardinality, so I keep it as a span attribute for trace lookup, but I do not turn it into a metric label. Stage goes on the metric label. Session id stays on the trace.

ASR: measure first partial, not final transcript

The mistake I made first was timing ASR as audio-in to final-transcript-out. That number is real but it is not the one that matches what the user feels, because a streaming ASR gives you a partial transcript fast and then refines it. So the span gets two numbers: audio.first_byte_ms is time to first partial, and I stash time to final separately.

The other ASR attribute that earned its place is whether the final transcript disagreed badly with the last partial. We had an incident where ASR turned a customer saying they wanted to confirm an order into the word cancel, and the agent acted on it. After that I started recording a rough measure of how much the final revised the partial, so big late revisions show up in traces instead of only in an angry support ticket. What I look at for ASR: p95 of time to first partial. In our setup that sits under 150ms most days, and when it drifts up it is almost always the audio frames not arriving on time from the client, not the ASR model. A nice example of why you trace the whole thing.

The LLM: first token is the whole ballgame, and barge-in lives here too

For the model stage, total generation time barely matters for the felt experience, because TTS consumes tokens as they arrive. What matters is time to first token. If the model takes 600ms before the first token, the user hears 600ms of silence after they stopped talking, and that feels like the agent froze. So the LLM span's headline attribute is time to first token.

Barge-in is the part people forget to instrument, and the part I would instrument first if I were starting over. It is what happens when the user starts talking while the agent is still speaking. The metric: from the moment voice-activity detection fires, to the moment the agent's outbound audio actually goes quiet. The first time we measured it, it was around 500ms and felt terrible, and the breakdown showed most of the time was not detection. It was buffered TTS audio we had already shipped toward the client and could not un-send. We had buffered aggressively to fight jitter, and that same buffer made barge-in slow. Tracing let me see the two goals were fighting. We are at roughly 180ms p95 now.

def run_barge_in(session_id, turn_id, vad, agent_audio):
    with stage_span("barge_in", session_id, turn_id) as first_byte:
        span = trace.get_current_span()
        t0 = time.monotonic()
        vad.wait_for_user_speech()
        span.set_attribute("audio.vad_detect_ms", (time.monotonic() - t0) * 1000.0)
        agent_audio.cancel_generation(); first_byte()
        span.set_attribute("audio.cancel_ms", (time.monotonic() - t0) * 1000.0)
        agent_audio.flush_downstream_buffers()
        span.set_attribute("audio.silence_ms", (time.monotonic() - t0) * 1000.0)

The number I keep on the wall for the model layer is two numbers honestly: p95 first token, and p95 barge-in silence. Both have to be good.

TTS: first audio chunk, and the gap between sentences

TTS is streaming too, so the attribute that matters is first byte, the first chunk of playable audio. We page when p95 first-byte on TTS goes above 350ms, because past that the pause between the user finishing and the agent starting gets long enough that testers describe it as the agent thinking too hard. There is a second TTS thing a single first-byte number misses: the gaps between chunks once audio is flowing. If TTS stalls mid-sentence the user hears a stutter, and average latency looks fine. So I record the largest inter-chunk gap on the TTS span.

I keep ASR, the model, and TTS all using the exact same audio.first_byte_ms attribute name on purpose, even though "first byte" means a slightly different physical thing for each. Same name means one query pulls first-byte across all three stages and I compare them on one screen.

The client: jitter is the number, and you cannot see it from the server

Everything above is server side. The client receives audio over a network you do not control and plays it. The enemy is jitter: frames arriving unevenly. From the server everything can look healthy while the user hears choppy audio. So the client emits its own span per turn, with the jitter it measured and the buffer depth it settled on, shipped to the same collector with the same audio.session_id. Now a glitchy call shows the jitter right next to the three server spans. The honest caveat: client clocks are not synced to your server, so treat client timestamps as approximate. I trust the client span for the jitter and buffer values it reports about itself, not for lining its clock up to the millisecond.

This is the TraceQL I keep saved. It pulls p95 of first-byte latency, grouped by stage.

{ span.audio.stage != "" && span.audio.first_byte_ms >= 0 }
  | select(span.audio.stage, span.audio.first_byte_ms)
  | quantile_over_time(span.audio.first_byte_ms, 0.95) by (span.audio.stage)

The >= 0 filter is there because a stage that produced nothing gets first_byte_ms = -1, and I do not want those poisoning the percentile. To go from aggregate to a single bad call I filter by session: { span.audio.session_id = "sess_8f21c0" }. That gives every span for that session in time order, which is the entire reason I put session_id on every span. A word on percentiles, because it changes what you do: p50 first token might be 280ms and look fine, p99 might be 1,900ms, and in voice that p99 is a real human who had a two-second silence and probably said "hello? are you there?" into the void. Averages I mostly ignore.

What I am still chewing on

How do you set the client playout buffer when you cannot see the user's network until the call is already happening? Is barge-in even the right model, when VAD fires on a cough, an "mm-hm", the user's dog? And the question under all of it: I can trace every layer now, but I still do not have a number for "this call felt natural" that does not eventually come down to a human listening to it. The tracing tells me where time went. It does not tell me whether the conversation was any good.

If you are instrumenting a voice agent and you only have time to add one span this week, add barge-in. It is the one nobody measures and the one users feel the fastest.