<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Marcus Chen</title>
    <description>The latest articles on DEV Community by Marcus Chen (@realmarcuschen).</description>
    <link>https://dev.to/realmarcuschen</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3940517%2F7b3654df-2cab-42a2-a56a-eae04985c9a4.png</url>
      <title>DEV Community: Marcus Chen</title>
      <link>https://dev.to/realmarcuschen</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/realmarcuschen"/>
    <language>en</language>
    <item>
      <title>Our voice agent passed every test and still woke me up at 3am</title>
      <dc:creator>Marcus Chen</dc:creator>
      <pubDate>Thu, 11 Jun 2026 10:35:24 +0000</pubDate>
      <link>https://dev.to/realmarcuschen/our-voice-agent-passed-every-test-and-still-woke-me-up-at-3am-37dc</link>
      <guid>https://dev.to/realmarcuschen/our-voice-agent-passed-every-test-and-still-woke-me-up-at-3am-37dc</guid>
      <description>&lt;h2&gt;
  
  
  Replaying real call transcripts as your test set is a trap. The failures come from the inputs a user produces exactly once.
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Our voice-agent regression suite was 312 recorded production calls, all passing. The page at 3am came from a caller who switched between English and Hindi mid-sentence, a pattern that appeared zero times in those 312 calls. Replaying real transcripts tests the confidence you already have. It does not test the inputs that actually break you. We moved to simulating adversarial callers, and below is what I learned trying five tools to generate and grade those simulated conversations (as of June 2026).&lt;/p&gt;

&lt;p&gt;The test set was real, and that was the problem&lt;br&gt;
For about four months our regression set was 312 recorded production calls. It felt rigorous. Real audio, real ASR output, real user intents, replayed on every deploy. Green for weeks.&lt;/p&gt;

&lt;p&gt;Then the 3am page. A caller switched between English and Hindi inside single sentences. Our ASR mis-segmented the mixed-language audio, the intent classifier saw garbage, and the agent fell into a clarification loop it could not exit. The caller hung up. The dashboards were fine the whole time.&lt;/p&gt;

&lt;p&gt;I went looking for that pattern in the 312 calls. It was not there. Not once. The people who code-switch like that had mostly churned months earlier, so the behavior was absent from the recordings exactly because it was a problem we never handled. A test set built from past traffic contains what already happened, weighted toward the common case. The failures that page you are rare by definition, and rare things are missing from a sample of the past.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why replay gives false confidence&lt;/strong&gt;&lt;br&gt;
Replaying recorded calls is a regression test for behavior you have already seen. That is useful. It catches the case where a deploy breaks something that used to work. What it cannot do is produce an input you have never received. For that you have to manufacture the input on purpose: the fast talker who never pauses, the caller who interrupts the agent two words in, the code-switcher, the person who changes their mind halfway through a sentence, the line with a TV on in the background. That is simulation, and it is a different activity from replay.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I tried to generate and grade simulated calls&lt;/strong&gt;&lt;br&gt;
Five tools, roughly a week each, same eight adversarial caller profiles. None of these is voice-specific; I drove them off transcripts plus a separate ASR/TTS layer. Honest notes, your mileage will differ:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Promptfoo:&lt;/strong&gt; fast to wire into CI and good for red-teaming a prompt with generated variants. The fiddly part was that conversation state across turns was a manual build.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LangSmith:&lt;/strong&gt; dataset versioning and the trace view were the best of the set. The simulation half I had to assemble myself.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Future AGI Simulate:&lt;/strong&gt; persona-based, you define caller personas and it runs them through the agent, which matched how I already thought about adversarial callers (as of June 2026). Voice was not first-class, so I ran it on transcripts with ASR and TTS bolted on.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Braintrust:&lt;/strong&gt; the nicest UI for eyeballing where a run diverged. Persona definitions lived outside it, in my code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DeepEval:&lt;/strong&gt; the most knobs for synthetic-conversation generation. Tuning the synthesizer to stop producing unrealistic turns took a while.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Confident AI:&lt;/strong&gt; a reasonable hosted layer on top of DeepEval, though it is another account and key to manage.
I am deliberately not crowning one. Braintrust had the UI I liked, DeepEval had the most generation control, and the persona abstraction in Future AGI's Simulate (part of their open work at github.com/future-agi) lined up with how I list out adversarial callers. Any of them can run a persona once you have written the persona.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The thing that actually moved the needle was not the tool&lt;br&gt;
It was the persona list. Once we had written eight adversarial callers (the angry caller, the two-words-then-silence caller, the code-switcher, the background-noise line, and so on), every tool above could run them and grade the results. The leverage was in naming the failure modes, not in the framework that executed them. We spent two days arguing about the personas and twenty minutes wiring the runner.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The open question I still have&lt;/strong&gt;&lt;br&gt;
The space of adversarial callers is infinite, and we maintain eight. We chose those eight from incident postmortems, which means we are still only simulating failures we have already been burned by at least once. The genuinely novel failure, the next 3am page, is still unguarded. I do not have a principled way to pick simulation personas before the incident teaches me the persona. If you have one, that is the comment I want to read.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;FAQ&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Why not just add each failed call to the regression set after the incident?&lt;/strong&gt;&lt;br&gt;
We do. It is still reactive. The replay suite trails production by one outage, permanently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Doesn't simulated traffic drift away from what real users do?&lt;/strong&gt;&lt;br&gt;
Yes, and that is a real cost. We re-sample the real call distribution monthly and adjust how often each persona fires. Simulation supplements replay; it does not replace it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is any of this voice-specific?&lt;/strong&gt;&lt;br&gt;
Most of it applies to text agents too. Voice just adds two more failure surfaces: ASR segmentation and barge-in timing. The code-switching incident was really an ASR segmentation failure that a text agent would never have hit.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>voice</category>
      <category>testing</category>
      <category>llm</category>
    </item>
    <item>
      <title>The 4-layer voice-agent latency stack, traced with OTel spans</title>
      <dc:creator>Marcus Chen</dc:creator>
      <pubDate>Tue, 09 Jun 2026 09:28:48 +0000</pubDate>
      <link>https://dev.to/realmarcuschen/the-4-layer-voice-agent-latency-stack-traced-with-otel-spans-37hp</link>
      <guid>https://dev.to/realmarcuschen/the-4-layer-voice-agent-latency-stack-traced-with-otel-spans-37hp</guid>
      <description>&lt;p&gt;**&lt;/p&gt;

&lt;h2&gt;
  
  
  How I instrument ASR, LLM, TTS, and the client with OpenTelemetry, and which number in each layer I actually look at
&lt;/h2&gt;

&lt;p&gt;**&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR.&lt;/strong&gt; A voice agent is four moving parts stuck together: speech to text, the model that writes the reply, text to speech, and the client that plays the audio back. End to end latency hides which of those four is slow on any given turn, so I stopped tracking it as one number and started tracing each stage as its own OTel span with a shared session id. The number I watch hardest is barge-in: when the user starts talking over the agent, how many milliseconds until the agent actually stops sending audio. In our setup we want that under 200ms, and when p95 barge-in creeps past that, the agent feels like it is talking at you instead of with you. Everything below is how I wire the spans, what attributes go on each one, and the p95 I page on per layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The thing I keep saying, and the thing that keeps being true:&lt;/strong&gt; voice agents fail in production not because of raw latency but because nobody simulated the audio and LLM pipeline together. You can have a fast ASR, a fast model, a fast TTS, and a voice agent that still feels broken, because the failure lives in the seams between them and in the parts (barge-in, jitter) that no single-stage benchmark touches. Tracing is how I get the seams to show up.&lt;/p&gt;

&lt;p&gt;A note before the layers. This is just the setup we run, the spans we emit, and the mistakes that made us add each attribute. Some of it is probably specific to our stack and will not transfer. I will flag that where I can.&lt;/p&gt;

&lt;h2&gt;
  
  
  The shape of a turn, and why one span is not enough
&lt;/h2&gt;

&lt;p&gt;One turn is: user says a thing, agent says a thing back. Underneath that is roughly: audio frames come in, ASR turns them into text (streaming partials as it goes); the text plus history goes to the LLM, which streams tokens back; as text comes out, TTS turns it into audio, also streaming; the client receives audio frames and plays them, with some buffering to smooth out jitter.&lt;/p&gt;

&lt;p&gt;If you wrap the whole turn in a single span and call it voice.turn, you get a duration and almost no ability to act on it. A 1,400ms turn could be a slow first token, or TTS waiting on the full sentence before it starts, or the client buffering too aggressively. Same total, three different fixes.&lt;/p&gt;

&lt;p&gt;So the parent span is voice.turn, and each stage is a child span. Every span carries the same audio.session_id and an audio.turn_id, so I can pull one turn out of Tempo and see all four stages laid out in time. The attribute I care about most on the streaming stages is not total duration. It is first byte: how long until the stage produced its first useful output. First byte is what the user feels, because all three stages are streaming and the user starts perceiving progress at the first byte, not the last.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;contextlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;contextmanager&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;

&lt;span class="n"&gt;tracer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_tracer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;voice.pipeline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@contextmanager&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;stage_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;turn_id&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audio.&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;stage&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audio.stage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audio.session_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audio.turn_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;turn_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;started&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;monotonic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;mark_first_byte&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audio.first_byte_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;monotonic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;started&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1000.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;use_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end_on_exit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;mark_first_byte&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;record_exception&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audio.error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="k"&gt;raise&lt;/span&gt;
    &lt;span class="k"&gt;finally&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audio.total_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;monotonic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;started&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1000.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audio.first_byte_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# produced nothing
&lt;/span&gt;        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;end&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Calling it around the LLM stage:&lt;/strong&gt; you call first_byte() inside the streaming loop the first time a token shows up, and the wrapper does the timing math.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_llm_stage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;turn_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;llm_client&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;stage_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;turn_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;first_byte&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;llm_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="nf"&gt;first_byte&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;           &lt;span class="c1"&gt;# no-op after the first call
&lt;/span&gt;            &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I use time.monotonic() and not time.time() on purpose. Wall clock can jump (NTP corrections), and on a sub-second budget a backwards clock gives you negative latencies that poison the percentiles. One more thing I learned the annoying way: audio.session_id is high cardinality, so I keep it as a span attribute for trace lookup, but I do not turn it into a metric label. Stage goes on the metric label. Session id stays on the trace.&lt;/p&gt;

&lt;h2&gt;
  
  
  ASR: measure first partial, not final transcript
&lt;/h2&gt;

&lt;p&gt;The mistake I made first was timing ASR as audio-in to final-transcript-out. That number is real but it is not the one that matches what the user feels, because a streaming ASR gives you a partial transcript fast and then refines it. So the span gets two numbers: audio.first_byte_ms is time to first partial, and I stash time to final separately.&lt;/p&gt;

&lt;p&gt;The other ASR attribute that earned its place is whether the final transcript disagreed badly with the last partial. We had an incident where ASR turned a customer saying they wanted to confirm an order into the word cancel, and the agent acted on it. After that I started recording a rough measure of how much the final revised the partial, so big late revisions show up in traces instead of only in an angry support ticket. What I look at for ASR: p95 of time to first partial. In our setup that sits under 150ms most days, and when it drifts up it is almost always the audio frames not arriving on time from the client, not the ASR model. A nice example of why you trace the whole thing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The LLM: first token is the whole ballgame, and barge-in lives here too
&lt;/h2&gt;

&lt;p&gt;For the model stage, total generation time barely matters for the felt experience, because TTS consumes tokens as they arrive. What matters is time to first token. If the model takes 600ms before the first token, the user hears 600ms of silence after they stopped talking, and that feels like the agent froze. So the LLM span's headline attribute is time to first token.&lt;/p&gt;

&lt;p&gt;Barge-in is the part people forget to instrument, and the part I would instrument first if I were starting over. It is what happens when the user starts talking while the agent is still speaking. The metric: from the moment voice-activity detection fires, to the moment the agent's outbound audio actually goes quiet. The first time we measured it, it was around 500ms and felt terrible, and the breakdown showed most of the time was not detection. It was buffered TTS audio we had already shipped toward the client and could not un-send. We had buffered aggressively to fight jitter, and that same buffer made barge-in slow. Tracing let me see the two goals were fighting. We are at roughly 180ms p95 now.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_barge_in&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;turn_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vad&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;agent_audio&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;stage_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;barge_in&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;turn_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;first_byte&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_current_span&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;t0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;monotonic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;vad&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wait_for_user_speech&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audio.vad_detect_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;monotonic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;t0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1000.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;agent_audio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cancel_generation&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="nf"&gt;first_byte&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audio.cancel_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;monotonic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;t0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1000.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;agent_audio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;flush_downstream_buffers&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audio.silence_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;monotonic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;t0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1000.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The number I keep on the wall for the model layer is two numbers honestly: p95 first token, and p95 barge-in silence. Both have to be good.&lt;/p&gt;

&lt;h2&gt;
  
  
  TTS: first audio chunk, and the gap between sentences
&lt;/h2&gt;

&lt;p&gt;TTS is streaming too, so the attribute that matters is first byte, the first chunk of playable audio. We page when p95 first-byte on TTS goes above 350ms, because past that the pause between the user finishing and the agent starting gets long enough that testers describe it as the agent thinking too hard. There is a second TTS thing a single first-byte number misses: the gaps between chunks once audio is flowing. If TTS stalls mid-sentence the user hears a stutter, and average latency looks fine. So I record the largest inter-chunk gap on the TTS span.&lt;/p&gt;

&lt;p&gt;I keep ASR, the model, and TTS all using the exact same audio.first_byte_ms attribute name on purpose, even though "first byte" means a slightly different physical thing for each. Same name means one query pulls first-byte across all three stages and I compare them on one screen.&lt;/p&gt;

&lt;h2&gt;
  
  
  The client: jitter is the number, and you cannot see it from the server
&lt;/h2&gt;

&lt;p&gt;Everything above is server side. The client receives audio over a network you do not control and plays it. The enemy is jitter: frames arriving unevenly. From the server everything can look healthy while the user hears choppy audio. So the client emits its own span per turn, with the jitter it measured and the buffer depth it settled on, shipped to the same collector with the same audio.session_id. Now a glitchy call shows the jitter right next to the three server spans. The honest caveat: client clocks are not synced to your server, so treat client timestamps as approximate. I trust the client span for the jitter and buffer values it reports about itself, not for lining its clock up to the millisecond.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhxynoyobrt4c7hi8n8ho.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhxynoyobrt4c7hi8n8ho.png" alt=" " width="776" height="234"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is the TraceQL I keep saved. It pulls p95 of first-byte latency, grouped by stage.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{ span.audio.stage != "" &amp;amp;&amp;amp; span.audio.first_byte_ms &amp;gt;= 0 }
  | select(span.audio.stage, span.audio.first_byte_ms)
  | quantile_over_time(span.audio.first_byte_ms, 0.95) by (span.audio.stage)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &amp;gt;= 0 filter is there because a stage that produced nothing gets first_byte_ms = -1, and I do not want those poisoning the percentile. To go from aggregate to a single bad call I filter by session: { span.audio.session_id = "sess_8f21c0" }. That gives every span for that session in time order, which is the entire reason I put session_id on every span. A word on percentiles, because it changes what you do: p50 first token might be 280ms and look fine, p99 might be 1,900ms, and in voice that p99 is a real human who had a two-second silence and probably said "hello? are you there?" into the void. Averages I mostly ignore.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I am still chewing on
&lt;/h2&gt;

&lt;p&gt;How do you set the client playout buffer when you cannot see the user's network until the call is already happening? Is barge-in even the right model, when VAD fires on a cough, an "mm-hm", the user's dog? And the question under all of it: I can trace every layer now, but I still do not have a number for "this call felt natural" that does not eventually come down to a human listening to it. The tracing tells me where time went. It does not tell me whether the conversation was any good.&lt;/p&gt;

&lt;p&gt;If you are instrumenting a voice agent and you only have time to add one span this week, add barge-in. It is the one nobody measures and the one users feel the fastest.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>observability</category>
      <category>voice</category>
      <category>rust</category>
    </item>
    <item>
      <title>3 OTel span attributes I tag on every voice-pipeline span</title>
      <dc:creator>Marcus Chen</dc:creator>
      <pubDate>Tue, 02 Jun 2026 17:50:16 +0000</pubDate>
      <link>https://dev.to/realmarcuschen/3-otel-span-attributes-i-tag-on-every-voice-pipeline-span-4pg7</link>
      <guid>https://dev.to/realmarcuschen/3-otel-span-attributes-i-tag-on-every-voice-pipeline-span-4pg7</guid>
      <description>&lt;p&gt;Voice pipelines have 4 stages that need separate latency stories: ASR (speech to text), LLM (the response prompt), TTS (text to speech), and client (jitter on the receiving end). When we wired OTel across all 4, the spans without consistent attributes were useless for queries. 3 attributes ended up on every span and earn their keep.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;audio.stage.Enum&lt;/strong&gt;: asr, llm, tts, client. The single most-queried attribute. The Grafana query for p95 latency by stage is one filter. Without this, you are scrolling raw traces.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;audio.session_id&lt;/strong&gt;: The full conversation. Lets you query "what did the user actually experience" end-to-end. We use a uuid generated at session start, propagated to every downstream call. Tempo's &lt;code&gt;traces by tag&lt;/code&gt; lookup is fast on this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;audio.first_byte_ms&lt;/strong&gt;: The time from request start to first audio byte returned. For ASR and TTS streaming stages. This is what catches barge-in latency regressions before the dashboard's aggregate alert does. We page when p95 first_byte_ms goes above 350ms on TTS.&lt;/p&gt;

&lt;p&gt;Honorable mention attributes that didn't survive the first cleanup: &lt;code&gt;audio.codec&lt;/code&gt; (covered by the service info), &lt;code&gt;audio.session_turn_index&lt;/code&gt; (covered by parent-span linkage), &lt;code&gt;audio.user_id&lt;/code&gt; (privacy concerns at scale; left out).&lt;/p&gt;

&lt;p&gt;If you are starting voice-pipeline observability: tag stage + session_id on every span from day one. The first_byte_ms is the one you will add after the first production incident; you might as well add it on day two.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>observability</category>
      <category>voice</category>
      <category>rust</category>
    </item>
    <item>
      <title>Voice agent latency is a lie. The number you care about is barge-in interrupt rate.</title>
      <dc:creator>Marcus Chen</dc:creator>
      <pubDate>Tue, 26 May 2026 15:58:32 +0000</pubDate>
      <link>https://dev.to/realmarcuschen/voice-agent-latency-is-a-lie-the-number-you-care-about-is-barge-in-interrupt-rate-38fc</link>
      <guid>https://dev.to/realmarcuschen/voice-agent-latency-is-a-lie-the-number-you-care-about-is-barge-in-interrupt-rate-38fc</guid>
      <description>&lt;p&gt;Last quarter we shipped our voice agent into production. The p99 end-to-end latency was 280 milliseconds. Our largest competitor's was 450 milliseconds. On every dashboard, we were faster.&lt;/p&gt;

&lt;p&gt;Our user research panel said our agent felt slower.&lt;/p&gt;

&lt;p&gt;The "felt slower" gap was 8 percentage points on a 5-point Likert. Statistically significant. We had been measuring the wrong thing.&lt;/p&gt;

&lt;p&gt;It took us two weeks to figure out what the panel was actually measuring, and four weeks after that to fix the right number. The wrong number was end-to-end latency. The right number was barge-in interrupt rate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the dashboard lied
&lt;/h2&gt;

&lt;p&gt;Voice agent benchmarks measure response time. ASR converts speech to text, the LLM produces a response, TTS turns it into audio, you ship it. The end-to-end clock is what gets reported.&lt;/p&gt;

&lt;p&gt;That clock is not what users experience as "speed."&lt;/p&gt;

&lt;p&gt;What users experience is the loop between starting to interrupt the agent and the agent shutting up. If they say "wait" mid-sentence and the agent finishes the sentence first, that is a one-to-two-second pause from the user's perspective.&lt;/p&gt;

&lt;p&gt;That gap, the barge-in delay, was 380 milliseconds for us. Our competitor's was 60 milliseconds. Users felt that gap on every interruption.&lt;/p&gt;

&lt;h2&gt;
  
  
  How we measured barge-in interrupt rate
&lt;/h2&gt;

&lt;p&gt;The metric: of attempts where the user starts speaking during agent speech, what percentage result in the agent yielding within X milliseconds?&lt;/p&gt;

&lt;p&gt;Two methods.&lt;/p&gt;

&lt;p&gt;Synthetic. A corpus of 500 recorded interruption attempts pulled from prior support calls. We fed each audio segment into a copy of the agent and measured time-from-first-syllable to agent-stops-speaking.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python
# barge_in_eval.py (simplified)
def measure_barge_in(agent, recording):
    start = time.monotonic_ns()
    agent.play(recording.agent_response_audio)
    interrupt_t = start + recording.interrupt_offset_ns
    play_user_audio(recording.user_audio, at=interrupt_t)
    stop_t = wait_for_agent_silence()
    return (stop_t - interrupt_t) / 1_000_000  # ms

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Real. Instrumented the production audio pipeline to emit one span when VAD (Voice Activity Detection) fires and another when TTS interrupts. Both go to OTel. Subtracting the timestamps gives the per-call barge-in latency.&lt;/p&gt;

&lt;p&gt;Our barge-in interrupt rate at the 100ms threshold was 41%. At 250ms it was 89%, but 250ms is too slow to feel responsive.&lt;/p&gt;

&lt;h2&gt;
  
  
  The three things we changed
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Pin the audio buffer pages
&lt;/h3&gt;

&lt;p&gt;Our agent ran in a long-lived Tokio runtime. The audio buffers were allocated on the heap and occasionally got paged to swap when the model weights were active.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;libc&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="n"&gt;mlock&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;c_void&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="k"&gt;unsafe&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;pin_buffer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;u8&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;io&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;ret&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;mlock&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="nf"&gt;.as_ptr&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="nb"&gt;c_void&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="nf"&gt;.len&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ret&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;Err&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;io&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;last_os_error&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(())&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After this, VAD detected user speech within 25ms of first syllable.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. VAD threshold tuning
&lt;/h3&gt;

&lt;p&gt;A/B tested 0.4 to 0.65 on the synthetic corpus. 0.5 was best. 4% earlier detection than 0.6 with only 1.2% false positive increase.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. TTS interrupt path
&lt;/h3&gt;

&lt;p&gt;The killer. Our TTS streamed audio in 200ms chunks. When VAD fired, the audio queue still held 400ms of buffered audio that played to completion. Users heard the agent finish a fragment of a sentence before silence.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;handle_barge_in&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="py"&gt;.llm_handle&lt;/span&gt;&lt;span class="nf"&gt;.cancel&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="py"&gt;.tts_queue&lt;/span&gt;&lt;span class="nf"&gt;.clear&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="py"&gt;.audio_out&lt;/span&gt;&lt;span class="nf"&gt;.stop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We dropped chunk size to 30ms and flushed the queue immediately on VAD fire.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;Four weeks of work. Barge-in interrupt rate at 100ms threshold moved from 41% to 89%. The "felt slower" gap closed within one user research cycle.&lt;/p&gt;

&lt;p&gt;Our actual p99 latency went up slightly (280ms to 305ms) because of the smaller TTS chunks. The dashboard number got worse. The user-felt number got dramatically better.&lt;/p&gt;

&lt;h2&gt;
  
  
  The number that mattered
&lt;/h2&gt;

&lt;p&gt;Voice agent latency is the dashboard number. Barge-in interrupt rate is the user number.&lt;/p&gt;

&lt;p&gt;Most voice agent teams I have talked to do not measure barge-in interrupt rate. They measure end-to-end latency, they get a number that feels low, they ship. Then their users say "your agent sucks" and the team cannot reconcile what the dashboard says with what the user says.&lt;/p&gt;

&lt;p&gt;The reconciliation is the metric you are not tracking.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I am still tuning
&lt;/h2&gt;

&lt;p&gt;Eight months in, I have stopped trusting the dashboard more than the user research panel. The dashboard wants to be right. The panel just is.&lt;/p&gt;

&lt;p&gt;The barge-in threshold itself is the part I am least sure about. 100ms is our target. 60ms is our competitor's. Whether 60ms gives a meaningful UX delta over 100ms for the users we serve, I genuinely cannot tell yet.&lt;/p&gt;

&lt;p&gt;Distinguishing intentional from filler interrupts is the next obvious area. Yielding on "wait" is correct. Yielding on "mhm" is wrong. We currently treat both the same.&lt;/p&gt;

&lt;p&gt;And the felt-slower measurement is the one I am most aware of being weak on. Our 5-point Likert is the best we have, and it is not great. If anyone is running rigorous voice agent UX studies, the methodology would be more useful to me than the dashboard ever will.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>rust</category>
      <category>performance</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
