<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Nick Lackman</title>
    <description>The latest articles on DEV Community by Nick Lackman (@nick_lackman).</description>
    <link>https://dev.to/nick_lackman</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3815621%2F442b019c-4264-463a-ad97-f776220e3dbd.png</url>
      <title>DEV Community: Nick Lackman</title>
      <link>https://dev.to/nick_lackman</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/nick_lackman"/>
    <language>en</language>
    <item>
      <title>Python Is Lying to You: Async Pitfalls in Real-Time Audio Pipelines</title>
      <dc:creator>Nick Lackman</dc:creator>
      <pubDate>Fri, 12 Jun 2026 04:14:07 +0000</pubDate>
      <link>https://dev.to/nick_lackman/python-is-lying-to-you-async-pitfalls-in-real-time-audio-pipelines-40lk</link>
      <guid>https://dev.to/nick_lackman/python-is-lying-to-you-async-pitfalls-in-real-time-audio-pipelines-40lk</guid>
      <description>&lt;p&gt;If you're building voice AI with async Python, your audio quality is probably worse than it needs to be — and the bottleneck isn't where you think it is.&lt;/p&gt;

&lt;p&gt;I've spent the last two years building production voice AI systems — real-time pipelines handling thousands of calls per day over PSTN telephony. The stack is what you'd expect: STT, LLM reasoning, TTS, Speech-to-Speech, guardrails, tool calls, all wired together with async Python and streaming over websockets.&lt;/p&gt;

&lt;p&gt;Along the way I've hunted down a specific category of bug that I think is more common than people realize: code that works correctly, passes every test, and makes the AI sound terrible. Not wrong answers — terrible &lt;em&gt;audio&lt;/em&gt;. Stutters, gaps, unnatural pauses. The kind of artifacts that make callers hang up even though the AI gave the right answer.&lt;/p&gt;

&lt;p&gt;These often traces back to the same root cause: Python's async model doesn't actually guarantee what most developers think it guarantees. And in a domain where timing matters at the millisecond level, those gaps in the guarantee become audible.&lt;/p&gt;

&lt;p&gt;Here's what I found, what caused it, and how I fixed it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Promise
&lt;/h2&gt;

&lt;p&gt;Here's the setup: you've built a voice AI pipeline. S2S or STT feeds into an LLM, the LLM reasons and calls tools, TTS converts the response to speech or S2S response, and guardrails — PII detection, content filtering, maybe sentiment analysis — run alongside everything because responsible AI isn't optional. It's all &lt;code&gt;async def&lt;/code&gt;. It's all &lt;code&gt;await&lt;/code&gt;ed properly. The demo sounds great.&lt;/p&gt;

&lt;p&gt;Python's async model feels like it was built for this. Non-blocking I/O, coroutines, event-driven architecture. You've got ML inference, LLM calls, NLP guardrails, and audio streaming all sharing the same event loop, and it maps cleanly onto a streaming voice pipeline.&lt;/p&gt;

&lt;p&gt;Then you go to production with real concurrent call volume and the audio quality falls apart. Not immediately, and not on every call — but enough that it's a problem.&lt;/p&gt;

&lt;p&gt;What follows are the specific issues I've tracked down, the code that caused them, and the fixes that brought audio quality back to where it needed to be.&lt;/p&gt;




&lt;h2&gt;
  
  
  "But I Thought It Was Async"
&lt;/h2&gt;

&lt;p&gt;This one cost me real debugging time and it's the most important concept in this entire post: &lt;code&gt;async def&lt;/code&gt; does not mean non-blocking. It's a contract that Python does not enforce.&lt;/p&gt;

&lt;h3&gt;
  
  
  The PII Guardrail That Killed Audio Quality
&lt;/h3&gt;

&lt;p&gt;We have an inline guardrail that runs NLP analysis for PII detection on every transcript chunk. It's a compliance requirement — you can't skip it. The method is &lt;code&gt;async def&lt;/code&gt;. It's being &lt;code&gt;await&lt;/code&gt;ed correctly. Everything looks right.&lt;/p&gt;

&lt;p&gt;But the NLP inference inside that method is CPU-bound. It's doing regex matching and model inference on every chunk of transcribed text. It never yields back to the event loop because there's nothing to yield &lt;em&gt;to&lt;/em&gt; — it's doing computation, not I/O. Meanwhile, the audio frames that need to ship on a 20ms cadence are sitting in a queue, waiting for the event loop to come back around to them.&lt;/p&gt;

&lt;p&gt;The guardrail runs, returns &lt;code&gt;True&lt;/code&gt; — no PII detected, everything's fine — and in the 80-200ms it took to reach that conclusion, the caller heard dead air where smooth audio should have been.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# BEFORE: Looks async. Isn't.
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check_pii&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transcript_chunk&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# All CPU-bound work. No awaits, no yields.
&lt;/span&gt;    &lt;span class="c1"&gt;# The event loop is frozen while this executes.
&lt;/span&gt;    &lt;span class="n"&gt;patterns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_compile_patterns&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;matches&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_scan_patterns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transcript_chunk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;patterns&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;matches&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;

    &lt;span class="c1"&gt;# NLP model inference — the expensive part
&lt;/span&gt;    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nlp_model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transcript_chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;contains_pii&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The mental model most Python developers carry is: "I used &lt;code&gt;await&lt;/code&gt;, so it's non-blocking." But &lt;code&gt;await&lt;/code&gt; only yields control if the called coroutine actually &lt;em&gt;suspends&lt;/em&gt; — meaning it hits a real I/O wait or an explicit yield point. An &lt;code&gt;async def&lt;/code&gt; that does CPU work without any suspension points is functionally synchronous. Python won't stop you from writing it. Python won't warn you. The only thing that tells you something is wrong is the audio quality.&lt;/p&gt;

&lt;p&gt;The important constraint here: you can't remove the guardrail. It's compliance. You can't make the NLP inference faster — it takes what it takes. The fix is getting it off the audio hot path entirely:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# AFTER: Actually non-blocking. CPU work moves to a thread.
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check_pii&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transcript_chunk&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_thread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_detect_pii_sync&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transcript_chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_detect_pii_sync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Runs in a thread pool, not on the event loop.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;patterns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_compile_patterns&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;matches&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_scan_patterns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;patterns&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;matches&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nlp_model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;contains_pii&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;asyncio.to_thread()&lt;/code&gt; moves the CPU-bound work to a thread and yields control back to the event loop immediately. The PII check still runs. Compliance is still met. But the event loop is free to keep shipping audio frames while the NLP model does its work in the background.&lt;/p&gt;

&lt;p&gt;One line changed the call signature. The caller stopped hearing gaps.&lt;/p&gt;

&lt;p&gt;What makes this worth writing about is the irony: the guardrail protecting the user experience was the thing degrading it. In a typical web application, 80-200ms of CPU work means a slightly slower HTTP response. Nobody notices. In a streaming audio pipeline, it means the event loop can't service the coroutines responsible for sending audio frames on time, and the caller hears it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F19egnk4nc97rqxr1ro06.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F19egnk4nc97rqxr1ro06.png" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Your Logger Is in the Hot Path
&lt;/h2&gt;

&lt;p&gt;This one is subtle enough that I think most Python developers don't know about it.&lt;/p&gt;

&lt;p&gt;Python's standard &lt;code&gt;logging&lt;/code&gt; module uses &lt;code&gt;StreamHandler&lt;/code&gt; by default. Here's what actually happens every time you call &lt;code&gt;logger.info()&lt;/code&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;StreamHandler.emit()&lt;/code&gt; acquires a threading lock — &lt;code&gt;self.lock.acquire()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;It calls &lt;code&gt;self.stream.write()&lt;/code&gt; — a blocking I/O operation&lt;/li&gt;
&lt;li&gt;Even writing to stdout is blocking in CPython&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's a lock acquisition plus a synchronous write on every log call. In a web server, this is noise — your response already takes 50-200ms, and a few microseconds of lock contention doesn't register. On an audio hot path where frames need to ship every 20ms, you've added synchronous I/O to the critical timing loop.&lt;/p&gt;

&lt;p&gt;During development you add &lt;code&gt;logger.debug()&lt;/code&gt; throughout the audio path because you need visibility into what's happening. Reasonable. But every one of those calls is synchronous I/O that can introduce jitter. In a REST API, it doesn't matter. In a streaming audio pipeline, it does.&lt;/p&gt;

&lt;p&gt;Logging isn't the only thing hiding in plain sight either. On the audio hot path, nothing is safe to assume is free:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;json.dumps()&lt;/code&gt; / &lt;code&gt;json.loads()&lt;/code&gt;&lt;/strong&gt; on large payloads — CPU-bound, holds the GIL the entire time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DNS resolution inside aiohttp&lt;/strong&gt; — can block if the system resolver is slow&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;File I/O that looks async but isn't&lt;/strong&gt; — many "async" wrappers delegate to a thread pool, and if that pool is saturated, you wait&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Python's string interpolation with file reads&lt;/strong&gt; — we hit this one loading AI tool definitions dynamically in our orchestration package. Trying to be clever about caching actually introduced synchronous file I/O on a path that needed to be fast&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The fix is two-fold:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First, audit your log levels.&lt;/strong&gt; Production doesn't need &lt;code&gt;DEBUG&lt;/code&gt; on the hot path. You'd be surprised how many &lt;code&gt;logger.debug()&lt;/code&gt; calls survive into production with a permissive log level config. Remove what you don't need, bump the rest to levels that won't fire in production. This is free.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second, replace &lt;code&gt;StreamHandler&lt;/code&gt; with &lt;code&gt;QueueHandler&lt;/code&gt;:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;logging.handlers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;QueueHandler&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;QueueListener&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;queue&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Queue&lt;/span&gt;

&lt;span class="n"&gt;log_queue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Queue&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Drops log records into the queue and returns immediately
&lt;/span&gt;&lt;span class="n"&gt;queue_handler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;QueueHandler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log_queue&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Separate thread drains the queue and writes at its own pace
&lt;/span&gt;&lt;span class="n"&gt;stream_handler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;StreamHandler&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;listener&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;QueueListener&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log_queue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stream_handler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;listener&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;logger&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getLogger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;voice_pipeline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addHandler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;queue_handler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setLevel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;INFO&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;QueueHandler&lt;/code&gt; drops the log record into an in-memory queue and returns immediately. A &lt;code&gt;QueueListener&lt;/code&gt; on a separate thread drains the queue at its own pace. The hot path never blocks on I/O. The log still gets written — just not synchronously.&lt;/p&gt;

&lt;p&gt;Neither of these fixes requires rearchitecting anything. Together they remove synchronous I/O from every frame of your audio pipeline.&lt;/p&gt;




&lt;h2&gt;
  
  
  The GIL Doesn't Care About Your Deadlines
&lt;/h2&gt;

&lt;p&gt;Even if you get your async discipline perfect — every CPU-bound operation offloaded to a thread, every logger swapped, every stdlib call audited — the GIL is still there.&lt;/p&gt;

&lt;p&gt;The Global Interpreter Lock means only one thread executes Python bytecode at a time. Most Python concurrency advice hand-waves past this because for web workloads it genuinely doesn't matter much. Threads spend most of their time waiting on I/O, the GIL is released during I/O waits, and everyone gets a turn.&lt;/p&gt;

&lt;p&gt;Real-time audio is different. You have CPU-bound work happening in threads (we just moved PII detection to a thread). You have the event loop on the main thread scheduling audio frame callbacks. The GIL means these take turns, not run in parallel. When the PII thread is holding the GIL for NLP inference, the event loop thread can't run. When the event loop can't run, audio frames don't ship.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fswkfzvt7mvt5upv10rs2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fswkfzvt7mvt5upv10rs2.png" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What this looks like in production:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Latency spikes that only appear under load — one concurrent call is fine, fifty and you see jitter&lt;/li&gt;
&lt;li&gt;Symptoms look identical to network jitter in your metrics, but it's scheduling contention inside your own process&lt;/li&gt;
&lt;li&gt;Doesn't reproduce locally because your dev machine isn't running fifty concurrent sessions&lt;/li&gt;
&lt;li&gt;The p50 and p99 look acceptable. The p99.9 is bad. And in voice, the p99.9 is what callers remember&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The honest answer is that the GIL makes Python fundamentally limited for work where sub-millisecond scheduling guarantees matter. But "rewrite it in Rust" isn't practical for most teams, and the rest of the stack — orchestration, LLM integration, business logic — is genuinely well-served by Python. The practical approach is knowing the constraint exists and designing around it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Free-Threaded Python: The Light at the End of the Tunnel
&lt;/h3&gt;

&lt;p&gt;Python is finally making the GIL optional. PEP 703 introduced a free-threaded build, and as of Python 3.14 it's officially supported — though still opt-in, not the default.&lt;/p&gt;

&lt;p&gt;In practice: you can build CPython with &lt;code&gt;--disable-gil&lt;/code&gt; and get true multi-threaded parallelism. The PII detection thread could run in parallel with the event loop thread instead of taking turns. Several items on the fix hierarchy below could potentially collapse.&lt;/p&gt;

&lt;p&gt;The caveats are real though. The ecosystem is still catching up — C extensions that assumed the GIL would protect shared state may not be thread-safe without it. Libraries that haven't been updated may re-enable the GIL automatically. And race conditions that the GIL was silently preventing will surface the moment you remove it.&lt;/p&gt;

&lt;p&gt;For real-time audio, this is genuinely promising. But it's a migration measured in years, not a weekend upgrade. Your STT clients, TTS clients, LLM SDKs, and orchestration frameworks all need to support it before you can flip the switch in production. Worth tracking closely and something I look for experimenting with.&lt;/p&gt;




&lt;h2&gt;
  
  
  "Just Use Threads" — The Trap
&lt;/h2&gt;

&lt;p&gt;At this point the instinct is obvious: &lt;code&gt;ThreadPoolExecutor&lt;/code&gt; is right there, move everything off the event loop.&lt;/p&gt;

&lt;p&gt;Sometimes that's the right call — &lt;code&gt;asyncio.to_thread()&lt;/code&gt; fixed the PII guardrail cleanly. But "just use threads" as a general strategy is a trap:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shared state becomes a problem.&lt;/strong&gt; Audio pipelines have state — playback buffers, conversation context, agent state, connection metadata. Threading means reasoning about what's shared. Race conditions in audio manifest as once-in-a-thousand glitches: a frame sent out of order, a buffer read during a write, piping audio into a different user's websocket, a state update that arrives late. Nearly impossible to reproduce in testing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thread safety overhead can reintroduce latency.&lt;/strong&gt; Locks fix shared state problems but introduce contention, which reintroduces the timing issues you were trying to solve.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The GIL means threads aren't truly parallel&lt;/strong&gt; for CPU-bound work anyway. You've added complexity without gaining real concurrency where it matters most.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Actually Works
&lt;/h3&gt;

&lt;p&gt;When I find something blocking the audio hot path, my first questions are about the product, not the code:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Do we need this on the hot path at all?"&lt;/strong&gt; This is the question that junior engineers skip. They go straight to "how do I make this faster" when the answer is often "don't do it here." Can this work happen after the audio frame ships? Before the call starts? Does it need to run on every chunk, or can it batch?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Where else can I put it?"&lt;/strong&gt; If it needs to happen during the call, can it move to a background task without breaking anything? Often the answer is yes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"If I have to move it, who do I talk to about feature parity?"&lt;/strong&gt; Sometimes moving work off the hot path changes how a feature behaves. That's a product conversation, not just an engineering one.&lt;/p&gt;

&lt;p&gt;Then the technical options, in order of preference:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Background tasks&lt;/strong&gt; — fire-and-forget with &lt;code&gt;asyncio.create_task()&lt;/code&gt; if you don't need the result immediately. Lowest cost.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Threads&lt;/strong&gt; — &lt;code&gt;asyncio.to_thread()&lt;/code&gt; for isolated CPU work like the PII guardrail. Keep the surface area small.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multiprocessing&lt;/strong&gt; — escapes the GIL, but IPC overhead adds its own latency. Worth it for heavy, long-running work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Separate process&lt;/strong&gt; — full isolation. Hot path and heavy processing don't share a GIL or memory. Real architectural cost, but real isolation guarantees.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Event loop inversion&lt;/strong&gt; — give the audio hot path its own dedicated event loop. Nothing else runs on it, so nothing can starve it. This is the nuclear option.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The right choice depends on how close the work is to the audio stream. A guardrail that doesn't gate playback? Background task. Audio pacing logic that controls frame timing? Might need its own event loop.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Free Lunch: uvloop
&lt;/h2&gt;

&lt;p&gt;After all of the above, here's something you can do in five minutes that actually helps.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/MagicStack/uvloop" rel="noopener noreferrer"&gt;uvloop&lt;/a&gt; is a drop-in replacement for asyncio's event loop, written in Cython on top of libuv — the same library that powers Node.js. It's faster at everything the event loop does: iterating, dispatching callbacks, resolving timers, handling I/O.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;uvloop&lt;/span&gt;
&lt;span class="n"&gt;uvloop&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;install&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="c1"&gt;# That's it.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In a real-time audio pipeline where the event loop drives frame timing, faster iteration means tighter scheduling means smoother audio. With the default asyncio event loop, I was seeing event loop congestion of 10 ms - 3 seconds. Meaning, the event loop was stuck for that long, unable to do anything else. After switching to uvloop, that number stayed down to single digit ms. Still some congestion, but much better than waiting 2 seconds for the event loop to run the next scheduled operation.&lt;/p&gt;

&lt;p&gt;What it doesn't fix: the GIL, blocking code, CPU-bound coroutines that starve the loop. Every problem from the previous sections still applies. uvloop makes a healthy event loop faster — it can't fix a broken one.&lt;/p&gt;

&lt;p&gt;But after spending hours tracking down blocking calls and refactoring thread strategies, a one-line change that measurably improves scheduling is a nice win.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Takeaway
&lt;/h2&gt;

&lt;p&gt;Python is the right language for building voice AI systems. The orchestration, the LLM integration, the business logic, the rapid prototyping — the ecosystem is unmatched - though TypeScript is becoming more and more robust AI ecosystem by the day.&lt;/p&gt;

&lt;p&gt;But the audio hot path operates under timing constraints that Python's async model wasn't designed for. The issues I've described here aren't Python bugs — they're assumption gaps. Assumptions that &lt;code&gt;async def&lt;/code&gt; means non-blocking, that the stdlib is fast enough to be invisible, that the GIL only matters for batch processing, that threads give you parallelism.&lt;/p&gt;

&lt;p&gt;Every one of those assumptions is reasonable in the context of a web application. Every one of them will degrade your production audio quality in a voice AI pipeline.&lt;/p&gt;

&lt;p&gt;The fix isn't rewriting everything in Rust/C++ or abandoning Python. It's knowing where the constraints are, designing around them, keeping CPU-bound work off the path where milliseconds are audible, and architecting your system with these constraints in mind.&lt;/p&gt;

&lt;p&gt;More on that in the next blog post: The Audio Gateway.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This is part of a series on building production voice AI systems. Previously: &lt;a href="https://dev.to/nick_lackman/dude-wheres-my-response-cutting-600ms-from-every-voice-ai-turn-with-local-vad-8j5"&gt;Dude, Where's My Response? Cutting 700ms from Every Voice AI Turn with Local VAD&lt;/a&gt; | &lt;a href="https://dev.to/nick_lackman/voice-ai-fast-and-dumb-or-slow-and-smart-why-not-both-8d5"&gt;Your Voice Agent Needs Two Brains: Building Multi-Thinker on OpenAI's Realtime API&lt;/a&gt; | &lt;a href="https://dev.to/nick_lackman/i-tested-our-websocket-audio-pipeline-with-webrtc-heres-why-i-switched-it-back-3g1j"&gt;I Tested Our WebSocket Audio Pipeline with WebRTC. Here's Why I Switched It Back.&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>voiceai</category>
      <category>python</category>
      <category>realtimeai</category>
    </item>
    <item>
      <title>Voice AI: Fast and Dumb or Slow and Smart — Why Not Fast and Smart?</title>
      <dc:creator>Nick Lackman</dc:creator>
      <pubDate>Mon, 06 Apr 2026 23:26:29 +0000</pubDate>
      <link>https://dev.to/nick_lackman/voice-ai-fast-and-dumb-or-slow-and-smart-why-not-both-8d5</link>
      <guid>https://dev.to/nick_lackman/voice-ai-fast-and-dumb-or-slow-and-smart-why-not-both-8d5</guid>
      <description>&lt;p&gt;Many voice AI demos connect the browser directly to a real-time audio model API and lets the server decide when you've stopped talking. That's a demo architecture with a built-in latency tax that quickly breaks down in production. Here's the production alternative: a backend-mediated, multi-thinker voice system with local voice activity detection that owns the entire audio pipeline end-to-end.&lt;/p&gt;

&lt;p&gt;I spent the last year and half building production voice AI systems that handle thousands of calls per day. This post covers the architecture I wish someone had documented when I started: how to make your voice AI product fast and smart, what the Responder-Thinker pattern is, why single-thinker breaks, how to build multi-thinker with your backend in the middle, and why local VAD is the key to making it feel instant.&lt;/p&gt;

&lt;p&gt;The companion repo is fully functional — clone it, run it, talk to it (OpenAI API Key Required): &lt;a href="https://github.com/lackmannicholas/responder-thinker" rel="noopener noreferrer"&gt;github.com/lackmannicholas/responder-thinker&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Latency Budget You Can't Meet
&lt;/h2&gt;

&lt;p&gt;Before the Realtime API existed, voice AI meant chaining three models in series: speech-to-text, an LLM, then text-to-speech. The math doesn't work.&lt;/p&gt;

&lt;p&gt;STT endpointing and recognition eats 500-1000ms. The LLM's time-to-first-token adds another 500-1500ms. TTS synthesis takes 200-500ms. You're at &lt;strong&gt;1.2-3 seconds minimum&lt;/strong&gt; before the caller hears a single syllable — and conversational turn-taking breaks down around 800ms of silence.&lt;/p&gt;

&lt;p&gt;In &lt;a href="https://dev.to/lackmannicholas/dude-wheres-my-response-cutting-700ms-from-every-voice-ai-turn-with-local-vad-41jn"&gt;my previous post&lt;/a&gt;, I showed that server-side voice activity detection alone adds 500ms+ of unnecessary overhead to every turn. But even after fixing that, the serial pipeline architecture is the bottleneck. You can't engineer your way to natural conversation speed with a pipeline. The architecture has to change.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Realtime API: Fast, But Not Smart Enough
&lt;/h2&gt;

&lt;p&gt;OpenAI's Realtime API collapses the STT → LLM → TTS pipeline into a single api call. Latency drops to sub-second. The conversation finally feels naturalish.&lt;/p&gt;

&lt;p&gt;But there's a tradeoff. The realtime model is conversational and fast, but compared to text-based models like GPT-5.4, it struggles with complex multi-step instructions, structured tool use, and domain-specific accuracy. It hallucinates more. Its instruction-following degrades as the system prompt grows.&lt;/p&gt;

&lt;p&gt;A voice agent that responds instantly but gives wrong information is worse than one that takes two seconds and gets it right. &lt;strong&gt;The Realtime API solved the latency problem and created an intelligence problem.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Enter Responder-Thinker
&lt;/h2&gt;

&lt;p&gt;The Responder-Thinker pattern resolves this by splitting responsibilities:&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;Responder&lt;/strong&gt; (Realtime API) is "always on". It handles conversational flow — greetings, acknowledgments, stalling, turn-taking. It's fast and socially intelligent. When the user asks something that needs real data or complex reasoning, the Responder classifies the intent and hands off to a Thinker.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;Thinker&lt;/strong&gt; (text-based model) runs in the background. It has a focused system prompt, domain-specific tools, and the reasoning capability to get the answer right. When it's done, the result is injected back into the Realtime API conversation, and the Responder delivers it naturally.&lt;/p&gt;

&lt;p&gt;The insight: &lt;strong&gt;you don't need your real-time voice to be smart. You need it to be &lt;em&gt;present&lt;/em&gt; while the smart thing works in the background.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This pattern comes from OpenAI — their &lt;a href="https://github.com/openai/openai-realtime-agents" rel="noopener noreferrer"&gt;openai-realtime-agents&lt;/a&gt; repo calls it "Chat-Supervisor." The concept isn't new. Making it production-grade is the hard part.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Single-Thinker Breaks
&lt;/h2&gt;

&lt;p&gt;The simplest implementation has one generalist Thinker handling everything — weather, stocks, news, FAQ, escalation. In my experience, this breaks fast.&lt;/p&gt;

&lt;p&gt;The system prompt grows to accommodate every domain, and quality degrades across all of them. A weather lookup and a complex knowledge question go through the same agent with the same overhead. You can't tune one domain without risking regressions in the others. You can't use a cheaper model for simple lookups and a smarter model for hard reasoning — it's one model for everything. You have to vertically scale the model capability based on the your most complex task. Lighter tasks are "over-provisioned" in terms of model usage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Single-thinker is a monolith. Multi-thinker is microservices.&lt;/strong&gt; The voice AI industry is learning the same architectural lessons backend engineering learned fifteen years ago.&lt;/p&gt;

&lt;p&gt;In a multi-thinker architecture, each Thinker owns a domain with a focused prompt and its own tools. Weather uses &lt;code&gt;gpt-5.4-mini&lt;/code&gt; with a live weather API. News uses &lt;code&gt;gpt-5.4&lt;/code&gt; because summarization requires more reasoning. Each can be tested, cached, and optimized independently.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Realistic Production Architecture
&lt;/h2&gt;

&lt;p&gt;Here's where this implementation diverges from most tutorials you'll find.&lt;/p&gt;

&lt;p&gt;Many demos connect the browser directly to OpenAI's Realtime API via WebRTC. The browser gets an ephemeral token, establishes a peer connection, and audio flows between the user and OpenAI with nothing in between. &lt;strong&gt;It's not how production voice systems work.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In production — Twilio, SIP trunks, contact centers — audio always flows through your backend. This architecture puts your backend in the middle:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Browser ←—WebRTC—→ Python Backend ←—WebSocket—→ OpenAI Realtime API
                        │
                   Thinker Agents
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The browser connects to a FastAPI server via WebRTC (using &lt;a href="https://github.com/aiortc/aiortc" rel="noopener noreferrer"&gt;aiortc&lt;/a&gt; for server-side WebRTC). The backend opens a WebSocket to OpenAI's Realtime API and streams audio bidirectionally, resampling between 48kHz (WebRTC) and 24kHz (Realtime API) using &lt;code&gt;libswresample&lt;/code&gt; for proper anti-aliased conversion.&lt;/p&gt;

&lt;p&gt;What this gives you that direct connection doesn't:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Interception&lt;/strong&gt;: the backend sees every event between the user and the model. Tool calls route to your server-side agents, not browser JavaScript. This is important for conservation aggregation, metrics, and downstream analytics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;State management&lt;/strong&gt;: Redis-backed conversation history, cross-session user memory, per-domain result caching.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local VAD&lt;/strong&gt;: your backend owns turn detection, not OpenAI's servers. This is where hundreds of milliseconds live.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security&lt;/strong&gt;: API keys never touch the browser.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transport flexibility&lt;/strong&gt;: the same backend works for WebRTC browsers and telephony SIP trunks.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Local VAD: Owning Turn Detection End-to-End
&lt;/h2&gt;

&lt;p&gt;This is the piece that makes the architecture feel instant.&lt;/p&gt;

&lt;p&gt;Most implementations of the OpenAI Realtime API use &lt;code&gt;semantic_vad&lt;/code&gt; or &lt;code&gt;server_vad&lt;/code&gt; in the session config and let OpenAI decide when the user stopped talking. That means every audio frame travels to OpenAI's servers, their VAD processes it, they decide the turn is over, and only &lt;em&gt;then&lt;/em&gt; does the model start generating a response. That round-trip is hundreds of milliseconds you're paying on every single turn.&lt;/p&gt;

&lt;p&gt;My implementation replaces this entirely with &lt;strong&gt;local voice activity detection&lt;/strong&gt;. The backend runs a &lt;a href="https://github.com/AgoraIO/ten_vad" rel="noopener noreferrer"&gt;TEN VAD&lt;/a&gt; model that processes audio locally and makes the turn detection decision on your own hardware, with zero network round-trip:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# When local VAD is active, server-side turn detection is completely disabled.
# The backend owns the full pipeline: detect speech end → commit buffer → trigger response.
&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_vad_gate&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_vad_gate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pcm16_bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Speech onset: interrupt if audio is still playing
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;speech_started&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_response_active&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;has_queued_audio&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_handle_interrupt&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Speech end: commit and request response immediately
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;speech_ended&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_commit_and_respond&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;chunks_to_send&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;pcm16_bytes&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# fallback: send everything, let OpenAI decide
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The VAD gate uses a three-state machine — SILENCE, SPEECH, and HANGOVER — with a pre-roll buffer that preserves audio from just before speech onset. When speech ends, the backend immediately commits the audio buffer and sends &lt;code&gt;response.create&lt;/code&gt;. No server-side VAD involved. No round-trip. The Realtime API starts generating the instant it receives the committed buffer.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;_commit_and_respond&lt;/code&gt; method uses the same &lt;code&gt;_response_create_lock&lt;/code&gt; that protects thinker result injection and idle nudges, because all of them compete for the same &lt;code&gt;response.create&lt;/code&gt; API constraint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_commit_and_respond&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_realtime_ws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_audio_buffer.commit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}))&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_response_create_lock&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_response_done&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wait&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_running&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_realtime_ws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response.create&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The result: it feels like the agent starts responding before you've finished talking. It isn't really, but the gap between end-of-speech and first audio byte is so small that it feels that way. This is &lt;a href="https://dev.to/lackmannicholas/dude-wheres-my-response-cutting-700ms-from-every-voice-ai-turn-with-local-vad-41jn"&gt;the same VAD research I published previously&lt;/a&gt; — 689ms improvement measured in controlled testing — now integrated into a full production architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  Routing: The Dumbest Model Makes the Most Important Decision
&lt;/h2&gt;

&lt;p&gt;The Responder classifies intent via a single tool call — &lt;code&gt;route_to_thinker(domain, query)&lt;/code&gt;. The domain is constrained to a fixed enum:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ROUTE_TO_THINKER_TOOL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;route_to_thinker&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parameters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;domain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;enum&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weather&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stocks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;news&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;knowledge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;research&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s question, rephrased for the specialist.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is architecturally interesting because &lt;strong&gt;your dumbest model is making the most important decision&lt;/strong&gt;. And that's the right tradeoff. Routing needs to be fast — 100ms, not 2 seconds. The Responder already has full conversational context. And "what kind of question is this?" is a dramatically simpler task than "what's the answer?" Constraining routing to a fixed enum makes misclassification rare and fallback trivial: unknown domains go to the Knowledge Thinker.&lt;/p&gt;

&lt;p&gt;The bridge intercepts the tool call and dispatches the Thinker concurrently so the Responder keeps talking:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;case&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response.function_call_arguments.done&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_handle_tool_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Production Failure States and Three Guards Against Them
&lt;/h2&gt;

&lt;p&gt;When the Thinker returns a result, you can't just inject it and call &lt;code&gt;response.create&lt;/code&gt;. Three things can go wrong when handling real users:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Guard 1: The user interrupted.&lt;/strong&gt; While the Thinker was working, the user barged in with a new question. The Thinker's result is stale. You still submit the tool output (the API requires it), but you don't ask the Responder to speak a stale answer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;dispatched_turn_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_turn_id&lt;/span&gt;  &lt;span class="c1"&gt;# snapshot before dispatch
&lt;/span&gt;
&lt;span class="c1"&gt;# ... thinker runs ...
&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_turn_id&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;dispatched_turn_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt;  &lt;span class="c1"&gt;# stale — user moved on
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Guard 2: The Responder is still talking.&lt;/strong&gt; The Realtime API silently drops &lt;code&gt;response.create&lt;/code&gt; while it's already generating a response — like the "let me check on that" filler. This is the primary cause of the "thinker came back but nothing happened" bug. You have to wait, and you have to serialize all callers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_response_create_lock&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wait_for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_response_done&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wait&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;10.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The lock serializes every &lt;code&gt;response.create&lt;/code&gt; caller — thinker results, the local VAD commit path, idle nudges, and disconnect goodbyes — because they all compete for the same API constraint.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Guard 3: The user interrupted during the wait.&lt;/strong&gt; After Guard 2 releases, check staleness again. The user could have barged in while you were blocked.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_turn_id&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;dispatched_turn_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;  &lt;span class="c1"&gt;# stale after wait
&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_realtime_ws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response.create&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Real callers interrupt, change their minds, and don't wait politely for the AI to finish thinking. You need to handle each one to have a system that feels as close to a human conversation as possible.&lt;/p&gt;

&lt;h2&gt;
  
  
  Barge-In Handling
&lt;/h2&gt;

&lt;p&gt;When local VAD detects speech onset while the Responder is outputting audio, the bridge does three things:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_handle_interrupt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# 1. Invalidate in-flight thinker tasks
&lt;/span&gt;    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_turn_id&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

    &lt;span class="c1"&gt;# 2. Cancel the active response
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_response_active&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_realtime_ws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response.cancel&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}))&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_response_active&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_response_done&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# 3. Flush queued audio so the speaker stops immediately
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;audio_track&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;audio_track&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output_track&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;clear&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Incrementing &lt;code&gt;_turn_id&lt;/code&gt; is the key move. Every in-flight thinker task holds a snapshot of the turn ID from when it was dispatched. When it returns, Guard 1 catches the mismatch and discards the result. No stale answers, no race conditions, no complex cancellation logic.&lt;/p&gt;

&lt;p&gt;With local VAD, barge-in detection is also local — the backend sees speech onset in the VAD state machine before any audio reaches OpenAI. The interrupt fires faster than server-side detection could.&lt;/p&gt;

&lt;h2&gt;
  
  
  Context Is Not Just Conversation History
&lt;/h2&gt;

&lt;p&gt;A caller asking "is a two-bedroom available?" means nothing without property context. "Same unit as last time" means nothing without user context. In production, managing multiple types of structured context beyond raw conversation history is paramount to giving your conversation a personal feel as well as better model performance.&lt;/p&gt;

&lt;p&gt;The repo demonstrates this with a typed &lt;code&gt;UserContext&lt;/code&gt; model persisted in Redis — preferences, memory facts, conversation summaries, and behavioral signals — keyed by browser fingerprint for cross-session persistence:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;UserContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;preferences&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Preferences&lt;/span&gt;       &lt;span class="c1"&gt;# name, location, temp unit, watched tickers
&lt;/span&gt;    &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;MemoryStore&lt;/span&gt;            &lt;span class="c1"&gt;# inferred facts, deduped, capped at 20
&lt;/span&gt;    &lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Summary&lt;/span&gt;               &lt;span class="c1"&gt;# rolling LLM-generated conversation summary
&lt;/span&gt;    &lt;span class="n"&gt;signals&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Signals&lt;/span&gt;               &lt;span class="c1"&gt;# topic counts, session count, last active
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Thinkers return a &lt;code&gt;ThinkResult&lt;/code&gt; that includes an optional &lt;code&gt;ContextUpdate&lt;/code&gt; — a class describing what the thinker learned. The router applies updates after the thinker returns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ThinkResult&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;context_update&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ContextUpdate&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Weather Thinker persists the user's location. The Knowledge Thinker picks up on it without being told. Context isn't trapped in a single agent's conversation. It's a shared, typed resource that any thinker can read from and contribute to. When context changes, the Responder's system prompt is updated mid-session via &lt;code&gt;session.update&lt;/code&gt; so it immediately knows what the thinkers learned.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The cost of implementing your own turn detection with a local VAD is well worth it.&lt;/strong&gt; The latency improvement isn't incremental — it's the difference between "this feels like talking to a computer" and "this feels like talking to someone." Owning the turn detection pipeline means you control the most latency-sensitive decision in the entire system. If you're building on the Realtime API and not doing local VAD, you're leaving hundreds of milliseconds on the table on every turn.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The routing decision matters more than the reasoning quality.&lt;/strong&gt; A perfectly accurate Thinker routed to the wrong domain produces a wrong answer. A slightly less accurate Thinker routed correctly produces a useful one. Invest in your routing prompt and your domain enum. Simple and strict rules help the dumb realtime model perform routing well. An additional consideration is using a separate LLM call to classify, but with only a handful of potential tool calls, the realtime API can do that just fine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stalling is a prompt engineering problem, not a code problem.&lt;/strong&gt; The Realtime API naturally acknowledges the user before executing the tool call. Your system prompt just needs to tell it how. The Research Thinker in the repo simulates a 30-second delay specifically to stress-test this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-thinker is worth the complexity.&lt;/strong&gt; Independent prompts, independent model tiers, independent caching TTLs, independent testing. The overhead of managing multiple agents is far less than the quality cost of a bloated single-thinker prompt.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Backend mediation is not optional for production.&lt;/strong&gt; Direct browser-to-OpenAI works for demos. The moment you need state, security, observability, local VAD, or telephony support, your backend has to be in the middle. The upfront work will save you time in the long run.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The three guards make it feel alive.&lt;/strong&gt; The "thinker returned but nothing happened" bug (Guard 2) is a frustrating one to try to debug in production and ensures the user isn't left hanging no matter what. The stale-result-after-interrupt bug (Guards 1 and 3) only manifested when callers talked fast and gives them the answer with the fullest context. These are things I wish I had known or discovered without the pain of production issues.&lt;/p&gt;




&lt;p&gt;The full implementation — local VAD, multi-thinker routing, typed user context, LangSmith observability, Docker deployment, and a 30-second research thinker for stress-testing stalling behavior — is at &lt;a href="https://github.com/lackmannicholas/responder-thinker" rel="noopener noreferrer"&gt;github.com/lackmannicholas/responder-thinker&lt;/a&gt;. Clone it, run it, talk to it.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Previously: &lt;a href="https://dev.to/lackmannicholas/dude-wheres-my-response-cutting-700ms-from-every-voice-ai-turn-with-local-vad-41jn"&gt;Cutting 600ms from Every Voice AI Turn with Local VAD&lt;/a&gt;&lt;/em&gt;&lt;br&gt;
&lt;em&gt;Coming next: Adding guardrails and voice quality evals to the Responder-Thinker pattern.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>voiceai</category>
      <category>realtimeai</category>
      <category>openai</category>
      <category>python</category>
    </item>
    <item>
      <title>Dude, Where's My Response? Cutting 600ms from Every Voice AI Turn with Local VAD</title>
      <dc:creator>Nick Lackman</dc:creator>
      <pubDate>Sat, 21 Mar 2026 02:45:02 +0000</pubDate>
      <link>https://dev.to/nick_lackman/dude-wheres-my-response-cutting-600ms-from-every-voice-ai-turn-with-local-vad-8j5</link>
      <guid>https://dev.to/nick_lackman/dude-wheres-my-response-cutting-600ms-from-every-voice-ai-turn-with-local-vad-8j5</guid>
      <description>&lt;p&gt;If you're building voice AI on OpenAI's Realtime API, your agent is slower than it needs to be — the main bottleneck is certainly inference but there's additional overhead to cut.&lt;/p&gt;

&lt;p&gt;I spent the past week instrumenting a production telephony voice pipeline, measuring where latency actually lives, and testing whether local voice activity detection (VAD) could meaningfully reduce response time. The answer is yes — by &lt;strong&gt;689ms per turn on substantive responses&lt;/strong&gt; — and the methodology is cleaner than I expected.&lt;/p&gt;

&lt;p&gt;Here's what I found, how I measured it, and why it matters for anyone building conversational AI on the Realtime API.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hidden Latency Tax
&lt;/h2&gt;

&lt;p&gt;When you build a voice agent on OpenAI's Realtime API — whether you're using the OpenAI Agents SDK, a custom WebSocket implementation, or any orchestration framework — the audio pipeline follows the same path:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The user speaks, and your telephony provider (Twilio, in my case) streams audio frames to your server&lt;/li&gt;
&lt;li&gt;Your server forwards every audio frame to OpenAI's Realtime API via WebSocket (&lt;code&gt;input_audio_buffer.append&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;OpenAI's server-side VAD (&lt;code&gt;semantic_vad&lt;/code&gt;, the default) processes the audio and decides when the user has stopped talking&lt;/li&gt;
&lt;li&gt;Only after the server-side VAD commits the audio buffer does the LLM begin generating a response&lt;/li&gt;
&lt;li&gt;The generated audio streams back to your server and out to the caller&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The problem is step 3. Every VAD decision requires a network round-trip. The audio has to travel to OpenAI's server, get processed by their turn detection model, and the commit decision happens server-side. Your code doesn't even participate — if you look at the OpenAI Agents SDK source, &lt;code&gt;input_audio_buffer.speech_stopped&lt;/code&gt; is handled as an informational notification. The server has already committed and started response generation by the time your code hears about it.&lt;/p&gt;

&lt;p&gt;This adds an irreducible network latency plus server-side model deliberation time on every single turn. And in a conversational AI system, latency after the user stops speaking is the most perceptible kind — it's the moment they're actively waiting.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Approach: Local VAD + Manual Turn Control
&lt;/h2&gt;

&lt;p&gt;The Realtime API supports disabling server-side turn detection entirely. When you set &lt;code&gt;turn_detection&lt;/code&gt; to &lt;code&gt;null&lt;/code&gt;, the server stops making autonomous commit decisions, and you take control of when to send &lt;code&gt;input_audio_buffer.commit&lt;/code&gt; and &lt;code&gt;response.create&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This means you can run a VAD model locally on your server, process the same audio frames as they arrive from Twilio — before they're even sent to OpenAI — and commit the turn the moment &lt;em&gt;you&lt;/em&gt; detect silence. The audio is already on your machine. There's no round-trip to wait for.&lt;/p&gt;

&lt;p&gt;I used &lt;a href="https://huggingface.co/TEN-framework/ten-vad" rel="noopener noreferrer"&gt;TEN VAD&lt;/a&gt; (by Agora) as the local model, running via ONNX Runtime. More on why TEN VAD below.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Not Just Use Silero?
&lt;/h2&gt;

&lt;p&gt;I evaluated three tiers of VAD before settling on TEN VAD:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Energy-based VAD&lt;/strong&gt; (WebRTC VAD, fast-vad) uses signal processing — energy levels, spectral characteristics, zero-crossing rates — to make binary speech/no-speech decisions. Extremely fast, but can't distinguish speech energy from background noise. WebRTC VAD misses roughly 1 out of every 2 speech frames at a 5% false positive rate. Not viable for production turn detection.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Silero VAD&lt;/strong&gt; is the industry-standard ML-based VAD — an LSTM-based architecture trained on 6,000+ languages, available as an ONNX model. Significantly more accurate than energy-based approaches. But it has a meaningful limitation for conversational AI: it suffers from a multi-hundred-millisecond delay when detecting speech-to-silence transitions. The recurrent architecture needs several silence frames to shift its internal state, which translates directly to turn detection delay.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TEN VAD&lt;/strong&gt; (by Agora) is purpose-built for real-time conversational AI turn detection. Agora has 10+ years of experience in real-time voice infrastructure, and it shows. In my testing, TEN VAD detected speech-to-silence transitions with a median head start of &lt;strong&gt;722ms&lt;/strong&gt; over OpenAI's server-side VAD, compared to &lt;strong&gt;342ms&lt;/strong&gt; for Silero under the same conditions. It also achieves a 32% lower Real-Time Factor and 86% smaller library footprint than Silero, which matters when you're running VAD alongside everything else in a voice pipeline.&lt;/p&gt;

&lt;p&gt;The key advantage for turn detection is transition speed. TEN VAD operates on 16kHz audio with 10ms frame hops, giving it finer temporal resolution than Silero's minimum 32ms chunks. It correctly identifies short silent durations between adjacent speech segments that Silero misses entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  Test Methodology
&lt;/h2&gt;

&lt;p&gt;Measuring this correctly turned out to be the hardest part. The naive approach — comparing "local VAD detected silence at time X" vs "server sent speech_stopped at time Y" — has a fundamental bias: the server's &lt;code&gt;speech_stopped&lt;/code&gt; event arrives &lt;em&gt;after&lt;/em&gt; the server has already begun processing, so it makes server-side VAD look artificially fast.&lt;/p&gt;

&lt;p&gt;The solution: &lt;strong&gt;use local VAD as a passive timestamp observer in both configurations.&lt;/strong&gt; In the server-side VAD test runs, TEN VAD runs locally but doesn't commit or trigger responses — it only records when it detects silence. This gives both configurations the same "true speech end" anchor point.&lt;/p&gt;

&lt;p&gt;The test protocol:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;50 turns per configuration&lt;/strong&gt; — local VAD + commit vs server-side &lt;code&gt;semantic_vad&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scripted test calls&lt;/strong&gt; from a cell phone through production Twilio PSTN infrastructure (8kHz µ-law audio)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Common measurement anchor&lt;/strong&gt;: both configurations measure perceived latency from the true moment speech ends, as detected by the passive local TEN VAD observer&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Controlled quiet-room environment&lt;/strong&gt; to isolate the VAD comparison from acoustic variability&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Perceived latency&lt;/strong&gt; defined as: true speech end → first audio byte emitted to the caller&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Filler Response Segmentation
&lt;/h3&gt;

&lt;p&gt;An important methodological consideration: the LLM non-deterministically generates "filler" responses (e.g., "Let me look that up for you") that respond in under 1 second. Server-side VAD received 44% fillers vs 32% for local VAD in my test runs, which biases the unsegmented comparison. I present results segmented by response type to control for this.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Non-Filler Turns (Primary Comparison)
&lt;/h3&gt;

&lt;p&gt;These are substantive AI responses where the LLM performs real inference. LLM latency is closely matched between configurations, isolating the VAD effect.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Local VAD&lt;/th&gt;
&lt;th&gt;Server VAD&lt;/th&gt;
&lt;th&gt;Delta&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Sample size&lt;/td&gt;
&lt;td&gt;34 turns&lt;/td&gt;
&lt;td&gt;28 turns&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Perceived latency (median)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2,412ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3,101ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-689ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Perceived latency (mean)&lt;/td&gt;
&lt;td&gt;2,396ms&lt;/td&gt;
&lt;td&gt;3,216ms&lt;/td&gt;
&lt;td&gt;-820ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM latency (median)&lt;/td&gt;
&lt;td&gt;2,183ms&lt;/td&gt;
&lt;td&gt;2,263ms&lt;/td&gt;
&lt;td&gt;~equal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cohen's d&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.04 (large)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Significance&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;p &amp;lt; 0.001&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;t = 3.93&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;22% reduction in perceived latency&lt;/strong&gt; with closely matched LLM latency, confirming the improvement is attributable to the VAD change, not LLM variance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Filler Turns (Cleanest Proof of VAD Effect)
&lt;/h3&gt;

&lt;p&gt;Filler turns provide the cleanest isolation because LLM latency is virtually identical — the entire improvement is pure VAD overhead.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Local VAD&lt;/th&gt;
&lt;th&gt;Server VAD&lt;/th&gt;
&lt;th&gt;Delta&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Sample size&lt;/td&gt;
&lt;td&gt;16 turns&lt;/td&gt;
&lt;td&gt;22 turns&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Perceived latency (median)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;679ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1,134ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-454ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM latency (mean)&lt;/td&gt;
&lt;td&gt;519ms&lt;/td&gt;
&lt;td&gt;517ms&lt;/td&gt;
&lt;td&gt;~equal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cohen's d&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.74 (very large)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Significance&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;p &amp;lt; 0.001&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;t = 5.81&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;40% reduction.&lt;/strong&gt; With LLM latency at 519ms vs 517ms (effectively identical), the entire 454ms improvement is pure VAD overhead eliminated. This is the irreducible cost of server-side turn detection made visible.&lt;/p&gt;

&lt;h3&gt;
  
  
  Response Time Distribution
&lt;/h3&gt;

&lt;p&gt;The distribution shift tells the most compelling story:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Threshold&lt;/th&gt;
&lt;th&gt;Local VAD&lt;/th&gt;
&lt;th&gt;Server VAD&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Under 1 second&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;28%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Under 1.5 seconds&lt;/td&gt;
&lt;td&gt;42%&lt;/td&gt;
&lt;td&gt;36%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Under 2.5 seconds&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;78%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;54%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Under 3 seconds&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;92%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;70%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;28% of local VAD turns respond in under 1 second vs essentially 0% for server-side VAD.&lt;/strong&gt; Sub-second response time is a qualitatively different user experience — it's the difference between a conversation that feels like talking to a person versus waiting for a system.&lt;/p&gt;

&lt;p&gt;Over a 10-turn call, the cumulative improvement is approximately &lt;strong&gt;5–7 seconds&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Implement This
&lt;/h2&gt;

&lt;p&gt;The Realtime API makes this straightforward. The key is setting &lt;code&gt;turn_detection&lt;/code&gt; to &lt;code&gt;null&lt;/code&gt; in your session configuration, which puts you in manual turn control mode:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Disable server-side VAD
&lt;/span&gt;&lt;span class="n"&gt;session_update&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session.update&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;realtime&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audio&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;turn_detection&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;  &lt;span class="c1"&gt;# Manual turn control
&lt;/span&gt;            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;websocket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session_update&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# When your local VAD detects end of speech:
&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;websocket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_audio_buffer.commit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}))&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;websocket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response.create&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output_modalities&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audio&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
&lt;span class="p"&gt;}))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you're using the OpenAI Agents SDK (Python), the same mechanism works through the session's manual turn control:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_audio&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;audio_bytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;commit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The approach works identically regardless of your orchestration framework — it's all the same Realtime API WebSocket protocol underneath.&lt;/p&gt;

&lt;p&gt;For the local VAD model, &lt;a href="https://github.com/TEN-framework/ten-vad" rel="noopener noreferrer"&gt;TEN VAD&lt;/a&gt; is available on Hugging Face with ONNX weights and Python bindings. &lt;a href="https://github.com/snakers4/silero-vad" rel="noopener noreferrer"&gt;Silero VAD&lt;/a&gt; is the more established alternative if you want a simpler setup, though you'll see slower transition detection.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next: Speculative Response Generation
&lt;/h2&gt;

&lt;p&gt;With local VAD handling turn detection, the remaining bottleneck is LLM inference (~2.2s median on non-filler turns). The next optimization I'm exploring is &lt;strong&gt;speculative response generation&lt;/strong&gt; — using the local VAD's early silence detection to trigger LLM inference &lt;em&gt;before&lt;/em&gt; we're fully certain the user has finished speaking. This allows for super tight local VAD configuration that wouldn't fly in production without OpenAI's server-side VAD confirmation. &lt;/p&gt;

&lt;p&gt;The generated audio would be buffered rather than played immediately. If the user continues speaking, we discard the speculative response. If they're done, the response is already generated and plays almost instantly.&lt;/p&gt;

&lt;p&gt;The Realtime API supports a hybrid configuration for this: set &lt;code&gt;turn_detection.create_response = false&lt;/code&gt; and &lt;code&gt;turn_detection.interrupt_response = false&lt;/code&gt;. This keeps semantic_vad running as a signal while leaving response timing under your control — the best of both worlds.&lt;/p&gt;

&lt;p&gt;Early prototyping suggests this could save an additional 200–300ms, potentially bringing total response latency consistently under 2 seconds. But the edge cases are real — still working through the interplay between local VAD and OpenAI's server-side VAD.&lt;/p&gt;

&lt;h2&gt;
  
  
  Methodology Details
&lt;/h2&gt;

&lt;p&gt;For those who want to reproduce this or poke holes in it:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Perceived latency&lt;/strong&gt; is defined as the interval from true speech end (local TEN VAD detection) to first audio byte emitted to the telephony provider. Both configurations are measured from the same anchor point — this eliminates the measurement bias inherent in using the server's &lt;code&gt;speech_stopped&lt;/code&gt; event.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Commit latency&lt;/strong&gt; (local VAD mode only): true speech end → server acknowledgment of &lt;code&gt;input_audio_buffer.committed&lt;/code&gt;. Median 122ms — this is the WebSocket round-trip overhead that local VAD adds. A small price for a large gain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM latency&lt;/strong&gt;: server commit acknowledgment → first audio delta from OpenAI. This is the model inference time, independent of VAD choice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Filler segmentation threshold&lt;/strong&gt;: LLM latency &amp;lt; 1000ms. Filler responses are non-deterministic LLM behavior (e.g., "Let me find that for you") and are not controllable by VAD configuration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Statistical tests&lt;/strong&gt;: Welch's two-sample t-test (unequal variances), Cohen's d for effect size. All p-values are two-tailed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Environment&lt;/strong&gt;: Controlled quiet-room conditions. Scripted test calls from cell phone through production Twilio PSTN infrastructure (8kHz µ-law, ~20ms frames). Test dates: March 20–21, 2026.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I build real-time AI voice systems — telephony pipelines, streaming audio, LLM orchestration. If you're working on similar problems, I'd love to hear what latency challenges you're seeing. Reach out on &lt;a href="https://www.linkedin.com/in/lackmannicholas" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>openai</category>
      <category>voiceai</category>
      <category>realtimeai</category>
      <category>websockets</category>
    </item>
    <item>
      <title>I Tested Our WebSocket Audio Pipeline with WebRTC. Here's Why I Switched It Back.</title>
      <dc:creator>Nick Lackman</dc:creator>
      <pubDate>Wed, 11 Mar 2026 04:01:01 +0000</pubDate>
      <link>https://dev.to/nick_lackman/i-tested-our-websocket-audio-pipeline-with-webrtc-heres-why-i-switched-it-back-3g1j</link>
      <guid>https://dev.to/nick_lackman/i-tested-our-websocket-audio-pipeline-with-webrtc-heres-why-i-switched-it-back-3g1j</guid>
      <description>&lt;p&gt;There's a prevailing assumption in the voice AI space that WebRTC is inherently better than WebSockets for real-time audio. Better latency, better quality, better everything. I built a full proof-of-concept to test that assumption on an enterprise scale production AI voice system.&lt;/p&gt;

&lt;p&gt;I found a few things surprising.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;Our system takes inbound phone calls, pipes the audio through an AI agent (OpenAI Realtime API), and sends the response back to the caller. The current architecture uses Twilio Programmable Voice with WebSocket media streams — G.711 μ-law audio at 8kHz using WebSocket protocol.&lt;/p&gt;

&lt;p&gt;The hypothesis was straightforward: replace the WebSocket media path with WebRTC via LiveKit, and we'd get lower latency (UDP instead of TCP, no WebSocket framing overhead) and better audio quality (Opus codec at 48kHz instead of G.711 at 8kHz).&lt;/p&gt;

&lt;p&gt;I built the full integration — LiveKit Cloud as the media server, Twilio Elastic SIP Trunking for the PSTN connection, a transport abstraction layer so both paths could run side by side, and a real-time audio pacer to handle frame timing. The key here was adding this new transport path without changing any of the LLM orchestration or Agent configuration and tools. It should work the exact same as production with the exception of using Livekit/SIP/WebRTC rather than Twilio/ProgrammableVoice/Websockets.&lt;/p&gt;

&lt;p&gt;Measuring the delta was necessary to take any meaningful insights from this proof-of-concept.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Latency Result
&lt;/h2&gt;

&lt;p&gt;Median response latency (time from when the caller stops speaking to when the AI starts responding):&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WebSocket path: ~1,920ms&lt;br&gt;
WebRTC path: ~2,060ms&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Essentially identical. The theoretical 50–150ms savings from eliminating WebSocket overhead is real, but invisible against 2+ seconds of LLM response time. The transport layer accounts for less than 5% of total conversational latency. The bottleneck is the model, not the pipe. The thing I found interesting about this is the conversation around websockets vs WebRTC for real-time AI. “WebRTC is always better” is the general consensus. While WebRTC is the superior transport mechanism for real-time communications - literally in the name, the efficiency benefits are hard to see when model inference is 500ms-4s. &lt;/p&gt;

&lt;h2&gt;
  
  
  The Audio Quality Result
&lt;/h2&gt;

&lt;p&gt;Both paths delivered the same audio quality — because both paths carry the same audio. When a caller dials from a phone, the audio enters the PSTN as G.711 μ-law at 8kHz. That's a hard ceiling imposed by the telephone network. It doesn't matter whether those bytes travel over a WebSocket or a WebRTC connection; the frequency content is identical. You can't recover information that was never captured at the source. Said a different way, you can go from low quality audio encoding to high quality audio encoding and expect a better sounding output. &lt;/p&gt;

&lt;h3&gt;
  
  
  Spectral Analysis
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl3z2wnzo225n7nni7ss7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl3z2wnzo225n7nni7ss7.png" alt=" " width="800" height="581"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Surprise: WebRTC Sounded Worse at First
&lt;/h2&gt;

&lt;p&gt;The initial WebRTC implementation actually sounded worse than WebSocket — choppy audio, dropped words, audible artifacts. It took real debugging to figure out why.&lt;/p&gt;

&lt;p&gt;WebRTC's jitter buffer is designed for network jitter. It smoothing out packets that arrive with variable timing from a remote peer over UDP. It is not designed to handle an application dumping large bursts of AI-generated audio into the WebRTC stack all at once.&lt;/p&gt;

&lt;p&gt;When the LLM generates a response, the audio arrives in variable-sized chunks — sometimes 50ms of audio, sometimes 500ms, delivered as fast as the model can produce it. The OpenAI Realtime API delivers fairly consistent audio chunks, but it’s not exact and not in the way that is expected for PSTN. Our WebSocket implementation had a strict real-time pacer that metered these chunks out at exactly one frame per 20ms with prebuffering and underrun detection. Without that same pacer on the WebRTC path, the audio sounded terrible.&lt;/p&gt;

&lt;p&gt;The fix was porting the same pacer architecture to the WebRTC path. Once both paths had identical frame-level timing discipline, the audio quality matched. The lesson: application-level pacing of AI-generated audio is your responsibility regardless of transport. WebRTC handles network timing, not application timing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where WebRTC Actually Wins
&lt;/h2&gt;

&lt;p&gt;I also tested a WebRTC-native path with no PSTN involved — a browser client connecting directly to the AI agent via LiveKit with Opus at 24kHz. The difference was dramatic:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;99% audio bandwidth: 8,438 Hz (vs. ~3,969 Hz for PSTN paths)&lt;/li&gt;
&lt;li&gt;2x+ frequency content — you can hear breathiness, sibilants, natural voice texture&lt;/li&gt;
&lt;li&gt;Fewest signal artifacts of all three paths&lt;/li&gt;
&lt;li&gt;Same latency as the other paths (still LLM-bound)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;WebRTC is transformatively better when the caller isn't on a phone. The technology delivers on its promise — just not for PSTN calls.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Takeaway
&lt;/h2&gt;

&lt;p&gt;The right question isn't "should we use WebRTC?" It's "where is the bottleneck?" For PSTN-based AI voice calls today, the telephone network limits quality, and the LLM limits speed. Changing the transport layer between those two bottlenecks doesn't move the needle.&lt;/p&gt;

&lt;p&gt;WebRTC becomes the right answer when one of these changes: callers move to VoIP/browser/app clients (removing the PSTN quality ceiling), LLM response times drop by an order of magnitude (making transport latency a meaningful fraction of total latency), or wideband codecs become available end-to-end on SIP trunks.&lt;/p&gt;

&lt;p&gt;While WebRTC is the de facto real-time communication protocol, we have millions of phone numbers and deeply ingrained Twilio Programmable Voice integrations. Switching would mean setting up new infrastructure, changing the call routing logic, additional overhead of managing a media server ourselves or paying for a cloud service like livekit. SIP/WebRTC needed to be a significant improvement over Twilio/Websockets to justify the migration, and it was about the same.&lt;/p&gt;

&lt;p&gt;If you are already deeply integrated with Twilio and their Programmable Voice, the boring WebSocket pipeline with a well-tuned audio pacer is the right architecture. Sometimes the best engineering decision is knowing when not to ship.&lt;/p&gt;

</description>
      <category>websockets</category>
      <category>webrtc</category>
      <category>ai</category>
      <category>twilio</category>
    </item>
  </channel>
</rss>
