DEV Community: Nick Lackman

The Audio Gateway: The Production Pattern for Real-Time Voice AI

Nick Lackman — Thu, 16 Jul 2026 22:30:48 +0000

If your voice agent stutters under load, talks over callers, or freezes mid-sentence while a tool call runs — the fix probably isn't a better model. It's an architectural split.

In my last post, Python Is Lying to You, I walked through every way Python's async model quietly sabotages a real-time audio pipeline: the async-looking guardrail that froze the event loop, the logger doing blocking I/O on every frame, GIL contention that looks exactly like network jitter. The fixes worked. But they are all defensive. Move this off the hot path. Offload that to a thread. Audit every innocent-looking call. You're playing whack-a-mole against your own process.

There's a point where defense stops scaling. The real fix isn't defending the hot path inside one process; it's removing everything else from the process entirely.

This post is about that architecture: an audio gateway that owns the media hot path, a business plane that owns everything else, and a small gRPC contract between them. It's the pattern we're building right now and has been utilized across the realtime voice industry. I've built a full, runnable, open-source reference implementation you can clone and watch work, no API key required - but OpenAI key strongly suggested: ai-audio-gateway.

Two workloads that should never have shared a process

A voice AI system contains two workloads with opposite requirements:

The audio path is hard real time. A 20ms frame every 20ms, forever, or the caller hears a glitch. It can't retry, can't batch, can't hide latency behind a spinner. Its scheduling requirements are measured in single-digit milliseconds.

Reasoning is the opposite. Slow, bursty, variable. An LLM call might take 500ms or five seconds. A tool call hits a database, an internal API, another service. Multi-step agent loops fan out unpredictably. This work is, by nature, slow. At least until we can do deep reasoning at lightning speed, but we're not there yet.

Put both in one process and the slow thing janks the fast thing. Every bug in my previous post is a specific instance of this one collision: CPU-bound guardrail work starving the event loop, an LLM SDK blocking at the wrong moment, a checkpoint write landing mid-frame. You can fix each instance individually running on process, or you can partition them by their workload characteristics.

The production answer is to split the system into two planes:

A media plane — the audio gateway — that owns hard real-time work: telephony or WebRTC audio I/O, VAD, playback pacing, barge-in, codec handling.
A business plane that owns meaning: prompts, tools, orchestration, multi-step reasoning, data access.

Between them, a bidirectional gRPC stream carrying a small, typed contract.

One rule governs the whole design: audio timing never waits on business work. The business plane can delay a tool result. It cannot block barge-in, response cancellation, VAD, or an outbound audio frame. Everything else in this post is a consequence of that rule.

You don't have to take my word for it

This split is what the serious voice AI shops converge on, independently, because production pushes everyone to the same place.

LiveKit — the infrastructure OpenAI used to ship ChatGPT's Advanced Voice Mode, now serving billions of calls a year — is the cleanest example. Their media server is an SFU that forwards audio packets without ever decoding them; agents run as entirely separate processes that join a room like any other participant. Their own framing: media transport and AI inference are independent layers. LiveKit stays deliberately dumb about meaning; everything related to understanding, reasoning, or storage lives outside the media layer.

Vapi describes itself, verbatim, as "an orchestration layer" — and their data-flow documentation draws exactly this boundary: endpointing, interruption detection, and backchanneling run exclusively on Vapi's infrastructure, while your custom LLM and tools run on your servers, reached by webhook. Vapi is, functionally, a media/orchestration plane sold as a service, with your server URL as the business plane.

Pipecat (Daily's open-source framework) is built on the same separation — transports on one side, frame processors on the other.

And the deepest precedent predates all of them: telephony has separated signaling from media since SIP existed. The industry keeps rediscovering this split because real-time media and application logic have never belonged in the same process. The pattern isn't novel. That's the point: it's convergent. What this post adds is what the boundary looks like when the thing behind it is an agent: how tools, reasoning, and interruption coordination cross the wire.

The contract: two enums you can read in a minute

There's no .proto in the reference implementation. The wire format is Pydantic envelopes serialized as JSON over a gRPC stream_stream with identity serializers. This gives you HTTP/2 multiplexing and bidirectional streaming, with a payload you can log and read. For an example repo, that's perfect to demonstrate the concept.

Two enums are the contract:

class GatewayEventType(str, enum.Enum):
    """Gateway -> Business. Things that *happened* in the media plane."""
    CALL_STARTED = "call.started"
    USER_SPEECH_STARTED = "user.speech_started"
    USER_SPEECH_STOPPED = "user.speech_stopped"
    TOOL_CALL_REQUESTED = "tool_call.requested"
    BARGE_IN = "barge_in"
    RESPONSE_DONE = "response.done"
    # ...

class GatewayCommandType(str, enum.Enum):
    """Business -> Gateway. Things the business plane asks the media plane to do."""
    SESSION_CONFIGURE = "session.configure"
    TOOL_CALL_OUTPUT = "tool_call.output"
    RESPONSE_CREATE = "response.create"
    RESPONSE_CANCEL = "response.cancel"
    # ...

Events are past tense where the media plane owns reality and reports it. Commands are imperative where the business plane owns meaning and directs the media plane through a small vocabulary. When the vocabulary is this small, the boundary stays legible; when a new capability doesn't fit either enum cleanly, that's usually a sign it's on the wrong side of the wire.

Hollow proxy tools: the schema crosses, the behavior doesn't

The realtime model needs tools. The tools need databases, internal APIs, and business logic. None of which belongs on the audio hot path. The mechanism that reconciles this is the piece of the architecture I like most.

A tool has two halves that usually live together: a schema (name, description, JSON parameters. The stuff the model needs to decide to call it) and an implementation (the code that does the work). This architecture splits them across the wire.

At call setup, the business plane sends the gateway a list of tool specs — schema only, no behavior:

class ToolSpec(BaseModel):
    name: str
    description: str
    params_json_schema: dict[str, Any]
    strict_json_schema: bool = True
    # Note what's absent: invoke().

The gateway builds one proxy per spec: an object the realtime model can call, whose entire body is "relay this call across the wire and await the result."

async def _relay_tool_call(self, name: str, arguments: dict) -> Any:
    tool_call_id = "tc_" + uuid.uuid4().hex[:10]
    fut = asyncio.get_running_loop().create_future()
    self._pending[tool_call_id] = fut          # correlate by id

    await self._send_event(GatewayEvent(
        type=GatewayEventType.TOOL_CALL_REQUESTED,
        payload={"name": name, "tool_call_id": tool_call_id,
                 "arguments_json": json.dumps(arguments),
                 "turn_id": self._turn_id},    # more on this below
    ))
    return await fut                            # resolved by the read loop

The model believes it has local tools. It calls them the way it natively wants to — no special-casing. But strip the business plane away and the proxies relay to nothing. The gateway holds shapes, not behavior.

The proof: two agents, one gateway, zero gateway changes

The reference repo ships two agents against the identical gateway, because the directory listing alone makes the argument:

The single agent exposes the four café tools (get_menu, place_order, ...) directly. The gateway builds four proxies. Every tool call crosses the wire. Flat.

The responder–thinker agent exposes one tool: the thinker. The gateway builds a single proxy. When the model calls it, one envelope crosses the wire, and then the thinker — a hand-rolled agent loop running in the business plane — fans out into its own tool calls: get_menu, validate, place_order. None of those nested calls cross back. The gateway sees one envelope out and one envelope back; behind it, a whole tree executed that the gateway is blind to.

That blindness into tool execution is the entire point of where the boundary sits. The media plane relays shapes; whether resolving a request takes one round trip or a dozen nested calls is the business plane's business. This is also why the architecture survived swapping agent patterns without a gateway deploy.

If you've read my Responder-Thinker post, this is that pattern, now with the plane boundary drawn through it: the responder lives with the realtime model behind the gateway, and the thinker's entire fan-out is invisible to the media plane.

Barge-in and turn staleness: coordinating two processes with one integer

This is one of the more difficult pieces of a natural sounding conversation. When the caller interrupts, two things have to happen fast: the assistant must stop talking, and any reasoning already in flight for the old turn must not come back and speak over the new one.

The first problem lives entirely in the gateway. The gateway paces assistant audio out at real time even though the model streams it much faster, which means the outbound queue is the backlog of not-yet-heard speech. Barge-in clears that queue instantly. It also has to fire while audio is still draining, not just while the model is generating, because a fast model finishes generating long before the caller finishes hearing. In OpenAI mode the gateway additionally cancels the response, drops in-flight deltas from the cancelled response, and truncates the model's conversation item to roughly what was actually played so the next turn isn't grounded in words the caller never heard.

The second problem crosses the wire, and the mechanism is small but impactful. The gateway owns a turn_id that increments on every barge-in. That integer rides on every tool_call.requested frame. The business plane mirrors it, snapshots it before slow work, and checks it after:

# In the thinker's loop, between every step:
if is_stale(snapshot_turn_id):
    return {"stale": True}   # abandon; don't speak over a moved conversation

Two processes. No shared memory. Coordinating real-time state through one integer on the wire.

(If that mechanism feels familiar, it's a degenerate Lamport clock. A monotonic counter establishing happened-before across processes is one of the oldest tools in distributed systems, and it's a good reminder that a well-placed integer often beats a clever protocol.)

In the reference implementation this is threaded through a ToolContext handed to every tool invocation with call id, turn id, a live is_stale() and the thinker passes the same context down into its nested calls. A barge-in detected three levels deep into the fan-out abandons the whole tree.

The bug that proved the point

While building the reference implementation, I hit a bug. It's the same lesson as the entire previous article, recurring one layer up.

My first business-plane handler processed inbound events sequentially — one async for loop, handle each event, move on. Tool calls were awaited inline, looks clean, and totally broken: while the handler awaited a slow thinker run, it couldn't receive the barge_in event. The staleness check could never fire because the message that creates staleness was stuck behind the work it needed to invalidate.

The fix was to run tool execution as concurrent tasks so the event pump never blocks. Which is, of course, "don't block the loop" except this time the loop being blocked was a gRPC stream handler instead of an audio callback. The failure mode follows you across process boundaries. I know this seemed like you almost didn't need to care about async discipline in your business plane, but the discipline is still needed. Just isolated to a few distinct places rather than the entire process.

Endpointing: one clock, one authority

A component of barge-in. Deciding when the caller stopped talking is where responsiveness is won or lost, and the operating rule is that exactly one component owns that decision.

In the reference implementation, the preferred authority is local VAD in the gateway (TEN VAD – the same local-VAD approach I benchmarked in an earlier post, worth roughly 600ms per turn against OpenAI server-side semantic VAD). When local VAD is active, the gateway disables the realtime provider's server-side turn detection entirely; its own VAD commits the audio buffer and requests the response. Run both authorities at once and every utterance triggers duplicate responses. Just fyi, you don't want to do that.

The critical tuning knob in the whole system lives here: the hangover. No, not your post-weekend headache or the movie. It's how much trailing silence must accumulate before the utterance is committed. It's a hard floor on response latency, and it's the parameter that background noise attacks: noise holds the gate open and the agent appears to stall. To make it measurable rather than vibes-based, the gateway emits a turn_latency event per turn (utterance commit → first assistant audio). If you take one operational idea from this post, it's that metric: latency per turn, measured at the boundary you control.

"But what if one good model makes all this obsolete?"

This is the most compelling objection I've heard that makes me question the validity of this approach. Speech-to-speech models are getting dramatically better. If one model can eventually handle the conversation and the reasoning then there's no need for a responder-thinker split, and no separate text-model calls. Doesn't this whole gateway become overhead?

The objection has two claims. Let's review each.

Claim one: a capable-enough audio model collapses the responder-thinker pattern into one model. Plausibly true, eventually (likely already here with duplex architectures. See https://openai.com/index/introducing-gpt-live/, and https://docs.x.ai/developers/model-capabilities/audio/voice-agent). That's the direction the frontier is moving and the multi-model pattern may well be transitional.

Claim two: therefore you don't need the gateway. This doesn't follow, because the gateway and the agent pattern solve different problems at different layers. The gateway is about where audio is processed; the agent pattern is about how many models reason behind it. Collapse the brain count to one, and you still have: telephony or WebRTC I/O, jitter buffering, pacing, VAD and endpointing, barge-in and queue-clearing, codec handling, reconnection, and tool execution, which still needs databases and internal APIs that don't belong in the media plane. The proxy relay is completely indifferent to how many models sit behind it. One model calling tools through hollow proxies is the same architecture as two.

And there's a second argument that consolidation actually strengthens: betting your entire audio experience on a single external model means you want an abstraction layer more, not less. The gateway is what lets you swap providers, fall back during an outage, or A/B a new model without touching business logic. The two-agents-one-gateway demo is exactly this. We changed the entire reasoning topology based on a configuration parameter passed in by the UI and the gateway didn't change. The layer that survives your architectural bets is the layer worth building.

Keep the agent patterns as ever changing and the hard physical realities as an invariant. They're independent decisions, and conflating them is how you end up rebuilding the coupling we've just escaped.

Where MCP fits

If your organization is moving tools to MCP servers the question becomes: who holds the MCP connections?

Two options:

You can hand MCP server configs to the gateway and let the media plane connect directly. This has the fewest hops, but now the process with 20ms deadlines owns tool-infrastructure connections, auth, and failure handling. We just added non-realtime concerns to our realtime voice gateway. The exact coupling this architecture exists to prevent, and you lose the policy/observability control point.

You can treat the business plane as a passthrough proxy to MCP (works, but framing it as "an extra proxy" invites the why-not-go-direct objection). Rephrased slightly: business plane is the MCP host: it owns the MCP client connections, aggregates tools across servers, and presents one unified toolset. The gateway never learns MCP exists. The proxies relay to the agent service exactly as before; whether the tool behind the interface is a local function or an MCP call is an implementation detail of the meaning plane.

The clean test of the boundary: we could swap the entire tool layer from local functions to MCP servers, and the gateway contract wouldn't move an inch. When a tool-strategy migration doesn't touch your media plane, the planes are actually decoupled.

The same fork exists even if you never build a gateway. Blackbox voice providers offer both topologies, and which one you configure determines who holds your credentials. Vapi's server-URL model inverts the auth: when the model calls a tool, Vapi calls your server — authenticating to you with an API key or OAuth2 client credentials that your endpoint validates. Your plane stays the host; your internal secrets never leave. Their native MCP client is the other topology: you hand Vapi your MCP server URL — which their own docs say to treat as a credential — and Vapi connects at call start, discovers your tools, injects the schemas into the model, and invokes your MCP server directly on every tool call. That's the vendor's plane becoming your MCP host, with a credentialed doorway into your tool infrastructure. Neither is wrong; but the decision framework above applies to buy exactly as much as build. That decision shouldn't be made by accident.

What else lives in the business plane

Tools are the flashiest wire-crossers, but they're not the reason the boundary earns its keep. The stronger examples are the responsibilities that can't be an MCP server — the ones that need the conversation itself.

Guardrails are the sharpest case. A policy check on what the agent is about to say — off-limits topics, compliance language, PII — is business logic that must observe every transcript in near-real-time and occasionally veto audio. It cannot live in the hot path (it's exactly the CPU-bound work my last post was about), and it can't be a tool the model politely decides to call. The architecture's answer: transcripts already cross the wire as events, the guardrail runs in the business plane, and the small command vocabulary (response.cancel) is its veto. Decision in the meaning plane, mechanism in the media plane. The same concept as everything else here.

The same applies to post-call processing, pushing transcripts to analytics, conversation summarization, CRM writes. None of it belongs within a mile of a 20ms frame deadline, all of it needs the conversation stream, and the event side of the contract is how it gets fed without the gateway knowing any of it exists.

What it costs

None of this is free, and not addressing the cost is bad engineering.

You now run two services instead of one: two deploys, two failure domains, distributed debugging. (Mitigation: propagate trace context across the gRPC stream so business-plane tool spans nest under the gateway spans that requested them — one trace per call, both planes visible.) Every wire-crossing tool call pays serialization and a network hop; keep the planes network-close and this is small against LLM latency, but measure it. The dev loop changes: engineers iterating on prompts and tools shouldn't need the audio stack running, so the gRPC seam has to be testable from both sides. The reference repo ships harnesses for exactly this. The same seam that a skeptic might call overhead is what makes each plane independently testable. The contract becomes a real interface you have to version and maintain with the discipline of a public API, because that's what it is now.

An aside: "why gRPC, though?"

Because this is the question I have gotten right after "why two services," here's the receipts version.

Typed streaming RPC between planes is exactly how LiveKit coordinates internally: they built psrpc — protobuf service definitions with bidirectional streaming — and use it for signaling relay, room management, and agent job dispatch across their mesh. They chose a message-bus transport (Redis/NATS) instead of point-to-point HTTP/2 because their problem is a global multi-node mesh that needs topics, fan-out, and affinity routing. Ours is two co-located services, so point-to-point gRPC is the correct simplification of the same invariant: a typed, bidirectional-streaming contract between planes. The invariant is the contract; the transport follows the topology.

If the skepticism is "but gRPC can't handle real-time audio," let's review some use-cases: NVIDIA Riva ships its entire speech stack as gRPC streaming microservices running thousands of parallel streams, and Google Cloud Speech-to-Text's streaming recognition is a bidirectional gRPC stream, audio chunks in, interim transcripts back mid-sentence. That said, note what this architecture actually does: audio never crosses the bridge. The gRPC stream carries events, commands, and tool relays; audio flows gateway↔caller and gateway↔model. LiveKit makes the same move — media rides WebRTC, psrpc carries control. The bridge is a control plane, and control planes are exactly what gRPC is for.

(What the closed shops like Vapi, Deepgram, ElevenLabs run internally isn't public. Their WebSocket-heavy edges are a browser-and-telephony compatibility choice, not an internal architecture statement.)

Worth it? At demo scale, probably not. One process is simpler and the collisions haven't started. At production call volume, where one CPU spike in business logic becomes audible artifacts across every concurrent call on the box, the split stops being optional. We learned the workload-collision lesson the hard way at GA and fixed it with a cruder separation under pressure; the gateway is the deliberate version of that lesson, built before the next fire instead of during it.

The takeaway

The bugs in my last post were all one bug: two workloads with incompatible timing requirements sharing a process. You can defend the hot path forever — or you can give it its own process, its own event loop, its own deploy, and a small typed contract to the plane where slowness is allowed.

The whole reference implementation — both agents, the mock realtime model, the barge-in and staleness machinery, the browser demo — is at github.com/lackmannicholas/ai-audio-gateway. It runs with no API key: docker compose up, open the browser, order a coffee, and watch the wire.

Part of a series on building production voice AI. Previously: Local VAD · Responder-Thinker · WebRTC vs WebSockets · Python Is Lying to You.

Python Is Lying to You: Async Pitfalls in Real-Time Audio Pipelines

Nick Lackman — Fri, 12 Jun 2026 04:14:07 +0000

If you're building voice AI with async Python, your audio quality is probably worse than it needs to be — and the bottleneck isn't where you think it is.

I've spent the last two years building production voice AI systems — real-time pipelines handling thousands of calls per day over PSTN telephony. The stack is what you'd expect: STT, LLM reasoning, TTS, Speech-to-Speech, guardrails, tool calls, all wired together with async Python and streaming over websockets.

Along the way I've hunted down a specific category of bug that I think is more common than people realize: code that works correctly, passes every test, and makes the AI sound terrible. Not wrong answers — terrible audio. Stutters, gaps, unnatural pauses. The kind of artifacts that make callers hang up even though the AI gave the right answer.

These often traces back to the same root cause: Python's async model doesn't actually guarantee what most developers think it guarantees. And in a domain where timing matters at the millisecond level, those gaps in the guarantee become audible.

Here's what I found, what caused it, and how I fixed it.

The Promise

Here's the setup: you've built a voice AI pipeline. S2S or STT feeds into an LLM, the LLM reasons and calls tools, TTS converts the response to speech or S2S response, and guardrails — PII detection, content filtering, maybe sentiment analysis — run alongside everything because responsible AI isn't optional. It's all async def. It's all awaited properly. The demo sounds great.

Python's async model feels like it was built for this. Non-blocking I/O, coroutines, event-driven architecture. You've got ML inference, LLM calls, NLP guardrails, and audio streaming all sharing the same event loop, and it maps cleanly onto a streaming voice pipeline.

Then you go to production with real concurrent call volume and the audio quality falls apart. Not immediately, and not on every call — but enough that it's a problem.

What follows are the specific issues I've tracked down, the code that caused them, and the fixes that brought audio quality back to where it needed to be.

"But I Thought It Was Async"

This one cost me real debugging time and it's the most important concept in this entire post: async def does not mean non-blocking. It's a contract that Python does not enforce.

The PII Guardrail That Killed Audio Quality

We have an inline guardrail that runs NLP analysis for PII detection on every transcript chunk. It's a compliance requirement — you can't skip it. The method is async def. It's being awaited correctly. Everything looks right.

But the NLP inference inside that method is CPU-bound. It's doing regex matching and model inference on every chunk of transcribed text. It never yields back to the event loop because there's nothing to yield to — it's doing computation, not I/O. Meanwhile, the audio frames that need to ship on a 20ms cadence are sitting in a queue, waiting for the event loop to come back around to them.

The guardrail runs, returns True — no PII detected, everything's fine — and in the 80-200ms it took to reach that conclusion, the caller heard dead air where smooth audio should have been.

# BEFORE: Looks async. Isn't.
async def check_pii(self, transcript_chunk: str) -> bool:
    # All CPU-bound work. No awaits, no yields.
    # The event loop is frozen while this executes.
    patterns = self._compile_patterns()
    matches = self._scan_patterns(transcript_chunk, patterns)
    if matches:
        return True

    # NLP model inference — the expensive part
    result = self.nlp_model.predict(transcript_chunk)
    return result.contains_pii

The mental model most Python developers carry is: "I used await, so it's non-blocking." But await only yields control if the called coroutine actually suspends — meaning it hits a real I/O wait or an explicit yield point. An async def that does CPU work without any suspension points is functionally synchronous. Python won't stop you from writing it. Python won't warn you. The only thing that tells you something is wrong is the audio quality.

The important constraint here: you can't remove the guardrail. It's compliance. You can't make the NLP inference faster — it takes what it takes. The fix is getting it off the audio hot path entirely:

# AFTER: Actually non-blocking. CPU work moves to a thread.
async def check_pii(self, transcript_chunk: str) -> bool:
    result = await asyncio.to_thread(self._detect_pii_sync, transcript_chunk)
    return result

def _detect_pii_sync(self, text: str) -> bool:
    """Runs in a thread pool, not on the event loop."""
    patterns = self._compile_patterns()
    matches = self._scan_patterns(text, patterns)
    if matches:
        return True
    result = self.nlp_model.predict(text)
    return result.contains_pii

asyncio.to_thread() moves the CPU-bound work to a thread and yields control back to the event loop immediately. The PII check still runs. Compliance is still met. But the event loop is free to keep shipping audio frames while the NLP model does its work in the background.

One line changed the call signature. The caller stopped hearing gaps.

What makes this worth writing about is the irony: the guardrail protecting the user experience was the thing degrading it. In a typical web application, 80-200ms of CPU work means a slightly slower HTTP response. Nobody notices. In a streaming audio pipeline, it means the event loop can't service the coroutines responsible for sending audio frames on time, and the caller hears it.

Your Logger Is in the Hot Path

This one is subtle enough that I think most Python developers don't know about it.

Python's standard logging module uses StreamHandler by default. Here's what actually happens every time you call logger.info():

StreamHandler.emit() acquires a threading lock — self.lock.acquire()
It calls self.stream.write() — a blocking I/O operation
Even writing to stdout is blocking in CPython

That's a lock acquisition plus a synchronous write on every log call. In a web server, this is noise — your response already takes 50-200ms, and a few microseconds of lock contention doesn't register. On an audio hot path where frames need to ship every 20ms, you've added synchronous I/O to the critical timing loop.

During development you add logger.debug() throughout the audio path because you need visibility into what's happening. Reasonable. But every one of those calls is synchronous I/O that can introduce jitter. In a REST API, it doesn't matter. In a streaming audio pipeline, it does.

Logging isn't the only thing hiding in plain sight either. On the audio hot path, nothing is safe to assume is free:

json.dumps() / json.loads() on large payloads — CPU-bound, holds the GIL the entire time
DNS resolution inside aiohttp — can block if the system resolver is slow
File I/O that looks async but isn't — many "async" wrappers delegate to a thread pool, and if that pool is saturated, you wait
Python's string interpolation with file reads — we hit this one loading AI tool definitions dynamically in our orchestration package. Trying to be clever about caching actually introduced synchronous file I/O on a path that needed to be fast

The fix is two-fold:

First, audit your log levels. Production doesn't need DEBUG on the hot path. You'd be surprised how many logger.debug() calls survive into production with a permissive log level config. Remove what you don't need, bump the rest to levels that won't fire in production. This is free.

Second, replace StreamHandler with QueueHandler:

import logging
from logging.handlers import QueueHandler, QueueListener
from queue import Queue

log_queue = Queue()

# Drops log records into the queue and returns immediately
queue_handler = QueueHandler(log_queue)

# Separate thread drains the queue and writes at its own pace
stream_handler = logging.StreamHandler()
listener = QueueListener(log_queue, stream_handler)
listener.start()

logger = logging.getLogger("voice_pipeline")
logger.addHandler(queue_handler)
logger.setLevel(logging.INFO)

QueueHandler drops the log record into an in-memory queue and returns immediately. A QueueListener on a separate thread drains the queue at its own pace. The hot path never blocks on I/O. The log still gets written — just not synchronously.

Neither of these fixes requires rearchitecting anything. Together they remove synchronous I/O from every frame of your audio pipeline.

The GIL Doesn't Care About Your Deadlines

Even if you get your async discipline perfect — every CPU-bound operation offloaded to a thread, every logger swapped, every stdlib call audited — the GIL is still there.

The Global Interpreter Lock means only one thread executes Python bytecode at a time. Most Python concurrency advice hand-waves past this because for web workloads it genuinely doesn't matter much. Threads spend most of their time waiting on I/O, the GIL is released during I/O waits, and everyone gets a turn.

Real-time audio is different. You have CPU-bound work happening in threads (we just moved PII detection to a thread). You have the event loop on the main thread scheduling audio frame callbacks. The GIL means these take turns, not run in parallel. When the PII thread is holding the GIL for NLP inference, the event loop thread can't run. When the event loop can't run, audio frames don't ship.

What this looks like in production:

Latency spikes that only appear under load — one concurrent call is fine, fifty and you see jitter
Symptoms look identical to network jitter in your metrics, but it's scheduling contention inside your own process
Doesn't reproduce locally because your dev machine isn't running fifty concurrent sessions
The p50 and p99 look acceptable. The p99.9 is bad. And in voice, the p99.9 is what callers remember

The honest answer is that the GIL makes Python fundamentally limited for work where sub-millisecond scheduling guarantees matter. But "rewrite it in Rust" isn't practical for most teams, and the rest of the stack — orchestration, LLM integration, business logic — is genuinely well-served by Python. The practical approach is knowing the constraint exists and designing around it.

Free-Threaded Python: The Light at the End of the Tunnel

Python is finally making the GIL optional. PEP 703 introduced a free-threaded build, and as of Python 3.14 it's officially supported — though still opt-in, not the default.

In practice: you can build CPython with --disable-gil and get true multi-threaded parallelism. The PII detection thread could run in parallel with the event loop thread instead of taking turns. Several items on the fix hierarchy below could potentially collapse.

The caveats are real though. The ecosystem is still catching up — C extensions that assumed the GIL would protect shared state may not be thread-safe without it. Libraries that haven't been updated may re-enable the GIL automatically. And race conditions that the GIL was silently preventing will surface the moment you remove it.

For real-time audio, this is genuinely promising. But it's a migration measured in years, not a weekend upgrade. Your STT clients, TTS clients, LLM SDKs, and orchestration frameworks all need to support it before you can flip the switch in production. Worth tracking closely and something I look for experimenting with.

"Just Use Threads" — The Trap

At this point the instinct is obvious: ThreadPoolExecutor is right there, move everything off the event loop.

Sometimes that's the right call — asyncio.to_thread() fixed the PII guardrail cleanly. But "just use threads" as a general strategy is a trap:

Shared state becomes a problem. Audio pipelines have state — playback buffers, conversation context, agent state, connection metadata. Threading means reasoning about what's shared. Race conditions in audio manifest as once-in-a-thousand glitches: a frame sent out of order, a buffer read during a write, piping audio into a different user's websocket, a state update that arrives late. Nearly impossible to reproduce in testing.

Thread safety overhead can reintroduce latency. Locks fix shared state problems but introduce contention, which reintroduces the timing issues you were trying to solve.

The GIL means threads aren't truly parallel for CPU-bound work anyway. You've added complexity without gaining real concurrency where it matters most.

What Actually Works

When I find something blocking the audio hot path, my first questions are about the product, not the code:

"Do we need this on the hot path at all?" This is the question that junior engineers skip. They go straight to "how do I make this faster" when the answer is often "don't do it here." Can this work happen after the audio frame ships? Before the call starts? Does it need to run on every chunk, or can it batch?

"Where else can I put it?" If it needs to happen during the call, can it move to a background task without breaking anything? Often the answer is yes.

"If I have to move it, who do I talk to about feature parity?" Sometimes moving work off the hot path changes how a feature behaves. That's a product conversation, not just an engineering one.

Then the technical options, in order of preference:

Background tasks — fire-and-forget with asyncio.create_task() if you don't need the result immediately. Lowest cost.
Threads — asyncio.to_thread() for isolated CPU work like the PII guardrail. Keep the surface area small.
Multiprocessing — escapes the GIL, but IPC overhead adds its own latency. Worth it for heavy, long-running work.
Separate process — full isolation. Hot path and heavy processing don't share a GIL or memory. Real architectural cost, but real isolation guarantees.
Event loop inversion — give the audio hot path its own dedicated event loop. Nothing else runs on it, so nothing can starve it. This is the nuclear option.

The right choice depends on how close the work is to the audio stream. A guardrail that doesn't gate playback? Background task. Audio pacing logic that controls frame timing? Might need its own event loop.

The Free Lunch: uvloop

After all of the above, here's something you can do in five minutes that actually helps.

uvloop is a drop-in replacement for asyncio's event loop, written in Cython on top of libuv — the same library that powers Node.js. It's faster at everything the event loop does: iterating, dispatching callbacks, resolving timers, handling I/O.

import uvloop
uvloop.install()
# That's it.

In a real-time audio pipeline where the event loop drives frame timing, faster iteration means tighter scheduling means smoother audio. With the default asyncio event loop, I was seeing event loop congestion of 10 ms - 3 seconds. Meaning, the event loop was stuck for that long, unable to do anything else. After switching to uvloop, that number stayed down to single digit ms. Still some congestion, but much better than waiting 2 seconds for the event loop to run the next scheduled operation.

What it doesn't fix: the GIL, blocking code, CPU-bound coroutines that starve the loop. Every problem from the previous sections still applies. uvloop makes a healthy event loop faster — it can't fix a broken one.

But after spending hours tracking down blocking calls and refactoring thread strategies, a one-line change that measurably improves scheduling is a nice win.

The Takeaway

Python is the right language for building voice AI systems. The orchestration, the LLM integration, the business logic, the rapid prototyping — the ecosystem is unmatched - though TypeScript is becoming more and more robust AI ecosystem by the day.

But the audio hot path operates under timing constraints that Python's async model wasn't designed for. The issues I've described here aren't Python bugs — they're assumption gaps. Assumptions that async def means non-blocking, that the stdlib is fast enough to be invisible, that the GIL only matters for batch processing, that threads give you parallelism.

Every one of those assumptions is reasonable in the context of a web application. Every one of them will degrade your production audio quality in a voice AI pipeline.

The fix isn't rewriting everything in Rust/C++ or abandoning Python. It's knowing where the constraints are, designing around them, keeping CPU-bound work off the path where milliseconds are audible, and architecting your system with these constraints in mind.

More on that in the next blog post: The Audio Gateway.

This is part of a series on building production voice AI systems. Previously: Dude, Where's My Response? Cutting 700ms from Every Voice AI Turn with Local VAD | Your Voice Agent Needs Two Brains: Building Multi-Thinker on OpenAI's Realtime API | I Tested Our WebSocket Audio Pipeline with WebRTC. Here's Why I Switched It Back.

Voice AI: Fast and Dumb or Slow and Smart — Why Not Fast and Smart?

Nick Lackman — Mon, 06 Apr 2026 23:26:29 +0000

Many voice AI demos connect the browser directly to a real-time audio model API and lets the server decide when you've stopped talking. That's a demo architecture with a built-in latency tax that quickly breaks down in production. Here's the production alternative: a backend-mediated, multi-thinker voice system with local voice activity detection that owns the entire audio pipeline end-to-end.

I spent the last year and half building production voice AI systems that handle thousands of calls per day. This post covers the architecture I wish someone had documented when I started: how to make your voice AI product fast and smart, what the Responder-Thinker pattern is, why single-thinker breaks, how to build multi-thinker with your backend in the middle, and why local VAD is the key to making it feel instant.

The companion repo is fully functional — clone it, run it, talk to it (OpenAI API Key Required): github.com/lackmannicholas/responder-thinker

The Latency Budget You Can't Meet

Before the Realtime API existed, voice AI meant chaining three models in series: speech-to-text, an LLM, then text-to-speech. The math doesn't work.

STT endpointing and recognition eats 500-1000ms. The LLM's time-to-first-token adds another 500-1500ms. TTS synthesis takes 200-500ms. You're at 1.2-3 seconds minimum before the caller hears a single syllable — and conversational turn-taking breaks down around 800ms of silence.

In my previous post, I showed that server-side voice activity detection alone adds 500ms+ of unnecessary overhead to every turn. But even after fixing that, the serial pipeline architecture is the bottleneck. You can't engineer your way to natural conversation speed with a pipeline. The architecture has to change.

The Realtime API: Fast, But Not Smart Enough

OpenAI's Realtime API collapses the STT → LLM → TTS pipeline into a single api call. Latency drops to sub-second. The conversation finally feels naturalish.

But there's a tradeoff. The realtime model is conversational and fast, but compared to text-based models like GPT-5.4, it struggles with complex multi-step instructions, structured tool use, and domain-specific accuracy. It hallucinates more. Its instruction-following degrades as the system prompt grows.

A voice agent that responds instantly but gives wrong information is worse than one that takes two seconds and gets it right. The Realtime API solved the latency problem and created an intelligence problem.

Enter Responder-Thinker

The Responder-Thinker pattern resolves this by splitting responsibilities:

The Responder (Realtime API) is "always on". It handles conversational flow — greetings, acknowledgments, stalling, turn-taking. It's fast and socially intelligent. When the user asks something that needs real data or complex reasoning, the Responder classifies the intent and hands off to a Thinker.

The Thinker (text-based model) runs in the background. It has a focused system prompt, domain-specific tools, and the reasoning capability to get the answer right. When it's done, the result is injected back into the Realtime API conversation, and the Responder delivers it naturally.

The insight: you don't need your real-time voice to be smart. You need it to be present while the smart thing works in the background.

This pattern comes from OpenAI — their openai-realtime-agents repo calls it "Chat-Supervisor." The concept isn't new. Making it production-grade is the hard part.

Why Single-Thinker Breaks

The simplest implementation has one generalist Thinker handling everything — weather, stocks, news, FAQ, escalation. In my experience, this breaks fast.

The system prompt grows to accommodate every domain, and quality degrades across all of them. A weather lookup and a complex knowledge question go through the same agent with the same overhead. You can't tune one domain without risking regressions in the others. You can't use a cheaper model for simple lookups and a smarter model for hard reasoning — it's one model for everything. You have to vertically scale the model capability based on the your most complex task. Lighter tasks are "over-provisioned" in terms of model usage.

Single-thinker is a monolith. Multi-thinker is microservices. The voice AI industry is learning the same architectural lessons backend engineering learned fifteen years ago.

In a multi-thinker architecture, each Thinker owns a domain with a focused prompt and its own tools. Weather uses gpt-5.4-mini with a live weather API. News uses gpt-5.4 because summarization requires more reasoning. Each can be tested, cached, and optimized independently.

The Realistic Production Architecture

Here's where this implementation diverges from most tutorials you'll find.

Many demos connect the browser directly to OpenAI's Realtime API via WebRTC. The browser gets an ephemeral token, establishes a peer connection, and audio flows between the user and OpenAI with nothing in between. It's not how production voice systems work.

In production — Twilio, SIP trunks, contact centers — audio always flows through your backend. This architecture puts your backend in the middle:

Browser ←—WebRTC—→ Python Backend ←—WebSocket—→ OpenAI Realtime API
                        │
                   Thinker Agents

The browser connects to a FastAPI server via WebRTC (using aiortc for server-side WebRTC). The backend opens a WebSocket to OpenAI's Realtime API and streams audio bidirectionally, resampling between 48kHz (WebRTC) and 24kHz (Realtime API) using libswresample for proper anti-aliased conversion.

What this gives you that direct connection doesn't:

Interception: the backend sees every event between the user and the model. Tool calls route to your server-side agents, not browser JavaScript. This is important for conservation aggregation, metrics, and downstream analytics
State management: Redis-backed conversation history, cross-session user memory, per-domain result caching.
Local VAD: your backend owns turn detection, not OpenAI's servers. This is where hundreds of milliseconds live.
Security: API keys never touch the browser.
Transport flexibility: the same backend works for WebRTC browsers and telephony SIP trunks.

Local VAD: Owning Turn Detection End-to-End

This is the piece that makes the architecture feel instant.

Most implementations of the OpenAI Realtime API use semantic_vad or server_vad in the session config and let OpenAI decide when the user stopped talking. That means every audio frame travels to OpenAI's servers, their VAD processes it, they decide the turn is over, and only then does the model start generating a response. That round-trip is hundreds of milliseconds you're paying on every single turn.

My implementation replaces this entirely with local voice activity detection. The backend runs a TEN VAD model that processes audio locally and makes the turn detection decision on your own hardware, with zero network round-trip:

# When local VAD is active, server-side turn detection is completely disabled.
# The backend owns the full pipeline: detect speech end → commit buffer → trigger response.

if self._vad_gate is not None:
    result = self._vad_gate.process(pcm16_bytes)

    # Speech onset: interrupt if audio is still playing
    if result.speech_started:
        if self._response_active or has_queued_audio:
            await self._handle_interrupt()

    # Speech end: commit and request response immediately
    if result.speech_ended:
        asyncio.create_task(self._commit_and_respond())
else:
    chunks_to_send = [pcm16_bytes]  # fallback: send everything, let OpenAI decide

The VAD gate uses a three-state machine — SILENCE, SPEECH, and HANGOVER — with a pre-roll buffer that preserves audio from just before speech onset. When speech ends, the backend immediately commits the audio buffer and sends response.create. No server-side VAD involved. No round-trip. The Realtime API starts generating the instant it receives the committed buffer.

The _commit_and_respond method uses the same _response_create_lock that protects thinker result injection and idle nudges, because all of them compete for the same response.create API constraint:

async def _commit_and_respond(self):
    await self._realtime_ws.send(json.dumps({"type": "input_audio_buffer.commit"}))
    async with self._response_create_lock:
        await self._response_done.wait()
        if self._running:
            await self._realtime_ws.send(json.dumps({"type": "response.create"}))

The result: it feels like the agent starts responding before you've finished talking. It isn't really, but the gap between end-of-speech and first audio byte is so small that it feels that way. This is the same VAD research I published previously — 689ms improvement measured in controlled testing — now integrated into a full production architecture.

Routing: The Dumbest Model Makes the Most Important Decision

The Responder classifies intent via a single tool call — route_to_thinker(domain, query). The domain is constrained to a fixed enum:

ROUTE_TO_THINKER_TOOL = {
    "type": "function",
    "name": "route_to_thinker",
    "parameters": {
        "type": "object",
        "properties": {
            "domain": {
                "type": "string",
                "enum": ["weather", "stocks", "news", "knowledge", "research"],
            },
            "query": {
                "type": "string",
                "description": "The user's question, rephrased for the specialist.",
            },
        },
    },
}

This is architecturally interesting because your dumbest model is making the most important decision. And that's the right tradeoff. Routing needs to be fast — 100ms, not 2 seconds. The Responder already has full conversational context. And "what kind of question is this?" is a dramatically simpler task than "what's the answer?" Constraining routing to a fixed enum makes misclassification rare and fallback trivial: unknown domains go to the Knowledge Thinker.

The bridge intercepts the tool call and dispatches the Thinker concurrently so the Responder keeps talking:

case "response.function_call_arguments.done":
    asyncio.create_task(self._handle_tool_call(event))

Production Failure States and Three Guards Against Them

When the Thinker returns a result, you can't just inject it and call response.create. Three things can go wrong when handling real users:

Guard 1: The user interrupted. While the Thinker was working, the user barged in with a new question. The Thinker's result is stale. You still submit the tool output (the API requires it), but you don't ask the Responder to speak a stale answer.

dispatched_turn_id = self._turn_id  # snapshot before dispatch

# ... thinker runs ...

if self._turn_id != dispatched_turn_id:
    return  # stale — user moved on

Guard 2: The Responder is still talking. The Realtime API silently drops response.create while it's already generating a response — like the "let me check on that" filler. This is the primary cause of the "thinker came back but nothing happened" bug. You have to wait, and you have to serialize all callers:

async with self._response_create_lock:
    await asyncio.wait_for(self._response_done.wait(), timeout=10.0)

The lock serializes every response.create caller — thinker results, the local VAD commit path, idle nudges, and disconnect goodbyes — because they all compete for the same API constraint.

Guard 3: The user interrupted during the wait. After Guard 2 releases, check staleness again. The user could have barged in while you were blocked.

    if self._turn_id != dispatched_turn_id:
        return  # stale after wait

    await self._realtime_ws.send(json.dumps({"type": "response.create"}))

Real callers interrupt, change their minds, and don't wait politely for the AI to finish thinking. You need to handle each one to have a system that feels as close to a human conversation as possible.

Barge-In Handling

When local VAD detects speech onset while the Responder is outputting audio, the bridge does three things:

async def _handle_interrupt(self):
    # 1. Invalidate in-flight thinker tasks
    self._turn_id += 1

    # 2. Cancel the active response
    if self._response_active:
        await self._realtime_ws.send(json.dumps({"type": "response.cancel"}))
        self._response_active = False
        self._response_done.set()

    # 3. Flush queued audio so the speaker stops immediately
    if self.audio_track:
        self.audio_track.output_track.clear()

Incrementing _turn_id is the key move. Every in-flight thinker task holds a snapshot of the turn ID from when it was dispatched. When it returns, Guard 1 catches the mismatch and discards the result. No stale answers, no race conditions, no complex cancellation logic.

With local VAD, barge-in detection is also local — the backend sees speech onset in the VAD state machine before any audio reaches OpenAI. The interrupt fires faster than server-side detection could.

Context Is Not Just Conversation History

A caller asking "is a two-bedroom available?" means nothing without property context. "Same unit as last time" means nothing without user context. In production, managing multiple types of structured context beyond raw conversation history is paramount to giving your conversation a personal feel as well as better model performance.

The repo demonstrates this with a typed UserContext model persisted in Redis — preferences, memory facts, conversation summaries, and behavioral signals — keyed by browser fingerprint for cross-session persistence:

class UserContext(BaseModel):
    preferences: Preferences       # name, location, temp unit, watched tickers
    memory: MemoryStore            # inferred facts, deduped, capped at 20
    summary: Summary               # rolling LLM-generated conversation summary
    signals: Signals               # topic counts, session count, last active

Thinkers return a ThinkResult that includes an optional ContextUpdate — a class describing what the thinker learned. The router applies updates after the thinker returns:

class ThinkResult(BaseModel):
    response: str
    context_update: ContextUpdate | None = None

The Weather Thinker persists the user's location. The Knowledge Thinker picks up on it without being told. Context isn't trapped in a single agent's conversation. It's a shared, typed resource that any thinker can read from and contribute to. When context changes, the Responder's system prompt is updated mid-session via session.update so it immediately knows what the thinkers learned.

What I Learned

The cost of implementing your own turn detection with a local VAD is well worth it. The latency improvement isn't incremental — it's the difference between "this feels like talking to a computer" and "this feels like talking to someone." Owning the turn detection pipeline means you control the most latency-sensitive decision in the entire system. If you're building on the Realtime API and not doing local VAD, you're leaving hundreds of milliseconds on the table on every turn.

The routing decision matters more than the reasoning quality. A perfectly accurate Thinker routed to the wrong domain produces a wrong answer. A slightly less accurate Thinker routed correctly produces a useful one. Invest in your routing prompt and your domain enum. Simple and strict rules help the dumb realtime model perform routing well. An additional consideration is using a separate LLM call to classify, but with only a handful of potential tool calls, the realtime API can do that just fine.

Stalling is a prompt engineering problem, not a code problem. The Realtime API naturally acknowledges the user before executing the tool call. Your system prompt just needs to tell it how. The Research Thinker in the repo simulates a 30-second delay specifically to stress-test this.

Multi-thinker is worth the complexity. Independent prompts, independent model tiers, independent caching TTLs, independent testing. The overhead of managing multiple agents is far less than the quality cost of a bloated single-thinker prompt.

Backend mediation is not optional for production. Direct browser-to-OpenAI works for demos. The moment you need state, security, observability, local VAD, or telephony support, your backend has to be in the middle. The upfront work will save you time in the long run.

The three guards make it feel alive. The "thinker returned but nothing happened" bug (Guard 2) is a frustrating one to try to debug in production and ensures the user isn't left hanging no matter what. The stale-result-after-interrupt bug (Guards 1 and 3) only manifested when callers talked fast and gives them the answer with the fullest context. These are things I wish I had known or discovered without the pain of production issues.

The full implementation — local VAD, multi-thinker routing, typed user context, LangSmith observability, Docker deployment, and a 30-second research thinker for stress-testing stalling behavior — is at github.com/lackmannicholas/responder-thinker. Clone it, run it, talk to it.

Previously: Cutting 600ms from Every Voice AI Turn with Local VAD
Coming next: Adding guardrails and voice quality evals to the Responder-Thinker pattern.

Dude, Where's My Response? Cutting 600ms from Every Voice AI Turn with Local VAD

Nick Lackman — Sat, 21 Mar 2026 02:45:02 +0000

If you're building voice AI on OpenAI's Realtime API, your agent is slower than it needs to be — the main bottleneck is certainly inference but there's additional overhead to cut.

I spent the past week instrumenting a production telephony voice pipeline, measuring where latency actually lives, and testing whether local voice activity detection (VAD) could meaningfully reduce response time. The answer is yes — by 689ms per turn on substantive responses — and the methodology is cleaner than I expected.

Here's what I found, how I measured it, and why it matters for anyone building conversational AI on the Realtime API.

The Hidden Latency Tax

When you build a voice agent on OpenAI's Realtime API — whether you're using the OpenAI Agents SDK, a custom WebSocket implementation, or any orchestration framework — the audio pipeline follows the same path:

The user speaks, and your telephony provider (Twilio, in my case) streams audio frames to your server
Your server forwards every audio frame to OpenAI's Realtime API via WebSocket (input_audio_buffer.append)
OpenAI's server-side VAD (semantic_vad, the default) processes the audio and decides when the user has stopped talking
Only after the server-side VAD commits the audio buffer does the LLM begin generating a response
The generated audio streams back to your server and out to the caller

The problem is step 3. Every VAD decision requires a network round-trip. The audio has to travel to OpenAI's server, get processed by their turn detection model, and the commit decision happens server-side. Your code doesn't even participate — if you look at the OpenAI Agents SDK source, input_audio_buffer.speech_stopped is handled as an informational notification. The server has already committed and started response generation by the time your code hears about it.

This adds an irreducible network latency plus server-side model deliberation time on every single turn. And in a conversational AI system, latency after the user stops speaking is the most perceptible kind — it's the moment they're actively waiting.

The Approach: Local VAD + Manual Turn Control

The Realtime API supports disabling server-side turn detection entirely. When you set turn_detection to null, the server stops making autonomous commit decisions, and you take control of when to send input_audio_buffer.commit and response.create.

This means you can run a VAD model locally on your server, process the same audio frames as they arrive from Twilio — before they're even sent to OpenAI — and commit the turn the moment you detect silence. The audio is already on your machine. There's no round-trip to wait for.

I used TEN VAD (by Agora) as the local model, running via ONNX Runtime. More on why TEN VAD below.

Why Not Just Use Silero?

I evaluated three tiers of VAD before settling on TEN VAD:

Energy-based VAD (WebRTC VAD, fast-vad) uses signal processing — energy levels, spectral characteristics, zero-crossing rates — to make binary speech/no-speech decisions. Extremely fast, but can't distinguish speech energy from background noise. WebRTC VAD misses roughly 1 out of every 2 speech frames at a 5% false positive rate. Not viable for production turn detection.

Silero VAD is the industry-standard ML-based VAD — an LSTM-based architecture trained on 6,000+ languages, available as an ONNX model. Significantly more accurate than energy-based approaches. But it has a meaningful limitation for conversational AI: it suffers from a multi-hundred-millisecond delay when detecting speech-to-silence transitions. The recurrent architecture needs several silence frames to shift its internal state, which translates directly to turn detection delay.

TEN VAD (by Agora) is purpose-built for real-time conversational AI turn detection. Agora has 10+ years of experience in real-time voice infrastructure, and it shows. In my testing, TEN VAD detected speech-to-silence transitions with a median head start of 722ms over OpenAI's server-side VAD, compared to 342ms for Silero under the same conditions. It also achieves a 32% lower Real-Time Factor and 86% smaller library footprint than Silero, which matters when you're running VAD alongside everything else in a voice pipeline.

The key advantage for turn detection is transition speed. TEN VAD operates on 16kHz audio with 10ms frame hops, giving it finer temporal resolution than Silero's minimum 32ms chunks. It correctly identifies short silent durations between adjacent speech segments that Silero misses entirely.

Test Methodology

Measuring this correctly turned out to be the hardest part. The naive approach — comparing "local VAD detected silence at time X" vs "server sent speech_stopped at time Y" — has a fundamental bias: the server's speech_stopped event arrives after the server has already begun processing, so it makes server-side VAD look artificially fast.

The solution: use local VAD as a passive timestamp observer in both configurations. In the server-side VAD test runs, TEN VAD runs locally but doesn't commit or trigger responses — it only records when it detects silence. This gives both configurations the same "true speech end" anchor point.

The test protocol:

50 turns per configuration — local VAD + commit vs server-side semantic_vad
Scripted test calls from a cell phone through production Twilio PSTN infrastructure (8kHz µ-law audio)
Common measurement anchor: both configurations measure perceived latency from the true moment speech ends, as detected by the passive local TEN VAD observer
Controlled quiet-room environment to isolate the VAD comparison from acoustic variability
Perceived latency defined as: true speech end → first audio byte emitted to the caller

Filler Response Segmentation

An important methodological consideration: the LLM non-deterministically generates "filler" responses (e.g., "Let me look that up for you") that respond in under 1 second. Server-side VAD received 44% fillers vs 32% for local VAD in my test runs, which biases the unsegmented comparison. I present results segmented by response type to control for this.

Results

Non-Filler Turns (Primary Comparison)

These are substantive AI responses where the LLM performs real inference. LLM latency is closely matched between configurations, isolating the VAD effect.

Metric	Local VAD	Server VAD	Delta
Sample size	34 turns	28 turns
Perceived latency (median)	2,412ms	3,101ms	-689ms
Perceived latency (mean)	2,396ms	3,216ms	-820ms
LLM latency (median)	2,183ms	2,263ms	~equal
Cohen's d	1.04 (large)
Significance	p < 0.001	t = 3.93

22% reduction in perceived latency with closely matched LLM latency, confirming the improvement is attributable to the VAD change, not LLM variance.

Filler Turns (Cleanest Proof of VAD Effect)

Filler turns provide the cleanest isolation because LLM latency is virtually identical — the entire improvement is pure VAD overhead.

Metric	Local VAD	Server VAD	Delta
Sample size	16 turns	22 turns
Perceived latency (median)	679ms	1,134ms	-454ms
LLM latency (mean)	519ms	517ms	~equal
Cohen's d	1.74 (very large)
Significance	p < 0.001	t = 5.81

40% reduction. With LLM latency at 519ms vs 517ms (effectively identical), the entire 454ms improvement is pure VAD overhead eliminated. This is the irreducible cost of server-side turn detection made visible.

Response Time Distribution

The distribution shift tells the most compelling story:

Threshold	Local VAD	Server VAD
Under 1 second	28%	4%
Under 1.5 seconds	42%	36%
Under 2.5 seconds	78%	54%
Under 3 seconds	92%	70%

28% of local VAD turns respond in under 1 second vs essentially 0% for server-side VAD. Sub-second response time is a qualitatively different user experience — it's the difference between a conversation that feels like talking to a person versus waiting for a system.

Over a 10-turn call, the cumulative improvement is approximately 5–7 seconds.

How to Implement This

The Realtime API makes this straightforward. The key is setting turn_detection to null in your session configuration, which puts you in manual turn control mode:

# Disable server-side VAD
session_update = {
    "type": "session.update",
    "session": {
        "type": "realtime",
        "audio": {
            "input": {
                "turn_detection": None  # Manual turn control
            }
        }
    }
}
await websocket.send(json.dumps(session_update))

# When your local VAD detects end of speech:
await websocket.send(json.dumps({
    "type": "input_audio_buffer.commit"
}))
await websocket.send(json.dumps({
    "type": "response.create",
    "response": {"output_modalities": ["audio"]}
}))

If you're using the OpenAI Agents SDK (Python), the same mechanism works through the session's manual turn control:

await session.send_audio(audio_bytes, commit=True)

The approach works identically regardless of your orchestration framework — it's all the same Realtime API WebSocket protocol underneath.

For the local VAD model, TEN VAD is available on Hugging Face with ONNX weights and Python bindings. Silero VAD is the more established alternative if you want a simpler setup, though you'll see slower transition detection.

What's Next: Speculative Response Generation

With local VAD handling turn detection, the remaining bottleneck is LLM inference (~2.2s median on non-filler turns). The next optimization I'm exploring is speculative response generation — using the local VAD's early silence detection to trigger LLM inference before we're fully certain the user has finished speaking. This allows for super tight local VAD configuration that wouldn't fly in production without OpenAI's server-side VAD confirmation.

The generated audio would be buffered rather than played immediately. If the user continues speaking, we discard the speculative response. If they're done, the response is already generated and plays almost instantly.

The Realtime API supports a hybrid configuration for this: set turn_detection.create_response = false and turn_detection.interrupt_response = false. This keeps semantic_vad running as a signal while leaving response timing under your control — the best of both worlds.

Early prototyping suggests this could save an additional 200–300ms, potentially bringing total response latency consistently under 2 seconds. But the edge cases are real — still working through the interplay between local VAD and OpenAI's server-side VAD.

Methodology Details

For those who want to reproduce this or poke holes in it:

Perceived latency is defined as the interval from true speech end (local TEN VAD detection) to first audio byte emitted to the telephony provider. Both configurations are measured from the same anchor point — this eliminates the measurement bias inherent in using the server's speech_stopped event.

Commit latency (local VAD mode only): true speech end → server acknowledgment of input_audio_buffer.committed. Median 122ms — this is the WebSocket round-trip overhead that local VAD adds. A small price for a large gain.

LLM latency: server commit acknowledgment → first audio delta from OpenAI. This is the model inference time, independent of VAD choice.

Filler segmentation threshold: LLM latency < 1000ms. Filler responses are non-deterministic LLM behavior (e.g., "Let me find that for you") and are not controllable by VAD configuration.

Statistical tests: Welch's two-sample t-test (unequal variances), Cohen's d for effect size. All p-values are two-tailed.

Environment: Controlled quiet-room conditions. Scripted test calls from cell phone through production Twilio PSTN infrastructure (8kHz µ-law, ~20ms frames). Test dates: March 20–21, 2026.

I build real-time AI voice systems — telephony pipelines, streaming audio, LLM orchestration. If you're working on similar problems, I'd love to hear what latency challenges you're seeing. Reach out on LinkedIn.

I Tested Our WebSocket Audio Pipeline with WebRTC. Here's Why I Switched It Back.

Nick Lackman — Wed, 11 Mar 2026 04:01:01 +0000

There's a prevailing assumption in the voice AI space that WebRTC is inherently better than WebSockets for real-time audio. Better latency, better quality, better everything. I built a full proof-of-concept to test that assumption on an enterprise scale production AI voice system.

I found a few things surprising.

The Setup

Our system takes inbound phone calls, pipes the audio through an AI agent (OpenAI Realtime API), and sends the response back to the caller. The current architecture uses Twilio Programmable Voice with WebSocket media streams — G.711 μ-law audio at 8kHz using WebSocket protocol.

The hypothesis was straightforward: replace the WebSocket media path with WebRTC via LiveKit, and we'd get lower latency (UDP instead of TCP, no WebSocket framing overhead) and better audio quality (Opus codec at 48kHz instead of G.711 at 8kHz).

I built the full integration — LiveKit Cloud as the media server, Twilio Elastic SIP Trunking for the PSTN connection, a transport abstraction layer so both paths could run side by side, and a real-time audio pacer to handle frame timing. The key here was adding this new transport path without changing any of the LLM orchestration or Agent configuration and tools. It should work the exact same as production with the exception of using Livekit/SIP/WebRTC rather than Twilio/ProgrammableVoice/Websockets.

Measuring the delta was necessary to take any meaningful insights from this proof-of-concept.

The Latency Result

Median response latency (time from when the caller stops speaking to when the AI starts responding):

WebSocket path: ~1,920ms
WebRTC path: ~2,060ms

Essentially identical. The theoretical 50–150ms savings from eliminating WebSocket overhead is real, but invisible against 2+ seconds of LLM response time. The transport layer accounts for less than 5% of total conversational latency. The bottleneck is the model, not the pipe. The thing I found interesting about this is the conversation around websockets vs WebRTC for real-time AI. “WebRTC is always better” is the general consensus. While WebRTC is the superior transport mechanism for real-time communications - literally in the name, the efficiency benefits are hard to see when model inference is 500ms-4s.

The Audio Quality Result

Both paths delivered the same audio quality — because both paths carry the same audio. When a caller dials from a phone, the audio enters the PSTN as G.711 μ-law at 8kHz. That's a hard ceiling imposed by the telephone network. It doesn't matter whether those bytes travel over a WebSocket or a WebRTC connection; the frequency content is identical. You can't recover information that was never captured at the source. Said a different way, you can go from low quality audio encoding to high quality audio encoding and expect a better sounding output.

Spectral Analysis

The Surprise: WebRTC Sounded Worse at First

The initial WebRTC implementation actually sounded worse than WebSocket — choppy audio, dropped words, audible artifacts. It took real debugging to figure out why.

WebRTC's jitter buffer is designed for network jitter. It smoothing out packets that arrive with variable timing from a remote peer over UDP. It is not designed to handle an application dumping large bursts of AI-generated audio into the WebRTC stack all at once.

When the LLM generates a response, the audio arrives in variable-sized chunks — sometimes 50ms of audio, sometimes 500ms, delivered as fast as the model can produce it. The OpenAI Realtime API delivers fairly consistent audio chunks, but it’s not exact and not in the way that is expected for PSTN. Our WebSocket implementation had a strict real-time pacer that metered these chunks out at exactly one frame per 20ms with prebuffering and underrun detection. Without that same pacer on the WebRTC path, the audio sounded terrible.

The fix was porting the same pacer architecture to the WebRTC path. Once both paths had identical frame-level timing discipline, the audio quality matched. The lesson: application-level pacing of AI-generated audio is your responsibility regardless of transport. WebRTC handles network timing, not application timing.

Where WebRTC Actually Wins

I also tested a WebRTC-native path with no PSTN involved — a browser client connecting directly to the AI agent via LiveKit with Opus at 24kHz. The difference was dramatic:

99% audio bandwidth: 8,438 Hz (vs. ~3,969 Hz for PSTN paths)
2x+ frequency content — you can hear breathiness, sibilants, natural voice texture
Fewest signal artifacts of all three paths
Same latency as the other paths (still LLM-bound)

WebRTC is transformatively better when the caller isn't on a phone. The technology delivers on its promise — just not for PSTN calls.

The Takeaway

The right question isn't "should we use WebRTC?" It's "where is the bottleneck?" For PSTN-based AI voice calls today, the telephone network limits quality, and the LLM limits speed. Changing the transport layer between those two bottlenecks doesn't move the needle.

WebRTC becomes the right answer when one of these changes: callers move to VoIP/browser/app clients (removing the PSTN quality ceiling), LLM response times drop by an order of magnitude (making transport latency a meaningful fraction of total latency), or wideband codecs become available end-to-end on SIP trunks.

While WebRTC is the de facto real-time communication protocol, we have millions of phone numbers and deeply ingrained Twilio Programmable Voice integrations. Switching would mean setting up new infrastructure, changing the call routing logic, additional overhead of managing a media server ourselves or paying for a cloud service like livekit. SIP/WebRTC needed to be a significant improvement over Twilio/Websockets to justify the migration, and it was about the same.

If you are already deeply integrated with Twilio and their Programmable Voice, the boring WebSocket pipeline with a well-tuned audio pacer is the right architecture. Sometimes the best engineering decision is knowing when not to ship.