The 320-Millisecond Conversation

#ai #webdev #programming #productivity

How AI voice mode went from a five-second Q&A bot to something that interrupts you mid‑sentence — and everything that had to be rebuilt to get there.

Open ChatGPT's voice mode, or Gemini Live, and try talking over it mid-sentence. The AI just stops. Not after it finishes the sentence it was on, not after an awkward half-second lag. It stops the way a person stops when they realize you've started talking over them.

A couple of years ago, this same interaction was painful. You'd ask a question, watch a spinner, wait three to five seconds, and get back a flat, oddly-paced reading of a paragraph. Now it feels closer to a phone call. Same companies, same basic idea, "talk to an AI", but somewhere along the way it stopped feeling like using a tool and started feeling like talking to something.

So what actually changed? Not "the model got smarter" — though it did, a bit. The bigger story is a near-total rebuild of how audio moves between your phone and the model, touching everything from how these models are trained down to which network protocol your browser happens to be using.

Three Models Playing Telephone

Until recently, every AI voice assistant, including the first version of ChatGPT's voice mode, worked the same way under the hood: three separate models bolted together in a row, each one handing its output to the next.

Each handoff adds latency, and it adds up fast — roughly 2.8 seconds for GPT-3.5 and 5.4 seconds for GPT-4 to respond in voice mode. But honestly, the delay wasn't even the worst part. The real loss was information. The moment your voice became text, everything that wasn't literally the words — your tone, your pacing, whether you sounded annoyed or amused — was gone. The LLM never heard any of it. And the TTS model at the very end had no idea what emotion the reply should carry, because all it ever got was a string of text too.

It's the classic telephone-game problem: three models that don't share a brain, passing notes to each other through a narrow, text-shaped slot.

One Model, Audio In, Audio Out

The fix — which OpenAI shipped first with GPT-4o in mid-2024, and Google followed not long after with Gemini 2.5 Flash's native-audio models — collapses all three stages into a single model.

OpenAI's own system card describes GPT-4o as "an autoregressive omni model, which accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs… trained end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network." In plain terms: there's no handoff anymore. The network that hears you is the same one that decides what to say, and the same one that says it.

How the Audio Actually Moves

Making this work in real time meant OpenAI had to build something new: the Realtime API. Instead of the usual request-response pattern, where every message is its own fresh API call, it opens one persistent connection — either a WebSocket or a WebRTC connection — and keeps it open for the whole conversation.

The audio itself usually moves as 16-bit PCM at 16kHz or 24kHz (Gemini's Live API specifically wants 16kHz PCM in, 24kHz PCM out). G.711 works too, for telephony, but there's a catch: converting its 8kHz audio up to the model's native 24kHz tacks on another 50–100ms and makes it sound worse.

Knowing When to Stop Talking

That "stops the instant you talk" behavior comes down to Voice Activity Detection (VAD). The basic version, server VAD, just watches for silence — pause for about 500ms and it assumes you're done. Semantic VAD is the newer, more interesting one: it's a classifier that judges whether what you just said sounds finished. Trail off with "ummm…" and it waits longer. End on something that sounds like a complete thought, and it jumps in almost immediately.

When you actually interrupt, the system fires response.cancelled, and whatever audio the model hadn't finished playing yet gets truncated out of the conversation history — so the model only "remembers" what it actually said out loud, not what it was about to say. WebRTC handles this part automatically, since the server already knows exactly how much audio has played. Over a plain WebSocket, your client has to track that itself and tell the server where the cut happened.

The Trade-offs

First trade-off: you lose the transcript you could inspect. When everything ran through text, moderation was simple — you just read the text. With audio-native models, OpenAI still runs that old text-based check on a transcription, but adds a second, real-time classifier that watches the audio as it's being generated and checks whether the voice still matches one of the approved presets, shutting things down if it drifts. OpenAI says this setup "catches 100% of meaningful deviations from the system voice" in their internal tests — mostly there to stop the model from ever accidentally cloning a user's own voice.

Second trade-off, and it's bigger than most people realize: voice is just expensive. Here's what the original Realtime API pricing looked like, token for token:

That's roughly 20x the price of text input and 10x the price of text output. This is the actual reason voice features sit behind paid tiers with daily limits while text chat feels basically free: every second you spend talking has a meter running on it in a way that typing never did.

Third one: switching to WebRTC isn't automatically a win either. Some teams have swapped a working WebSocket pipeline for WebRTC expecting better audio by default, and gotten the opposite. WebRTC's jitter buffer is built to smooth out unpredictable timing between two network peers, but AI-generated audio doesn't arrive on that kind of schedule. Without rebuilding the same frame-pacing logic (one 20ms frame at a time, with prebuffering), the audio actually sounds worse over WebRTC than it did over a plain socket.

The Philosophical Close

Here's the thing that gets me about all this: almost none of it is really about the model getting "smarter." Better reasoning, more facts, a bigger context window — that's not really the story here. This is about closing the gap between thinking and speaking until you can't feel it anymore, and then rebuilding an entire safety and pricing model around the fact that text, the one format these companies could cheaply moderate, log, and price, might not exist as a middle step anymore.

It's a strange trade when you sit with it: betting that feeling present in a conversation — interruptible, emotionally responsive, instant — is worth more than being able to check what the AI actually said.

232 milliseconds isn't really the achievement, if you think about it. The achievement is that getting there meant these companies had to make their own systems harder to inspect on purpose, then build a whole new layer of classifiers just to make up for it. Talking to an AI that actually feels like a person turns out to be inseparable from a much harder problem: trusting a system you can no longer just read.

Sources

OpenAI — GPT-4o System Card (arXiv:2410.21276)
OpenAI — "Introducing the Realtime API"
OpenAI — "Introducing gpt-realtime and Realtime API updates for production voice agents"
OpenAI Platform Docs — Realtime conversations, Voice Activity Detection, Managing costs
Google AI for Developers — Gemini Live API overview & Gemini 2.5 Flash Native Audio
Latent Space — "OpenAI Realtime API: The Missing Manual"
Production engineering write-ups: Skywork, Effloow, Forasoft, DEV Community (WebRTC/WebSocket pacing)