When I was building intervu.dev, an AI mock coding interviewer that conducts full voice interviews in the browser, one of the most annoying problems I ran into was a noticeable gap between when the AI finished speaking and when the microphone actually went live for the candidate to respond.
The AI would finish its sentence, and then there'd be this dead pause of around 850ms before the mic activated. In a real interview, that kind of delay feels broken. It kills the conversational flow and makes the whole thing feel like a chatbot, not an interviewer.
Here's what was causing it and how pre-warming the WebSocket connection during TTS playback got it down to under 400ms.
The problem
The turn-taking loop in intervu.dev works like this:
- The AI generates a response and sends it to TTS
- TTS audio streams back and plays in the browser
- Once TTS finishes, the mic WebSocket connection is opened and the candidate can speak
- Audio is streamed to the backend over that WebSocket, transcribed in real-time, and the turn continues
The issue was in step 3. Opening a fresh WebSocket connection after TTS finishes takes time. DNS resolution, TCP handshake, the WebSocket upgrade handshake. On a typical connection that overhead was sitting around 800-900ms. Most of the time it was fine in testing, but on slower connections or under load, it was noticeably bad.
The fix: pre-warm the connection during TTS playback
The insight is simple: TTS playback gives you a natural window of time where the user is listening and not yet expected to speak. That window is completely idle from a WebSocket perspective. Instead of waiting until TTS finishes to open the mic connection, you can open it during playback and have it ready to go the moment the AI stops speaking.
// When TTS playback starts, immediately begin opening the mic WebSocket
function onTTSPlaybackStart() {
prewarmMicConnection();
}
function prewarmMicConnection() {
// Open the connection early but don't activate the mic yet
micSocket = new WebSocket(MIC_WEBSOCKET_URL);
micSocket.onopen = () => {
micConnectionReady = true;
};
micSocket.onerror = (err) => {
// Fall back to opening it on-demand if pre-warm fails
micConnectionReady = false;
};
}
// When TTS finishes playing
function onTTSPlaybackEnd() {
if (micConnectionReady && micSocket?.readyState === WebSocket.OPEN) {
// Connection already open, activate mic immediately
activateMicrophone(micSocket);
} else {
// Pre-warm didn't complete in time, open on-demand
openMicConnectionAndActivate();
}
}
The pre-warm happens in the background while audio is playing. By the time TTS finishes, even if the AI only speaks for a second or two, the WebSocket handshake is almost always done.
The numbers
Before: ~850ms gap between TTS end and mic activation, measured as time between the TTS ended event firing and the first audio chunk arriving at the backend.
After: under 400ms consistently, and closer to 50-80ms on good connections where the pre-warm had plenty of time to complete.
The 400ms worst case comes from the fallback path. If the AI says something very short (under about 1 second), TTS can finish before the pre-warm completes and we fall back to opening on-demand. For anything longer than a second of speech, the connection is reliably ready.
A few things worth knowing
Close the connection if the turn gets interrupted. If the user clicks stop or the interview ends mid-TTS, you need to close the pre-warmed socket cleanly. Leaving stale open connections is the obvious failure mode here.
One connection per turn. Don't reuse the pre-warmed connection across multiple turns. Open a fresh one each time TTS starts. This avoids state leakage and keeps the server-side logic simple.
Server-side idle timeout. Your backend will see a WebSocket connection open and then sit idle for a few seconds before the client sends any audio. Make sure your idle timeout is long enough to survive that window. I set mine to 30 seconds, which is comfortably longer than any realistic TTS utterance.
It degrades gracefully. The fallback to on-demand opening means users on very slow connections still get a working experience, just with slightly more latency. The pre-warm is purely an optimization, not a hard dependency.
Why this matters for voice AI apps
Any application that alternates between AI speech output and user speech input has this same dead-time window during playback. The pattern generalizes well: use the playback window to pre-warm whatever connection or resource you need for the next user action. In intervu.dev's case it's a WebSocket for audio streaming, but the same idea applies to pre-fetching context, warming up a transcription session, or pre-loading the next state in a conversational flow.
If you're building anything with voice turn-taking and you're seeing a gap at the handoff point, this is almost certainly part of what's causing it.
I'm building intervu.dev as a solo project, an AI that conducts real FAANG-style mock coding interviews in the browser, voice and all. If you're curious about the broader architecture (Docker-in-Docker for code sandboxing, LLM prompt state machines, real-time STT), I wrote about the full build here.
Top comments (0)