Konstantin

Posted on Apr 4 • Edited on Apr 6

How I Built a Real-Time Voice AI Interview System with Gemini Live API and WebSockets (and What Almost Broke Me)

#webdev #ai #startup #python

When I started building GoNoGo.team -- a platform that uses AI to validate startup ideas through voice interviews -- I thought the hardest part would be the business logic. Turns out, the hardest part was keeping a duplex audio stream alive across three layers of abstraction without everything falling apart.

This is a technical post-mortem of the voice AI system I built solo. I'll cover the architecture, the ugly edge cases, and the specific patterns that finally made it stable enough to run 500+ validation interviews.

The Core Problem: Bidirectional Audio at Low Latency

The concept: a founder speaks, Gemini listens and responds with follow-up questions, in real time. No STT/TTS pipeline -- direct speech-to-speech using Gemini Live API (native audio). The pipeline looks like this:

Browser Mic -> ScriptProcessor (16kHz PCM) -> WebSocket (base64) -> Python FastAPI -> Gemini Live API
                                                                                      v
Browser Speaker <- AudioContext (24kHz PCM) <- WebSocket (base64) <- Audio Chunks <------+

Every arrow in that diagram is a potential failure point. And in production, every single one of them failed at least once.

Step 1: Capturing Audio in the Browser

The browser side uses a ScriptProcessorNode (yes, it's deprecated -- but AudioWorklet adds latency we can't afford for real-time conversation). We capture 16kHz mono PCM in 512-sample chunks -- roughly 32ms per chunk.

// Audio capture setup (simplified from useAudioInput.ts)
const stream = await navigator.mediaDevices.getUserMedia({
  audio: {
    echoCancellation: true,
    noiseSuppression: true,
    autoGainControl: true,
    sampleRate: 16000,
  }
});

const audioContext = new AudioContext({ sampleRate: 16000 });
const source = audioContext.createMediaStreamSource(stream);
const processor = audioContext.createScriptProcessor(512, 1, 1);

// Analyser for RMS-based voice activity detection
const analyser = audioContext.createAnalyser();
source.connect(analyser);
analyser.connect(processor);

processor.onaudioprocess = (event) => {
  const pcmData = event.inputBuffer.getChannelData(0);
  const rms = Math.sqrt(
    pcmData.reduce((sum, x) => sum + x * x, 0) / pcmData.length
  );

  // VAD gate: only send if voice detected (RMS > 0.05)
  if (rms > 0.05 && ws.readyState === WebSocket.OPEN) {
    // Convert Float32 to Int16 PCM, then base64 encode
    const int16 = new Int16Array(pcmData.length);
    for (let i = 0; i < pcmData.length; i++) {
      int16[i] = Math.max(-32768, Math.min(32767, pcmData[i] * 32768));
    }
    ws.send(JSON.stringify({
      type: "audio",
      data: btoa(String.fromCharCode(...new Uint8Array(int16.buffer)))
    }));
  }
};

The 32ms chunk interval was a hard-won choice. It gives Gemini enough data per packet to process efficiently while keeping perceived latency under 300ms end-to-end. The VAD threshold of 0.05 RMS filters out background noise without clipping soft speech.

Step 2: The Python Backend (FastAPI + WebSockets)

The backend is Python FastAPI, deployed on Google Cloud Run. Python was the right call because Gemini's client libraries are Python-first, and the entire analysis pipeline (market research, competitor scraping with Playwright, report generation) lives in the same codebase.

# WebSocket handler (simplified from server.py)
@app.websocket("/ws_live")
async def websocket_live(ws: WebSocket):
    await ws.accept()
    session = GeminiLiveSession(model="gemini-2.5-flash-exp")

    async def forward_to_gemini():
        # Browser audio -> Gemini
        async for msg in ws.iter_json():
            if msg["type"] == "audio":
                pcm_bytes = base64.b64decode(msg["data"])
                await session.send_audio(pcm_bytes)  # 16kHz PCM

    async def forward_to_browser():
        # Gemini audio -> Browser
        async for event in session.events():
            if event.type == "audio":
                # Gemini returns 24kHz PCM
                chunk_b64 = base64.b64encode(event.data).decode()
                await ws.send_json({
                    "type": "audio",
                    "data": chunk_b64
                })
            elif event.type == "tool_call":
                result = await execute_tool(event)
                await session.send_tool_response(result)

    await asyncio.gather(forward_to_gemini(), forward_to_browser())

The asymmetric sample rates (16kHz in, 24kHz out) aren't a mistake -- Gemini natively outputs at 24kHz, and downsampling would lose audio quality. The browser's AudioContext handles the sample rate mismatch transparently.

The Echo Cancellation Problem

Gemini hears its own output through the user's speakers and tries to respond to itself. Browser-level echo cancellation (echoCancellation: true) handles most cases, but not all -- especially on laptops with poor speaker-mic isolation.

My solution: a speaking-state gate. When Gemini is outputting audio, we suppress inbound audio at the application level:

# Echo gate in the session handler
class SessionState:
    def __init__(self):
        self.agent_speaking = False
        self.last_agent_audio_time = 0.0

    def should_forward_audio(self, rms: float) -> bool:
        # Suppress during agent speech + 1.5s cooldown after
        if self.agent_speaking:
            return False
        if time.time() - self.last_agent_audio_time < 1.5:
            return rms > 0.03  # Higher threshold during cooldown
        return rms > 0.01  # Normal threshold

This two-tier threshold was the key insight: background noise sits at RMS 0.01-0.02, so during the cooldown period after the agent stops speaking, we only forward audio that's clearly human speech (> 0.03).

Failure Mode: The 1011 Disconnects

For weeks, Gemini Live API would randomly close connections with status code 1011 (Internal Server Error). No pattern, no warning. Sessions would die mid-sentence.

The fix was layered:

# Reconnection with session resumption
async def handle_disconnect(session, ws):
    for attempt in range(3):
        try:
            # Gemini supports session resumption via handle
            new_session = await GeminiLiveSession.resume(
                session.resumption_handle
            )
            # Re-send last audio chunk as context
            await new_session.send_audio(session.last_chunk)
            return new_session
        except Exception:
            await asyncio.sleep(0.5 * (attempt + 1))
    # After 3 fails, notify user with audio message
    await ws.send_json({"type": "system", "message": "reconnecting"})

Session resumption handles (persisted to Firestore) were a game-changer. Instead of starting a new conversation, Gemini picks up exactly where it left off. Users barely notice the blip.

Step 3: Playing Audio Back in the Browser

Gemini returns 24kHz PCM chunks. Playing them without glitches requires the Web Audio API with a buffer scheduler:

// Audio playback (simplified from useAudioOutput.ts)
class AudioPlayer {
  private context: AudioContext;
  private nextStartTime = 0;
  private gainNode: GainNode;

  constructor() {
    this.context = new AudioContext({
      sampleRate: 24000,
      latencyHint: /Mobi/.test(navigator.userAgent)
        ? "playback" : "interactive"
    });
    this.gainNode = this.context.createGain();
    this.gainNode.connect(this.context.destination);
  }

  playChunk(pcmBase64: string) {
    const bytes = Uint8Array.from(atob(pcmBase64), c => c.charCodeAt(0));
    const int16 = new Int16Array(bytes.buffer);
    const float32 = new Float32Array(int16.length);
    for (let i = 0; i < int16.length; i++) {
      float32[i] = int16[i] / 32768;
    }

    const buffer = this.context.createBuffer(1, float32.length, 24000);
    buffer.getChannelData(0).set(float32);

    const source = this.context.createBufferSource();
    source.buffer = buffer;
    source.connect(this.gainNode);

    const startTime = Math.max(
      this.context.currentTime, this.nextStartTime
    );
    source.start(startTime);
    this.nextStartTime = startTime + buffer.duration;
  }
}

The nextStartTime scheduler ensures seamless playback regardless of network jitter. The latencyHint switch between mobile ("playback") and desktop ("interactive") was a subtle but important optimization -- mobile browsers handle audio buffers differently.

What I Learned Building This Solo

1. Build the unhappy path first. I spent week one on the happy path. Weeks two through four were entirely edge cases -- reconnection, echo suppression, barge-in handling. If I could redo it, I'd build error recovery before a single feature.

2. Voice is a different UX paradigm. Users don't read error messages mid-conversation. Every failure needs an audio fallback. We pre-compute "filler" audio chunks ("one moment please...") as 24kHz PCM, ready to stream instantly when Gemini is slow or reconnecting.

3. Speech-to-speech beats STT+TTS. We initially considered a Whisper -> Claude -> ElevenLabs pipeline. Gemini Live API's native audio mode is faster (sub-500ms round-trip), cheaper, and handles interruptions naturally. The trade-off: less control over the voice, but the latency gain is massive.

4. Cloud Run works for WebSockets, with caveats. We deploy on Google Cloud Run (me-west1 region). WebSocket connections survive container restarts thanks to session resumption handles saved in Firestore. The key setting: request timeout of 3600s (1 hour) for long interview sessions.

The Result

The system now runs validation interviews averaging 12 minutes of continuous voice conversation. Across 500+ sessions, hard failures dropped to under 1% after implementing the echo gate and session resumption. Each interview includes 3 AI agents (Alex for discovery, Sam for architecture, Maya for design) that use ~12 function-calling tools to research markets, analyze competitors, and generate reports -- all while maintaining a natural conversation.

Building this solo meant every failure landed directly in my Telegram inbox (via a monitoring bot). Which, honestly, is the fastest feedback loop possible.

What I'm Curious About

The biggest remaining challenge is audio quality on mobile browsers. iOS Safari handles AudioContext differently from Chrome, and some Android devices have aggressive echo cancellation that clips the AI's speech. We're currently using device-specific settings, but it feels like a hack.

Has anyone found a robust cross-browser audio playback strategy for real-time AI voice? Especially interested in experiences with AudioWorklet vs ScriptProcessorNode for this use case. Drop your thoughts in the comments.

DEV Community