Konstantin

Posted on Apr 5 • Edited on Apr 20 • Originally published at gonogo.team

I Built a Voice AI with Sub-500ms Latency. Here's the Echo Cancellation Problem Nobody Talks About

#ai #webdev #python #typescript

When I started building GoNoGo.team — a platform where AI agents interview founders by voice to validate startup ideas — I thought the hard part would be the AI reasoning. The multi-agent orchestration. The 40+ function-calling tools.

I was wrong.

The hard part was echo. Specifically: how do you stop an AI agent from hearing itself talk, freaking out, and interrupting its own sentence?

After 500+ voice sessions and too many late nights staring at RMS waveforms, here's what I actually learned.

The Setup: Speech-to-Speech, Not STT → LLM → TTS

GoNoGo runs on Gemini 2.5 Flash Live API — a true speech-to-speech pipeline. There's no intermediate transcription step, no text-to-speech synthesis layer bolted on afterward. Audio goes in, audio comes out. Direct.

This is important because it changes everything about how you handle audio on the client. You're not working with text buffers. You're working with raw PCM, 16kHz input from the browser mic, 24kHz output from the agent voice. Base64-encoded over WebSocket.

The browser capture side looks roughly like this:

// ScriptProcessorNode in browser — 512-sample chunks (~32ms each)
const scriptProcessor = audioContext.createScriptProcessor(512, 1, 1);

scriptProcessor.onaudioprocess = (event) => {
  const inputBuffer = event.inputBuffer.getChannelData(0);

  // Calculate RMS for VAD
  const rms = Math.sqrt(
    inputBuffer.reduce((sum, sample) => sum + sample * sample, 0) / inputBuffer.length
  );

  // VAD threshold: 0.05 RMS
  if (rms < VAD_THRESHOLD) return;

  // Convert Float32 PCM to Int16
  const int16Buffer = new Int16Array(inputBuffer.length);
  for (let i = 0; i < inputBuffer.length; i++) {
    int16Buffer[i] = Math.max(-32768, Math.min(32767, inputBuffer[i] * 32768));
  }

  // Base64 encode and send over WebSocket
  const base64Audio = btoa(String.fromCharCode(...new Uint8Array(int16Buffer.buffer)));
  ws.send(JSON.stringify({ type: 'audio_chunk', data: base64Audio }));
};

Simple enough. Until the AI starts talking.

The Echo Problem (And Why Browser AEC Isn't Enough)

Browsers have built-in acoustic echo cancellation. You enable it when you call getUserMedia:

const stream = await navigator.mediaDevices.getUserMedia({
  audio: {
    echoCancellation: true,
    noiseSuppression: true,
    autoGainControl: true
  }
});

This works great for video calls between humans. It was designed for that. But it has a fundamental assumption baked in: the "far end" audio is coming through a <audio> element or Web Audio API that the browser knows about.

When you're playing 24kHz PCM chunks from a WebSocket, decoded manually and scheduled through AudioContext buffers? The browser's AEC has no idea that audio exists. It can't cancel what it can't see.

So your AI agent starts speaking. The microphone picks up the speaker output. The agent hears itself. In the best case, it gets confused and repeats something. In the worst case — and this happened constantly in early builds — you get a feedback loop where the agent interrupts itself mid-sentence, hears the interruption, tries to respond to it, hears that, and the whole session collapses.

I called these 1011 disconnects, because that was the WebSocket close code I kept seeing in logs.

The Two-Tier RMS Gate

The fix is a two-tier RMS (Root Mean Square) gate on the audio capture side. The idea is simple: measure the loudness of what the mic is picking up, and if it's probably just the speaker playing back, don't send it.

But "simple" hides a lot of edge cases.

Tier 1: Hard suppress during agent speech

While the agent is actively speaking, I track that state server-side and send it to the client. During this window, incoming audio is suppressed entirely — no chunks sent to Gemini.

let agentSpeaking = false;
let cooldownTimer: ReturnType<typeof setTimeout> | null = null;
const COOLDOWN_MS = 1500;
const COOLDOWN_THRESHOLD = 0.03; // Higher threshold during cooldown
const NORMAL_THRESHOLD = 0.05;   // Normal VAD threshold

// Called when agent audio stream starts/stops
function setAgentSpeakingState(speaking: boolean) {
  if (speaking) {
    agentSpeaking = true;
    if (cooldownTimer) clearTimeout(cooldownTimer);
  } else {
    agentSpeaking = false;
    // Start cooldown period
    cooldownTimer = setTimeout(() => {
      cooldownTimer = null;
    }, COOLDOWN_MS);
  }
}

function shouldSendAudioChunk(rms: number): boolean {
  if (agentSpeaking) return false; // Hard suppress

  if (cooldownTimer !== null) {
    // In cooldown: use higher threshold
    return rms > COOLDOWN_THRESHOLD;
  }

  return rms > NORMAL_THRESHOLD;
}

Tier 2: The 1.5-second cooldown

This is the part that took me longest to figure out. When the agent stops talking, there's still speaker resonance in the room. The RMS of captured audio doesn't drop to zero immediately — it decays. The background noise in a typical home office sits at 0.01–0.02 RMS. But for 1-2 seconds after playback stops, you're seeing 0.025–0.04 RMS — above the normal VAD threshold.

The cooldown period uses a higher threshold (0.03 vs 0.05) for 1.5 seconds after agent speech ends. This catches the decay without cutting off a founder who immediately starts talking back.

Was this threshold tuned empirically? Absolutely. I spent days listening to session replays measuring exactly how fast room resonance decays in different mic setups.

Session Resumption: The Other Half of the Problem

Echo cancellation solved the quality problem. Session resumption solved the reliability problem.

Gemini Live sessions drop. Network hiccups, mobile handoffs, Chrome deciding to do something aggressive with memory — connections fail. Early on, a dropped connection meant starting the entire 30-minute interview over. Founders would ragequit. I would understand completely.

The fix: store session handles in Firestore and resume on reconnect.

# FastAPI backend — session management
from google.genai.live import AsyncSession
from firebase_admin import firestore

async def get_or_create_session(
    project_id: str, 
    user_id: str
) -> tuple[AsyncSession, bool]:
    db = firestore.client()
    session_ref = db.collection('sessions').document(f'{user_id}_{project_id}')
    session_doc = session_ref.get()

    if session_doc.exists:
        session_data = session_doc.to_dict()
        handle = session_data.get('resumption_handle')

        if handle:
            try:
                # Attempt resume — Gemini picks up exactly where it left off
                session = await resume_gemini_session(handle)
                return session, True  # resumed=True
            except Exception:
                pass  # Fall through to new session

    # Create new session
    session = await create_gemini_session(project_id)
    session_ref.set({
        'created_at': firestore.SERVER_TIMESTAMP,
        'project_id': project_id
    })
    return session, False  # resumed=False

async def store_resumption_handle(user_id: str, project_id: str, handle: str):
    db = firestore.client()
    session_ref = db.collection('sessions').document(f'{user_id}_{project_id}')
    session_ref.update({'resumption_handle': handle})

When a session resumes, Gemini restores full context — every tool call result, every piece of market research, every persona in the synthetic focus group. The founder reconnects and the agent says "Sorry about that, where were we?" and genuinely knows where you were.

The Filler Audio Problem

One more thing nobody talks about: what do you play while the AI is thinking?

Gemini 2.5 Flash is fast. 300-500ms end-to-end is genuinely fast. But when the agent is executing a tool call — crawling a competitor site with Playwright, running Reddit scraping, calculating unit economics — you can have 3-8 second gaps.

Silence in a voice conversation feels broken. Users assume the connection dropped.

Solution: pre-computed filler audio. Short phrases like "one moment please" or "let me look that up" in 17 languages, stored as PCM chunks, played when tool execution exceeds ~800ms. The agent is triggered via text signal (not proactive_audio, which had a regression that caused double-playback — disabled entirely, use text triggers instead).

This sounds trivial. It removed about 40% of "the app is broken" support messages.

What I'd Do Differently

Start with the echo gate, not the AI logic. I spent weeks building beautiful multi-agent orchestration before I could demo it reliably. Wrong order.
Instrument RMS values from day one. Log them. Every session. You can't tune what you can't see.
Test on bad hardware. My dev setup has a good mic with physical distance from speakers. Most users have laptop mics 30cm from laptop speakers. Build for that.
Mobile is a different planet. iOS Safari handles AudioContext lifecycle in ways that will make you question your career choices. But that's an article for another day.

The Result

After solving these problems — the two-tier RMS gate, the 1.5s cooldown, the session resumption, the filler audio — GoNoGo runs 15-45 minute voice sessions with real founders, across 21 languages, with 3 AI agents handing off to each other mid-conversation. The 1011 disconnects essentially disappeared.

The voice infrastructure became invisible, which is exactly what it should be.

If you're building anything with browser mic + real-time AI audio: what's been your biggest challenge? I'm genuinely curious whether the echo problem is universal or whether I was doing something particularly wrong early on. Drop it in the comments.

Top comments (6)

Nimrod Kramer • Apr 5

solid deep dive into the audio pipeline challenges! the echo cancellation problem is brutal - we've tackled similar issues in some of the voice projects we've covered on daily.dev. your two-tier RMS gate approach works well, especially the cooldown period accounting for room resonance decay. the filler audio insight is spot on - silence feels broken in voice conversations. have you experimented with adaptive thresholds based on room acoustics? some setups we've seen dynamically adjust the RMS thresholds based on initial background noise measurement during session setup.

Konstantin • Apr 6

Great question about adaptive thresholds! We considered dynamic calibration during session setup, but went a different route. The problem: background noise is a snapshot — user moves rooms, opens a window, kid starts playing — and your baseline is stale.

What worked: a fixed two-tier approach. RMS 0.03 as the speech/silence boundary (we started at 0.01 — took us a while to realize background noise sits at 0.01-0.02 and was triggering false positives). Then a separate echo gate at RMS 0.05 that activates during agent speech with a 1.5s cooldown for room resonance decay. That cooldown value was hard-won — the echo gate is what catches residual artifacts that browser AEC misses and that would otherwise crash Gemini's Live API with 1011 errors.

If you look at our commit history, it's basically a graveyard of approaches: silence injection, manual VAD, audioStreamEnd timing, adaptive thresholds — we tried everything before settling on the simple static thresholds that just work. Sometimes boring is better

Mike Ritchie • Apr 7

I guess one trade off could be a manual switch engaged while the user is talking, like a ”walky-talky”. Pro is that there’s a guaranteed knowledge of when the user is talking, including long pauses. Con is that if the user forgets to hold down the “talking” button, nothing is captured

Mike • Apr 5

The echo cancellation problem is real and under-discussed. I built a real-time meeting transcription system and ran into a similar issue, except mine was speaker diarization falling apart when two people talked over each other.

Sub-500ms is impressive. What's your actual p95 latency in production? The median is always great, it's the tail that kills the user experience. In my case, the first response felt instant but occasionally the transcription would buffer for 2-3 seconds and the whole perceived speed collapsed.

The Gemini Live API is interesting for this. How does it compare to Whisper + streaming for the speech-to-text portion?

Konstantin • Apr 6

On p95 — honest answer: we don't formally track p95 yet. And the "300-500ms" headline is the best case — native audio round-trip with no tool calls, no thinking budget.

In reality, latency stacks up from multiple sources:

Auto VAD turn detection: 1-2.5s just to decide the user stopped talking. This is the hidden killer — the model won't start responding until it's confident you're done speaking
Function calling: add 500-1500ms per tool round-trip. Some turns chain 2-3 tools
Thinking budget: model "pauses" to reason — 1-3s of silence before audio starts
First turn cold start: ~50% of sessions on Vertex AI GA, model goes silent. We retry up to 3x
And here's the real tradeoff nobody talks about: VAD sensitivity vs natural speech. You can cut VAD short (silence_duration_ms=500) and get blazing fast responses — but then the user can't pause to think, can't cough, can't hesitate mid-sentence. The AI jumps in the moment you breathe. It feels aggressive, like talking to someone who interrupts you constantly. Or you set it longer (we use 2.5s for interviews where users need to think, 1.0s for casual) — but now every turn has a 1-2.5s dead silence before the AI responds, and that feels laggy.

There's no winning config. Fast VAD = responsive but robotic conversation where you must speak in one unbroken stream. Slow VAD = natural pauses allowed but noticeable delay. We ended up with different settings per mode, plus our own RMS-based fallback at 1.8s that fires before Auto VAD's 2.5s (because sometimes Auto VAD doesn't fire at all and the session just hangs), plus a transcript-based fallback at 3.0s for when background noise keeps RMS high. It's fallbacks all the way down.

So realistic end-to-end from "user stops talking" to "hears first audio": p50 ~1.5-2s, p95 probably 3-5s with tool calls. The sub-500ms is real but it's only the model's audio generation time — not what the user actually perceives.

On Gemini Live vs Whisper + streaming — this is the fundamental difference: it's speech-to-speech native audio. No STT→LLM→TTS chain. Audio in, audio out, model reasons on the signal directly. Two fewer latency hops, zero transcription loss.

The tradeoff is control. With Whisper + TTS you can inspect intermediate text, tune each stage, swap voices. With native audio, it's a black box — PCM in, PCM out. Our git history has ~60 commits just on the audio pipeline. Debugging is archaeology.

We also use context_window_compression (sliding window, 40K trigger tokens) for unlimited session duration — without it you hit a ~15 min GoAway disconnect.

Re: diarization — we sidestepped it entirely since it's always 1:1. Echo cancellation separates the two "speakers" at the audio level before anything hits the model

Mykola Kondratiuk • Apr 10

ran into this on a voice-based standup prototype - the AI interrupting itself was the first thing testers noticed, way before any latency complaints. took way too long to realize it was echo and not a model timing issue.