<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Konstantin</title>
    <description>The latest articles on DEV Community by Konstantin (@remi_etien).</description>
    <link>https://dev.to/remi_etien</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3860913%2F1e8c4bb1-6afd-45a5-abe2-caa35bde29de.JPG</url>
      <title>DEV Community: Konstantin</title>
      <link>https://dev.to/remi_etien</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/remi_etien"/>
    <language>en</language>
    <item>
      <title>How we calibrated a Synthetic Focus Group from 'this looks great!' to 93% accuracy</title>
      <dc:creator>Konstantin</dc:creator>
      <pubDate>Tue, 21 Apr 2026 13:00:00 +0000</pubDate>
      <link>https://dev.to/remi_etien/how-we-calibrated-a-synthetic-focus-group-from-this-looks-great-to-93-accuracy-1pln</link>
      <guid>https://dev.to/remi_etien/how-we-calibrated-a-synthetic-focus-group-from-this-looks-great-to-93-accuracy-1pln</guid>
      <description>&lt;p&gt;When we shipped the first version of GoNoGo's Synthetic Focus Group (SFG), every persona loved everything.&lt;/p&gt;

&lt;p&gt;The setup: a founder finishes a 30-minute voice discovery interview about their idea. From that conversation plus a stack of insights we'd scraped for the niche, we spin up five AI personas — a CTO, a budget-conscious shopper, a skeptic, an early adopter, a casual user — and ask the panel to react to the founder's two value-prop variants and a pricing test. &lt;strong&gt;All five voted the same way on every variant. All five "would maybe buy."&lt;/strong&gt; Slightly different wording. Same vibe.&lt;/p&gt;

&lt;p&gt;The problem is obvious in retrospect: we'd built a confirmation engine, not a focus group.&lt;/p&gt;

&lt;p&gt;This is the story of the next six months — what broke, what we tried, what stuck. By the end we hit &lt;strong&gt;93% predicted-vs-real accuracy&lt;/strong&gt; across 16 niches with a 95% CI of 91.4–94.6%. Here's how.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;📌 &lt;strong&gt;TL;DR&lt;/strong&gt; — If you're building anything with synthetic personas, three things matter more than the rest: &lt;strong&gt;(1)&lt;/strong&gt; generate persona grievances from real user data, weighted 3× over LLM-imagined ones, &lt;strong&gt;(2)&lt;/strong&gt; tune sampling temperature &lt;em&gt;per archetype&lt;/em&gt; (skeptics ≠ early adopters), &lt;strong&gt;(3)&lt;/strong&gt; shuffle variant labels per persona to kill position bias. We measured the third one alone shaving 14 percentage points off label bias.&lt;/p&gt;

&lt;p&gt;If you missed the first two posts in this series: &lt;a href="https://dev.to/remi_etien/we-were-building-marketing-for-a-startup-we-accidentally-built-an-a3-41ln"&gt;we built A³ by accident while doing marketing&lt;/a&gt; and then &lt;a href="https://dev.to/remi_etien/i-built-a-voice-ai-with-sub-500ms-latency-heres-the-echo-cancellation-problem-nobody-talks-about-14la"&gt;solved sub-500ms voice latency the hard way&lt;/a&gt;. This post is about what happens &lt;strong&gt;after&lt;/strong&gt; the voice interview ends and we have to actually validate the idea.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What SFG is actually for (and what it isn't)
&lt;/h2&gt;

&lt;p&gt;Before the engineering, the use case.&lt;/p&gt;

&lt;p&gt;A founder finishes the voice discovery and now has a stack of decisions to make: which value prop leads, what to charge, which marketing claims hold up, which feature to build first, which segment to target. Traditionally these get answered by (a) gut feel, (b) asking 5 friends, or (c) running a real survey that takes weeks and costs money to recruit the right respondents.&lt;/p&gt;

&lt;p&gt;The SFG sits in between. It's a &lt;strong&gt;panel of synthetic respondents&lt;/strong&gt;, each modelled on a specific archetype-with-real-grievances, that the founder can re-run as many times as they want against any decision they need to make — A/B variants, pricing, claims, positioning, segments.&lt;/p&gt;

&lt;p&gt;What it gives you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A vote with &lt;strong&gt;reasoning&lt;/strong&gt;, not a number — "the Skeptic rejected variant B because pricing felt anchored to a tier she doesn't need"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Disagreement&lt;/strong&gt; between archetypes — surfacing the trade-off you were about to ignore&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reproducible runs&lt;/strong&gt; — same insights → same panel → same decision logic. You can re-test the same idea after changing one word in the headline&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️ &lt;strong&gt;What SFG is NOT.&lt;/strong&gt; It is not a fortune teller. It does not predict whether your startup will succeed. It does not predict what &lt;em&gt;the market&lt;/em&gt; will do. It models the &lt;strong&gt;decision behavior of a specific archetype&lt;/strong&gt;, given the frustrations and goals we've sourced for that archetype from real public data. That's a narrower claim than "predict the future" — and a much more honest one.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When we say "93% accuracy" later in this post, that's what's being measured: &lt;strong&gt;how closely a synthesized archetype's modelled behavior matches the observed behavior of real users in that archetype, on data the model didn't see during synthesis.&lt;/strong&gt; Not pre-cognition. Behavioral fidelity.&lt;/p&gt;

&lt;p&gt;That distinction matters because it tells you what the SFG is good for (decision-stage trade-offs, claim stress-tests, segment fit, pricing) and what it's bad for (predicting macro-market outcomes, novel categories with no public user data, regulated industries where the data isn't there).&lt;/p&gt;




&lt;h2&gt;
  
  
  The three things that broke our personas
&lt;/h2&gt;

&lt;p&gt;After watching ~50 sessions where personas all said "great idea!" we noticed three failure modes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Personas had no real grievances.&lt;/strong&gt; They were generated from the LLM's vague prior of "what a CTO might say." So a CTO persona evaluating a B2B SaaS would just... vibe. No specific scar tissue, no real pain.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Sampling temperature was uniform.&lt;/strong&gt; Skeptics rolled the same temperature (0.7) as early adopters. Skeptics weren't actually skeptical — they were just slightly less enthusiastic.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Variant labels biased everything.&lt;/strong&gt; "Option A" reliably won over "Option B" — classic position bias. Personas were anchoring on label, not content.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We fixed each one. Here's how.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fix #1: Personas built from real grievances, not templates
&lt;/h2&gt;

&lt;p&gt;The base persona-generation algorithm now does this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Niche detection.&lt;/strong&gt; A small LLM classifier maps the project to one of 16 niches (B2B SaaS, marketplace, dev tools, ecommerce, hardware, fitness, content, freelancers, ...). Each niche has a different archetype pool.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Insight collection.&lt;/strong&gt; We pull real posts from Reddit, HackerNews, ProductHunt, G2, app store reviews — anywhere the niche's actual users complain. Typical project gets 100–300 raw insights.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-persona synthesis.&lt;/strong&gt; For each archetype slot (3–5 per project), we sample 3–8 frustrations and 3–8 goals &lt;strong&gt;directly from the real insights&lt;/strong&gt; for that persona's likely demographic.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The critical line in &lt;code&gt;persona_builder.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Real frustrations from source data weighted 3x over LLM-generated ones
&lt;/span&gt;&lt;span class="n"&gt;weighted_frustrations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;real_frustrations&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
    &lt;span class="n"&gt;llm_inferred_frustrations&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That 3× multiplier is the entire difference between a persona who says "I'd want better onboarding" (LLM generic) and one who says "I bounced from the last 4 tools because none of them imported my Notion docs without breaking nested toggles" (real Reddit thread).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💬 The 3× weight on real frustrations is the cheapest, highest-leverage change in the whole pipeline. Without it, you're just paraphrasing the model's prior beliefs back to the founder.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Each persona also carries up to 5 &lt;strong&gt;verbatim quotes&lt;/strong&gt; from the source data, plus a &lt;em&gt;richness score&lt;/em&gt; (0–1) so the orchestrator can flag thin personas before they pollute results. Average richness when &amp;gt;100 insights are available: 0.85+.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Takeaway:&lt;/strong&gt; persona realism is upstream of every other decision. If your input data is "what an LLM thinks a CTO sounds like," everything downstream is fan-fiction.&lt;/p&gt;




&lt;h2&gt;
  
  
  Fix #2: Temperature tuned per archetype
&lt;/h2&gt;

&lt;p&gt;Behavioral diversity isn't a prompt problem — it's a sampling problem.&lt;/p&gt;

&lt;p&gt;We tuned &lt;code&gt;temperature&lt;/code&gt; per archetype in &lt;code&gt;get_temperature()&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ARCHETYPE_TEMPERATURE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;EARLY_ADOPTER&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;     &lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# impulsive, willing to leap
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CASUAL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;            &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MAINSTREAM&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;        &lt;span class="mf"&gt;0.6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PRAGMATIST&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;        &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# analytical, predictable
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CTO&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;               &lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CFO&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;               &lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SKEPTIC&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;           &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# rigid, negative-biased
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BUDGET_CONSCIOUS&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="c1"&gt;# ... 15 archetypes total
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This alone meaningfully shifted distributions. Skeptics started landing in the 4–6 appeal range by default. Early adopters jumped to 7–9. Pragmatists stayed in the 5–7 band where they belonged.&lt;/p&gt;

&lt;p&gt;We also embedded &lt;strong&gt;cognitive bias hints&lt;/strong&gt; directly into each archetype's system prompt. Pragmatists get explicit "status quo bias" framing. Skeptics get "negativity bias" framing. CFOs get loss-aversion phrasing.&lt;/p&gt;

&lt;p&gt;The personas didn't just &lt;em&gt;sound&lt;/em&gt; different — they actually disagreed with each other.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Takeaway:&lt;/strong&gt; if every persona is sampled at the same temperature, you're running the same character five times in different costumes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Fix #3: Variant shuffling
&lt;/h2&gt;

&lt;p&gt;Stupid, easy, huge:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Shuffle variants per-persona to neutralize position bias
&lt;/span&gt;&lt;span class="n"&gt;variant_labels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Option 1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Option 2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Option 3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;shuffle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;variant_labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For an A/B/C test with 5 personas, each persona sees the variants in a different random order under neutral labels. Position effects average out across the panel.&lt;/p&gt;

&lt;p&gt;We measured this. Before shuffling: Option A won 64% of two-variant tests across our calibration set. After shuffling: 50.3% / 49.7%. The label was carrying a 14-point bias.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️ Position bias in LLM panels is real and large. If you're not shuffling labels, your A/B "winners" are partially a measurement of which slot you put them in.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Takeaway:&lt;/strong&gt; before tuning anything sophisticated, audit for the dumb biases first. They cost you 14 points and a &lt;code&gt;random.shuffle()&lt;/code&gt; call to fix.&lt;/p&gt;




&lt;h2&gt;
  
  
  By the numbers
&lt;/h2&gt;

&lt;p&gt;A snapshot of where the system landed after the calibration pass:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Niches calibrated&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;16&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tagged insights in calibration set&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2,069&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Train/test split&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;70 / 30&lt;/strong&gt; per niche&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Overall accuracy&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;93.1%&lt;/strong&gt; (95% CI 91.4–94.6)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Decision match rate&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4 / 4&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Personas generated per project&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3–5&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Average persona richness (&amp;gt;100 insights)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.85+&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Position-bias reduction from shuffling&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;14 percentage points&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Real-grievance weighting over LLM&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3×&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Archetypes available&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;15&lt;/strong&gt; (across B2B, consumer, marketplace)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;p&gt;We needed to know if any of this was working. So we built a calibration suite.&lt;/p&gt;

&lt;p&gt;But first — what are we actually measuring?&lt;/p&gt;

&lt;p&gt;We are &lt;strong&gt;not&lt;/strong&gt; measuring "did SFG predict whether the product succeeded." We're measuring something narrower and more testable: &lt;strong&gt;given a known archetype and a known set of grievances, does the synthesized persona produce the same pattern of pain points, needs, sentiment, and decisions that the real users in that archetype produced — on data the model didn't see?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If yes, the persona is a faithful behavioral model of its archetype. That's what we calibrated against.&lt;/p&gt;

&lt;p&gt;The setup: 2,069 manually-tagged insights across 16 niches, each with known ground-truth pain points, needs, sentiment distribution, and the decision a real founder would have arrived at when looking at the full dataset.&lt;/p&gt;

&lt;p&gt;We split 70/30 — synthesize personas using 70% of insights per niche, then ask each persona to characterize the held-out 30% (without ever seeing it). Compare the persona's response to ground truth across &lt;strong&gt;5 weighted dimensions&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Weight&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Pain point overlap&lt;/td&gt;
&lt;td&gt;Semantic Jaccard (threshold 0.53)&lt;/td&gt;
&lt;td&gt;0.30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pain point ranking&lt;/td&gt;
&lt;td&gt;Spearman's ρ&lt;/td&gt;
&lt;td&gt;0.15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Needs overlap&lt;/td&gt;
&lt;td&gt;Semantic Jaccard&lt;/td&gt;
&lt;td&gt;0.25&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sentiment distribution&lt;/td&gt;
&lt;td&gt;1 − √JSD&lt;/td&gt;
&lt;td&gt;0.20&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Language similarity&lt;/td&gt;
&lt;td&gt;Cosine of embeddings&lt;/td&gt;
&lt;td&gt;0.10&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Final score across the 16 niches: &lt;strong&gt;93.1% behavioral fidelity, 95% CI 91.4–94.6%.&lt;/strong&gt; Best-performing niches: content tools (93.2%), freelancers (92.7%), fitness (90.7%).&lt;/p&gt;

&lt;p&gt;Decision match rate (does the synthesized panel reach the same go/no-go verdict as the held-out real data on 4 axes — concern, need, verdict, recommendations): &lt;strong&gt;4/4 across the calibration set.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To restate what that number means: when we ask a synthesized archetype to characterize a problem space using only the 70% it was built from, its description of pain points, needs, sentiment, and recommended decisions matches the description that real users in that archetype produced (on the held-out 30%) &lt;strong&gt;at 93% similarity, on average, across 16 niches.&lt;/strong&gt; Not "predicts the future at 93%." Reproduces archetype behavior at 93%.&lt;/p&gt;

&lt;p&gt;The most valuable thing this gave us wasn't the headline number. It was the per-niche breakdown — we could see &lt;em&gt;which&lt;/em&gt; archetype pools were weak, &lt;em&gt;which&lt;/em&gt; niches needed more insight sources, &lt;em&gt;which&lt;/em&gt; prompts were drifting.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🎯 The headline accuracy number is for marketing. The per-niche breakdown is for engineering. Build both.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  How an A/B test actually runs
&lt;/h2&gt;

&lt;p&gt;When a founder gives us two landing page variants:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Generate panel&lt;/strong&gt; — 3–5 personas synthesized from the project's collected insights (already done at discovery time).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-persona evaluation in parallel&lt;/strong&gt; — each persona sees all variants in one prompt, with shuffled labels.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured response&lt;/strong&gt; — for each variant: appeal score (1–10), willingness (&lt;code&gt;would_buy&lt;/code&gt; / &lt;code&gt;might_buy&lt;/code&gt; / &lt;code&gt;would_not_buy&lt;/code&gt;), pros, cons, 2–6 sentence reasoning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Round 2 — panel discussion&lt;/strong&gt; — personas react to each other's reasoning. This is where the interesting stuff happens. The skeptic challenges the early adopter. Scores shift. Sometimes the panel realigns entirely.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aggregate&lt;/strong&gt; — winner by win count first, average appeal as tiebreak.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The output isn't just a winner. It's a transcript a founder can actually read — with reasoning that maps to specific frustrations from real users.&lt;/p&gt;

&lt;h2&gt;
  
  
  What else the SFG can do
&lt;/h2&gt;

&lt;p&gt;Once we had calibrated personas, A/B testing turned out to be the smallest use case. The same panel can run:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Claim validation&lt;/strong&gt; — paste 1–10 marketing claims, each persona votes agree / disagree with reasoning. Surfaces which claims a real audience would call BS on.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing tests&lt;/strong&gt; — test 3+ price points, get per-persona perceived value and conversion likelihood.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adaptive hypothesis generation&lt;/strong&gt; — auto-generates 5–6 testable hypotheses covering problem fit, segment fit, behavior change, switching costs, pricing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Early adopter lead extraction&lt;/strong&gt; — pulls 20 real handles from the source insights — actual people who described the exact problem you're solving. Not synthetic. Outreach list.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There's also a separate &lt;strong&gt;Reality Check&lt;/strong&gt; feature that flips the comparison around: it lets you run a real human survey, then dual-scores the SFG prediction against the real responses. That's how we keep the 93% number honest as the model evolves.&lt;/p&gt;




&lt;h2&gt;
  
  
  What still doesn't work
&lt;/h2&gt;

&lt;p&gt;A few things I'm still not satisfied with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Single-model persona reasoning.&lt;/strong&gt; Persona inference currently runs on one frontier-class LLM. We cross-verify &lt;em&gt;factual claims&lt;/em&gt; across multiple providers in a separate feature, but the persona reasoning itself is single-backbone. That's a known shared-blind-spot risk we want to address — multi-model panels are on the roadmap.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No benchmarking against traditional focus groups.&lt;/strong&gt; Only against holdout real-user data. Comparing AI personas to a real moderated focus group with 8 humans is the obvious next benchmark, and it's expensive enough that I keep deferring it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Niches we don't have insight sources for&lt;/strong&gt; (regulated industries mostly) drop to ~75% accuracy. The whole approach falls apart when you can't pull real user grievances from somewhere public.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;🚧 We're not done. The 93% number is a calibration milestone, not a verdict. Anyone who tells you their AI focus group has solved the problem is selling.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;The full Synthetic Focus Group lives inside &lt;a href="https://gonogo.team/?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=sfg_article" rel="noopener noreferrer"&gt;gonogo.team&lt;/a&gt;. The free tier gives you 3 projects with the voice Discovery agent — enough to feel out whether the methodology makes sense for your idea before unlocking the full multi-agent pipeline (which is where SFG, A/B testing, pricing tests and the rest live, behind a one-time per-project credit — no subscription).&lt;/p&gt;

&lt;p&gt;If you build something with synthetic personas yourself — the three things that mattered most for us, ranked: &lt;strong&gt;(1)&lt;/strong&gt; real grievances over LLM templates (3× weight), &lt;strong&gt;(2)&lt;/strong&gt; per-archetype temperature, &lt;strong&gt;(3)&lt;/strong&gt; variant shuffling. Without all three, you'll just keep getting "this looks great!" forever.&lt;/p&gt;

&lt;p&gt;Comments and corrections welcome — especially if you've benchmarked AI personas against real focus groups, I'd love to compare notes.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>startup</category>
      <category>productdevelopment</category>
    </item>
    <item>
      <title>We Were Building Marketing for a Startup. We Accidentally Built an A3</title>
      <dc:creator>Konstantin</dc:creator>
      <pubDate>Thu, 16 Apr 2026 06:22:44 +0000</pubDate>
      <link>https://dev.to/remi_etien/we-were-building-marketing-for-a-startup-we-accidentally-built-an-a3-41ln</link>
      <guid>https://dev.to/remi_etien/we-were-building-marketing-for-a-startup-we-accidentally-built-an-a3-41ln</guid>
      <description>&lt;p&gt;We build &lt;a href="https://gonogo.team/" rel="noopener noreferrer"&gt;GoNoGo&lt;/a&gt; — a platform where founders validate their ideas through live voice interviews with a synthetic focus group, competitor analysis, market sizing, and a dozen other tools that Anna will tell you about.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;We'll get to Anna&lt;/u&gt;&lt;/p&gt;

&lt;p&gt;The product was ready. Time for marketing. We did everything by the book: landing page, explainer video, banners, social posts.&lt;/p&gt;

&lt;p&gt;The result? Standard tools delivered standard results. We expected more.&lt;/p&gt;

&lt;p&gt;Banners get lost in the noise. Videos get skipped. Landing pages get scanned in half a second. The impression → click → signup funnel leaks at every step. Not because the product is bad. Because the format is dead.&lt;/p&gt;

&lt;p&gt;What if advertising stopped showing — and started selling?&lt;/p&gt;




&lt;h2&gt;
  
  
  Not a Video. Not a Banner. Not a Chatbot
&lt;/h2&gt;

&lt;p&gt;Meet Anna&lt;/p&gt;

&lt;p&gt;&lt;iframe height="640" src="https://codepen.io/Konstantin-Tikhaev/embed/EagrgGb?height=640&amp;amp;default-tab=result&amp;amp;embed-version=2"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;This is not a recording. Anna is live right now. Press Start, allow your microphone — and ask her anything about GoNoGo. She'll answer with her voice, show you slides with real data, compare competitors.&lt;/p&gt;

&lt;p&gt;Right here. In this article. No redirects.&lt;/p&gt;

&lt;p&gt;Try speaking to her in Japanese, Arabic, or Spanish — she'll switch on the fly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A³ — Autonomous Advertising Ambassador&lt;/strong&gt;&lt;br&gt;
Three letters. Three words. Not one wasted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Autonomous&lt;/strong&gt; — works on her own. No scripts, no decision trees. Understands context, improvises, adapts to the person she's talking to. In any language.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Advertising&lt;/strong&gt; — lives where the audience is. Not on your website where you still need to drive traffic. In the feed, in an article, in a post.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ambassador&lt;/strong&gt; — represents your brand. Doesn't answer FAQs from a corner widget — she greets, explains, demonstrates, persuades. Like your best salesperson who knows the product by heart.&lt;/p&gt;

&lt;p&gt;The difference between A³ and a chatbot is the difference between a live salesperson and a "FAQ" sign by the door.&lt;/p&gt;




&lt;h2&gt;
  
  
  Case 1: A Pocket Marketer for Tech Products
&lt;/h2&gt;

&lt;p&gt;Anna is already live. Right now she's embedded in a post on X, in this Dev.to article, and soon on Medium.&lt;/p&gt;

&lt;p&gt;What she does:&lt;/p&gt;

&lt;p&gt;Explains the product by voice — differently for each person, depending on their questions&lt;br&gt;
Generates analytics and slides in real time&lt;br&gt;
Generates contextual CTAs during the conversation — links, sign-up prompts, demo redirects — based on what you just discussed, not a static button&lt;br&gt;
Compares with competitors if you ask&lt;br&gt;
Handles objections — not from a template, but in conversation&lt;br&gt;
Speaks the listener's language — switches on the fly, no configuration&lt;br&gt;
Not one video for a million views. A million unique conversations.&lt;/p&gt;




&lt;h2&gt;
  
  
  Case 2: A Welcomer in Physical Retail
&lt;/h2&gt;

&lt;p&gt;Same architecture, different context. A screen at the store entrance. A customer walks up:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Do you have the iPhone 16 Pro in black?"&lt;br&gt;
→ API call to inventory → "Three in stock. Want to see the specs?"&lt;br&gt;
"Compare it with the Samsung S25"&lt;br&gt;
→ comparison table generated on the fly&lt;br&gt;
"Order it for delivery tomorrow"&lt;br&gt;
→ order placed via API, no cashier needed&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Not a chatbot on a website. A voice interface on location, with live access to store data.&lt;/p&gt;

&lt;p&gt;A tourist walks into a store in Tel Aviv and asks in Japanese — A³ answers in Japanese. No switching, no settings, no language barrier.&lt;/p&gt;




&lt;h2&gt;
  
  
  First Reaction
&lt;/h2&gt;

&lt;p&gt;We recently demoed Anna to the owner of a retail chain. We adapted the demo to his inventory — stock checks, product comparisons, order placement.&lt;/p&gt;

&lt;p&gt;His first question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How many response variations did you pre-record?&lt;br&gt;
We explained — none. Everything is generated in real time.&lt;br&gt;
His second question:&lt;br&gt;
No, seriously — how many?&lt;br&gt;
We're now in partnership talks.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What Doesn't Work (Yet)
&lt;/h2&gt;

&lt;p&gt;We believe in honesty more than hype.&lt;/p&gt;

&lt;p&gt;Dev.to — fully functional widget. You just saw it.&lt;/p&gt;

&lt;p&gt;X (Twitter) — Player Card shows the widget in the feed, but the platform blocks microphone access inside the iframe. We solved this with a popup — the browser version is fully functional.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foo3rdw0743rir8j6tgxf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foo3rdw0743rir8j6tgxf.png" alt="Anna — A³ Autonomous Advertising Ambassador by GoNoGo" width="800" height="431"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Other platforms — some strip permissions in sandboxed iframes. We're working on a universal fallback.&lt;/p&gt;

&lt;p&gt;These are platform limitations, not technology limitations. A³ works anywhere there's a WebSocket and a microphone.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;We're building a constructor — a platform where any business can create their own A³. Connect your data, customize voice and visuals, embed anywhere. No code required.&lt;/p&gt;

&lt;p&gt;Anna is the first proof of concept.&lt;/p&gt;

&lt;p&gt;Talk to her: &lt;a href="//gonogo.team/talk"&gt;A³ for Team GoNoGO&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A³ is a GoNoGo technology. Provisional patent filed&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>marketing</category>
      <category>ai</category>
      <category>webdev</category>
      <category>startup</category>
    </item>
    <item>
      <title>I Built a Voice AI with Sub-500ms Latency. Here's the Echo Cancellation Problem Nobody Talks About</title>
      <dc:creator>Konstantin</dc:creator>
      <pubDate>Sun, 05 Apr 2026 08:50:00 +0000</pubDate>
      <link>https://dev.to/remi_etien/i-built-a-voice-ai-with-sub-500ms-latency-heres-the-echo-cancellation-problem-nobody-talks-about-14la</link>
      <guid>https://dev.to/remi_etien/i-built-a-voice-ai-with-sub-500ms-latency-heres-the-echo-cancellation-problem-nobody-talks-about-14la</guid>
      <description>&lt;p&gt;When I started building &lt;a href="https://gonogo.team" rel="noopener noreferrer"&gt;GoNoGo.team&lt;/a&gt; — a platform where AI agents interview founders by voice to validate startup ideas — I thought the hard part would be the AI reasoning. The multi-agent orchestration. The 40+ function-calling tools.&lt;/p&gt;

&lt;p&gt;I was wrong.&lt;/p&gt;

&lt;p&gt;The hard part was echo. Specifically: how do you stop an AI agent from hearing itself talk, freaking out, and interrupting its own sentence?&lt;/p&gt;

&lt;p&gt;After 500+ voice sessions and too many late nights staring at RMS waveforms, here's what I actually learned.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Setup: Speech-to-Speech, Not STT → LLM → TTS
&lt;/h2&gt;

&lt;p&gt;GoNoGo runs on Gemini 2.5 Flash Live API — a true speech-to-speech pipeline. There's no intermediate transcription step, no text-to-speech synthesis layer bolted on afterward. Audio goes in, audio comes out. Direct.&lt;/p&gt;

&lt;p&gt;This is important because it changes &lt;em&gt;everything&lt;/em&gt; about how you handle audio on the client. You're not working with text buffers. You're working with raw PCM, 16kHz input from the browser mic, 24kHz output from the agent voice. Base64-encoded over WebSocket.&lt;/p&gt;

&lt;p&gt;The browser capture side looks roughly like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// ScriptProcessorNode in browser — 512-sample chunks (~32ms each)&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;scriptProcessor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;audioContext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createScriptProcessor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="nx"&gt;scriptProcessor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;onaudioprocess&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;inputBuffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;inputBuffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getChannelData&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="c1"&gt;// Calculate RMS for VAD&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;rms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nx"&gt;inputBuffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reduce&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;sample&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nx"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nx"&gt;inputBuffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="c1"&gt;// VAD threshold: 0.05 RMS&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;rms&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;VAD_THRESHOLD&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="c1"&gt;// Convert Float32 PCM to Int16&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;int16Buffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Int16Array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;inputBuffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;inputBuffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;int16Buffer&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;32768&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;32767&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;inputBuffer&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;32768&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// Base64 encode and send over WebSocket&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;base64Audio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;btoa&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fromCharCode&lt;/span&gt;&lt;span class="p"&gt;(...&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Uint8Array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;int16Buffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;)));&lt;/span&gt;
  &lt;span class="nx"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;audio_chunk&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;base64Audio&lt;/span&gt; &lt;span class="p"&gt;}));&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Simple enough. Until the AI starts talking.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Echo Problem (And Why Browser AEC Isn't Enough)
&lt;/h2&gt;

&lt;p&gt;Browsers have built-in acoustic echo cancellation. You enable it when you call &lt;code&gt;getUserMedia&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;navigator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;mediaDevices&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getUserMedia&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;echoCancellation&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;noiseSuppression&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;autoGainControl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works great for video calls between humans. It was designed for that. But it has a fundamental assumption baked in: the "far end" audio is coming through a &lt;code&gt;&amp;lt;audio&amp;gt;&lt;/code&gt; element or Web Audio API that the browser &lt;em&gt;knows about&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;When you're playing 24kHz PCM chunks from a WebSocket, decoded manually and scheduled through AudioContext buffers? The browser's AEC has no idea that audio exists. It can't cancel what it can't see.&lt;/p&gt;

&lt;p&gt;So your AI agent starts speaking. The microphone picks up the speaker output. The agent hears itself. In the best case, it gets confused and repeats something. In the worst case — and this happened &lt;em&gt;constantly&lt;/em&gt; in early builds — you get a feedback loop where the agent interrupts itself mid-sentence, hears the interruption, tries to respond to it, hears &lt;em&gt;that&lt;/em&gt;, and the whole session collapses.&lt;/p&gt;

&lt;p&gt;I called these 1011 disconnects, because that was the WebSocket close code I kept seeing in logs.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Two-Tier RMS Gate
&lt;/h2&gt;

&lt;p&gt;The fix is a two-tier RMS (Root Mean Square) gate on the audio capture side. The idea is simple: measure the loudness of what the mic is picking up, and if it's probably just the speaker playing back, don't send it.&lt;/p&gt;

&lt;p&gt;But "simple" hides a lot of edge cases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tier 1: Hard suppress during agent speech&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;While the agent is actively speaking, I track that state server-side and send it to the client. During this window, incoming audio is suppressed entirely — no chunks sent to Gemini.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;agentSpeaking&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;cooldownTimer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;ReturnType&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;typeof&lt;/span&gt; &lt;span class="nx"&gt;setTimeout&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;COOLDOWN_MS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1500&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;COOLDOWN_THRESHOLD&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.03&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// Higher threshold during cooldown&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;NORMAL_THRESHOLD&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;   &lt;span class="c1"&gt;// Normal VAD threshold&lt;/span&gt;

&lt;span class="c1"&gt;// Called when agent audio stream starts/stops&lt;/span&gt;
&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;setAgentSpeakingState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;speaking&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;boolean&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;speaking&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;agentSpeaking&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cooldownTimer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nf"&gt;clearTimeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cooldownTimer&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;agentSpeaking&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="c1"&gt;// Start cooldown period&lt;/span&gt;
    &lt;span class="nx"&gt;cooldownTimer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;setTimeout&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;cooldownTimer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="nx"&gt;COOLDOWN_MS&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;shouldSendAudioChunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;rms&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nx"&gt;boolean&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;agentSpeaking&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// Hard suppress&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cooldownTimer&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// In cooldown: use higher threshold&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;rms&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;COOLDOWN_THRESHOLD&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;rms&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;NORMAL_THRESHOLD&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Tier 2: The 1.5-second cooldown&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the part that took me longest to figure out. When the agent &lt;em&gt;stops&lt;/em&gt; talking, there's still speaker resonance in the room. The RMS of captured audio doesn't drop to zero immediately — it decays. The background noise in a typical home office sits at 0.01–0.02 RMS. But for 1-2 seconds after playback stops, you're seeing 0.025–0.04 RMS — above the normal VAD threshold.&lt;/p&gt;

&lt;p&gt;The cooldown period uses a higher threshold (0.03 vs 0.05) for 1.5 seconds after agent speech ends. This catches the decay without cutting off a founder who immediately starts talking back.&lt;/p&gt;

&lt;p&gt;Was this threshold tuned empirically? Absolutely. I spent days listening to session replays measuring exactly how fast room resonance decays in different mic setups.&lt;/p&gt;




&lt;h2&gt;
  
  
  Session Resumption: The Other Half of the Problem
&lt;/h2&gt;

&lt;p&gt;Echo cancellation solved the &lt;em&gt;quality&lt;/em&gt; problem. Session resumption solved the &lt;em&gt;reliability&lt;/em&gt; problem.&lt;/p&gt;

&lt;p&gt;Gemini Live sessions drop. Network hiccups, mobile handoffs, Chrome deciding to do something aggressive with memory — connections fail. Early on, a dropped connection meant starting the entire 30-minute interview over. Founders would ragequit. I would understand completely.&lt;/p&gt;

&lt;p&gt;The fix: store session handles in Firestore and resume on reconnect.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# FastAPI backend — session management
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.genai.live&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AsyncSession&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;firebase_admin&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;firestore&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_or_create_session&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;AsyncSession&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;db&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;firestore&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;session_ref&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sessions&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;document&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;session_doc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session_ref&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;session_doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exists&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;session_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session_doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_dict&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;handle&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;resumption_handle&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;handle&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="c1"&gt;# Attempt resume — Gemini picks up exactly where it left off
&lt;/span&gt;                &lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;resume_gemini_session&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;handle&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;  &lt;span class="c1"&gt;# resumed=True
&lt;/span&gt;            &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;pass&lt;/span&gt;  &lt;span class="c1"&gt;# Fall through to new session
&lt;/span&gt;
    &lt;span class="c1"&gt;# Create new session
&lt;/span&gt;    &lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;create_gemini_session&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;session_ref&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;created_at&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;firestore&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SERVER_TIMESTAMP&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;project_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;project_id&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;  &lt;span class="c1"&gt;# resumed=False
&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;store_resumption_handle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;handle&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;db&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;firestore&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;session_ref&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sessions&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;document&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;session_ref&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;resumption_handle&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;handle&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a session resumes, Gemini restores full context — every tool call result, every piece of market research, every persona in the synthetic focus group. The founder reconnects and the agent says "Sorry about that, where were we?" and genuinely knows where you were.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Filler Audio Problem
&lt;/h2&gt;

&lt;p&gt;One more thing nobody talks about: what do you play while the AI is thinking?&lt;/p&gt;

&lt;p&gt;Gemini 2.5 Flash is fast. 300-500ms end-to-end is genuinely fast. But when the agent is executing a tool call — crawling a competitor site with Playwright, running Reddit scraping, calculating unit economics — you can have 3-8 second gaps.&lt;/p&gt;

&lt;p&gt;Silence in a voice conversation feels broken. Users assume the connection dropped.&lt;/p&gt;

&lt;p&gt;Solution: pre-computed filler audio. Short phrases like "one moment please" or "let me look that up" in 17 languages, stored as PCM chunks, played when tool execution exceeds ~800ms. The agent is triggered via text signal (not &lt;code&gt;proactive_audio&lt;/code&gt;, which had a regression that caused double-playback — disabled entirely, use text triggers instead).&lt;/p&gt;

&lt;p&gt;This sounds trivial. It removed about 40% of "the app is broken" support messages.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Start with the echo gate, not the AI logic.&lt;/strong&gt; I spent weeks building beautiful multi-agent orchestration before I could demo it reliably. Wrong order.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Instrument RMS values from day one.&lt;/strong&gt; Log them. Every session. You can't tune what you can't see.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Test on bad hardware.&lt;/strong&gt; My dev setup has a good mic with physical distance from speakers. Most users have laptop mics 30cm from laptop speakers. Build for that.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Mobile is a different planet.&lt;/strong&gt; iOS Safari handles AudioContext lifecycle in ways that will make you question your career choices. But that's an article for another day.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  The Result
&lt;/h2&gt;

&lt;p&gt;After solving these problems — the two-tier RMS gate, the 1.5s cooldown, the session resumption, the filler audio — GoNoGo runs 15-45 minute voice sessions with real founders, across 21 languages, with 3 AI agents handing off to each other mid-conversation. The 1011 disconnects essentially disappeared.&lt;/p&gt;

&lt;p&gt;The voice infrastructure became invisible, which is exactly what it should be.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;If you're building anything with browser mic + real-time AI audio:&lt;/strong&gt; what's been your biggest challenge? I'm genuinely curious whether the echo problem is universal or whether I was doing something particularly wrong early on. Drop it in the comments.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>ai</category>
      <category>typescript</category>
      <category>python</category>
    </item>
    <item>
      <title>How I Built a Real-Time Voice AI Interview System with Gemini Live API and WebSockets (and What Almost Broke Me)</title>
      <dc:creator>Konstantin</dc:creator>
      <pubDate>Sat, 04 Apr 2026 15:45:52 +0000</pubDate>
      <link>https://dev.to/remi_etien/how-i-built-a-real-time-voice-ai-interview-system-with-gemini-live-api-and-websockets-and-what-28m2</link>
      <guid>https://dev.to/remi_etien/how-i-built-a-real-time-voice-ai-interview-system-with-gemini-live-api-and-websockets-and-what-28m2</guid>
      <description>&lt;p&gt;When I started building &lt;a href="https://gonogo.team" rel="noopener noreferrer"&gt;GoNoGo.team&lt;/a&gt; -- a platform that uses AI to validate startup ideas through voice interviews -- I thought the hardest part would be the business logic. Turns out, the hardest part was keeping a duplex audio stream alive across three layers of abstraction without everything falling apart.&lt;/p&gt;

&lt;p&gt;This is a technical post-mortem of the voice AI system I built solo. I'll cover the architecture, the ugly edge cases, and the specific patterns that finally made it stable enough to run 500+ validation interviews.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Core Problem: Bidirectional Audio at Low Latency
&lt;/h2&gt;

&lt;p&gt;The concept: a founder speaks, Gemini listens and responds with follow-up questions, in real time. No STT/TTS pipeline -- direct speech-to-speech using Gemini Live API (native audio). The pipeline looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Browser Mic -&amp;gt; ScriptProcessor (16kHz PCM) -&amp;gt; WebSocket (base64) -&amp;gt; Python FastAPI -&amp;gt; Gemini Live API
                                                                                      v
Browser Speaker &amp;lt;- AudioContext (24kHz PCM) &amp;lt;- WebSocket (base64) &amp;lt;- Audio Chunks &amp;lt;------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every arrow in that diagram is a potential failure point. And in production, every single one of them failed at least once.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 1: Capturing Audio in the Browser
&lt;/h2&gt;

&lt;p&gt;The browser side uses a &lt;strong&gt;ScriptProcessorNode&lt;/strong&gt; (yes, it's deprecated -- but AudioWorklet adds latency we can't afford for real-time conversation). We capture 16kHz mono PCM in 512-sample chunks -- roughly 32ms per chunk.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Audio capture setup (simplified from useAudioInput.ts)&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;navigator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;mediaDevices&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getUserMedia&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;echoCancellation&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;noiseSuppression&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;autoGainControl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;sampleRate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;16000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;audioContext&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;AudioContext&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;sampleRate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;16000&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;source&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;audioContext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createMediaStreamSource&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;processor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;audioContext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createScriptProcessor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Analyser for RMS-based voice activity detection&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;analyser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;audioContext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createAnalyser&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="nx"&gt;source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;analyser&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nx"&gt;analyser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="nx"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;onaudioprocess&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;pcmData&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;inputBuffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getChannelData&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;rms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nx"&gt;pcmData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reduce&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nx"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nx"&gt;pcmData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="c1"&gt;// VAD gate: only send if voice detected (RMS &amp;gt; 0.05)&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;rms&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;0.05&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;readyState&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="nx"&gt;WebSocket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;OPEN&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Convert Float32 to Int16 PCM, then base64 encode&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;int16&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Int16Array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;pcmData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;pcmData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;int16&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;32768&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;32767&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;pcmData&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;32768&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nx"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;audio&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;btoa&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fromCharCode&lt;/span&gt;&lt;span class="p"&gt;(...&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Uint8Array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;int16&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
    &lt;span class="p"&gt;}));&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;strong&gt;32ms chunk interval&lt;/strong&gt; was a hard-won choice. It gives Gemini enough data per packet to process efficiently while keeping perceived latency under 300ms end-to-end. The VAD threshold of 0.05 RMS filters out background noise without clipping soft speech.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: The Python Backend (FastAPI + WebSockets)
&lt;/h2&gt;

&lt;p&gt;The backend is &lt;strong&gt;Python FastAPI&lt;/strong&gt;, deployed on &lt;strong&gt;Google Cloud Run&lt;/strong&gt;. Python was the right call because Gemini's client libraries are Python-first, and the entire analysis pipeline (market research, competitor scraping with Playwright, report generation) lives in the same codebase.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# WebSocket handler (simplified from server.py)
&lt;/span&gt;&lt;span class="nd"&gt;@app.websocket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/ws_live&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;websocket_live&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;WebSocket&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;accept&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GeminiLiveSession&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-flash-exp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;forward_to_gemini&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="c1"&gt;# Browser audio -&amp;gt; Gemini
&lt;/span&gt;        &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;iter_json&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audio&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;pcm_bytes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;b64decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
                &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_audio&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pcm_bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 16kHz PCM
&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;forward_to_browser&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="c1"&gt;# Gemini audio -&amp;gt; Browser
&lt;/span&gt;        &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;events&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audio&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="c1"&gt;# Gemini returns 24kHz PCM
&lt;/span&gt;                &lt;span class="n"&gt;chunk_b64&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;b64encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audio&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;chunk_b64&lt;/span&gt;
                &lt;span class="p"&gt;})&lt;/span&gt;
            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_call&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;execute_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_tool_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;forward_to_gemini&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="nf"&gt;forward_to_browser&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The asymmetric sample rates (16kHz in, 24kHz out) aren't a mistake -- Gemini natively outputs at 24kHz, and downsampling would lose audio quality. The browser's AudioContext handles the sample rate mismatch transparently.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Echo Cancellation Problem
&lt;/h3&gt;

&lt;p&gt;Gemini hears its own output through the user's speakers and tries to respond to itself. Browser-level echo cancellation (&lt;code&gt;echoCancellation: true&lt;/code&gt;) handles most cases, but not all -- especially on laptops with poor speaker-mic isolation.&lt;/p&gt;

&lt;p&gt;My solution: a &lt;strong&gt;speaking-state gate&lt;/strong&gt;. When Gemini is outputting audio, we suppress inbound audio at the application level:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Echo gate in the session handler
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SessionState&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;agent_speaking&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_agent_audio_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;should_forward_audio&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rms&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Suppress during agent speech + 1.5s cooldown after
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;agent_speaking&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_agent_audio_time&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;1.5&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;rms&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;0.03&lt;/span&gt;  &lt;span class="c1"&gt;# Higher threshold during cooldown
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;rms&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;0.01&lt;/span&gt;  &lt;span class="c1"&gt;# Normal threshold
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This two-tier threshold was the key insight: background noise sits at RMS 0.01-0.02, so during the cooldown period after the agent stops speaking, we only forward audio that's clearly human speech (&amp;gt; 0.03).&lt;/p&gt;




&lt;h3&gt;
  
  
  Failure Mode: The 1011 Disconnects
&lt;/h3&gt;

&lt;p&gt;For weeks, Gemini Live API would randomly close connections with status code 1011 (Internal Server Error). No pattern, no warning. Sessions would die mid-sentence.&lt;/p&gt;

&lt;p&gt;The fix was layered:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Reconnection with session resumption
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle_disconnect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Gemini supports session resumption via handle
&lt;/span&gt;            &lt;span class="n"&gt;new_session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;GeminiLiveSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;resume&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resumption_handle&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="c1"&gt;# Re-send last audio chunk as context
&lt;/span&gt;            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;new_session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_audio&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;new_session&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="c1"&gt;# After 3 fails, notify user with audio message
&lt;/span&gt;    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reconnecting&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Session resumption handles (persisted to &lt;strong&gt;Firestore&lt;/strong&gt;) were a game-changer. Instead of starting a new conversation, Gemini picks up exactly where it left off. Users barely notice the blip.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 3: Playing Audio Back in the Browser
&lt;/h2&gt;

&lt;p&gt;Gemini returns 24kHz PCM chunks. Playing them without glitches requires the Web Audio API with a buffer scheduler:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Audio playback (simplified from useAudioOutput.ts)&lt;/span&gt;
&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AudioPlayer&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;AudioContext&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="nx"&gt;nextStartTime&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="nx"&gt;gainNode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;GainNode&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="nf"&gt;constructor&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;AudioContext&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;sampleRate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;24000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;latencyHint&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sr"&gt;/Mobi/&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;navigator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;userAgent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;playback&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;interactive&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;gainNode&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createGain&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;gainNode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;destination&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nf"&gt;playChunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;pcmBase64&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;bytes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Uint8Array&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;from&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;atob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;pcmBase64&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;charCodeAt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;int16&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Int16Array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;float32&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Float32Array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;int16&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;int16&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;float32&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;int16&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;32768&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;buffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createBuffer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;float32&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;24000&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getChannelData&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;float32&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;source&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createBufferSource&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="nx"&gt;source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;buffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nx"&gt;source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;gainNode&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;startTime&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;currentTime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;nextStartTime&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;startTime&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;nextStartTime&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;startTime&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;duration&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;nextStartTime&lt;/code&gt; scheduler ensures seamless playback regardless of network jitter. The &lt;code&gt;latencyHint&lt;/code&gt; switch between mobile ("playback") and desktop ("interactive") was a subtle but important optimization -- mobile browsers handle audio buffers differently.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Learned Building This Solo
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Build the unhappy path first.&lt;/strong&gt; I spent week one on the happy path. Weeks two through four were entirely edge cases -- reconnection, echo suppression, barge-in handling. If I could redo it, I'd build error recovery before a single feature.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Voice is a different UX paradigm.&lt;/strong&gt; Users don't read error messages mid-conversation. Every failure needs an audio fallback. We pre-compute "filler" audio chunks ("one moment please...") as 24kHz PCM, ready to stream instantly when Gemini is slow or reconnecting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Speech-to-speech beats STT+TTS.&lt;/strong&gt; We initially considered a Whisper -&amp;gt; Claude -&amp;gt; ElevenLabs pipeline. Gemini Live API's native audio mode is faster (sub-500ms round-trip), cheaper, and handles interruptions naturally. The trade-off: less control over the voice, but the latency gain is massive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Cloud Run works for WebSockets, with caveats.&lt;/strong&gt; We deploy on &lt;strong&gt;Google Cloud Run&lt;/strong&gt; (me-west1 region). WebSocket connections survive container restarts thanks to session resumption handles saved in Firestore. The key setting: request timeout of 3600s (1 hour) for long interview sessions.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Result
&lt;/h2&gt;

&lt;p&gt;The system now runs validation interviews averaging 12 minutes of continuous voice conversation. Across 500+ sessions, hard failures dropped to under 1% after implementing the echo gate and session resumption. Each interview includes 3 AI agents (Alex for discovery, Sam for architecture, Maya for design) that use ~12 function-calling tools to research markets, analyze competitors, and generate reports -- all while maintaining a natural conversation.&lt;/p&gt;

&lt;p&gt;Building this solo meant every failure landed directly in my Telegram inbox (via a monitoring bot). Which, honestly, is the fastest feedback loop possible.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'm Curious About
&lt;/h2&gt;

&lt;p&gt;The biggest remaining challenge is &lt;strong&gt;audio quality on mobile browsers&lt;/strong&gt;. iOS Safari handles AudioContext differently from Chrome, and some Android devices have aggressive echo cancellation that clips the AI's speech. We're currently using device-specific settings, but it feels like a hack.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Has anyone found a robust cross-browser audio playback strategy for real-time AI voice?&lt;/strong&gt; Especially interested in experiences with AudioWorklet vs ScriptProcessorNode for this use case. Drop your thoughts in the comments.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>ai</category>
      <category>startup</category>
      <category>python</category>
    </item>
  </channel>
</rss>
