"Four product pivots in four days. Every one caused by something Gemini did that I didn't expect. Here's the commit-by-commit story of building a real-time voice app with the Gemini Live API."
I created this piece of content for the purposes of entering this hackathon.
Past, Live is a voice app where students call historical figures. You type "Fall of Constantinople," tap Call, and Constantine XI picks up. He tells you his city is falling and asks what you'd do. You talk back. When you hang up, you get a call receipt with what actually happened.
That's the pitch. Here's what the git log actually looks like.
How I build things
I'm a systems architect. I design the architecture, write specs, make product decisions, and direct AI agents to implement. I don't write code line by line. I see the system as a whole, make the calls on what to build and how, and iterate on what comes back.
This works well when I understand what I'm building. With the Gemini Live API, I had to learn it by breaking it. Four times.
Day 1: A quiz app
6b6ee03 feat(past-live): scaffold app + War Room Dispatch UI
The original concept was a quiz. Historical characters quiz the student. You're cast as Constantine's last advisor. The city is under siege. "What do you do?" You answer, the character reacts, you learn history through decision-making.
The app scaffolded in about an hour. Astro 5, Svelte 5, dark terminal aesthetic. Three screens stubbed with mock data. Backend was a Hono TypeScript server with WebSocket relay to Gemini Live. The relay is simple: browser sends PCM 16kHz audio, relay forwards to Gemini Live, Gemini sends PCM 24kHz back.
625427b feat(past-live): deploy to past-live.ngoquochuy.com + Cloud Run backend
Deployed. Frontend on Cloudflare Workers, backend on Cloud Run. First real session connected.
The architecture worked. The quiz didn't.
"shit. that's a new app"
c8af139 docs(past-live): "shit. that's a new app" — full pivot to Call the Past
I ran it with six personas. Ages 13 to 42. The advisor framing — "You ARE Constantine's advisor, what do you do?" — caused performance anxiety across the board. Jun, 17, wouldn't roleplay at all. Shame of performing. Tomás, 16, ADHD, froze at "what do you do?" He didn't know the stakes. He couldn't answer a question about a thing he's supposed to be learning about.
David's feedback hit the hardest: "The people whom we are targeting are the people who do not know shit about history. That's where they're learning about it. But the language is really hard to understand. I had to ask the model three times before I could understand what's going on."
And then the reframe: "You're calling somebody back in time, asking them about everything. You're not just the one making decision. You call the people and they're gonna tell you. You're gonna feel it too."
Same relay. Same WebSocket protocol. Same audio pipeline. Same tool declarations. Zero architecture changes. Just a prompt rewrite and a copy pass.
The quiz became a phone call.
Three Gemini models start collaborating
This is the part I'm happiest with. Three Gemini models, each doing what it's good at, handing off to each other per session.
Browser (Astro/Svelte) Cloud Run (Hono relay)
┌─────────────────────┐ ┌──────────────────────┐
│ Mic → PCM 16kHz │ │ @google/genai │
│ Speaker ← PCM 24kHz │──WebSocket─│ ai.live.connect() │
│ Transcript, images │ │ Tool call handler │
│ Text input │◄─WebSocket─│ Flash + Image calls │
└─────────────────────┘ └──────────┬───────────┘
│
Gemini Live (voice)
Gemini Flash (JSON, summaries)
Gemini Image (portraits, scenes)
Firestore (student profiles)
gemini-2.5-flash-native-audio-preview-12-2025 handles the live conversation. This is native audio, not text-to-speech. The model generates speech directly, so the voice carries actual emotion and timing. Combined with enableAffectiveDialog (requires v1alpha API version), the character modulates tone throughout the session.
const ai = new GoogleGenAI({
apiKey: process.env.GEMINI_API_KEY,
httpOptions: { apiVersion: 'v1alpha' },
});
gemini-3-flash-preview handles structured output. Before the call: character profile, story material, OKLCH color palette, voice selection from 30 voices. After the call: transcript analysis, key facts, historical outcome, farewell message. Flash is fast and doesn't need to be emotional.
gemini-3.1-flash-image-preview generates character portraits and era-specific scene banners. When the character calls show_scene mid-conversation, Image generates a scene in the background. It appears behind the transcript while the voice keeps going.
Five API calls per session. Flash prepares everything, Live performs, Image paints the scenes, Flash summarizes at the end.
"shit."
bc0a835 feat(past-live): "shit." — pivot v3: calm funny storytellers, not panicking heroes
The phone call concept worked. But the characters were IN their crisis. Constantine was panicking about the walls. Gene Kranz was sweating fuel.
enableAffectiveDialog made the panic too convincing. A 5-minute call with someone stressed and urgent isn't educational. It's draining. The feature that makes the app work — emotional voice — broke the tone.
So the characters stopped panicking. They survived. They're sitting somewhere comfortable, looking back, and they find the whole thing kind of absurd.
"Constantine XI isn't panicking about the walls — he's like 'Yeah, the walls fell. Wild story. Let me tell you about it.'"
Humor style: not jokes. The gap between how insane the situation was and how casually you describe it. "They dragged 72 ships over a mountain. Over. A. Mountain."
The first crash and the googleSearch lesson
bd5f9ed fix(relay): kill googleSearch tool -- crashes Gemini Live session
I added googleSearch as a tool because obviously a historical character should be able to look things up. First session: crash. closeCode=1011, reason=Internal error occurred. Second session: crash. Third session: crash.
GitHub issue #843, 43+ reactions, open since May 2025. Tool calling combined with native audio is unstable. Removed googleSearch entirely.
The rule I landed on: minimal tools, one per turn, all NON_BLOCKING. I kept three: announce_choice (tappable decision cards), end_session (character wraps up), show_scene (mid-call image generation). Non-blocking means the character keeps talking while the tool executes. If end_session were blocking, the character would go silent mid-farewell. Phone call illusion gone.
18fc878 feat(relay): GoAway signal handling + transparent session resumption
Gemini Live sends GoAway when it needs to reset the connection. Without sessionResumption, that's a hard disconnect mid-conversation. With it, the relay reconnects transparently using the session handle.
One gotcha: community examples show { handle, transparent: true }. That transparent field doesn't exist. Passing it crashes the connection. The correct form is { handle?: string } only.
68f5f9d feat(past-live): Zod schemas, WS retry, chunk logging, truncation detection
Research day. GitHub issues, Google forums, other hackathon submissions. The mitigations:
-
connectWithRetry()for transient WebSocket 1008 errors (#1236). Three attempts, exponential backoff. Auth errors skip retry. - Truncation detection for #2117 (40+ developers, 8 months open). If the gap between last audio output and
turnCompleteis under 500ms, I logpossible_truncation. Can't fix it server-side, but can detect it. - Bounded audio output queue with backpressure drop-oldest. Every production Gemini Live system needs this.
Four script versions fail
This is where I wasted the most time. The model playing Bolívar had nothing to work with except characterName: "SIMÓN BOLÍVAR" and historicalSetting: "Simón Bolívar, 1822". It started talking about "dragging ships over mountains" because the show_scene tool had an example about Ottoman ships at Constantinople. It grabbed whatever it could find.
So I tried giving it a script. Four versions.
V1: Exact dialogue
Beat 1: "So imagine this. You spend twelve years -- twelve -- fighting
the biggest empire in the world."
Student interrupted. Model restarted the same beat word for word. Three times. Then apologized: "My apologies. I got caught up in the drama."
V2: Hints instead of lines
CONVEY: 12-year war, barefoot soldiers, mountain ranges, Spain's empire
Then ASK: "Can you imagine that?"
40-second monologue. The model read the hint list as a checklist and delivered everything in sequence. I thought "CONVEY" meant "weave these in over time." The model read it as "say all of these right now."
V3: Minimal hint + explicit stop rule
One hook, one sentence, one question. "Then STOP TALKING and wait."
Mechanical beat-jumping. Said its sentence, waited, got a response, gave a generic acknowledgment, moved to next beat. If the student said something interesting, the model ignored it.
V4: Just give destinations
"By the end of this act, the student should understand that Bolívar's coalition is fracturing." No dialogue hints. Just where to end up.
The model couldn't project personality. It knew where to go but had nothing specific to work with. Pleasant, vague, educational filler.
The breakthrough
052040a feat(past-live): bag-of-material prompt architecture (Cleopatra test)
The commit message says it: "Tonight's testing proved scripted acts don't work for voice."
I had written nine dream conversation transcripts. Every good moment came from a specific, weird historical fact delivered casually:
"I looked like a coin. Which honestly, for a queen, was more useful."
"A carpet would have been ridiculous."
"They dragged 72 ships over a mountain. Over. A. Mountain."
These aren't things you put in a sequence. They're material the character grabs when the conversation makes them relevant. Flash generates a bag of material. Live pulls from it.
- Hooks (myth/truth combos)
- Verified facts
- Anchors (universal human experience the student can relate to)
- Choices (2-3 decisions with pre-mapped consequences)
- Scene descriptions (for
show_scene) - A closing line
const storyScript = await generateStoryScript(characterName, historicalSetting);
// System prompt gets the full bag.
// Live pulls based on where the conversation goes.
Live is the performer. Flash is the researcher. Flash verifies the facts and packs them. Live finds the right one for this moment.
d38b208 fix(past-live): prevent hook repetition in re-anchor injection
This commit was the proof. Cleopatra asked the student's name ("And you are?"), used it throughout ("Simon the young"), pulled hooks based on the student's actual words (said "pretty" then coin myth-bust, said "Spanish" then nine languages, said "fun" then linen sack), and closed with a personal observation ("A practical young man, Simon"). No scripted acts. No mechanical beat-jumping. The character was pulling from her bag based on where the conversation went.
I told Gemini to be funny
The bag-of-material worked, but the character sounded like a British diplomat. "One must adapt to the prevailing currents." Technically correct. Zero personality.
17982a9 fix(past-live): humor directive -- be funny, not serious
First attempt: "Never tell jokes. The facts ARE the comedy. The gap between how insane your life was and how calmly you describe it is what makes it funny."
The model played it safe. Serious. Formal. It interpreted "never tell jokes" as "never be funny."
So I changed it to "be FUNNY, tease, joke, be playful."
And the model started reading my system prompt out loud. As dialogue. Word for word. The hooks, the facts, the closing thread. Not improvising from the material. Just reciting what I wrote, in character voice, as if those were its lines.
That's the title.
79c5b30 fix(past-live): allow jokes + invented personal details
Student asked to bounce back and forth on jokes. Model responded: "Your ability to find humor is quite charming." Instead of being funny.
The fix: "You CAN make jokes, tease, invent funny personal details. Historical events stay locked. Everything else is fair game."
5a4c2fb docs(past-live): update session record with V5-V7 test results
Best session yet. Cleopatra opened with "a distant relative? How wonderfully convenient after two millennia. Are we discussing inheritances?" Then later: "Textbooks! They do tend to drain the life right out of things." And: "A carpet simply wouldn't do; far too cumbersome."
The formula: "be FUNNY" + "reactions/humor FREE, facts LOCKED" + bag-of-material.
Audio pipeline and VAD tuning
4d23de5 feat(past-live): audio pipeline overhaul -- smaller chunks, echo gate, cursor playback, VAD tuning
Voice Activity Detection controls when the model thinks you've stopped talking. The defaults were wrong for a phone call. The character kept interrupting students mid-sentence.
automaticActivityDetection: {
startOfSpeechSensitivity: 'START_SENSITIVITY_LOW',
endOfSpeechSensitivity: 'END_SENSITIVITY_HIGH',
prefixPaddingMs: 20,
silenceDurationMs: 500,
}
START_SENSITIVITY_LOW: slow to decide you've started talking. Rejects background noise. END_SENSITIVITY_HIGH: quick to decide you've stopped. 500ms pause = your turn. This combination produces phone-call pacing. I spent a full session just tuning these four numbers.
Also: mic chunk size from 4096 to 1024 samples (256ms to 64ms), echo gate to prevent speaker bleed, cursor-based playback scheduling to eliminate micro-gaps.
9480507 fix(past-live): re-anchor via sendClientContent -- stops audio cutoffs
Found a subtle distinction. sendRealtimeInput({ text }) triggers VAD. Gemini treats text input as "user activity" and fires interrupted, which clears the audio queue mid-sentence. Every audio cutoff in testing happened 1-2 seconds after a re-anchor injection. 100% correlation across 2 sessions.
Fix: switch re-anchor from sendRealtimeInput to sendClientContent with turnComplete: false. Injects context into history without triggering VAD.
3b310ff fix(past-live): correct VAD config + force tool calling + audio chunks
The VAD settings were inverted in our code. START_HIGH instead of START_LOW. The character had been interrupting students because of a config bug, not because VAD is bad. Reduced audio chunks from 1024 to 512 samples (64ms to 32ms), inside Google's recommended 20-40ms range.
Context compression and 10-minute sessions
Audio burns context fast: 32 tokens per second, both directions. A 10-minute call is roughly 38,400 tokens from audio alone. Without compression, quality degraded around 5-6 minutes.
contextWindowCompression: { slidingWindow: {} } lets the session run longer by compressing older context. With explicit triggerTokens: 10000, I got consistent 10-minute sessions where the character still referenced things from the beginning of the call at the end.
I set a hard 10-minute cap because conversation quality past that wasn't worth the cost.
Testing, and the hardest feedback
63b7c3d feat(past-live): extract presets with hand-written storyScripts
Wrote full story scripts for all three presets. Constantinople gets Aubrey Plaza's delivery style, 4 hooks, 8 verified facts, 1 derivable choice, 2 scenes. Moon Landing gets Tom Hanks. Mongol Empire gets Dave Chappelle. Each has a celebrity personality anchor so the model has a voice register to hit.
Then I ran user testing. Six student personas, ages 13-17 plus a parent.
Five out of five students tried calling a custom topic. Every single one got "Retry generating session preview." The endpoint was broken for non-preset topics.
When sessions did connect, Bolívar was delivering Cleopatra's material. The story script wasn't wired correctly, so the wrong character's bag was loading. From the test report, Maya, 15:
"Bolivar started talking about Elizabeth Taylor and then the app crashed, they would laugh AT the app, not WITH it."
The expert panel: "This is the most sophisticated codebase the panel reviewed. With backend online and conversation stability, it would win. Right now, it's the smartest submission you can't actually use."
Both right.
System prompt architecture
59baa28 refactor(past-live): reorder system prompt per Google best practices
Google's guidance says prompt order matters for voice models: persona first, rules second, guardrails last. If you put guardrails first, the character sounds guarded.
One finding that held up: "unmistakably" outperforms "MUST" and "NEVER" for voice models. "You are unmistakably Cleopatra, irreverent, strategic, mildly amused by everything" produced more consistent character than "You MUST stay in character as Cleopatra at all times."
I don't fully understand why. But it worked across every test session.
The Google Cloud stack
The backend runs on Cloud Run, deployed with gcloud run deploy --source . Cloud Build handles the Docker image. Student profiles and session history live in Firestore (project past-live-490122, EU eur3). Gemini API key stored in Cloud Run Secret Manager, not environment variables.
Cloud Run was the right choice for a WebSocket relay because it supports long-lived connections. Firestore was the right choice for student profiles because it's zero-config in the same GCP project.
Cost per session
About $0.25 per 5-minute call.
Voice (Gemini Live, 5 min audio I/O): ~$0.04. Cheap.
Three images (portrait + two scenes): ~$0.20. 81% of cost.
Text (Flash preview + summary): ~$0.005.
Images are the expensive part, not the voice. On free tier, image generation takes 12-15 seconds because of GPU queue throttling. I ran a 5x5 benchmark: variance between prompt styles was under 1 second, variance between runs was 3+ seconds. The bottleneck is queue position, not what you're asking for. Paid tier drops it to 2-3 seconds.
What I'd build first if I started over
Test the conversation on day one. I built the relay, the WebSocket protocol, the audio queue, the tool declarations, and tested the character conversation late. The bag-of-material insight came from testing. If I'd run voice sessions against bare prompts on day one, I'd have found the right architecture days earlier.
Start with zero tools. Prove the conversation works bare, add tools one at a time. I lost a full day to googleSearch crashing sessions before I even knew the conversation itself was broken.
Fix audio chunk size on day one. 256ms vs the recommended 20-40ms affects everything — latency, VAD accuracy, conversation feel. And double-check your VAD config isn't inverted. Mine was.
When it works
Constantine XI picks up. Says under ten words. Waits.
You're not reading about Constantinople. You're on the phone with someone watching it fall, and because of native audio and affective dialog, you can hear it in his voice. He asks what you'd do. You answer. He tells you what he actually did. The post-call receipt puts your choice next to the historical record. Constantine XI stayed when he could have fled. He died defending the walls.
That moment lands differently when you just had a conversation about it.
The bag-of-material architecture is right. The three-model split is right. The crashes are fixable. The conversation quality improved with every iteration. That's the version I'm building toward.
- Submission: Gemini Live Agent Challenge
- GitHub: Past, Live repository
- Live: past-live.ngoquochuy.com
Top comments (0)