Ngo Quoc Huy

Posted on Mar 16

I Told Gemini Live to Be Funny. It Read My System Prompt Out Loud.

#geminiliveagentchallenge #hackathon #geminilive #ai

I told Gemini Live to be funny. It read my system prompt out loud. Not a summary. Not a paraphrase. The actual text I wrote -- hooks, facts, anchors -- delivered as dialogue, in character, with vocal inflection. Cleopatra was quoting my engineering notes to a student like they were her own thoughts.

That was 3am on a Sunday in Santa Marta, Colombia. I was sitting on the floor. Laptop on a chair. I'd been awake for about 36 hours. I was picking ants out of my rice pot because I'd left it open. Two cups of coffee in. And the app I'd been building for three days straight was, at that exact moment, doing the one thing I never expected a voice model to do: performing my documentation.

This is the story of building Past, Live -- an app where you call historical figures and they pick up. Four pivots. Three Gemini models. 48 hours awake. One pyramid scheme joke that made the whole thing worth it.

What Past, Live does

Type any topic. "I wanna talk to one of the French astronauts." Flash finds a real person who lived it -- Jean-Loup Chrétien, first Western European in space -- generates their portrait and a scene from their era, and puts you on a live voice call. They pick up. They have opinions. They're funny. You can interrupt them mid-sentence. They remember you across calls.

Live: past-live.ngoquochuy.com
Repo: github.com/nqh-packages/past-live
Demo: youtu.be/j6wccLHKbqk

Why I built this

My brother and I had this idea for a while -- a Duolingo-style app for learning anything, gamifying the experience. History was the first topic. When the Gemini Live Agent Challenge opened, I took that core and entered.

I'm financially tight right now. I've been trying to make my other project profitable -- last month in Budapest I went to salons and convinced them to let me build their websites for free just to get first case studies. So this hackathon, if it could work out, it would mean a lot.

But there's a personal use case too. I've been living in Budapest for 8 years. My siblings are half Hungarian. I want citizenship, which means studying Hungarian history, law, culture. This app doesn't limit to history -- if the model can learn it, the model can teach it. Feed it a citizenship questionnaire and a character who knows the material. Fun first. Knowledge second.

How I build things

I'm a systems architect. I design the overall structure, write specifications, and direct AI coding agents. I don't write code line by line. With Gemini Live, there was no spec to write. The only way to learn was to break it. Four times.

Day 1: a quiz app that nobody wanted to use

The first version was a gamified quiz. You roleplayed as the historical figure, picked the right options to advance. Think "you are Napoleon's advisor -- do you march on Moscow or retreat?" In text, it works. In voice mode, it was not fun at all. The soul wasn't there.

Being a non-history nerd, it quickly hit me that: I don't have enough historical knowledge to make decisions as advisors. The pressure of deciding someone's fate when you don't know the context just creates anxiety. Imagine how younger users must feel using this app 🤡

So I flipped it. You're not the advisor. You're calling someone who lived through it. You ask them about everything. They have the stress, the urgency, the emotion. You just listen, ask questions, and make choices when they present them.

That pivot required only a prompt rewrite. Zero architecture changes. The schema stayed the same. But the experience was completely different.

The heaviness problem

V2 worked -- too well. Gemini's native audio conveys emotion genuinely. enableAffectiveDialog makes the model carry emotional weight through vocal tone. When Constantine XI picks up and his city is falling, you feel it. When Bolívar talks about crossing the Andes knowing half his men won't make it, the urgency is real.

But every call left me feeling worse. The stress, the heaviness -- I hated using my own app. I needed the characters to be people you actually want to stay on the phone with. Not tutors. Not fact machines. Not AI assistants being helpful. People.

50+ phone calls in one night

V3 was prompt surgery. Nine or ten variations. Each one needed at least five test calls to evaluate. That's 50+ phone calls in one night. I was bored. I was tired. I played games between test calls. I drank two cups of coffee. I was picking ants out of my rice pot because I'd left it open.

Then I went to wash the dishes. And the idea hit.

I was already using Gemini Flash to generate character metadata -- name, colors, historical setting. Why not have Flash write the entire personality prompt for Live too? So the voice, humor, quirks, all of it gets generated per person instead of hardcoded.

This is the core architectural insight: Gemini Live's reasoning is limited. It can't build a character AND perform it at the same time. Flash builds. Live performs.

Four failed scripts

Before the bag-of-material architecture worked, I tried four ways to structure the conversation:

V1 -- Exact dialogue. Student interrupted; model restarted identically three times then apologized. Dead on arrival.

V2 -- Hints. Gave the model "CONVEY: the strategic importance of the Bosphorus." It treated the list as a checklist, delivering all points in the first 30 seconds. No pacing, no conversation.

V3 -- Minimal hints + stop rule. Mechanical beat-jumping. The model ignored what the student was saying and just moved to the next checkpoint.

V4 -- Just destinations. "At some point, discuss the siege." Too vague. The model produced pleasant, vague, educational filler. No specificity, no surprise, no personality.

The bag of sticks

What actually worked: pack the prompt with discrete pieces of material -- hooks, verified facts, surprising anchors, decision points, scene descriptions, closing lines -- and let the model pull from them based on where the conversation goes. No linear script. No acts. No checkpoints.

I call it a "bag of sticks." Flash generates the bag. Live reaches in and grabs whatever fits the moment.

The quality came from specific, casual historical facts:

"I looked like a coin. Which honestly, for a queen, was more useful than looking like a person."
"They dragged 72 ships over a mountain. Over. A. Mountain."

These aren't in any textbook phrasing. Flash generates them with personality baked in. Live delivers them like they just occurred to the character mid-sentence.

The 3am breakthrough

After all the script iterations, the characters were knowledgeable but flat. Informative but boring. I couldn't figure out why.

Then I found it. A leftover directive from V2 -- the heavy, stressful version -- that said the model cannot make jokes. Left over from when the characters were supposed to be experiencing crisis. I'd rewritten the entire prompt architecture around it and never removed that one line.

I flipped it 180 degrees. The model should bounce energy back and forth with the student. If they joke, joke back. Push further. Be the funniest person at a dinner party who happens to have lived through something insane.

That was the click.

Four more rounds of tuning. I went to bed at 7:30am. Slept for 30 minutes. Got back up because I had 9 hours until the deadline and everything was still all over the place.

The pyramid scheme

At some point during testing, I told Cleopatra I was her mother.

She paused. Then: "Is this some kind of prank, or is it a pyramid scheme?"

David and I laughed our asses off.

The double meaning with pyramids. The timing. The delivery. None of that was prompted. The model invented it from the personality bag -- irreverent, strategic, mildly amused by everything. That's when I knew the architecture was right. You can't prompt humor directly. You give the model a personality, a bag of material, and rules for phone call pacing. Then you get out of the way.

30 voices, no documentation

Gemini Live has 30 voices. There's basically no documentation on what they sound like. No samples. No descriptions. Nothing.

I wrote a script that generates the same audio with each voice. Downloaded all 30. Then I uploaded the recordings to Gemini API and asked it to describe each speaker -- age, racial background, energy, personality. From that catalog, I picked 4 male and 4 female standouts so Flash can match any historical character to the right voice automatically.

Enceladus got Bolívar. Aoede got Cleopatra. The voice catalog lives in server/src/voice-catalog.ts.

Three models, five calls

Every session uses three Gemini models making five API calls:

#	Model	Purpose
1	`gemini-3-flash-preview`	Topic → 3 figures, full story script, voice matching
2	`gemini-3.1-flash-image-preview`	Scene art (16:9), pre-generated at preview
3	`gemini-3.1-flash-image-preview`	Character portrait, cached per character
4	`gemini-2.5-flash-native-audio-preview`	Live voice session with tool calling
5	`gemini-3-flash-preview`	Post-call summary, key facts, farewell

Flash writes the personality. Image generates the visuals. Live performs the call. Each model does what it's best at.

The art direction

I'm a designer turned system architect. Before writing any code I scaffolded design variations until something stood out.

I've always loved the crosshatch engravings on paper money, especially USD. Every character portrait uses that style -- monochrome black and white on vibrant orange. The scene images use the same engraving with about 30% orange placed intentionally at the focal point of the scene.

Gemini 3.1 Image is really good at deciding where that orange should go. I haven't found one occasion where I wasn't happy with the result.

Tool calling: less is more

I started with googleSearch, announce_choice, end_session, and show_scene. Four tools.

googleSearch crashed everything. GitHub issue #843, 43+ reactions, open since May 2025. Tool calling with native audio is fragile. The more tools, the more crashes.

I removed googleSearch and dropped to three tools, all marked NON_BLOCKING so the model doesn't go silent mid-sentence waiting for a tool response.

announce_choice -- presents 2-3 tappable decision cards
show_scene -- displays era-specific images inline
end_session -- graceful hangup with farewell

VAD: the invisible thing that makes or breaks it

Voice Activity Detection determines when the model thinks you've stopped talking. Get it wrong and the model interrupts you constantly, or waits 3 seconds of silence before responding.

The original config had start and end sensitivity inverted. Once I fixed it:

automaticActivityDetection: {
  startOfSpeechSensitivity: 'START_SENSITIVITY_LOW',
  endOfSpeechSensitivity: 'END_SENSITIVITY_HIGH',
  prefixPaddingMs: 20,
  silenceDurationMs: 500,
}

Low start sensitivity = doesn't trigger on background noise. High end sensitivity = picks up quickly when you've stopped. 500ms silence = natural phone call pacing.

Audio chunks: 512 samples at 16kHz = 32ms per chunk. Google recommends 20-40ms. I was originally sending 4096-sample chunks. The difference is night and day.

Re-anchoring: keeping characters in character

Without intervention, Gemini Live drifts after 4-5 minutes. Characters start lecturing. They drop their personality. They forget they're on a phone call and start monologuing like a Wikipedia article.

Every 4 model turns, I re-inject identity and behavioral anchors via sendClientContent with turnComplete: false. The false is critical -- it prevents VAD from triggering, so there's no audio cutoff mid-re-anchor. The character doesn't know it happened.

Context window compression

Audio consumes roughly 32 tokens per second. A 10-minute call = ~38,400 tokens. Without compression, quality degrades after 5-6 minutes as the context fills up.

contextWindowCompression: {
  slidingWindow: {},
  triggerTokens: 10000,
}

This enabled consistent 10-minute sessions without quality drift.

The 40% crash rate

Gemini Live crashes about 40% of the time. Not during quiet testing at 2am. During demo time -- around 6pm Colombian time, busy US hours.

I built:

Auto-reconnect on 1011 errors (max 2 attempts)
Full context replay on reconnect (complete transcript + tool call results re-injected)
"Signal Lost" UI with retry/abort buttons
Exponential backoff on initial connection
GoAway signal handling for graceful server-initiated reconnection

The app doesn't pretend the API is stable. It builds around the instability.

System prompt architecture

Google's guidance prioritizes order: persona first, rules second, guardrails last. Voice models respond better to "unmistakably" than imperative language.

This works:

"You are unmistakably Cleopatra, irreverent, strategic, mildly amused by everything."

This doesn't:

"You MUST always stay in character as Cleopatra. NEVER break character."

The soul of the character voice lives in one file: server/src/character-voice.ts. Every character -- preset or generated -- shares the same core rules. Be the funniest person at a dinner party who happens to have lived through something insane. Deliver facts WHILE being funny, never choose one over the other.

The deadline

I submitted 15 minutes late.

My laptop was frozen. Multiple AI coding agents building the backend in parallel. Extracting the public repo from my private monorepo. Everything happening at once in Santa Marta heat. I uploaded the demo video, pasted the link, clicked submit. It had just turned 7pm right in front of my eyes. I refreshed the page -- the RAM was eaten by other processes. It didn't let me submit anymore.

I broke down. My friend was next to me. He calmed me down and pushed me to email the organizers asking for an extension. I wouldn't have done it on my own. I would have just been devastated and moved on.

They replied at 9:11pm. They gave me until 10pm. I resubmitted.

Cost

About $0.25 per session on pay-as-you-go.

Component	Cost	%
Voice (Gemini Live)	~$0.04	16%
Images (3x)	~$0.20	81%
Text (Flash + summary)	~$0.005	3%

Images are the bottleneck. Free tier throttles to 12-15s per generation. Paid tier gets 2-3s. Images are pre-generated at preview time so the latency is hidden from the call.

What I learned

Test the conversation on day one. I built architecture for three days before having a real conversation with the model. The bag-of-material insight came from testing, not planning. Start with zero tools, prove the conversation works bare, then add tools one at a time.

VAD configuration is everything. Verify audio chunk size and sensitivity settings immediately. The difference between "this feels like a phone call" and "this feels like talking to a robot" is four config values.

Gemini Live can't multitask. Don't ask it to build a character and perform it. Offload generation to Flash. Let Live just... live.

You can't prompt humor. You prompt personality. Humor emerges. The "no jokes" directive killed the app for three days and I didn't notice because I was rewriting everything else around it.

The model is smarter than you think. Cleopatra made a pyramid scheme joke. Constantine thought someone was delivering chicken. Jean-Loup Chrétien asked about the KGB. None of these were in any prompt. The model invented them from personality + material + context. Get out of the way.

Stack

Layer	Technology
Frontend	Astro 5 + Svelte 5, Cloudflare Workers
Backend	Hono (TypeScript) on Cloud Run
AI Voice	Gemini Live API (`@google/genai`, `v1alpha`)
AI Text	Gemini 3 Flash Preview
AI Image	Gemini 3.1 Flash Image Preview
Profiles	Firestore (EU eur3)
Auth	Clerk (anonymous-first)
CI/CD	Cloud Build

Built in 4 days from Santa Marta, Colombia. Sitting on the floor. Laptop on a chair. No AC. Đứa con tinh thần.

DEV Community