somebodyatsomewhere

Posted on Mar 16

Building a Live AI Comedy Roast Show with Gemini

#gemini #googlecloud #ai #geminiliveagentchallenge

This post was created for the purposes of entering the Gemini Live Agent Challenge hackathon. #GeminiLiveAgentChallenge

I built GRL: Gemini Roast LIVE, an AI comedy roast show where AI comedians watch you through your camera and roast you in real time. You can talk back, and they respond. Here's how I built it with Gemini and what I learned along the way.

The idea

Most voice AI demos are chatbots. I wanted to build something people would actually want to use for fun: a live comedy show with an MC host and multiple comedians, each with their own voice and comedy style.

The roast format turned out to be a great test for the Gemini Live API. Comedy requires fast responses, consistent character voice, and the ability to riff off what the audience says. If the AI pauses too long or breaks character, the comedy falls apart.

Three modes, one app

The app has three ways to get roasted:

Live Roast Show is the main feature. An AI MC hosts the show and introduces 3 comedians (randomly picked from a pool of 10 personas). Each comedian gets their own Gemini Live API session with a unique voice. They see you through your camera, hear you through your mic, and roast what they observe. You can talk back. Flip to the rear camera to point it at a friend if you want them roasted instead.

Photo Roast takes a selfie and produces a duo comedy act. Two randomly selected personas perform an alternating script, rendered as multi-speaker audio with synchronized highlighting.

SNS Roast generates a comedy video from an uploaded photo using Veo 3.1, with extension chaining for clips longer than 8 seconds.

Every mode has a Roast/Boost toggle. Roast mode (dark theme) is 95% savage comedy. Boost mode (light theme) flips it to 90% genuine praise with humor mixed in.

The architecture

The backend runs on Python FastAPI with Google ADK. The key challenge was the Live API's one-voice-per-session constraint. I needed 10 different comedian voices plus an MC, but each session can only have one voice.

The solution was a session-switching architecture:

The MC runs on its own Live API BIDI session
When the MC calls a comedian's name, a ShowDirector class detects it via regex on the transcription text
ShowDirector creates a new independent Live API session for that comedian with their assigned voice
Camera frames and microphone audio get forwarded to the active session
When the comedian's turn ends (turn limit or time limit), the session is destroyed
The next comedian gets a fresh session

This turned the constraint into a feature. Each comedian genuinely sounds different because they have independent sessions.

Gemini API products used

I ended up using five different Gemini API products:

Product	What it does in the app
Gemini Live API	Real-time bidirectional voice for MC and each comedian (BIDI streaming)
Gemini 2.5 Flash Lite	Image analysis (selfie/SNS) and comedy script generation
Gemini Multi-Speaker TTS	Two-voice audio for Photo Roast duo acts
Veo 3.1	Comedy video generation for SNS Roast with extension chaining
Google ADK	Agent orchestration, Runner lifecycle, session management

The infrastructure runs on Google Cloud Run with Secret Manager for API keys and GitHub Actions for CI/CD via Workload Identity Federation.

The hard parts

FunctionTool didn't work on newer models

I originally planned to have the MC use ADK's FunctionTool to programmatically call comedians on stage. This works on the 09-2025 native audio model. But on the 12-2025 model, the model writes thinking text about calling the tool without ever actually emitting a function call.

I switched to natural language detection: the MC says the comedian's name naturally during the show, and a dynamic regex on the transcription text triggers the session switch. This ended up being more reliable and sounds more natural too.

Veo extension requires exact original URIs

Veo 3.1 can generate 8-second clips, and you can extend them. But the extension API requires the exact original file URI that Veo generated. The download URI (with :download?alt=media suffix) doesn't work. Re-uploading through the Files API also fails because VEO metadata gets lost.

I had to strip the download suffix and pass the clean original URI. This wasn't documented anywhere. I found it through trial and error.

Model version matters for comedy

I started with gemini-3.1-flash-lite-preview for text generation but found its safety filtering was too conservative for roast comedy. Everything came out mild. Switching to gemini-2.5-flash-lite produced noticeably funnier and more daring observational roasts, while still respecting the content guidelines I defined (no body-type, skin color, disability, or sexual jokes).

Comedy prompt engineering is its own thing

Generic instructions like "be funny" produce generic results. I had to specify concrete comedy techniques for each persona:

Razor (Precision Striker): "Start with something small you see, blow it up to absurd proportions, then stack 2-3 more jokes on the same target"
Frost (Deadpan Intellectual): "Matter-of-fact devastation, rhetorical questions, polite savagery"
Pops (Dad Joke Assassin): "Weaponized puns, fake proverbs, wholesome-to-savage swerves"

The roast intensity went through several rounds of calibration, from an initial 70/30 roast/praise ratio all the way up to 95/5.

Rate limits with concurrent sessions

Running an MC session plus comedian sessions means multiple simultaneous Gemini API calls. I implemented a dual-key fallback: if the primary API key hits a 429, the request automatically retries with a secondary key. Veo uses round-robin across all available clients.

10 personas, infinite combinations

The persona system (personas.py) is the single source of truth for all 10 comedian characters. Each persona defines:

Stage name and comedy style
Unique Gemini voice (from a pool of 10 Live API voices)
Full system prompts for both roast and boost modes
UI color and icon
Comedy technique instructions

Every show randomly selects from this pool, so no two shows are the same. The MC's prompt is dynamically built based on which 3 comedians were selected.

What's next

The feature I'm most excited about is a podcast roast mode: two comedian agents having a freeform conversation about the user (similar to NotebookLM's podcast format), where the user can jump in at any point. This would require multi-voice support within a single Live API session, but if that becomes available, it would open up a much more dynamic format.

Try it

Live app: Deployed on Google Cloud Run
Code: github.com/jusunkim328/geminiLAC

Built with Google ADK, Gemini Live API, Gemini 2.5 Flash, Gemini Multi-Speaker TTS, Veo 3.1, Google Cloud Run, and Google Cloud Secret Manager for the Gemini Live Agent Challenge. #GeminiLiveAgentChallenge

DEV Community