DEV Community

Vasilis Stefanopoulos
Vasilis Stefanopoulos

Posted on

How I Built GM-Genie: A Cinematic AI Game Master with Gemini Live API

This project was created for the Gemini Live Agent Challenge hackathon


The Problem: Text-Only RPGs Break Immersion

Every tabletop RPG player knows the feeling — your game master describes a breathtaking ruin half-swallowed by ash dunes, but all you see is... a chat window. Text generates the world in your mind, but there's a ceiling on how cinematic that experience can get.

What if your AI game master could show you the world as it narrated, play a sound effect when the door creaks open, shift the ambient drone when combat starts — all without breaking the flow of conversation?

That's what I built with GM-Genie: a voice-first RPG narrator that weaves AI-generated scene art, dynamic sound effects, and real-time audio narration into a single seamless experience. You talk to your GM, your GM talks back — and the world materializes around you.


Architecture: From Multi-Agent to Zero-Tool

Here's the honest story: the architecture you see in the final version is not the one I originally designed. I planned a multi-agent pipeline with function calling for everything — dice, inventory, scene generation, SFX. That version had a 70% connection crash rate in voice mode.

The current architecture threw nearly all of that away.

What's Running Now

Voice Mode (PRIMARY):
  gemini-2.5-flash-native-audio-latest  ←  WebSocket bidi audio
  ├── ZERO tools in voice pipeline
  ├── DicePool injects pre-rolled results into system prompt
  ├── Server transcribes GM speech → SceneDetector analyzes it
  └── SceneDetector triggers: scene image / SFX / ambient (server-side)

Text Mode (FALLBACK):
  gemini-2.5-flash + Google ADK
  └── Tools: roll_dice, manage_inventory, check_stats, generate_scene, trigger_sfx
Enter fullscreen mode Exit fullscreen mode

The voice session is a pure audio conversation. No function calling, no tool dispatch, no round-trips to decide what to do. The model speaks, the server listens to the transcript, and intelligence fires server-side.

Backend

FastAPI with two endpoints:

  • POST /api/chat — SSE streaming, text mode
  • WS /api/live — WebSocket, bidirectional audio, voice mode

Frontend

React 19 + Vite + Tailwind CSS v4. WorldSelect → Lobby → Game. Audio runs through Web Audio API with separate AudioWorklet processors for capture (16kHz) and playback (24kHz).


The Interesting Parts

1. Zero-Tool Voice Architecture

The native-audio Gemini model (gemini-2.5-flash-native-audio-latest) crashed approximately 70% of the time when function calling was active. The WebSocket would close mid-session with codes 1000, 1008, or 1011 — silently, without a clear error.

The fix: remove every tool from the voice pipeline. No function calling at all.

But then how does anything happen? The GM still needs dice results. Scenes still need to generate. Sound still needs to fire.

Dice: Pre-roll a full session's worth of results before the session starts and inject them into the system prompt. The model reads them and narrates accordingly.

class DicePool:
    """Pre-rolled dice injected into system prompt — no tool calls needed."""

    def __init__(self, seed: int | None = None):
        rng = random.Random(seed)
        self.pool = {
            "d4":  [rng.randint(1, 4)  for _ in range(30)],
            "d6":  [rng.randint(1, 6)  for _ in range(40)],
            "d8":  [rng.randint(1, 8)  for _ in range(30)],
            "d10": [rng.randint(1, 10) for _ in range(30)],
            "d12": [rng.randint(1, 12) for _ in range(20)],
            "d20": [rng.randint(1, 20) for _ in range(40)],
            "d100":[rng.randint(1, 100) for _ in range(10)],
        }
        self._idx: dict[str, int] = {k: 0 for k in self.pool}

    def draw(self, dice_type: str) -> int:
        pool = self.pool[dice_type]
        idx = self._idx[dice_type] % len(pool)
        self._idx[dice_type] += 1
        return pool[idx]

    def prompt_block(self) -> str:
        """Returns the pre-rolled pool formatted for system prompt injection."""
        lines = ["[PRE-ROLLED DICE POOL — use in order, top to bottom]"]
        for k, vals in self.pool.items():
            lines.append(f"{k}: {', '.join(str(v) for v in vals)}")
        return "\n".join(lines)
Enter fullscreen mode Exit fullscreen mode

The system prompt tells the GM to consume results in order from top to bottom. It's deterministic, fast, and requires zero API calls during the session.

Scenes and sound: handled by SceneDetector.

2. SceneDetector: Server-Side Intelligence

The SceneDetector monitors every GM transcript line and fires the right media event based on what the GM just said — no tool calls, no model decision-making.

# Primary scene trigger: GM says "You see..." or "Before you stands..."
_VISUAL_CUE_PATTERNS = [
    r"\byou see\b",
    r"\bbefore (?:you|your eyes)\b",
    r"\byou notice\b",
    r"\btowering (?:above|over|before)\b",
    r"\bsprawl(?:s|ing)? (?:before|below|around)\b",
    r"\bloom(?:s|ing)? (?:over|above|ahead|before)\b",
    # ... 10 more patterns
]

# Secondary: location transitions
_LOCATION_PATTERNS = [
    r"\byou (?:enter|arrive|step into|walk into)\b",
    r"\bwelcome to\b",
    # ...
]
Enter fullscreen mode Exit fullscreen mode

When a pattern fires, the server calls gemini-3-pro-image-preview to generate a scene image and sends it to the frontend via the WebSocket. SFX and ambient changes work the same way — keyword hits on the transcript trigger a Freesound API search, result gets cached to disk, URL goes to the frontend.

The GM's prompt is written so it reliably uses the trigger phrases. The phrases aren't random — they're the natural language a good narrator uses when describing what a player sees. The model's instincts and the detector's patterns are aligned by design.

3. Image Consistency Across Scenes

One early problem: scene images looked completely different from each other. The player's character would appear to change ethnicity, age, and clothing between images because each generation started from scratch.

The fix: capture a visual description of the player's character the first time the player speaks, and inject it into every scene prompt.

# In event_pipeline.py — first player transcript triggers character capture
if not session_state.get("character_visual") and is_player_turn:
    character_visual = await extract_character_visual(transcript_text)
    session_state["character_visual"] = character_visual

# Every scene prompt includes:
f"CHARACTER APPEARANCE (keep consistent): {session_state.get('character_visual', '')}"
Enter fullscreen mode Exit fullscreen mode

The visual description is extracted once and reused. Scenes now show the same character across every image in the session.

4. Story Loom System

Each session needs a story that feels driven and purposeful, not procedurally generic. The Story Loom generates a campaign arc before the session begins using per-world d12 tables.

# Formula: [ACTOR] wants to [ACTION] [SUBJECT] to [INTENT], but [DEVELOPMENT]
# Example output for ember_waste:
# "A rogue Iron Tide harpooneer who deserted after a failed harvest
#  is tracking a wounded juvenile Ash Strider
#  to harvest enough raw ichor to buy passage off the Char,
#  but the beast is already claimed by the Red Teeth — and they know she's here."

def roll_campaign_arc(world_key: str) -> CampaignArc:
    tables = _LOOM_TABLES[world_key]
    return CampaignArc(
        actor=random.choice(tables["actor"]),
        action=random.choice(tables["action"]),
        subject=random.choice(tables["subject"]),
        intent=random.choice(tables["intent"]),
        development=random.choice(tables["development"]),
        phase="discovery",
    )
Enter fullscreen mode Exit fullscreen mode

The arc unfolds through four phases: discovery → escalation → climax → resolution. Each session advances the phase and generates a beat — a shorter encounter seed tuned to the current arc phase — so sessions feel like chapters of a larger story.

5. Dynamic Sound from Freesound API

Instead of bundling a static sound library, GM-Genie searches Freesound in real time based on what the SceneDetector infers from the transcript.

async def fetch_sfx(description: str, cache_dir: Path) -> Path | None:
    """Search Freesound, download best match, cache to disk."""
    cache_key = re.sub(r"[^\w]", "_", description.lower())[:40]
    cached = cache_dir / f"sfx_{cache_key}.mp3"
    if cached.exists():
        return cached
    # Hit Freesound API, pick highest-rated result, download
    ...
    return cached
Enter fullscreen mode Exit fullscreen mode

First request hits the API. Every subsequent session reuses the cached file. The SFX library builds itself as the game is played.

6. Audio Pipeline: Continuous Capture, No Noise Gate

The AudioWorklet capture processor sends raw 16kHz PCM to the server in 84-byte chunks (2.6ms each). Early versions had a client-side noise gate to reduce API load. That was a mistake.

Gemini Live API has its own voice activity detection. When I added a noise gate on the client, it sent fragmented audio bursts — short bursts of sound separated by hard silences. The API's VAD couldn't recognize these as continuous speech. Player transcripts dropped to zero. The model stopped responding to the player entirely.

Removing the noise gate fixed it immediately.

The server-side fix was batching. 84-byte chunks are too granular for the API, so the server buffers them to ~3200 bytes (~100ms) before forwarding to the LiveRequestQueue.

# server.py — mic batching
MIC_BATCH_BYTES = 3200

async def _mic_sender(live_queue, mic_buffer):
    while True:
        chunk = await mic_buffer.get()
        batch = chunk
        while len(batch) < MIC_BATCH_BYTES:
            try:
                batch += mic_buffer.get_nowait()
            except asyncio.QueueEmpty:
                break
        live_queue.send_realtime(
            types.Blob(data=batch, mime_type="audio/pcm;rate=16000")
        )
Enter fullscreen mode Exit fullscreen mode

7. Transcript Deduplication

Gemini Live API sends transcripts in two stages: partial transcripts as the model speaks, then a finished transcript containing the complete text for that turn. Early versions concatenated everything — partials plus the finished event — so every player utterance appeared twice in the conversation log.

The fix: display partials to the frontend for real-time feedback, but use only the finished transcript for the conversation log. When the finished event arrives, discard whatever partial buffer was accumulating.

8. Session Timing and Graceful Endings

Voice sessions run on a timer. The ending needs to feel narrative, not abrupt.

The implementation: at 30 seconds remaining, inject a warning into the GM's context. At 5 seconds remaining, inject a hard "STOP NOW" directive. After the timer hits zero, drain audio for 1.5 seconds to let the final sentence finish playing, then close the connection.

The timer pauses during reconnection and force-expires on session_ended events so the display stays accurate regardless of network interruptions.


The Seven Worlds

Players choose from seven hand-crafted settings, each with distinct themes, lore factions, and story tables:

World Theme
The Char (ember_waste) Post-apocalyptic desert survival — Ash Striders, ichor economy, brutal factions
Neon Ghosts (chrome_cantrips) Cyberpunk noir with an arcane undercurrent
The Sundered Skies (leviathan_divide) Sky-city fantasy — civilization built on the backs of flying leviathans
The Drowning Sea (abyssal_wake) Pirate horror with deep-sea dread
The Verdant Maw (iron_vein_wilds) Nature-spirit fantasy — the wild is alive and it has opinions
The Crimson Siege (blood_hegemony) Gothic vampire warfare
The Starbound Frontier (neon_frontier) Space western with spirit bonds

Plus a custom text input if none of these fit.


The DM Framework in the Prompt

Architecture gets the model talking. The prompt makes it worth listening to. The system prompt bakes in several techniques drawn from live-play actual play:

E.A.S.E. (Environment → Atmosphere → Senses → Events): Every scene description hits all four layers in sequence. You don't just hear what the place looks like — you feel the temperature, smell the air, and something happens.

Rule of Three: Every location has exactly three interactable things. Never fewer (the player needs options), never more (too much to track). End on the most dramatically interesting one.

Cold Open / Sensory Hook: Sessions start in the middle of something. No "You are an adventurer who has arrived in a town." The first line drops you into a moment.

NPC One Distinct Feature + Want/Fear: Every named NPC gets one physical or behavioral detail that makes them memorable, plus a clear want and a clear fear. The GM always knows what an NPC will do because it knows why they do anything.

Proactive World: NPCs don't wait for the player to interact. The world has momentum. Things happen whether the player acts or not.


Challenges

1. Tool Calls Crashed the Native-Audio Model

Function calling in gemini-2.5-flash-native-audio-latest caused WebSocket disconnects ~70% of the time. The errors were 1000/1008/1011 — generic enough that it took a while to correlate them with tool invocation.

Solution: remove all tools from the voice pipeline. Zero function calling in voice mode. Pre-roll dice, move media triggering server-side, handle state transitions in the server without asking the model.

2. Noise Gate Broke Speech Detection

Client-side noise gating sent fragmented audio bursts. Gemini's VAD saw them as noise, not speech. Player turn transcripts flatlined.

Solution: continuous audio stream, no gate. The API handles silence detection.

3. 84-Byte AudioWorklet Chunks

AudioWorklet sends 2.6ms chunks. The Live API needs sustained audio to work correctly.

Solution: buffer in server to ~3200 bytes (100ms) before forwarding to LiveRequestQueue.

4. Transcript Duplication

Gemini sends partials + a final "finished" event containing the full text. Concatenating both produced doubled transcripts.

Solution: partials go to the frontend for real-time display; finished event goes to the conversation log; partial buffer discarded on finish.

5. Filler Sounds Burned Quota

Early versions played filler audio ("Hmm...", "Let me think...") via separate TTS connections. Each one consumed Live API RPM quota and occasionally triggered its own reconnection.

Solution: inject fillers directly into the live session's LiveRequestQueue. No separate connection, no quota hit, zero latency overhead.

6. Image Generation Quota Limits

gemini-2.5-flash-image has a tight daily quota on free tier (~50 requests/day). Aggressive scene generation during testing exhausted it within hours.

Solution: switch to gemini-3-pro-image-preview, implement cooldowns (10s for visual cues, 15s for location triggers), and cache everything to disk.


Tech Stack

Layer Technology
Voice Engine gemini-2.5-flash-native-audio-latest (WebSocket bidi)
Scene Generation gemini-3-pro-image-preview
Text Fallback gemini-2.5-flash + Google ADK
Sound Effects Freesound API (dynamic search + disk cache)
Frontend React 19 + Vite + Tailwind CSS v4
Backend FastAPI + Uvicorn on Google Cloud Run
Audio Processing Web Audio API + AudioWorklet

What I Learned

1. Zero-tool architectures beat reliable-tool architectures for real-time audio.
When you're in a bidi audio stream, every tool call is a potential connection interruption. Moving intelligence to the server — where it's synchronous, controllable, and doesn't touch the model connection — is almost always the right call.

2. Server-side intelligence scales better than client-side complexity.
The SceneDetector is 300 lines of Python with regex patterns and cooldown timers. It's simple to debug, easy to extend, and never hallucinates. Compare that to asking the model to decide when to show an image.

3. The GM's personality IS the product.
Users don't notice the architecture. They notice whether the GM makes them feel something. Prompt engineering — the E.A.S.E. structure, the cold opens, the proactive NPCs — mattered more than any infrastructure decision.

4. Pre-rolling dice beats function calling for latency.
Injecting a pool of results into the system prompt means zero round-trips during the session. The model reads the pool, narrates the results, and moves on. Latency drops to zero. The numbers are just as random.

5. Continuous audio streams beat noise-gated streams for speech detection.
This one cost me two days. Trust the API's VAD.


What's Next

  • Multi-session campaigns with persistent world state between sessions
  • NPC voice portraits — each named NPC gets a distinct voice
  • Community-created world templates
  • Mobile-optimized UI for actual tabletop use

Built with Google ADK, Gemini 2.5 Flash Native Audio, and way too much coffee.

#GeminiLiveAgentChallenge

Top comments (0)