I Built an AI Storyboarding Assistant That Listens to My Voice — Here's How

#gemini #googlecloud #ai #geminiliveagentchallenge

I Built an AI Storyboarding Assistant That Listens to My Voice — Here's How

I like telling stories and when I sit down to build a world, I don't want to type prompts into a chat box and wait. I want to direct. I want to speak a scene into existence the way a filmmaker speaks to their crew on set — and have the crew just go.

That inspiration is what led me to build StoryForge for the Gemini Live Agent Challenge.

The Gap No One's Filling

I've used some impressive AI creative tools, but they all share the same interaction pattern: click, type, wait, copy-paste, repeat. It's the opposite of creative flow. When I'm deep in a story — when I can see the rain on the alley, hear the footsteps — I'm pacing and talking the scene out into my recording app. The last thing I want is to break that flow to sit down and type a prompt.

What if I could just... talk? And the AI would listen, respond, and build the storyboard while we're still in conversation?

What StoryForgeActually Does

You open an infinite spatial canvas. You hit the mic button and speak:

"Scene one. A rain-soaked Tokyo alley at midnight. Neon signs reflected in puddles. A woman in a red coat walks toward camera."

And the AI assistant responds like a seasoned pro on set: "Copy that. Scene set. What's the blocking? Who's in the frame?"

While it speaks, a scene node materializes on the canvas with the mood, setting, and a generated storyboard panel. You introduce a character — a node appears and connects to the scene with a purple dashed line. You ask for a second scene — it flows from the first with an amber animated edge. The canvas becomes a living narrative graph that grows as you direct.

No text input. No chat interface. Just voice and vision.

How It Works Under the Hood

Two parallel paths keep the experience fluid:

The Voice Path connects the browser directly to the Gemini Multimodal Live API via WebSocket. Audio streams bidirectionally — you speak, the AI hears and responds in real-time with both voice and structured actions. Barge-in works too: interrupt the AI mid-sentence and it stops abruptly to listen to what you have to say.

The Intelligence Path runs on Google Cloud Run, powered by two Google ADK agents:

A Director Agent with custom tools — update_scene, introduce_character, generate_storyboard_prompt, generate_image_prompt — that structure creative direction into canvas-ready data
A Search Agent that uses Google Search grounding so the AI can give factual answers about filmmaking techniques, genre conventions, and visual references without hallucinating

Why two agents? ADK currently doesn't allow built-in tools like google_search alongside custom function tools in the same agent. So the Director handles creative tools, and the Search Agent handles grounding. The API router decides which to call.

The frontend is React with React Flow — an infinite canvas where every story element is a draggable, connectable node. Scenes link to scenes (the story timeline). Characters link to the scenes they appear in. You can drag between any two nodes to create manual connections. It's not just a canvas — it's a spatial story graph.

The Persona Makes It Real

The AI doesn't talk like a chatbot. It talks like someone who's been on a hundred sets. Terse. Opinionated. Uses terminology like "coverage," "blocking," "establishing shot" without explaining them. The voice is set to "Sadachbia," which gives it a grounded, confident presence that fits the persona.

Responses are kept to under 8 seconds of spoken audio, because long-winded AI responses destroy the creative rhythm. I wanted to build a voice-first interface that felt like a creative brainstorming session.

What I Learned Building This in 17 Days

1. Models deprecate fast. gemini-2.0-flash was deprecated mid-sprint. I had to swap to gemini-3.1-flash-lite-preview and test everything again.

2. Browser storage will crash your app mid-demo. I was saving story state to localStorage, which has a ~5MB quota. During a long session with generated images and trace events, it fills up and throws QuotaExceededError. Switched to sessionStorage with try/catch cleanup. Clear it before every session.

3. Audio feedback loops — this was the key insight I hadn't considered at all. If the AI's voice plays through your speakers and your mic picks it up, Gemini's Voice Activity Detection thinks you're speaking. It triggers barge-in, and the AI interrupts itself in an infinite loop. The fix is simple — wear headphones — but I lost hours debugging this before realizing what was happening. If you're building anything with the Live API's voice features, this is one of the first things I would account for.

4. ADK's InMemorySessionService.get_session() returns None, not an exception, when a session doesn't exist. If you're wrapping it in try/catch expecting an error, your fallback never fires and the agent crashes on every first request.

The Stack

Voice I/O: Gemini Multimodal Live API (WebSocket, PCM 16-bit, Sadachbia voice)
Agent Framework: Google ADK (LlmAgent, Runner, InMemorySessionService)
Grounding: Google Search (built-in ADK tool)
Image Generation: Gemini 3.1 Flash Image Preview (interleaved text + image)
Frontend: React, React Flow, TypeScript, Vite
Backend: Python, FastAPI, Uvicorn
Deployment: Google Cloud Run, Cloud Build
IaC: Bash deploy script