Kaviya Kumar

Posted on Mar 16

Most AI Apps Return Text. DreamLoom Returns a Living Storybook - With Voice, Illustrations, and Music in Real Time.

#geminiliveagentchallenge #gemini #creativestoryteller

How DreamLoom turns a spoken conversation into a living storybook — with native interleaved text+image, real-time interruption, sketch-to-scene camera input, and a cinematic trailer assembled entirely in the browser.

This content was created for my Gemini Live Agent Challenge submission.

Try it live: https://getdreamloom.com | GitHub repo

The Problem (in 3 sentences)

There are billions of people with stories they'll never write down. Not because they lack imagination — because writing takes craft, illustration takes training, and video takes editing software. Current AI tools don't fix this: you type a prompt, get a wall of text, paste it into an image generator, and stitch the result together manually. There is no conversation. No surprise. No creative partnership.

What I Built

DreamLoom is a voice-first AI story studio. You speak to Loom — an AI creative director with personality, opinions, and creative taste — and watch illustrated scenes materialize in real time. Narration and original illustrations arrive interleaved in a single Gemini API response. Music shifts with the mood. You can interrupt mid-sentence to redirect the story, hold up a pencil sketch for the AI to incorporate, and track characters and continuity through a live Story Bible. When the story is complete, DreamLoom packages everything into a cinematic trailer, a Storybook PDF, and a downloadable image archive — all assembled client-side, no server-side rendering.

Five Gemini models work in concert within a single creative session. No other app I'm aware of orchestrates this many Gemini capabilities in one unified experience.

Who It's Built For

DreamLoom is designed for people who think in voice, not prompts.

Families. A five-year-old says "tell me about a shy hedgehog who's scared their spines will hurt everyone" and, five minutes later, has a fully illustrated, narrated storybook they helped create. Kid-safe mode is on by default — not as a toggle buried in settings, but as a core design principle. Loom never generates violent, scary, or mature content. When a child asks for "blood everywhere," Loom redirects with a creative alternative: "How about a shadow that turns out to be friendly?" The guardrail feels like creative direction, not censorship. No typing, no reading required. A child who can speak can direct a story.

Teachers. A teacher says "create an educational adventure about the water cycle" and the class watches an illustrated story unfold in real time. Students can interrupt to add ideas. The Story Bible tracks continuity. The Director's Cut becomes a class artifact that can be exported as a PDF or shared to the public gallery. DreamLoom ships with guided templates — "Bedtime Adventure," "Learning Quest," "Sketch Catalyst" — so teachers can launch into a high-quality first scene without setup.

Non-writers with vivid imaginations. People who have stories but don't write. They can speak a narrative into existence, interrupt to change direction, hold up a napkin sketch for the AI to incorporate, and walk away with a multimedia storybook they couldn't have created alone. Creative ownership without creative skill.

The common thread: the target user has never typed a prompt. DreamLoom's entire interface is voice-first. There is no prompt box. No "generate" button. You talk, and the story responds.

The Full Demo: Voice Start → Living Storybook (6 Steps)

Here's exactly what happens when you use DreamLoom. Every claim below is reproducible at getdreamloom.com.

Step 1: You Speak — Loom Listens and Directs

You click "Begin Your Story" and say:

"Let's create a story about a young mapmaker named Mira who discovers her maps are portals to the places she draws."

Loom doesn't just start generating. It responds with voice — warm, theatrical, opinionated. It asks about visual style and narrator tone conversationally: "What kind of look are you going for — watercolor, comic book? And should I narrate warmly, or more like a mystery?" You answer. Loom registers Mira as a character with visual descriptions for illustration consistency, sets the style, and generates Scene 1.

What appears isn't text followed by an image. It's interleaved: a paragraph, then an illustration, then another paragraph, then another painting — narrative and art woven together in a single API response.

Evidence: Open the Debug Panel. It shows response_modalities: ["TEXT","IMAGE"] and the exact part order: 0:text, 1:image, 2:text, 3:image. This is native interleaved output — not two API calls stitched together.

Step 2: You Interrupt — Loom Pivots Instantly

Mid-sentence, while Loom is still talking, you say:

"Wait — make it nighttime. With glowing mushrooms."

Loom's audio cuts instantly. No lag. It pivots: "Oh, even better!" A new scene generates with atmospheric night lighting. The music shifts from wonder to mystery.

How it works: When the Live API fires a barge-in event, the frontend disconnects the GainNode to instantly silence all scheduled audio sources — keeping the AudioContext alive (no 50-100ms recreation penalty). The agent receives the interruption signal and adapts.

📸 [GIF: User interrupts → "Interrupted" flash → Loom pivots → new nighttime scene generates]

Step 3: You Show a Sketch — Loom Incorporates It

You hold up a pencil sketch of an owl wearing tiny glasses:

"I drew the owl librarian — let me show you."

The webcam captures it at 1 fps, sends the JPEG to Loom, who describes what it sees and incorporates the concept into the next scene — not a pixel copy, but a faithful interpretation rendered in the story's established art style.

Evidence: The physical sketch and the generated scene side by side. Same character concept, different medium. The AI maintains style consistency with earlier scenes while incorporating new visual input.

📸 [Side-by-side: Physical pencil sketch ↔ AI-generated scene with owl librarian in story's art style]

Step 4: You Test Memory — Loom Remembers

Three scenes in, Mira has gained a copper compass in Scene 1, befriended the owl librarian in Scene 2, and entered a glowing cave in Scene 3. You ask:

"Does Mira still have the compass? And is the owl with her?"

Loom answers without hesitation: "She does — and the owl's been perched on her satchel since they left the library." It doesn't just recall names. It tracks which items a character acquired, which characters are present in the current scene, and how the world state has evolved across the story.

The Story Bible is a live sidebar tracking every character (with visual descriptions for illustration consistency), world settings, mood, and plot continuity across all scenes. When Loom generates a new scene, it injects character descriptions and continuity notes into the prompt so illustrations stay consistent — Mira looks like Mira, the compass looks like the compass. This isn't retrieval from a vector database. It's structured state maintained in real-time by the agent's add_character and get_story_context tool calls.

Step 5: Kid-Safe Guardrails — Creative Redirection, Not Censorship

You test the boundaries:

"Make it super scary with blood everywhere."

Loom redirects gently: "Let's keep it spooky-mystical instead — how about a shadow that turns out to be friendly?"

This is the core design principle, not an afterthought. Kid-safe mode is on by default for every session. The system prompt instructs Loom to redirect inappropriate requests with imaginative alternatives, not flat refusals. No violent imagery. No mature themes. No horror. The youngest user — a five-year-old creating a bedtime story with a parent — is the design target. Every content guardrail is built to feel like creative direction: "I have an even better idea..." rather than "I can't do that."

The kid-safe toggle is visible in the status bar, and a parent or teacher can verify it's active at any time.

Step 6: The Director's Cut — From Conversation to Cinema

You say:

"That feels like a good ending. Can I see the Director's Cut?"

Loom generates a cover image, a logline, and trailer narration text. DreamLoom assembles everything into a cinematic experience:

Cover art — generated to match the story's visual style
Cinematic trailer — Ken Burns camera movement on each scene, AI-narrated voiceover via Gemini TTS, per-scene music with crossfades, letterbox framing
Storybook PDF — book-style layout with title page and per-scene pages featuring interleaved text and illustrations
Scene image ZIP — all generated artwork, downloadable

The trailer is assembled entirely client-side: Canvas renders at 1280x720 and 30fps, Web Audio API mixes narration and music, MediaRecorder captures as VP9 WebM at 3Mbps. No FFmpeg. No server-side video processing. 720 lines of useAnimatic.ts.

Beyond the Session: Gallery, Resume, and Guided Starts

DreamLoom isn't a one-shot demo. Stories persist.

Session resume — Close the tab and come back later. Your story is saved to Firestore and appears in "Your Stories" on the landing page. Pick up exactly where you left off — all scenes, characters, and Story Bible state intact.
Public gallery — Publish your finished story to a shared gallery. Other users can browse and read published storybooks. This turns DreamLoom from a tool into a community.
Guided templates — Not sure where to start? Choose from pre-built launch paths: "Bedtime Adventure" (family-friendly with a gentle arc), "Learning Quest" (classroom-ready educational story), or "Sketch Catalyst" (build a narrative around visual concepts). Each template sends an opening prompt to Loom so users get a high-quality first scene immediately.

Architecture: Two Models, One Creative Session

DreamLoom's core insight is a two-model architecture: one model for conversation, a different model for creation.

The Conversation Model (Gemini Live API)

Handles real-time bidirectional voice. User audio streams in at 16 kHz PCM via an AudioWorklet. Agent voice streams back at 24 kHz PCM. Barge-in detection is native to the Live API. The model runs through Google ADK run_live() with a LiveRequestQueue — a single persistent connection for the duration of the session.

The Scene Model (Gemini Interleaved Output)

When Loom decides it's time for a scene, the agent calls the create_scene tool, which dispatches to a second Gemini model (gemini-2.5-flash-image) with response_modalities=["TEXT","IMAGE"]. The response arrives as interleaved blocks — paragraphs and illustrations woven together — not sequential text-then-image.

The Bridge

A single Director Agent named Loom connects the two models through six callable tools:

create_scene()         → Interleaved text+image generation
generate_music()       → Lyria RealTime streaming (+ CC0 fallback)
create_directors_cut() → Cover + logline + trailer narration
set_story_metadata()   → Title, genre, art style, narrator voice
add_character()        → Character registry with visual descriptions
get_story_context()    → Story Bible for continuity

The agent decides when to call each tool based on conversational context — not hard-coded triggers. Scene generation runs as a background asyncio.create_task to avoid blocking the Live API connection (which would cause 1011 timeout errors).

Five Models in Concert

Role	Model	What It Does
Voice conversation	`gemini-2.5-flash-native-audio`	Real-time bidi voice via Live API
Scene generation	`gemini-2.5-flash-image`	Interleaved text+image in one response
Music composition	`lyria-realtime-exp`	48 kHz stereo AI music, streamed
Trailer narration	`gemini-2.5-flash-preview-tts`	Styled voice narration for the Director's Cut
Audio transcription	`gemini-2.5-flash`	Transcribes buffered audio on reconnect

Data Flow (One Turn)

User voice (16kHz PCM)
  → AudioWorklet → WebSocket (binary) → LiveRequestQueue
  → Gemini Live API (conversation model)
  → Agent decides: call create_scene()
  → SceneGenerator → Gemini Interleaved API
  → response: [text, image, text, image, ...]
  → images saved to GCS → notification queue
  → WebSocket (JSON) → React StoryCanvas renders
  → Agent voice: "There we go! What do you think?"
  → WebSocket (binary PCM) → Web Audio playback

Technical Decisions That Mattered

Why Two Models Instead of One?

The Gemini Live API excels at real-time voice conversation but doesn't support interleaved image output. The interleaved model (gemini-2.5-flash-image) generates beautiful text+image scenes but doesn't support bidirectional audio streaming. Neither can do the other's job.

The bridge — an ADK agent with tools — means Loom has a voice that responds in real time, AND the ability to generate illustrated scenes when the moment is right. The creative direction happens in voice; the creation happens in a separate, purpose-built model call.

Why Client-Side Video Assembly?

The Director's Cut trailer could have been assembled server-side with FFmpeg. I chose client-side for three reasons:

Zero server cost — video encoding is CPU-intensive. With client-side assembly, the server never touches video.
Zero latency — no upload/download round-trip. The video renders directly from images already cached in the browser.
No dependency — FFmpeg is a deployment headache. Canvas + Web Audio + MediaRecorder are native browser APIs.

The tradeoff: browser throttling can affect quality in background tabs. I use setInterval instead of requestAnimationFrame to mitigate this, but it's not perfect.

Why AudioWorklet + Gain-Node Flush?

For barge-in to feel instant, audio must stop immediately when the user interrupts. The standard approach — close the AudioContext and create a new one — adds 50-100ms of silence. Instead, I disconnect the GainNode to instantly silence all scheduled audio sources, create a fresh one, and keep the AudioContext alive. The interruption is imperceptible.

Why Firestore With Graceful Degradation?

Firestore stores sessions and the public gallery, but if Firestore is unavailable (no GCP project configured, or service down), everything still works. Sessions persist in memory. Gallery features degrade to disabled. This means the app runs locally with zero cloud dependencies beyond the Gemini API key.

Challenges I Solved (And What I Learned)

Challenge 1: Voice Barge-In Was Unusable Out of the Box

The default VAD settings for the Gemini Live API were too aggressive for a creative storytelling app. Background music, natural pauses, and ambient noise all triggered false barge-ins — cutting Loom mid-sentence.

Five problems, five solutions:

1. Over-sensitive VAD — Configured AutomaticActivityDetection with start_of_speech_sensitivity=LOW, end_of_speech_sensitivity=LOW, silence_duration_ms=1200 (default ~500ms was cutting people off mid-thought), and proactive_audio=True to ignore incidental sounds.

2. No visibility into what was heard — Enabled both input_audio_transcription and output_audio_transcription. The Debug Panel now shows a scrollable transcript log with timestamped entries for user and agent speech. Debugging "why did it interrupt?" went from guesswork to trivial.

3. No recovery from false interrupts — This was the most nuanced fix. When the Live API fires an interrupted event, I don't immediately accept it. I start a 400ms verification window:

If user speech arrives → real interruption, proceed normally
If no speech arrives → false interrupt (noise/music), inject a system message telling Loom to continue from where it left off (with the last ~150 characters as a resume hint)

A cough or chair scrape briefly pauses Loom, but it picks right back up naturally.

4. No manual fallback — Added optional push-to-talk mode alongside auto-detection. PTT sends activity_start/activity_end signals to the Live API as additional hints, not a replacement.

5. Music caused self-interruption — The agent's own audio was feeding back into the mic. Implemented three-tier music ducking:

State	Music Volume
Mic active	0% (muted)
Agent speaking	10%
Idle	25%

Volume transitions use a smooth 200ms ramp instead of jarring instant cuts.

Key takeaway: Voice UX in a creative app isn't "pipe audio to the API." The gap between a working demo and a usable product lives in VAD sensitivity, false-positive recovery, graceful degradation to manual control, and ensuring your own output doesn't fight your input.

Challenge 2: Live API Drops Connections Mid-Story

The Gemini Live API occasionally disconnects with codes 1008/1011. In a storytelling app, losing the conversation three scenes in is catastrophic.

Solution: Reconnect with full context restoration.

Buffer user and agent audio in rolling 15-second ring buffers (~480KB each)
On disconnect: save session to Firestore, create fresh ADK session
Transcribe both audio buffers using Gemini Flash
Re-inject the full story state: conversation history, Story Bible, scene summaries, art style
Suppress the agent's greeting on reconnect (mute the first turn)
Resume — Loom picks up exactly where it left off

Up to 5 retries with exponential backoff (1s, 2s, 4s, 8s). The user sees a brief "reconnecting" banner. The story doesn't break.

Challenge 3: Scene Generation Blocks the Live API

Interleaved scene generation takes 10-30 seconds. If the create_scene tool blocks the event loop, the Live API connection times out (error 1011) because it expects regular keepalive signals.

Solution: All generation runs as background asyncio.create_task(). The tool returns immediately to the agent ("Scene is being generated..."), and the agent fills the silence with conversation: "This one's going to be gorgeous — while we wait, do you have any ideas for what happens next?"

Results push to an asyncio.Queue(maxsize=100) that a separate drain task broadcasts over WebSocket. The Live API connection never stalls.

What Makes This Different From a Chatbot

Chatbot	DreamLoom
Returns text	Returns interleaved text+image+music in real time
Turn-based: prompt → response → prompt	Continuous: speak, interrupt, redirect, watch
No memory	Story Bible tracks characters, world, continuity
No output packaging	Cinematic trailer, Storybook PDF, image ZIP
Generic AI personality	Loom: a creative director with taste, opinions, pacing
Type to interact	Speak to interact. The target user has never typed a prompt.

The Debug Panel is built into the product — not as a dev tool, but as proof. Judges (or users) can open it at any time to see the model name, response_modalities, part order, and generation time for every scene. No other submission I've seen builds native interleaved output proof directly into the UI.

The Stack

Backend: FastAPI + WebSocket, Python 3.12, deployed on Cloud Run (us-central1)
Frontend: React 19 + Vite + TypeScript + TailwindCSS v4 + Framer Motion
Agent: Google ADK (google-adk) with run_live() + LiveRequestQueue
Models: Gemini Live API, Gemini Interleaved, Lyria RealTime, Gemini TTS, Gemini Flash
Storage: GCS (media), Firestore (sessions + gallery)
Exports: jsPDF (Storybook PDF), JSZip (image ZIP), Canvas+MediaRecorder (WebM animatic)
Deployment: Cloud Run + Cloud Build + Artifact Registry, automated via infra/deploy.sh

What's Next

Multi-voice narration — Different Gemini TTS voices per character in the trailer
Scene branching UI — Visual tree view for "What If?" alternate story paths
Collaborative voice — Let multiple users speak into the same story session
ePub export — Alongside PDF, export for e-readers with chapter structure
Prompt replay — Record the full voice conversation alongside the generated story

DEV Community