How I Built SAGA: A Living Multimodal Story Engine with 5 Google AI Models

#geminiliveagentchallenge #ai #gemini #vibecoding

Subtitle: Built for the Gemini Live Agent Challenge 2026 #GeminiLiveAgentChallenge

1. The Problem: AI stories are boring

Most AI storytelling experiences still feel transactional. You type into a box, get a paragraph back, maybe copy it into a document, and the magic ends there. They do not see, hear, speak, or remember. They do not feel like a living world.

That gap became the starting point for SAGA. I wanted something that felt less like prompting an API and more like stepping into a creative chamber where prose, visuals, narration, and world memory move together.

2. The Vision: What if a story could See, Hear, Speak, and Remember?

SAGA is built around a simple belief: stories should not be output, they should be environments.

So the product became a story universe engine:

See through inline illustrations and cinematic clips
Hear through narration and ambient score
Speak through Gemini Live as a co-author
Remember through persistent world state in Firestore and vector memory in Qdrant

That one framing decision drove the entire architecture.

3. Architecture: The 5-model stack

SAGA uses a layered Google AI stack:

Gemini 2.0 Flash as the primary story engine
Gemini Live API for real-time voice co-authoring
Imagen 4 for scene illustrations
Veo 2 for short cinematic beats
Gemini TTS for narration
Lyria 2 for ambient score generation

The backend runs on FastAPI and Cloud Run. Firestore stores story sessions and return-state. Cloud Storage stores media artifacts. Terraform provisions the infrastructure. Secret Manager handles secrets. Qdrant stores vector memory for continuity.

The key design choice was interleaving. Text, image, narration, and music do not appear in separate tabs. They arrive in one manuscript stream so the user experiences a single living artifact.

4. The Hard Parts

There were a few technical pieces that mattered more than expected:

PCM-to-WAV wrapping for live audio

Gemini Live returns raw audio chunks, so browser-safe playback required clean PCM handling and scheduling. Once chunk playback was scheduled in a persistent audio context instead of one context per chunk, the speaking voice stopped sounding broken.

Lyria REST workaround

The current Lyria path uses Vertex REST because the SDK path had a proto/runtime mismatch for this use case. That made the music layer slightly different from the other model integrations, but it kept the product stable and demoable.

Background world extraction

The story could not wait for map extraction, narration, or video to finish. The manuscript needed to keep moving. So world extraction, narration, music, and cinematic clip generation were pushed into non-blocking background tasks, then streamed back into the same WebSocket session.

5. The ADK Layer: Why SAGA is an agent

I wanted SAGA to be legible as an agent, not just a collection of API calls.

So I added an explicit Google ADK surface with tool definitions for:

generating the next story section
applying director commands
extracting world locations

That matters for the architecture story. Gemini Live does not just transcribe voice. It listens, understands intent, then says GENERATING: ... when it is ready to trigger the next action. That is an agent moment.

6. The Demo Moment: "Three days have passed in Hastinapur..."

The most emotionally important feature is the return experience.

If you close the browser and come back later, SAGA restores the story world and writes you a welcome-back message that references your characters and locations. That single interaction reframes the product. The system no longer feels stateless. It feels like the world kept breathing while you were away.

That is the moment most people immediately understand the product.