<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Shrey Patel</title>
    <description>The latest articles on DEV Community by Shrey Patel (@shreyp087).</description>
    <link>https://dev.to/shreyp087</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3824605%2Fcc3f1080-480b-4422-ae9b-81a609422d0a.png</url>
      <title>DEV Community: Shrey Patel</title>
      <link>https://dev.to/shreyp087</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/shreyp087"/>
    <language>en</language>
    <item>
      <title>How I Built SAGA: A Living Multimodal Story Engine with 5 Google AI Models</title>
      <dc:creator>Shrey Patel</dc:creator>
      <pubDate>Mon, 16 Mar 2026 03:03:40 +0000</pubDate>
      <link>https://dev.to/shreyp087/how-i-built-saga-a-living-multimodal-story-engine-with-5-google-ai-models-102b</link>
      <guid>https://dev.to/shreyp087/how-i-built-saga-a-living-multimodal-story-engine-with-5-google-ai-models-102b</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxptlwsdzed928ki4uelt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxptlwsdzed928ki4uelt.png" alt="Landing Page" width="800" height="485"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Subtitle:&lt;/strong&gt; Built for the Gemini Live Agent Challenge 2026 #GeminiLiveAgentChallenge&lt;/p&gt;

&lt;h2&gt;
  
  
  1. The Problem: AI stories are boring
&lt;/h2&gt;

&lt;p&gt;Most AI storytelling experiences still feel transactional. You type into a box, get a paragraph back, maybe copy it into a document, and the magic ends there. They do not see, hear, speak, or remember. They do not feel like a living world.&lt;/p&gt;

&lt;p&gt;That gap became the starting point for SAGA. I wanted something that felt less like prompting an API and more like stepping into a creative chamber where prose, visuals, narration, and world memory move together.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvbjseespzcwd8a7vcmgd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvbjseespzcwd8a7vcmgd.png" alt=" " width="800" height="516"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  2. The Vision: What if a story could See, Hear, Speak, and Remember?
&lt;/h2&gt;

&lt;p&gt;SAGA is built around a simple belief: stories should not be output, they should be environments.&lt;/p&gt;

&lt;p&gt;So the product became a story universe engine:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;See&lt;/strong&gt; through inline illustrations and cinematic clips&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hear&lt;/strong&gt; through narration and ambient score&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Speak&lt;/strong&gt; through Gemini Live as a co-author&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remember&lt;/strong&gt; through persistent world state in Firestore and vector memory in Qdrant&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That one framing decision drove the entire architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Architecture: The 5-model stack
&lt;/h2&gt;

&lt;p&gt;SAGA uses a layered Google AI stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Gemini 2.0 Flash&lt;/strong&gt; as the primary story engine&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemini Live API&lt;/strong&gt; for real-time voice co-authoring&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Imagen 4&lt;/strong&gt; for scene illustrations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Veo 2&lt;/strong&gt; for short cinematic beats&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemini TTS&lt;/strong&gt; for narration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lyria 2&lt;/strong&gt; for ambient score generation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The backend runs on FastAPI and Cloud Run. Firestore stores story sessions and return-state. Cloud Storage stores media artifacts. Terraform provisions the infrastructure. Secret Manager handles secrets. Qdrant stores vector memory for continuity.&lt;/p&gt;

&lt;p&gt;The key design choice was interleaving. Text, image, narration, and music do not appear in separate tabs. They arrive in one manuscript stream so the user experiences a single living artifact.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2l0kg5cl3mrvixumj2y6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2l0kg5cl3mrvixumj2y6.png" alt=" " width="800" height="488"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm6rxfd79iijjp6whqzoj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm6rxfd79iijjp6whqzoj.png" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  4. The Hard Parts
&lt;/h2&gt;

&lt;p&gt;There were a few technical pieces that mattered more than expected:&lt;/p&gt;

&lt;h3&gt;
  
  
  PCM-to-WAV wrapping for live audio
&lt;/h3&gt;

&lt;p&gt;Gemini Live returns raw audio chunks, so browser-safe playback required clean PCM handling and scheduling. Once chunk playback was scheduled in a persistent audio context instead of one context per chunk, the speaking voice stopped sounding broken.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lyria REST workaround
&lt;/h3&gt;

&lt;p&gt;The current Lyria path uses Vertex REST because the SDK path had a proto/runtime mismatch for this use case. That made the music layer slightly different from the other model integrations, but it kept the product stable and demoable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Background world extraction
&lt;/h3&gt;

&lt;p&gt;The story could not wait for map extraction, narration, or video to finish. The manuscript needed to keep moving. So world extraction, narration, music, and cinematic clip generation were pushed into non-blocking background tasks, then streamed back into the same WebSocket session.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8fmav3gcpi99uy1s0brs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8fmav3gcpi99uy1s0brs.png" alt=" " width="800" height="434"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  5. The ADK Layer: Why SAGA is an agent
&lt;/h2&gt;

&lt;p&gt;I wanted SAGA to be legible as an agent, not just a collection of API calls.&lt;/p&gt;

&lt;p&gt;So I added an explicit Google ADK surface with tool definitions for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;generating the next story section&lt;/li&gt;
&lt;li&gt;applying director commands&lt;/li&gt;
&lt;li&gt;extracting world locations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That matters for the architecture story. Gemini Live does not just transcribe voice. It listens, understands intent, then says &lt;code&gt;GENERATING: ...&lt;/code&gt; when it is ready to trigger the next action. That is an agent moment.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. The Demo Moment: "Three days have passed in Hastinapur..."
&lt;/h2&gt;

&lt;p&gt;The most emotionally important feature is the return experience.&lt;/p&gt;

&lt;p&gt;If you close the browser and come back later, SAGA restores the story world and writes you a welcome-back message that references your characters and locations. That single interaction reframes the product. The system no longer feels stateless. It feels like the world kept breathing while you were away.&lt;/p&gt;

&lt;p&gt;That is the moment most people immediately understand the product.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2sntosb5nls1r2mrca0l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2sntosb5nls1r2mrca0l.png" alt=" " width="800" height="548"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F989ufunhd3u6hugrajqz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F989ufunhd3u6hugrajqz.png" alt=" " width="800" height="544"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwfhvuy6q5tk0z3krplu1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwfhvuy6q5tk0z3krplu1.png" alt=" " width="800" height="523"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  7. What's Next
&lt;/h2&gt;

&lt;p&gt;If I keep building SAGA, the next steps are clear:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;multi-user shared worlds&lt;/li&gt;
&lt;li&gt;mobile companion app&lt;/li&gt;
&lt;li&gt;collaborative writer rooms&lt;/li&gt;
&lt;li&gt;publishable world libraries&lt;/li&gt;
&lt;li&gt;a marketplace for stories, universes, and generated artifacts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj4gz3ztapxjgxcp47v79.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj4gz3ztapxjgxcp47v79.png" alt=" " width="800" height="237"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  8. Try It Yourself
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Demo Video: &lt;a href="https://youtu.be/mdONC55NxEU" rel="noopener noreferrer"&gt;https://youtu.be/mdONC55NxEU&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I created this content for the purposes of entering the Gemini Live Agent Challenge 2026. #GeminiLiveAgentChallenge&lt;/p&gt;

</description>
      <category>geminiliveagentchallenge</category>
      <category>ai</category>
      <category>gemini</category>
      <category>vibecoding</category>
    </item>
  </channel>
</rss>
