<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: D SAI CHARAN</title>
    <description>The latest articles on DEV Community by D SAI CHARAN (@d_saicharan_030505).</description>
    <link>https://dev.to/d_saicharan_030505</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3789225%2Fd57604f0-45e5-4694-93d5-da90a35bd93f.png</url>
      <title>DEV Community: D SAI CHARAN</title>
      <link>https://dev.to/d_saicharan_030505</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/d_saicharan_030505"/>
    <language>en</language>
    <item>
      <title>How I Built Vitale — A Multimodal Medical Storytelling Agent with Gemini 3.1 Pro</title>
      <dc:creator>D SAI CHARAN</dc:creator>
      <pubDate>Sun, 15 Mar 2026 08:14:01 +0000</pubDate>
      <link>https://dev.to/d_saicharan_030505/how-i-built-vitale-a-multimodal-medical-storytelling-agent-with-gemini-31-pro-4ii2</link>
      <guid>https://dev.to/d_saicharan_030505/how-i-built-vitale-a-multimodal-medical-storytelling-agent-with-gemini-31-pro-4ii2</guid>
      <description>&lt;h1&gt;
  
  
  How I Built Vitale — A Multimodal Medical Storytelling Agent with Gemini 3.1 Pro
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;I created this post for the purposes of entering the *&lt;/em&gt;#GeminiLiveAgentChallenge** hackathon. If you're reading this, I hope it gives you a clear picture of how Gemini's interleaved output can power something genuinely useful.*&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Medical reports are written for doctors, not patients.&lt;/p&gt;

&lt;p&gt;Most people receive a PDF full of numbers, abbreviations, and clinical language — and have no real idea what it means for their life. They either panic, ignore it, or wait weeks for a follow-up appointment just to get a plain-English explanation.&lt;/p&gt;

&lt;p&gt;I wanted to fix that. Not with another chatbot that answers questions. With a &lt;em&gt;storyteller&lt;/em&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Vitale Does
&lt;/h2&gt;

&lt;p&gt;Vitale is a multimodal medical storytelling agent. You upload a medical report PDF, choose your audience theme (child-friendly or adult), and the agent generates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;📖 &lt;strong&gt;Text narration&lt;/strong&gt; — warm, human storytelling of your health findings&lt;/li&gt;
&lt;li&gt;🎨 &lt;strong&gt;AI illustrations&lt;/strong&gt; — Imagen 3 visuals generated inline with the story&lt;/li&gt;
&lt;li&gt;🔊 &lt;strong&gt;Voiceover audio&lt;/strong&gt; — Google Cloud TTS synchronized per chunk&lt;/li&gt;
&lt;li&gt;🎬 &lt;strong&gt;A story film&lt;/strong&gt; — Ken Burns-style video assembled from all the above&lt;/li&gt;
&lt;li&gt;❓ &lt;strong&gt;Doctor questions card&lt;/strong&gt; — intelligent follow-up questions to ask your physician&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All of this streams to the browser in real time. No waiting for a final result. You watch your story being built, chunk by chunk.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Core Mechanic: Gemini Interleaved Output
&lt;/h2&gt;

&lt;p&gt;This is the heart of Vitale, so let me spend time on it.&lt;/p&gt;

&lt;p&gt;Gemini 3.1 Pro generates text and image prompts &lt;strong&gt;interleaved in a single streaming response&lt;/strong&gt;. I designed a custom tag protocol that looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;[NARRATE]Your heart health story begins with good news...[/NARRATE]
[ILLUSTRATE]Watercolor illustration of a friendly cartoon heart with a steady pulse line...[/ILLUSTRATE]
[NARRATE]Your blood pressure of 118/76 tells a steady, calm story...[/NARRATE]
[ILLUSTRATE]Minimal flat design showing a blood pressure gauge in the green zone...[/ILLUSTRATE]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each chunk in this stream triggers a different downstream action:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tag&lt;/th&gt;
&lt;th&gt;What happens&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;[NARRATE]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Text renders on screen with typewriter effect + sent to Cloud TTS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;[ILLUSTRATE]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Prompt sent to Imagen 3 → image rendered inline&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The interleaved stream &lt;strong&gt;is&lt;/strong&gt; the storytelling engine. Not a post-processing step. The entire experience — narration, images, audio, film — flows from this one stream.&lt;/p&gt;




&lt;h2&gt;
  
  
  Two Creative Themes
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Theme&lt;/th&gt;
&lt;th&gt;Narrator Voice&lt;/th&gt;
&lt;th&gt;Illustration Style&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Child&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Warm storybook narrator&lt;/td&gt;
&lt;td&gt;Watercolor cartoon characters&lt;/td&gt;
&lt;td&gt;Kids, families&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Adult&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Calm, trusted friend&lt;/td&gt;
&lt;td&gt;Clean minimal flat design&lt;/td&gt;
&lt;td&gt;Adults, professionals&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The theme changes Gemini's system prompt, the Imagen 3 style descriptors, and the Cloud TTS voice — all at once.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PDF Upload (browser)
       ↓
FastAPI Backend (Google Cloud Run)
       ↓
Gemini 3.1 Pro → extract medical findings
       ↓
Gemini 3.1 Pro → interleaved stream
       ↓                    ↓
  [NARRATE] chunks     [ILLUSTRATE] prompts
       ↓                    ↓
  Cloud TTS            Imagen 3
       ↓                    ↓
  audio (base64)       image (base64)
       ↓                    ↓
         SSE stream to browser
                ↓
   Canvas + MediaRecorder → Story Film
                ↓
       Doctor Questions Card
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The backend is a &lt;strong&gt;FastAPI&lt;/strong&gt; app deployed on &lt;strong&gt;Google Cloud Run&lt;/strong&gt;. The frontend is vanilla HTML/CSS/JS — no framework. I wanted zero build complexity and fast iteration.&lt;/p&gt;




&lt;h2&gt;
  
  
  Tech Stack
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Technology&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gemini 3.1 Pro&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Interleaved narrative generation — the core engine&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Imagen 3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Inline AI illustration generation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Google Cloud TTS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Per-chunk voiceover synthesis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Google Cloud Run&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Managed serverless deployment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Google Cloud Build&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Automated CI/CD via &lt;code&gt;cloudbuild.yaml&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Google Secret Manager&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Secure API key storage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FastAPI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Backend orchestration + SSE streaming&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Vanilla HTML/CSS/JS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Frontend — no framework overhead&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Deployment: Fully Automated with Cloud Build
&lt;/h2&gt;

&lt;p&gt;Deployment is a single command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud builds submit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;cloudbuild.yaml&lt;/code&gt; handles everything — Docker build, push to Container Registry, and Cloud Run deploy — including secrets injection via Secret Manager:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gcr.io/cloud-builders/docker&lt;/span&gt;
    &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;-t&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;gcr.io/$PROJECT_ID/vitale&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;.&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gcr.io/cloud-builders/docker&lt;/span&gt;
    &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;gcr.io/$PROJECT_ID/vitale&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gcr.io/cloud-builders/gcloud&lt;/span&gt;
    &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;run&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;deploy&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;vitale&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--image=gcr.io/$PROJECT_ID/vitale&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--platform=managed&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--region=us-central1&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--allow-unauthenticated&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--memory=2Gi&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--cpu=2&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--set-secrets=GEMINI_API_KEY=GEMINI_API_KEY:latest&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;GEMINI_API_KEY&lt;/code&gt; is never hardcoded — it's pulled from &lt;strong&gt;Google Secret Manager&lt;/strong&gt; at deploy time. Clean, secure, reproducible.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Hardest Part: Streaming Synchronization
&lt;/h2&gt;

&lt;p&gt;The trickiest engineering challenge was keeping narration text, images, and audio in sync as they streamed in.&lt;/p&gt;

&lt;p&gt;Each &lt;code&gt;[NARRATE]&lt;/code&gt; chunk needs its TTS audio generated and returned &lt;em&gt;before&lt;/em&gt; the next chunk starts playing, or the experience feels broken. I solved this by:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Parsing the interleaved stream chunk-by-chunk on the backend&lt;/li&gt;
&lt;li&gt;Firing TTS requests immediately when a &lt;code&gt;[NARRATE]&lt;/code&gt; tag closes&lt;/li&gt;
&lt;li&gt;Sending both the text and the audio base64 together in a single SSE event&lt;/li&gt;
&lt;li&gt;The frontend queues events and plays them sequentially — never out of order&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For images, Imagen 3 has higher latency than TTS, so I render a soft loading placeholder inline and swap it in when the image arrives. The story keeps flowing — the illustration catches up.&lt;/p&gt;




&lt;h2&gt;
  
  
  Privacy First
&lt;/h2&gt;

&lt;p&gt;Vitale processes your medical report entirely in memory. No data is stored, logged, or persisted anywhere. The uploaded PDF is discarded immediately after the story is generated. This was a non-negotiable design decision for a health data product.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Gemini's interleaved output changes what's possible.&lt;/strong&gt; Most multimodal pipelines are sequential — generate text, then generate images, then stitch them together. Interleaved output collapses that into a single intelligent stream that reasons about both modalities simultaneously. The narrative and the visuals are designed together, not assembled separately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SSE streaming + FastAPI is underrated.&lt;/strong&gt; Server-Sent Events are simpler than WebSockets for one-directional streaming and work perfectly for this use case. FastAPI's &lt;code&gt;StreamingResponse&lt;/code&gt; made it trivial to implement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vanilla JS still ships fast.&lt;/strong&gt; No React, no Vite, no build step. Just a script tag and a canvas. For a hackathon with a tight timeline, this was the right call.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It / See the Code
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;🔗 &lt;strong&gt;Demo:&lt;/strong&gt; &lt;a href="https://vitale-981966808005.us-central1.run.app/" rel="noopener noreferrer"&gt;https://vitale-981966808005.us-central1.run.app/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;💻 &lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/SAI-CHARAN-D/Vitale" rel="noopener noreferrer"&gt;https://github.com/SAI-CHARAN-D/Vitale&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🎥 &lt;strong&gt;Demo Video:&lt;/strong&gt; &lt;a href="https://youtu.be/saRVDDiCYXY" rel="noopener noreferrer"&gt;https://youtu.be/saRVDDiCYXY&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Disclaimer
&lt;/h2&gt;

&lt;p&gt;Vitale summarizes medical reports in story form only. It does not diagnose, interpret, or provide medical advice. Always consult your doctor.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built with Gemini 3.1 Pro · Imagen 3 · Google Cloud Run&lt;/em&gt;&lt;br&gt;
&lt;em&gt;Submitted to the Gemini Live Agent Challenge — Creative Storyteller Track&lt;/em&gt;&lt;br&gt;
&lt;em&gt;#GeminiLiveAgentChallenge&lt;/em&gt;&lt;/p&gt;

</description>
      <category>geminiliveagentchallenge</category>
      <category>ai</category>
      <category>googlecloud</category>
    </item>
  </channel>
</rss>
