DEV Community

D SAI CHARAN
D SAI CHARAN

Posted on

How I Built Vitale — A Multimodal Medical Storytelling Agent with Gemini 3.1 Pro

How I Built Vitale — A Multimodal Medical Storytelling Agent with Gemini 3.1 Pro

I created this post for the purposes of entering the *#GeminiLiveAgentChallenge** hackathon. If you're reading this, I hope it gives you a clear picture of how Gemini's interleaved output can power something genuinely useful.*


The Problem

Medical reports are written for doctors, not patients.

Most people receive a PDF full of numbers, abbreviations, and clinical language — and have no real idea what it means for their life. They either panic, ignore it, or wait weeks for a follow-up appointment just to get a plain-English explanation.

I wanted to fix that. Not with another chatbot that answers questions. With a storyteller.


What Vitale Does

Vitale is a multimodal medical storytelling agent. You upload a medical report PDF, choose your audience theme (child-friendly or adult), and the agent generates:

  • 📖 Text narration — warm, human storytelling of your health findings
  • 🎨 AI illustrations — Imagen 3 visuals generated inline with the story
  • 🔊 Voiceover audio — Google Cloud TTS synchronized per chunk
  • 🎬 A story film — Ken Burns-style video assembled from all the above
  • Doctor questions card — intelligent follow-up questions to ask your physician

All of this streams to the browser in real time. No waiting for a final result. You watch your story being built, chunk by chunk.


The Core Mechanic: Gemini Interleaved Output

This is the heart of Vitale, so let me spend time on it.

Gemini 3.1 Pro generates text and image prompts interleaved in a single streaming response. I designed a custom tag protocol that looks like this:

[NARRATE]Your heart health story begins with good news...[/NARRATE]
[ILLUSTRATE]Watercolor illustration of a friendly cartoon heart with a steady pulse line...[/ILLUSTRATE]
[NARRATE]Your blood pressure of 118/76 tells a steady, calm story...[/NARRATE]
[ILLUSTRATE]Minimal flat design showing a blood pressure gauge in the green zone...[/ILLUSTRATE]
Enter fullscreen mode Exit fullscreen mode

Each chunk in this stream triggers a different downstream action:

Tag What happens
[NARRATE] Text renders on screen with typewriter effect + sent to Cloud TTS
[ILLUSTRATE] Prompt sent to Imagen 3 → image rendered inline

The interleaved stream is the storytelling engine. Not a post-processing step. The entire experience — narration, images, audio, film — flows from this one stream.


Two Creative Themes

Theme Narrator Voice Illustration Style Best For
Child Warm storybook narrator Watercolor cartoon characters Kids, families
Adult Calm, trusted friend Clean minimal flat design Adults, professionals

The theme changes Gemini's system prompt, the Imagen 3 style descriptors, and the Cloud TTS voice — all at once.


Architecture

PDF Upload (browser)
       ↓
FastAPI Backend (Google Cloud Run)
       ↓
Gemini 3.1 Pro → extract medical findings
       ↓
Gemini 3.1 Pro → interleaved stream
       ↓                    ↓
  [NARRATE] chunks     [ILLUSTRATE] prompts
       ↓                    ↓
  Cloud TTS            Imagen 3
       ↓                    ↓
  audio (base64)       image (base64)
       ↓                    ↓
         SSE stream to browser
                ↓
   Canvas + MediaRecorder → Story Film
                ↓
       Doctor Questions Card
Enter fullscreen mode Exit fullscreen mode

The backend is a FastAPI app deployed on Google Cloud Run. The frontend is vanilla HTML/CSS/JS — no framework. I wanted zero build complexity and fast iteration.


Tech Stack

Technology Role
Gemini 3.1 Pro Interleaved narrative generation — the core engine
Imagen 3 Inline AI illustration generation
Google Cloud TTS Per-chunk voiceover synthesis
Google Cloud Run Managed serverless deployment
Google Cloud Build Automated CI/CD via cloudbuild.yaml
Google Secret Manager Secure API key storage
FastAPI Backend orchestration + SSE streaming
Vanilla HTML/CSS/JS Frontend — no framework overhead

Deployment: Fully Automated with Cloud Build

Deployment is a single command:

gcloud builds submit
Enter fullscreen mode Exit fullscreen mode

The cloudbuild.yaml handles everything — Docker build, push to Container Registry, and Cloud Run deploy — including secrets injection via Secret Manager:

steps:
  - name: gcr.io/cloud-builders/docker
    args: [build, -t, gcr.io/$PROJECT_ID/vitale, .]
  - name: gcr.io/cloud-builders/docker
    args: [push, gcr.io/$PROJECT_ID/vitale]
  - name: gcr.io/cloud-builders/gcloud
    args:
      - run
      - deploy
      - vitale
      - --image=gcr.io/$PROJECT_ID/vitale
      - --platform=managed
      - --region=us-central1
      - --allow-unauthenticated
      - --memory=2Gi
      - --cpu=2
      - --set-secrets=GEMINI_API_KEY=GEMINI_API_KEY:latest
Enter fullscreen mode Exit fullscreen mode

The GEMINI_API_KEY is never hardcoded — it's pulled from Google Secret Manager at deploy time. Clean, secure, reproducible.


The Hardest Part: Streaming Synchronization

The trickiest engineering challenge was keeping narration text, images, and audio in sync as they streamed in.

Each [NARRATE] chunk needs its TTS audio generated and returned before the next chunk starts playing, or the experience feels broken. I solved this by:

  1. Parsing the interleaved stream chunk-by-chunk on the backend
  2. Firing TTS requests immediately when a [NARRATE] tag closes
  3. Sending both the text and the audio base64 together in a single SSE event
  4. The frontend queues events and plays them sequentially — never out of order

For images, Imagen 3 has higher latency than TTS, so I render a soft loading placeholder inline and swap it in when the image arrives. The story keeps flowing — the illustration catches up.


Privacy First

Vitale processes your medical report entirely in memory. No data is stored, logged, or persisted anywhere. The uploaded PDF is discarded immediately after the story is generated. This was a non-negotiable design decision for a health data product.


What I Learned

Gemini's interleaved output changes what's possible. Most multimodal pipelines are sequential — generate text, then generate images, then stitch them together. Interleaved output collapses that into a single intelligent stream that reasons about both modalities simultaneously. The narrative and the visuals are designed together, not assembled separately.

SSE streaming + FastAPI is underrated. Server-Sent Events are simpler than WebSockets for one-directional streaming and work perfectly for this use case. FastAPI's StreamingResponse made it trivial to implement.

Vanilla JS still ships fast. No React, no Vite, no build step. Just a script tag and a canvas. For a hackathon with a tight timeline, this was the right call.


Try It / See the Code


Disclaimer

Vitale summarizes medical reports in story form only. It does not diagnose, interpret, or provide medical advice. Always consult your doctor.


Built with Gemini 3.1 Pro · Imagen 3 · Google Cloud Run
Submitted to the Gemini Live Agent Challenge — Creative Storyteller Track
#GeminiLiveAgentChallenge

Top comments (0)