D SAI CHARAN

Posted on Mar 15

How I Built Vitale — A Multimodal Medical Storytelling Agent with Gemini 3.1 Pro

#geminiliveagentchallenge #ai #googlecloud

How I Built Vitale — A Multimodal Medical Storytelling Agent with Gemini 3.1 Pro

I created this post for the purposes of entering the *#GeminiLiveAgentChallenge** hackathon. If you're reading this, I hope it gives you a clear picture of how Gemini's interleaved output can power something genuinely useful.*

The Problem

Medical reports are written for doctors, not patients.

Most people receive a PDF full of numbers, abbreviations, and clinical language — and have no real idea what it means for their life. They either panic, ignore it, or wait weeks for a follow-up appointment just to get a plain-English explanation.

I wanted to fix that. Not with another chatbot that answers questions. With a storyteller.

What Vitale Does

Vitale is a multimodal medical storytelling agent. You upload a medical report PDF, choose your audience theme (child-friendly or adult), and the agent generates:

📖 Text narration — warm, human storytelling of your health findings
🎨 AI illustrations — Imagen 3 visuals generated inline with the story
🔊 Voiceover audio — Google Cloud TTS synchronized per chunk
🎬 A story film — Ken Burns-style video assembled from all the above
❓ Doctor questions card — intelligent follow-up questions to ask your physician

All of this streams to the browser in real time. No waiting for a final result. You watch your story being built, chunk by chunk.

The Core Mechanic: Gemini Interleaved Output

This is the heart of Vitale, so let me spend time on it.

Gemini 3.1 Pro generates text and image prompts interleaved in a single streaming response. I designed a custom tag protocol that looks like this:

[NARRATE]Your heart health story begins with good news...[/NARRATE]
[ILLUSTRATE]Watercolor illustration of a friendly cartoon heart with a steady pulse line...[/ILLUSTRATE]
[NARRATE]Your blood pressure of 118/76 tells a steady, calm story...[/NARRATE]
[ILLUSTRATE]Minimal flat design showing a blood pressure gauge in the green zone...[/ILLUSTRATE]

Each chunk in this stream triggers a different downstream action:

Tag	What happens
`[NARRATE]`	Text renders on screen with typewriter effect + sent to Cloud TTS
`[ILLUSTRATE]`	Prompt sent to Imagen 3 → image rendered inline

The interleaved stream is the storytelling engine. Not a post-processing step. The entire experience — narration, images, audio, film — flows from this one stream.

Two Creative Themes

Theme	Narrator Voice	Illustration Style	Best For
Child	Warm storybook narrator	Watercolor cartoon characters	Kids, families
Adult	Calm, trusted friend	Clean minimal flat design	Adults, professionals

The theme changes Gemini's system prompt, the Imagen 3 style descriptors, and the Cloud TTS voice — all at once.

Architecture

PDF Upload (browser)
       ↓
FastAPI Backend (Google Cloud Run)
       ↓
Gemini 3.1 Pro → extract medical findings
       ↓
Gemini 3.1 Pro → interleaved stream
       ↓                    ↓
  [NARRATE] chunks     [ILLUSTRATE] prompts
       ↓                    ↓
  Cloud TTS            Imagen 3
       ↓                    ↓
  audio (base64)       image (base64)
       ↓                    ↓
         SSE stream to browser
                ↓
   Canvas + MediaRecorder → Story Film
                ↓
       Doctor Questions Card

The backend is a FastAPI app deployed on Google Cloud Run. The frontend is vanilla HTML/CSS/JS — no framework. I wanted zero build complexity and fast iteration.

Tech Stack

Technology	Role
Gemini 3.1 Pro	Interleaved narrative generation — the core engine
Imagen 3	Inline AI illustration generation
Google Cloud TTS	Per-chunk voiceover synthesis
Google Cloud Run	Managed serverless deployment
Google Cloud Build	Automated CI/CD via `cloudbuild.yaml`
Google Secret Manager	Secure API key storage
FastAPI	Backend orchestration + SSE streaming
Vanilla HTML/CSS/JS	Frontend — no framework overhead

Deployment: Fully Automated with Cloud Build

Deployment is a single command:

gcloud builds submit

The cloudbuild.yaml handles everything — Docker build, push to Container Registry, and Cloud Run deploy — including secrets injection via Secret Manager:

steps:
  - name: gcr.io/cloud-builders/docker
    args: [build, -t, gcr.io/$PROJECT_ID/vitale, .]
  - name: gcr.io/cloud-builders/docker
    args: [push, gcr.io/$PROJECT_ID/vitale]
  - name: gcr.io/cloud-builders/gcloud
    args:
      - run
      - deploy
      - vitale
      - --image=gcr.io/$PROJECT_ID/vitale
      - --platform=managed
      - --region=us-central1
      - --allow-unauthenticated
      - --memory=2Gi
      - --cpu=2
      - --set-secrets=GEMINI_API_KEY=GEMINI_API_KEY:latest

The GEMINI_API_KEY is never hardcoded — it's pulled from Google Secret Manager at deploy time. Clean, secure, reproducible.

The Hardest Part: Streaming Synchronization

The trickiest engineering challenge was keeping narration text, images, and audio in sync as they streamed in.

Each [NARRATE] chunk needs its TTS audio generated and returned before the next chunk starts playing, or the experience feels broken. I solved this by:

Parsing the interleaved stream chunk-by-chunk on the backend
Firing TTS requests immediately when a [NARRATE] tag closes
Sending both the text and the audio base64 together in a single SSE event
The frontend queues events and plays them sequentially — never out of order

For images, Imagen 3 has higher latency than TTS, so I render a soft loading placeholder inline and swap it in when the image arrives. The story keeps flowing — the illustration catches up.

Privacy First

Vitale processes your medical report entirely in memory. No data is stored, logged, or persisted anywhere. The uploaded PDF is discarded immediately after the story is generated. This was a non-negotiable design decision for a health data product.

What I Learned

Gemini's interleaved output changes what's possible. Most multimodal pipelines are sequential — generate text, then generate images, then stitch them together. Interleaved output collapses that into a single intelligent stream that reasons about both modalities simultaneously. The narrative and the visuals are designed together, not assembled separately.

SSE streaming + FastAPI is underrated. Server-Sent Events are simpler than WebSockets for one-directional streaming and work perfectly for this use case. FastAPI's StreamingResponse made it trivial to implement.

Vanilla JS still ships fast. No React, no Vite, no build step. Just a script tag and a canvas. For a hackathon with a tight timeline, this was the right call.

Try It / See the Code

🔗 Demo: https://vitale-981966808005.us-central1.run.app/
💻 GitHub: https://github.com/SAI-CHARAN-D/Vitale
🎥 Demo Video: https://youtu.be/saRVDDiCYXY

Disclaimer

Vitale summarizes medical reports in story form only. It does not diagnose, interpret, or provide medical advice. Always consult your doctor.

Built with Gemini 3.1 Pro · Imagen 3 · Google Cloud Run
Submitted to the Gemini Live Agent Challenge — Creative Storyteller Track
#GeminiLiveAgentChallenge

DEV Community

How I Built Vitale — A Multimodal Medical Storytelling Agent with Gemini 3.1 Pro

How I Built Vitale — A Multimodal Medical Storytelling Agent with Gemini 3.1 Pro

The Problem

What Vitale Does

The Core Mechanic: Gemini Interleaved Output

Two Creative Themes

Architecture

Tech Stack

Deployment: Fully Automated with Cloud Build

The Hardest Part: Streaming Synchronization

Privacy First

What I Learned

Try It / See the Code

Disclaimer

Top comments (0)