How I Built Vitale — A Multimodal Medical Storytelling Agent with Gemini 3.1 Pro
I created this post for the purposes of entering the *#GeminiLiveAgentChallenge** hackathon. If you're reading this, I hope it gives you a clear picture of how Gemini's interleaved output can power something genuinely useful.*
The Problem
Medical reports are written for doctors, not patients.
Most people receive a PDF full of numbers, abbreviations, and clinical language — and have no real idea what it means for their life. They either panic, ignore it, or wait weeks for a follow-up appointment just to get a plain-English explanation.
I wanted to fix that. Not with another chatbot that answers questions. With a storyteller.
What Vitale Does
Vitale is a multimodal medical storytelling agent. You upload a medical report PDF, choose your audience theme (child-friendly or adult), and the agent generates:
- 📖 Text narration — warm, human storytelling of your health findings
- 🎨 AI illustrations — Imagen 3 visuals generated inline with the story
- 🔊 Voiceover audio — Google Cloud TTS synchronized per chunk
- 🎬 A story film — Ken Burns-style video assembled from all the above
- ❓ Doctor questions card — intelligent follow-up questions to ask your physician
All of this streams to the browser in real time. No waiting for a final result. You watch your story being built, chunk by chunk.
The Core Mechanic: Gemini Interleaved Output
This is the heart of Vitale, so let me spend time on it.
Gemini 3.1 Pro generates text and image prompts interleaved in a single streaming response. I designed a custom tag protocol that looks like this:
[NARRATE]Your heart health story begins with good news...[/NARRATE]
[ILLUSTRATE]Watercolor illustration of a friendly cartoon heart with a steady pulse line...[/ILLUSTRATE]
[NARRATE]Your blood pressure of 118/76 tells a steady, calm story...[/NARRATE]
[ILLUSTRATE]Minimal flat design showing a blood pressure gauge in the green zone...[/ILLUSTRATE]
Each chunk in this stream triggers a different downstream action:
| Tag | What happens |
|---|---|
[NARRATE] |
Text renders on screen with typewriter effect + sent to Cloud TTS |
[ILLUSTRATE] |
Prompt sent to Imagen 3 → image rendered inline |
The interleaved stream is the storytelling engine. Not a post-processing step. The entire experience — narration, images, audio, film — flows from this one stream.
Two Creative Themes
| Theme | Narrator Voice | Illustration Style | Best For |
|---|---|---|---|
| Child | Warm storybook narrator | Watercolor cartoon characters | Kids, families |
| Adult | Calm, trusted friend | Clean minimal flat design | Adults, professionals |
The theme changes Gemini's system prompt, the Imagen 3 style descriptors, and the Cloud TTS voice — all at once.
Architecture
PDF Upload (browser)
↓
FastAPI Backend (Google Cloud Run)
↓
Gemini 3.1 Pro → extract medical findings
↓
Gemini 3.1 Pro → interleaved stream
↓ ↓
[NARRATE] chunks [ILLUSTRATE] prompts
↓ ↓
Cloud TTS Imagen 3
↓ ↓
audio (base64) image (base64)
↓ ↓
SSE stream to browser
↓
Canvas + MediaRecorder → Story Film
↓
Doctor Questions Card
The backend is a FastAPI app deployed on Google Cloud Run. The frontend is vanilla HTML/CSS/JS — no framework. I wanted zero build complexity and fast iteration.
Tech Stack
| Technology | Role |
|---|---|
| Gemini 3.1 Pro | Interleaved narrative generation — the core engine |
| Imagen 3 | Inline AI illustration generation |
| Google Cloud TTS | Per-chunk voiceover synthesis |
| Google Cloud Run | Managed serverless deployment |
| Google Cloud Build | Automated CI/CD via cloudbuild.yaml
|
| Google Secret Manager | Secure API key storage |
| FastAPI | Backend orchestration + SSE streaming |
| Vanilla HTML/CSS/JS | Frontend — no framework overhead |
Deployment: Fully Automated with Cloud Build
Deployment is a single command:
gcloud builds submit
The cloudbuild.yaml handles everything — Docker build, push to Container Registry, and Cloud Run deploy — including secrets injection via Secret Manager:
steps:
- name: gcr.io/cloud-builders/docker
args: [build, -t, gcr.io/$PROJECT_ID/vitale, .]
- name: gcr.io/cloud-builders/docker
args: [push, gcr.io/$PROJECT_ID/vitale]
- name: gcr.io/cloud-builders/gcloud
args:
- run
- deploy
- vitale
- --image=gcr.io/$PROJECT_ID/vitale
- --platform=managed
- --region=us-central1
- --allow-unauthenticated
- --memory=2Gi
- --cpu=2
- --set-secrets=GEMINI_API_KEY=GEMINI_API_KEY:latest
The GEMINI_API_KEY is never hardcoded — it's pulled from Google Secret Manager at deploy time. Clean, secure, reproducible.
The Hardest Part: Streaming Synchronization
The trickiest engineering challenge was keeping narration text, images, and audio in sync as they streamed in.
Each [NARRATE] chunk needs its TTS audio generated and returned before the next chunk starts playing, or the experience feels broken. I solved this by:
- Parsing the interleaved stream chunk-by-chunk on the backend
- Firing TTS requests immediately when a
[NARRATE]tag closes - Sending both the text and the audio base64 together in a single SSE event
- The frontend queues events and plays them sequentially — never out of order
For images, Imagen 3 has higher latency than TTS, so I render a soft loading placeholder inline and swap it in when the image arrives. The story keeps flowing — the illustration catches up.
Privacy First
Vitale processes your medical report entirely in memory. No data is stored, logged, or persisted anywhere. The uploaded PDF is discarded immediately after the story is generated. This was a non-negotiable design decision for a health data product.
What I Learned
Gemini's interleaved output changes what's possible. Most multimodal pipelines are sequential — generate text, then generate images, then stitch them together. Interleaved output collapses that into a single intelligent stream that reasons about both modalities simultaneously. The narrative and the visuals are designed together, not assembled separately.
SSE streaming + FastAPI is underrated. Server-Sent Events are simpler than WebSockets for one-directional streaming and work perfectly for this use case. FastAPI's StreamingResponse made it trivial to implement.
Vanilla JS still ships fast. No React, no Vite, no build step. Just a script tag and a canvas. For a hackathon with a tight timeline, this was the right call.
Try It / See the Code
- 🔗 Demo: https://vitale-981966808005.us-central1.run.app/
- 💻 GitHub: https://github.com/SAI-CHARAN-D/Vitale
- 🎥 Demo Video: https://youtu.be/saRVDDiCYXY
Disclaimer
Vitale summarizes medical reports in story form only. It does not diagnose, interpret, or provide medical advice. Always consult your doctor.
Built with Gemini 3.1 Pro · Imagen 3 · Google Cloud Run
Submitted to the Gemini Live Agent Challenge — Creative Storyteller Track
#GeminiLiveAgentChallenge
Top comments (0)