How I Built CalmAid — A Live AI First Aid Agent with Gemini 2.5 Flash and Google Cloud Run
I created this piece of content for the purposes of entering the Gemini Live Agent Challenge hackathon. #GeminiLiveAgentChallenge
The Idea
In an emergency, people panic. They fumble with Google, get walls of text, and waste critical seconds. I wanted to build something that could just talk to you — calmly, instantly, while also seeing what you're dealing with.
That became CalmAid: speak the emergency, show the injury, hear step-by-step instructions streaming back in real time.
The Stack
- Gemini 2.5 Flash — multimodal vision + text generation with streaming
-
Google GenAI SDK (
google-genai) — the new SDK, not the deprecated one - FastAPI — async Python backend
- Server-Sent Events (SSE) — real-time streaming to the browser
- Google Cloud Run — serverless hosting
- Google Secret Manager — secure API key storage
- Web Speech API + Speech Synthesis — browser-native voice in and out
- GSAP 3 — animations
How Streaming Works
The key insight that makes CalmAid feel live is that text renders and TTS speaks simultaneously while Gemini is still generating.
The backend streams via SSE:
async def stream_gemini(parts):
response = client.models.generate_content_stream(
model="gemini-2.5-flash",
contents=parts,
config=types.GenerateContentConfig(
system_instruction=SYSTEM_PROMPT,
max_output_tokens=300,
)
)
for chunk in response:
if chunk.text:
yield f"data: {json.dumps({'chunk': chunk.text})}\n\n"
await asyncio.sleep(0)
yield f"data: {json.dumps({'done': True})}\n\n"
The frontend reads the stream and feeds sentences to a TTS queue the moment a sentence boundary (., !, ?) is detected:
function enqueueSentences(newText) {
ttsBuffer += newText;
const sentences = ttsBuffer.split(/(?<=[.!?])\s+/);
ttsBuffer = sentences.pop() || "";
sentences.forEach(s => { if (s.trim()) ttsQueue.push(s.trim()); });
if (!ttsActive) drainTTSQueue();
}
The result: the agent starts speaking before the full response arrives. That's what makes it feel genuinely live.
Vision Integration
When a user snaps a photo, it's sent as base64 and converted to a Pillow image on the backend:
if req.image_b64:
img_bytes = base64.b64decode(req.image_b64)
img = Image.open(io.BytesIO(img_bytes)).convert("RGB")
buf = io.BytesIO()
img.save(buf, format="JPEG")
img_part = types.Part.from_bytes(data=buf.getvalue(), mime_type="image/jpeg")
parts.append(img_part)
Gemini then describes what it sees and tailors the first aid advice accordingly.
Deploying to Cloud Run
The whole deploy is one command thanks to --source . which triggers Cloud Build automatically:
gcloud run deploy calmaid-agent \
--source . \
--region us-central1 \
--allow-unauthenticated \
--set-secrets="GEMINI_API_KEY=gemini-api-key:latest" \
--memory 512Mi
The API key lives in Secret Manager and gets injected at runtime — never hardcoded, never in the repo.
Challenges
SSE buffer management was trickier than expected. Chunks from the stream reader arrive mid-line, so you have to hold incomplete lines across read cycles:
buffer += decoder.decode(value, { stream: true });
const lines = buffer.split("\n");
buffer = lines.pop(); // hold incomplete line
Python 3.13 compatibility broke several pinned packages. Pillow 10.x and pydantic 2.7.x don't have prebuilt wheels for 3.13 — bumping to Pillow 11.1.0 and pydantic 2.10.0 fixed it.
SDK migration — the google-generativeai package is fully deprecated and streaming was unreliable. Switching to google-genai resolved it completely.
What I Learned
- Streaming + TTS together is what makes AI feel live vs turn-based
- Browser-native Web Speech API and Speech Synthesis are underrated — zero dependencies, instant
-
python:3.11-alpinecuts Docker image vulnerabilities dramatically vsslim - Cloud Run + Secret Manager is the cleanest production pattern for API keys
Try It
- Live app: submitted via the Gemini Live Agent Challenge portal
- GitHub: https://github.com/git791/Calm-Aid
Built for the Gemini Live Agent Challenge. #GeminiLiveAgentChallenge
Top comments (0)