DEV Community

Cover image for I Built a Live AI First Aid Agent with Gemini 2.5 Flash in 3 Days
Mohammed Ayaan Adil Ahmed
Mohammed Ayaan Adil Ahmed

Posted on

I Built a Live AI First Aid Agent with Gemini 2.5 Flash in 3 Days

How I Built CalmAid — A Live AI First Aid Agent with Gemini 2.5 Flash and Google Cloud Run

I created this piece of content for the purposes of entering the Gemini Live Agent Challenge hackathon. #GeminiLiveAgentChallenge


The Idea

In an emergency, people panic. They fumble with Google, get walls of text, and waste critical seconds. I wanted to build something that could just talk to you — calmly, instantly, while also seeing what you're dealing with.

That became CalmAid: speak the emergency, show the injury, hear step-by-step instructions streaming back in real time.


The Stack

  • Gemini 2.5 Flash — multimodal vision + text generation with streaming
  • Google GenAI SDK (google-genai) — the new SDK, not the deprecated one
  • FastAPI — async Python backend
  • Server-Sent Events (SSE) — real-time streaming to the browser
  • Google Cloud Run — serverless hosting
  • Google Secret Manager — secure API key storage
  • Web Speech API + Speech Synthesis — browser-native voice in and out
  • GSAP 3 — animations

How Streaming Works

The key insight that makes CalmAid feel live is that text renders and TTS speaks simultaneously while Gemini is still generating.

The backend streams via SSE:

async def stream_gemini(parts):
    response = client.models.generate_content_stream(
        model="gemini-2.5-flash",
        contents=parts,
        config=types.GenerateContentConfig(
            system_instruction=SYSTEM_PROMPT,
            max_output_tokens=300,
        )
    )
    for chunk in response:
        if chunk.text:
            yield f"data: {json.dumps({'chunk': chunk.text})}\n\n"
            await asyncio.sleep(0)
    yield f"data: {json.dumps({'done': True})}\n\n"
Enter fullscreen mode Exit fullscreen mode

The frontend reads the stream and feeds sentences to a TTS queue the moment a sentence boundary (., !, ?) is detected:

function enqueueSentences(newText) {
  ttsBuffer += newText;
  const sentences = ttsBuffer.split(/(?<=[.!?])\s+/);
  ttsBuffer = sentences.pop() || "";
  sentences.forEach(s => { if (s.trim()) ttsQueue.push(s.trim()); });
  if (!ttsActive) drainTTSQueue();
}
Enter fullscreen mode Exit fullscreen mode

The result: the agent starts speaking before the full response arrives. That's what makes it feel genuinely live.


Vision Integration

When a user snaps a photo, it's sent as base64 and converted to a Pillow image on the backend:

if req.image_b64:
    img_bytes = base64.b64decode(req.image_b64)
    img = Image.open(io.BytesIO(img_bytes)).convert("RGB")
    buf = io.BytesIO()
    img.save(buf, format="JPEG")
    img_part = types.Part.from_bytes(data=buf.getvalue(), mime_type="image/jpeg")
    parts.append(img_part)
Enter fullscreen mode Exit fullscreen mode

Gemini then describes what it sees and tailors the first aid advice accordingly.


Deploying to Cloud Run

The whole deploy is one command thanks to --source . which triggers Cloud Build automatically:

gcloud run deploy calmaid-agent \
  --source . \
  --region us-central1 \
  --allow-unauthenticated \
  --set-secrets="GEMINI_API_KEY=gemini-api-key:latest" \
  --memory 512Mi
Enter fullscreen mode Exit fullscreen mode

The API key lives in Secret Manager and gets injected at runtime — never hardcoded, never in the repo.


Challenges

SSE buffer management was trickier than expected. Chunks from the stream reader arrive mid-line, so you have to hold incomplete lines across read cycles:

buffer += decoder.decode(value, { stream: true });
const lines = buffer.split("\n");
buffer = lines.pop(); // hold incomplete line
Enter fullscreen mode Exit fullscreen mode

Python 3.13 compatibility broke several pinned packages. Pillow 10.x and pydantic 2.7.x don't have prebuilt wheels for 3.13 — bumping to Pillow 11.1.0 and pydantic 2.10.0 fixed it.

SDK migration — the google-generativeai package is fully deprecated and streaming was unreliable. Switching to google-genai resolved it completely.


What I Learned

  • Streaming + TTS together is what makes AI feel live vs turn-based
  • Browser-native Web Speech API and Speech Synthesis are underrated — zero dependencies, instant
  • python:3.11-alpine cuts Docker image vulnerabilities dramatically vs slim
  • Cloud Run + Secret Manager is the cleanest production pattern for API keys

Try It


Built for the Gemini Live Agent Challenge. #GeminiLiveAgentChallenge

Top comments (0)