wisdom

Posted on May 8

Built a Multimodal Emergency First Aid Assistant with Gemma 4 — Here's What the Model Unlocked

#devchallenge #gemmachallenge #gemma #ai

Gemma 4 Challenge: Write about Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

A few weeks ago, I asked myself a simple question: what would it take to build an AI that could walk a frightened person through a medical emergency — without typing a single word?

They'd need to show the situation. Speak the emergency. Get guided through it, step by step, in plain language, with their hands free.

That question led me to build Med-first — and it led me straight to Gemma 4. Because Gemma 4 is the first open model I've encountered where the answer to that question is: yes, all of that is actually possible in one API call.

This post is about what Gemma 4 unlocked, how I built it, and — if you're a developer in Africa or anywhere compute access has historically been a barrier — why this release matters more than the benchmarks suggest.

What Is Med-first?

Med-first is a browser-based emergency first aid assistant. Open it on any phone, no install, no login. Then:

Type your emergency and get structured, step-by-step first aid guidance
Speak into the mic hands-free — the browser transcribes it and Gemma 4 responds, reading the answer aloud automatically
Point your camera at the injury or scene, capture a frame, and Gemma 4 describes what it sees and tailors its guidance to the actual visual situation

The output is always a structured triage card: a severity assessment (Critical / Urgent / Stable), a numbered list of steps a non-medical person can follow, warning signs to watch for, and a line of calm reassurance.

For Critical cases, the very first thing it does — before any first aid steps — is tell you to call emergency services.

Why Gemma 4 Specifically?

This is the question the challenge judges care about most, so let me be direct.

The core experience of Med-first requires three things to happen in a single interaction:

Understand a spoken description of an emergency (audio/transcript)
Analyze a photo or camera frame of the scene (vision)
Return structured, actionable guidance in plain language

Before Gemma 4, building this with a single open model wasn't possible. You'd stitch together three separate models — a speech recognition model, a vision model, a text model — with all the latency, error surface, and infrastructure complexity that entails.

Gemma 4 handles all three natively.

From the official model card:

"Extended Multimodalities: Processes Text, Image with variable aspect ratio and resolution support (all models), Video, and Audio (featured natively on the E2B and E4B models)."

That's the unlock. One model, one API call, three modalities.

Which Model I Chose and Why

For Med-first, I'm using gemma-4-27b-a4b-it — the 26B Mixture of Experts variant — accessed via the Gemini API on Google AI Studio.

The choice was deliberate:

The MoE architecture activates only ~3.8B parameters per inference pass, which means fast response times — critical when someone is in a medical emergency and every second of waiting feels like ten
The 256K context window means the full conversation session stays in context from the first message to the last. If someone describes a situation, sends a photo, asks a follow-up, and then says "it's getting worse" — Gemma 4 has all of that history and its guidance evolves accordingly, rather than starting from zero each turn
The model's native function-calling and structured JSON output capabilities let me drive the entire UI from a single model response — the severity badge, the numbered steps, the call-emergency banner — all parsed from a JSON object Gemma 4 returns directly

An edge model (E2B or E4B) would make sense for a future offline/on-device version — and I've architected it so that path is open. But for a web app where response quality and context retention matter most, the 26B MoE is the right tool.

How I Built It

Stack

Next.js 14 (App Router) — frontend and server actions in one project
Tailwind CSS + shadcn/ui — dark, high-contrast medical UI
Web Speech API (browser-native, free) — voice input transcription
Web Speech Synthesis — reads the AI response aloud, hands-free
getUserMedia — live camera access for frame capture
Next.js Server Actions — the backend layer, no separate server needed
Gemini API (primary) → OpenRouter free tier (fallback)

The Architecture

Browser (React frontend)
    │
    ├── Text Input
    ├── Voice Input (Web Speech API → transcript)
    └── Camera Capture (getUserMedia → frame → base64)
              │
              ▼
    actions/action.ts  ['use server']
    ← runs server-side, API keys never touch the browser →
              │
    ┌─────────┴──────────┐
    │   Gemini API       │  → fallback →  OpenRouter
    │   gemma-4-27b-a4b-it              gemma-4-31b-it:free
    └─────────┬──────────┘
              │
    Structured JSON triage response
              │
    TriageCard rendered in UI
    + TTS reads response aloud

The key architectural decision was Next.js Server Actions over traditional API routes. The frontend calls triageEmergency() like a plain async function — no fetch(), no HTTP status codes, no CORS. TypeScript types flow end-to-end. It made the code dramatically simpler to build and easier to reason about.

The System Prompt

Getting the model to behave correctly under emergency conditions required a carefully designed system prompt. A few things that mattered:

Force JSON output. Gemma 4 supports native structured output, but I reinforce it in the prompt:

You MUST respond ONLY with valid JSON matching the exact schema below.
No markdown fences. No preamble. No explanation. Just JSON.

Plain language requirement. Emergency guidance is useless if a frightened person can't understand it:

Give instructions in numbered steps. Short sentences. Plain language.
No medical jargon. A frightened 14-year-old must understand you.

Always escalate Critical cases first:

For Critical cases, ALWAYS instruct the user to call emergency services
(911 / 999 / 112) as your FIRST step before any other instructions.

The response schema:

{
  "severity": "Critical | Urgent | Stable",
  "call_emergency": true,
  "what_i_see": "description of the image if provided",
  "steps": ["step 1", "step 2", "step 3"],
  "watch_for": ["warning sign 1", "warning sign 2"],
  "reassurance": "one calming sentence"
}

Multimodal Image Handling

When a user captures a frame from the camera or uploads a photo, it gets base64-encoded in the browser, then passed to the server action as a string. The server action attaches it to the user message in the OpenAI-compatible format the Gemini API accepts:

{
  role: 'user',
  content: [
    {
      type: 'image_url',
      image_url: { url: `data:image/jpeg;base64,${imageBase64}` }
    },
    {
      type: 'text',
      text: userMessage
    }
  ]
}

Gemma 4 then describes what it observes in the what_i_see field of its response before giving guidance — so someone can see the model is actually reading the image, not guessing.

One Bug Worth Mentioning

During development, I hit a 500 error from the Gemini API that cost me an hour. The issue: the model string gemma-4-it does not exist. The actual correct model strings are gemma-4-27b-a4b-it and gemma-4-31b-it. Claude Code had guessed a model name that sounded right but wasn't. Always verify model strings against the official docs before debugging your request format.

What the 256K Context Window Actually Means in This Context (Pun Intended)

Most emergency situations aren't a single message. They evolve:

"Someone fell, they hit their head."

[Gemma 4 guides them through head injury checks]

"They're conscious but confused."

[Guidance updates to reflect that detail]

"Now they're saying their neck hurts."

[Full session history still in context — guidance escalates appropriately]

With an 8K or 32K context model, you'd be managing conversation truncation, losing critical earlier context, or paying to re-summarize. With 256K, the model tracks the entire situation as it unfolds. For a use case where continuity is literally a safety concern, this matters.

What This Means for Developers in Africa

I want to say something that most Gemma 4 guides won't.

For developers building in Nigeria and across the continent, the economics of cloud AI have always been a quiet barrier. Dollar-denominated API pricing. Latency from distant servers. Payment methods that require workarounds. And the harder-to-quantify problem of data sovereignty — sending sensitive user data to foreign cloud infrastructure is a compliance and trust problem many African startups navigate silently.

Gemma 4 changes the equation.

An open-weight model powerful enough to run locally — or accessed via a free API tier with no credit card required — removes several of those barriers at once. Med-first is deployed on Vercel's free tier, uses Google AI Studio's free tier for the primary API, and falls back to OpenRouter's free tier. The total infrastructure cost to run this application is zero.

More importantly: 136 million people globally lack access to emergency medical services. In many parts of Africa, the nearest hospital is hours away. A tool that can guide someone through a medical crisis until help arrives — available on any phone browser, in any of 140+ languages Gemma 4 supports natively — isn't a demo. It's something that could matter.

That's what open models at this capability level actually unlock.

Try It Yourself

The model strings to get started immediately on Google AI Studio (free, no credit card):

Model	String	Best For
26B MoE	`gemma-4-27b-a4b-it`	Speed + reasoning, production
31B Dense	`gemma-4-31b-it`	Maximum quality

Via OpenRouter (also free):

google/gemma-4-31b-it:free

The OpenAI-compatible endpoint for Gemini API:

https://generativelanguage.googleapis.com/v1beta/openai/chat/completions

Wrapping Up

Gemma 4 isn't an incremental update. The combination of native multimodal input, a 256K context window, structured output, and an Apache 2.0 license puts it in a category that genuinely didn't exist for open models before this release.

Med-first exists because Gemma 4 made it possible to handle voice, vision, and text in a single model call. That's the unlock. Everything else — the UI, the triage card, the hands-free voice loop — is just building around what the model already knows how to do.

What will you build with it?

Med-first is built with Next.js, Tailwind CSS, and Gemma 4 via the Gemini API. Deployed on Vercel.

Medical disclaimer: Med-first provides AI-generated guidance for demonstration purposes. Always call emergency services for life-threatening emergencies.

Top comments (1)

Onah Sunday. • May 9

interesting.