Breno dos Santos Alves

Posted on May 13

Healthcare AI that runs where there's no internet — Gemma 4 on a $150 phone

#devchallenge #gemmachallenge #gemma #healthcare

Gemma 4 Challenge: Build With Gemma 4 Submission

Track: Build With Gemma 4 · Variant used: gemma4:e2b (default) and gemma4:e4b (compared) · Submission: individual

Not a medical device. This is a research proof-of-concept and does not replace clinical evaluation or laboratory confirmation.

The scenario that motivated this

A community health worker is 80 km from the nearest health center. A child has a fever. The visit kit is a stack of single-T lateral-flow rapid diagnostic tests — COVID-19 antigen, HIV TR1, syphilis DPP, dengue NS1, hepatitis B/C, leishmaniasis rK39, and pregnancy hCG. In Brazil, these are public-health-system mandated tests, available in every primary care unit (UBS). Misreading them is consequential: a false-positive HIV result is a life-altering moment for the patient, and a false-negative dengue NS1 in the first week of infection misses the only window where the antigen test works at all.

There is no mobile signal. Cloud-based "AI triage" tools do not run here.

The proof of concept I am submitting: Gemma 4 2B, running entirely on the phone, reads a photo of the cassette and returns a structured JSON verdict. No network call. No telemetry. No remote logging. The health worker keeps the decision — the model is just a second pair of eyes.

This post is what I learned building it.

Repository

github.com/brenosalves/gemma-poct — MIT licensed. The system prompt, the analysis script, the synthetic benchmark generator, the Streamlit UI, and the Playwright recorder are all there. The demo video is in poc/demo_video/demo.webm.

What the model returns

A single Ollama call with the cassette photo and a system prompt returns:

{
  "left_half_observation": "vertical red/pink pigment band",
  "right_half_observation": "mostly white, no pigment",
  "visual_description": "The window has a red vertical band on its left half and is blank on its right half.",
  "c_line_present": true,
  "t_line_present": false,
  "image_quality": "good",
  "status": "valid",
  "result": "non_reactive",
  "confidence": "high",
  "notes": "The control line is present, but the test line is absent."
}

Three things to notice:

The model describes before it concludes. The first three fields are pure observation. The remaining fields are derived. This ordering matters; I will come back to it.
Two coupled gates: status and result. A test with no control line is invalid no matter what the T position looks like. The contract enforces it.
No clinical recommendation in notes. The UI layer maps the verdict to actions (repeat the test, refer to a specialist, take a confirmatory test) using a per-disease protocol — the model does not.

Why Gemma 4, and why `gemma4:e2b` specifically

The challenge's "Build With Gemma 4" track lets submissions pick any variant. I picked gemma4:e2b (~1.5 GB int4 quantized) and the constraint that forced the choice is the audience, not the leaderboard.

The intended user is a community health worker in the field. The intended device is the phone they already own. In the Brazilian context that means entry-level Android devices with 4 GB RAM — Moto G14, Galaxy A05, Redmi 12C. Those phones can host e2b comfortably. e4b (~3 GB int4) needs 6 GB+ RAM and excludes that audience. The 31B-dense and 26B-MoE variants are eligible for the challenge but disqualified by the on-device constraint of the project.

So the variant choice rolls up to a single sentence: the model has to fit in the phone the health worker is already carrying, otherwise the project does not exist. I benchmarked e4b here for comparison because that comparison is informative — see below — but the default everywhere in the codebase is e2b.

Two other Gemma 4 properties that matter for this use case:

Native multimodality. No separate OCR pipeline, no image-to-text intermediate. One call, one JSON.
128k context. It opens a door I have not walked through yet: loading the full clinical interpretation manual for each test as part of the system prompt. That is on the roadmap below.

The architecture is boring (and that is the point)

[phone camera]
     ↓
[optional crop]
     ↓
[Gemma 4 e2b on-device, via MediaPipe LLM Inference (mobile) / Ollama (POC)]
     ↓
[structured JSON]
     ↓
[UI: verdict + per-disease protocol action]

Everything happens on the device. There is no backend. There is no analytics. The capture-to-result loop never leaves the phone.

No fine-tuning. Just prompting.

The whole behavior of the model is steered by a single system prompt (poc/prompts/system.md). Four decisions in that prompt matter more than the rest:

1. A validity gate before any result

The prompt enforces a logical order: image quality first, control line second, test line third. The consistency rules at the bottom of the contract make this impossible to break:

- c_line_present: false → status: "invalid" and result: "indeterminate"
                          (no exceptions, regardless of T).

A test with no control line is biologically meaningless — the reagent did not flow, the strip is dead. The model treats it that way, even if it thinks it sees a T.

2. Image quality as step zero

Before reading lines, the model must assess whether the window is in focus, whether glare is covering the lines, whether the cassette is even in frame. Blurred or framed-out images return image_quality: "poor" and that locks the verdict to invalid/indeterminate/low. Without this, the model hallucinates lines in fuzzy images.

3. Spatial decomposition before naming

This is the decision that earned its keep. Read on.

4. Doubt is not evidence

The contract explicitly states: "To mark t_line_present: true you need concrete visual evidence (a red/pink horizontal pigment mark at the T position). Shadow, texture noise or glare do NOT count. When in doubt, mark false." This sounds obvious. It was not obvious to the model.

The bug we could not write our way out of (and the fix)

Here is the failure mode I spent the most time on. With an earlier version of the prompt, gemma4:e2b would look at a clean, well-lit negative test — a single red control line in the left half of the window, pristine white in the right half — and write:

"Both the control and test lines are clearly visible."

Word for word, in case after case. This is not a model that is uncertain; it is a model that is affirmatively hallucinating a line that is not there. The prompt already contained a paragraph telling the model that the letters "C" and "T" printed above the window are just plastic, not lines. The model would repeat the warning back to me and then hallucinate the line anyway.

Adding stronger anti-hallucination warnings did not help. Adding base-rate priors ("most field tests are negative") helped a little but not enough. Rewriting the warning in a more imperative tone moved nothing.

What worked was changing how the model is asked to look. Instead of "is there a C line? is there a T line?", the prompt now asks the model to:

Find the white horizontal window.
Mentally split it into a LEFT HALF and a RIGHT HALF. Do not yet name these halves "C" and "T".
Describe what is in each half. Pigment band? White? Unclear?
Only then map left → control line, right → test line.

In the JSON contract, this shows up as two extra fields before the conclusion:

"left_half_observation":  "vertical red/pink pigment band" |
                          "mostly white, no pigment" | "unclear"
"right_half_observation": <same>

The reasoning order matters. When the model has to commit to a description of the right half before seeing the words "T" or "test", the label-driven hallucination disappears. The synthetic benchmark went from 3/6 to 5/6 immediately. The narrative notes field still occasionally drifts ("both lines visible") but the structured fields — which are what the app actually consumes — are now reliable in the horizontal-layout case.

The takeaway, for anyone using small multimodal models: the order of fields in your JSON contract is part of your prompt. The model commits to whatever it writes first.

The capture convention (and why it is enough)

Real cassettes ship in two physical layouts: horizontal (e.g. typical pregnancy hCG strips — window is wide, C and T labels sit above it, lines are vertical inside the window) and vertical (e.g. many COVID-19 antigen kits — window is tall and narrow, C and T labels are stacked on one side, lines are horizontal). My synthetic benchmark images are all horizontal. The model is excellent on horizontal. On vertical it gets confused: "left half" and "right half" of a vertical window mean nothing useful.

I tried generalizing the prompt with an orientation-detection step ("if C is above and T is below, use top/bottom halves"). It regressed performance on both layouts: the model occasionally misidentified the orientation, and the extra branching diluted the spatial decomposition that had been working.

The solution is on the capture side, not the inference side. The app instructs the user to rotate the cassette so the C label is on the left and T label is on the right before taking the photo. A vertical cassette becomes a horizontal one with a 90° turn. The model sees the layout it knows. The Streamlit app reminds the user with a sticky tip; on mobile this becomes a camera-overlay guide.

This is the kind of compromise I have come to appreciate working with small models: do not bend the model toward the messiness of the world if you can bend the capture step toward the model.

How it actually performs

Two benchmarks. The synthetic one drives prompt iteration. The real one keeps me honest.

Synthetic benchmark (6 cases, deterministic generator)

generate_synthetic.py produces six PIL-drawn cassettes with ground truth encoded in the filename: positive (clear), positive (faint T), negative (clear), negative (clean), invalid (no lines), invalid (only T present, no C). Each isolates one decision point.

Prompt iteration	`gemma4:e2b`
Baseline (protocol only, anti-hallucination warning, base-rate prior)	3/6
+ spatial decomposition (left half / right half before C/T)	5/6

The remaining miss is invalid_only_t_01 — a synthetic case where the control line is absent but the T line is present. This is biologically rare (the reagent failed but the antigen capture worked anyway) and the model now classifies it as valid/non_reactive instead of invalid/indeterminate. The error is on the "fail safe" side for most real-world workflows, but it remains a known limitation.

Real benchmark (4 photos, public-domain or owned)

Photo	Layout	`gemma4:e2b`	`gemma4:e4b`
`negative_covid_01`	vertical → rotated	✓ non_reactive	✓ non_reactive
`positive_covid_01`	vertical, faint Ts	✗ non_reactive	✗ invalid
`positive_covid_02` (Wikimedia, CC BY-SA 4.0)	horizontal, faint T	✗ non_reactive	✓ reactive
`positive_pregnancy_01`	horizontal, bold lines	✓ reactive	✓ reactive
Score		2/4	3/4

What this tells me:

e2b reads bold, high-contrast lines (the pregnancy hCG cassette is the easy case for any model). The pattern of misses is faint T lines on real-world cassettes — the COVID-19 antigen tests in particular ship with subtle test lines that even people miss.
e4b catches one of those faint-T cases (positive_covid_02). This is the only case where e4b improves over e2b. If you have the RAM, you get one fewer false-negative. If you do not, you have the same blind spot as the average human reader, which is at least an honest blind spot.
The positive_covid_01 photo defeats both models. The lines are extremely subtle and the lighting is mediocre. This is a model-capability ceiling, not a prompt issue.

So: prompt engineering moves the synthetic score from 3/6 to 5/6 and reveals where capacity, not steering, is the bottleneck.

Demo

Four cases in order:

positive_pregnancy_01 on e2b → Reactive ✓
positive_covid_02 on e2b → Non-reactive ✗ (faint T missed)
positive_covid_02 on e4b → Reactive ✓ (same photo, bigger model)
negative_covid_01 on e2b → Non-reactive ✓

The Streamlit UI shows the JSON, the verdict, the per-disease action, and a debug expander with the spatial observations. Latency per analysis is 15-30 seconds on an Apple M-series laptop running e2b; on a real entry-level phone this will be slower (rough target: under a minute), which is fine because the workflow is "snap photo, wait while you write the patient's name down".

Limitations (the part to read before quoting any number)

Not a medical device. Research and educational POC. Any deployment would require clinical validation and regulatory clearance.
Single-T scope. Malaria Pf/Pv and dengue IgM/IgG are explicitly out — they need a multi-T schema and per-disease combination rules.
Image conditions matter. Severe glare, motion blur, partial framing → invalid/low-confidence verdicts. Good, because alternative is a confidently wrong call.
Vertical layouts depend on capture rotation. The model is optimized for horizontal-window layouts. The app guides the user, but a user who ignores the guide will hit accuracy degradation on vertical cassettes.
Brand variability. Different manufacturers use different dyes and window geometries. The four photos I evaluated are not a substitute for a per-manufacturer study.

Roadmap

Package as a Flutter app using MediaPipe LLM Inference — that is the path to actually getting it onto the target phones. The desktop POC and the mobile app share the same system prompt and JSON contract.
Multi-T support. Malaria Pf/Pv and dengue IgM/IgG need a t_lines: [{position, present}] array and per-disease combination rules.
Use the 128k context. Load the clinical interpretation manual for each test as part of the prompt; the model could then explain why a faint T is biologically plausible for the disease in question.
Larger real benchmark. Public-domain photos of HIV, dengue NS1 and syphilis rapid tests are scarce; a structured collection effort (informed-consent, anonymized) is needed.

Acknowledgements

Cover photo by John Cameron on Unsplash.
The positive_covid_02 photo used in the demo video and benchmark table is cropped from "Positive Covid-19 Rapid Antigen Test" on Wikimedia Commons, licensed under CC BY-SA 4.0. Cropped to focus on the cassette; no other modifications. The benchmark photo files themselves are not committed to the repository — only the demo video that uses them is.
The clinical action mappings in poc/protocols.json are a simplified summary of the Brazilian Ministry of Health protocols for each test. They are illustrative, not authoritative.

Closing

I started this project thinking the interesting question would be which model. The interesting question turned out to be how do you structure the JSON so the model commits to observation before conclusion. The model is plenty capable for the easy case. Steering it is most of the work, and steering it is free — no fine-tuning, no GPU rental, no data labeling, no cloud infrastructure.

That is the part that I think generalizes beyond this project: for small on-device multimodal models, the prompt is the product. Iterate on the prompt the way you iterate on UI — measure, change one thing, measure again — and your whole system becomes a single file someone can read, audit, and improve.

— Breno Alves

Built for the Google Gemma 4 Challenge. Code: github.com/brenosalves/gemma-poct. Connect on LinkedIn. Feedback welcome in the comments.

DEV Community

Healthcare AI that runs where there's no internet — Gemma 4 on a $150 phone

The scenario that motivated this

Repository

What the model returns

Why Gemma 4, and why `gemma4:e2b` specifically

The architecture is boring (and that is the point)

No fine-tuning. Just prompting.

1. A validity gate before any result

2. Image quality as step zero

3. Spatial decomposition before naming

4. Doubt is not evidence

The bug we could not write our way out of (and the fix)

The capture convention (and why it is enough)

How it actually performs

Synthetic benchmark (6 cases, deterministic generator)

Real benchmark (4 photos, public-domain or owned)

Demo

Limitations (the part to read before quoting any number)

Roadmap

Acknowledgements

Closing

Top comments (0)

The scenario that motivated this

Repository

What the model returns

Why Gemma 4, and why gemma4:e2b specifically

The architecture is boring (and that is the point)

No fine-tuning. Just prompting.

1. A validity gate before any result

2. Image quality as step zero

3. Spatial decomposition before naming

4. Doubt is not evidence

The bug we could not write our way out of (and the fix)

The capture convention (and why it is enough)

How it actually performs

Synthetic benchmark (6 cases, deterministic generator)

Real benchmark (4 photos, public-domain or owned)

Demo

Limitations (the part to read before quoting any number)

Roadmap

Acknowledgements

Closing

Why Gemma 4, and why `gemma4:e2b` specifically