Building Medical AI for the Other 90%: A Field Report from a Solo Developer

#ai #deeplearning #machinelearning #learning

Notes on architecture, licensing landmines, and why I'm building offline-first medical AI for community health workers — not radiologists.

Why this exists
There is no shortage of medical AI startups, and almost all of them are building for the same user: a radiologist or pathologist working in a well-resourced hospital with cloud connectivity, an electronic health record, and the budget to license a SaaS dashboard. That market is real, but it's also crowded, slow to adopt, and over-served.

The user I care about is different. She is a Community Health Extension Worker — sometimes called a CHEW — working in a district clinic in sub-Saharan Africa, often the only clinical-grade contact a village has with the formal health system. She does not have an internet connection during the visit. She is the screening layer, the referral decision, and the patient education function all in one role. Her tools are her training, a smartphone, and whatever cheap peripherals she can carry in a shirt pocket: a digital stethoscope, sometimes a $400 smartphone-mounted fundus adapter, occasionally a portable ultrasound probe.

This is the deployment context that should shape medical AI architecture. It mostly doesn't.

What that constraint forces
Three architectural decisions follow directly from the deployment context, and they are non-negotiable:

Offline-first. Every inference path must run on a phone CPU or a district-clinic laptop with no internet. That means ONNX export with INT8 quantization, sub-50MB quantized weights for phone-tier models, and inference budgets measured in seconds, not milliseconds.

Foundation encoders held frozen, with small trainable heads. Training a 12-billion-parameter model from scratch is not happening on the budget that serves this user. But a frozen pretrained encoder plus a small task-specific trainable head is a well-known pattern that scales cleanly across modalities. I've now applied the same dual-stage architecture across echocardiography (video), ECG signals, cervical cytology images, chest X-rays, CT volumes, and digital pathology slides. The encoder differs per modality. The training pattern does not.

License-clean from day one. This one bit me hard, and it's the most non-obvious lesson of the past few months.

The license trap
If you go to HuggingFace and look at the model cards for the most-cited medical AI foundation models — vision-language pathology FMs, chest X-ray DINOv2 derivatives, multimodal biomedical CLIP models — many of them carry license tags like Apache 2.0 or MIT. Those tags govern the model code: the inference scripts, the training pipeline, the architecture definition.

The weights, in several cases, are governed by something different. Not by a separate LICENSE file. By a paragraph buried in the model card README.

The phrase I've seen now in multiple model cards from major labs reads, almost verbatim: "Any deployed use case commercial or otherwise is out of scope." That language doesn't appear in the legal LICENSE file, which is what most engineers check. It appears in the README, which is what compliance teams read.

For a system intended for clinical deployment, those weights are radioactive. The legal license permits commercial use. The maintainers' written intent does not support it. Most institutional legal reviews respect maintainer intent.

This is not a critique of those labs — they have legitimate reasons to gate deployment, including liability and regulatory considerations. It is a critique of the assumption that public, "permissive-licensed" medical AI is deployment-ready by default. Often, it isn't.

The deploy-clean foundation model stack, encoders that are truly usable in a clinical product without a license addendum, is meaningfully smaller than the published benchmarks suggest.

What I've built so far
The system I'm building treats each clinical modality as a plug-in module that bolts onto a shared orchestration layer. The orchestration layer handles model routing, retrieval-augmented citation from medical guidelines, and report generation. The modules contribute domain-specific encoders and clinical heads.

Current status, internal validation tier on public benchmarks:

Echocardiography. Frozen video foundation model, attention pooling over clip embeddings, binary reduced-ejection-fraction classification. Internal AUC in the high 0.80s with calibrated recall.

ECG. Frozen wav2vec-class encoder, linear probe across 27 SNOMED arrhythmia classes. Macro AUC around 0.91. Atrial fibrillation AUC around 0.90.

Cervical cytology. Pap-smear cell type classification, lab-tier deployment, very high AUC on the public benchmark.

Chest X-ray (tuberculosis). Promoted internally on a public benchmark; I'm being deliberately conservative about claiming clinical-grade performance here because the published benchmark is known to be heterogeneous.

CT (lung nodule malignancy). A ConvNeXt-Tiny per-slice encoder with attention pooling across slices, AUC around 0.79. This is the locked baseline I'm now trying to beat with a multi-scale CT foundation model architecture.

Pathology (lung cancer subtyping). A generic DINOv2 encoder applied to histopathology tiles, with a small Gated Attention MIL head, hitting AUC 0.85 on the public TCGA-LUNG split. This number is interesting precisely because the encoder has no pathology-specific pretraining; the headroom from a deploy-clean medical pathology FM is significant when one exists.

Mammography. This is where the journey was most educational. The first five iterations on the legacy CBIS-DDSM benchmark plateaued around 0.65–0.70 AUC regardless of encoder choice. The performance ceiling was the dataset, not the architecture. Switching to the cleaner VinDr-Mammo benchmark with explicit multi-view per-breast aggregation lifted the number from 0.65 to 0.84 in a single iteration. The architectural pattern was right; the data was the bottleneck.

Report orchestration. Built. Includes retrieval-augmented citation from a vector store of clinical guideline literature, with a structured output handler that flags grounded vs ungrounded claims in the generated report.

The pieces that are not yet built and that I am being explicit about: AlphaFold-based therapeutic reasoning, longitudinal patient state representation, cross-modal contrastive alignment for joint queries across imaging and lab data, and a clinically-validated user interface for the community health worker. Each of these is in the architectural plan; none of them is shipping.

What I learned that I think generalizes
Three takeaways for anyone building in this space.

First, the deployment context is the architecture. If you start from the deployment user, not the dataset, not the encoder, not the benchmark, the architectural choices become much easier and much more constrained. Offline-first eliminates entire categories of design options. Phone-tier inference forces aggressive quantization. The frozen-encoder + trainable-head pattern emerges naturally from the compute budget, and it has the side benefit of being deeply reusable across modalities.

Second, license claims live in the README, not the LICENSE file. This is the single most important non-obvious thing I've learned in the past few months. Read the model card. Twice. Forward it to whoever does your compliance review.

Third, dataset choice and aggregation strategy often dominate encoder choice. Across five mammography iterations with three different encoders, the AUC ceiling was bounded by the dataset until I switched datasets. The encoder was a small lever compared to the data. This is the boring lesson everyone says they know and almost no one acts on.

If you work on this
I'm a solo developer. The next several months are about pushing the remaining modules through internal validation, building the AlphaFold therapeutic reasoning layer, and beginning conversations about deployment partnerships in target settings.

If you work on low-resource health systems, on medical imaging foundation models, on regulatory pathways for AI-as-medical-device, or on community health worker training programs; I'd genuinely like to talk.

DEV Community

Building Medical AI for the Other 90%: A Field Report from a Solo Developer

Top comments (0)