DEV Community: Eke Chukwudi

Building Medical AI for the Other 90%: A Field Report from a Solo Developer

Eke Chukwudi — Wed, 03 Jun 2026 22:05:50 +0000

Notes on architecture, licensing landmines, and why I'm building offline-first medical AI for community health workers — not radiologists.

Why this exists
There is no shortage of medical AI startups, and almost all of them are building for the same user: a radiologist or pathologist working in a well-resourced hospital with cloud connectivity, an electronic health record, and the budget to license a SaaS dashboard. That market is real, but it's also crowded, slow to adopt, and over-served.

The user I care about is different. She is a Community Health Extension Worker — sometimes called a CHEW — working in a district clinic in sub-Saharan Africa, often the only clinical-grade contact a village has with the formal health system. She does not have an internet connection during the visit. She is the screening layer, the referral decision, and the patient education function all in one role. Her tools are her training, a smartphone, and whatever cheap peripherals she can carry in a shirt pocket: a digital stethoscope, sometimes a $400 smartphone-mounted fundus adapter, occasionally a portable ultrasound probe.

This is the deployment context that should shape medical AI architecture. It mostly doesn't.

What that constraint forces
Three architectural decisions follow directly from the deployment context, and they are non-negotiable:

Offline-first. Every inference path must run on a phone CPU or a district-clinic laptop with no internet. That means ONNX export with INT8 quantization, sub-50MB quantized weights for phone-tier models, and inference budgets measured in seconds, not milliseconds.

Foundation encoders held frozen, with small trainable heads. Training a 12-billion-parameter model from scratch is not happening on the budget that serves this user. But a frozen pretrained encoder plus a small task-specific trainable head is a well-known pattern that scales cleanly across modalities. I've now applied the same dual-stage architecture across echocardiography (video), ECG signals, cervical cytology images, chest X-rays, CT volumes, and digital pathology slides. The encoder differs per modality. The training pattern does not.

License-clean from day one. This one bit me hard, and it's the most non-obvious lesson of the past few months.

The license trap
If you go to HuggingFace and look at the model cards for the most-cited medical AI foundation models — vision-language pathology FMs, chest X-ray DINOv2 derivatives, multimodal biomedical CLIP models — many of them carry license tags like Apache 2.0 or MIT. Those tags govern the model code: the inference scripts, the training pipeline, the architecture definition.

The weights, in several cases, are governed by something different. Not by a separate LICENSE file. By a paragraph buried in the model card README.

The phrase I've seen now in multiple model cards from major labs reads, almost verbatim: "Any deployed use case commercial or otherwise is out of scope." That language doesn't appear in the legal LICENSE file, which is what most engineers check. It appears in the README, which is what compliance teams read.

For a system intended for clinical deployment, those weights are radioactive. The legal license permits commercial use. The maintainers' written intent does not support it. Most institutional legal reviews respect maintainer intent.

This is not a critique of those labs — they have legitimate reasons to gate deployment, including liability and regulatory considerations. It is a critique of the assumption that public, "permissive-licensed" medical AI is deployment-ready by default. Often, it isn't.

The deploy-clean foundation model stack, encoders that are truly usable in a clinical product without a license addendum, is meaningfully smaller than the published benchmarks suggest.

What I've built so far
The system I'm building treats each clinical modality as a plug-in module that bolts onto a shared orchestration layer. The orchestration layer handles model routing, retrieval-augmented citation from medical guidelines, and report generation. The modules contribute domain-specific encoders and clinical heads.

Current status, internal validation tier on public benchmarks:

Echocardiography. Frozen video foundation model, attention pooling over clip embeddings, binary reduced-ejection-fraction classification. Internal AUC in the high 0.80s with calibrated recall.

ECG. Frozen wav2vec-class encoder, linear probe across 27 SNOMED arrhythmia classes. Macro AUC around 0.91. Atrial fibrillation AUC around 0.90.

Cervical cytology. Pap-smear cell type classification, lab-tier deployment, very high AUC on the public benchmark.

Chest X-ray (tuberculosis). Promoted internally on a public benchmark; I'm being deliberately conservative about claiming clinical-grade performance here because the published benchmark is known to be heterogeneous.

CT (lung nodule malignancy). A ConvNeXt-Tiny per-slice encoder with attention pooling across slices, AUC around 0.79. This is the locked baseline I'm now trying to beat with a multi-scale CT foundation model architecture.

Pathology (lung cancer subtyping). A generic DINOv2 encoder applied to histopathology tiles, with a small Gated Attention MIL head, hitting AUC 0.85 on the public TCGA-LUNG split. This number is interesting precisely because the encoder has no pathology-specific pretraining; the headroom from a deploy-clean medical pathology FM is significant when one exists.

Mammography. This is where the journey was most educational. The first five iterations on the legacy CBIS-DDSM benchmark plateaued around 0.65–0.70 AUC regardless of encoder choice. The performance ceiling was the dataset, not the architecture. Switching to the cleaner VinDr-Mammo benchmark with explicit multi-view per-breast aggregation lifted the number from 0.65 to 0.84 in a single iteration. The architectural pattern was right; the data was the bottleneck.

Report orchestration. Built. Includes retrieval-augmented citation from a vector store of clinical guideline literature, with a structured output handler that flags grounded vs ungrounded claims in the generated report.

The pieces that are not yet built and that I am being explicit about: AlphaFold-based therapeutic reasoning, longitudinal patient state representation, cross-modal contrastive alignment for joint queries across imaging and lab data, and a clinically-validated user interface for the community health worker. Each of these is in the architectural plan; none of them is shipping.

What I learned that I think generalizes
Three takeaways for anyone building in this space.

First, the deployment context is the architecture. If you start from the deployment user, not the dataset, not the encoder, not the benchmark, the architectural choices become much easier and much more constrained. Offline-first eliminates entire categories of design options. Phone-tier inference forces aggressive quantization. The frozen-encoder + trainable-head pattern emerges naturally from the compute budget, and it has the side benefit of being deeply reusable across modalities.

Second, license claims live in the README, not the LICENSE file. This is the single most important non-obvious thing I've learned in the past few months. Read the model card. Twice. Forward it to whoever does your compliance review.

Third, dataset choice and aggregation strategy often dominate encoder choice. Across five mammography iterations with three different encoders, the AUC ceiling was bounded by the dataset until I switched datasets. The encoder was a small lever compared to the data. This is the boring lesson everyone says they know and almost no one acts on.

If you work on this
I'm a solo developer. The next several months are about pushing the remaining modules through internal validation, building the AlphaFold therapeutic reasoning layer, and beginning conversations about deployment partnerships in target settings.

If you work on low-resource health systems, on medical imaging foundation models, on regulatory pathways for AI-as-medical-device, or on community health worker training programs; I'd genuinely like to talk.

Intercepting Gradients in PyTorch: Preprocess the Update Before Your Optimizer Sees It

Eke Chukwudi — Mon, 01 Jun 2026 12:52:00 +0000

Most people tune the optimizer. Almost nobody touches the thing the optimizer actually eats: the gradient. But the gap between loss.backward() and optimizer.step() is a real hook point, and you can do useful work there. This is a short, runnable guide to intercepting gradients and transforming them before the update lands.

The mental model

A training step is really four moves:

Forward pass: compute predictions and loss
Backward pass: loss.backward() fills parameter.grad for every parameter
Update: optimizer.step() reads those .grad tensors and adjusts the weights
Reset: optimizer.zero_grad()

The window we care about is between steps 2 and 3. After backward(), the gradients exist as plain tensors in p.grad. The optimizer has not touched them yet. That is where we intervene.

Step 1: a baseline step

import torch
import torch.nn as nn

model = nn.Sequential(nn.Linear(10, 32), nn.ReLU(), nn.Linear(32, 1))
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.MSELoss()

x = torch.randn(64, 10)
y = torch.randn(64, 1)

optimizer.zero_grad()
loss = loss_fn(model(x), y)
loss.backward()
optimizer.step()

Standard. Now let's get between the last two lines.

Step 2: transform the gradients in place

After backward(), iterate the parameters and modify each .grad before stepping:

def soft_threshold(g, lam=1e-4):
    # shrink every gradient component toward zero by lam
    # this is the core operation behind wavelet denoising
    return torch.sign(g) * torch.clamp(g.abs() - lam, min=0.0)

optimizer.zero_grad()
loss = loss_fn(model(x), y)
loss.backward()

for p in model.parameters():
    if p.grad is not None:
        p.grad = soft_threshold(p.grad)

optimizer.step()

That is the entire idea. Anything you can express as a function of a tensor, you can apply to the gradient: clipping, smoothing, denoising, sign-based updates, masking. The optimizer never knows the difference.

Step 3: make it reusable and optimizer-agnostic

Editing the training loop by hand gets messy. Wrap it instead, so the transform works with any first-order optimizer without changing your loop:

class GradTransform:
    def __init__(self, optimizer, transform):
        self.optimizer = optimizer
        self.transform = transform

    def zero_grad(self, *args, **kwargs):
        self.optimizer.zero_grad(*args, **kwargs)

    def step(self, *args, **kwargs):
        for group in self.optimizer.param_groups:
            for p in group["params"]:
                if p.grad is not None:
                    p.grad = self.transform(p.grad)
        self.optimizer.step(*args, **kwargs)

Usage is a drop-in:

base = torch.optim.Adam(model.parameters(), lr=1e-3)
optimizer = GradTransform(base, soft_threshold)

# training loop is unchanged
optimizer.zero_grad()
loss = loss_fn(model(x), y)
loss.backward()
optimizer.step()

Swap Adam for SGD and it still works, because the wrapper only touches .grad, which every first-order optimizer reads the same way.

A quicker alternative: tensor hooks

If you want the transform to fire automatically during the backward pass instead of after it, register a hook directly on a parameter:

for p in model.parameters():
    p.register_hook(lambda g: soft_threshold(g))

The hook runs as the gradient is computed. The wrapper approach is usually easier to reason about because everything happens in one obvious place, but hooks are handy when you want the change to be invisible to the rest of your code.

One honest warning

Intercepting gradients is powerful, which means it is also a good way to quietly break training. A transform that helps on a noisy problem can actively hurt on a clean one, where the gradient signal was fine to begin with. So treat any gradient transform as a hypothesis, not a free win: run it against an untouched baseline on your actual task, with the same seeds, and keep it only if the numbers say so. The hook is easy. Earning the improvement is the hard part.

If you want to see this idea taken all the way, soft-thresholding the gradient is exactly the building block behind WaveGuard, a gradient denoiser I built that swaps the flat threshold for a Haar wavelet transform and gates it so it stays quiet when the gradient is already clean. I benchmarked it honestly, including a task where it actively made things worse.

Write-up here: https://medium.com/@chukwudieke61/adam-cant-hear-the-signal-through-the-noise-waveguard-can-ffb1d8963a38

Code here: https://github.com/Harry-Potter20/wavelet-grad.