Pyae Sone

Posted on Jun 8

Same Weights, Same Prompt, Different Triage Level

#ai #rag #llm #programming

On quantized inference nondeterminism, rules-first clinical AI, and what five years of medicine taught me about building systems that fail safely.

TL;DR: The same quantized model, given the same prompt, returned a different triage output on different hardware. This post explains the mechanism, why it matters specifically for clinical AI, and what a rules-first architecture does about it.

I ran a 4-bit medical triage model on a laptop GPU and on a CPU. For one patient, the GPU said urgent and the CPU said emergency. Same model file, same prompt, same input.

I sat with that result for a while. Then I opened a note that just said: this is the most important thing in the project.

To explain why it matters and why it scared me in a way that a normal software bug would not, I need to start somewhere else.

Five years of medicine, and then a coup

I spent five years studying to be a doctor at the University of Medicine 1 in Yangon. I did clinical rotations. I delivered babies at Central Women's Hospital. I went on rural healthcare missions where triage was not an algorithm. It was a nurse with thirty seconds and a pulse oximeter making a call that would determine whether someone survived the next hour.

Then in 2021, Myanmar's military staged a coup. The healthcare system collapsed, and I left.

I retrained as a software engineer in Singapore and ended up at the intersection of the two things I knew: medicine and computation. When I started building Aegis-MD, an ED Triage Console that runs entirely on local hardware with no data leaving the machine, I had watched enough triage to know better than to try replacing it with AI.

I wanted to demonstrate something narrower: that you could build a medical AI system whose failure modes were visible, whose design decisions were auditable, and whose trust was earned through architecture rather than claimed through marketing. That requires being honest about what breaks.

The setup, and why a 4-bit model

Two things about Aegis-MD's design matter for what follows.

First, it is local by design. Triage data is highly sensitive, so nothing leaves the machine. The trade-off is that I am running a small, heavily quantized model, MedGemma-1.5 4B at Q4_K_XL, at about 3.4 GB rather than a frontier API. Four-bit weights are the price of running offline on consumer hardware.

Second, I deliberately tested on two configurations. The intended deployment is local GPU inference using an RTX 5070 Ti Mobile with 12 GB VRAM. The public demo runs CPU-only on Cloud Run because GPU instances require a paid quota I lack. I ran the same evaluation set against both environments using the same model, same code, and same prompts.

The eval consists of 17 hand-written ED cases spanning all five ATS levels, from cardiac arrest down to a medical-certificate request. Seventeen cases is a smoke test, not a validation. I will not quote a percentage off a sample that small. The headline was unremarkable: 16/17 on GPU, 15/17 on CPU. What mattered was which cases diverged.

The architecture decision I kept second-guessing

The finding only makes sense against the architecture itself.

My first instinct was the obvious one. I put the LLM in charge, used RAG to ground it in clinical guidelines, and let MedGemma-1.5 reason its way to an ATS category. I built it that way, and then I tore it out.

While MedGemma-1.5 can produce surprisingly structured clinical thinking, you cannot audit its reasoning. When a system processing patient presentations gives you a 3 instead of a 2, you need to know why, not just that a transformer's attention heads found "diaphoresis" near "chest pain" and weighted that appropriately. A triage nurse who gets an urgency wrong can explain their rationale. A quantized LLM cannot.

What I built instead is best described as rules with LLM assist. The system has three tiers: full RAG feeding into the LLM, LLM-only when ChromaDB is unavailable, and deterministic keyword matching when the LLM is down entirely. The LLM can escalate urgency beyond what the rule layer suggests. It cannot decrease urgency below a definitive keyword match. When the ATS-1 discriminator fires on "cardiac arrest," no amount of LLM confidence in the other direction matters.

This felt like the right architecture philosophically, until I watched it fail.

Evaluation case 9 was an 80-year-old with a head injury, anticoagulated on warfarin. The expected result was ATS-3. It returned ATS-4 consistently on both GPU and CPU, every run.

The root cause was immediate and embarrassing. The ATS-4 keyword list matched "laceration" in the chief complaint text. The rule layer inspects free text and ignores structured fields. The comorbidity flags (anticoagulants: true, age: 80, mechanism: fall) were invisible to the safety floor.

This is a class of failure. A keyword-based safety floor built on chief complaint text will systematically fail on presentations where the urgency comes from context rather than the presenting complaint itself. An elderly anticoagulated patient with a seemingly minor head injury is exactly the case where an experienced triage nurse's pattern recognition catches what a textbook algorithm misses. The fix is to wire structured risk modifiers into the rule tier. I have not shipped it yet. Shipping an incomplete fix to a safety-relevant component felt worse than leaving the known limitation visible and documented.

The thing that shouldn't happen

Case 8 was renal colic, a kidney stone. It is painful and urgent, but not immediately life-threatening, making it an ATS-3.

Case 8 : renal colic (expected ATS-3)
  GPU (RTX 5070 Ti Mobile):  ATS-3  ✓
  CPU (4 vCPU Docker):       ATS-2  ✗  (over-triage)

Same weights. Same prompt. Same patient data. Same Ollama binary. The only variable was the hardware the forward pass ran on, and the model returned a different clinical category.

It erred on the safe side this time, as ATS-2 is more urgent than ATS-3, meaning a real patient would have been seen sooner. But safety by luck is not a system property. Nothing in the mechanism guarantees the next hardware-induced flip rounds toward caution. Case 9, the anticoagulated head injury, failed identically on both machines in the unsafe direction. These are two different failure modes. One lives in my rule layer. The other moves with the silicon.

How hardware divergence works

I did the obvious checks. I verified the same model digest on both machines, same prompt bytes, and same decoding settings (greedy, temperature 0). You do not sample a clinical decision. I found no code path keyed on device type.

Nothing in my code caused the divergence. It lived a layer below in how the same arithmetic gets executed on two different processors.

If you ask why two runs of an LLM disagree, the stock answer usually points to floating-point addition. It is not associative, GPUs run thousands of threads concurrently, the order they finish in is unpredictable, rounding differs, and small numeric noise snowballs into different tokens.

That explanation is largely incomplete. A September 2025 paper from Horace He and Thinking Machines Lab, Defeating Nondeterminism in LLM Inference, explains why. Their argument is that the hot path of a normal forward pass does not lean on atomic adds or racing threads, so for a fixed shape and fixed schedule, individual kernels are run-to-run deterministic. The reason a temperature-0 endpoint still gives different answers is that production servers batch your request with others, the batch size fluctuates with traffic, and many kernels are not batch-invariant. Change the batch, change the reduction, change the result.

That diagnoses same-endpoint nondeterminism, but it does not describe my situation. I was not getting different answers from one server under varying load. I was getting different answers from two entirely different backends: llama.cpp's CUDA path on the GPU versus its CPU path.

Cross-backend divergence is a completely different axis. Two backends are two separate implementations of the same math. They tile and block their matrix multiplies differently, reduce in different orders, accumulate at different precisions, and fuse operations differently. Bitwise agreement across two such code paths was never possible. They are different computations approximating the same ideal.

The honest framing is that same-hardware temperature-0 nondeterminism is a batch-invariance problem. Cross-hardware divergence is a floating-point and implementation problem living in the gap between two backends. Both mean the forward pass is not a stable function of your inputs alone.

Why quantization pours fuel on it

Four-bit weights make this worse on two independent counts.

The reconstruction is backend-specific. A Q4_K weight is a 4-bit code plus per-block scales that expand back to a usable precision at runtime. That dequantization arithmetic is implemented per backend, so the reconstructed weights themselves differ slightly between CPU and GPU before a single multiply happens.

The logit landscape is flatter. Quantizing to 4 bits blurs the model. A blurrier model is less decisive, meaning more of its next-token decisions sit as near-ties between two candidates with almost-equal logits. Near-ties are exactly the decisions a sub-bit numerical difference can tip. A full-precision model that is confident about the next token will pick the same one even if the logits wobble in the seventh decimal place. A 4-bit model hovering between two tokens will not.

How a rounding error becomes a different diagnosis

Greedy decoding picks the argmax logit at each step. Argmax is stable until two logits cross. At a near-tie it is discontinuous, and an arbitrarily small change flips the winner.

token A logit: 8.4012  ← GPU picks A
token B logit: 8.4011
                        ← CPU, with slightly different arithmetic, picks B

A cross-backend difference of one part in ten thousand is enough to swap which token wins. Because generation is autoregressive, that one different token becomes part of the context for everything after it. A different word early reshapes the whole completion, which my parser then maps to a different ATS category. A perturbation far below anything you would notice in a single number cascades into a categorically different clinical output.

(I am describing the mechanism that fits the observation. The rigorous confirmation requires instrumenting the divergence point and measuring the logit margin where the two backends part ways, which is the experiment I will run next.)

Why clinical AI breaks the comfort zone

For most LLM applications this is a curiosity. If a chatbot phrases an answer two equally good ways depending on the GPU, nobody is harmed.

Triage breaks that comfort because the output space is discrete, ordinal, and load-bearing. ATS-2 and ATS-3 carry different time-to-treatment targets of 10 minutes versus 30 minutes. The category is the product. When the category moves with the hardware, you lose reproducibility, which is a prerequisite for trustworthy clinical software.

You cannot meaningfully validate a system whose answers depend on the machine. A study showing it triages correctly on your workstation guarantees nothing about the CPU node it ships on. You cannot audit an adverse event if the same inputs yield different outputs elsewhere.

What this retroactively justified and what I will change next

What the finding retroactively justified: the rules-first architecture. Before this, rules-first was a defensible architectural taste. After watching a category flip with the silicon, it is the only honest design. You do not hand a final, discrete decision to a component that will not reproduce across hardware. The deterministic rule floor gives the same answer on every machine. The LLM adds nuance and a readable rationale, while the rules guarantee the floor does not move when you change processors.

What I will change: Greedy decoding at temperature 0 is necessary but not sufficient. It kills sampling noise but does nothing about the cross-backend gap. Fixed seeds only order one backend's RNG and mean nothing across two implementations.

The practical moves, in increasing order of utility:

Pin the whole stack. The model is not medgemma-q4. It is medgemma-q4 on a specific backend on a specific device. I plan to version the backend, quant format, llama.cpp version, and target hardware together.

Test for consistency across hardware. My eval runner supports repeat runs to measure run-to-run stability. I will extend this to diff outputs across hardware configurations and treat any category disagreement as a first-class failure.

Wire structured risk modifiers into the rule tier. This fixes case 9, and more importantly fixes presentations where urgency comes from context rather than the chief complaint text.

Security as patient safety

My other professional thread is cybersecurity. I spent six months red teaming at the Land Transport Authority doing endpoint hardening and purple team exercises. When I built the security layer for Aegis-MD, I brought that adversarial mindset with me.

Prompt injection in a clinical context is not just a nuisance. If a bad actor overrides the system prompt of a triage agent, they can make it classify a cardiac arrest as ATS-5.

The security gateway does two things most LLM applications skip. First, it uses NFKC normalization before pattern matching. Attackers use Cyrillic and Greek homoglyphs to construct instructions that bypass regex but read correctly to an LLM. Normalizing to ASCII first means the patterns see what the model sees. Second, it uses scored heuristics rather than binary blocks, so borderline cases generate an audit trail without degrading usability for legitimate clinical inputs.

Regex blocklists are bypassable by construction. This layer is a demonstration of defense-in-depth principles applied to clinical AI, not a bypass-proof control.

What I would still do differently

The RAG corpus is the wrong shape. The five documents I used (WHO pediatric guidelines, MOH hypertension, etc.) are mostly disease-specific rather than triage-specific. When the retriever surfaces a hypertension management chunk in a chest pain presentation, it provides background rather than urgency reasoning. Retrieval relevance is entirely unmeasured here.

Vision is unevaluated. The multimodal component classifies wound and rash images with zero quantitative accuracy results. A 4B quantized multimodal model doing visual risk stratification without validation against a labelled dataset is not clinically meaningful. The vision component demonstrates that multimodal fusion is architecturally possible, but it is not evidence that it works.

Seventeen cases is not an evaluation. The 95% confidence interval on 16/17 extends from the low 70s to near-certainty. The cases are textbook presentations with no held-out set and no independent rater. This proves the system handles clean cases. It says nothing about the failure rate on the messy, incomplete presentations that dominate real ED triage.

What building this taught me

Medical school taught me that triage is not a diagnostic problem. It is a risk stratification problem under time pressure with incomplete information. The nurse's job is to figure out how wrong and how fast. That framing differs from most medical AI, which tends to optimize for diagnostic accuracy on clean datasets.

Building Aegis-MD confirmed that the architecture of a clinical AI system is a clinical decision itself. Whether the LLM or the rule layer has final authority is a question about where you are willing to accept brittleness. Rules fail predictably. LLMs fail surprisingly. In a safety context, predictable failures are preferable.

This lesson generalizes past medicine. Any time an LLM's discrete output drives a consequential decision, you inherit a reproducibility problem that the probabilistic shrug ignores. LLM inference is less reproducible than it looks, and if your output carries weight, you have to engineer for determinism.

Trust in clinical AI is earned through governance and validation, not claimed through framing. Once a kidney stone can change urgency depending on whether there is a GPU in the box, hardware environment stops being an ops question and becomes a clinical one.

DEV Community