DEV Community

Pyae Sone
Pyae Sone

Posted on

Same Weights, Same Prompt, Different Triage Level

I ran a 4-bit medical-triage model on a laptop GPU and on a CPU. For one patient, the GPU said urgent and the CPU said emergency. Same model file, same prompt, same input. Here's the mechanism and why "validated on hardware X" doesn't mean what you'd hope.


I've been building Aegis-MD, a local-first emergency-department triage console. You hand it a structured clinical picture: chief complaint, vitals, age, pain score, a few risk modifiers, and it returns an urgency category on the Australasian Triage Scale (ATS 1–5), where ATS-1 means resuscitate now and ATS-5 means this can wait two hours. The whole thing runs on-device: a quantized MedGemma 4B served through Ollama, a small RAG layer over open guidelines, and a deterministic rule-based floor underneath the model.

 

I never set out to write about floating-point arithmetic. But while running my evaluation set across two machines, I hit a result that stopped me, and the explanation turned out to be more interesting and more current than the textbook answer most people reach for.

 

The setup, and why a 4-bit model

 

Two things about Aegis-MD's design matter for this story.

 

First, it's local by design. Triage data is about as sensitive as data gets, so nothing leaves the machine. The trade-off is that I'm running a small, heavily quantized model: MedGemma 1.5 4B at Q4_K_XL, about 3.4 GB rather than a frontier API. Four-bit weights are the price of running offline on consumer hardware.

 

Second, I tested on two configurations on purpose. The intended deployment is local GPU inference (an RTX 5070 Ti Mobile, 12 GB). But the public demo runs CPU-only on Cloud Run, because GPU instances need a paid quota I don't have. So I ran the same evaluation against both: the GPU build and the CPU build, same model, same code, same prompts.

 

The eval is 17 hand-written cases spanning all five ATS levels, cardiac arrest down to a medical-certificate request. (Seventeen is a smoke test, not a validation; I won't quote a percentage off a sample that small.) The headline was unremarkable: 16/17 on GPU, 15/17 on CPU. It was which case diverged that bothered me.

 

The thing that shouldn't happen

 

Case 8 was renal colic, a kidney stone. Painful, urgent, but not immediately life-threatening: ATS-3.

 


Case 8 — renal colic (expected ATS-3)

  GPU (RTX 5070 Ti):  ATS-3   ✓

  CPU (4 vCPU):       ATS-2   ✗  (over-triage)

Enter fullscreen mode Exit fullscreen mode

 

Same weights. Same prompt. Same patient JSON. Same Ollama. The only variable was the hardware the forward pass ran on and the model returned a different clinical category.

 

It erred on the safe side this time: ATS-2 is more urgent than ATS-3, so a real patient would have been seen sooner, not later. But "safe this time" is luck, not a property. Nothing in the mechanism guarantees the next hardware-induced flip rounds toward caution. (For contrast: my one genuine failure, case 9, an anticoagulated head injury under-triaged to ATS-4, failed identically on both machines; that's a bug in my rule layer, a different post. This one is different. This one moves with the silicon.)

 

My first guess was a bug. It wasn't.

 

I did the obvious checks. Same model digest on both machines. Same prompt bytes. Same decoding settings: greedy, temperature 0, which is what you want for triage anyway: you don't sample a clinical decision. No code path keyed on device. Nothing I could find that should produce a different answer.

 

That's because nothing in my code did. The divergence lives a layer below my code, in how the same arithmetic gets executed on two different processors.

 

What people usually blame

 

If you ask why two runs of an LLM disagree, the stock answer is some version of: floating-point addition isn't associative, GPUs run thousands of threads concurrently, the order they finish in is unpredictable, so the rounding differs and small numeric noise snowballs into different tokens.

 

(a + b) + c ≠ a + (b + c) is true, floats round at each step, so summation order changes the result in the last bits. And that's the story I'd absorbed too.

 

The problem is that, as the primary explanation for LLM nondeterminism, it's largely wrong and a recent piece of work spells out why.

 

What's actually going on

 

In September 2025, Horace He and Thinking Machines Lab published Defeating Nondeterminism in LLM Inference, and it reframed the whole thing. Their argument, in my words: the hot path of a normal forward pass doesn't lean on atomic adds or racing threads, so for a fixed shape and a fixed schedule, the individual kernels are run-to-run deterministic. The reason a temperature-0 endpoint still gives you different answers isn't thread races, it's that production servers batch your request with other people's, the batch size fluctuates with traffic, and many kernels (matmul, RMSNorm, attention) aren't batch-invariant: their numerical output changes with the batch they're computed in. Change the batch, change the reduction, change the result.

 

That's a brilliant diagnosis of same-endpoint nondeterminism. But notice it doesn't describe my situation. I wasn't getting different answers from one server under varying load. I was getting different answers from two entirely different backends — llama.cpp's CUDA path on the GPU versus its CPU path.

 

Cross-backend is a different axis entirely, and here the floating-point story is the right lens, just not at the level of thread races. Two backends are two separate implementations of the same math. They tile and block their matrix multiplies differently, reduce in different orders, accumulate at different precisions (a GPU kernel may accumulate in fp16 where the CPU uses fp32), and fuse operations differently. Bitwise agreement across two such code paths was never on the table. They're not the same computation that happens to round differently under load; they're different computations that approximate the same ideal.

 

So the honest framing is: same-hardware temp-0 nondeterminism is mostly a batch-invariance problem, not a floating-point one. Cross-hardware divergence, what I saw is a floating-point-and-implementation problem, living in the gap between two backends. Both end at the same uncomfortable place: the forward pass is not a stable function of your inputs alone.

 

Why quantization pours fuel on it

 

Four-bit weights make this worse on two independent counts.

 

The reconstruction is backend-specific. A Q4_K weight isn't a number you read off; it's a 4-bit code plus per-block scales that get expanded back to a usable precision at runtime. That dequantization arithmetic is implemented per backend, so the reconstructed weights themselves differ slightly between CPU and GPU before a single multiply happens. You're not even feeding the two paths identical numbers.

 

The logit landscape is flatter. Quantizing to 4 bits doesn't just add noise; it blurs the model. A blurrier model is less decisive, more of its next-token decisions sit as near-ties between two candidates with almost-equal logits. And near-ties are exactly the decisions a sub-bit numerical difference can tip. A full-precision model that's confident about the next token will pick the same one even if the logits wobble in the seventh decimal place. A 4-bit model hovering between two tokens will not.

 

How a rounding error becomes a different diagnosis

 

Here's the part that makes this matter rather than just being trivia.

 

Greedy decoding picks the argmax logit at each step. Argmax is a step function: it's perfectly stable until two logits cross, and at a tie it's discontinuous, an arbitrarily small change flips the winner. Picture one decision point:

 


token A logit:  8.4012   ← GPU picks A

token B logit:  8.4011

Enter fullscreen mode Exit fullscreen mode

 

A cross-backend difference of one part in ten thousand is enough to swap which token wins. And generation is autoregressive: that one different token becomes part of the context for everything after it. A different word early reshapes the whole completion, which my parser then maps to a different ATS category. A perturbation far below anything you'd notice in a single number, injected at one near-tie, cascades into a categorically different clinical output.

 

That's the chain: 4-bit weights and two backends produce slightly different logits → at a near-tie the argmax flips → autoregressive decoding amplifies one token into a different sentence → the sentence parses to ATS-2 instead of ATS-3.

 

(I'm describing the mechanism that fits the observation, not a logit-level autopsy of this exact case. The rigorous confirmation, instrumenting the divergence point and measuring the logit margin where the two backends part ways is the experiment I'd run next, and it's the right way to turn "this is almost certainly what happened" into "here's the token where it happened.")

 

Why I care more than a chatbot would

 

For most LLM products this is a non-issue. If a chatbot phrases an answer two equally-good ways depending on the GPU, nobody is harmed.

 

Triage breaks that comfort for a specific reason: the output space is discrete, ordinal, and load-bearing. ATS-2 and ATS-3 aren't paraphrases, they carry different time-to-treatment targets. The category is the product. When the category moves with the hardware, you've lost something more fundamental than accuracy.

 

You've lost reproducibility, and reproducibility isn't a nicety in clinical software, it's a prerequisite for everything that makes such software trustworthy. You cannot meaningfully validate a system whose answers depend on the machine: a study that shows it triages correctly on your workstation says nothing certain about the CPU node it ships on. You cannot audit an adverse event,"why did it under-triage this patient?" has no stable answer if the same inputs yield different outputs elsewhere. And you cannot get anything past a regulator on "it usually agrees with itself." "Validated on hardware X" turns out to be a far narrower claim than people assume the moment a quantized model is involved.

 

What I changed

 

A few practical moves, in increasing order of how much they actually help.

 

Greedy, and confirm it. Temperature 0 is necessary. You never sample a triage decision but it is not sufficient. It kills sampling noise; it does nothing about the cross-backend gap. (Fixed seeds don't rescue you either: a seed only orders one backend's RNG, and means nothing across two.)

 

Treat the hardware as part of the model artifact. I now pin the whole stack; backend, quant, llama.cpp/Ollama version, and the target hardware and I version them together. The model isn't medgemma-q4; it's medgemma-q4 on this backend on this device. If the demo runs on CPU and the real thing runs on GPU, those are two artifacts that need separate evaluation, not one with two deployment targets.

 

Test for consistency across hardware, not just accuracy. My eval runner already supports repeat runs to measure run-to-run stability; the obvious extension is to diff outputs across configurations and treat any category disagreement as a first-class failure, alongside wrong answers.

 

And the big one: don't let the stochastic component make the safety-critical call. This is the design decision the whole episode retroactively justified. Aegis-MD isn't an LLM with some guardrails bolted on; it's a deterministic rule floor, keyword and vitals-threshold logic that the LLM is allowed to escalate but not override. Those rules give the same answer on every machine, every time. The model adds nuance and a readable rationale; the rules guarantee the floor doesn't move when you change processors. Before this, "rules-first" was a defensible architectural taste. After watching a category flip with the silicon, it reads as the only honest way to build the thing: you do not hand the final, consequential, discrete decision to a component that won't reproduce across hardware.

 

The takeaway

 

The lesson generalizes past medicine. Any time an LLM's discrete output drives a decision with consequences, a triage level, a loan tier, a content-moderation verdict, a routing choice, you've inherited a reproducibility problem that the "it's probabilistic anyway" shrug papers over. As the Thinking Machines work argues, a lot of what gets waved away as inherent randomness is fixable engineering. The cross-hardware version I ran into is its own axis, and quantization makes it sharper, but the conclusion is the same: LLM inference is less reproducible than it looks, and if your output carries weight, you have to engineer for determinism rather than assume it.

 

The model I shipped is the same model on both machines. The arithmetic is not. And once a kidney stone can change urgency depending on whether there's a GPU in the box, "what hardware is this running on" stops being an ops question and becomes a clinical one.

 


 

Further reading

 

Aegis-MD is a research prototype, not a medical device, and must not be used for real patient care.

Top comments (0)