HeytalePazguato

Posted on May 26

Deterministic by design: code review without an LLM

#ai #codequality #devtools #showdev

Every code review tool launched in the last two years seems to lead with the same word: AI. Point a model at a diff, get back prose about what might be wrong. For a lot of code, that is genuinely useful.

I built a code review tool recently, and I deliberately left the LLM out. Not because I dislike them, I use them daily, but because the code I was targeting has a property that makes a non-deterministic reviewer the wrong tool: it runs machines, and a wrong or inconsistent answer has a physical cost.

This is about that decision, why determinism mattered more than fluency for this case, and where I think an LLM still earns a place.

The case study: industrial control code

The tool, plc-st-review, reviews IEC 61131-3 Structured Text. That is the language in which a large share of the world's factories, water plants, and process lines are programmed in. A bug here is not a 500 on a web page. It is a conveyor that runs too fast, a safety interlock whose timeout quietly changed, or a pump that never starts.

The famous extreme of this is Stuxnet. It quietly altered the PLC logic driving uranium enrichment centrifuges in Iran so they spun at damaging speeds, while replaying normal sensor readings back to the operators so nothing looked wrong. No explosion, just centrifuges tearing themselves apart over months. That was deliberate, state-built malware engineered to hide itself, so to be clear, no linter would have caught it. But you do not need a nation-state attacker to get the physical version of a wrong number. A timer preset typed as T#200ms instead of T#2s in an ordinary change does it too, and that is exactly the kind of thing a code review is supposed to catch and routinely misses.

You do not need to know Structured Text to follow the argument. The point is the constraint: this is code where "probably fine" is not an acceptable review result, and where the same input has to produce the same answer every single time.

Why determinism beats fluency here

A linter you gate a CI pipeline on makes a promise: the same code produces the same findings, today and in six months, on my machine and on the build server. That promise is what lets a team say "the build is red, the merge is blocked" and trust it.

An LLM reviewer cannot make that promise. The same diff can produce different output across runs. It can hallucinate a problem that is not there, or miss one that is. Temperature, model version, and context window all move the result. For an exploratory review, that is a fine trade. For a merge gate on safety-relevant code, it is disqualifying, because a gate that sometimes blocks and sometimes does not is not a gate.

Determinism bought me four things that matter more than natural language here:

Reproducibility. Every finding is a pure function of the parse tree. Run it a thousand times, get the same result a thousand times. CI can depend on it.

Auditability. When the tool flags something, it points to a named rule and the exact node that triggered it. In a regulated environment, someone will eventually ask, "Why did this fail?" "A rule named TIMER_VALUE_CHANGED fired because the PT went from T#2s to T#200ms" is an answer. "The model felt it looked risky" is not.

No data leaving the building. Industrial shops are, correctly, paranoid about shipping control code to a third-party API. A tool that parses locally and calls nothing external clears that bar without a procurement fight.

Cost and latency that round to zero. It parses and walks a tree. No tokens, no rate limits, no per-review bill. It runs on every push without anyone watching the meter.

How it actually works

There is no magic, which is the point. The pipeline is boring on purpose:

Parse each .st file into a syntax tree with a tree-sitter grammar. Real parsing, not regex on text.
Build a symbol table per revision: every program unit and its parameter signature, global variables, enums, timer instances, call sites, CASE statements.
Hand that structured model to each check. A check is a small, self-contained function that looks at the tree and the symbol table and returns findings.
For pull request review, do all of the above for both the before and after versions of a change, and diff the two models.

That last step is where it earns its keep. A single-revision analyzer can tell you a timer exists. Comparing two revisions tells you the timer's preset went from two seconds to two hundred milliseconds in this specific commit, ten times faster, which is exactly the kind of one-character typo that passes a visual review and trips a machine in production.

A few more examples of what falls out of having a real model instead of text matching:

A function block instance whose outputs you read but that nothing ever calls, so you are reading stale values.
A literal array index outside the declared bounds.
A constant whose name starts with SAFETY_ whose value changed, flagged at a higher severity because of the prefix.
A function that grew a required input while only some of its call sites were updated.

None of those needs a language model. They need a correct model of the code and a rule.

Where the LLM does belong

This is the part I want to be honest about, because "no AI" as a dogma is just the inverse mistake.

There is one place an LLM clearly helps: explaining a finding to someone who is not a domain expert. A junior engineer reading EDGE_TRIG_REUSED may not know why feeding one R_TRIG instance from two different clock expressions is a problem. A model is great at turning a terse, correct finding into a paragraph of plain English.

So the design rule I settled on is: the LLM never originates a finding. It only paraphrases one that the deterministic engine has already produced and grounded in a specific node. Determinism remains the source of truth; the model is an optional translation layer on top. That keeps the gate trustworthy while still making the output approachable. It is on the roadmap as a strictly additive --explain flag, off by default, never in the path that decides pass or fail.

That boundary, the model can explain but never decide, is the whole thesis. Let the deterministic core own correctness and the merge gate. Let the LLM own fluency, where being occasionally wrong costs nothing.

The takeaway, beyond PLCs

The reflex right now is to reach for a model first and ask what it should not touch later. I think it is worth inverting that for any code where a review result gates something real: decide what must be deterministic and auditable, build that part without the model, and add the LLM only where a wrong answer is cheap.

Not everything should be reviewed by an AI. Some things should be reviewed by a rule that gives the same answer every time, and can tell you exactly why.

The tool is open source (MIT) if you want to see the checks: https://github.com/HeytalePazguato/plc-st-review

I would be curious where other people draw this line. What in your stack do you keep deterministic on purpose, and where have you let a model in?

Top comments (1)

Harjot Singh • May 31

Good counter-current take. The industry is racing to throw an LLM at code review, but a huge fraction of what review actually catches is deterministic - lint, type errors, complexity thresholds, banned patterns, dependency rules, dead code, formatting. Those don't need a probabilistic model and shouldn't use one: a deterministic checker is fast, free, reproducible, and never hallucinates a finding or misses the same thing twice. Reserving the LLM for the genuinely fuzzy stuff (does this design make sense, is this the right abstraction) and letting deterministic tools own everything with a clear rule is the correct division of labor. An LLM that flags lint issues is just a slower, pricier, less reliable linter.

This maps exactly onto how I think about verification - use the cheapest deterministic check that can decide truth, and only escalate to a model when the question is actually subjective. It's core to Moonshift, the thing I work on - a multi-agent pipeline that takes a prompt to a deployed SaaS, where the verify layer prefers a hard deterministic gate over an LLM judgment wherever one exists, because deterministic is trustworthy and cheap. Multi-model routing keeps a build ~$3 flat, first run free no card. Really like this framing. Where do you draw the deterministic-vs-LLM line - everything rule-expressible goes deterministic and the LLM only gets the subjective design questions? That's the split I'd defend.