I shipped a DevSecOps tool in 2026 with zero LLM calls. On purpose. I think determinism still wins.

danilo vaccalluzzo — Wed, 22 Apr 2026 18:24:48 +0000

Let me get one thing out of the way before the comments section catches fire.

I am not a luddite. I use AI all day. This article was outlined in Cursor, the tool I am about to describe was built with heavy help from Claude Code, and my IDE autocomplete is so spicy these days that I sometimes forget I am the one supposed to be writing the code. That is normal in 2026. Pretending otherwise would be ridiculous.

The point of this article is something else.

The point is: the shipped product itself does not call a single LLM at runtime. No OpenAI key. No Anthropic key. No RAG layer. No agentic loop. No "AI Mode" toggle. No vector database, no embedding step, no model card. Just plain old deterministic code that does the same thing every time.

In the current climate I think that genuinely needs an explanation, so here it is.

What the tool actually does

I built ArchiteX, a free open source GitHub Action that runs on every pull request that touches *.tf files. It parses the Terraform on both sides of the PR, builds an architecture graph for each, computes the delta, runs a set of weighted risk rules, and posts a sticky comment on the PR with:

a 0 to 10 risk score
a short plain English summary of what changed
a small Mermaid diagram of just the changed nodes plus one layer of context
an optional CI gate to fail the build above a threshold

You can see exactly what the comment looks like here, no install needed: live sample report.

It supports AWS and Azure today. MIT licensed. Single Go binary. Free forever, no paid tier ever.

This is, in 2026 vocabulary, the most boring possible tool. No agents. No reasoning loop. No "ask the diagram a question". I think that is a feature, not a bug, and I want to explain why.

Why I deliberately did not put an LLM in the hot path

In my opinion, the moment you put an LLM in the runtime of a tool that grades pull requests, you lose three things at the same time. And once you lose them, you cannot get them back without rewriting the trust contract you have with your users.

1. You lose determinism. And determinism is the entire product.

A reviewer trusts an automated PR comment for one reason and one reason only: because re-running it cannot quietly change the score. If I open the same PR twice and the first run says "9.0 / HIGH" and the second run says "6.5 / MEDIUM", I will never trust that tool again. Not for security. Not for anything.

Every LLM I know is non-deterministic by default. You can pin temperature to zero, you can fix the seed, you can do all the rituals. The provider can still ship a new model checkpoint next Tuesday and your scores drift overnight, silently, with no audit trail. I think that is unacceptable for a tool whose entire job is to be trustworthy at the moment of code review.

ArchiteX has a golden test suite that re-runs the full pipeline against checked-in fixtures and asserts the rendered Mermaid, the score JSON, and the egress JSON are byte identical to a stored expected output. If anyone ever changes a map iteration order, a sort comparator, or a JSON marshaller, the build fails on the next push. That guarantee is impossible if there is a model call anywhere in the path.

2. You lose the trust model.

ArchiteX never runs terraform plan. Never calls AWS or Azure. Never downloads provider plugins. Never touches state. The only network call in the entire tool is the GitHub REST API call at the very end to post the comment. The Terraform code never leaves the runner. There is no SaaS, no signup, no telemetry, no opt-out flag, because there is nothing to opt out of.

The moment I add a model call, I have to send something to a third party. Even a sanitized summary. Even just the rule IDs. The trust conversation immediately changes from "this runs entirely on your CI runner" to "well, mostly". I think for a tool aimed at regulated tenants, financial services, healthcare, government, that is the difference between adoption and a polite no thank you.

3. You lose air-gap and fork-PR support.

Real bonus consequence I did not appreciate until I was deep into design. Because the tool needs zero credentials and zero API keys, it works on PRs from forks. PRs from forks are exactly where most supply-chain-style attacks land, and most CI tooling refuses to run on them precisely because it needs secrets. ArchiteX runs there fine, because it has nothing to lose.

Same for air-gapped CI. Banks, defense contractors, hospitals. Add an LLM call and you cannot ship there at all.

What I gave up by saying no to AI in the runtime

I am not pretending this came for free. The tradeoffs are real and I think being honest about them is part of why this post is worth writing.

The plain English summaries are template-based, which means they are correct, deterministic, and a bit dry. An LLM would write nicer prose. I know. I have prototyped exactly that. The prose was indeed nicer. The score also drifted between runs and a colleague asked me, with a completely straight face, "did the AI just decide my PR was fine today". That was the end of that prototype.

There is no smart cross-resource correlation. If your PR adds an unauthenticated Lambda URL and attaches AdministratorAccess to a role in the same diff, ArchiteX scores both rules and sums them. It does not say "hey, those two together are materially worse than the sum of their parts". An LLM walking the graph could probably spot that. A deterministic rule engine cannot, unless I write the specific compound rule by hand. I am writing them by hand. It is slow. I think that is fine.

Rule curation is manual. Adding a new resource type means writing the parser support, the abstract type mapping, the literal attribute extraction, the edge inference, the risk rules, and the tests. There is no "let me ask the model to suggest 5 dangerous patterns for this resource". That tradeoff is the price of getting reproducibility as a load-bearing property.

A few technical decisions that follow directly from "no LLM"

For the people who care about how this looks in code, the no-LLM constraint shaped a bunch of design choices that I think are interesting on their own merits.

The HCL parser is hashicorp/hcl/v2 walked generically through hclsyntax, not decoded with gohcl (which would have needed a Go struct per resource type). Attribute values are evaluated with expr.Value(nil) and any failure to resolve a literal is recorded as nil rather than guessed at. Variable-driven attributes never trigger rules even when they should. The engine never invents values. Reproducibility wins.

The trust model is enforced structurally with a CI grep rule: ! grep -rE "net/http|architex/github" parser graph delta risk interpreter models. The build fails if any analysis package ever imports networking. The only place HTTP is allowed is the github REST client, which is only ever called by main.go in one specific subcommand. Code review can be fooled. A CI grep cannot.

The Mermaid renderer has a deterministic byte-budget cap. mermaid-js stops rendering above 50,000 characters (the GitHub PR comment failure mode for big diagrams). The renderer keeps nodes by status priority, then abstract type priority, then ID alphabetical, until the byte budget is hit, then drops the rest and emits a visible truncation marker so reviewers always know it happened. Found this empirically with a synthetic stress probe, not by asking a model.

There is a subcommand called architex baseline that snapshots the "shape" of your repo (kinds of resources, abstract types, edge pairs ever seen) into a small JSON file. Three baseline rules then surface novelties (a brand new abstract type, a brand new resource kind, a brand new edge pair) as low-weight signals. This is the closest the tool gets to "anomaly detection", and I deliberately built it as a deterministic snapshot diff and not as embeddings. Same fixture in, same novelty out, every time.

When I think AI absolutely does belong in a tool like this

Just so I do not come off as a hater. Here are the places I would happily use an LLM, and possibly will, just not in the runtime hot path:

Generating the first draft of a new risk rule from a CVE writeup or an incident postmortem. Then a human edits it, tests it, locks the weight, and ships it. The human is the trust boundary, not the model.
Translating the deterministic plain-English summary into other languages, offline, at release time. The translation is part of the build artifact, not a runtime call.
Helping users author suppressions. "I have this finding, write me a suppression block for it". This runs on the user's machine, not in the analysis pipeline, so it cannot influence the score.

The rule I keep coming back to: the model can write the rules, but the rules have to run without the model. I think that is the right line for security tooling specifically. Maybe not for chatbots. Maybe not for code generation. Definitely for anything that grades a pull request.

So if you are also building developer tools right now, I think this is worth asking

In 2026 it feels like every product launch needs an "AI" in the title to even get clicks. I get it. I literally led with the no-LLM angle to get you to read this article, so I am as guilty as anyone.

But I think there is a real, durable category of tools where adding an LLM in the runtime is a strict downgrade. Not because the model is bad, but because the contract between the tool and its users requires reproducibility, locality, and low trust surface. PR review tooling is one. CI gates are another. Anything that fires on every commit, anything that produces a number people will argue over, anything that has to work in an air-gapped tenant.

For those, I think the boring deterministic answer is still the right one. And I think that has to be defended on purpose right now, because the default is the other way.

If you want to look at it:

Repo: github.com/danilotrix86/ArchiteX
Live sample report (no install): danilotrix86.github.io/ArchiteX/report.html
30 second quickstart at the top of the README

I would genuinely love your honest feedback in the comments, especially:

If you have built a similar tool with an LLM in the runtime, what did you do about reproducibility?
If you work in a regulated tenant, is the "no SaaS, no telemetry, runs entirely on your runner" property actually the deal-breaker I think it is, or am I overestimating it?
If you are on the other side of this debate, where would you put the LLM in a PR-review tool, and why?

I will reply to every comment.

DEV Community: danilo vaccalluzzo