Joseph Yeo

Posted on May 12

The Determinism War: Why We Stopped Chasing Better Models

#aiagents #tdd #localai

The biggest upgrade to our AI coding system was not a better model.

It was deleting model calls.

Everyone in AI infrastructure asks: "Which model should we use?" That question is a distraction. After 22 development sprints and a deep dive into 35 research papers while building ForgeFlow — our fully local, TDD-based autonomous implementation system — we landed on a different question:

"How many of our system's decisions can we replace with deterministic rules?"

The answer to that question often matters more than whether you use a frontier model or a 45GB local one.

The Insight That Changed Everything

LLM-based generation is probabilistic unless tightly constrained. Even when you pin temperature and seed, model calls remain poor substitutes for explicit rules when a decision can be mechanically verified.

Consider what a typical agentic coding loop actually decides:

Which task to implement next
Whether the environment is healthy
Whether the generated code has syntax errors
Whether the import paths are correct
Whether the generated file is in scope
Whether the tests pass
Whether to commit, retry, or declare deadlock

None of these require a language model. They're deterministic operations: dependency graph traversal, docker health checks, py_compile, ruff, allow-list comparison, pytest exit code, SHA-256 hashing.

The only judgment that genuinely requires a model is code generation itself — one step in the entire cycle.

We call this ratio the DCR: Deterministic Coverage Ratio.

DCR = deterministic decision points / total decision points

DCR is not a benchmark score. It is a design accounting tool: count every branch in the agent loop, then classify whether it is resolved by code or by model judgment.

ForgeFlow's current DCR: ≈ 85% — 11 of 13 decision points per cycle are deterministic.

Decision	Deterministic?	Mechanism
Next task selection	✅	dependency DAG traversal
Dependency satisfied?	✅	status field check
Syntax valid?	✅	py_compile
Style gate pass?	✅	ruff
Import paths correct?	✅	Phase 0.5 AST correction
File in scope?	✅	allow-list comparison
Environment healthy?	✅	docker / ollama / disk check
Test code generation	❌	local LLM
Implementation generation	❌	local LLM
Tests pass?	✅	pytest exit code
Commit / Retry / Deadlock?	✅	gate_decision (rule-based)
Failure type?	✅	stderr pattern matching
Deadlock detected?	✅	failure signature × 3

Why DCR Is the Right Metric

Recent inference-time scaling research keeps pointing to the same condition: repeated sampling becomes powerful when there is an automatic verifier.

Without an oracle, more samples just create more candidates to rank. With an oracle, more samples create more chances to find a correct answer.

A few papers that converge on this pattern:

Large Language Monkeys (Brown et al., ICLR 2025): Coverage scales with sample count. In coding tasks with automatic verification, repeated sampling with a cheaper model consistently outperforms a frontier model run once.
"The Larger the Better?" (Hassid et al., 2024): Under a fixed compute budget, smaller code models sampled multiple times can match or outperform larger models — when unit tests are available to select the correct answer.
"Do We Truly Need So Many Samples?" (2025): Multi-model repeated sampling improves sample efficiency. Smaller model combinations approach larger-model performance with far fewer samples, given a verifier.

The critical shared condition: automatic verification must exist.

Unit tests. pytest. An oracle that says "yes" or "no" without human judgment.

This is where most real-world projects diverge from benchmarks. HumanEval and MBPP ship with test suites. Your JWT authentication service doesn't.

The bottleneck isn't model capability. It's verifiability.

Verifiability Is Constructed, Not Given

Existing research operates in "Verifiability as Given" mode: benchmarks provide the oracle, models generate code, oracle decides.

ForgeFlow operates in "Verifiability as Constructed" mode: start with a natural-language requirement and systematically transform it into something testable before any model is ever called.

We call this the Verifiability Transformation Pipeline:

Stage 1: Specification Crystallization
Natural language → numerically precise spec. Zero ambiguous terms ("appropriately", "as needed", "etc." are banned). Every requirement is anchored by a concrete test assertion:

POST {title: 'Buy milk'} → 201, body contains {id, title}

Stage 2: Atomic Decomposition
One task = one test file + one implementation file. An ordered dependency DAG ensures each task's dependencies are already verified before it runs. This eliminates "cross-task contamination" — the pattern where one task's test setup silently breaks another task's environment.

Stage 3: Assertion Embedding
The test contract is written into the spec itself, at pytest-assertion granularity. TDD RED phase uses this to generate the test first. The oracle exists before any code does.

The goal is not to claim that benchmark scaling laws automatically transfer to every real-world project. The goal is to reshape your project until enough of it has benchmark-like properties: explicit specs, executable tests, and a deterministic oracle. Once you have that, repeated sampling and local inference become much more interesting.

The Substitution Paradox

Here's the part that feels contradictory:

Replacing non-deterministic judgment with deterministic rules often requires the most creative kind of intelligence.

Designing the rule "run py_compile before the LLM sees the output" seems obvious in retrospect. But reaching that design requires prior reasoning: that syntax errors are a deterministic category, that they can be caught pre-LLM, and that relying on the LLM to catch its own syntax errors is inherently fragile.

That's inference. Creative, structural inference.

This is why ForgeFlow has a three-tier intelligence structure:

Tier	Role	Actor	Frequency
C — Design	"Which decisions can become deterministic?"	Claude + Joseph	Once per system
B — Execute	"Generate code matching this spec"	Qwen3 45GB (local)	Every cycle
A — Verify	pytest, py_compile, ruff, gate_decision	forgeflow.py	Every cycle, ≈ free

The cloud model designs the harness. The local model runs inside it. The deterministic verifier judges the result.

The strongest model's job is to make the weakest model capable.

The Economics

Cloud model, direct execution:

Cost = (1 / P(success)) × price_per_call

As the problem gets harder, tries increase. As the model gets stronger, price increases. Cost diverges in both directions.

ForgeFlow (DCR maximization):

Cost = design_cost (one-time) + N × marginal_local_inference_cost

The marginal cost is not zero — electricity, time, and thermal throttling all exist. But it is low enough that repeated attempts become operationally feasible in a way that repeated cloud calls are not. Design cost is high, but it is a constant.

This is the same computation the "Large Language Monkeys" paper runs — applied to a system where the oracle is constructed rather than assumed.

What This Means in Practice

For local AI users: The question isn't "is my model good enough?" It's "have I built a harness where my model only needs to do code generation, and everything else is handled deterministically?"

For AI system designers: DCR is a design maturity metric. DCR 30% means you're relying on the model for most decisions. DCR 85% means the model is a narrow specialist in a well-guarded context. Measure yours.

For teams debating cloud vs local: Once DCR is high enough, model selection becomes an economic and security variable — not a technical one. The same harness runs with a local model overnight or a cloud model during the day. Same gates. Same oracle. Different inference cost.

The Thesis in One Sentence

"The bottleneck of LLM-driven software engineering is not model capability, but the verifiability of specifications — and once verifiability is systematically constructed, a 45GB local model running overnight can match a frontier cloud model running once."

We think this is true. We're building the system to prove it.

What's your system's DCR?

About

I'm Joseph YEO, a solo builder from Seoul, Korea. ForgeFlow is my experiment in pushing local AI agents toward full autonomy — no cloud inference during execution, no hand-holding mid-cycle.

This post is about the design principle behind that experiment: maximizing the ratio of decisions that don't require a model at all.

Follow along:

Built over ~26 sessions, May 2026. All models run locally via Ollama 0.23.0 on Apple Silicon. No cloud APIs were used during autonomous execution.

This post was drafted with Claude and edited by me. I use AI tools to write, just like I use them to code. That's kind of the whole point.

Top comments (3)

Raju Dandigam • May 12

This framing around DCR is really useful. I’ve seen the same pattern in production AI systems: the most reliable agent loops are usually the ones where the model has the smallest possible surface area and every surrounding decision is handled by explicit checks, schemas, tests, health probes, or policy gates.

The “verifiability is constructed, not given” point especially resonates. A lot of teams try to improve agent reliability by swapping models, but the bigger unlock is often designing the workflow so success/failure can be mechanically judged. That turns retries, local models, and cheaper inference into realistic options instead of just hoping the model reasons correctly.

I’d be curious how you think about measuring DCR over time as the system evolves. Do you treat it as an architecture review metric, or something you’d actually track alongside test coverage/build health?

Joseph Yeo • May 12

Thanks Raju — really glad the DCR framing landed. Your point about minimizing the model’s surface area is exactly the design lever we keep coming back to.

Right now, I treat DCR primarily as an architecture review metric. For ForgeFlow, we enumerate every decision point in the cycle — currently 13 — and classify each one as deterministic or model-mediated. It’s a crude accounting method, but it forces a useful question: “Why does the LLM need a vote here?”

The next step is to split DCR into two layers:

Static DCR: how many decision points are designed to be deterministic.
Observed DCR: how many decisions were actually resolved deterministically during real runs.

ForgeFlow already logs each cycle outcome — COMMIT, RETRY, DEADLOCK — so the natural direction is to track DCR alongside pass rate, retry depth, failure signatures, and deadlock frequency.

Small-N so far, but the signal is useful: with the same local model and same execution engine, improving the information structure around the model moved our autonomous pass rate from 0% to 67% across two internal project runs. That makes me think DCR should eventually sit closer to build health/test coverage than to a one-time architecture review.

I’m planning to write up the measurement side next — especially the part where we realized the bottleneck was not raw model capability, but how much testable context the model actually received at inference time.

Harjot Singh • May 31

We stopped chasing better models is the realization that separates teams shipping reliable AI from teams stuck on a treadmill. Chasing the next model is seductive because it feels like progress, but it's renting someone else's improvement curve while your actual reliability problems stay unsolved, and the gains evaporate the moment everyone has the same model. The durable wins come from the system around the model: determinism where you can get it (cache the stable, pin what you can, make the control flow repeatable), and verification where you can't, so the nondeterministic part is bounded and checked rather than trusted. That's the whole determinism war, you're not trying to make the model deterministic, you're trying to make the system's behavior predictable despite a probabilistic component, the same way you build reliable systems on unreliable networks. A slightly worse model inside a well-engineered harness beats a frontier model wired up with vibes, every time, because the harness is what you control and improve. Stop optimizing the part you rent, start engineering the part you own. That invest-in-the-harness-not-the-model-lottery instinct is the core of how I think about Moonshift. Where did you get the most determinism back, caching/structured outputs, or constraining the control flow so the model makes fewer free choices?