The biggest upgrade to our AI coding system was not a better model.
It was deleting model calls.
Everyone in AI infrastructure asks: "Which model should we use?" That question is a distraction. After 22 development sprints and a deep dive into 35 research papers while building ForgeFlow — our fully local, TDD-based autonomous implementation system — we landed on a different question:
"How many of our system's decisions can we replace with deterministic rules?"
The answer to that question often matters more than whether you use a frontier model or a 45GB local one.
The Insight That Changed Everything
LLM-based generation is probabilistic unless tightly constrained. Even when you pin temperature and seed, model calls remain poor substitutes for explicit rules when a decision can be mechanically verified.
Consider what a typical agentic coding loop actually decides:
- Which task to implement next
- Whether the environment is healthy
- Whether the generated code has syntax errors
- Whether the import paths are correct
- Whether the generated file is in scope
- Whether the tests pass
- Whether to commit, retry, or declare deadlock
None of these require a language model. They're deterministic operations: dependency graph traversal, docker health checks, py_compile, ruff, allow-list comparison, pytest exit code, SHA-256 hashing.
The only judgment that genuinely requires a model is code generation itself — one step in the entire cycle.
We call this ratio the DCR: Deterministic Coverage Ratio.
DCR = deterministic decision points / total decision points
DCR is not a benchmark score. It is a design accounting tool: count every branch in the agent loop, then classify whether it is resolved by code or by model judgment.
ForgeFlow's current DCR: ≈ 85% — 11 of 13 decision points per cycle are deterministic.
| Decision | Deterministic? | Mechanism |
|---|---|---|
| Next task selection | ✅ | dependency DAG traversal |
| Dependency satisfied? | ✅ | status field check |
| Syntax valid? | ✅ | py_compile |
| Style gate pass? | ✅ | ruff |
| Import paths correct? | ✅ | Phase 0.5 AST correction |
| File in scope? | ✅ | allow-list comparison |
| Environment healthy? | ✅ | docker / ollama / disk check |
| Test code generation | ❌ | local LLM |
| Implementation generation | ❌ | local LLM |
| Tests pass? | ✅ | pytest exit code |
| Commit / Retry / Deadlock? | ✅ | gate_decision (rule-based) |
| Failure type? | ✅ | stderr pattern matching |
| Deadlock detected? | ✅ | failure signature × 3 |
Why DCR Is the Right Metric
Recent inference-time scaling research keeps pointing to the same condition: repeated sampling becomes powerful when there is an automatic verifier.
Without an oracle, more samples just create more candidates to rank. With an oracle, more samples create more chances to find a correct answer.
A few papers that converge on this pattern:
- Large Language Monkeys (Brown et al., ICLR 2025): Coverage scales with sample count. In coding tasks with automatic verification, repeated sampling with a cheaper model consistently outperforms a frontier model run once.
- "The Larger the Better?" (Hassid et al., 2024): Under a fixed compute budget, smaller code models sampled multiple times can match or outperform larger models — when unit tests are available to select the correct answer.
- "Do We Truly Need So Many Samples?" (2025): Multi-model repeated sampling improves sample efficiency. Smaller model combinations approach larger-model performance with far fewer samples, given a verifier.
The critical shared condition: automatic verification must exist.
Unit tests. pytest. An oracle that says "yes" or "no" without human judgment.
This is where most real-world projects diverge from benchmarks. HumanEval and MBPP ship with test suites. Your JWT authentication service doesn't.
The bottleneck isn't model capability. It's verifiability.
Verifiability Is Constructed, Not Given
Existing research operates in "Verifiability as Given" mode: benchmarks provide the oracle, models generate code, oracle decides.
ForgeFlow operates in "Verifiability as Constructed" mode: start with a natural-language requirement and systematically transform it into something testable before any model is ever called.
We call this the Verifiability Transformation Pipeline:
Stage 1: Specification Crystallization
Natural language → numerically precise spec. Zero ambiguous terms ("appropriately", "as needed", "etc." are banned). Every requirement is anchored by a concrete test assertion:
POST {title: 'Buy milk'} → 201, body contains {id, title}
Stage 2: Atomic Decomposition
One task = one test file + one implementation file. An ordered dependency DAG ensures each task's dependencies are already verified before it runs. This eliminates "cross-task contamination" — the pattern where one task's test setup silently breaks another task's environment.
Stage 3: Assertion Embedding
The test contract is written into the spec itself, at pytest-assertion granularity. TDD RED phase uses this to generate the test first. The oracle exists before any code does.
The goal is not to claim that benchmark scaling laws automatically transfer to every real-world project. The goal is to reshape your project until enough of it has benchmark-like properties: explicit specs, executable tests, and a deterministic oracle. Once you have that, repeated sampling and local inference become much more interesting.
The Substitution Paradox
Here's the part that feels contradictory:
Replacing non-deterministic judgment with deterministic rules often requires the most creative kind of intelligence.
Designing the rule "run py_compile before the LLM sees the output" seems obvious in retrospect. But reaching that design requires prior reasoning: that syntax errors are a deterministic category, that they can be caught pre-LLM, and that relying on the LLM to catch its own syntax errors is inherently fragile.
That's inference. Creative, structural inference.
This is why ForgeFlow has a three-tier intelligence structure:
| Tier | Role | Actor | Frequency |
|---|---|---|---|
| C — Design | "Which decisions can become deterministic?" | Claude + Joseph | Once per system |
| B — Execute | "Generate code matching this spec" | Qwen3 45GB (local) | Every cycle |
| A — Verify | pytest, py_compile, ruff, gate_decision | forgeflow.py | Every cycle, ≈ free |
The cloud model designs the harness. The local model runs inside it. The deterministic verifier judges the result.
The strongest model's job is to make the weakest model capable.
The Economics
Cloud model, direct execution:
Cost = (1 / P(success)) × price_per_call
As the problem gets harder, tries increase. As the model gets stronger, price increases. Cost diverges in both directions.
ForgeFlow (DCR maximization):
Cost = design_cost (one-time) + N × marginal_local_inference_cost
The marginal cost is not zero — electricity, time, and thermal throttling all exist. But it is low enough that repeated attempts become operationally feasible in a way that repeated cloud calls are not. Design cost is high, but it is a constant.
This is the same computation the "Large Language Monkeys" paper runs — applied to a system where the oracle is constructed rather than assumed.
What This Means in Practice
For local AI users: The question isn't "is my model good enough?" It's "have I built a harness where my model only needs to do code generation, and everything else is handled deterministically?"
For AI system designers: DCR is a design maturity metric. DCR 30% means you're relying on the model for most decisions. DCR 85% means the model is a narrow specialist in a well-guarded context. Measure yours.
For teams debating cloud vs local: Once DCR is high enough, model selection becomes an economic and security variable — not a technical one. The same harness runs with a local model overnight or a cloud model during the day. Same gates. Same oracle. Different inference cost.
The Thesis in One Sentence
"The bottleneck of LLM-driven software engineering is not model capability, but the verifiability of specifications — and once verifiability is systematically constructed, a 45GB local model running overnight can match a frontier cloud model running once."
We think this is true. We're building the system to prove it.
What's your system's DCR?
About
I'm Joseph YEO, a solo builder from Seoul, Korea. ForgeFlow is my experiment in pushing local AI agents toward full autonomy — no cloud inference during execution, no hand-holding mid-cycle.
This post is about the design principle behind that experiment: maximizing the ratio of decisions that don't require a model at all.
Follow along:
- 𝕏: @josephyeo_dev
- GitHub: joseph-yeo
- Site: projectjoseph.dev
Built over ~26 sessions, May 2026. All models run locally via Ollama 0.23.0 on Apple Silicon. No cloud APIs were used during autonomous execution.
This post was drafted with Claude and edited by me. I use AI tools to write, just like I use them to code. That's kind of the whole point.
Top comments (1)
This framing around DCR is really useful. I’ve seen the same pattern in production AI systems: the most reliable agent loops are usually the ones where the model has the smallest possible surface area and every surrounding decision is handled by explicit checks, schemas, tests, health probes, or policy gates.
The “verifiability is constructed, not given” point especially resonates. A lot of teams try to improve agent reliability by swapping models, but the bigger unlock is often designing the workflow so success/failure can be mechanically judged. That turns retries, local models, and cheaper inference into realistic options instead of just hoping the model reasons correctly.
I’d be curious how you think about measuring DCR over time as the system evolves. Do you treat it as an architecture review metric, or something you’d actually track alongside test coverage/build health?