DEV Community: Joseph Yeo

The Human in the Loop Doesn't Scale. I Kept Him Anyway.

Joseph Yeo — Sun, 28 Jun 2026 14:51:25 +0000

What it costs to be the last reviewer your own system has

Part of the ForgeFlow series — building a coding agent that runs its execution loop locally on an M5 Max, and writing down what actually breaks. Planning runs on a frontier model; code generation runs on a local model via Ollama, test-driven inside a Docker sandbox.

In the last post, I described how my agent's rulebook learned to forget — rules that age out get flagged, and a human decides whether to retire them. I ended on a line that's been nagging me ever since: the whole thing works because I'm still small enough to read everything, and I suspect that doesn't last.

This post is about that suspicion. It's about the design decision I keep reaching for — keep a human in the loop — and the uncomfortable thing underneath it: that human is me, I am exactly one person, and "put a human on it" is not a plan. It's a debt I haven't been billed for yet.

"Keep a human in the loop" is the answer I keep reaching for

When an automated decision feels risky, my reflex is usually the same: don't let the machine do it alone, put a person on the final call. It sounds responsible. It's the answer I reach for when I don't yet trust the automation boundary, and it's the one I'd give if you asked me how to make an agent safe.

And it's right, as far as it goes. A human on the final decision catches the failure modes a confidence score can't see. I argued for exactly this last time: the machine is good at noticing a rule hasn't earned its keep lately; it is not good at knowing whether that's because the rule is obsolete or because I just haven't exercised it. So the machine flags, and I decide.

The part I glossed over is what "I decide" actually costs.

The cost nobody puts on the diagram

When you draw a human-in-the-loop system, the human is a small box near the end with an arrow labeled approve / reject. The box looks free. It isn't. Every decision routed to that box spends three things that don't show up in the diagram.

It spends attention — each call needs enough context loaded into a human head to judge it, and that context isn't cached between decisions the way it is for a machine.

It spends latency — the system now moves at the speed of when I happen to look, not the speed of the runs. A flag raised at 2am waits for me. The agent doesn't.

It spends a budget that doesn't grow — the machine's side scales with hardware. My side barely scales at all. I get the same hours next year. If the flags-per-week curve goes up and the human-hours curve is flat, the lines cross. After they cross, "a human reviews it" quietly becomes "a human is supposed to review it," which is a different claim wearing the same label.

That last failure is the one I actually fear. Not the human-in-the-loop that says no. The human-in-the-loop that has too much queued to look properly, and starts rubber-stamping — approving on a glance because the backlog is the real pressure. A reviewer who can't keep up doesn't fail loudly. They fail by agreeing, and a rubber-stamped approval still moves the system forward — it just carries a decision nobody actually made, buried until something downstream breaks.

What I tried: making the human's time the scarce resource it actually is

Once I stopped treating my own attention as free, the design question flipped. It stopped being "where should a human review?" and became "this human has a small, fixed number of real decisions in him per week — which ones are worth spending?"

That reframing changed the design more than another automation pass would have. A few things fell out of it.

Most decisions don't need me; they need a default. A flag I almost always approve isn't a decision, it's a ceremony. The honest move is to pick the safe default, let it happen automatically, and log it where I can audit a sample later — not to stand at the gate nodding. The clearest case: stale-rule flags for rules that only ever applied to throwaway scaffolding from old projects. There's no real call to make there — retiring them is the safe default, so the system does it and tells me, instead of asking. I moved a whole class of "review" into "do the safe thing and log it," and got the time back for the calls that were actually close.

The decisions worth my time are the irreversible and the ambiguous. Retiring a rule that can't easily be un-retired; anything where the machine's confidence is truly split rather than just low. Those I keep. They're rare, which is the point — keeping a human in the loop only scales if the loop is small.

Batching beats interrupting. Ten flags reviewed in one sitting, with shared context, cost a fraction of ten flags reviewed across ten interruptions. So the system holds non-urgent decisions and presents them together, instead of paging me the moment each one appears.

None of this removes the human. It does the opposite — it admits the human is the bottleneck and budgets around it, instead of pretending the bottleneck is free and acting surprised when it backs up.

What I deliberately refused to automate, even knowing it doesn't scale

Here's the tension I haven't resolved.

Some decisions I keep for myself even though I know that choice is what limits how far the system can run without me. Retiring hard-won knowledge is one. The call to let the agent act on something it's never done before is another. Not because a model couldn't make those calls — increasingly it could — but because I'm not yet willing to not know when they happen.

That's an honest admission, not a principle. It might be that I'm holding onto these out of caution that's already obsolete. It might be that one of them is the next thing to hand off, and I'm just attached. The reason I can still tell the difference is, again, that I'm small enough to feel each of these decisions individually. The day there are too many to feel is the day this stops being judgment and starts being a story I tell myself about judgment.

So I'm not claiming I solved it. I'm claiming I stopped pretending the human was free, and that alone changed which decisions I let reach me.

What this didn't prove

This is one person's setup, and the bottleneck I'm describing is me, specifically — my hours, my attention, my unwillingness to look away from certain decisions. A team has a different shape of this problem: more reviewers, but also coordination cost, and the rubber-stamp failure mode gets easier to hide, not harder, when "someone reviewed it" can mean anyone.

I haven't shown that my particular triage — defaults for the routine, human for the irreversible and the ambiguous — is the right cut. It's the cut that fit a single-operator system small enough to audit by sampling. I also haven't escaped the core problem; I've only delayed it. Every decision I automate to save my attention is a decision I now have to trust without watching, which is the exact move the earlier posts in this series were nervous about. I traded one risk for another with my eyes open. That's not a solution. It's a position.

The takeaway

"Keep a human in the loop" is true and incomplete. It's true because the human catches what the metric can't. It's incomplete because it quietly assumes the human's time is free, and in my setup the human's time is the least scalable resource in the whole system. A loop with a person in it only works while the loop stays small enough that the person can actually be in it — not nominally, actually.

If I had to compress it: the goal isn't to keep a human in the loop. It's to spend the human on the decisions that deserve a human, and to be honest that everything else was a default you chose, not a review you did.

I'd like to hear how others handle this, because I don't think being small saves me for long. In your systems, what do you actually keep a person on — and how do you tell the difference between a review that's real and one that's become a rubber stamp? And when you handed a decision to automation, how did you decide it was safe to stop watching?

What I Learned by Deleting Rules My Agent Had Already Learned

Joseph Yeo — Fri, 26 Jun 2026 16:03:00 +0000

Knowledge that isn't pruned starts to mislead you

Part of the ForgeFlow series — building a coding agent that runs its execution loop locally on an M5 Max, and writing down what actually breaks. Planning runs through a separate planning step; code generation runs on a local model via Ollama, test-driven inside a Docker sandbox.

For months, my agent got better by accumulating rules.

Every time a project failed in a way I understood, I'd write the lesson down as a rule and feed it back in — don't do this, always do that — so the agent wouldn't repeat the mistake. To be clear about what "rules" means here: not fine-tuning, not weights. A plain, human-readable rulebook the system injects into the agent's context, built up from past failures. For a long time, adding to it worked. Autonomy went up. The same mistakes stopped coming back.

Then the agent started getting worse, and the rulebook was the reason.

Not because the rules were wrong. Because some of them had stopped being right — and I'd never built anything to notice. This post is about the part of the system I hadn't built: the part that lets old knowledge leave.

(This continues a thread from a recent run of posts about verifying your own system. The short version of that run: every layer you trust should be measured before you trust it. This is what happens when you forget that even earned trust has an expiry date.)

The half I built first: a place to write lessons down

The first half of this is the obvious, satisfying half. Your agent fails, you diagnose it, you encode the fix as a rule, and the failure doesn't recur. Do that across a few dozen projects and you've got a real rulebook — concrete, earned, specific. Mine grew to around 77 rules at one point, and I wrote a whole post about how good that felt ("77 Rules Later"). Each rule had a story behind it. Each one had prevented something real.

This is the part that feels like progress, because it is progress — at first. A rulebook induced from actual failures is far better than generic advice like "write clean code." It captures things like a specific framework-level behavior that only surfaces when two configuration paths interact — the kind of thing that turns a multi-hour debugging session into a one-line guardrail.

So you keep adding. Why wouldn't you? Every rule paid for itself once.

The hidden assumption in "why wouldn't you" is that a rule, once true, stays true. That's the assumption that broke.

The half I had missed: a place for lessons to expire

Somewhere past a few dozen rules, I noticed the curve bend the wrong way. More rules stopped producing better behavior. Past a point, they started producing slightly worse behavior — more hesitation, more rules half-applied, the occasional case where two rules pulled in different directions and the agent picked the wrong one to honor.

This surprised me, because my default instinct with the agent was to give it more context. More rules, more examples, more guardrails — surely more is safer. But a rulebook isn't free. Every rule you inject is something the model has to hold, weigh, and reason around on every step. Past some threshold, the cost of carrying a rule can exceed the failure it still prevents — especially when the thing it was written to prevent no longer happens.

And that's the category I'd completely missed: rules that had simply aged out. A rule written for one version of a library, after the library changed. A guardrail for a failure mode I'd since fixed at the source, so it could never recur — but the rule kept riding along, costing attention, preventing nothing. I had a careful process for adding knowledge and no process at all for removing it. The rulebook could only grow.

A rulebook that can only grow stops being a curated knowledge base. It quietly turns into an index where the dead entries dilute the living ones — and nothing in it tells you which is which.

Why a stale rule is worse than no rule

Here's the part worth sitting with, and it rhymes with something I keep running into in this project: the failures that cost me most were the quiet ones.

A missing rule is loud. The agent makes a mistake, you notice, you add the rule. The system has a built-in way to surface what it doesn't know — failure.

A stale rule is silent. It was right when you wrote it. It reads as authoritative because it earned that authority, once. It sits in the rulebook looking exactly like the rules that are still true, and nothing about it announces that the world it described has moved on. The agent dutifully follows it. You don't get a failure that points at it — you get slightly degraded behavior with no obvious cause, spread thin across everything the agent does.

That's why a stale rule can be worse than no rule. A missing rule often leaves a gap you can see once the failure appears. A stale rule fills the gap with confident, outdated instruction, and confident-and-outdated is one of the harder kinds of wrong to catch — because everything looks fine.

What I built instead: rules with a confidence score and a way to die

The fix wasn't cleverer rules. It was treating each rule as something other than a permanent truth.

The reframing that helped: a rule is not a fact. It's a hypothesis with an evidence record. It claims "doing X prevents failure Y," and that claim is either being supported by what actually happens in real runs, or it isn't. So instead of a flat list, each rule now carries a small amount of state: how confident I currently am in it, and what recent evidence supports or undercuts it.

The lifecycle, in plain terms:

A new rule starts as a candidate — written down, but not yet trusted. It hasn't earned its place.
If real runs keep showing evidence that the rule is still guarding against an active failure mode, confidence rises and the rule becomes trusted.
If a rule stops getting that supporting evidence — the failure it guards against simply isn't occurring anymore, across many runs — its confidence decays. Below a threshold, it's flagged stale for review.
A stale rule that, on inspection, appears to guard against something that can no longer happen gets retired. Out of the rulebook, into an archive, where it's recorded but no longer injected. One honest caveat about that evidence: I can't prove a counterfactual. I don't know the failure would have happened without the rule — I'm inferring it from run traces and the shape of what the agent attempted, not from a controlled A/B where I remove the rule and watch the system break. The confidence score reflects that inference, not a proof. I keep that in mind whenever I read it.

I'm deliberately not giving the exact scoring here, partly because the specific numbers are tuned to my system and would be noise to you, and partly because the point isn't the formula. The point is the direction: confidence flows from evidence, evidence comes from real runs, and a rule with no recent supporting evidence is a liability until proven otherwise — not an asset by default. Adding a rule is cheap. Letting it persist without review is where the cost accumulates.

If this sounds like cache invalidation, or like deprecating a feature flag, or like paying down tech debt — yes. It's the same shape: the cost isn't in creating the thing, it's in the discipline of removing it when it stops earning its keep. I just hadn't been treating lessons as something that needed garbage collection.

The part I deliberately didn't automate

The obvious next step would be to close the loop completely: let the system retire its own rules the moment confidence drops. I didn't, and I think the reason matters.

A confidence score is itself a measurement, and measurements can be wrong. A rule might look unsupported simply because I haven't run the kind of project that triggers its failure mode lately — not because it's obsolete. If I let the system auto-delete on a low score, it could kill a good rule during a quiet stretch, and I'd only find out when the old failure came roaring back.

So the system doesn't delete. It proposes. A rule going stale raises a flag for me to look at, and retiring it is a decision I still make. The machine is good at noticing "this rule hasn't earned its keep lately." It is not good at knowing whether that's because the rule is obsolete or because I just haven't stress-tested it. That difference needs a human, at least for now.

What this didn't prove

I'll keep this honest, the way I've tried to with the rest of these.

This is one system's experience — a single-person setup with a rulebook small enough that I can still read all of it. I haven't shown that knowledge decay matters at every scale, or that my particular lifecycle is the right one. It's an approach that worked in this setup, not a general result. For a small or short-lived project, the whole apparatus is overkill; a rulebook you fully re-read every week doesn't need confidence scores, it needs you to read it.

And the hard part isn't solved. "Has this rule earned its keep lately?" is a real signal, but it's a proxy, and proxies drift too — the watcher needs watching, same as everything else in this series. I don't have a clean rule for when a quiet rule is obsolete versus merely untested, which is exactly why I kept a human in that decision. I'm sharing the shape because it changed how I think about my own knowledge base, not because it's finished.

The takeaway

In my own system, I spent enormous effort teaching it new things and almost none deciding when old things should stop being believed. A knowledge base that can only grow may stop getting wiser; past a point, it can get slower and quietly less correct, because the rules that have gone stale look exactly like the rules that are still true.

If I had to compress it: writing a lesson down is the easy half. Letting it expire when it no longer earns its place is the half that keeps the lessons honest.

I'd like to know how other people handle this. In your systems — prompt libraries, rule sets, runbooks, internal docs, lint configs — does anything ever retire, or does it all just accumulate? And if things do get removed, how do you decide a piece of hard-won knowledge has stopped being true? I worked out a rough answer for my case, but it leans on being small enough to still read everything, and I suspect that doesn't last.

We Stopped Trusting Models. Then We Stopped Trusting Our Own Numbers.

Joseph Yeo — Tue, 23 Jun 2026 16:31:06 +0000

Nondeterminism isn't a bug to ban — it's a force to place

Part of the ForgeFlow series — building a coding agent that runs its execution loop locally on an M5 Max, and writing down what actually breaks. Planning runs on Claude; code generation runs on a local model via Ollama, test-driven inside a Docker sandbox.

A while back I wrote that we'd stopped chasing better models — that for this project, swapping in a stronger model kept failing to fix problems that turned out to be about the system around the model, not the model itself. That post ended on a tidy note: the model wasn't the bottleneck, the system was.

This is the post where the same suspicion turns inward, toward our own measurements. Because once you stop trusting the model to be the answer, the next thing you lean on is your own measurements — the test counts, the gate statistics, the tallies that tell you whether the system is working. And over a stretch of building, I learned those can't be trusted blindly either. The three previous posts in this run were each a version of that discovery:

A test suite that passed while measuring the wrong environment.
A gate that blocked 198 times while being wrong often enough that the count could no longer serve as evidence of quality.
An agent that counted twelve when the real number was thirteen.

Three different instruments. A recurring failure shape underneath them: the thing I use to verify can itself be wrong, and it tends to be wrong in a way that looks like success. This post is about what I took from that — and, more importantly, about the wrong conclusion I almost drew from it.

The wrong lesson: "ban the uncertainty"

When you get burned three times by measurements you trusted, there's an instinctive reaction: trust nothing that isn't certain. Push every source of uncertainty out of the system. Treat nondeterminism — anything probabilistic, anything that might come out differently twice — as a defect to eliminate. If the language model is the uncertain part, minimize the language model. Make everything deterministic and tell yourself you can finally sleep soundly.

I leaned that way for a while. It's wrong, or at least too blunt to be useful. Taken seriously, "ban all nondeterminism" throws out the single most valuable thing the model brings to this system — its ability to propose, to explore, to suggest a fix I wouldn't have enumerated. You can't get that from a deterministic rule. The uncertainty isn't only a liability; in the right seat, it's the entire point.

So the question stopped being "how do I remove the uncertainty" and became "where does the uncertainty belong?"

The better lesson: place it, don't ban it

Here's the reframing that held up — and it took an embarrassingly long time to see, partly because I'd quietly assumed nondeterminism was a debt the whole time without ever checking the assumption.

A system like this has two kinds of seats.

Seats that propose and explore. What should we try next? What might this failure be? What's a candidate fix? These seats want nondeterminism. A probabilistic model generating possibilities is a feature here, not a risk. If it's occasionally wrong, the cost is low — being wrong is fine when you're only suggesting, because something downstream still has to approve you.

Seats that judge and record. Did the tests actually pass? Is this allowed through? What gets written down as true? These seats can't tolerate an unaccountable input. They need to be deterministic, reproducible, and checkable — because this is exactly where "almost right" becomes indistinguishable from "right," which is the trap the last three posts kept falling into.

Each of the three failures was the same shape: something that shouldn't have been the final authority had crept into a judging seat. The polluted environment let an unpinned variable decide a verdict. The miscalibrated gate let an unchecked heuristic sit in judgment and reject valid work. The agent's self-report nearly let a probabilistic counter be the last word on a number. In every case, the fix wasn't to purge uncertainty from the whole system — it was to get the wrong thing out of the judging seat, and make sure that seat was held by something deterministic, pinned, and witnessable.

Nondeterminism, it turns out, isn't a quality of the system to be turned up or down. It's a force to be placed. You let it run at the front, where things are proposed and explored. You keep it out of the back, where things are judged and recorded. The model proposes; a deterministic check decides. That one division became the thread that connected the fixes in this run.

Why this isn't just "trust deterministic things more"

It's tempting to read all this as "deterministic good, probabilistic bad." That's not it, and the distinction matters.

A probabilistic suggestion in a proposing seat is more valuable than a deterministic one, because it can reach things a rule can't. And a deterministic check is only as good as whether it's actually correct and checkable — the gate in the second post was deterministic and still wrong a lot of the time, and a deterministic judge that's consistently wrong can be worse than a coin flip, because it's a steady error you eventually stop questioning. So it isn't that determinism is virtuous and uncertainty is sinful. It's that they belong in different places, and the whole engineering problem is keeping them sorted:

Let the uncertain thing explore. Let deterministic checks judge — but only after the checks themselves have been checked.

The reason this took four posts and a fair amount of getting it wrong is that the failures don't announce which category they're in. A measurement that's quietly wrong looks exactly like one that's right, until you go and look. The sorting isn't automatic. You do it deliberately, instrument by instrument, and you usually find out you got it wrong only when a number you trusted turns out to have been the problem all along.

What this didn't prove

I want to close the series the way I've tried to write the whole thing — without inflating it.

This is a framing that helped one system: a local AI coding agent with a test-driven loop, where I happen to have the luxury of pinning judging seats to deterministic checks I can watch. I haven't proven it's the right decomposition for every AI system, and I'd be wary of anyone — including me — who turned "place nondeterminism, don't ban it" into a universal law. It's a working lens, not an established result. The evidence behind it is a handful of incidents, honestly reported, not a controlled study.

It also doesn't resolve the hard part: the boundary between "propose" and "judge" is not always obvious. Plenty of real decisions are a blend — a judgment that also has to weigh uncertain evidence, or a proposal that quietly commits you to something before anything else gets a vote. Where exactly to put the deterministic check, and how much to let the probabilistic part inform a judgment without becoming the judgment, is something I'm still working out. The previous post's open question — when a human-witnessed check can graduate to a trusted automated one — is part of the same unsolved area. I'm sharing the lens because it organized a mess for me, not because it's finished.

The takeaway, for the whole run

Four posts, one thread. We stopped trusting that a better model would save us. Then we learned not to trust our own measurements blindly either — not the passing tests, not the busy gate, not the agent's tidy count. What survived all that doubt, for me, was a single discipline that proved sturdy enough to stand on:

In systems like this, every layer that grants trust should be measured before it's trusted — and that measurement should be grounded, wherever possible, in something deterministic you can witness.

That's it. Not a model, not a framework, not a clever architecture. A place to stand. After doubting the model, the gate, the tally, and our own numbers, it's the one thing I found solid enough to build the next thing on.

This is the end of this short run, so I'll ask the broad version of the question. For those building systems with AI in the loop: where do you draw the line between the parts you let be uncertain and the parts you insist be deterministic — and how do you check that you drew it in the right place? I arrived at one answer through a series of mistakes. I'd rather learn the next one from other people's experience than from my own next mistake.

Thanks for following this run. The earlier ForgeFlow posts — on the local agent itself, on why we stopped chasing models, and on what breaks at scale — are linked from my profile.

My Agent Reported 12. The Real Number Was 13.

Joseph Yeo — Mon, 22 Jun 2026 11:47:21 +0000

Why you have to witness the measurement before you trust the instrument

I was building the part of the system that measures itself, and I'd reached the point where it felt safe to let the AI agent do the counting. The task was mundane: tally how often a certain check had fired, aggregate it, report a number. The agent ran and reported 12.

I almost took it. It was a believable number, produced confidently, and I was tired of doing this kind of bookkeeping by hand. But something made me run it down myself in the terminal. The real count was 13. One item that belonged in the tally had been silently dropped from the agent's raw count.

One off. Twelve versus thirteen. On its own, trivial. But it landed on a question I'd been circling for a while — who is allowed to be the final witness to a measurement? — and the answer changed how I let AI into this system. This is the third post in a short run about that larger theme: the tools I use to verify my work can themselves be wrong, usually in ways that look fine.

Why I wanted to hand off the counting

The honest reason was fatigue. A self-measuring system generates a lot of small, exact bookkeeping — count the occurrences, filter the relevant ones, divide, summarize. It's beneath an agent's apparent ability and above my patience, so handing it off felt obvious. The agent reads the data, the agent reports the number, I read the report. Clean.

And to be fair to the agent: most of what it did was right. It wasn't hallucinating wildly. It produced a number that was almost correct — off by one, in a way that would have been invisible if I hadn't gone and looked. That's the dangerous kind of wrong: not obviously broken, just plausible enough to accept.

The one that got away

Here's how I know the real number was 13 and not 12, because "the agent was wrong" needs a ground truth. I didn't ask the agent to summarize. I ran a deterministic pass in the terminal that emitted each matching record, one by one, from the same source data, under the same inclusion rule — and then counted those emitted records directly. Run it again, same 13. The missing entry met that same rule; it just arrived in an irregular shape, unlike the other twelve, so the agent's summarizing pass had skipped it. Thirteen by reproducible enumeration, twelve by agent summary. The difference wasn't a matter of my opinion — it was the same data, counted two ways, where only one of the two ways could be re-run to the same answer.

Now the detail that actually unsettled me. The final number I cared about — a summarized metric, produced after a downstream step reduced the raw count to a coarser, normalized figure — came out the same whether the raw count was 12 or 13. It was the kind of step (rounding, bucketing) where a difference of one can simply vanish. So if I'd checked only the headline metric, I'd have seen the agent's report agree with mine and concluded everything was fine. Only the raw count underneath disagreed.

That's the part worth sitting with — and it's where "the final number was the same, so what's the problem?" gets answered. The problem was not that this particular headline metric changed. It didn't. The problem was that the raw measurement layer was already wrong, and the agreement at the normalized layer would have hidden that fact. A different threshold, a different aggregation, or the next consumer downstream might not have absorbed the same error — and once the raw layer is wrong, every later use of it (an audit, a trend line, a regression check) inherits the error. The safety here wasn't something I'd designed. It was luck, and luck doesn't generalize.

So the lesson wasn't "the agent made an arithmetic mistake." Agents make mistakes; that's expected. The lesson was about who I'd let stand as the last check. I'd been about to let the agent's self-report be the final word on a measurement, and the self-report was wrong in a way only a deterministic, re-runnable count would catch.

The instrument can't be the only witness to itself

Let me state the principle the way it finally settled, because it's the part I'd defend.

When you build a measuring system, there's a pull to let the same system that does the work also judge whether it came out right — especially when that system is a capable model that can read logs, count, and summarize. It's convenient. But it quietly inverts the roles. The model's judgment is supposed to be assistance. If it becomes the only thing standing between you and the recorded truth, then the model has become the source of truth — and a probabilistic, occasionally-off-by-one source of truth is sitting in the one seat that should be reproducible and auditable.

The fix isn't to stop using the agent. It's to insist that, at least while I'm still establishing trust, a human deterministically witnesses the output — that I sit at the terminal and watch the real number come out, with my own eyes, before I let the automated measurement stand on its own. The model can do the heavy lifting. It just can't, yet, be the sole witness to whether the heavy lifting was correct — because verifying a measurement is exactly where "almost right" is indistinguishable from "right" until something deterministic says otherwise.

It's a little uncomfortable to write down, because it means the convenient version of the workflow — agent counts, agent confirms, I read the summary — is the one I specifically can't have when the number matters.

What I changed

Two adjustments.

A human witness before the automation is trusted. Before I let any self-measurement run unattended, I make the deterministic version happen in front of me at least once — the actual count, from the actual data, watched as it's produced, and re-runnable to the same result. If the automated path and the watched path disagree, the automated path is wrong until proven otherwise. The agent's confidence carries no weight in that comparison.

Pin the measurement to units I can actually witness. Part of why the slip nearly slid through was a mismatch between what the agent tallied and what was legitimately in scope. So I narrowed the measurement to count only the units that were genuinely instrumented and observable — the units a human can directly observe and enumerate — rather than a looser population the agent gets to interpret. When the population is well-defined, the hand-witnessed number and the machine number can actually be compared. When it's loose, they drift, and you can't even tell.

Both of these slow things down. That's the cost I'm deliberately paying for the measurements I'm willing to bet on.

What this didn't prove

I want to keep this in proportion.

This is one off-by-one, caught once, on one counting task. It doesn't prove AI agents can't count, or that they're untrustworthy in general, or that you should hand-verify every number an agent produces. For many low-stakes tallies, the agent's number may be good enough, and re-counting by hand would waste a perfectly good tool. I'm not arguing for paranoia.

The claim I'd stand behind is narrow and conditional: for measurements you intend to build further conclusions on, the final witness should be deterministic and, at least initially, human. The stakes set the bar. A throwaway internal count doesn't need this. A number you're going to let the system's self-assessment ride on does — because that's the number whose error compounds.

I'd also flag the obvious tension: this doesn't scale by hand forever. The whole point of automating measurement was to not sit and count things. So "a human witnesses it" is a phase, not a permanent state — it's how trust gets established before the automation runs on its own, not a vow to recount reality by hand every day. For now, I only start relaxing the human witness once the deterministic path and the automated path have matched across repeated runs, and once the inclusion rule is pinned tightly enough that both are counting the same population. Where exactly to graduate beyond that is something I'm still working out, and I don't have a clean rule for it yet.

The takeaway, stated honestly

You can let an AI write the code. You can let it run the analysis. But if you also let the same system that produced the answer be the final judge of whether that answer was correct, you've made the examinee the examiner, and the score gets hard to trust on its own. For the numbers that matter, something deterministic — and, while trust is still being earned, something human — has to be the last thing that looks.

The agent was right about almost everything — off by one. The whole question is whether you find out about the one, and that depended entirely on whether I was willing to look myself.

How do others draw this line? When you let an AI agent measure or aggregate something, do you verify it independently — and how do you decide which numbers earn the hand-check and which numbers you allow to pass without it? I leaned on doing it myself this time, but I know that doesn't scale, and I'd be glad to hear how people who've gone further handle the trust handoff.

Next in the series — the last of this run: we stopped chasing better models a while ago. This is about the moment we stopped trusting our own measurements too, and what we kept instead.

The Gate Fired 198 Times. I Called It "Working."

Joseph Yeo — Sun, 21 Jun 2026 17:39:52 +0000

Why a blocked count is not a success metric

I built a gate to block bad code. It blocked 198 pieces of code, and I took that number as evidence the gate was working well.

Then I opened the blocked cases and read them one by one, checking each against the acceptance criteria for the task it came from. A large share of them weren't bad code. The gate had been wrong often enough that I could no longer read the block count as evidence it was working — it had been firing constantly, exactly as it was designed to fire, and I'd mistaken "it fires a lot" for "it's doing its job." Those are not the same statement.

This is the second post in a short run about something I kept tripping over while building this agent: the things I use to verify my system can themselves be broken, and they tend to break in ways that look like success. The last post was about a test run that lied by passing. This one is about a gate that lied by blocking.

What the gate is, and why I trusted the count

The agent works in a test-driven loop, and one step in that loop is a gate: before certain code is allowed through, the gate checks that it meets a standard, and if it doesn't, it blocks the code and sends the work back to be redone. The gate exists to keep weak or malformed attempts from getting committed.

For a while, the gate's headline number was how many times it had blocked something: 198. I looked at that and felt good about it. The reasoning felt obvious: the gate is catching a lot of bad attempts, so the gate is valuable, so the system is healthier for having it. High block count, hard-working gate, fewer bad commits. Why look closer?

That reasoning has a hole in it — and that hole is what this post is about.

Two claims I'd quietly merged

When I went through the blocked cases — not the count, the actual cases — I found that a large share of what the gate had blocked was work that was fine. Not malformed, not weak. Legitimate attempts that happened to take a shape the gate didn't recognize, so it rejected them.

I want to be precise about what "fine" means here, because "the gate was wrong" needs a ground truth or it's just my opinion. By ground truth I don't mean "I liked the code." I mean each step had explicit acceptance criteria: the targeted tests it was meant to pass, the behavior it was meant to produce, and the stated constraints for that step. A block was a false positive when those criteria were already satisfied and the gate rejected the work anyway, for a property that wasn't part of the task's success condition.

For example: a solution that passed every test it was supposed to pass, but got rejected because its internal structure didn't match what the gate's check expected — the same task-level behavior, in a representation the task itself never required. That isn't a question of taste. The work met the criteria; the gate said no for a reason that sat outside them.

So the 198 was real. Every one of those blocks happened. What was false was the meaning I'd attached to it. I had collapsed two different claims into one:

The gate fired. (True. 198 times. Verifiable.)
The firing was justified. (Never checked — and, it turned out, often not.) "It blocked something" and "it was right to block that something" are independent facts. A gate can be extremely active and extremely wrong at the same time — and a miscalibrated gate will tend to be both, because the same flaw that makes it reject good work also makes it reject a lot of it. The block count I'd treated as evidence of value is equally consistent with a gate that's simply trigger-happy. The block count alone can't separate a justified rejection from a false positive.

This seems like an easy mistake to make when building guards — linter rules, CI checks, validation layers, policy filters. At least it was in my case, and I suspect it's more common than we'd like to admit. The dashboard shows you activity. Activity feels like protection. But activity is not the same as correct activity, and the dashboard usually doesn't know the difference, so it shows you the comforting number and lets you supply the flattering interpretation.

"Then how did anything get through?"

That's the fair question, and it's the one that finally made me look. If the gate was wrong most of the time, how did the system make any progress at all?

The answer is uncomfortable: it made progress despite the gate, not because of it. A typical failure looked like this. The first attempt satisfied the tests but used a structure the gate distrusted, so it got blocked. The retry didn't improve the behavior — it just reshaped the same behavior into a form the gate would accept. From the dashboard, that looked like "the gate forced an improvement." From the case review, it was adaptation to the gate. (When even that didn't work, I'd step in and wave the work through, because I could see it was fine — another quiet sign the gate wasn't earning its place.)

That was the tell I'd missed. A gate doing real work makes the loop converge on better code. Mine was making the loop converge on gate-shaped code — which is not the same thing, and is sometimes worse. The retry didn't make the code more correct. It made it more acceptable to the gate.

How I now try to tell a real block from a false one

Catching this forced me to write down what would actually distinguish a justified block from a noisy one. The count clearly wasn't it. What I landed on is a three-part check — not elegant, but it's caught things since, so I'll offer it as a working heuristic for this setup rather than a rule.

1. Look at the distribution of reasons, not the total. I'd expect the block reasons to map to substantive defects, not to repeatedly trip on the same shallow surface feature regardless of whether the work is good. If they cluster on the latter, the gate is probably pattern-matching on the wrong thing instead of judging quality. (This is more useful for a broad quality gate than for a narrow, single-purpose check.)

2. Watch what happens on retry. This turned out to be the most useful signal. In my loop, a justified block tended to make the work stick on the same underlying defect across retries, until that defect was actually addressed; a false positive produced shape-shifting attempts that changed the surface without improving the behavior. It's a tendency, not a law — a model can wander even when the defect is real — but the shape of the retry sequence carried information the single block event didn't.

3. Check final convergence. A justified block should eventually resolve: the work gets rewritten, the real problem gets fixed, and it passes on its own merits. If blocked work never converges — or only "passes" once you weaken the gate — then either the gate was wrong, or it was right and your loop can't act on it. Both are problems, and both are invisible if you only count how many times it fired.

None of these is a clean pass/fail on its own. Together they let me ask the question I'd skipped — was this block justified? — instead of reading the answer off the fact that a block happened.

The deeper version of the mistake

There's a more general trap underneath this, and I want to name it plainly, because I fell into it without noticing.

A test that only checks "the gate blocks bad input" is testing the easy half. The hard half is: does the gate let through good input that simply looks unusual? If you only ever feed a guard the inputs it's supposed to reject, of course it rejects them, and of course the tests pass — but you've proven nothing about its false-positive behavior, which is exactly where mine was failing. The gate's own tests were green for the same reason the gate looked healthy: I'd only ever asked it the flattering question.

So now, when I test a gate, I deliberately include cases that are legitimate but oddly shaped — valid work in a form the gate might naively distrust — and check that it lets them through. In this case, the negative cases (reject the bad stuff) were the easier half. The risk I'd under-tested was the good stuff in unfamiliar clothing.

What this didn't prove

I don't want to inflate this into "all gates are bad" or "block counts are meaningless." Neither is true.

Gates can earn their keep. A well-calibrated one really does stop real problems, and the block count is a perfectly good operational signal — useful for noticing that something is happening, or spotting a sudden spike. The mistake wasn't tracking the count. It was treating the count as evidence of correctness when it's only evidence of activity. Those are different axes, and I'd conflated them.

I'd also flag that my three-part check is shaped by this particular system — a test-driven loop where blocked work gets automatically retried, so "watch the retries" is even available to me as a signal. If your setup doesn't produce that kind of trajectory, parts of this won't transfer. I'm offering it as something that worked in one place, not a general theorem about guards. I've been wrong about generality before in this project, so I'm holding it loosely.

The takeaway, stated honestly

A guard blocking something tells you it's active. It does not tell you it's right. Those are separate facts, and the gap between them is where a confident-looking gate can quietly turn into a noise machine that punishes good work and calls it protection.

If you run gates, linters, validators, policy filters — anything that stops things — it's worth auditing a sample of what it blocked, not just the total it blocked. The total can easily look like diligence. The sample is where you find out whether the diligence was real.

I'm curious how others handle this. If you operate a gate or a strict CI rule, do you ever sample its blocks to check they were justified — and if so, how do you decide "justified" without it turning subjective? I worked out a rough method for my case, but it leans on my system's particular shape, and I'd like to hear how it's done elsewhere.

Next in the series: I asked an AI agent to count something for me. It said 12. The real number was 13 — and that one-off gap changed a rule I now follow.

When pytest Said "Passed," It Was Lying

Joseph Yeo — Sat, 20 Jun 2026 13:05:33 +0000

How a polluted virtual environment made my green tests meaningless

For a few days, I made decisions on top of a number that wasn't true.

The number was 186 passed. It came out of pytest, green, at the bottom of the terminal, the way it had dozens of times before. I trusted it the way you trust a number that has never been wrong before. Then I found out the run had been measured inside the wrong environment, and the green had very little to do with the code I thought I was checking.

To be fair to the tool: pytest wasn't wrong. It answered exactly the question I handed it — it just wasn't the question I meant to ask. This post is about that gap. Not a bug in a test, but a bug in how I measured the tests. It turned out to be one of the more uncomfortable lessons in the project so far, because it sat underneath everything else. If the floor is tilted, every measurement you take on top of it inherits the tilt, and you don't see it, because the floor looks like the floor.

I'm writing it down mostly because I suspect I'm not the only person who has trusted a green checkmark that didn't earn it.

The setup: a baseline I checked constantly

ForgeFlow is a coding agent that runs a test-driven loop. Plan, write a failing test, write code, run the tests, decide what happened, repeat. Because so much of the system's behavior is judged by "did the tests pass," I keep a baseline: a known set of test files that should report a known set of numbers. Before and after almost any change to the engine, I re-run the baseline and compare. If the counts move when they shouldn't, something is wrong.

One thing to be precise about, because it matters later: the agent executes the code it generates inside a Docker sandbox, but this baseline — the engine's own test suite — I was running from my host shell. Two different executions. The sandbox was fine. The host shell was the problem.

The baseline is the closest thing the project has to a source of truth about its own health. That's exactly why this hurt.

One afternoon I was moving between two things on the same machine — the agent's own codebase, and an unrelated project I'd been poking at earlier. I ran the baseline. Green. 186 passed. I noted it, moved on, and built the next decision on top of it.

What I didn't notice was a single line of state that had carried over from the earlier work.

What actually happened

I'd left a different project's virtual environment active.

That's the whole bug, mechanically. The shell still had another project's VIRTUAL_ENV set, so when I ran pytest, it resolved pytest through that environment, and Python resolved imports against that environment's installed packages.

Here's the question a careful reader asks immediately: if it was the wrong environment, why didn't it just fail? Why no ModuleNotFoundError, no loud red collection error?

Because nothing was missing. The polluted environment happened to have the same packages installed — only at different versions. So nothing errored out. The tests collected, ran, and passed; they just passed against a version matrix the baseline doesn't assume. And that's the genuinely unsettling part: if a dependency had been entirely absent, I'd have gotten a loud error and caught it in seconds. The danger was precisely that everything was present — present and subtly wrong. Wrong in the one way that doesn't announce itself.

The problem is that "green" had stopped meaning what I read it as. I read it as "the code is correct in the environment it's meant to run in." What it actually meant was "the code passed in whatever environment happened to be active." Those are different sentences. For a few days I couldn't tell them apart, because the terminal prints the same word for both.

Here's the part worth sitting with: nothing failed. A failing test is a gift — it's loud, it points at itself, you go fix it. This didn't fail. It passed, and the passing was the problem. The signal I rely on to catch mistakes was itself the mistake, wearing the costume of everything being fine.

"The tests pass" and "the measurement is honest" are different claims

When I finally caught it — by comparing against a run from a fresh shell using the project's intended environment, and noticing the counts didn't line up — I reduced the lesson to a sentence I've kept since:

Whether the tests pass and whether the measurement of the tests is trustworthy are two separate questions, and I had been treating them as one.

Almost all of my testing discipline had been aimed at the first question. I had careful tests. What I didn't have was anything checking the second — the integrity of the act of measuring. The environment the measurement runs in is an input to the result, and I'd been treating it as a constant when it was actually a variable I'd left lying around.

It's easy to invest heavily in test correctness while leaving test measurement integrity implicit — to treat it as the environment's job, or the tooling's job, rather than something to check directly. Plenty of teams do handle it, with lockfiles, containerized test runs, hermetic builds. I just wasn't one of them at this layer, for this particular command, on this particular day.

What I changed

Two things, deliberately small.

A guard before the measurement, not just inside it. The cheapest fix is a single check that runs before the baseline: confirm the environment is the one I think it is. In my case, simply asserting that no foreign virtual environment was active would have caught it — the testing equivalent of checking the floor before you measure the wall. It's almost embarrassingly simple, and it would have caught this in one second. (A stricter guard wouldn't stop at the VIRTUAL_ENV variable; it would also check sys.executable, the resolved pytest path, and the expected project root. The variable was the obvious giveaway here, but it's the weakest of the four.)

A separate set of checks for the measurement itself. Beyond the one-line guard, I added a small, dedicated set of tests whose only job is to protect the invariants of how I measure — not the features, the measurement. They're counted separately from the normal baseline on purpose, so they can't be quietly folded into the same number they're supposed to be watching. The exact count is secondary. What matters is that "is my measurement honest" became something the system checks for me, instead of something I assume.

Neither of these is clever. That's sort of the lesson. The failure wasn't subtle once I saw it; it was invisible only because I'd never thought to look there.

What this didn't prove

I want to be careful not to inflate this into a grand principle.

This is one incident, on one machine, caused by one careless bit of leftover state. It doesn't prove that everyone's test suites are secretly lying, and it doesn't prove you need an elaborate measurement-verification layer. For a small throwaway script, a clean shell and a moment of attention is the entire fix, and the machinery would be overkill.

To be clear about which part scales: environment hygiene matters at any size — it's the guards and dedicated checks that scale with how much you're betting on the number. I'm betting a lot on mine; the agent makes real decisions off these signals. So for me the machinery earned its place. The hygiene would have been worth it regardless of project size.

I'm also aware the "fix" mostly moves the trust down one level. Now I trust the guard. If the guard is wrong, I'm back where I started. There's no absolute bottom here — just a level low enough that I'm willing to stop and call it ground. I picked one. I could be wrong about whether it's low enough.

The takeaway, stated honestly

If I had to compress it: a green test run answers "did the code pass?" It does not answer "did I measure that in the environment I meant to?" The second question has its own failure mode, and because the failure mode is passing, your normal instincts — chase the red, fix what's loud — never fire.

The verification pyramid most of us picture has tests at the bottom. I'd now put one more layer underneath it: the environment the tests run in. When that layer shifts, every green light above it is reporting on an environment you're not actually in.

I'd like to know how other people handle this. Do you guard your test environment explicitly, or rely on convention and attention? Have you been bitten by a passing result that turned out to be measured wrong — and if so, how did you finally catch it? I caught mine by luck and a mismatched count. I'd rather not depend on luck next time, and I suspect some of you have better answers than I do.

Next in the series: a quality gate in the same system blocked code 198 times — and why I was wrong to call that "working."

We Spent Six Sessions Fixing One Task. The Problem Was Six Tasks.

Joseph Yeo — Tue, 02 Jun 2026 11:42:06 +0000

This is Part 9 of the ForgeFlow series. Part 8: 77 Rules Later ended on a question we couldn't answer at the time: can a rule-based agent system keep growing without becoming harder to reason about than the model it was built to constrain? This post is about one small episode that pushed us toward an answer — not a clean one, but a useful one. It's also a record of getting a diagnosis wrong for several sessions in a row, and what finally corrected it.

Quick terms for new readers:

FC = Failure Catalog entry (a documented failure pattern)

CL = Crystallized Lesson (a testable design rule derived from repeated failures)

critical_rules = a block of rules injected into the model's prompt for a given project

nightrun = an overnight batch that re-runs projects so we can measure behavior over many runs

ForgeFlow = a fully local, TDD-based autonomous coding system running on Apple Silicon

The failure we kept "fixing"

For several sessions, we had a recurring failure in one project that we treated as a single, local problem.

The symptom: in async tests, the model would construct an HTTP test client directly instead of using the shared fixture we'd set up. That shared fixture is what installs the dependency override pointing the app at the test database. By hand-rolling its own client, the model skipped the fixture entirely — so the override was never applied, and the test hit an unconfigured database and failed with a missing-table error.

(The underlying trigger, for those on the same stack: httpx 0.28 removed the deprecated app= shortcut from Client/AsyncClient. The supported pattern is now transport=ASGITransport(app=app). Our conftest fixture already used the current ASGITransport approach and was correct — the app= argument on ASGITransport itself is still valid. The model just wasn't using the fixture; it kept hand-rolling the client the old, removed way. Why it preferred the old pattern is something we can only guess at — likely the weight of older examples — and we didn't try to prove it.)

Each time it surfaced, we did the natural thing. We looked at the task where it appeared, and we wrote a rule for that task. Next run, it would seem quieter there — and then show up somewhere else. We'd note it, half-suspect it was the same thing, and move on to whatever else the run surfaced.

What we were doing, in effect, was treating a recurring pattern as a series of unrelated incidents. We never asked the obvious question: how often does this actually happen, and where?

The measurement we should have taken earlier

The thing that broke the loop wasn't a better model or a smarter rule. It was a measurement we hadn't bothered to take.

We wrote a small script to walk back through our run history and count occurrences of the failing pattern across every recorded run — not by reconstructing it from summary tables, but by grepping the raw captured error text directly. (We'd learned in an earlier session that our summary-level deduplication could merge distinct failures under one signature, so for this we went to the raw text instead.) In practice that meant grepping the archived stderr for the exact failing construction across every run directory, rather than trusting the rolled-up signatures.

In the archived runs we checked, the count came back:

16 occurrences
across 6 distinct tasks
in 3 different projects
spread over 7 separate run sessions

This count made our earlier framing untenable. We had been writing single-task rules for something that was happening across six tasks in three projects. Each of those per-task fixes had been addressing, at most, one of the sixteen occurrences we eventually counted. The measurement didn't just refine our picture — for this failure pattern, it showed our framing had been wrong in kind, not just in degree.

It's a little uncomfortable to write that down. But that's the part worth keeping: the wrong number wasn't in the model's output. It was in our own estimate of how widespread the problem was, and we'd carried that estimate for several sessions without checking it.

The prescription got smaller, not bigger

Here's the part that surprised us most.

Once we saw "6 tasks, 3 projects," the instinct might be to write six fixes — one hardened rule per affected task. Given how we'd been responding up to that point, that's roughly the path we were on.

Instead, the data pointed the other way. If one pattern was appearing across many tasks and projects, the better place to address it wasn't any individual task — it was the project-wide rule block. We added one line to the critical_rules for the affected projects: a single instruction telling the model not to construct the test client directly, and to use the fixture instead (taking the rule block from 20 lines to 21).

One rule addressed a pattern that, on our prior trajectory, could easily have become six separate task-level patches. This was a small, concrete instance of something we keep seeing on this project: when we measure a problem's actual scope more carefully, the fix tends to get narrower, not wider. When you don't know the shape of a problem, you tend to over-prescribe locally and under-address globally. Measuring the shape let us do less.

We then ran a verification batch over the affected projects. The grep count for the pattern came back at zero across those runs. Two of the projects involved — a small gallery API and a small library API — met the per-project completion criteria on the verifying runs, so we marked them as "graduated" in a narrow, project-level sense.

Note on "graduated": Part 8 used this word for an entire stack. Here it means something much smaller — these individual projects met their completion criteria on the runs we executed. It is not a claim about the stack, and not a claim that these projects will never fail again. The underlying numbers were modest: pass rates in the roughly 38–85% range across individual runs. What changed wasn't a jump to near-perfect runs — it was that this particular structural failure stopped appearing. We're reporting a state we observed, not a guarantee.

What this didn't prove

A few limits, because the result is smaller than it might sound.

The zero is a zero on the runs we executed, for this specific pattern. It's evidence the rule is doing its job in the tested scope, not proof the pattern is gone for good. A different project, a different httpx version, or a different phrasing of the same task could surface it again.

The measurement approach itself has a known weakness we worked around rather than solved: it reads raw error text, which is reliable for exact-pattern counting but says nothing about why each occurrence happened. For this pattern — a single, well-understood cause — that was fine. For a fuzzier failure, raw-text counting would undercount or overcount, and we'd need something better.

And the broader idea ("measure before you prescribe") is a working heuristic from repeated experience on this project, not something we've established rigorously. We'd genuinely like to know where it breaks down for other people.

Beyond the coding loop

The same week, an unrelated problem came up — a structural issue in how some of our project directories were tracked in version control. An earlier note had flagged it as a likely large cleanup, the kind of thing you budget a careful session for.

Before touching anything, we applied the same lesson the measurement script had just taught us: measure first. A few read-only checks showed the actual scope was much smaller than the earlier flag assumed — most of what looked like a structural defect turned out to be a single missing ignore-rule and one stale index entry, and the fix was a handful of non-destructive steps rather than a restructuring. I'm including this not as a second result but as an observation: in both cases the failure mode was the same — acting on an estimate instead of a measurement — and in both cases, once we measured, the prescription shrank. Two instances isn't a trend, but it was enough to make "measure the scope before writing the fix" a step we now try to take on purpose rather than when we happen to remember.

Back to Part 8's question

Part 8 ended by asking whether a growing rule set could stay manageable. This episode is one data point toward a tentative answer: rule growth seems more controllable when the decision to add a rule is driven by measured scope rather than by where a failure last happened to appear. Six task-level rules would have grown the rule set faster and addressed the problem worse. One measured rule did more with less.

That's not a method, and it's certainly not a solution to the interaction-effect problem from Part 8. It's a habit we're trying to form: before crystallizing a new rule, spend the few minutes it takes to count how often and where the underlying failure actually occurs. Sometimes that collapses six patches into one rule. Other times — we assume, though we haven't hit this yet — it will reveal that what looked like one problem is really several, and we'll need more rules, not fewer. Either way, the rule follows the measurement.

Series Links

ForgeFlow runs on a MacBook Pro M5 Max 128GB. Planning uses Claude (cloud API). Execution is fully local — Qwen3-Coder-Next 45GB via Ollama, gemma4:26b for QA, Docker sandbox, no API calls during the coding loop. The methodology and failure data are shared in this series.

If you're running your own local agents or failure-catalog systems: have you caught yourself prescribing locally for a problem that turned out to be project-wide — and what finally made you measure it? Logs, test signatures, prompts, generated diffs? The comments are open.

77 Rules Later: What Graduating Our First Stack Actually Looked Like

Joseph Yeo — Mon, 25 May 2026 08:19:22 +0000

This is Part 8 of the ForgeFlow series. Part 7: The File Modification Boundary documented the constraint that changed how we structure tasks: every autonomous task target should be a new file. We ended Part 7 at 12 projects, roughly 52 failure patterns, and 71 design rules. Part 7 closed with an open question: "Project 13 will be the first real test of whether CL-071 holds under normal conditions."

Quick terms for new readers:

FC = Failure Catalog entry (a documented failure pattern)

CL = Crystallized Lesson (a testable design rule derived from repeated failures)

DEADLOCK = the system gives up after repeated identical failures

ForgeFlow = a fully local, TDD-based autonomous coding system running on Apple Silicon

Part 7 ended with a hypothesis and a bet.

The hypothesis: CL-071 (every task targets a new file, never modifies an existing one) might reduce or remove the dominant failure mode we'd been observing. The bet: we'd set formal graduation criteria and run projects until we met them — or discovered why we couldn't.

We ran five more projects (with one intermediate rerun included in the data). On the seventeenth — a blog API with 14 tasks — all 33 tests passed without intervention or deadlock, completing in approximately 12 minutes.

This post is about the five projects between that hypothesis and this result, what the graduation criteria actually measured, and the failure that appeared after we thought we'd addressed all the known ones.

The Graduation Criteria

Before results, here's what we were measuring. We didn't want "it worked once" to count as graduation. We defined four conditions, all of which had to hold on a qualifying run:

Criterion	Threshold
First-run pass rate (tasks passing on the first TDD cycle, no retry)	≥ 85%
New FC yield per project	≤ 2
Repeat FC rate (previously solved patterns recurring)	≤ 5%
Teacher escalation (human operator interventions mid-task)	Decreasing trend

The logic: a graduated stack should show repeatable autonomous recovery within the tested scope (criterion 1), stop producing novel failure patterns at a high rate (criterion 2), not regress on already-solved problems (criterion 3), and require less human involvement over time (criterion 4).

We chose 85% rather than 100% for the pass rate deliberately. Occasional retries are expected behavior in a TDD loop — in ForgeFlow's architecture, the system is designed to recover from them. What we track is whether it recovers autonomously.

The Five-Project Path

Here's the longitudinal data from Part 7's endpoint (project 12) through the graduation run. Note: this table tracks the autonomous pass rate — tasks that eventually passed without human intervention, including retries. The graduation criterion uses the stricter first-run pass rate (no retries), which we measured separately for the qualifying run.

#	Project	Tasks	Autonomous Pass Rate	New FCs	CL Count (at time)
13	comment-api	12	83%	0	~72
14	order-api	16	56%	2	~74
15	recipe-api	14	57%	1	~75
16	bookmark-api v2	12	83%	0	~76
16.5	catalog-api-v2	12	83%	1	~76
17	blog-api	14	100%	1	77

The trajectory wasn't smooth. Projects 14 and 15 dropped below 60%. Then it recovered. In this sequence, plateaus tended to expose a new failure category; the system dipped, the failure got crystallized into a rule, and the next project incorporated the fix.

What changed between project 15 (57%) and project 17 (100%) was not a model upgrade or an engine rewrite. It was three additional design rules, each derived from a specific failure we observed and diagnosed.

The Dip: What Went Wrong on Projects 14 and 15

Projects 14 (order-api) and 15 (recipe-api) both hovered around 56–57% autonomous pass rate. The failures clustered around a few patterns:

Route endpoint isolation. Tasks that bundled multiple endpoints into a single file — GET list and GET detail in the same route module — showed a notably higher failure rate than single-endpoint tasks. The outputs showed scope-related failures: given two endpoints to implement, the model would sometimes complete one and leave the other as a stub, or attempt both and introduce inconsistencies.

We already had CL-043 (one task, one endpoint) from Part 6. But we'd been applying it loosely — allowing two closely related endpoints to share a task. Projects 14 and 15 showed us that "closely related" was too vague for this local execution loop. The rule needed to be absolute: one endpoint, one file, one task.

Import specification gaps. Route tasks that didn't explicitly list every required import in their task description had a high failure rate. The model would guess import paths, often incorrectly. CL-072 crystallized this: every route task description must include a complete "Required imports" block. For example:

Required imports: from fastapi import APIRouter, Depends;
from sqlalchemy.ext.asyncio import AsyncSession;
from app.database import get_db;
from app.schemas.author import AuthorCreate, AuthorRead

Decimal type mismatches. In project 16.5 (catalog-api-v2), a product model with a Numeric(10,2) price column exposed a subtle testing issue. The model wrote assertions comparing float literals to SQLAlchemy Decimal values — and 999.99 != Decimal('999.99') in Python. CL-076 captured this: any Numeric column test must use Decimal comparisons.

In our diagnosis, these looked less like model-capability failures and more like specification-precision failures — cases where the PRD left enough ambiguity for a 45GB quantized model to make a reasonable-but-wrong choice.

The Failure We Didn't Expect: FC-074

Project 17 (blog-api) was designed as the graduation attempt. We applied all 76 existing rules. The PRD passed our automated validator (50 checks passed, 0 failures). We expected fewer known-pattern failures.

The first three attempts all failed on the very first task — creating the Author model. Same error each time: red_apply_empty — the engine's signal that the RED-phase output contained implementation code rather than a test.

Here's what happened, step by step:

Our setup script created a minimal model stub file — just the class name and primary key column. This was standard practice per CL-066 ("stubs should be PK-only").
Before the RED phase (test generation), the engine runs FC-060 cleanup: it deletes the target implementation file so the model writes it fresh.
FC-060 deleted the stub.
The model didn't need the file to exist at generation time — the surrounding task context still described enough of the intended model structure (via data_models in the PRD and conftest import references) that it produced implementation code during RED instead of a test.
The engine detected this as a scope violation and triggered red_apply_empty.
Three retries. Same result each time.

We called this FC-074: the interaction between two previously validated rules (CL-066: keep stubs minimal, and FC-060: clean target files before RED) producing a new failure when combined.

This is worth pausing on. FC-074 wasn't a gap in any single rule. It was an interaction effect — two rules that had each been validated independently across multiple projects, producing a failure only in a specific sequence of operations.

Rule	Behavior in isolation	Combined behavior
CL-066	Minimal stubs reduce over-complete-stub failures	Creates a target file before RED
FC-060	Deletes implementation target before RED to ensure clean state	Removes the stub CL-066 created
Combined	—	RED sees a missing target but enough context to generate implementation instead of a test

The Fix: Stop Creating Stubs

The first instinct was to adjust the prompt wording — tell the model more explicitly to write a test, not an implementation. We tried that. Same failure. Prompt changes alone didn't resolve it; file-state became the stronger hypothesis.

The second instinct was to refine the stub. But we diagnosed the stub's existence as the likely trigger: FC-060 deleted it, and the residual context information was enough to derail the RED phase.

The third attempt was the simplest: don't create the stub at all.

CL-077: Setup scripts must not create model stub files. Model files are created from scratch by the task that implements them. The conftest wraps model imports in try/except so that earlier tasks can run before the model file exists:

try:
    from app.models.author import Author
except ImportError:
    Author = None

This inverted an assumption we'd held across the previous 16 project iterations. We'd operated under the belief that providing a stub — even a minimal one — helped the model by giving it a starting point. FC-074 suggested that in our current engine architecture, the stub hurt by creating a state that the cleanup logic couldn't handle cleanly.

After applying CL-077, the same blog-api project ran all 14 tasks to completion. 33 tests passed, zero intervention, approximately 12 minutes total.

What the Graduation Run Measured

Here's how project 17 scored against the criteria:

Criterion	Threshold	Project 17 Result
First-run pass rate	≥ 85%	93% (13/14 first-shot, 1 retry)
New FC yield	≤ 2	1 (FC-074)
Repeat FC rate	≤ 5%	0%
Teacher escalation	Decreasing	Zero escalations

Project 17 met all four thresholds. The preceding project (16.5, catalog-api-v2) reached 83% — close but below the ≥85% line. So we are treating project 17 as the graduation point rather than claiming a two-project stable plateau.

To be precise about what this means and what it doesn't:

What it means: On the specific runs we executed — FastAPI + SQLAlchemy async + pytest projects with CRUD-level complexity and 1:N foreign key relationships, using Qwen3-Coder-Next 45GB Q4_K_M on Apple Silicon M5 Max 128GB with 77 design rules — the system completed the full project autonomously within the scope of new-file-creation tasks.

What it doesn't mean: We haven't tested more complex architectural patterns (many-to-many relationships, authentication flows, file uploads, WebSocket endpoints). We haven't tested with different model families or hardware tiers. The 100% figure is for one specific project run; it's a data point, not a guarantee.

77 rules is a lot of rules. Each one was derived from at least one observed problem. But the cumulative load of maintaining 77 interacting rules is substantial. We don't yet know if this scales — whether a 200-rule system would be manageable or would collapse under interaction effects. This matches a concern we are starting to track internally: beyond a certain threshold, adding more constraints may dilute model attention rather than improve output. In our design, we've set a ceiling of 20 CLs per prompt injection bundle to guard against this, but we haven't yet hit a project that tests that limit.

The Rule Accumulation Curve

One pattern we've been tracking across the series is how the rate of new rule discovery changes over time:

Projects  1–3:   CL-001 to CL-020   (~7 per project)
Projects  4–6:   CL-021 to CL-035   (~5 per project)
Projects  7–9:   CL-036 to CL-051   (~5 per project)
Projects 10–12:  CL-052 to CL-071   (~6 per project)
Projects 13–17:  CL-072 to CL-077   (~1 per project)

The yield dropped from roughly 7 new rules per project to roughly 1. We're cautious about reading too much into this — it could mean we're approaching the boundary of what our current project complexity can reveal, rather than the boundary of what rules exist. More complex projects might expose entirely new failure categories.

But within the FastAPI + SQLAlchemy + CRUD scope, the flattening is visible in this dataset. The most notable new failure in this stretch was an interaction effect between existing rules — FC-074 — rather than an entirely novel pattern.

The Interaction Effect Problem

FC-074 taught us something we hadn't articulated before: as the rule set grows, the opportunity for interaction effects between rules increases. Each rule is validated independently, but the system runs them all simultaneously.

This resembles a familiar problem in complex systems: the space of pairwise interactions grows faster than the number of components. We can't test all combinations manually.

We don't have a systematic solution for this yet. What we have is a detection mechanism: when a failure occurs that doesn't match any existing FC pattern, we now check whether it could be an interaction between two rules that had both worked in isolation in prior runs. FC-074 was caught this way.

Whether this can be automated — detecting interaction effects without human diagnosis — is an open question. The engine could potentially track which CLs were active when a novel failure occurs and flag the pairwise candidates, but we haven't built that yet.

What Comes Next

Graduating from the FastAPI stack opens a question: what do we do with a graduated stack?

We see two directions, each answering a different question:

Direction A: Complexity escalation. Stay on FastAPI but increase project complexity — many-to-many relationships, authentication flows, nested resources, pagination. This tests whether the current 77 rules hold at higher complexity or whether new failure categories emerge.

Direction B: Stack transfer. Move to a different framework and measure how many of the 77 rules transfer. Our rules are categorized by stack tags — 29 are marked "universal," 32 are "fastapi"-specific. A new stack would test whether the universal rules actually are universal.

The question we're most interested in now isn't whether we can achieve another 100% run. It's whether a rule-based agent system can keep growing without becoming harder to reason about than the model it was designed to constrain.

Series Links

If you're building something similar — local AI agents, TDD automation, failure catalog systems — I'd be interested to hear whether you're seeing interaction effects between your own accumulated rules. The comments are open.

The File Modification Boundary We Found After 12 ForgeFlow Projects

Joseph Yeo — Fri, 22 May 2026 15:00:08 +0000

This is Part 7 of the ForgeFlow series. Part 6: The Bug Wasn't in the Model ended at 9 projects, 51 failure patterns, and 70 design rules. Up until that point, failure rates in our setup were declining and the working framework felt like it was converging. Project 12 exposed a structural gap we hadn't yet documented.

Quick terms for new readers:

FC = Failure Catalog entry (a documented failure pattern)

CL = Crystallized Lesson (a testable design rule derived from repeated failures)

Identical GREEN = the model returns an unchanged file during the implementation phase

DEADLOCK = the system gives up after repeated identical failures

Part 6 ended on a high note. Nine projects. A 100% pass rate on the last one. Forty-three crystallized lessons. A working framework in our setup: DCR × Information Quality × Task Complexity. The system felt like it was converging.

Then we tried self-referential foreign keys, and a failure mode we'd only seen sporadically became the dominant pattern.

This post is about project 12 — a department hierarchy API with JWT authentication and self-referential parent-child relationships. It documents the failure pattern that connected several scattered observations into a single engineering constraint. And it discusses why, in our case, the most practical response was to restructure the work rather than retry harder.

The Setup: Department API

Project 12 was designed to test two development vectors simultaneously: JWT authentication (new for ForgeFlow) and self-referential foreign keys (a department can be a child of another department). The tech stack was familiar — FastAPI, SQLAlchemy async, pytest — but the data model was more complex than our previous test projects.

The target execution plan: 13 tasks total. Of these, 4 were new-file creation tasks (schemas, tests), 5 were existing-file modification tasks (models, routes), and 4 were either setup steps or handled outside the autonomous loop.

We ran it five times, redesigning between each iteration. The pattern became hard to ignore.

The Scorecard

The table below shows the task categories from Project 12. The same outcome repeated across five redesign-and-rerun attempts.

Task Type	Count	Pass Rate	Avg Cycles
New file creation (schemas, tests)	4	100%	1.0
Existing file modification (models, routes)	5	0%	DEADLOCK

In our setup, tasks requiring the generation of an entirely new file succeeded on the first attempt. Tasks that required modifying an existing codebase file resulted in a processing deadlock. This held across five separate runs, two different backends (direct Ollama API and Aider), and multiple retry strategies.

To scope these findings: our dataset is constrained to a single model family (Qwen3-Coder-Next, 45GB Q4_K_M) running on a single hardware tier (Apple Silicon M5 Max 128GB). We don't claim these trends apply universally. But the pattern was consistent enough across five runs that we changed how we structure tasks going forward.

What "Identical GREEN" Looks Like

ForgeFlow's TDD loop works in two phases: RED (write a failing test) and GREEN (write code to pass it). The GREEN phase is where modifications happen.

When a task required modifying an existing file, the following loop repeated:

The model receives the existing file content + test requirements
The model outputs code that matches the existing file exactly (detected via SHA-256 hash comparison)
The engine retries with an explicit prompt: "Your output was identical to the current file"
The model outputs the same file again
DEADLOCK after 3 identical cycles

We call this an identical GREEN deadlock. The engine already had detection for it (FC-037, added months ago). But we'd only seen it sporadically before. In project 12, it became the primary failure mode.

Working Hypotheses

We're cautious about attributing "understanding" to the model — we're observing output patterns, not internal reasoning. Here's what we think might be happening:

The whole-file generation pattern (Ollama backend): When generating code via raw completion, the model streams the entire file from the first token. If the existing file is 95% correct and only needs a few lines added, the token history in the context window acts as a statistical attractor — the generation pattern defaults to reproducing the verified, working code rather than deviating to introduce new logic. The smaller the required change relative to the existing file, the stronger this pull appears to be.

The diff generation constraint (Aider backend): Diffs require precise line-matching tokens. When the target file is complex — multiple async routes, mixed dependencies, dense imports — generating accurate unified diff chunks appears to become erratic for our local quantized model. In our tests with this specific model and configuration, this manifested as timeouts (capped at 200 seconds per task) or a fallback to emitting an unchanged version of the source file.

Both pathways showed similar limitations on file modification tasks in our configuration. Whether this is specific to quantized local models or a broader pattern, we can't say.

Connecting Scattered Observations

Before project 12, our tracker had three separate failure patterns that each captured a piece of this:

FC-034 / CL-043: "One task, one endpoint" — adding endpoints to an existing route file often resulted in syntax errors or duplicates
FC-047 / CL-066: "Over-complete stubs" — when a stub had significant boilerplate, the model treated it as finished
FC-039 / CL-058: "POST endpoints need Aider" — some tasks specifically failed on the Ollama backend

Project 12 gave us the data to connect these into a single classification, FC-052:

In our local execution setup, existing file modification tasks demonstrate a high probability of identical GREEN DEADLOCK on both whole-file and diff-based backends. In our observations, identical-output failures appeared more often when the required change was small relative to the existing file.

From FC-052, we derived CL-071:

Every autonomous task target should be a new file. If a workflow step must modify an existing file, that modification should either be handled programmatically during setup or the architecture should be decoupled so that features reside in isolated modules.

This became our 71st crystallized lesson, and it changed how we now structure ForgeFlow projects.

One notable data point: across three complete projects (10, 11, and 12), our failure catalog expanded by only a single new entry. The rule accumulation curve is flattening, which may suggest we're mapping the boundary of our current configuration — or just the boundary of our current project complexity.

The Design Pattern That Emerged

CL-071 pushed us to rethink how we write PRDs.

Before (task-level modifications):

TASK-001: Create User model (stub)        → models/user.py
TASK-002: Add fields to User model        → models/user.py    [DEADLOCK]
TASK-003: Create Department model (stub)  → models/department.py
TASK-004: Add relationship                → models/department.py [DEADLOCK]

After (decoupled new-file generation):

SETUP SCRIPT: Generate complete models with all fields and relationships
TASK-001: Create User schemas    → schemas/user.py      [NEW FILE ✅]
TASK-002: Create Dept schemas    → schemas/department.py [NEW FILE ✅]
TASK-003: Create register route  → routes/auth.py        [NEW FILE ✅]

The pattern: infrastructure is established deterministically during setup, while the model handles clean-sheet file generation.

An important caveat: applying this pattern to project 12 was not a clean autonomous success. We manually implemented the CRUD endpoints (6 routes) to unblock the dependency chain, then tested whether the remaining new-file task would run cleanly under the revised structure. The integration test — creating a fresh test_integration.py — passed on its first autonomous cycle. The important result was narrower than "we solved it": once existing-file modification was removed from the autonomous task path, the remaining new-file task completed cleanly.

We should also note an open concern: forcing every task into a "new file only" pattern shifts complexity from generation-time editing to project-level file organization. At 13 tasks, this is manageable. At 50+, it could create significant file fragmentation and import overhead. We haven't tested at that scale yet.

Where We Are After 12 Projects

Metric	Value
Total projects	12 (11 completed, 1 scrapped)
Failure patterns cataloged (FC)	52
Design rules (CL)	71
Automated rule checks	53 functions in validate_prd.py
Sessions	81

The Honest Assessment

After 12 projects and 81 sessions:

What's working in our setup:

New file generation from detailed specs: reliable across the runs we tested
TDD enforcement (RED must fail, GREEN must pass): useful as a mechanical guardrail
Failure pattern → design rule pipeline: producing diminishing but real returns
Setup-based infrastructure + model-based creation: tested over 3 projects

What isn't working:

Existing file modification: consistently unreliable with our current model and configuration
Non-deterministic results on complex tasks: one task passed in 2 out of 3 runs, failed in 1. Same code, same model, different outcome.
Long dependency chains: a single DEADLOCK blocks everything downstream

Open questions:

Does CL-071 hold on 20+ task projects with complex dependency graphs?
Does the "new file only" constraint create unsustainable file fragmentation at scale?
Will newer local models (Qwen3-Coder v2, Llama 4) shift this boundary?
Is this specific to quantized local models, or do cloud API models show similar patterns on file modification tasks inside TDD loops?

A Request to Readers

If you're running local models — Ollama, llama.cpp, vLLM, or something else — within autonomous execution loops, we'd be interested in learning whether your telemetry shows similar variations between file creation and file modification tasks.

Specifically: how do your local configurations handle incremental diff generation inside structured loops versus generating complete, fresh modules from detailed specs? If you've logged similar boundaries or found alternative designs to work around modification deadlocks, please share your setup and observations in the comments.

We're also curious whether anyone has hard metrics on how cloud models (GPT, Claude) perform on targeted file modifications inside closed-loop TDD environments. Our dataset is one model family on one hardware tier — more data points from different setups would help everyone working in this space.

What's Next

Project 13 will be the first real test of whether CL-071 is a design principle or just a project-12-specific workaround. Every implementation task will target a new file. Setup will handle all infrastructure. The open question isn't whether it passes — it's whether the "new file only" constraint produces a project structure that's actually maintainable at 20+ tasks.

We're also adding automatic CL-071 validation to validate_prd.py — a check that flags any task whose implementation target already exists at execution time. For our workflow, rules that repeatedly affect outcomes should probably be machine-enforced.

The Series So Far

I Built a Local AI Coding Agent on M5 Max 128GB — 164 failures, 35 tests, proof of concept
We Didn't Migrate from n8n to Python Because n8n Failed — The orchestrator rewrite
The Determinism War — Why we stopped chasing better models
The Information Design Gap — Why the agent was coding blind
DCR Wasn't Enough — Adding information quality to the framework
The Bug Wasn't in the Model — Lessons from 9 projects
The File Modification Boundary — You are here. 12 projects, a boundary mapped.

About

I'm Joseph YEO, a solo builder from Seoul, Korea. ForgeFlow runs entirely on a MacBook Pro M5 Max 128GB — no cloud APIs during execution. The planning agent (Claude) designs the specs. The local model (Qwen3-Coder-Next, 45GB Q4_K_M) executes the TDD loop autonomously.

Follow along:

Built over 81 sessions, May 2026. All models run locally via Ollama 0.23.0 on macOS. No cloud APIs were used during autonomous execution.

This post was drafted with Claude and edited by me.

The Bug Wasn't in the Model: Lessons from 9 Local AI Coding Agent Projects

Joseph Yeo — Sun, 17 May 2026 14:11:48 +0000

This is Part 6 of the ForgeFlow series. Part 5: DCR Wasn't Enough introduced the two-axis model: DCR × Information Quality. We ended Part 5 at three projects and a 29% pass rate. Here's how we reached 100% on a controlled project — and why we needed a third axis.

Part 5 ended with a framework and a question.

The framework said: System Reliability ≈ DCR × Information Quality. The question was whether that would actually hold up as we kept running projects.

We ran six more. Same model. Same hardware. No cloud APIs during execution. By project nine, the autonomous pass rate hit 100% on that specific project — eight tasks, thirty-one tests, four minutes, zero manual intervention.

This post is about the path from 29% to 100%, and the third variable we didn't expect to find.

The Scoreboard

Here's the full longitudinal data. Nine projects, same 45GB local model, same hardware throughout:

#	Project	Pass Rate	CL Rules	Key Change
1	repo-jwt	0%	0	No design rules existed
2	todo-api	67%	~10	Context files added
3	bookmark-api	100%	~20	Full information pipeline
4	expense-tracker	70%	32	New failure patterns emerged
5	rating-api	73%	32	DB fixture issues
6	library-api	0% → scrapped	35	Fundamental architecture gap*
7	event-api	80%	35	Setup script pattern validated
8	habit-tracker	44%	39	Route tasks collapsed
9	contact-book	100%	43	All axes aligned

The 100% figure refers to Project 9 only, not the aggregate across all nine projects. We used it as a controlled checkpoint: after fixing the route-task failure pattern from Project 8, could the same local model complete a comparable route-heavy project without intervention?

*Project 6 was scrapped because it required an architectural paradigm change (multi-model foreign-key setup scripts) that our orchestrator couldn't state-track at the time. Rather than polluting the loop data with a mismatched setup, we halted execution to redesign our baseline infrastructure scripts. The lessons from that failure directly produced CL-036 through CL-039.

The trajectory isn't a clean line upward. Project 3 hit 100%, then projects 4 through 8 dropped back. Each drop exposed a new category of failure that our rules didn't cover yet.

This is the pattern that mattered most to us: in these nine bounded projects, every failure we investigated had a concrete system-level fix. We did not find a case where replacing the model was the only plausible remedy.

The Crystallization Loop

Each project failure produced what we call a "Crystallized Lesson" (CL) — a concrete, testable rule that prevents that specific failure from recurring. Not a vague principle. A rule precise enough that code could check it.

Examples:

CL-005: Infrastructure files (conftest.py, database.py) must never appear in a task's target files. Origin: Project 3, where the model kept overwriting shared fixtures.
CL-034: DateTime fields with SQLAlchemy's default= must be set in Python's __init__, not relied upon at DB insert time. Origin: Project 5, where unit tests failed because created_at was None before flush.
CL-043: When adding an endpoint to an existing route file, each task must contain exactly one endpoint. Origin: Project 8, where multi-endpoint tasks caused the model to time out trying to understand the existing code.

By project 9, we had 43 of these rules. They're not guidelines — they're checkable constraints on the PRD document that feeds the model. We call the document that holds them the PRD Design Checklist.

Here's what the accumulation looked like:

Projects 1-3:  CL-001 to CL-020  (~7 per project)
Projects 4-6:  CL-021 to CL-035  (~5 per project)
Projects 7-8:  CL-036 to CL-043  (~4 per project)
Project 9:     0 new CLs needed

The rate of new rules slowed, but the depth increased. Early rules were about file placement ("where does conftest.py go?"). Later rules were about engine-level behavior ("how does the correction system handle idempotency?").

What Project 8 Broke

Project 8 (habit-tracker-api) is worth examining because it's where the two-axis model from Part 5 stopped being sufficient.

The project had nine tasks. The first four — model and schema creation — passed autonomously in one cycle each. Then the route tasks (5 through 9) collapsed. Zero of five passed.

The failures fell into four categories:

A pytest configuration warning was being captured as a failure signature. The code was correct, but the orchestrator classified it as broken.
A string-replacement correction was applied twice. client.post( → await client.post( was also applied to lines that already had await, producing await await client.post( — a syntax error.
A schema class was never generated because no test existed for it. The model only builds what it's tested for. No test, no code.
Tasks that modified existing files timed out because the model needed too long to understand the accumulated code.

Notice: none of these are IQ problems in the Part 4/5 sense. The model had all the information it needed. The PRD was well-designed by the standards we had at the time. The failures came from the engine itself — the orchestrator's correction logic, its gate system, its timeout handling.

The Third Variable: Engine Quality

This forced us to extend the two-axis model:

I don't mean this as a measured mathematical product yet, but as a diagnostic model: if any axis collapses, the whole loop collapses.

Or to put it less formally: you can have a perfectly designed PRD and a well-informed model, and still fail because the orchestrator has bugs.

By engine quality, we mean whether the orchestrator preserves the intended semantics of the execution loop: phase isolation (RED writes only tests, GREEN writes only implementation), retry correctness (rollbacks don't destroy infrastructure state), deterministic correction safety (rewrites don't corrupt already-correct code), timeout policy, and commit boundaries.

In our case, the concrete fixes were:

The correction engine's idempotency. Our string-replacement system applied corrections blindly, turning client.post( into await await client.post(. The fix was a line-level guard: if the replacement text already exists on a line, skip it. (We're aware this is a limitation of primitive string matching. A proper AST-based mutation engine using something like LibCST would eliminate this entire class of errors. That's on our roadmap but hasn't been necessary yet at our current project complexity.)
The RED phase scope. If the model outputs an implementation file during the test-writing phase and the orchestrator writes it to disk, the test passes immediately — and the TDD cycle breaks. The fix was restricting the RED phase's file-write scope to test files only.
Router registration resilience. If git reset --hard during a retry also reverts infrastructure changes made by the orchestrator's auto-registration system, the next cycle starts with a broken setup. The fix was committing router registration during the initial setup script, not inside the TDD cycle.

We fixed these with three targeted engine patches, each with its own test suite (16 tests total). After the fixes, project 9 ran eight tasks with zero failures.

The key insight for us: PRD quality and engine quality appeared to be independent variables. Improving one didn't fix the other. Project 8's 44% pass rate wasn't a PRD problem — it was an engine problem that looked like a PRD problem until we traced each failure to its root cause.

What 100% Actually Looked Like

Project 9 was a contact book API with search. Single model (no foreign keys), six CRUD endpoints plus a search-by-query-param feature. We chose it deliberately to test route-task decomposition — the exact pattern that failed in project 8.

The numbers:

Metric	Value
Tasks	8 (model, schemas, 6 endpoints)
Total cycles	8 (every task passed first try)
Total tokens	9,042 (Ollama-reported generated tokens; prompts excluded)
Total time	~4 minutes
Tests generated	31
Manual intervention	0
Cloud API cost	$0

Each task followed the same loop: Ollama writes a failing test → Ollama writes minimum implementation → deterministic corrections applied → pytest runs → commit if green.

The route tasks that had failed repeatedly in project 8 now passed in single cycles. The differences:

Each task added exactly one endpoint (CL-043)
All schema classes had test coverage (CL-042)
The asyncio configuration was pre-set in the setup script (CL-040)
Trailing-slash corrections were applied deterministically (new)
Router registration was committed during setup, not during the TDD cycle (new)

None of these changes required a better model. The model was the same 45GB Qwen3 that produced 0% on project 1.

Why 100% Doesn't Mean "Solved"

I want to be careful here.

100% on a contact book API doesn't mean ForgeFlow can build anything. A contact book API is an architectural sandbox. The project was deliberately chosen to isolate the route-task failure pattern. It had no foreign keys, no authentication, no file uploads. Each endpoint was independent. The success here suggests that our execution loop is stable under these constrained conditions, not that it can refactor a legacy microservice architecture.

The real test is whether the next project — something with two related models and foreign-key relationships — maintains a high pass rate. We don't know yet.

What we do know:

The crystallization loop works. Each failure produces a rule. Rules accumulate. The same failure hasn't recurred in subsequent comparable projects.
Engine fixes matter as much as PRD fixes. Three engine patches in one session unblocked a project that no amount of PRD improvement would have fixed.
The three-variable model explains our data better than two. Projects 4-8 had good PRDs but engine bugs. The two-axis model couldn't explain those drops. The three-variable model can.

The Failure Catalog

Across nine projects, we cataloged 19 distinct failure patterns. Every one was eventually addressed — either through a PRD design rule, an engine fix, or a setup script change.

Category	Count	Resolution
PRD design gap	10	CL rules in checklist
Engine bug	5	forgeflow.py patches + tests
Infrastructure/setup	3	Setup script standardization
Timeout/performance	1	aider_timeout configuration

A few examples from the catalog:

FC-015: Non-idempotent correction rule. Symptom: deterministic correction produced await await client.post(...). Root cause: the string-replacement rule did not check whether the line was already corrected. Fix: line-level idempotency guard — if the replacement text already exists on a given line, skip the replacement.
FC-018: RED phase implementation leakage. Symptom: tests passed immediately during RED because the implementation file was also written to disk. Root cause: the orchestrator's file-write scope included both test and implementation files during the test-writing phase. Fix: restrict RED phase scope to test files only; reject or quarantine non-test files during RED.
FC-019a: Router registration lost across retry. Symptom: every retry started from a broken app state (404 on all endpoints) after git reset --hard. Root cause: the orchestrator's auto-registration system added router imports to main.py during the TDD cycle, but git reset reverted those changes. Fix: commit router registration during the initial setup script, before the TDD loop begins.

Under our postmortem classification criteria, none of the 19 cataloged failures were classified as pure model-capability failures — cases where the model lacked the syntax or logic ability to solve the task. Every failure traced back to something in the system around the model: missing information, incorrect scaffolding, or engine bugs.

This doesn't mean model capability doesn't matter. A stronger model would probably tolerate worse PRDs and buggier engines. But in our limited experience, fixing the system was always cheaper and more permanent than hoping for a smarter model.

What Comes Next

We're designing a diagnostic pipeline that applies the failure catalog automatically. The idea: when a task deadlocks, the engine checks the failure catalog for a matching pattern before giving up.

[DEADLOCK DETECTED]
        │
        ▼
[Pattern Match: Failure Catalog]
        ├──► Match Found ──► Apply Fix (Deterministic) ──► Retry
        └──► No Match   ──► Local LLM Diagnosis (Stage 2)
                                   └──► Fails ──► Human Escalation (Stage 3)

Stage 1 is pure pattern matching — deterministic, no LLM needed. Stage 2 would use the local model to diagnose novel failures. Stage 3 remains human review.

The goal isn't to eliminate human involvement entirely. It's to ensure that each human intervention produces a rule that prevents the same intervention next time. The system should get cheaper to operate with every project it runs.

The Thesis, Updated

Part 3: "The bottleneck is not model capability, but the verifiability of specifications."

Part 4: "Even after verifiability is constructed, the bottleneck shifts to information delivery."

Part 5: "An AI coding agent's reliability is a product of its deterministic coverage and its information quality."

Now, the working version after nine projects:

"In our experience, an AI coding agent's reliability is bounded by three independent variables: the determinism of its scaffolding, the quality of information it receives, and the correctness of its own engine. Improving any two without the third produced a system that failed in ways that looked like model limitations but weren't."

The practical diagnostic is now threefold: measure your deterministic coverage, inspect your information quality, and test the engine itself. Fix the axis that's actually broken. In our nine projects, that diagnostic kept pointing to the system, not the model.

Whether that pattern holds at higher complexity is something we're still finding out.

About

I'm Joseph YEO, building ForgeFlow from Seoul, Korea — a local AI coding agent that runs entirely on Apple Silicon, no cloud inference during execution.

What's your experience with orchestrator-level bugs masquerading as model limitations? Have you seen cases where the system around the model was the actual bottleneck? I'd love to compare notes.

Follow along:

Previous parts: Part 1: 164 Failures · Part 2: n8n to Python · Part 3: The Determinism War · Part 4: The Information Design Gap · Part 5: DCR Wasn't Enough

9 projects. 43 rules. 19 failure patterns. 48 development sessions. Same 45GB model throughout. All models run locally via Ollama 0.23.0 on Apple Silicon M5 Max 128GB. No cloud APIs were used during autonomous execution.

This post was drafted with Claude and edited by me.

DCR Wasn't Enough: Why AI Coding Agents Also Need Information Quality

Joseph Yeo — Thu, 14 May 2026 14:02:17 +0000

This is Part 5 of the ForgeFlow series. Part 3: The Determinism War introduced DCR. Part 4: The Information Design Gap showed how information delivery moved our pass rate from 0% to 67%.

We thought we had the answer.

In Part 4, we showed that fixing our information pipeline — zero code changes, just better PRD design — moved ForgeFlow's autonomous pass rate from 0% to 67%. We thought the lesson was clear: give the model enough context and it delivers.

Then we ran a third project. Same model. Same engine. Same careful PRD design. The pass rate dropped to 29%.

This post is about why that happened and what it taught us about measuring AI coding agents.

What Went Wrong on Project C

Project C was a bookmark API with many-to-many relationships — more complex than a simple CRUD, but not wildly different in structure. We applied everything we learned from Project B: explicit context files per task, detailed descriptions, proper test scenario format.

The PRD had sixteen tasks. In the first seven executed tasks, only two passed autonomously — 2 / 7, or 29%. The other five required manual intervention.

The failures weren't the same as Project A's. In Project A, the model hallucinated imports and invented fixtures because it couldn't see the project. In Project C, it could see the project — but it kept hitting runtime patterns our prompt pipeline had not exposed clearly enough:

Pydantic's HttpUrl adds a trailing slash that breaks equality checks
FastAPI's router prefix with "/" vs "" causes 307 redirects
SQLAlchemy async many-to-many relationships trigger MissingGreenlet errors
create_async_engine lives in sqlalchemy.ext.asyncio, not sqlalchemy

I would not classify these primarily as intelligence failures. They looked more like behavioral knowledge gaps — framework-specific quirks that context files alone didn't expose.

Why Context Files Weren't Enough

This forced us to confront a limitation in our Part 4 approach.

Context files are static. They're defined when you write the PRD — before any code exists. By TASK-007, the project has files that weren't there when the PRD was written. The model can't see them unless someone manually updates the list.

For example, TASK-007 needed to create a route that depended on models and schemas generated by TASK-003 and TASK-005. But those files didn't exist when the PRD was written. The context file list was correct at design time — and stale by execution time.

And even when context files are current, they show the model what code exists. They don't teach the model how the runtime behaves. No amount of reading bookmark.py will tell you that bookmark.tags.append(tag) triggers a synchronous database call inside an async context.

We realized we needed two different kinds of information, not one:

Type	What it provides	Source	Example
Structural information	What files exist, what they export, how they relate	Context files, repo maps	"BookmarkCreate has a field `url: HttpUrl`"
Behavioral knowledge	How the runtime actually works, framework quirks, patterns to avoid	Accumulated experience, failure analysis	"HttpUrl adds a trailing slash — use `str(url).rstrip('/')` for comparison"

Project B's success came from fixing structural information. Project C's failures seemed to come largely from missing behavioral knowledge. Both feed into what the model needs, but they come from different places and accumulate differently.

Two Axes, Not One

This is what led us to think about AI coding agent performance along two axes instead of one.

After Part 3, we had DCR — the ratio of decisions handled deterministically. At 85%, the model's job was narrow: just write the code.

After Part 4, we had the Information Design Gap — the model needs enough context to do that narrow job.

Now, after three projects, we're working with a slightly more structured version:

Axis 1: DCR (Deterministic Coverage Ratio) — how much of the decision surface is handled without the model. This is the scaffolding.

Axis 2: Information Quality (IQ) — how well the model is equipped for the decisions it does handle. This is the fuel.

This is not a measured equation — just the mental model that best explains our runs so far:

System Reliability ≈ DCR × Information Quality

DCR narrows the blast radius. IQ determines how well the model performs inside it. In our data so far, you need both. Neither alone has been sufficient.

Three Dimensions of Information Quality

After analyzing our failures across three projects and reading through recent work on LLM-based code generation, we've found that "the model didn't get enough information" breaks down into at least three problems:

Dimension 1 — Availability: Does the information exist in the context window?

This was Project A's problem. The model received ~240 tokens of task-relevant content. The information existed on disk — the orchestrator just never loaded it.

Dillon & Varanasi (2026) observed something similar. They measured whether generated code follows team-level architectural decisions. When a decision was visible in the files the model received, compliance was near-perfect. When it existed only in documents the model never saw, compliance dropped to zero.

Dimension 2 — Selection: Is irrelevant information excluded?

More context isn't always better. Alonso et al. (2026) found that adding procedural TDD instructions increased regressions, while a targeted test map reduced them significantly. The practical lesson for us was simple: token budget is finite.

Hu et al. (2026) quantified this from the other direction. In their cross-file code generation benchmark, 62% of functions didn't need cross-file context at all. The skill is knowing which information to include.

Dimension 3 — Structure: Is the information formatted for the model to use?

This is the counterintuitive one. Information can be available and selected correctly but still fail because of how it's structured.

Hu et al.'s ablation showed this clearly. "Inlined" context (dependencies inserted at relevant code locations) versus "prepended" context (same information at the top of the prompt) — same information, different structure. Removing the inlining degraded performance to nearly the same level as removing the context entirely.

Chinthareddy (2026) found a similar pattern with code retrieval. On a set of architectural queries, a deterministic AST-derived knowledge graph scored 100% correctness while a vector-similarity approach on the same codebase scored 40% (on the Shopizer benchmark suite). The gap came from how relationships were structured, not what information was available.

How the Two Axes Interact

Here's why we think you need both:

Scenario	DCR	IQ	What happens
A	High	High	System has a chance to work. Deterministic decisions are correct, model has what it needs.
B	High	Low	Deterministic decisions are correct, but the model is flying blind in its narrow lane.
C	Low	High	Model generates good code, but the system mishandles it — wrong task order, broken gates, environment failures.
D	Low	Low	Failures become hard to diagnose. This may be what "the model isn't smart enough" often looks like.

Our Project A was Scenario B. DCR was 85% — the harness was solid. But Information Quality was ~15%. The model couldn't do its job because it couldn't see the project.

Project B was closer to Scenario A. Same DCR. At least on the availability dimension, the delivered context moved closer to ~80%. The model had enough context to complete most of the tasks that fit the orchestration loop.

Project C showed us that IQ itself has layers. Structural availability was good (~80%), but behavioral knowledge was missing. The two-axis model held — DCR was fine, IQ was the bottleneck — but the nature of the IQ problem was different from Project A.

A Practical Diagnostic

If you're building or evaluating an AI coding agent, here's the check we now run on our own system:

Step 1: Measure your DCR. List every decision point in one execution cycle. For each one, ask: is this resolved by deterministic code, or does it depend on model output? Count the ratio. If it's below 50%, the scaffolding likely needs reinforcement before the model can succeed.

Step 2: Dump the prompt. Not the template — the actual string that reaches the model at inference time. Read it as if you're a developer seeing this codebase for the first time. Can you write the code from this prompt alone? If you can't, the model can't either.

Step 3: Diagnose by axis.

High DCR, low pass rate → Information Quality problem. Check: are context files loaded? Are descriptions specific enough? Are test assertions reaching the model?
Low DCR, inconsistent results → Structural problem. The model is making decisions that should be deterministic. Move those decisions into code.
Both seem fine, still failing → Might be a genuine model capability limit. Only after that does a model upgrade become the next reasonable hypothesis.

We jumped to Step 3 after Project A. "Qwen3 can't handle JWT auth" was our first diagnosis. Our initial diagnosis was premature. The bigger problem was that the information pipeline was effectively empty. We could have saved ourselves weeks.

Related Work Pointing in a Similar Direction

I didn't set out to build a framework. I set out to figure out why Project A failed. But a consistent pattern kept showing up in recent work:

Alonso et al. (2026) — TDD procedure instructions hurt. Contextual test maps helped. Procedure without context was counterproductive.

Midolo et al. (2026) — Surveyed 50 developers. 14% independently reported "contextual information about other system components" as a missing factor.

Jalil et al. (2025) — Smaller models with TDD and code execution surpassed larger models without those supports.

Dillon & Varanasi (2026) — Decision compliance went from 46% to 95% by adding product context and structured specs. Cost per merge-ready task dropped 68%.

Hu et al. (2026) — Cross-file inlining improved exact match by a reported average of 29.73% on RepoExec across three backbone models. The result was model-independent.

Chinthareddy (2026) — Deterministic AST-derived code graphs achieved 100% correctness vs. 40% for vector-only retrieval on architectural queries (Shopizer suite). LLM-based graph extraction missed 31% of files entirely.

These studies don't prove our framework. But they point in a consistent direction, and our three internal runs are consistent with that direction.

What We're Not Claiming

I want to be precise about the boundaries here.

We're not claiming that model capability doesn't matter. It does — for the non-deterministic slice. A stronger model will generate better code from the same information.

We're not claiming these two metrics capture everything. Latency, cost, context window size, tool use ability — all matter. But in our limited experience, DCR and IQ have explained the largest share of variance in autonomous pass rates.

We're also not claiming this is proven. ForgeFlow is a sample size of one. We have three data points (0%, 67%, 29%) and they're consistent with the two-axis model, but three points don't make a proof.

If anyone has run similar experiments — different scaffolding levels, different context strategies, measured pass rates — I'd genuinely love to compare notes.

The Thesis, So Far

Part 3: "The bottleneck is not model capability, but the verifiability of specifications."

Part 4: "Even after verifiability is constructed, the bottleneck shifts to information delivery."

Now the version we're working with:

"An AI coding agent's reliability seems to be a product of its deterministic coverage and its information quality. Improving either without the other produces a system that is either structurally sound but informationally blind, or well-informed but structurally fragile."

Two axes. One product. Neither alone has been sufficient in our experience.

Measure your DCR. Dump your prompt. Fix the axis that's actually broken. Only after that does a model upgrade become the next reasonable hypothesis. That's the diagnostic that's worked for us. Whether it generalizes is something we're still finding out.

About

I'm Joseph YEO, building ForgeFlow from Seoul, Korea — a local AI coding agent that runs entirely on Apple Silicon, no cloud inference during execution.

This post synthesizes what I've learned from running three projects end-to-end and reading 40+ papers on LLM-based code generation. The two-axis model isn't a proven theory — it's the working diagnostic I use every time a cycle fails. I'm sharing it because it's been useful, and because I'm curious whether others are seeing the same patterns.

How are you handling the "stale context" problem as your agent modifies the codebase? Are you using repo maps, re-indexing on every task, or something else entirely? I'd love to hear what's working.

Follow along:

All models run locally via Ollama 0.23.0 on Apple Silicon M5 Max 128GB. No cloud APIs were used during autonomous execution.

This post was drafted with Claude and edited by me.

The Information Design Gap: Why Our AI Agent Was Coding Blind

Joseph Yeo — Wed, 13 May 2026 15:23:44 +0000

This is Part 4 of the ForgeFlow series. Part 3: The Determinism War introduced DCR (Deterministic Coverage Ratio) and why we stopped chasing better models.

In Part 3, I proposed a hypothesis:

"The bottleneck of LLM-driven software engineering is not model capability, but the verifiability of specifications."

Then I said: "We're building the system to test it."

We ran two projects. Same model. Same engine. Same orchestrator. The autonomous pass rate went from 0% to 67%.

In our case, the fix wasn't a better model. It was giving the model enough information to do its job.

Two Projects, One Model, One Engine

ForgeFlow is a TDD orchestrator that runs entirely locally. No cloud API calls during execution. The cycle is simple: generate test (RED) → generate implementation (GREEN) → run pytest → commit or retry.

We ran it against two internal projects:

	Project A: repo-jwt	Project B: todo-api
Domain	JWT authentication API	Todo CRUD API
Tasks	18	12
Model	Qwen3-Coder-Next Q4_K_M (45GB)	Same
Engine	forgeflow.py v2	Same
Autonomous passes	0 / 18 (0%)	8 / 12 (67%)
Manual intervention	18 tasks (100%)	4 tasks (33%)

Same model. Same engine. The pass rate changed from 0% to 67%.

What changed between Project A and Project B was not the model or the orchestrator. It was the information structure of the PRD — the spec document that tells the model what to build.

The Prompt Dump That Exposed the Problem

After Project A failed across all 18 tasks, we did something we should have done much earlier: we dumped the actual prompt the model received at inference time.

Not the prompt template. Not the system prompt spec. The literal string that arrived at the model's context window.

Here's what we expected to find: a rich prompt containing the task spec, relevant source files, test fixtures, data model definitions, and dependency context.

What we actually found was a prompt of about ~720 tokens — and only ~240 of those were task-relevant project information. The rest was role text, formatting rules, and boilerplate.

No source code. No test fixtures. No existing implementation files. The model was being asked to generate code for a project it could barely see.

In hindsight, this was an embarrassing oversight. The information pipeline existed in the code — it just wasn't wired up. But we didn't notice until we read the raw prompt.

The Five Gaps

We listed out every piece of information the model should have received but didn't. Five gaps emerged:

Gap 1 — No context in RED phase. During test generation, the context parameter was hardcoded to an empty dictionary. The model wrote tests for modules it couldn't see, importing functions that didn't exist yet, guessing at fixture structures it had no way to know.

Gap 2 — No context file list in the PRD. The orchestrator had a function ready to read context files from a task-level field. But the PRD never defined that field. So the function returned an empty list every time.

Gap 3 — Module names without signatures. The prompt listed available modules by name: todo.py, database.py. But not their contents, not their function signatures, not their class fields. The model knew that modules existed, but not what they contained.

Gap 4 — Test assertions not forwarded. The PRD included test assertion fields with precise expected behavior. The prompt builder read a different field name. The assertions existed in the spec but never reached the model.

Gap 5 — No conftest.py. In a pytest project, conftest.py defines shared fixtures — test database sessions, HTTP clients, factory functions. The model never saw it. Every task that required a test client, the model invented its own from scratch, often incompatibly.

Quantifying the Gap

We measured how much relevant information actually reached the model by counting task-specific tokens:

Source	Task-relevant tokens	What it contains
What the model received	~240	Task ID, one-line description, module names
What a human developer would reference	~1,640	Above + conftest.py, database.py, models, schemas, existing routes

The model was operating on roughly 15% of the information a developer would use for the same task.

We started calling this the Information Design Gap: the difference between what a model could use and what the system actually delivers at inference time. Whether this framing is useful beyond our system is something we're still figuring out — but for us, it immediately clarified what to fix.

The Fix: No Code Changes

Here's the part that surprised us.

The orchestrator already had the machinery to deliver context files. A function to resolve which files a task needs — existed. A function to read those files from disk — existed. A function to format them into the prompt — existed. The prompt builder had a slot for context.

The pipeline existed. The PRD just wasn't feeding it.

For Project B (todo-api), we made three changes — all in the PRD, none in the engine:

1. Added a context file list to every task. Each task now listed exactly which existing files the model should see. Early tasks had empty lists. CRUD endpoint tasks included the test fixture file, the relevant model file, and the schema file.

Here's what a task spec looked like after the fix:

- id: TASK-007
  description: >
    Create POST /api/todos.
    Use the client fixture from conftest.py.
    Return 201 with TodoResponse schema.
  context_files:
    - tests/conftest.py
    - app/models/todo.py
    - app/schemas/todo.py

2. Made descriptions explicit. Instead of "Create the POST endpoint", we wrote "Create POST /api/todos. Use the client fixture from conftest.py. Return 201 with TodoResponse schema." One sentence, but it told the model where to look.

3. Unified test scenario format. Aligned PRD field names with what the prompt builder actually read, so test assertions reached the model.

Total lines of code changed in the engine: zero.

What 67% Actually Means

Eight of twelve tasks passed autonomously — the model generated code, tests passed, and ForgeFlow committed without human intervention for those tasks.

Based on manual inspection, I would not classify the four manual tasks as model-capability failures. They were structural mismatches with the TDD cycle:

Task	Why manual	Category
TASK-001	Infrastructure setup — test and implementation must be created together	RED phase incompatible
TASK-003	Fixture-only — conftest.py defines fixtures, nothing to "fail" in RED	No failing test possible
TASK-010	Validation already handled by Pydantic schema from an earlier task	RED unexpected pass
TASK-012	Integration test — no implementation file, test-only task	Engine assumes impl file exists

That said, I should be honest about what we can't fully measure here. "Model-capability failure" is hard to distinguish from "subtle information gap we didn't notice." Our classification is based on manual inspection, not a controlled experiment. What we can say with confidence is that the type of failure changed completely — from hallucinated imports and invented fixtures in Project A to structural mismatches in Project B.

The Lesson: Intelligence Gap vs. Information Gap

After Project A, our diagnosis was: "The model isn't smart enough. Qwen3 at Q4 quantization can't handle multi-file JWT authentication."

That diagnosis was wrong — or at least, premature.

The model appeared to have more usable capability than our system was exposing. In this run, the difference between 0% and 67% looked less like intelligence and more like context delivery.

This completely changed how we thought about local model limitations:

	Intelligence Gap	Information Gap
Symptom	Model generates plausible but wrong code	Model generates structurally incompatible code
Diagnosis	"Model too small / too quantized"	"Prompt missing critical context"
Fix	Upgrade model (expensive, diminishing returns)	Improve information design (free, compounding)
Testable?	Hard — model capability is a black box	Easy — dump the prompt, count what's missing

The Information Design Gap is testable. Dump the prompt. Read it as if you're a developer seeing this project for the first time. If you couldn't write the code from that prompt alone, the model can't either.

Similar Patterns in Recent Research

While writing this post, we surveyed recent research on TDD-based code generation and found similar patterns appearing independently. These don't prove our framework, but the convergence seemed worth noting.

Alonso et al. (2026) tested TDD prompting on SWE-bench Verified with a 30B local model. Adding procedural TDD instructions ("write tests first, then implement") increased regressions. Adding a graph-derived test map ("here are the specific tests at risk") reduced them significantly. Their conclusion: agents don't need to be told how to do TDD — they need to be told which tests to check.

We saw the same mechanism: telling the model what process to follow consumed context tokens that could carry actual project information.

Midolo et al. (2026) surveyed 50 developers about what makes code generation prompts succeed. Their top factors: algorithmic details (57%) and I/O format specification (44%). When asked what else was missing, 14% independently reported "contextual information about other components in the system" — which sounds a lot like the gap our per-task context file list was designed to close.

Jalil et al. (2025) showed that smaller models with TDD and a code interpreter could surpass larger models without those supports. The pattern held across model families: tests as structured context beat model scale.

Different benchmarks, different teams, different setups. They all point toward the same practical lesson: before blaming the model, it might be worth inspecting the information pipeline. Our data adds one more point in that direction.

Implications for DCR

In Part 3, I defined DCR as the ratio of deterministic decisions in an agent loop. A reader asked whether DCR should be tracked like test coverage — not just reviewed once at architecture time.

Running two projects gave us a partial answer: DCR alone wasn't enough.

ForgeFlow's DCR didn't change between Project A and Project B. It was 85% both times — same 11 of 13 decisions handled deterministically. Yet performance went from 0% to 67%.

What changed was the quality of information feeding the non-deterministic decisions. DCR tells you how narrow the model's role is. It doesn't tell you whether the model is equipped to play that role.

This is why we're now thinking about DCR in two layers:

Static DCR: how many decision points are designed to be deterministic. (Architecture metric.)
Observed DCR: how many decisions were actually resolved deterministically during real runs. (Runtime metric.)

And alongside both: Information Delivery Rate — how much of the available, relevant context actually reaches the model at inference time. Using task-relevant token delivery as a rough proxy, Project A was around 15%. Project B was much closer to the information a human developer would expect to see.

We're still working out whether these are the right abstractions — but they've been useful for diagnosing our own failures so far.

What We're Building Next

The immediate roadmap based on these findings:

RED phase context delivery. The RED phase (test generation) was still sending an empty context when we ran these projects. We've since fixed this in the engine — the model now sees existing fixtures before writing new tests.

Automatic context inference. Right now, context files are manually specified per task in the PRD. The next step is deriving them from the dependency graph: if TASK-007 depends on TASK-005 and TASK-006, automatically include their implementation files as context. We're exploring tree-sitter-based approaches for this.

Structural mismatch detection. Four of twelve tasks didn't fit the RED-GREEN cycle. We want ForgeFlow to detect these patterns (infrastructure setup, fixture-only, test-only) during PRD validation and handle them with a separate path — not force them through TDD.

The Thesis, Updated

Part 3's thesis was about structure:

"The bottleneck is not model capability, but verifiability of specifications."

Two projects later, I'd extend it:

"In our runs, even after we built verifiability, the bottleneck seemed to shift to information delivery — whether the model receives enough context to use that verifiability."

DCR gave us the harness. Information design made that harness useful. Both seem to be required. Neither alone was sufficient in our experience.

Same model. Same engine. Zero code changes. 0% → 67%. In our case, the difference was information.

Several recent studies point in a similar direction, though from different setups. The practical suggestion I'd offer: if your AI coding agent is underperforming, it might be worth checking what it's receiving before swapping the model. That's what worked for us.

About

I'm Joseph YEO, a solo builder from Seoul, Korea. ForgeFlow is my experiment in pushing local AI agents toward more reliable autonomous execution — no cloud inference during execution, no hand-holding mid-cycle.

This post covers what happened when we actually ran the system from Part 3 against real projects and discovered the gap between having a verification harness and feeding the generator enough context. I'm sharing this because I wish someone had told me to dump the raw prompt before I spent weeks blaming the model.

If you've run into similar issues — or found different solutions — I'd love to hear about it in the comments.

Follow along:

Built over ~33 sessions, May 2026. All models run locally via Ollama 0.23.0 on Apple Silicon. No cloud APIs were used during autonomous execution.

This post was drafted with Claude and edited by me.