We Spent Six Sessions Fixing One Task. The Problem Was Six Tasks.

#agents #ai #devjournal #softwareengineering

This is Part 9 of the ForgeFlow series. Part 8: 77 Rules Later ended on a question we couldn't answer at the time: can a rule-based agent system keep growing without becoming harder to reason about than the model it was built to constrain? This post is about one small episode that pushed us toward an answer — not a clean one, but a useful one. It's also a record of getting a diagnosis wrong for several sessions in a row, and what finally corrected it.

Quick terms for new readers:

FC = Failure Catalog entry (a documented failure pattern)

CL = Crystallized Lesson (a testable design rule derived from repeated failures)

critical_rules = a block of rules injected into the model's prompt for a given project

nightrun = an overnight batch that re-runs projects so we can measure behavior over many runs

ForgeFlow = a fully local, TDD-based autonomous coding system running on Apple Silicon

The failure we kept "fixing"

For several sessions, we had a recurring failure in one project that we treated as a single, local problem.

The symptom: in async tests, the model would construct an HTTP test client directly instead of using the shared fixture we'd set up. That shared fixture is what installs the dependency override pointing the app at the test database. By hand-rolling its own client, the model skipped the fixture entirely — so the override was never applied, and the test hit an unconfigured database and failed with a missing-table error.

(The underlying trigger, for those on the same stack: httpx 0.28 removed the deprecated app= shortcut from Client/AsyncClient. The supported pattern is now transport=ASGITransport(app=app). Our conftest fixture already used the current ASGITransport approach and was correct — the app= argument on ASGITransport itself is still valid. The model just wasn't using the fixture; it kept hand-rolling the client the old, removed way. Why it preferred the old pattern is something we can only guess at — likely the weight of older examples — and we didn't try to prove it.)

Each time it surfaced, we did the natural thing. We looked at the task where it appeared, and we wrote a rule for that task. Next run, it would seem quieter there — and then show up somewhere else. We'd note it, half-suspect it was the same thing, and move on to whatever else the run surfaced.

What we were doing, in effect, was treating a recurring pattern as a series of unrelated incidents. We never asked the obvious question: how often does this actually happen, and where?

The measurement we should have taken earlier

The thing that broke the loop wasn't a better model or a smarter rule. It was a measurement we hadn't bothered to take.

We wrote a small script to walk back through our run history and count occurrences of the failing pattern across every recorded run — not by reconstructing it from summary tables, but by grepping the raw captured error text directly. (We'd learned in an earlier session that our summary-level deduplication could merge distinct failures under one signature, so for this we went to the raw text instead.) In practice that meant grepping the archived stderr for the exact failing construction across every run directory, rather than trusting the rolled-up signatures.

In the archived runs we checked, the count came back:

16 occurrences
across 6 distinct tasks
in 3 different projects
spread over 7 separate run sessions

This count made our earlier framing untenable. We had been writing single-task rules for something that was happening across six tasks in three projects. Each of those per-task fixes had been addressing, at most, one of the sixteen occurrences we eventually counted. The measurement didn't just refine our picture — for this failure pattern, it showed our framing had been wrong in kind, not just in degree.

It's a little uncomfortable to write that down. But that's the part worth keeping: the wrong number wasn't in the model's output. It was in our own estimate of how widespread the problem was, and we'd carried that estimate for several sessions without checking it.

The prescription got smaller, not bigger

Here's the part that surprised us most.

Once we saw "6 tasks, 3 projects," the instinct might be to write six fixes — one hardened rule per affected task. Given how we'd been responding up to that point, that's roughly the path we were on.

Instead, the data pointed the other way. If one pattern was appearing across many tasks and projects, the better place to address it wasn't any individual task — it was the project-wide rule block. We added one line to the critical_rules for the affected projects: a single instruction telling the model not to construct the test client directly, and to use the fixture instead (taking the rule block from 20 lines to 21).

One rule addressed a pattern that, on our prior trajectory, could easily have become six separate task-level patches. This was a small, concrete instance of something we keep seeing on this project: when we measure a problem's actual scope more carefully, the fix tends to get narrower, not wider. When you don't know the shape of a problem, you tend to over-prescribe locally and under-address globally. Measuring the shape let us do less.

We then ran a verification batch over the affected projects. The grep count for the pattern came back at zero across those runs. Two of the projects involved — a small gallery API and a small library API — met the per-project completion criteria on the verifying runs, so we marked them as "graduated" in a narrow, project-level sense.

Note on "graduated": Part 8 used this word for an entire stack. Here it means something much smaller — these individual projects met their completion criteria on the runs we executed. It is not a claim about the stack, and not a claim that these projects will never fail again. The underlying numbers were modest: pass rates in the roughly 38–85% range across individual runs. What changed wasn't a jump to near-perfect runs — it was that this particular structural failure stopped appearing. We're reporting a state we observed, not a guarantee.

What this didn't prove

A few limits, because the result is smaller than it might sound.

The zero is a zero on the runs we executed, for this specific pattern. It's evidence the rule is doing its job in the tested scope, not proof the pattern is gone for good. A different project, a different httpx version, or a different phrasing of the same task could surface it again.

The measurement approach itself has a known weakness we worked around rather than solved: it reads raw error text, which is reliable for exact-pattern counting but says nothing about why each occurrence happened. For this pattern — a single, well-understood cause — that was fine. For a fuzzier failure, raw-text counting would undercount or overcount, and we'd need something better.

And the broader idea ("measure before you prescribe") is a working heuristic from repeated experience on this project, not something we've established rigorously. We'd genuinely like to know where it breaks down for other people.

Beyond the coding loop

The same week, an unrelated problem came up — a structural issue in how some of our project directories were tracked in version control. An earlier note had flagged it as a likely large cleanup, the kind of thing you budget a careful session for.

Before touching anything, we applied the same lesson the measurement script had just taught us: measure first. A few read-only checks showed the actual scope was much smaller than the earlier flag assumed — most of what looked like a structural defect turned out to be a single missing ignore-rule and one stale index entry, and the fix was a handful of non-destructive steps rather than a restructuring. I'm including this not as a second result but as an observation: in both cases the failure mode was the same — acting on an estimate instead of a measurement — and in both cases, once we measured, the prescription shrank. Two instances isn't a trend, but it was enough to make "measure the scope before writing the fix" a step we now try to take on purpose rather than when we happen to remember.

Back to Part 8's question

Part 8 ended by asking whether a growing rule set could stay manageable. This episode is one data point toward a tentative answer: rule growth seems more controllable when the decision to add a rule is driven by measured scope rather than by where a failure last happened to appear. Six task-level rules would have grown the rule set faster and addressed the problem worse. One measured rule did more with less.

That's not a method, and it's certainly not a solution to the interaction-effect problem from Part 8. It's a habit we're trying to form: before crystallizing a new rule, spend the few minutes it takes to count how often and where the underlying failure actually occurs. Sometimes that collapses six patches into one rule. Other times — we assume, though we haven't hit this yet — it will reveal that what looked like one problem is really several, and we'll need more rules, not fewer. Either way, the rule follows the measurement.

Series Links

ForgeFlow runs on a MacBook Pro M5 Max 128GB. Planning uses Claude (cloud API). Execution is fully local — Qwen3-Coder-Next 45GB via Ollama, gemma4:26b for QA, Docker sandbox, no API calls during the coding loop. The methodology and failure data are shared in this series.

If you're running your own local agents or failure-catalog systems: have you caught yourself prescribing locally for a problem that turned out to be project-wide — and what finally made you measure it? Logs, test signatures, prompts, generated diffs? The comments are open.