DEV Community

Joseph Yeo
Joseph Yeo

Posted on

The Bug Wasn't in the Model: Lessons from 9 Local AI Coding Agent Projects

This is Part 6 of the ForgeFlow series. Part 5: DCR Wasn't Enough introduced the two-axis model: DCR × Information Quality. We ended Part 5 at three projects and a 29% pass rate. Here's how we reached 100% on a controlled project — and why we needed a third axis.


Part 5 ended with a framework and a question.

The framework said: System Reliability ≈ DCR × Information Quality. The question was whether that would actually hold up as we kept running projects.

We ran six more. Same model. Same hardware. No cloud APIs during execution. By project nine, the autonomous pass rate hit 100% on that specific project — eight tasks, thirty-one tests, four minutes, zero manual intervention.

This post is about the path from 29% to 100%, and the third variable we didn't expect to find.


The Scoreboard

Here's the full longitudinal data. Nine projects, same 45GB local model, same hardware throughout:

# Project Pass Rate CL Rules Key Change
1 repo-jwt 0% 0 No design rules existed
2 todo-api 67% ~10 Context files added
3 bookmark-api 100% ~20 Full information pipeline
4 expense-tracker 70% 32 New failure patterns emerged
5 rating-api 73% 32 DB fixture issues
6 library-api 0% → scrapped 35 Fundamental architecture gap*
7 event-api 80% 35 Setup script pattern validated
8 habit-tracker 44% 39 Route tasks collapsed
9 contact-book 100% 43 All axes aligned

The 100% figure refers to Project 9 only, not the aggregate across all nine projects. We used it as a controlled checkpoint: after fixing the route-task failure pattern from Project 8, could the same local model complete a comparable route-heavy project without intervention?

*Project 6 was scrapped because it required an architectural paradigm change (multi-model foreign-key setup scripts) that our orchestrator couldn't state-track at the time. Rather than polluting the loop data with a mismatched setup, we halted execution to redesign our baseline infrastructure scripts. The lessons from that failure directly produced CL-036 through CL-039.

The trajectory isn't a clean line upward. Project 3 hit 100%, then projects 4 through 8 dropped back. Each drop exposed a new category of failure that our rules didn't cover yet.

This is the pattern that mattered most to us: in these nine bounded projects, every failure we investigated had a concrete system-level fix. We did not find a case where replacing the model was the only plausible remedy.


The Crystallization Loop

Each project failure produced what we call a "Crystallized Lesson" (CL) — a concrete, testable rule that prevents that specific failure from recurring. Not a vague principle. A rule precise enough that code could check it.

Examples:

  • CL-005: Infrastructure files (conftest.py, database.py) must never appear in a task's target files. Origin: Project 3, where the model kept overwriting shared fixtures.

  • CL-034: DateTime fields with SQLAlchemy's default= must be set in Python's __init__, not relied upon at DB insert time. Origin: Project 5, where unit tests failed because created_at was None before flush.

  • CL-043: When adding an endpoint to an existing route file, each task must contain exactly one endpoint. Origin: Project 8, where multi-endpoint tasks caused the model to time out trying to understand the existing code.

By project 9, we had 43 of these rules. They're not guidelines — they're checkable constraints on the PRD document that feeds the model. We call the document that holds them the PRD Design Checklist.

Here's what the accumulation looked like:

Projects 1-3:  CL-001 to CL-020  (~7 per project)
Projects 4-6:  CL-021 to CL-035  (~5 per project)
Projects 7-8:  CL-036 to CL-043  (~4 per project)
Project 9:     0 new CLs needed
Enter fullscreen mode Exit fullscreen mode

The rate of new rules slowed, but the depth increased. Early rules were about file placement ("where does conftest.py go?"). Later rules were about engine-level behavior ("how does the correction system handle idempotency?").


What Project 8 Broke

Project 8 (habit-tracker-api) is worth examining because it's where the two-axis model from Part 5 stopped being sufficient.

The project had nine tasks. The first four — model and schema creation — passed autonomously in one cycle each. Then the route tasks (5 through 9) collapsed. Zero of five passed.

The failures fell into four categories:

  1. A pytest configuration warning was being captured as a failure signature. The code was correct, but the orchestrator classified it as broken.

  2. A string-replacement correction was applied twice. client.post(await client.post( was also applied to lines that already had await, producing await await client.post( — a syntax error.

  3. A schema class was never generated because no test existed for it. The model only builds what it's tested for. No test, no code.

  4. Tasks that modified existing files timed out because the model needed too long to understand the accumulated code.

Notice: none of these are IQ problems in the Part 4/5 sense. The model had all the information it needed. The PRD was well-designed by the standards we had at the time. The failures came from the engine itself — the orchestrator's correction logic, its gate system, its timeout handling.


The Third Variable: Engine Quality

This forced us to extend the two-axis model:

I don't mean this as a measured mathematical product yet, but as a diagnostic model: if any axis collapses, the whole loop collapses.
Enter fullscreen mode Exit fullscreen mode

Or to put it less formally: you can have a perfectly designed PRD and a well-informed model, and still fail because the orchestrator has bugs.

By engine quality, we mean whether the orchestrator preserves the intended semantics of the execution loop: phase isolation (RED writes only tests, GREEN writes only implementation), retry correctness (rollbacks don't destroy infrastructure state), deterministic correction safety (rewrites don't corrupt already-correct code), timeout policy, and commit boundaries.

In our case, the concrete fixes were:

  • The correction engine's idempotency. Our string-replacement system applied corrections blindly, turning client.post( into await await client.post(. The fix was a line-level guard: if the replacement text already exists on a line, skip it. (We're aware this is a limitation of primitive string matching. A proper AST-based mutation engine using something like LibCST would eliminate this entire class of errors. That's on our roadmap but hasn't been necessary yet at our current project complexity.)

  • The RED phase scope. If the model outputs an implementation file during the test-writing phase and the orchestrator writes it to disk, the test passes immediately — and the TDD cycle breaks. The fix was restricting the RED phase's file-write scope to test files only.

  • Router registration resilience. If git reset --hard during a retry also reverts infrastructure changes made by the orchestrator's auto-registration system, the next cycle starts with a broken setup. The fix was committing router registration during the initial setup script, not inside the TDD cycle.

We fixed these with three targeted engine patches, each with its own test suite (16 tests total). After the fixes, project 9 ran eight tasks with zero failures.

The key insight for us: PRD quality and engine quality appeared to be independent variables. Improving one didn't fix the other. Project 8's 44% pass rate wasn't a PRD problem — it was an engine problem that looked like a PRD problem until we traced each failure to its root cause.


What 100% Actually Looked Like

Project 9 was a contact book API with search. Single model (no foreign keys), six CRUD endpoints plus a search-by-query-param feature. We chose it deliberately to test route-task decomposition — the exact pattern that failed in project 8.

The numbers:

Metric Value
Tasks 8 (model, schemas, 6 endpoints)
Total cycles 8 (every task passed first try)
Total tokens 9,042 (Ollama-reported generated tokens; prompts excluded)
Total time ~4 minutes
Tests generated 31
Manual intervention 0
Cloud API cost $0

Each task followed the same loop: Ollama writes a failing test → Ollama writes minimum implementation → deterministic corrections applied → pytest runs → commit if green.

The route tasks that had failed repeatedly in project 8 now passed in single cycles. The differences:

  • Each task added exactly one endpoint (CL-043)
  • All schema classes had test coverage (CL-042)
  • The asyncio configuration was pre-set in the setup script (CL-040)
  • Trailing-slash corrections were applied deterministically (new)
  • Router registration was committed during setup, not during the TDD cycle (new)

None of these changes required a better model. The model was the same 45GB Qwen3 that produced 0% on project 1.


Why 100% Doesn't Mean "Solved"

I want to be careful here.

100% on a contact book API doesn't mean ForgeFlow can build anything. A contact book API is an architectural sandbox. The project was deliberately chosen to isolate the route-task failure pattern. It had no foreign keys, no authentication, no file uploads. Each endpoint was independent. The success here suggests that our execution loop is stable under these constrained conditions, not that it can refactor a legacy microservice architecture.

The real test is whether the next project — something with two related models and foreign-key relationships — maintains a high pass rate. We don't know yet.

What we do know:

  • The crystallization loop works. Each failure produces a rule. Rules accumulate. The same failure hasn't recurred in subsequent comparable projects.
  • Engine fixes matter as much as PRD fixes. Three engine patches in one session unblocked a project that no amount of PRD improvement would have fixed.
  • The three-variable model explains our data better than two. Projects 4-8 had good PRDs but engine bugs. The two-axis model couldn't explain those drops. The three-variable model can.

The Failure Catalog

Across nine projects, we cataloged 19 distinct failure patterns. Every one was eventually addressed — either through a PRD design rule, an engine fix, or a setup script change.

Category Count Resolution
PRD design gap 10 CL rules in checklist
Engine bug 5 forgeflow.py patches + tests
Infrastructure/setup 3 Setup script standardization
Timeout/performance 1 aider_timeout configuration

A few examples from the catalog:

  • FC-015: Non-idempotent correction rule. Symptom: deterministic correction produced await await client.post(...). Root cause: the string-replacement rule did not check whether the line was already corrected. Fix: line-level idempotency guard — if the replacement text already exists on a given line, skip the replacement.

  • FC-018: RED phase implementation leakage. Symptom: tests passed immediately during RED because the implementation file was also written to disk. Root cause: the orchestrator's file-write scope included both test and implementation files during the test-writing phase. Fix: restrict RED phase scope to test files only; reject or quarantine non-test files during RED.

  • FC-019a: Router registration lost across retry. Symptom: every retry started from a broken app state (404 on all endpoints) after git reset --hard. Root cause: the orchestrator's auto-registration system added router imports to main.py during the TDD cycle, but git reset reverted those changes. Fix: commit router registration during the initial setup script, before the TDD loop begins.

Under our postmortem classification criteria, none of the 19 cataloged failures were classified as pure model-capability failures — cases where the model lacked the syntax or logic ability to solve the task. Every failure traced back to something in the system around the model: missing information, incorrect scaffolding, or engine bugs.

This doesn't mean model capability doesn't matter. A stronger model would probably tolerate worse PRDs and buggier engines. But in our limited experience, fixing the system was always cheaper and more permanent than hoping for a smarter model.


What Comes Next

We're designing a diagnostic pipeline that applies the failure catalog automatically. The idea: when a task deadlocks, the engine checks the failure catalog for a matching pattern before giving up.

[DEADLOCK DETECTED]
        │
        ▼
[Pattern Match: Failure Catalog]
        ├──► Match Found ──► Apply Fix (Deterministic) ──► Retry
        └──► No Match   ──► Local LLM Diagnosis (Stage 2)
                                   └──► Fails ──► Human Escalation (Stage 3)
Enter fullscreen mode Exit fullscreen mode

Stage 1 is pure pattern matching — deterministic, no LLM needed. Stage 2 would use the local model to diagnose novel failures. Stage 3 remains human review.

The goal isn't to eliminate human involvement entirely. It's to ensure that each human intervention produces a rule that prevents the same intervention next time. The system should get cheaper to operate with every project it runs.


The Thesis, Updated

Part 3: "The bottleneck is not model capability, but the verifiability of specifications."

Part 4: "Even after verifiability is constructed, the bottleneck shifts to information delivery."

Part 5: "An AI coding agent's reliability is a product of its deterministic coverage and its information quality."

Now, the working version after nine projects:

"In our experience, an AI coding agent's reliability is bounded by three independent variables: the determinism of its scaffolding, the quality of information it receives, and the correctness of its own engine. Improving any two without the third produced a system that failed in ways that looked like model limitations but weren't."

The practical diagnostic is now threefold: measure your deterministic coverage, inspect your information quality, and test the engine itself. Fix the axis that's actually broken. In our nine projects, that diagnostic kept pointing to the system, not the model.

Whether that pattern holds at higher complexity is something we're still finding out.


About

I'm Joseph YEO, building ForgeFlow from Seoul, Korea — a local AI coding agent that runs entirely on Apple Silicon, no cloud inference during execution.

What's your experience with orchestrator-level bugs masquerading as model limitations? Have you seen cases where the system around the model was the actual bottleneck? I'd love to compare notes.

Follow along:

Previous parts: Part 1: 164 Failures · Part 2: n8n to Python · Part 3: The Determinism War · Part 4: The Information Design Gap · Part 5: DCR Wasn't Enough

9 projects. 43 rules. 19 failure patterns. 48 development sessions. Same 45GB model throughout. All models run locally via Ollama 0.23.0 on Apple Silicon M5 Max 128GB. No cloud APIs were used during autonomous execution.

This post was drafted with Claude and edited by me.

Top comments (0)