Joseph Yeo

Posted on May 22

The File Modification Boundary We Found After 12 ForgeFlow Projects

#ai #llm #agents #softwareengineering

This is Part 7 of the ForgeFlow series. Part 6: The Bug Wasn't in the Model ended at 9 projects, 51 failure patterns, and 70 design rules. Up until that point, failure rates in our setup were declining and the working framework felt like it was converging. Project 12 exposed a structural gap we hadn't yet documented.

Quick terms for new readers:

FC = Failure Catalog entry (a documented failure pattern)

CL = Crystallized Lesson (a testable design rule derived from repeated failures)

Identical GREEN = the model returns an unchanged file during the implementation phase

DEADLOCK = the system gives up after repeated identical failures

Part 6 ended on a high note. Nine projects. A 100% pass rate on the last one. Forty-three crystallized lessons. A working framework in our setup: DCR × Information Quality × Task Complexity. The system felt like it was converging.

Then we tried self-referential foreign keys, and a failure mode we'd only seen sporadically became the dominant pattern.

This post is about project 12 — a department hierarchy API with JWT authentication and self-referential parent-child relationships. It documents the failure pattern that connected several scattered observations into a single engineering constraint. And it discusses why, in our case, the most practical response was to restructure the work rather than retry harder.

The Setup: Department API

Project 12 was designed to test two development vectors simultaneously: JWT authentication (new for ForgeFlow) and self-referential foreign keys (a department can be a child of another department). The tech stack was familiar — FastAPI, SQLAlchemy async, pytest — but the data model was more complex than our previous test projects.

The target execution plan: 13 tasks total. Of these, 4 were new-file creation tasks (schemas, tests), 5 were existing-file modification tasks (models, routes), and 4 were either setup steps or handled outside the autonomous loop.

We ran it five times, redesigning between each iteration. The pattern became hard to ignore.

The Scorecard

The table below shows the task categories from Project 12. The same outcome repeated across five redesign-and-rerun attempts.

Task Type	Count	Pass Rate	Avg Cycles
New file creation (schemas, tests)	4	100%	1.0
Existing file modification (models, routes)	5	0%	DEADLOCK

In our setup, tasks requiring the generation of an entirely new file succeeded on the first attempt. Tasks that required modifying an existing codebase file resulted in a processing deadlock. This held across five separate runs, two different backends (direct Ollama API and Aider), and multiple retry strategies.

To scope these findings: our dataset is constrained to a single model family (Qwen3-Coder-Next, 45GB Q4_K_M) running on a single hardware tier (Apple Silicon M5 Max 128GB). We don't claim these trends apply universally. But the pattern was consistent enough across five runs that we changed how we structure tasks going forward.

What "Identical GREEN" Looks Like

ForgeFlow's TDD loop works in two phases: RED (write a failing test) and GREEN (write code to pass it). The GREEN phase is where modifications happen.

When a task required modifying an existing file, the following loop repeated:

The model receives the existing file content + test requirements
The model outputs code that matches the existing file exactly (detected via SHA-256 hash comparison)
The engine retries with an explicit prompt: "Your output was identical to the current file"
The model outputs the same file again
DEADLOCK after 3 identical cycles

We call this an identical GREEN deadlock. The engine already had detection for it (FC-037, added months ago). But we'd only seen it sporadically before. In project 12, it became the primary failure mode.

Working Hypotheses

We're cautious about attributing "understanding" to the model — we're observing output patterns, not internal reasoning. Here's what we think might be happening:

The whole-file generation pattern (Ollama backend): When generating code via raw completion, the model streams the entire file from the first token. If the existing file is 95% correct and only needs a few lines added, the token history in the context window acts as a statistical attractor — the generation pattern defaults to reproducing the verified, working code rather than deviating to introduce new logic. The smaller the required change relative to the existing file, the stronger this pull appears to be.

The diff generation constraint (Aider backend): Diffs require precise line-matching tokens. When the target file is complex — multiple async routes, mixed dependencies, dense imports — generating accurate unified diff chunks appears to become erratic for our local quantized model. In our tests with this specific model and configuration, this manifested as timeouts (capped at 200 seconds per task) or a fallback to emitting an unchanged version of the source file.

Both pathways showed similar limitations on file modification tasks in our configuration. Whether this is specific to quantized local models or a broader pattern, we can't say.

Connecting Scattered Observations

Before project 12, our tracker had three separate failure patterns that each captured a piece of this:

FC-034 / CL-043: "One task, one endpoint" — adding endpoints to an existing route file often resulted in syntax errors or duplicates
FC-047 / CL-066: "Over-complete stubs" — when a stub had significant boilerplate, the model treated it as finished
FC-039 / CL-058: "POST endpoints need Aider" — some tasks specifically failed on the Ollama backend

Project 12 gave us the data to connect these into a single classification, FC-052:

In our local execution setup, existing file modification tasks demonstrate a high probability of identical GREEN DEADLOCK on both whole-file and diff-based backends. In our observations, identical-output failures appeared more often when the required change was small relative to the existing file.

From FC-052, we derived CL-071:

Every autonomous task target should be a new file. If a workflow step must modify an existing file, that modification should either be handled programmatically during setup or the architecture should be decoupled so that features reside in isolated modules.

This became our 71st crystallized lesson, and it changed how we now structure ForgeFlow projects.

One notable data point: across three complete projects (10, 11, and 12), our failure catalog expanded by only a single new entry. The rule accumulation curve is flattening, which may suggest we're mapping the boundary of our current configuration — or just the boundary of our current project complexity.

The Design Pattern That Emerged

CL-071 pushed us to rethink how we write PRDs.

Before (task-level modifications):

TASK-001: Create User model (stub)        → models/user.py
TASK-002: Add fields to User model        → models/user.py    [DEADLOCK]
TASK-003: Create Department model (stub)  → models/department.py
TASK-004: Add relationship                → models/department.py [DEADLOCK]

After (decoupled new-file generation):

SETUP SCRIPT: Generate complete models with all fields and relationships
TASK-001: Create User schemas    → schemas/user.py      [NEW FILE ✅]
TASK-002: Create Dept schemas    → schemas/department.py [NEW FILE ✅]
TASK-003: Create register route  → routes/auth.py        [NEW FILE ✅]

The pattern: infrastructure is established deterministically during setup, while the model handles clean-sheet file generation.

An important caveat: applying this pattern to project 12 was not a clean autonomous success. We manually implemented the CRUD endpoints (6 routes) to unblock the dependency chain, then tested whether the remaining new-file task would run cleanly under the revised structure. The integration test — creating a fresh test_integration.py — passed on its first autonomous cycle. The important result was narrower than "we solved it": once existing-file modification was removed from the autonomous task path, the remaining new-file task completed cleanly.

We should also note an open concern: forcing every task into a "new file only" pattern shifts complexity from generation-time editing to project-level file organization. At 13 tasks, this is manageable. At 50+, it could create significant file fragmentation and import overhead. We haven't tested at that scale yet.

Where We Are After 12 Projects

Metric	Value
Total projects	12 (11 completed, 1 scrapped)
Failure patterns cataloged (FC)	52
Design rules (CL)	71
Automated rule checks	53 functions in validate_prd.py
Sessions	81

The Honest Assessment

After 12 projects and 81 sessions:

What's working in our setup:

New file generation from detailed specs: reliable across the runs we tested
TDD enforcement (RED must fail, GREEN must pass): useful as a mechanical guardrail
Failure pattern → design rule pipeline: producing diminishing but real returns
Setup-based infrastructure + model-based creation: tested over 3 projects

What isn't working:

Existing file modification: consistently unreliable with our current model and configuration
Non-deterministic results on complex tasks: one task passed in 2 out of 3 runs, failed in 1. Same code, same model, different outcome.
Long dependency chains: a single DEADLOCK blocks everything downstream

Open questions:

Does CL-071 hold on 20+ task projects with complex dependency graphs?
Does the "new file only" constraint create unsustainable file fragmentation at scale?
Will newer local models (Qwen3-Coder v2, Llama 4) shift this boundary?
Is this specific to quantized local models, or do cloud API models show similar patterns on file modification tasks inside TDD loops?

A Request to Readers

If you're running local models — Ollama, llama.cpp, vLLM, or something else — within autonomous execution loops, we'd be interested in learning whether your telemetry shows similar variations between file creation and file modification tasks.

Specifically: how do your local configurations handle incremental diff generation inside structured loops versus generating complete, fresh modules from detailed specs? If you've logged similar boundaries or found alternative designs to work around modification deadlocks, please share your setup and observations in the comments.

We're also curious whether anyone has hard metrics on how cloud models (GPT, Claude) perform on targeted file modifications inside closed-loop TDD environments. Our dataset is one model family on one hardware tier — more data points from different setups would help everyone working in this space.

What's Next

Project 13 will be the first real test of whether CL-071 is a design principle or just a project-12-specific workaround. Every implementation task will target a new file. Setup will handle all infrastructure. The open question isn't whether it passes — it's whether the "new file only" constraint produces a project structure that's actually maintainable at 20+ tasks.

We're also adding automatic CL-071 validation to validate_prd.py — a check that flags any task whose implementation target already exists at execution time. For our workflow, rules that repeatedly affect outcomes should probably be machine-enforced.

The Series So Far

I Built a Local AI Coding Agent on M5 Max 128GB — 164 failures, 35 tests, proof of concept
We Didn't Migrate from n8n to Python Because n8n Failed — The orchestrator rewrite
The Determinism War — Why we stopped chasing better models
The Information Design Gap — Why the agent was coding blind
DCR Wasn't Enough — Adding information quality to the framework
The Bug Wasn't in the Model — Lessons from 9 projects
The File Modification Boundary — You are here. 12 projects, a boundary mapped.

About

I'm Joseph YEO, a solo builder from Seoul, Korea. ForgeFlow runs entirely on a MacBook Pro M5 Max 128GB — no cloud APIs during execution. The planning agent (Claude) designs the specs. The local model (Qwen3-Coder-Next, 45GB Q4_K_M) executes the TDD loop autonomously.

Follow along:

Built over 81 sessions, May 2026. All models run locally via Ollama 0.23.0 on macOS. No cloud APIs were used during autonomous execution.

This post was drafted with Claude and edited by me.

DEV Community