<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Joseph Yeo</title>
    <description>The latest articles on DEV Community by Joseph Yeo (@josephyeo).</description>
    <link>https://dev.to/josephyeo</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3863060%2F14a6921b-eef9-4611-ba9b-c1a7b9835304.png</url>
      <title>DEV Community: Joseph Yeo</title>
      <link>https://dev.to/josephyeo</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/josephyeo"/>
    <language>en</language>
    <item>
      <title>77 Rules Later: What Graduating Our First Stack Actually Looked Like</title>
      <dc:creator>Joseph Yeo</dc:creator>
      <pubDate>Mon, 25 May 2026 08:19:22 +0000</pubDate>
      <link>https://dev.to/josephyeo/77-rules-later-what-graduating-our-first-stack-actually-looked-like-2o3k</link>
      <guid>https://dev.to/josephyeo/77-rules-later-what-graduating-our-first-stack-actually-looked-like-2o3k</guid>
      <description>&lt;p&gt;&lt;em&gt;This is Part 8 of the ForgeFlow series. &lt;a href="https://dev.to/josephyeo/the-file-modification-boundary-we-found-after-12-forgeflow-projects-3m01"&gt;Part 7: The File Modification Boundary&lt;/a&gt; documented the constraint that changed how we structure tasks: every autonomous task target should be a new file. We ended Part 7 at 12 projects, roughly 52 failure patterns, and 71 design rules. Part 7 closed with an open question: "Project 13 will be the first real test of whether CL-071 holds under normal conditions."&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Quick terms for new readers:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;FC&lt;/strong&gt; = Failure Catalog entry (a documented failure pattern)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CL&lt;/strong&gt; = Crystallized Lesson (a testable design rule derived from repeated failures)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DEADLOCK&lt;/strong&gt; = the system gives up after repeated identical failures&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ForgeFlow&lt;/strong&gt; = a fully local, TDD-based autonomous coding system running on Apple Silicon&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;




&lt;p&gt;Part 7 ended with a hypothesis and a bet.&lt;/p&gt;

&lt;p&gt;The hypothesis: CL-071 (every task targets a new file, never modifies an existing one) might reduce or remove the dominant failure mode we'd been observing. The bet: we'd set formal graduation criteria and run projects until we met them — or discovered why we couldn't.&lt;/p&gt;

&lt;p&gt;We ran five more projects (with one intermediate rerun included in the data). On the seventeenth — a blog API with 14 tasks — all 33 tests passed without intervention or deadlock, completing in approximately 12 minutes.&lt;/p&gt;

&lt;p&gt;This post is about the five projects between that hypothesis and this result, what the graduation criteria actually measured, and the failure that appeared &lt;em&gt;after&lt;/em&gt; we thought we'd addressed all the known ones.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Graduation Criteria
&lt;/h2&gt;

&lt;p&gt;Before results, here's what we were measuring. We didn't want "it worked once" to count as graduation. We defined four conditions, all of which had to hold on a qualifying run:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Criterion&lt;/th&gt;
&lt;th&gt;Threshold&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;First-run pass rate (tasks passing on the first TDD cycle, no retry)&lt;/td&gt;
&lt;td&gt;≥ 85%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;New FC yield per project&lt;/td&gt;
&lt;td&gt;≤ 2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Repeat FC rate (previously solved patterns recurring)&lt;/td&gt;
&lt;td&gt;≤ 5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Teacher escalation (human operator interventions mid-task)&lt;/td&gt;
&lt;td&gt;Decreasing trend&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The logic: a graduated stack should show repeatable autonomous recovery within the tested scope (criterion 1), stop producing novel failure patterns at a high rate (criterion 2), not regress on already-solved problems (criterion 3), and require less human involvement over time (criterion 4).&lt;/p&gt;

&lt;p&gt;We chose 85% rather than 100% for the pass rate deliberately. Occasional retries are expected behavior in a TDD loop — in ForgeFlow's architecture, the system is designed to recover from them. What we track is whether it recovers autonomously.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Five-Project Path
&lt;/h2&gt;

&lt;p&gt;Here's the longitudinal data from Part 7's endpoint (project 12) through the graduation run. Note: this table tracks the &lt;em&gt;autonomous pass rate&lt;/em&gt; — tasks that eventually passed without human intervention, including retries. The graduation criterion uses the stricter &lt;em&gt;first-run pass rate&lt;/em&gt; (no retries), which we measured separately for the qualifying run.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Project&lt;/th&gt;
&lt;th&gt;Tasks&lt;/th&gt;
&lt;th&gt;Autonomous Pass Rate&lt;/th&gt;
&lt;th&gt;New FCs&lt;/th&gt;
&lt;th&gt;CL Count (at time)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;td&gt;comment-api&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;83%&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;~72&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;order-api&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;56%&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;~74&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;recipe-api&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;57%&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;~75&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;bookmark-api v2&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;83%&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;~76&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16.5&lt;/td&gt;
&lt;td&gt;catalog-api-v2&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;83%&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;~76&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;17&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;blog-api&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;14&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;77&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The trajectory wasn't smooth. Projects 14 and 15 dropped below 60%. Then it recovered. In this sequence, plateaus tended to expose a new failure category; the system dipped, the failure got crystallized into a rule, and the next project incorporated the fix.&lt;/p&gt;

&lt;p&gt;What changed between project 15 (57%) and project 17 (100%) was not a model upgrade or an engine rewrite. It was three additional design rules, each derived from a specific failure we observed and diagnosed.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Dip: What Went Wrong on Projects 14 and 15
&lt;/h2&gt;

&lt;p&gt;Projects 14 (order-api) and 15 (recipe-api) both hovered around 56–57% autonomous pass rate. The failures clustered around a few patterns:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Route endpoint isolation.&lt;/strong&gt; Tasks that bundled multiple endpoints into a single file — GET list and GET detail in the same route module — showed a notably higher failure rate than single-endpoint tasks. The outputs showed scope-related failures: given two endpoints to implement, the model would sometimes complete one and leave the other as a stub, or attempt both and introduce inconsistencies.&lt;/p&gt;

&lt;p&gt;We already had CL-043 (one task, one endpoint) from Part 6. But we'd been applying it loosely — allowing two closely related endpoints to share a task. Projects 14 and 15 showed us that "closely related" was too vague for this local execution loop. The rule needed to be absolute: one endpoint, one file, one task.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Import specification gaps.&lt;/strong&gt; Route tasks that didn't explicitly list every required import in their task description had a high failure rate. The model would guess import paths, often incorrectly. CL-072 crystallized this: every route task description must include a complete "Required imports" block. For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;Required&lt;/span&gt; &lt;span class="n"&gt;imports&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;APIRouter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Depends&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sqlalchemy.ext.asyncio&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AsyncSession&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;app.database&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;get_db&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;app.schemas.author&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AuthorCreate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AuthorRead&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Decimal type mismatches.&lt;/strong&gt; In project 16.5 (catalog-api-v2), a product model with a &lt;code&gt;Numeric(10,2)&lt;/code&gt; price column exposed a subtle testing issue. The model wrote assertions comparing float literals to SQLAlchemy Decimal values — and &lt;code&gt;999.99 != Decimal('999.99')&lt;/code&gt; in Python. CL-076 captured this: any Numeric column test must use Decimal comparisons.&lt;/p&gt;

&lt;p&gt;In our diagnosis, these looked less like model-capability failures and more like specification-precision failures — cases where the PRD left enough ambiguity for a 45GB quantized model to make a reasonable-but-wrong choice.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Failure We Didn't Expect: FC-074
&lt;/h2&gt;

&lt;p&gt;Project 17 (blog-api) was designed as the graduation attempt. We applied all 76 existing rules. The PRD passed our automated validator (50 checks passed, 0 failures). We expected fewer known-pattern failures.&lt;/p&gt;

&lt;p&gt;The first three attempts all failed on the very first task — creating the Author model. Same error each time: &lt;code&gt;red_apply_empty&lt;/code&gt; — the engine's signal that the RED-phase output contained implementation code rather than a test.&lt;/p&gt;

&lt;p&gt;Here's what happened, step by step:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Our setup script created a minimal model stub file — just the class name and primary key column. This was standard practice per CL-066 ("stubs should be PK-only").&lt;/li&gt;
&lt;li&gt;Before the RED phase (test generation), the engine runs FC-060 cleanup: it deletes the target implementation file so the model writes it fresh.&lt;/li&gt;
&lt;li&gt;FC-060 deleted the stub.&lt;/li&gt;
&lt;li&gt;The model didn't need the file to exist at generation time — the surrounding task context still described enough of the intended model structure (via data_models in the PRD and conftest import references) that it produced implementation code during RED instead of a test.&lt;/li&gt;
&lt;li&gt;The engine detected this as a scope violation and triggered &lt;code&gt;red_apply_empty&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Three retries. Same result each time.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We called this FC-074: the interaction between two previously validated rules (CL-066: keep stubs minimal, and FC-060: clean target files before RED) producing a new failure when combined.&lt;/p&gt;

&lt;p&gt;This is worth pausing on. FC-074 wasn't a gap in any single rule. It was an &lt;em&gt;interaction effect&lt;/em&gt; — two rules that had each been validated independently across multiple projects, producing a failure only in a specific sequence of operations.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rule&lt;/th&gt;
&lt;th&gt;Behavior in isolation&lt;/th&gt;
&lt;th&gt;Combined behavior&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CL-066&lt;/td&gt;
&lt;td&gt;Minimal stubs reduce over-complete-stub failures&lt;/td&gt;
&lt;td&gt;Creates a target file before RED&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FC-060&lt;/td&gt;
&lt;td&gt;Deletes implementation target before RED to ensure clean state&lt;/td&gt;
&lt;td&gt;Removes the stub CL-066 created&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Combined&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;RED sees a missing target but enough context to generate implementation instead of a test&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The Fix: Stop Creating Stubs
&lt;/h2&gt;

&lt;p&gt;The first instinct was to adjust the prompt wording — tell the model more explicitly to write a test, not an implementation. We tried that. Same failure. Prompt changes alone didn't resolve it; file-state became the stronger hypothesis.&lt;/p&gt;

&lt;p&gt;The second instinct was to refine the stub. But we diagnosed the stub's existence as the likely trigger: FC-060 deleted it, and the residual context information was enough to derail the RED phase.&lt;/p&gt;

&lt;p&gt;The third attempt was the simplest: don't create the stub at all.&lt;/p&gt;

&lt;p&gt;CL-077: Setup scripts must not create model stub files. Model files are created from scratch by the task that implements them. The conftest wraps model imports in try/except so that earlier tasks can run before the model file exists:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;app.models.author&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Author&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;ImportError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;Author&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This inverted an assumption we'd held across the previous 16 project iterations. We'd operated under the belief that providing a stub — even a minimal one — helped the model by giving it a starting point. FC-074 suggested that in our current engine architecture, the stub &lt;em&gt;hurt&lt;/em&gt; by creating a state that the cleanup logic couldn't handle cleanly.&lt;/p&gt;

&lt;p&gt;After applying CL-077, the same blog-api project ran all 14 tasks to completion. 33 tests passed, zero intervention, approximately 12 minutes total.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Graduation Run Measured
&lt;/h2&gt;

&lt;p&gt;Here's how project 17 scored against the criteria:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Criterion&lt;/th&gt;
&lt;th&gt;Threshold&lt;/th&gt;
&lt;th&gt;Project 17 Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;First-run pass rate&lt;/td&gt;
&lt;td&gt;≥ 85%&lt;/td&gt;
&lt;td&gt;93% (13/14 first-shot, 1 retry)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;New FC yield&lt;/td&gt;
&lt;td&gt;≤ 2&lt;/td&gt;
&lt;td&gt;1 (FC-074)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Repeat FC rate&lt;/td&gt;
&lt;td&gt;≤ 5%&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Teacher escalation&lt;/td&gt;
&lt;td&gt;Decreasing&lt;/td&gt;
&lt;td&gt;Zero escalations&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Project 17 met all four thresholds. The preceding project (16.5, catalog-api-v2) reached 83% — close but below the ≥85% line. So we are treating project 17 as the graduation point rather than claiming a two-project stable plateau.&lt;/p&gt;

&lt;p&gt;To be precise about what this means and what it doesn't:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it means:&lt;/strong&gt; On the specific runs we executed — FastAPI + SQLAlchemy async + pytest projects with CRUD-level complexity and 1:N foreign key relationships, using Qwen3-Coder-Next 45GB Q4_K_M on Apple Silicon M5 Max 128GB with 77 design rules — the system completed the full project autonomously within the scope of new-file-creation tasks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it doesn't mean:&lt;/strong&gt; We haven't tested more complex architectural patterns (many-to-many relationships, authentication flows, file uploads, WebSocket endpoints). We haven't tested with different model families or hardware tiers. The 100% figure is for one specific project run; it's a data point, not a guarantee.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;77 rules is a lot of rules.&lt;/strong&gt; Each one was derived from at least one observed problem. But the cumulative load of maintaining 77 interacting rules is substantial. We don't yet know if this scales — whether a 200-rule system would be manageable or would collapse under interaction effects. This matches a concern we are starting to track internally: beyond a certain threshold, adding more constraints may dilute model attention rather than improve output. In our design, we've set a ceiling of 20 CLs per prompt injection bundle to guard against this, but we haven't yet hit a project that tests that limit.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Rule Accumulation Curve
&lt;/h2&gt;

&lt;p&gt;One pattern we've been tracking across the series is how the rate of new rule discovery changes over time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Projects  1–3:   CL-001 to CL-020   (~7 per project)
Projects  4–6:   CL-021 to CL-035   (~5 per project)
Projects  7–9:   CL-036 to CL-051   (~5 per project)
Projects 10–12:  CL-052 to CL-071   (~6 per project)
Projects 13–17:  CL-072 to CL-077   (~1 per project)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The yield dropped from roughly 7 new rules per project to roughly 1. We're cautious about reading too much into this — it could mean we're approaching the boundary of what our current project complexity can reveal, rather than the boundary of what rules exist. More complex projects might expose entirely new failure categories.&lt;/p&gt;

&lt;p&gt;But within the FastAPI + SQLAlchemy + CRUD scope, the flattening is visible in this dataset. The most notable new failure in this stretch was an interaction effect between existing rules — FC-074 — rather than an entirely novel pattern.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Interaction Effect Problem
&lt;/h2&gt;

&lt;p&gt;FC-074 taught us something we hadn't articulated before: as the rule set grows, the opportunity for interaction effects between rules increases. Each rule is validated independently, but the system runs them all simultaneously.&lt;/p&gt;

&lt;p&gt;This resembles a familiar problem in complex systems: the space of pairwise interactions grows faster than the number of components. We can't test all combinations manually.&lt;/p&gt;

&lt;p&gt;We don't have a systematic solution for this yet. What we have is a detection mechanism: when a failure occurs that doesn't match any existing FC pattern, we now check whether it could be an interaction between two rules that had both worked in isolation in prior runs. FC-074 was caught this way.&lt;/p&gt;

&lt;p&gt;Whether this can be automated — detecting interaction effects without human diagnosis — is an open question. The engine could potentially track which CLs were active when a novel failure occurs and flag the pairwise candidates, but we haven't built that yet.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Comes Next
&lt;/h2&gt;

&lt;p&gt;Graduating from the FastAPI stack opens a question: what do we do with a graduated stack?&lt;/p&gt;

&lt;p&gt;We see two directions, each answering a different question:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Direction A: Complexity escalation.&lt;/strong&gt; Stay on FastAPI but increase project complexity — many-to-many relationships, authentication flows, nested resources, pagination. This tests whether the current 77 rules hold at higher complexity or whether new failure categories emerge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Direction B: Stack transfer.&lt;/strong&gt; Move to a different framework and measure how many of the 77 rules transfer. Our rules are categorized by stack tags — 29 are marked "universal," 32 are "fastapi"-specific. A new stack would test whether the universal rules actually are universal.&lt;/p&gt;

&lt;p&gt;The question we're most interested in now isn't whether we can achieve another 100% run. It's whether a rule-based agent system can keep growing without becoming harder to reason about than the model it was designed to constrain.&lt;/p&gt;




&lt;h2&gt;
  
  
  Series Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/josephyeo/i-built-a-local-ai-coding-agent-on-m5-max-128gb-it-failed-164-times-before-passing-35-tests-2fgj"&gt;Part 1: 164 Failures Before 35 Tests&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/josephyeo/we-didnt-migrate-from-n8n-to-python-because-n8n-failed-k9j"&gt;Part 2: We Didn't Migrate from n8n Because n8n Failed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/josephyeo/the-determinism-war-why-we-stopped-chasing-better-models-3c21"&gt;Part 3: The Determinism War&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/josephyeo/the-information-design-gap-why-our-ai-agent-was-coding-blind-4p8o"&gt;Part 4: The Information Design Gap&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/josephyeo/dcr-wasnt-enough-why-ai-coding-agents-also-need-information-quality-1da4"&gt;Part 5: DCR Wasn't Enough&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/josephyeo/the-bug-wasnt-in-the-model-lessons-from-9-local-ai-coding-agent-projects-18aa"&gt;Part 6: The Bug Wasn't in the Model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/josephyeo/the-file-modification-boundary-we-found-after-12-forgeflow-projects-3m01"&gt;Part 7: The File Modification Boundary&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;ForgeFlow runs on a MacBook Pro M5 Max 128GB. Planning uses Claude (cloud API). Execution is fully local — Qwen3-Coder-Next 45GB via Ollama, gemma4:26b for QA, Docker sandbox, no API calls during the coding loop. The methodology and failure data are shared in this series.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;If you're building something similar — local AI agents, TDD automation, failure catalog systems — I'd be interested to hear whether you're seeing interaction effects between your own accumulated rules. The comments are open.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>architecture</category>
      <category>devjournal</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>The File Modification Boundary We Found After 12 ForgeFlow Projects</title>
      <dc:creator>Joseph Yeo</dc:creator>
      <pubDate>Fri, 22 May 2026 15:00:08 +0000</pubDate>
      <link>https://dev.to/josephyeo/the-file-modification-boundary-we-found-after-12-forgeflow-projects-3m01</link>
      <guid>https://dev.to/josephyeo/the-file-modification-boundary-we-found-after-12-forgeflow-projects-3m01</guid>
      <description>&lt;p&gt;&lt;em&gt;This is Part 7 of the ForgeFlow series. &lt;a href="https://dev.to/josephyeo/the-bug-wasnt-in-the-model-lessons-from-9-local-ai-coding-agent-projects-18aa"&gt;Part 6: The Bug Wasn't in the Model&lt;/a&gt; ended at 9 projects, 51 failure patterns, and 70 design rules. Up until that point, failure rates in our setup were declining and the working framework felt like it was converging. Project 12 exposed a structural gap we hadn't yet documented.&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Quick terms for new readers:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;FC&lt;/strong&gt; = Failure Catalog entry (a documented failure pattern)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CL&lt;/strong&gt; = Crystallized Lesson (a testable design rule derived from repeated failures)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Identical GREEN&lt;/strong&gt; = the model returns an unchanged file during the implementation phase&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DEADLOCK&lt;/strong&gt; = the system gives up after repeated identical failures&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;




&lt;p&gt;Part 6 ended on a high note. Nine projects. A 100% pass rate on the last one. Forty-three crystallized lessons. A working framework in our setup: DCR × Information Quality × Task Complexity. The system felt like it was converging.&lt;/p&gt;

&lt;p&gt;Then we tried self-referential foreign keys, and a failure mode we'd only seen sporadically became the dominant pattern.&lt;/p&gt;

&lt;p&gt;This post is about project 12 — a department hierarchy API with JWT authentication and self-referential parent-child relationships. It documents the failure pattern that connected several scattered observations into a single engineering constraint. And it discusses why, in our case, the most practical response was to restructure the work rather than retry harder.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Setup: Department API
&lt;/h2&gt;

&lt;p&gt;Project 12 was designed to test two development vectors simultaneously: JWT authentication (new for ForgeFlow) and self-referential foreign keys (a department can be a child of another department). The tech stack was familiar — FastAPI, SQLAlchemy async, pytest — but the data model was more complex than our previous test projects.&lt;/p&gt;

&lt;p&gt;The target execution plan: 13 tasks total. Of these, 4 were new-file creation tasks (schemas, tests), 5 were existing-file modification tasks (models, routes), and 4 were either setup steps or handled outside the autonomous loop.&lt;/p&gt;

&lt;p&gt;We ran it five times, redesigning between each iteration. The pattern became hard to ignore.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Scorecard
&lt;/h2&gt;

&lt;p&gt;The table below shows the task categories from Project 12. The same outcome repeated across five redesign-and-rerun attempts.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task Type&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;Pass Rate&lt;/th&gt;
&lt;th&gt;Avg Cycles&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;New file creation (schemas, tests)&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Existing file modification (models, routes)&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;DEADLOCK&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In our setup, tasks requiring the generation of an entirely new file succeeded on the first attempt. Tasks that required modifying an existing codebase file resulted in a processing deadlock. This held across five separate runs, two different backends (direct Ollama API and Aider), and multiple retry strategies.&lt;/p&gt;

&lt;p&gt;To scope these findings: our dataset is constrained to a single model family (Qwen3-Coder-Next, 45GB Q4_K_M) running on a single hardware tier (Apple Silicon M5 Max 128GB). We don't claim these trends apply universally. But the pattern was consistent enough across five runs that we changed how we structure tasks going forward.&lt;/p&gt;




&lt;h2&gt;
  
  
  What "Identical GREEN" Looks Like
&lt;/h2&gt;

&lt;p&gt;ForgeFlow's TDD loop works in two phases: RED (write a failing test) and GREEN (write code to pass it). The GREEN phase is where modifications happen.&lt;/p&gt;

&lt;p&gt;When a task required modifying an existing file, the following loop repeated:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The model receives the existing file content + test requirements&lt;/li&gt;
&lt;li&gt;The model outputs code that matches the existing file exactly (detected via SHA-256 hash comparison)&lt;/li&gt;
&lt;li&gt;The engine retries with an explicit prompt: "Your output was identical to the current file"&lt;/li&gt;
&lt;li&gt;The model outputs the same file again&lt;/li&gt;
&lt;li&gt;DEADLOCK after 3 identical cycles&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We call this an &lt;em&gt;identical GREEN&lt;/em&gt; deadlock. The engine already had detection for it (FC-037, added months ago). But we'd only seen it sporadically before. In project 12, it became the primary failure mode.&lt;/p&gt;




&lt;h2&gt;
  
  
  Working Hypotheses
&lt;/h2&gt;

&lt;p&gt;We're cautious about attributing "understanding" to the model — we're observing output patterns, not internal reasoning. Here's what we think might be happening:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The whole-file generation pattern (Ollama backend):&lt;/strong&gt; When generating code via raw completion, the model streams the entire file from the first token. If the existing file is 95% correct and only needs a few lines added, the token history in the context window acts as a statistical attractor — the generation pattern defaults to reproducing the verified, working code rather than deviating to introduce new logic. The smaller the required change relative to the existing file, the stronger this pull appears to be.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The diff generation constraint (Aider backend):&lt;/strong&gt; Diffs require precise line-matching tokens. When the target file is complex — multiple async routes, mixed dependencies, dense imports — generating accurate unified diff chunks appears to become erratic for our local quantized model. In our tests with this specific model and configuration, this manifested as timeouts (capped at 200 seconds per task) or a fallback to emitting an unchanged version of the source file.&lt;/p&gt;

&lt;p&gt;Both pathways showed similar limitations on file modification tasks in our configuration. Whether this is specific to quantized local models or a broader pattern, we can't say.&lt;/p&gt;




&lt;h2&gt;
  
  
  Connecting Scattered Observations
&lt;/h2&gt;

&lt;p&gt;Before project 12, our tracker had three separate failure patterns that each captured a piece of this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;FC-034 / CL-043&lt;/strong&gt;: "One task, one endpoint" — adding endpoints to an existing route file often resulted in syntax errors or duplicates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FC-047 / CL-066&lt;/strong&gt;: "Over-complete stubs" — when a stub had significant boilerplate, the model treated it as finished&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FC-039 / CL-058&lt;/strong&gt;: "POST endpoints need Aider" — some tasks specifically failed on the Ollama backend&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Project 12 gave us the data to connect these into a single classification, &lt;strong&gt;FC-052&lt;/strong&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;In our local execution setup, existing file modification tasks demonstrate a high probability of identical GREEN DEADLOCK on both whole-file and diff-based backends. In our observations, identical-output failures appeared more often when the required change was small relative to the existing file.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;From FC-052, we derived &lt;strong&gt;CL-071&lt;/strong&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Every autonomous task target should be a new file. If a workflow step must modify an existing file, that modification should either be handled programmatically during setup or the architecture should be decoupled so that features reside in isolated modules.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This became our 71st crystallized lesson, and it changed how we now structure ForgeFlow projects.&lt;/p&gt;

&lt;p&gt;One notable data point: across three complete projects (10, 11, and 12), our failure catalog expanded by only a single new entry. The rule accumulation curve is flattening, which may suggest we're mapping the boundary of our current configuration — or just the boundary of our current project complexity.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Design Pattern That Emerged
&lt;/h2&gt;

&lt;p&gt;CL-071 pushed us to rethink how we write PRDs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before (task-level modifications):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TASK-001: Create User model (stub)        → models/user.py
TASK-002: Add fields to User model        → models/user.py    [DEADLOCK]
TASK-003: Create Department model (stub)  → models/department.py
TASK-004: Add relationship                → models/department.py [DEADLOCK]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;After (decoupled new-file generation):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SETUP SCRIPT: Generate complete models with all fields and relationships
TASK-001: Create User schemas    → schemas/user.py      [NEW FILE ✅]
TASK-002: Create Dept schemas    → schemas/department.py [NEW FILE ✅]
TASK-003: Create register route  → routes/auth.py        [NEW FILE ✅]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The pattern: infrastructure is established deterministically during setup, while the model handles clean-sheet file generation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;An important caveat:&lt;/strong&gt; applying this pattern to project 12 was not a clean autonomous success. We manually implemented the CRUD endpoints (6 routes) to unblock the dependency chain, then tested whether the remaining new-file task would run cleanly under the revised structure. The integration test — creating a fresh &lt;code&gt;test_integration.py&lt;/code&gt; — passed on its first autonomous cycle. The important result was narrower than "we solved it": once existing-file modification was removed from the autonomous task path, the remaining new-file task completed cleanly.&lt;/p&gt;

&lt;p&gt;We should also note an open concern: forcing every task into a "new file only" pattern shifts complexity from generation-time editing to project-level file organization. At 13 tasks, this is manageable. At 50+, it could create significant file fragmentation and import overhead. We haven't tested at that scale yet.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where We Are After 12 Projects
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total projects&lt;/td&gt;
&lt;td&gt;12 (11 completed, 1 scrapped)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Failure patterns cataloged (FC)&lt;/td&gt;
&lt;td&gt;52&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Design rules (CL)&lt;/td&gt;
&lt;td&gt;71&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Automated rule checks&lt;/td&gt;
&lt;td&gt;53 functions in validate_prd.py&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sessions&lt;/td&gt;
&lt;td&gt;81&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The Honest Assessment
&lt;/h2&gt;

&lt;p&gt;After 12 projects and 81 sessions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's working in our setup:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;New file generation from detailed specs: reliable across the runs we tested&lt;/li&gt;
&lt;li&gt;TDD enforcement (RED must fail, GREEN must pass): useful as a mechanical guardrail&lt;/li&gt;
&lt;li&gt;Failure pattern → design rule pipeline: producing diminishing but real returns&lt;/li&gt;
&lt;li&gt;Setup-based infrastructure + model-based creation: tested over 3 projects&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What isn't working:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Existing file modification: consistently unreliable with our current model and configuration&lt;/li&gt;
&lt;li&gt;Non-deterministic results on complex tasks: one task passed in 2 out of 3 runs, failed in 1. Same code, same model, different outcome.&lt;/li&gt;
&lt;li&gt;Long dependency chains: a single DEADLOCK blocks everything downstream&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Open questions:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Does CL-071 hold on 20+ task projects with complex dependency graphs?&lt;/li&gt;
&lt;li&gt;Does the "new file only" constraint create unsustainable file fragmentation at scale?&lt;/li&gt;
&lt;li&gt;Will newer local models (Qwen3-Coder v2, Llama 4) shift this boundary?&lt;/li&gt;
&lt;li&gt;Is this specific to quantized local models, or do cloud API models show similar patterns on file modification tasks inside TDD loops?&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  A Request to Readers
&lt;/h2&gt;

&lt;p&gt;If you're running local models — Ollama, llama.cpp, vLLM, or something else — within autonomous execution loops, we'd be interested in learning whether your telemetry shows similar variations between file creation and file modification tasks.&lt;/p&gt;

&lt;p&gt;Specifically: &lt;strong&gt;how do your local configurations handle incremental diff generation inside structured loops versus generating complete, fresh modules from detailed specs?&lt;/strong&gt; If you've logged similar boundaries or found alternative designs to work around modification deadlocks, please share your setup and observations in the comments.&lt;/p&gt;

&lt;p&gt;We're also curious whether anyone has hard metrics on how cloud models (GPT, Claude) perform on targeted file modifications inside closed-loop TDD environments. Our dataset is one model family on one hardware tier — more data points from different setups would help everyone working in this space.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Project 13 will be the first real test of whether CL-071 is a design principle or just a project-12-specific workaround. Every implementation task will target a new file. Setup will handle all infrastructure. The open question isn't whether it passes — it's whether the "new file only" constraint produces a project structure that's actually maintainable at 20+ tasks.&lt;/p&gt;

&lt;p&gt;We're also adding automatic CL-071 validation to &lt;code&gt;validate_prd.py&lt;/code&gt; — a check that flags any task whose implementation target already exists at execution time. For our workflow, rules that repeatedly affect outcomes should probably be machine-enforced.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Series So Far
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;a href="https://dev.to/josephyeo/i-built-a-local-ai-coding-agent-on-m5-max-128gb-it-failed-164-times-before-passing-35-tests-2fgj"&gt;I Built a Local AI Coding Agent on M5 Max 128GB&lt;/a&gt; — 164 failures, 35 tests, proof of concept&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/josephyeo/we-didnt-migrate-from-n8n-to-python-because-n8n-failed-k9j"&gt;We Didn't Migrate from n8n to Python Because n8n Failed&lt;/a&gt; — The orchestrator rewrite&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/josephyeo/the-determinism-war-why-we-stopped-chasing-better-models-3c21"&gt;The Determinism War&lt;/a&gt; — Why we stopped chasing better models&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/josephyeo/the-information-design-gap-why-our-ai-agent-was-coding-blind-4p8o"&gt;The Information Design Gap&lt;/a&gt; — Why the agent was coding blind&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/josephyeo/dcr-wasnt-enough-why-ai-coding-agents-also-need-information-quality-1da4"&gt;DCR Wasn't Enough&lt;/a&gt; — Adding information quality to the framework&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/josephyeo/the-bug-wasnt-in-the-model-lessons-from-9-local-ai-coding-agent-projects-18aa"&gt;The Bug Wasn't in the Model&lt;/a&gt; — Lessons from 9 projects&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The File Modification Boundary&lt;/strong&gt; — You are here. 12 projects, a boundary mapped.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  About
&lt;/h2&gt;

&lt;p&gt;I'm Joseph YEO, a solo builder from Seoul, Korea. ForgeFlow runs entirely on a MacBook Pro M5 Max 128GB — no cloud APIs during execution. The planning agent (Claude) designs the specs. The local model (Qwen3-Coder-Next, 45GB Q4_K_M) executes the TDD loop autonomously.&lt;/p&gt;

&lt;p&gt;Follow along:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;𝕏: &lt;a href="https://x.com/josephyeo_dev" rel="noopener noreferrer"&gt;@josephyeo_dev&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/joseph-yeo" rel="noopener noreferrer"&gt;joseph-yeo&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Site: &lt;a href="https://projectjoseph.dev" rel="noopener noreferrer"&gt;projectjoseph.dev&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Built over 81 sessions, May 2026. All models run locally via Ollama 0.23.0 on macOS. No cloud APIs were used during autonomous execution.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This post was drafted with Claude and edited by me.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>The Bug Wasn't in the Model: Lessons from 9 Local AI Coding Agent Projects</title>
      <dc:creator>Joseph Yeo</dc:creator>
      <pubDate>Sun, 17 May 2026 14:11:48 +0000</pubDate>
      <link>https://dev.to/josephyeo/the-bug-wasnt-in-the-model-lessons-from-9-local-ai-coding-agent-projects-18aa</link>
      <guid>https://dev.to/josephyeo/the-bug-wasnt-in-the-model-lessons-from-9-local-ai-coding-agent-projects-18aa</guid>
      <description>&lt;p&gt;&lt;em&gt;This is Part 6 of the ForgeFlow series. &lt;a href="https://dev.to/josephyeo/dcr-wasnt-enough-why-ai-coding-agents-also-need-information-quality"&gt;Part 5: DCR Wasn't Enough&lt;/a&gt; introduced the two-axis model: DCR × Information Quality. We ended Part 5 at three projects and a 29% pass rate. Here's how we reached 100% on a controlled project — and why we needed a third axis.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Part 5 ended with a framework and a question.&lt;/p&gt;

&lt;p&gt;The framework said: &lt;em&gt;System Reliability ≈ DCR × Information Quality.&lt;/em&gt; The question was whether that would actually hold up as we kept running projects.&lt;/p&gt;

&lt;p&gt;We ran six more. Same model. Same hardware. No cloud APIs during execution. By project nine, the autonomous pass rate hit &lt;strong&gt;100% on that specific project&lt;/strong&gt; — eight tasks, thirty-one tests, four minutes, zero manual intervention.&lt;/p&gt;

&lt;p&gt;This post is about the path from 29% to 100%, and the third variable we didn't expect to find.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Scoreboard
&lt;/h2&gt;

&lt;p&gt;Here's the full longitudinal data. Nine projects, same 45GB local model, same hardware throughout:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Project&lt;/th&gt;
&lt;th&gt;Pass Rate&lt;/th&gt;
&lt;th&gt;CL Rules&lt;/th&gt;
&lt;th&gt;Key Change&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;repo-jwt&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;No design rules existed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;todo-api&lt;/td&gt;
&lt;td&gt;67%&lt;/td&gt;
&lt;td&gt;~10&lt;/td&gt;
&lt;td&gt;Context files added&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;bookmark-api&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;~20&lt;/td&gt;
&lt;td&gt;Full information pipeline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;expense-tracker&lt;/td&gt;
&lt;td&gt;70%&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;New failure patterns emerged&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;rating-api&lt;/td&gt;
&lt;td&gt;73%&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;DB fixture issues&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;library-api&lt;/td&gt;
&lt;td&gt;0% → scrapped&lt;/td&gt;
&lt;td&gt;35&lt;/td&gt;
&lt;td&gt;Fundamental architecture gap*&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;event-api&lt;/td&gt;
&lt;td&gt;80%&lt;/td&gt;
&lt;td&gt;35&lt;/td&gt;
&lt;td&gt;Setup script pattern validated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;habit-tracker&lt;/td&gt;
&lt;td&gt;44%&lt;/td&gt;
&lt;td&gt;39&lt;/td&gt;
&lt;td&gt;Route tasks collapsed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;contact-book&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;43&lt;/td&gt;
&lt;td&gt;All axes aligned&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The 100% figure refers to Project 9 only, not the aggregate across all nine projects.&lt;/strong&gt; We used it as a controlled checkpoint: after fixing the route-task failure pattern from Project 8, could the same local model complete a comparable route-heavy project without intervention?&lt;/p&gt;

&lt;p&gt;&lt;em&gt;*Project 6 was scrapped because it required an architectural paradigm change (multi-model foreign-key setup scripts) that our orchestrator couldn't state-track at the time. Rather than polluting the loop data with a mismatched setup, we halted execution to redesign our baseline infrastructure scripts. The lessons from that failure directly produced CL-036 through CL-039.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The trajectory isn't a clean line upward. Project 3 hit 100%, then projects 4 through 8 dropped back. Each drop exposed a new category of failure that our rules didn't cover yet.&lt;/p&gt;

&lt;p&gt;This is the pattern that mattered most to us: &lt;strong&gt;in these nine bounded projects, every failure we investigated had a concrete system-level fix. We did not find a case where replacing the model was the only plausible remedy.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Crystallization Loop
&lt;/h2&gt;

&lt;p&gt;Each project failure produced what we call a "Crystallized Lesson" (CL) — a concrete, testable rule that prevents that specific failure from recurring. Not a vague principle. A rule precise enough that code could check it.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;CL-005:&lt;/strong&gt; Infrastructure files (conftest.py, database.py) must never appear in a task's target files. &lt;em&gt;Origin: Project 3, where the model kept overwriting shared fixtures.&lt;/em&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;CL-034:&lt;/strong&gt; DateTime fields with SQLAlchemy's &lt;code&gt;default=&lt;/code&gt; must be set in Python's &lt;code&gt;__init__&lt;/code&gt;, not relied upon at DB insert time. &lt;em&gt;Origin: Project 5, where unit tests failed because &lt;code&gt;created_at&lt;/code&gt; was None before flush.&lt;/em&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;CL-043:&lt;/strong&gt; When adding an endpoint to an existing route file, each task must contain exactly one endpoint. &lt;em&gt;Origin: Project 8, where multi-endpoint tasks caused the model to time out trying to understand the existing code.&lt;/em&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By project 9, we had 43 of these rules. They're not guidelines — they're checkable constraints on the PRD document that feeds the model. We call the document that holds them the &lt;em&gt;PRD Design Checklist&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Here's what the accumulation looked like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Projects 1-3:  CL-001 to CL-020  (~7 per project)
Projects 4-6:  CL-021 to CL-035  (~5 per project)
Projects 7-8:  CL-036 to CL-043  (~4 per project)
Project 9:     0 new CLs needed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The rate of new rules slowed, but the &lt;em&gt;depth&lt;/em&gt; increased. Early rules were about file placement ("where does conftest.py go?"). Later rules were about engine-level behavior ("how does the correction system handle idempotency?").&lt;/p&gt;




&lt;h2&gt;
  
  
  What Project 8 Broke
&lt;/h2&gt;

&lt;p&gt;Project 8 (habit-tracker-api) is worth examining because it's where the two-axis model from Part 5 stopped being sufficient.&lt;/p&gt;

&lt;p&gt;The project had nine tasks. The first four — model and schema creation — passed autonomously in one cycle each. Then the route tasks (5 through 9) collapsed. Zero of five passed.&lt;/p&gt;

&lt;p&gt;The failures fell into four categories:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;A pytest configuration warning was being captured as a failure signature.&lt;/strong&gt; The code was correct, but the orchestrator classified it as broken.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;A string-replacement correction was applied twice.&lt;/strong&gt; &lt;code&gt;client.post(&lt;/code&gt; → &lt;code&gt;await client.post(&lt;/code&gt; was also applied to lines that already had &lt;code&gt;await&lt;/code&gt;, producing &lt;code&gt;await await client.post(&lt;/code&gt; — a syntax error.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;A schema class was never generated&lt;/strong&gt; because no test existed for it. The model only builds what it's tested for. No test, no code.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Tasks that modified existing files timed out&lt;/strong&gt; because the model needed too long to understand the accumulated code.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Notice: none of these are IQ problems in the Part 4/5 sense. The model had all the information it needed. The PRD was well-designed by the standards we had at the time. The failures came from &lt;strong&gt;the engine itself&lt;/strong&gt; — the orchestrator's correction logic, its gate system, its timeout handling.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Third Variable: Engine Quality
&lt;/h2&gt;

&lt;p&gt;This forced us to extend the two-axis model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;I don't mean this as a measured mathematical product yet, but as a diagnostic model: if any axis collapses, the whole loop collapses.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or to put it less formally: you can have a perfectly designed PRD and a well-informed model, and still fail because the orchestrator has bugs.&lt;/p&gt;

&lt;p&gt;By engine quality, we mean whether the orchestrator preserves the intended semantics of the execution loop: phase isolation (RED writes only tests, GREEN writes only implementation), retry correctness (rollbacks don't destroy infrastructure state), deterministic correction safety (rewrites don't corrupt already-correct code), timeout policy, and commit boundaries.&lt;/p&gt;

&lt;p&gt;In our case, the concrete fixes were:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The correction engine's idempotency.&lt;/strong&gt; Our string-replacement system applied corrections blindly, turning &lt;code&gt;client.post(&lt;/code&gt; into &lt;code&gt;await await client.post(&lt;/code&gt;. The fix was a line-level guard: if the replacement text already exists on a line, skip it. (We're aware this is a limitation of primitive string matching. A proper AST-based mutation engine using something like LibCST would eliminate this entire class of errors. That's on our roadmap but hasn't been necessary yet at our current project complexity.)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The RED phase scope.&lt;/strong&gt; If the model outputs an implementation file during the test-writing phase and the orchestrator writes it to disk, the test passes immediately — and the TDD cycle breaks. The fix was restricting the RED phase's file-write scope to test files only.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Router registration resilience.&lt;/strong&gt; If &lt;code&gt;git reset --hard&lt;/code&gt; during a retry also reverts infrastructure changes made by the orchestrator's auto-registration system, the next cycle starts with a broken setup. The fix was committing router registration during the initial setup script, not inside the TDD cycle.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We fixed these with three targeted engine patches, each with its own test suite (16 tests total). After the fixes, project 9 ran eight tasks with zero failures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The key insight for us:&lt;/strong&gt; PRD quality and engine quality appeared to be independent variables. Improving one didn't fix the other. Project 8's 44% pass rate wasn't a PRD problem — it was an engine problem that looked like a PRD problem until we traced each failure to its root cause.&lt;/p&gt;




&lt;h2&gt;
  
  
  What 100% Actually Looked Like
&lt;/h2&gt;

&lt;p&gt;Project 9 was a contact book API with search. Single model (no foreign keys), six CRUD endpoints plus a search-by-query-param feature. We chose it deliberately to test route-task decomposition — the exact pattern that failed in project 8.&lt;/p&gt;

&lt;p&gt;The numbers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Tasks&lt;/td&gt;
&lt;td&gt;8 (model, schemas, 6 endpoints)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total cycles&lt;/td&gt;
&lt;td&gt;8 (every task passed first try)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total tokens&lt;/td&gt;
&lt;td&gt;9,042 (Ollama-reported generated tokens; prompts excluded)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total time&lt;/td&gt;
&lt;td&gt;~4 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tests generated&lt;/td&gt;
&lt;td&gt;31&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Manual intervention&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cloud API cost&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each task followed the same loop: Ollama writes a failing test → Ollama writes minimum implementation → deterministic corrections applied → pytest runs → commit if green.&lt;/p&gt;

&lt;p&gt;The route tasks that had failed repeatedly in project 8 now passed in single cycles. The differences:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each task added exactly one endpoint (CL-043)&lt;/li&gt;
&lt;li&gt;All schema classes had test coverage (CL-042)&lt;/li&gt;
&lt;li&gt;The asyncio configuration was pre-set in the setup script (CL-040)&lt;/li&gt;
&lt;li&gt;Trailing-slash corrections were applied deterministically (new)&lt;/li&gt;
&lt;li&gt;Router registration was committed during setup, not during the TDD cycle (new)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these changes required a better model. The model was the same 45GB Qwen3 that produced 0% on project 1.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why 100% Doesn't Mean "Solved"
&lt;/h2&gt;

&lt;p&gt;I want to be careful here.&lt;/p&gt;

&lt;p&gt;100% on a contact book API doesn't mean ForgeFlow can build anything. A contact book API is an architectural sandbox. The project was deliberately chosen to isolate the route-task failure pattern. It had no foreign keys, no authentication, no file uploads. Each endpoint was independent. The success here suggests that our execution loop is stable under these constrained conditions, not that it can refactor a legacy microservice architecture.&lt;/p&gt;

&lt;p&gt;The real test is whether the next project — something with two related models and foreign-key relationships — maintains a high pass rate. We don't know yet.&lt;/p&gt;

&lt;p&gt;What we do know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The crystallization loop works.&lt;/strong&gt; Each failure produces a rule. Rules accumulate. The same failure hasn't recurred in subsequent comparable projects.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Engine fixes matter as much as PRD fixes.&lt;/strong&gt; Three engine patches in one session unblocked a project that no amount of PRD improvement would have fixed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The three-variable model explains our data better than two.&lt;/strong&gt; Projects 4-8 had good PRDs but engine bugs. The two-axis model couldn't explain those drops. The three-variable model can.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Failure Catalog
&lt;/h2&gt;

&lt;p&gt;Across nine projects, we cataloged 19 distinct failure patterns. Every one was eventually addressed — either through a PRD design rule, an engine fix, or a setup script change.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;Resolution&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;PRD design gap&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;CL rules in checklist&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engine bug&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;forgeflow.py patches + tests&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Infrastructure/setup&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Setup script standardization&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Timeout/performance&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;aider_timeout configuration&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A few examples from the catalog:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;FC-015: Non-idempotent correction rule.&lt;/strong&gt; Symptom: deterministic correction produced &lt;code&gt;await await client.post(...)&lt;/code&gt;. Root cause: the string-replacement rule did not check whether the line was already corrected. Fix: line-level idempotency guard — if the replacement text already exists on a given line, skip the replacement.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;FC-018: RED phase implementation leakage.&lt;/strong&gt; Symptom: tests passed immediately during RED because the implementation file was also written to disk. Root cause: the orchestrator's file-write scope included both test and implementation files during the test-writing phase. Fix: restrict RED phase scope to test files only; reject or quarantine non-test files during RED.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;FC-019a: Router registration lost across retry.&lt;/strong&gt; Symptom: every retry started from a broken app state (404 on all endpoints) after &lt;code&gt;git reset --hard&lt;/code&gt;. Root cause: the orchestrator's auto-registration system added router imports to &lt;code&gt;main.py&lt;/code&gt; during the TDD cycle, but &lt;code&gt;git reset&lt;/code&gt; reverted those changes. Fix: commit router registration during the initial setup script, before the TDD loop begins.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Under our postmortem classification criteria, none of the 19 cataloged failures were classified as pure model-capability failures — cases where the model lacked the syntax or logic ability to solve the task. Every failure traced back to something in the system around the model: missing information, incorrect scaffolding, or engine bugs.&lt;/p&gt;

&lt;p&gt;This doesn't mean model capability doesn't matter. A stronger model would probably tolerate worse PRDs and buggier engines. But in our limited experience, fixing the system was always cheaper and more permanent than hoping for a smarter model.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Comes Next
&lt;/h2&gt;

&lt;p&gt;We're designing a diagnostic pipeline that applies the failure catalog automatically. The idea: when a task deadlocks, the engine checks the failure catalog for a matching pattern before giving up.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[DEADLOCK DETECTED]
        │
        ▼
[Pattern Match: Failure Catalog]
        ├──► Match Found ──► Apply Fix (Deterministic) ──► Retry
        └──► No Match   ──► Local LLM Diagnosis (Stage 2)
                                   └──► Fails ──► Human Escalation (Stage 3)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Stage 1 is pure pattern matching — deterministic, no LLM needed. Stage 2 would use the local model to diagnose novel failures. Stage 3 remains human review.&lt;/p&gt;

&lt;p&gt;The goal isn't to eliminate human involvement entirely. It's to ensure that &lt;strong&gt;each human intervention produces a rule that prevents the same intervention next time.&lt;/strong&gt; The system should get cheaper to operate with every project it runs.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Thesis, Updated
&lt;/h2&gt;

&lt;p&gt;Part 3: &lt;em&gt;"The bottleneck is not model capability, but the verifiability of specifications."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Part 4: &lt;em&gt;"Even after verifiability is constructed, the bottleneck shifts to information delivery."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Part 5: &lt;em&gt;"An AI coding agent's reliability is a product of its deterministic coverage and its information quality."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Now, the working version after nine projects:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;&lt;em&gt;"In our experience, an AI coding agent's reliability is bounded by three independent variables: the determinism of its scaffolding, the quality of information it receives, and the correctness of its own engine. Improving any two without the third produced a system that failed in ways that looked like model limitations but weren't."&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The practical diagnostic is now threefold: measure your deterministic coverage, inspect your information quality, and test the engine itself. Fix the axis that's actually broken. In our nine projects, that diagnostic kept pointing to the system, not the model.&lt;/p&gt;

&lt;p&gt;Whether that pattern holds at higher complexity is something we're still finding out.&lt;/p&gt;




&lt;h2&gt;
  
  
  About
&lt;/h2&gt;

&lt;p&gt;I'm Joseph YEO, building ForgeFlow from Seoul, Korea — a local AI coding agent that runs entirely on Apple Silicon, no cloud inference during execution.&lt;/p&gt;

&lt;p&gt;What's your experience with orchestrator-level bugs masquerading as model limitations? Have you seen cases where the system around the model was the actual bottleneck? I'd love to compare notes.&lt;/p&gt;

&lt;p&gt;Follow along:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;𝕏: &lt;a href="https://x.com/josephyeo_dev" rel="noopener noreferrer"&gt;@josephyeo_dev&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/joseph-yeo" rel="noopener noreferrer"&gt;joseph-yeo&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Site: &lt;a href="https://projectjoseph.dev/" rel="noopener noreferrer"&gt;projectjoseph.dev&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Previous parts: &lt;a href="https://dev.to/josephyeo/i-built-a-local-ai-coding-agent-on-m5-max-128gb-it-failed-164-times-before-passing-35-tests-2fgj"&gt;Part 1: 164 Failures&lt;/a&gt; · &lt;a href="https://dev.to/josephyeo/we-didnt-migrate-from-n8n-to-python-because-n8n-failed-k9j"&gt;Part 2: n8n to Python&lt;/a&gt; · &lt;a href="https://dev.to/josephyeo/the-determinism-war-why-we-stopped-chasing-better-models-3c21"&gt;Part 3: The Determinism War&lt;/a&gt; · &lt;a href="https://dev.to/josephyeo/the-information-design-gap-why-our-ai-agent-was-coding-blind-4p8o"&gt;Part 4: The Information Design Gap&lt;/a&gt; · Part 5: DCR Wasn't Enough&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;9 projects. 43 rules. 19 failure patterns. 48 development sessions. Same 45GB model throughout. All models run locally via Ollama 0.23.0 on Apple Silicon M5 Max 128GB. No cloud APIs were used during autonomous execution.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This post was drafted with Claude and edited by me.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>automation</category>
      <category>programming</category>
    </item>
    <item>
      <title>DCR Wasn't Enough: Why AI Coding Agents Also Need Information Quality</title>
      <dc:creator>Joseph Yeo</dc:creator>
      <pubDate>Thu, 14 May 2026 14:02:17 +0000</pubDate>
      <link>https://dev.to/josephyeo/dcr-wasnt-enough-why-ai-coding-agents-also-need-information-quality-1da4</link>
      <guid>https://dev.to/josephyeo/dcr-wasnt-enough-why-ai-coding-agents-also-need-information-quality-1da4</guid>
      <description>&lt;p&gt;&lt;em&gt;This is Part 5 of the ForgeFlow series. &lt;a href="https://dev.to/josephyeo/the-determinism-war-why-we-stopped-chasing-better-models-3c21"&gt;Part 3: The Determinism War&lt;/a&gt; introduced DCR. &lt;a href="https://dev.to/josephyeo/the-information-design-gap-why-our-ai-agent-was-coding-blind-4p8o"&gt;Part 4: The Information Design Gap&lt;/a&gt; showed how information delivery moved our pass rate from 0% to 67%.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;We thought we had the answer.&lt;/p&gt;

&lt;p&gt;In Part 4, we showed that fixing our information pipeline — zero code changes, just better PRD design — moved ForgeFlow's autonomous pass rate from 0% to 67%. We thought the lesson was clear: give the model enough context and it delivers.&lt;/p&gt;

&lt;p&gt;Then we ran a third project. Same model. Same engine. Same careful PRD design. The pass rate dropped to &lt;strong&gt;29%.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This post is about why that happened and what it taught us about measuring AI coding agents.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Went Wrong on Project C
&lt;/h2&gt;

&lt;p&gt;Project C was a bookmark API with many-to-many relationships — more complex than a simple CRUD, but not wildly different in structure. We applied everything we learned from Project B: explicit context files per task, detailed descriptions, proper test scenario format.&lt;/p&gt;

&lt;p&gt;The PRD had sixteen tasks. In the first seven executed tasks, only two passed autonomously — &lt;strong&gt;2 / 7, or 29%&lt;/strong&gt;. The other five required manual intervention.&lt;/p&gt;

&lt;p&gt;The failures weren't the same as Project A's. In Project A, the model hallucinated imports and invented fixtures because it couldn't see the project. In Project C, it could see the project — but it kept hitting &lt;strong&gt;runtime patterns our prompt pipeline had not exposed clearly enough:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pydantic's &lt;code&gt;HttpUrl&lt;/code&gt; adds a trailing slash that breaks equality checks&lt;/li&gt;
&lt;li&gt;FastAPI's router prefix with &lt;code&gt;"/"&lt;/code&gt; vs &lt;code&gt;""&lt;/code&gt; causes 307 redirects&lt;/li&gt;
&lt;li&gt;SQLAlchemy async many-to-many relationships trigger &lt;code&gt;MissingGreenlet&lt;/code&gt; errors&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;create_async_engine&lt;/code&gt; lives in &lt;code&gt;sqlalchemy.ext.asyncio&lt;/code&gt;, not &lt;code&gt;sqlalchemy&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I would not classify these primarily as intelligence failures. They looked more like &lt;strong&gt;behavioral knowledge gaps&lt;/strong&gt; — framework-specific quirks that context files alone didn't expose.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Context Files Weren't Enough
&lt;/h2&gt;

&lt;p&gt;This forced us to confront a limitation in our Part 4 approach.&lt;/p&gt;

&lt;p&gt;Context files are &lt;strong&gt;static&lt;/strong&gt;. They're defined when you write the PRD — before any code exists. By TASK-007, the project has files that weren't there when the PRD was written. The model can't see them unless someone manually updates the list.&lt;/p&gt;

&lt;p&gt;For example, TASK-007 needed to create a route that depended on models and schemas generated by TASK-003 and TASK-005. But those files didn't exist when the PRD was written. The context file list was correct at design time — and stale by execution time.&lt;/p&gt;

&lt;p&gt;And even when context files are current, they show the model &lt;em&gt;what code exists&lt;/em&gt;. They don't teach the model &lt;em&gt;how the runtime behaves&lt;/em&gt;. No amount of reading &lt;code&gt;bookmark.py&lt;/code&gt; will tell you that &lt;code&gt;bookmark.tags.append(tag)&lt;/code&gt; triggers a synchronous database call inside an async context.&lt;/p&gt;

&lt;p&gt;We realized we needed two different kinds of information, not one:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;What it provides&lt;/th&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Structural information&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;What files exist, what they export, how they relate&lt;/td&gt;
&lt;td&gt;Context files, repo maps&lt;/td&gt;
&lt;td&gt;"BookmarkCreate has a field &lt;code&gt;url: HttpUrl&lt;/code&gt;"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Behavioral knowledge&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;How the runtime actually works, framework quirks, patterns to avoid&lt;/td&gt;
&lt;td&gt;Accumulated experience, failure analysis&lt;/td&gt;
&lt;td&gt;"HttpUrl adds a trailing slash — use &lt;code&gt;str(url).rstrip('/')&lt;/code&gt; for comparison"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Project B's success came from fixing structural information. Project C's failures seemed to come largely from missing behavioral knowledge. Both feed into what the model needs, but they come from different places and accumulate differently.&lt;/p&gt;




&lt;h2&gt;
  
  
  Two Axes, Not One
&lt;/h2&gt;

&lt;p&gt;This is what led us to think about AI coding agent performance along two axes instead of one.&lt;/p&gt;

&lt;p&gt;After Part 3, we had &lt;strong&gt;DCR&lt;/strong&gt; — the ratio of decisions handled deterministically. At 85%, the model's job was narrow: just write the code.&lt;/p&gt;

&lt;p&gt;After Part 4, we had the &lt;strong&gt;Information Design Gap&lt;/strong&gt; — the model needs enough context to do that narrow job.&lt;/p&gt;

&lt;p&gt;Now, after three projects, we're working with a slightly more structured version:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Axis 1: DCR (Deterministic Coverage Ratio)&lt;/strong&gt; — how much of the decision surface is handled without the model. This is the scaffolding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Axis 2: Information Quality (IQ)&lt;/strong&gt; — how well the model is equipped for the decisions it &lt;em&gt;does&lt;/em&gt; handle. This is the fuel.&lt;/p&gt;

&lt;p&gt;This is not a measured equation — just the mental model that best explains our runs so far:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;System Reliability ≈ DCR × Information Quality
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;DCR narrows the blast radius. IQ determines how well the model performs inside it. In our data so far, you need both. Neither alone has been sufficient.&lt;/p&gt;




&lt;h2&gt;
  
  
  Three Dimensions of Information Quality
&lt;/h2&gt;

&lt;p&gt;After analyzing our failures across three projects and reading through recent work on LLM-based code generation, we've found that "the model didn't get enough information" breaks down into at least three problems:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dimension 1 — Availability: Does the information exist in the context window?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This was Project A's problem. The model received ~240 tokens of task-relevant content. The information existed on disk — the orchestrator just never loaded it.&lt;/p&gt;

&lt;p&gt;Dillon &amp;amp; Varanasi (2026) observed something similar. They measured whether generated code follows team-level architectural decisions. When a decision was visible in the files the model received, compliance was near-perfect. When it existed only in documents the model never saw, compliance dropped to zero.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dimension 2 — Selection: Is irrelevant information excluded?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;More context isn't always better. Alonso et al. (2026) found that adding procedural TDD instructions increased regressions, while a targeted test map reduced them significantly. The practical lesson for us was simple: token budget is finite.&lt;/p&gt;

&lt;p&gt;Hu et al. (2026) quantified this from the other direction. In their cross-file code generation benchmark, 62% of functions didn't need cross-file context at all. The skill is knowing &lt;em&gt;which&lt;/em&gt; information to include.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dimension 3 — Structure: Is the information formatted for the model to use?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the counterintuitive one. Information can be &lt;em&gt;available&lt;/em&gt; and &lt;em&gt;selected correctly&lt;/em&gt; but still fail because of how it's structured.&lt;/p&gt;

&lt;p&gt;Hu et al.'s ablation showed this clearly. "Inlined" context (dependencies inserted at relevant code locations) versus "prepended" context (same information at the top of the prompt) — same information, different structure. Removing the inlining degraded performance to nearly the same level as removing the context entirely.&lt;/p&gt;

&lt;p&gt;Chinthareddy (2026) found a similar pattern with code retrieval. On a set of architectural queries, a deterministic AST-derived knowledge graph scored 100% correctness while a vector-similarity approach on the same codebase scored 40% (on the Shopizer benchmark suite). The gap came from how relationships were structured, not what information was available.&lt;/p&gt;




&lt;h2&gt;
  
  
  How the Two Axes Interact
&lt;/h2&gt;

&lt;p&gt;Here's why we think you need both:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;DCR&lt;/th&gt;
&lt;th&gt;IQ&lt;/th&gt;
&lt;th&gt;What happens&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;A&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;System has a chance to work. Deterministic decisions are correct, model has what it needs.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Deterministic decisions are correct, but the model is flying blind in its narrow lane.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;C&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Model generates good code, but the system mishandles it — wrong task order, broken gates, environment failures.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;D&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Failures become hard to diagnose. This may be what "the model isn't smart enough" often looks like.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Our Project A was Scenario B. DCR was 85% — the harness was solid. But Information Quality was ~15%. The model couldn't do its job because it couldn't see the project.&lt;/p&gt;

&lt;p&gt;Project B was closer to Scenario A. Same DCR. At least on the availability dimension, the delivered context moved closer to ~80%. The model had enough context to complete most of the tasks that fit the orchestration loop.&lt;/p&gt;

&lt;p&gt;Project C showed us that IQ itself has layers. Structural availability was good (~80%), but behavioral knowledge was missing. The two-axis model held — DCR was fine, IQ was the bottleneck — but the &lt;em&gt;nature&lt;/em&gt; of the IQ problem was different from Project A.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Practical Diagnostic
&lt;/h2&gt;

&lt;p&gt;If you're building or evaluating an AI coding agent, here's the check we now run on our own system:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Measure your DCR.&lt;/strong&gt; List every decision point in one execution cycle. For each one, ask: is this resolved by deterministic code, or does it depend on model output? Count the ratio. If it's below 50%, the scaffolding likely needs reinforcement before the model can succeed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Dump the prompt.&lt;/strong&gt; Not the template — the actual string that reaches the model at inference time. Read it as if you're a developer seeing this codebase for the first time. Can you write the code from this prompt alone? If you can't, the model can't either.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Diagnose by axis.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High DCR, low pass rate → Information Quality problem. Check: are context files loaded? Are descriptions specific enough? Are test assertions reaching the model?&lt;/li&gt;
&lt;li&gt;Low DCR, inconsistent results → Structural problem. The model is making decisions that should be deterministic. Move those decisions into code.&lt;/li&gt;
&lt;li&gt;Both seem fine, still failing → Might be a genuine model capability limit. Only after that does a model upgrade become the next reasonable hypothesis.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We jumped to Step 3 after Project A. "Qwen3 can't handle JWT auth" was our first diagnosis. Our initial diagnosis was premature. The bigger problem was that the information pipeline was effectively empty. We could have saved ourselves weeks.&lt;/p&gt;




&lt;h2&gt;
  
  
  Related Work Pointing in a Similar Direction
&lt;/h2&gt;

&lt;p&gt;I didn't set out to build a framework. I set out to figure out why Project A failed. But a consistent pattern kept showing up in recent work:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://arxiv.org/abs/2603.17973" rel="noopener noreferrer"&gt;Alonso et al. (2026)&lt;/a&gt;&lt;/strong&gt; — TDD procedure instructions hurt. Contextual test maps helped. Procedure without context was counterproductive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://arxiv.org/abs/2601.13118" rel="noopener noreferrer"&gt;Midolo et al. (2026)&lt;/a&gt;&lt;/strong&gt; — Surveyed 50 developers. 14% independently reported "contextual information about other system components" as a missing factor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://arxiv.org/abs/2511.12823" rel="noopener noreferrer"&gt;Jalil et al. (2025)&lt;/a&gt;&lt;/strong&gt; — Smaller models with TDD and code execution surpassed larger models without those supports.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://arxiv.org/abs/2605.08112" rel="noopener noreferrer"&gt;Dillon &amp;amp; Varanasi (2026)&lt;/a&gt;&lt;/strong&gt; — Decision compliance went from 46% to 95% by adding product context and structured specs. Cost per merge-ready task dropped 68%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://arxiv.org/abs/2601.00376" rel="noopener noreferrer"&gt;Hu et al. (2026)&lt;/a&gt;&lt;/strong&gt; — Cross-file inlining improved exact match by a reported average of 29.73% on RepoExec across three backbone models. The result was model-independent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://arxiv.org/abs/2601.08773" rel="noopener noreferrer"&gt;Chinthareddy (2026)&lt;/a&gt;&lt;/strong&gt; — Deterministic AST-derived code graphs achieved 100% correctness vs. 40% for vector-only retrieval on architectural queries (Shopizer suite). LLM-based graph extraction missed 31% of files entirely.&lt;/p&gt;

&lt;p&gt;These studies don't prove our framework. But they point in a consistent direction, and our three internal runs are consistent with that direction.&lt;/p&gt;




&lt;h2&gt;
  
  
  What We're Not Claiming
&lt;/h2&gt;

&lt;p&gt;I want to be precise about the boundaries here.&lt;/p&gt;

&lt;p&gt;We're not claiming that model capability doesn't matter. It does — for the non-deterministic slice. A stronger model will generate better code from the same information.&lt;/p&gt;

&lt;p&gt;We're not claiming these two metrics capture everything. Latency, cost, context window size, tool use ability — all matter. But in our limited experience, DCR and IQ have explained the largest share of variance in autonomous pass rates.&lt;/p&gt;

&lt;p&gt;We're also not claiming this is proven. ForgeFlow is a sample size of one. We have three data points (0%, 67%, 29%) and they're consistent with the two-axis model, but three points don't make a proof.&lt;/p&gt;

&lt;p&gt;If anyone has run similar experiments — different scaffolding levels, different context strategies, measured pass rates — I'd genuinely love to compare notes.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Thesis, So Far
&lt;/h2&gt;

&lt;p&gt;Part 3: &lt;em&gt;"The bottleneck is not model capability, but the verifiability of specifications."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Part 4: &lt;em&gt;"Even after verifiability is constructed, the bottleneck shifts to information delivery."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Now the version we're working with:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;&lt;em&gt;"An AI coding agent's reliability seems to be a product of its deterministic coverage and its information quality. Improving either without the other produces a system that is either structurally sound but informationally blind, or well-informed but structurally fragile."&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Two axes. One product. Neither alone has been sufficient in our experience.&lt;/p&gt;

&lt;p&gt;Measure your DCR. Dump your prompt. Fix the axis that's actually broken. Only after that does a model upgrade become the next reasonable hypothesis. That's the diagnostic that's worked for us. Whether it generalizes is something we're still finding out.&lt;/p&gt;




&lt;h2&gt;
  
  
  About
&lt;/h2&gt;

&lt;p&gt;I'm Joseph YEO, building ForgeFlow from Seoul, Korea — a local AI coding agent that runs entirely on Apple Silicon, no cloud inference during execution.&lt;/p&gt;

&lt;p&gt;This post synthesizes what I've learned from running three projects end-to-end and reading 40+ papers on LLM-based code generation. The two-axis model isn't a proven theory — it's the working diagnostic I use every time a cycle fails. I'm sharing it because it's been useful, and because I'm curious whether others are seeing the same patterns.&lt;/p&gt;

&lt;p&gt;How are you handling the "stale context" problem as your agent modifies the codebase? Are you using repo maps, re-indexing on every task, or something else entirely? I'd love to hear what's working.&lt;/p&gt;

&lt;p&gt;Follow along:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;𝕏: &lt;a href="https://x.com/josephyeo_dev" rel="noopener noreferrer"&gt;@josephyeo_dev&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/joseph-yeo" rel="noopener noreferrer"&gt;joseph-yeo&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Site: &lt;a href="https://projectjoseph.dev/" rel="noopener noreferrer"&gt;projectjoseph.dev&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;All models run locally via Ollama 0.23.0 on Apple Silicon M5 Max 128GB. No cloud APIs were used during autonomous execution.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This post was drafted with Claude and edited by me.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>The Information Design Gap: Why Our AI Agent Was Coding Blind</title>
      <dc:creator>Joseph Yeo</dc:creator>
      <pubDate>Wed, 13 May 2026 15:23:44 +0000</pubDate>
      <link>https://dev.to/josephyeo/the-information-design-gap-why-our-ai-agent-was-coding-blind-4p8o</link>
      <guid>https://dev.to/josephyeo/the-information-design-gap-why-our-ai-agent-was-coding-blind-4p8o</guid>
      <description>&lt;p&gt;&lt;em&gt;This is Part 4 of the ForgeFlow series. &lt;a href="https://dev.to/josephyeo/the-determinism-war-why-we-stopped-chasing-better-models-3c21"&gt;Part 3: The Determinism War&lt;/a&gt; introduced DCR (Deterministic Coverage Ratio) and why we stopped chasing better models.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;In Part 3, I proposed a hypothesis:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"The bottleneck of LLM-driven software engineering is not model capability, but the verifiability of specifications."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Then I said: "We're building the system to test it."&lt;/p&gt;

&lt;p&gt;We ran two projects. Same model. Same engine. Same orchestrator. The autonomous pass rate went from &lt;strong&gt;0% to 67%.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In our case, the fix wasn't a better model. It was giving the model enough information to do its job.&lt;/p&gt;




&lt;h2&gt;
  
  
  Two Projects, One Model, One Engine
&lt;/h2&gt;

&lt;p&gt;ForgeFlow is a TDD orchestrator that runs entirely locally. No cloud API calls during execution. The cycle is simple: generate test (RED) → generate implementation (GREEN) → run pytest → commit or retry.&lt;/p&gt;

&lt;p&gt;We ran it against two internal projects:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Project A: repo-jwt&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Project B: todo-api&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Domain&lt;/td&gt;
&lt;td&gt;JWT authentication API&lt;/td&gt;
&lt;td&gt;Todo CRUD API&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tasks&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model&lt;/td&gt;
&lt;td&gt;Qwen3-Coder-Next Q4_K_M (45GB)&lt;/td&gt;
&lt;td&gt;Same&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engine&lt;/td&gt;
&lt;td&gt;forgeflow.py v2&lt;/td&gt;
&lt;td&gt;Same&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Autonomous passes&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0 / 18 (0%)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;8 / 12 (67%)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Manual intervention&lt;/td&gt;
&lt;td&gt;18 tasks (100%)&lt;/td&gt;
&lt;td&gt;4 tasks (33%)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Same model. Same engine. The pass rate changed from 0% to 67%.&lt;/p&gt;

&lt;p&gt;What changed between Project A and Project B was not the model or the orchestrator. It was the &lt;strong&gt;information structure of the PRD&lt;/strong&gt; — the spec document that tells the model what to build.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Prompt Dump That Exposed the Problem
&lt;/h2&gt;

&lt;p&gt;After Project A failed across all 18 tasks, we did something we should have done much earlier: we dumped the actual prompt the model received at inference time.&lt;/p&gt;

&lt;p&gt;Not the prompt template. Not the system prompt spec. The literal string that arrived at the model's context window.&lt;/p&gt;

&lt;p&gt;Here's what we expected to find: a rich prompt containing the task spec, relevant source files, test fixtures, data model definitions, and dependency context.&lt;/p&gt;

&lt;p&gt;What we actually found was a prompt of about &lt;strong&gt;~720 tokens&lt;/strong&gt; — and only &lt;strong&gt;~240 of those were task-relevant project information.&lt;/strong&gt; The rest was role text, formatting rules, and boilerplate.&lt;/p&gt;

&lt;p&gt;No source code. No test fixtures. No existing implementation files. The model was being asked to generate code for a project it could barely see.&lt;/p&gt;

&lt;p&gt;In hindsight, this was an embarrassing oversight. The information pipeline existed in the code — it just wasn't wired up. But we didn't notice until we read the raw prompt.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Five Gaps
&lt;/h2&gt;

&lt;p&gt;We listed out every piece of information the model &lt;em&gt;should&lt;/em&gt; have received but didn't. Five gaps emerged:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gap 1 — No context in RED phase.&lt;/strong&gt; During test generation, the context parameter was hardcoded to an empty dictionary. The model wrote tests for modules it couldn't see, importing functions that didn't exist yet, guessing at fixture structures it had no way to know.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gap 2 — No context file list in the PRD.&lt;/strong&gt; The orchestrator had a function ready to read context files from a task-level field. But the PRD never defined that field. So the function returned an empty list every time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gap 3 — Module names without signatures.&lt;/strong&gt; The prompt listed available modules by name: &lt;code&gt;todo.py&lt;/code&gt;, &lt;code&gt;database.py&lt;/code&gt;. But not their contents, not their function signatures, not their class fields. The model knew &lt;em&gt;that&lt;/em&gt; modules existed, but not &lt;em&gt;what&lt;/em&gt; they contained.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gap 4 — Test assertions not forwarded.&lt;/strong&gt; The PRD included test assertion fields with precise expected behavior. The prompt builder read a different field name. The assertions existed in the spec but never reached the model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gap 5 — No conftest.py.&lt;/strong&gt; In a pytest project, &lt;code&gt;conftest.py&lt;/code&gt; defines shared fixtures — test database sessions, HTTP clients, factory functions. The model never saw it. Every task that required a test client, the model invented its own from scratch, often incompatibly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quantifying the Gap
&lt;/h2&gt;

&lt;p&gt;We measured how much relevant information actually reached the model by counting task-specific tokens:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;Task-relevant tokens&lt;/th&gt;
&lt;th&gt;What it contains&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;What the model received&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~240&lt;/td&gt;
&lt;td&gt;Task ID, one-line description, module names&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;What a human developer would reference&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~1,640&lt;/td&gt;
&lt;td&gt;Above + conftest.py, database.py, models, schemas, existing routes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The model was operating on roughly &lt;strong&gt;15% of the information&lt;/strong&gt; a developer would use for the same task.&lt;/p&gt;

&lt;p&gt;We started calling this the &lt;strong&gt;Information Design Gap&lt;/strong&gt;: the difference between what a model &lt;em&gt;could&lt;/em&gt; use and what the system &lt;em&gt;actually delivers&lt;/em&gt; at inference time. Whether this framing is useful beyond our system is something we're still figuring out — but for us, it immediately clarified what to fix.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Fix: No Code Changes
&lt;/h2&gt;

&lt;p&gt;Here's the part that surprised us.&lt;/p&gt;

&lt;p&gt;The orchestrator already had the machinery to deliver context files. A function to resolve which files a task needs — existed. A function to read those files from disk — existed. A function to format them into the prompt — existed. The prompt builder had a slot for context.&lt;/p&gt;

&lt;p&gt;The pipeline existed. The PRD just wasn't feeding it.&lt;/p&gt;

&lt;p&gt;For Project B (todo-api), we made three changes — all in the PRD, none in the engine:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Added a context file list to every task.&lt;/strong&gt; Each task now listed exactly which existing files the model should see. Early tasks had empty lists. CRUD endpoint tasks included the test fixture file, the relevant model file, and the schema file.&lt;/p&gt;

&lt;p&gt;Here's what a task spec looked like after the fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TASK-007&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="s"&gt;Create POST /api/todos.&lt;/span&gt;
    &lt;span class="s"&gt;Use the client fixture from conftest.py.&lt;/span&gt;
    &lt;span class="s"&gt;Return 201 with TodoResponse schema.&lt;/span&gt;
  &lt;span class="na"&gt;context_files&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;tests/conftest.py&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;app/models/todo.py&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;app/schemas/todo.py&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Made descriptions explicit.&lt;/strong&gt; Instead of "Create the POST endpoint", we wrote "Create POST /api/todos. Use the &lt;code&gt;client&lt;/code&gt; fixture from conftest.py. Return 201 with TodoResponse schema." One sentence, but it told the model &lt;em&gt;where&lt;/em&gt; to look.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Unified test scenario format.&lt;/strong&gt; Aligned PRD field names with what the prompt builder actually read, so test assertions reached the model.&lt;/p&gt;

&lt;p&gt;Total lines of code changed in the engine: &lt;strong&gt;zero.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What 67% Actually Means
&lt;/h2&gt;

&lt;p&gt;Eight of twelve tasks passed autonomously — the model generated code, tests passed, and ForgeFlow committed without human intervention for those tasks.&lt;/p&gt;

&lt;p&gt;Based on manual inspection, I would not classify the four manual tasks as model-capability failures. They were structural mismatches with the TDD cycle:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Why manual&lt;/th&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;TASK-001&lt;/td&gt;
&lt;td&gt;Infrastructure setup — test and implementation must be created together&lt;/td&gt;
&lt;td&gt;RED phase incompatible&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TASK-003&lt;/td&gt;
&lt;td&gt;Fixture-only — conftest.py defines fixtures, nothing to "fail" in RED&lt;/td&gt;
&lt;td&gt;No failing test possible&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TASK-010&lt;/td&gt;
&lt;td&gt;Validation already handled by Pydantic schema from an earlier task&lt;/td&gt;
&lt;td&gt;RED unexpected pass&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TASK-012&lt;/td&gt;
&lt;td&gt;Integration test — no implementation file, test-only task&lt;/td&gt;
&lt;td&gt;Engine assumes impl file exists&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That said, I should be honest about what we can't fully measure here. "Model-capability failure" is hard to distinguish from "subtle information gap we didn't notice." Our classification is based on manual inspection, not a controlled experiment. What we can say with confidence is that the &lt;em&gt;type&lt;/em&gt; of failure changed completely — from hallucinated imports and invented fixtures in Project A to structural mismatches in Project B.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Lesson: Intelligence Gap vs. Information Gap
&lt;/h2&gt;

&lt;p&gt;After Project A, our diagnosis was: "The model isn't smart enough. Qwen3 at Q4 quantization can't handle multi-file JWT authentication."&lt;/p&gt;

&lt;p&gt;That diagnosis was wrong — or at least, premature.&lt;/p&gt;

&lt;p&gt;The model appeared to have more usable capability than our system was exposing. In this run, the difference between 0% and 67% looked less like intelligence and more like &lt;strong&gt;context delivery.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This completely changed how we thought about local model limitations:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Intelligence Gap&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Information Gap&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Symptom&lt;/td&gt;
&lt;td&gt;Model generates plausible but wrong code&lt;/td&gt;
&lt;td&gt;Model generates structurally incompatible code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Diagnosis&lt;/td&gt;
&lt;td&gt;"Model too small / too quantized"&lt;/td&gt;
&lt;td&gt;"Prompt missing critical context"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fix&lt;/td&gt;
&lt;td&gt;Upgrade model (expensive, diminishing returns)&lt;/td&gt;
&lt;td&gt;Improve information design (free, compounding)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Testable?&lt;/td&gt;
&lt;td&gt;Hard — model capability is a black box&lt;/td&gt;
&lt;td&gt;Easy — dump the prompt, count what's missing&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The Information Design Gap is testable. Dump the prompt. Read it as if you're a developer seeing this project for the first time. If &lt;em&gt;you&lt;/em&gt; couldn't write the code from that prompt alone, the model can't either.&lt;/p&gt;




&lt;h2&gt;
  
  
  Similar Patterns in Recent Research
&lt;/h2&gt;

&lt;p&gt;While writing this post, we surveyed recent research on TDD-based code generation and found similar patterns appearing independently. These don't prove our framework, but the convergence seemed worth noting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Alonso et al. (2026)&lt;/strong&gt; tested TDD prompting on SWE-bench Verified with a 30B local model. Adding procedural TDD instructions ("write tests first, then implement") increased regressions. Adding a graph-derived test map ("here are the specific tests at risk") reduced them significantly. Their conclusion: agents don't need to be told &lt;em&gt;how&lt;/em&gt; to do TDD — they need to be told &lt;em&gt;which tests to check.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We saw the same mechanism: telling the model &lt;em&gt;what process to follow&lt;/em&gt; consumed context tokens that could carry &lt;em&gt;actual project information&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Midolo et al. (2026)&lt;/strong&gt; surveyed 50 developers about what makes code generation prompts succeed. Their top factors: algorithmic details (57%) and I/O format specification (44%). When asked what &lt;em&gt;else&lt;/em&gt; was missing, 14% independently reported "contextual information about other components in the system" — which sounds a lot like the gap our per-task context file list was designed to close.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Jalil et al. (2025)&lt;/strong&gt; showed that smaller models with TDD and a code interpreter could surpass larger models without those supports. The pattern held across model families: tests as structured context beat model scale.&lt;/p&gt;

&lt;p&gt;Different benchmarks, different teams, different setups. They all point toward the same practical lesson: &lt;strong&gt;before blaming the model, it might be worth inspecting the information pipeline.&lt;/strong&gt; Our data adds one more point in that direction.&lt;/p&gt;




&lt;h2&gt;
  
  
  Implications for DCR
&lt;/h2&gt;

&lt;p&gt;In Part 3, I defined DCR as the ratio of deterministic decisions in an agent loop. A reader asked whether DCR should be tracked like test coverage — not just reviewed once at architecture time.&lt;/p&gt;

&lt;p&gt;Running two projects gave us a partial answer: &lt;strong&gt;DCR alone wasn't enough.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;ForgeFlow's DCR didn't change between Project A and Project B. It was 85% both times — same 11 of 13 decisions handled deterministically. Yet performance went from 0% to 67%.&lt;/p&gt;

&lt;p&gt;What changed was the &lt;em&gt;quality of information feeding the non-deterministic decisions&lt;/em&gt;. DCR tells you how narrow the model's role is. It doesn't tell you whether the model is equipped to play that role.&lt;/p&gt;

&lt;p&gt;This is why we're now thinking about DCR in two layers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Static DCR&lt;/strong&gt;: how many decision points are designed to be deterministic. (Architecture metric.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observed DCR&lt;/strong&gt;: how many decisions were actually resolved deterministically during real runs. (Runtime metric.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And alongside both: &lt;strong&gt;Information Delivery Rate&lt;/strong&gt; — how much of the available, relevant context actually reaches the model at inference time. Using task-relevant token delivery as a rough proxy, Project A was around 15%. Project B was much closer to the information a human developer would expect to see.&lt;/p&gt;

&lt;p&gt;We're still working out whether these are the right abstractions — but they've been useful for diagnosing our own failures so far.&lt;/p&gt;




&lt;h2&gt;
  
  
  What We're Building Next
&lt;/h2&gt;

&lt;p&gt;The immediate roadmap based on these findings:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RED phase context delivery.&lt;/strong&gt; The RED phase (test generation) was still sending an empty context when we ran these projects. We've since fixed this in the engine — the model now sees existing fixtures before writing new tests.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Automatic context inference.&lt;/strong&gt; Right now, context files are manually specified per task in the PRD. The next step is deriving them from the dependency graph: if TASK-007 depends on TASK-005 and TASK-006, automatically include their implementation files as context. We're exploring tree-sitter-based approaches for this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Structural mismatch detection.&lt;/strong&gt; Four of twelve tasks didn't fit the RED-GREEN cycle. We want ForgeFlow to detect these patterns (infrastructure setup, fixture-only, test-only) during PRD validation and handle them with a separate path — not force them through TDD.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Thesis, Updated
&lt;/h2&gt;

&lt;p&gt;Part 3's thesis was about structure:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"The bottleneck is not model capability, but verifiability of specifications."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Two projects later, I'd extend it:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;&lt;em&gt;"In our runs, even after we built verifiability, the bottleneck seemed to shift to information delivery — whether the model receives enough context to use that verifiability."&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;DCR gave us the harness. Information design made that harness useful. Both seem to be required. Neither alone was sufficient in our experience.&lt;/p&gt;

&lt;p&gt;Same model. Same engine. Zero code changes. &lt;strong&gt;0% → 67%.&lt;/strong&gt; In our case, the difference was information.&lt;/p&gt;

&lt;p&gt;Several recent studies point in a similar direction, though from different setups. The practical suggestion I'd offer: &lt;strong&gt;if your AI coding agent is underperforming, it might be worth checking what it's receiving before swapping the model.&lt;/strong&gt; That's what worked for us.&lt;/p&gt;




&lt;h2&gt;
  
  
  About
&lt;/h2&gt;

&lt;p&gt;I'm Joseph YEO, a solo builder from Seoul, Korea. ForgeFlow is my experiment in pushing local AI agents toward more reliable autonomous execution — no cloud inference during execution, no hand-holding mid-cycle.&lt;/p&gt;

&lt;p&gt;This post covers what happened when we actually ran the system from Part 3 against real projects and discovered the gap between having a verification harness and feeding the generator enough context. I'm sharing this because I wish someone had told me to dump the raw prompt before I spent weeks blaming the model.&lt;/p&gt;

&lt;p&gt;If you've run into similar issues — or found different solutions — I'd love to hear about it in the comments.&lt;/p&gt;

&lt;p&gt;Follow along:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;𝕏: &lt;a href="https://x.com/josephyeo_dev" rel="noopener noreferrer"&gt;@josephyeo_dev&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/joseph-yeo" rel="noopener noreferrer"&gt;joseph-yeo&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Site: &lt;a href="https://projectjoseph.dev/" rel="noopener noreferrer"&gt;projectjoseph.dev&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Built over ~33 sessions, May 2026. All models run locally via Ollama 0.23.0 on Apple Silicon. No cloud APIs were used during autonomous execution.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This post was drafted with Claude and edited by me.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>The Determinism War: Why We Stopped Chasing Better Models</title>
      <dc:creator>Joseph Yeo</dc:creator>
      <pubDate>Tue, 12 May 2026 13:56:15 +0000</pubDate>
      <link>https://dev.to/josephyeo/the-determinism-war-why-we-stopped-chasing-better-models-3c21</link>
      <guid>https://dev.to/josephyeo/the-determinism-war-why-we-stopped-chasing-better-models-3c21</guid>
      <description>&lt;p&gt;The biggest upgrade to our AI coding system was not a better model.&lt;/p&gt;

&lt;p&gt;It was deleting model calls.&lt;/p&gt;

&lt;p&gt;Everyone in AI infrastructure asks: &lt;em&gt;"Which model should we use?"&lt;/em&gt; That question is a distraction. After 22 development sprints and a deep dive into 35 research papers while building &lt;strong&gt;ForgeFlow&lt;/strong&gt; — our fully local, TDD-based autonomous implementation system — we landed on a different question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;"How many of our system's decisions can we replace with deterministic rules?"&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The answer to that question often matters more than whether you use a frontier model or a 45GB local one.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Insight That Changed Everything
&lt;/h2&gt;

&lt;p&gt;LLM-based generation is probabilistic unless tightly constrained. Even when you pin temperature and seed, model calls remain poor substitutes for explicit rules when a decision can be mechanically verified.&lt;/p&gt;

&lt;p&gt;Consider what a typical agentic coding loop actually decides:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Which task to implement next&lt;/li&gt;
&lt;li&gt;Whether the environment is healthy&lt;/li&gt;
&lt;li&gt;Whether the generated code has syntax errors&lt;/li&gt;
&lt;li&gt;Whether the import paths are correct&lt;/li&gt;
&lt;li&gt;Whether the generated file is in scope&lt;/li&gt;
&lt;li&gt;Whether the tests pass&lt;/li&gt;
&lt;li&gt;Whether to commit, retry, or declare deadlock&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;None of these require a language model.&lt;/strong&gt; They're deterministic operations: dependency graph traversal, docker health checks, &lt;code&gt;py_compile&lt;/code&gt;, &lt;code&gt;ruff&lt;/code&gt;, allow-list comparison, &lt;code&gt;pytest&lt;/code&gt; exit code, SHA-256 hashing.&lt;/p&gt;

&lt;p&gt;The only judgment that genuinely requires a model is code generation itself — one step in the entire cycle.&lt;/p&gt;

&lt;p&gt;We call this ratio the &lt;strong&gt;DCR: Deterministic Coverage Ratio.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DCR = deterministic decision points / total decision points
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;DCR is not a benchmark score. It is a design accounting tool: count every branch in the agent loop, then classify whether it is resolved by code or by model judgment.&lt;/p&gt;

&lt;p&gt;ForgeFlow's current DCR: &lt;strong&gt;≈ 85%&lt;/strong&gt; — 11 of 13 decision points per cycle are deterministic.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Decision&lt;/th&gt;
&lt;th&gt;Deterministic?&lt;/th&gt;
&lt;th&gt;Mechanism&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Next task selection&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;dependency DAG traversal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dependency satisfied?&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;status field check&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Syntax valid?&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;py_compile&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Style gate pass?&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;ruff&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Import paths correct?&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Phase 0.5 AST correction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;File in scope?&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;allow-list comparison&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Environment healthy?&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;docker / ollama / disk check&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Test code generation&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;local LLM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Implementation generation&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;local LLM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tests pass?&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;pytest exit code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Commit / Retry / Deadlock?&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;gate_decision (rule-based)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Failure type?&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;stderr pattern matching&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deadlock detected?&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;failure signature × 3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Why DCR Is the Right Metric
&lt;/h2&gt;

&lt;p&gt;Recent inference-time scaling research keeps pointing to the same condition: repeated sampling becomes powerful when there is an automatic verifier.&lt;/p&gt;

&lt;p&gt;Without an oracle, more samples just create more candidates to rank. With an oracle, more samples create more chances to find a correct answer.&lt;/p&gt;

&lt;p&gt;A few papers that converge on this pattern:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Large Language Monkeys (Brown et al., ICLR 2025):&lt;/strong&gt; Coverage scales with sample count. In coding tasks with automatic verification, repeated sampling with a cheaper model consistently outperforms a frontier model run once.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"The Larger the Better?" (Hassid et al., 2024):&lt;/strong&gt; Under a fixed compute budget, smaller code models sampled multiple times can match or outperform larger models — when unit tests are available to select the correct answer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"Do We Truly Need So Many Samples?" (2025):&lt;/strong&gt; Multi-model repeated sampling improves sample efficiency. Smaller model combinations approach larger-model performance with far fewer samples, given a verifier.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The critical shared condition: &lt;strong&gt;&lt;em&gt;automatic verification must exist.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Unit tests. pytest. An oracle that says "yes" or "no" without human judgment.&lt;/p&gt;

&lt;p&gt;This is where most real-world projects diverge from benchmarks. HumanEval and MBPP ship with test suites. Your JWT authentication service doesn't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The bottleneck isn't model capability. It's verifiability.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Verifiability Is Constructed, Not Given
&lt;/h2&gt;

&lt;p&gt;Existing research operates in &lt;strong&gt;"Verifiability as Given"&lt;/strong&gt; mode: benchmarks provide the oracle, models generate code, oracle decides.&lt;/p&gt;

&lt;p&gt;ForgeFlow operates in &lt;strong&gt;"Verifiability as Constructed"&lt;/strong&gt; mode: start with a natural-language requirement and systematically transform it into something testable before any model is ever called.&lt;/p&gt;

&lt;p&gt;We call this the &lt;strong&gt;Verifiability Transformation Pipeline&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 1: Specification Crystallization&lt;/strong&gt;&lt;br&gt;
Natural language → numerically precise spec. Zero ambiguous terms ("appropriately", "as needed", "etc." are banned). Every requirement is anchored by a concrete test assertion:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;POST {title: 'Buy milk'} → 201, body contains {id, title}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Stage 2: Atomic Decomposition&lt;/strong&gt;&lt;br&gt;
One task = one test file + one implementation file. An ordered dependency DAG ensures each task's dependencies are already verified before it runs. This eliminates "cross-task contamination" — the pattern where one task's test setup silently breaks another task's environment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 3: Assertion Embedding&lt;/strong&gt;&lt;br&gt;
The test contract is written into the spec itself, at pytest-assertion granularity. TDD RED phase uses this to generate the test first. The oracle exists before any code does.&lt;/p&gt;

&lt;p&gt;The goal is not to claim that benchmark scaling laws automatically transfer to every real-world project. The goal is to reshape your project until enough of it has benchmark-like properties: explicit specs, executable tests, and a deterministic oracle. Once you have that, repeated sampling and local inference become much more interesting.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Substitution Paradox
&lt;/h2&gt;

&lt;p&gt;Here's the part that feels contradictory:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Replacing non-deterministic judgment with deterministic rules often requires the most creative kind of intelligence.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Designing the rule "run &lt;code&gt;py_compile&lt;/code&gt; before the LLM sees the output" seems obvious in retrospect. But reaching that design requires prior reasoning: that syntax errors are a deterministic category, that they can be caught pre-LLM, and that relying on the LLM to catch its own syntax errors is inherently fragile.&lt;/p&gt;

&lt;p&gt;That's inference. Creative, structural inference.&lt;/p&gt;

&lt;p&gt;This is why ForgeFlow has a three-tier intelligence structure:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;th&gt;Actor&lt;/th&gt;
&lt;th&gt;Frequency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;C — Design&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"Which decisions can become deterministic?"&lt;/td&gt;
&lt;td&gt;Claude + Joseph&lt;/td&gt;
&lt;td&gt;Once per system&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;B — Execute&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"Generate code matching this spec"&lt;/td&gt;
&lt;td&gt;Qwen3 45GB (local)&lt;/td&gt;
&lt;td&gt;Every cycle&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;A — Verify&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;pytest, py_compile, ruff, gate_decision&lt;/td&gt;
&lt;td&gt;forgeflow.py&lt;/td&gt;
&lt;td&gt;Every cycle, ≈ free&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The cloud model designs the harness. The local model runs inside it. The deterministic verifier judges the result.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The strongest model's job is to make the weakest model capable.&lt;/strong&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  The Economics
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Cloud model, direct execution:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Cost = (1 / P(success)) × price_per_call
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As the problem gets harder, tries increase. As the model gets stronger, price increases. Cost diverges in both directions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ForgeFlow (DCR maximization):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Cost = design_cost (one-time) + N × marginal_local_inference_cost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The marginal cost is not zero — electricity, time, and thermal throttling all exist. But it is low enough that repeated attempts become operationally feasible in a way that repeated cloud calls are not. Design cost is high, but it is a constant.&lt;/p&gt;

&lt;p&gt;This is the same computation the "Large Language Monkeys" paper runs — applied to a system where the oracle is constructed rather than assumed.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Means in Practice
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;For local AI users:&lt;/strong&gt; The question isn't "is my model good enough?" It's "have I built a harness where my model only needs to do code generation, and everything else is handled deterministically?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For AI system designers:&lt;/strong&gt; DCR is a design maturity metric. DCR 30% means you're relying on the model for most decisions. DCR 85% means the model is a narrow specialist in a well-guarded context. Measure yours.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For teams debating cloud vs local:&lt;/strong&gt; Once DCR is high enough, model selection becomes an economic and security variable — not a technical one. The same harness runs with a local model overnight or a cloud model during the day. Same gates. Same oracle. Different inference cost.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Thesis in One Sentence
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"The bottleneck of LLM-driven software engineering is not model capability, but the verifiability of specifications — and once verifiability is systematically constructed, a 45GB local model running overnight can match a frontier cloud model running once."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We think this is true. We're building the system to prove it.&lt;/p&gt;

&lt;p&gt;What's your system's DCR?&lt;/p&gt;




&lt;h2&gt;
  
  
  About
&lt;/h2&gt;

&lt;p&gt;I'm Joseph YEO, a solo builder from Seoul, Korea. ForgeFlow is my experiment in pushing local AI agents toward full autonomy — no cloud inference during execution, no hand-holding mid-cycle.&lt;/p&gt;

&lt;p&gt;This post is about the design principle behind that experiment: maximizing the ratio of decisions that don't require a model at all.&lt;/p&gt;

&lt;p&gt;Follow along:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;𝕏: &lt;a href="https://x.com/josephyeo_dev" rel="noopener noreferrer"&gt;@josephyeo_dev&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/joseph-yeo" rel="noopener noreferrer"&gt;joseph-yeo&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Site: &lt;a href="https://projectjoseph.dev/" rel="noopener noreferrer"&gt;projectjoseph.dev&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Built over ~26 sessions, May 2026. All models run locally via Ollama 0.23.0 on Apple Silicon. No cloud APIs were used during autonomous execution.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This post was drafted with Claude and edited by me. I use AI tools to write, just like I use them to code. That's kind of the whole point.&lt;/em&gt;&lt;/p&gt;




</description>
      <category>aiagents</category>
      <category>tdd</category>
      <category>localai</category>
    </item>
    <item>
      <title>We Didn't Migrate from n8n to Python Because n8n Failed</title>
      <dc:creator>Joseph Yeo</dc:creator>
      <pubDate>Mon, 11 May 2026 14:58:28 +0000</pubDate>
      <link>https://dev.to/josephyeo/we-didnt-migrate-from-n8n-to-python-because-n8n-failed-k9j</link>
      <guid>https://dev.to/josephyeo/we-didnt-migrate-from-n8n-to-python-because-n8n-failed-k9j</guid>
      <description>&lt;p&gt;We migrated because our AI orchestrator became something that needed tests, trust boundaries, and deterministic behavior.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ForgeFlow&lt;/strong&gt; is a fully local, multi-agent TDD system — a planning agent (Claude) designs a spec document set, and a local 45GB LLM executes TDD cycles overnight on Apple Silicon. During our initial development iterations, n8n was the orchestrator. It felt like the right call. Visual workflow, no-code glue, easy HTTP nodes, built-in error routing. What's not to like?&lt;/p&gt;

&lt;p&gt;Turns out, quite a lot.&lt;/p&gt;




&lt;h2&gt;
  
  
  What n8n Got Right
&lt;/h2&gt;

&lt;p&gt;To be fair, n8n earned its place early on. When we were validating the core loop — &lt;em&gt;call LLM → apply files → run pytest → branch on result&lt;/em&gt; — the visual canvas was genuinely useful. You could see the entire decision tree at a glance. Non-engineers could follow it. Debugging a single node failure was fast.&lt;/p&gt;

&lt;p&gt;And for a prototype, that matters. We got to a working TDD loop in far fewer iterations than a pure Python approach would have required.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where It Started Breaking
&lt;/h2&gt;

&lt;p&gt;Friction mounted as the system's architectural requirements became more stringent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem 1: The &lt;code&gt;fetch()&lt;/code&gt; wall&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;n8n's Code node runs in a sandboxed Node.js environment. When we tried to read existing source files before passing them as context to the LLM — a critical step for our TDD approach — &lt;code&gt;fetch()&lt;/code&gt; to our local exec server failed silently. In our setup, &lt;code&gt;$helpers.httpRequest()&lt;/code&gt; didn't give us the control we needed either. Whether this was configuration-specific or sandbox-related, the practical result was the same: file reads became workflow plumbing instead of ordinary code. The workaround was to spin up a separate HTTP Request node just for file reads.&lt;/p&gt;

&lt;p&gt;This wasn't a bug. It was the architecture saying: &lt;em&gt;"Code nodes are for data transformation, not I/O."&lt;/em&gt; Which is correct! But we needed I/O.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem 2: The trust boundary was wrong&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In our system, the rule is absolute: &lt;strong&gt;only the orchestrator may modify files or run git commands.&lt;/strong&gt; The LLM proposes. The orchestrator decides and acts.&lt;/p&gt;

&lt;p&gt;In n8n, enforcing this required careful node discipline. Nothing in the framework &lt;em&gt;prevents&lt;/em&gt; you from firing off a git commit in any arbitrary Code node. The constraint was cultural, not structural. And cultural constraints erode.&lt;/p&gt;

&lt;p&gt;In Python, the constraint is structural: &lt;code&gt;apply_files()&lt;/code&gt;, &lt;code&gt;git_commit()&lt;/code&gt;, &lt;code&gt;gate_decision()&lt;/code&gt; are functions in a specific layer (L1/L2). Nothing in L3 (the LLM layer) can call them directly. The layer separation is enforced by code organization, not discipline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem 3: Deterministic logic deserves deterministic code&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The most critical parts of our system — syntax validation, failure signature hashing, gate decision logic — are 100% deterministic. Not a single LLM call. Pure logic.&lt;/p&gt;

&lt;p&gt;Expressing that logic in n8n's Code nodes felt like writing a chess engine in PowerPoint. It worked, technically. But every &lt;code&gt;if/elif&lt;/code&gt; chain lived in a text field, untestable and unversionable in any meaningful way.&lt;/p&gt;

&lt;p&gt;We wanted to write this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compute_failure_signature&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;failure_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stderr_text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;failure_type&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;target_file&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;stderr_text&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()[:&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And then test it directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_failure_signature_is_stable&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;sig1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compute_failure_signature&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;patch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auth.py&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;IndentationError line 42&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;sig2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compute_failure_signature&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;patch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auth.py&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;IndentationError line 42&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;sig1&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;sig2&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_failure_signature_differs_by_file&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;sig1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compute_failure_signature&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;patch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auth.py&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;IndentationError line 42&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;sig2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compute_failure_signature&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;patch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;models.py&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;IndentationError line 42&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;sig1&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;sig2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can approximate that with a Code node, but not as a first-class, importable, directly testable artifact. The difference matters when this logic is what determines whether a 45GB model gets another retry or hits deadlock.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem 4: Cross-model verification became essential&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;By our 21st development iteration, we brought in an independent reasoning model (different vendor, no shared context) to write &lt;code&gt;test_forgeflow.py&lt;/code&gt; — 110 tests for our orchestrator's behavior, written solely from our spec documents.&lt;/p&gt;

&lt;p&gt;It found 7 real mismatches between our spec and our implementation.&lt;/p&gt;

&lt;p&gt;To run this process, we needed: a well-defined function contract, stable function signatures, and a test suite that could run in CI. None of that is natural in n8n. All of it is natural in Python.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Migration Decision
&lt;/h2&gt;

&lt;p&gt;The turning point was a single question: &lt;strong&gt;"Can we write a unit test for this logic?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If the answer was no — and with n8n, it often was — that was a design smell. Critical, deterministic logic that can't be tested independently is a liability.&lt;/p&gt;

&lt;p&gt;We rewrote the orchestrator as &lt;code&gt;forgeflow.py&lt;/code&gt; with a &lt;strong&gt;three-layer trust model&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;L1: TRUST (immutable)      — File I/O, Git, Docker, State, Cycle Control
L2: VERIFY (deterministic) — Syntax Gate, correction engine, gate_decision
L3: DOUBT (always suspect) — LLM backends, JSON extraction, prompt assembly
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The design invariant: &lt;strong&gt;L1 and L2 never call an LLM. All LLM interaction is isolated in L3.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This made the migration feel less like a rewrite and more like making implicit structure explicit.&lt;/p&gt;




&lt;h2&gt;
  
  
  What We'd Tell Our Past Selves
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use n8n for:&lt;/strong&gt; Prototyping agent loops, visualizing complex branching logic for stakeholders, integrating heterogeneous external APIs with minimal glue code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Don't use n8n for:&lt;/strong&gt; Any logic you'll want to unit-test. Any constraint that needs to be structurally enforced rather than culturally agreed upon. Any system where the orchestrator's behavior is itself the thing being verified.&lt;/p&gt;

&lt;p&gt;The moment your AI orchestrator has enough opinions to deserve a test suite, it's time to write real code.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Result
&lt;/h2&gt;

&lt;p&gt;The important change wasn't "Python instead of n8n." It was that the orchestrator became verifiable.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;106 behavioral tests pass&lt;/strong&gt; against stable function contracts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Three execution paths&lt;/strong&gt; (COMMIT / RETRY / DEADLOCK) are validated independently via dry-run&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trust boundaries are structural&lt;/strong&gt;: the LLM layer has no reference to file mutation or git operations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-model verification is repeatable&lt;/strong&gt;: any external model can re-run tests against the same spec&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;n8n helped us discover the loop. Python let us verify it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;ForgeFlow runs entirely on local hardware — Apple Silicon M5 Max 128G, 45GB model, no cloud inference during execution. If you're building AI agent systems and hitting the same orchestration walls, whether with n8n, LangGraph, or CrewAI — what was your inflection point? The trust boundary question is one most teams hit eventually.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  About
&lt;/h2&gt;

&lt;p&gt;I'm Joseph YEO, a solo builder from Seoul, Korea. ForgeFlow is my experiment in pushing local AI agents toward full autonomy — no cloud inference during execution, no hand-holding mid-cycle.&lt;/p&gt;

&lt;p&gt;This post is about one architectural decision that turned out to matter more than I expected: moving the orchestrator from n8n to Python, and why testability was the forcing function.&lt;/p&gt;

&lt;p&gt;Follow along:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;𝕏: &lt;a href="https://x.com/josephyeo_dev" rel="noopener noreferrer"&gt;@josephyeo_dev&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/joseph-yeo" rel="noopener noreferrer"&gt;joseph-yeo&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Site: &lt;a href="https://projectjoseph.dev/" rel="noopener noreferrer"&gt;projectjoseph.dev&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Built over ~22 sessions, May 2026. All models run locally via Ollama 0.23.0 on Apple Silicon. No cloud APIs were used during autonomous execution.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This post was drafted with Claude and edited by me. I use AI tools to write, just like I use them to code. That's kind of the whole point.&lt;/em&gt;&lt;/p&gt;




</description>
      <category>aiagents</category>
      <category>localai</category>
      <category>tdd</category>
      <category>python</category>
    </item>
    <item>
      <title>I Built a Local AI Coding Agent on M5 Max 128GB — It Failed 164 Times Before Passing 35 Tests</title>
      <dc:creator>Joseph Yeo</dc:creator>
      <pubDate>Sat, 09 May 2026 11:31:07 +0000</pubDate>
      <link>https://dev.to/josephyeo/i-built-a-local-ai-coding-agent-on-m5-max-128gb-it-failed-164-times-before-passing-35-tests-2fgj</link>
      <guid>https://dev.to/josephyeo/i-built-a-local-ai-coding-agent-on-m5-max-128gb-it-failed-164-times-before-passing-35-tests-2fgj</guid>
      <description>&lt;p&gt;&lt;strong&gt;Fully local. No cloud APIs during execution. TDD-enforced. 35 tests passing.&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;To be clear: I used Claude for the initial architecture and rule design. The experiment was strictly focused on whether a local LLM could survive the &lt;strong&gt;autonomous execution loop&lt;/strong&gt; without phoning home. Planning, docs, and correction rule design I handled with Claude (a cloud API). The coding agent loop (Brain → Coder → Tester) ran 100% locally — no API calls during execution.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Actual Setup
&lt;/h2&gt;

&lt;p&gt;Most AI coding agent posts you see rely on GPT-4o or Claude via API. The model lives in a data center, and you're paying per token. That's fine — but it means your code, your architecture decisions, and your project context are all leaving your machine.&lt;/p&gt;

&lt;p&gt;I wanted something different: a multi-agent system that runs &lt;em&gt;entirely on my MacBook Pro M5 Max 128GB&lt;/em&gt;. It autonomously writes code, runs tests in a Docker sandbox, and only commits when tests pass. No internet required once it's running.&lt;/p&gt;

&lt;p&gt;This is the story of ForgeFlow — what I built, what broke, and what the data showed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hardware Context
&lt;/h2&gt;

&lt;p&gt;The M5 Max 128GB is unusual hardware for this kind of work. Most local LLM setups top out at 32GB or 64GB unified memory, which forces you to choose between model quality and running multiple models simultaneously. At 128GB, that constraint disappears.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I ran simultaneously:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Size (Q4_K_M)&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3-Coder-Next&lt;/td&gt;
&lt;td&gt;~45GB&lt;/td&gt;
&lt;td&gt;Brain + Coder&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gemma4:26b&lt;/td&gt;
&lt;td&gt;~17GB&lt;/td&gt;
&lt;td&gt;QA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;nomic-embed-text&lt;/td&gt;
&lt;td&gt;~0.3GB&lt;/td&gt;
&lt;td&gt;RAG embeddings&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Total: ~62GB loaded, ~66GB headroom for OS + KV cache. Both models stay warm in memory with &lt;code&gt;keep_alive: 24h&lt;/code&gt; — no reload latency between cycles.&lt;/p&gt;

&lt;p&gt;This isn't a flex. It's context: the architectural decisions I made (same model for Brain and Coder, both models always loaded) are only feasible at this memory tier. At 64GB, you'd need to make different tradeoffs.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture: What ForgeFlow Actually Does
&lt;/h2&gt;

&lt;p&gt;ForgeFlow is an n8n workflow that runs every 10 minutes, autonomously picks the next coding task, writes tests first, writes code second, and only commits if all tests pass.&lt;/p&gt;

&lt;p&gt;The full loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Schedule Trigger (10 min)
  → Load Context (working memory + results log + project rules)
  → Brain (Qwen3-Coder-Next): pick next task from PRD
  → Localization: RAG search for relevant existing code
  → Coder RED (same model): write a failing test
  → Verify RED: pytest must FAIL — if it passes, the test is wrong
  → Coder GREEN: write minimum code to pass the test
  → Phase 0 Gate: py_compile + ruff (deterministic, no LLM)
  → QA (gemma4:26b): run full test suite in Docker sandbox
  → Gate Decision: COMMIT / RETRY / DEADLOCK / ESCALATE
  → Commit &amp;amp; Update (on pass)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three design principles drove every decision:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. pytest exit code is the only truth.&lt;/strong&gt; I don't care if the LLM thinks the code is "clean." If the pytest exit code isn't 0, the code is garbage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The LLM proposes, n8n disposes.&lt;/strong&gt; No model has write access to the filesystem or git. n8n is the only actor that applies files, runs git commands, and updates state.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Deterministic gates before LLM gates.&lt;/strong&gt; &lt;code&gt;py_compile&lt;/code&gt; and &lt;code&gt;ruff&lt;/code&gt; run in under 0.5 seconds. If they catch the error, there's no reason to spend 30 seconds calling gemma4.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Memory System
&lt;/h2&gt;

&lt;p&gt;One of the underrated problems in autonomous coding agents is state management across cycles. The agent can't remember what it did last cycle unless you explicitly store it.&lt;/p&gt;

&lt;p&gt;ForgeFlow keeps track of state across six memory layers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Storage&lt;/th&gt;
&lt;th&gt;Scope&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Git history&lt;/td&gt;
&lt;td&gt;&lt;code&gt;.git&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Permanent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code summaries&lt;/td&gt;
&lt;td&gt;ChromaDB (RAG)&lt;/td&gt;
&lt;td&gt;Project lifetime&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;results.tsv&lt;/td&gt;
&lt;td&gt;TSV file&lt;/td&gt;
&lt;td&gt;Session&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AGENTS.md&lt;/td&gt;
&lt;td&gt;Markdown&lt;/td&gt;
&lt;td&gt;Cross-session&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Working memory&lt;/td&gt;
&lt;td&gt;JSON file&lt;/td&gt;
&lt;td&gt;Current loop&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Failure patterns&lt;/td&gt;
&lt;td&gt;AGENTS.md auto-update&lt;/td&gt;
&lt;td&gt;Generalized&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each layer operates at a different time scale. Git is permanent. Working memory resets every cycle. AGENTS.md accumulates lessons across sessions — when the same failure type occurs 3+ times, a rule gets written: &lt;em&gt;"always include &lt;code&gt;from app.database import get_db&lt;/code&gt; — the model consistently forgets this."&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  TDD Enforcement: Red-Green-Refactor as a System Constraint
&lt;/h2&gt;

&lt;p&gt;The TDD loop isn't a suggestion — it's mechanically enforced by the workflow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;RED phase&lt;/strong&gt;: Coder writes a test. n8n runs pytest. If it &lt;em&gt;passes&lt;/em&gt;, the test is rejected — it's testing something that already works, which means it's the wrong test.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;GREEN phase&lt;/strong&gt;: Coder writes minimum code to pass the test. n8n applies the files, runs the full test suite (not just the new test), checks for regressions.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Commit&lt;/strong&gt;: Only happens if exit code is 0 across the entire test suite.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Enforcing this mechanically means the model can't shortcut. It can't write "good enough" code and hope the reviewer misses it. The test either passes or it doesn't.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure Handling: Bounded Repair
&lt;/h2&gt;

&lt;p&gt;Blind retries are a token-burn trap. Instead, ForgeFlow fingerprints every failure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;failure_signature&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SHA256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;failure_type&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;file_path&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;first_50_chars_of_stderr&lt;/span&gt;&lt;span class="p"&gt;)[:&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If I see the same SHA256 signature three times, the agent hits a &lt;strong&gt;DEADLOCK&lt;/strong&gt; and walks away. It's better to skip a task than to let a model hallucinate in a loop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failure classification:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;patch&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Code logic error, syntax error&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;environment&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Import error, missing module&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;localization&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Wrong file referenced&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;deadlock&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Same signature 3×&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The Data: What Actually Happened
&lt;/h2&gt;

&lt;p&gt;I ran ForgeFlow on a Todo REST API (FastAPI + SQLAlchemy + pytest) — 12 tasks, classic CRUD.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Overall:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total attempts&lt;/td&gt;
&lt;td&gt;164&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PASS (committed)&lt;/td&gt;
&lt;td&gt;11 (6.7%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FAIL (discarded)&lt;/td&gt;
&lt;td&gt;116 (70.7%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DEADLOCK (skipped)&lt;/td&gt;
&lt;td&gt;37 (22.6%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Manual interventions&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Final test count&lt;/td&gt;
&lt;td&gt;35 passing&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The 6.7% raw PASS rate sounds bad. But that number is misleading — it includes the early cycles before deterministic corrections were added.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The real signal is in the pass rate as the system "learned" (via manual rules):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Corrections active&lt;/th&gt;
&lt;th&gt;PASS rate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0–5 corrections&lt;/td&gt;
&lt;td&gt;5.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6 corrections&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7–10 corrections&lt;/td&gt;
&lt;td&gt;40.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11–13 corrections&lt;/td&gt;
&lt;td&gt;62.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each "correction" is a deterministic rule applied before the LLM output reaches the filesystem. Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;from app.db.session import&lt;/code&gt; → auto-rewrite to &lt;code&gt;from app.database import&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;@router.post(...)&lt;/code&gt; without &lt;code&gt;status_code=201&lt;/code&gt; → auto-insert&lt;/li&gt;
&lt;li&gt;File not in &lt;code&gt;target_files&lt;/code&gt; → reject with error message&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As corrections accumulated, PASS rate went from 5.6% to 62.5%. The corrections are essentially a hand-built knowledge base of the model's systematic errors. It turns out those errors are highly consistent and predictable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failure type distribution:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;%&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;patch&lt;/td&gt;
&lt;td&gt;99&lt;/td&gt;
&lt;td&gt;64.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;environment&lt;/td&gt;
&lt;td&gt;54&lt;/td&gt;
&lt;td&gt;35.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At 35.3%, my environment failure rate is triple the standard benchmarks (~13%). That's the "quantization tax" you pay for running Q4 models locally. The deterministic corrections target exactly these failure types.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hardest tasks:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Attempts&lt;/th&gt;
&lt;th&gt;Primary failure&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;TASK-002 (DB model)&lt;/td&gt;
&lt;td&gt;41&lt;/td&gt;
&lt;td&gt;PytestDeprecationWarning + ImportError&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TASK-006 (GET list)&lt;/td&gt;
&lt;td&gt;26&lt;/td&gt;
&lt;td&gt;ImportError conftest&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TASK-012 (integration)&lt;/td&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;td&gt;Regression (previous code overwritten)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;TASK-002 taking 41 attempts is the starkest number. Most failures were the same &lt;code&gt;PytestDeprecationWarning&lt;/code&gt; signature — the model couldn't fix a pytest configuration issue that required understanding the test infrastructure, not just the code under test. Eventually, a manual intervention resolved it.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Broke (Honestly)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;3 manual interventions were required:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;TASK-008 (PUT endpoint):&lt;/strong&gt; The Coder kept generating tests with wrong status codes. Added correction #13 (PUT 201→200 auto-fix) after diagnosing the pattern.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;TASK-011 (filtering):&lt;/strong&gt; The Coder overwrote &lt;code&gt;routes/todo.py&lt;/code&gt; while working on filtering, destroying previously committed code. The target_files violation detection wasn't blocking writes — only logging them.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;TASK-012 (integration test):&lt;/strong&gt; DEADLOCK 3 times. The model couldn't figure out that &lt;code&gt;test_integration.py&lt;/code&gt; needed to use the existing &lt;code&gt;client&lt;/code&gt; fixture from &lt;code&gt;conftest.py&lt;/code&gt; rather than creating its own TestClient.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;All three were fixed in the session after they occurred by adding deterministic corrections. The system learned — just not automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Is (and Isn't)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;This is:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A proof that a fully local multi-agent TDD loop is viable on consumer hardware&lt;/li&gt;
&lt;li&gt;Evidence that deterministic corrections significantly outperform raw LLM retry for systematic errors&lt;/li&gt;
&lt;li&gt;A framework for thinking about autonomous coding at the task level&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;This isn't:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A "set it and forget it" system. It's a force multiplier that still requires a human to untangle the logic when the model hits a wall.&lt;/li&gt;
&lt;li&gt;A system that works without oversight (3 interventions in 12 tasks is not zero)&lt;/li&gt;
&lt;li&gt;Generalizable beyond the hardware tier that makes it feasible&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The 62.5% PASS rate in the final correction set is meaningful. But the 3 required manual interventions mean the system isn't yet fully autonomous.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Second project:&lt;/strong&gt; A more complex backend (20+ tasks, non-trivial dependencies) to validate that the correction set generalizes and the dependency resolution logic holds under pressure. The goal is a two-project dataset for a proper write-up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 0.5 Gate:&lt;/strong&gt; I'm looking at implementing AST-based checks — inspired by the Khati et al. (2026) paper — to kill hallucinations before they even hit the Docker sandbox. The goal is to catch &lt;code&gt;app.routes.todo.get_todo_by_id&lt;/code&gt; (doesn't exist) before it reaches pytest.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Automatic correction learning:&lt;/strong&gt; Right now, corrections are written manually after pattern identification. The next step is having n8n automatically identify recurring failure signatures and propose corrections for human approval.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hardware Note for Other M-Series Users
&lt;/h2&gt;

&lt;p&gt;If you're on M2/M3/M4 Pro (36–48GB), the same architecture works with tradeoffs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Run one model at a time (swap between Brain/Coder and QA)&lt;/li&gt;
&lt;li&gt;Use smaller QA model (gemma4:9b instead of 26b)&lt;/li&gt;
&lt;li&gt;Expect higher latency per cycle (~15–20 min instead of 10)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The fundamental approach — deterministic orchestration + LLM proposal + test-as-truth — doesn't require 128GB. It just runs faster and with better models at that tier.&lt;/p&gt;




&lt;h2&gt;
  
  
  About
&lt;/h2&gt;

&lt;p&gt;I'm Joseph YEO, a solo builder from Seoul, Korea. I run multiple projects in parallel using AI agents — local AI automation (ForgeFlow), supply chain security (&lt;a href="https://devradarguard.dev" rel="noopener noreferrer"&gt;DevRadar Guard&lt;/a&gt;), and a few things currently under wraps.&lt;/p&gt;

&lt;p&gt;What I'm really interested in is how autonomous these agents can actually become before I have to step in as the human. ForgeFlow is one experiment. There will be more.&lt;/p&gt;

&lt;p&gt;Follow along:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;𝕏: &lt;a href="https://x.com/josephyeo_dev" rel="noopener noreferrer"&gt;@josephyeo_dev&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/joseph-yeo" rel="noopener noreferrer"&gt;joseph-yeo&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Site: &lt;a href="https://projectjoseph.dev" rel="noopener noreferrer"&gt;projectjoseph.dev&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Built over ~7 sessions, May 2026. All models run locally via Ollama 0.23.0 on macOS. No cloud APIs were used during autonomous execution.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This post was drafted with Claude and edited by me. I use AI tools to write, just like I use them to code. That's kind of the whole point.&lt;/em&gt;&lt;/p&gt;




</description>
      <category>llm</category>
      <category>agents</category>
      <category>tdd</category>
      <category>ollama</category>
    </item>
    <item>
      <title>Case Study: How I Dogfood DevRadar Guard on a 954-Dependency Project</title>
      <dc:creator>Joseph Yeo</dc:creator>
      <pubDate>Mon, 06 Apr 2026 13:25:37 +0000</pubDate>
      <link>https://dev.to/josephyeo/case-study-how-i-dogfood-devradar-guard-on-a-954-dependency-project-d7e</link>
      <guid>https://dev.to/josephyeo/case-study-how-i-dogfood-devradar-guard-on-a-954-dependency-project-d7e</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a follow-up to my earlier post: &lt;a href="https://dev.to/devradarguard/axios-was-compromised-heres-what-it-means-for-your-repo-1hh0"&gt;Axios Was Compromised. Here's What It Means for Your Repo.&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;GloriaPPT is a presentation tool I built and maintain. It's a fairly typical modern JavaScript app: a Next.js frontend, a Node.js backend, and deployment on Vercel. What makes it interesting for this case study is its dependency tree: &lt;strong&gt;954 npm packages&lt;/strong&gt; in the lockfile.&lt;/p&gt;

&lt;p&gt;Most of those packages are transitive. I haven't read the source code for most of them, and realistically, neither do most small teams. If one of them were compromised tomorrow, I probably wouldn't know right away.&lt;/p&gt;

&lt;p&gt;That's the problem I built DevRadar Guard to solve.&lt;/p&gt;

&lt;h2&gt;
  
  
  Before DevRadar Guard
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dependency monitoring:&lt;/strong&gt; Manual &lt;code&gt;npm audit&lt;/code&gt; when I remembered to run it&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Supply chain alerts:&lt;/strong&gt; None — I found out about incidents from social feeds and security threads&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;.npmrc&lt;/code&gt; hardening:&lt;/strong&gt; Default settings&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;CLAUDE.md&lt;/code&gt; security section:&lt;/strong&gt; Didn't exist&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pre-install hooks:&lt;/strong&gt; None&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CI/CD security checks:&lt;/strong&gt; Basic Dependabot, no custom policy&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Response time to incidents:&lt;/strong&gt; Hours to days, depending on when I saw the news&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  After DevRadar Guard
&lt;/h2&gt;

&lt;p&gt;I deployed DevRadar Guard's hosted monitoring on a small VPS that checks every 30 minutes. Here's what changed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Signal Collection
&lt;/h3&gt;

&lt;p&gt;The Signal Engine collects threat intelligence from GitHub Security Advisories every 30 minutes. In the first 24 hours, it ingested &lt;strong&gt;467 raw events&lt;/strong&gt; — advisories affecting npm packages — and normalized all of them into structured threat candidates with confidence scores.&lt;/p&gt;

&lt;p&gt;Each signal is scored across five dimensions, including source quality, technical specificity, cross-reference validation, discussion velocity, and ecosystem relevance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Exposure Matching
&lt;/h3&gt;

&lt;p&gt;Out of 467 normalized signals, the Exposure Engine matched &lt;strong&gt;1 against GloriaPPT's actual dependency tree&lt;/strong&gt;: axios.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Package:&lt;/strong&gt; axios&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Installed version:&lt;/strong&gt; 1.14.1&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Confidence score:&lt;/strong&gt; 65/100&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Exposure score:&lt;/strong&gt; 50/100&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Final risk:&lt;/strong&gt; 57/100 (alert threshold: 50)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The confidence score reflects signal quality: a high-trust source, a named package and version, and enough technical detail to treat the advisory seriously.&lt;/p&gt;

&lt;p&gt;The exposure score reflects how directly the issue touched this repo: &lt;code&gt;axios&lt;/code&gt; was a direct dependency, and the affected version was present in the lockfile.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Alert
&lt;/h3&gt;

&lt;p&gt;At 00:24 KST (UTC+9) on April 6, a Slack alert landed in #devradar-alerts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;🛡️ DevRadar Guard Alert

Package: axios
Version: 1.14.1
Risk Score: 57/100
Confidence: 65
Exposure: 50
Path: direct

Signal: Compromised axios versions 1.14.1 and 0.30.4 were
reported to deliver a remote access trojan...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I didn't find out about the axios compromise from social media. The alert was waiting for me when I checked Slack.&lt;/p&gt;

&lt;h3&gt;
  
  
  Guardrail Bundle
&lt;/h3&gt;

&lt;p&gt;DevRadar Guard generates a guardrail bundle — a set of files you can drop into a repo to harden installs, guide AI coding agents, and surface risky dependency changes during review:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;CLAUDE.md&lt;/code&gt; security section&lt;/td&gt;
&lt;td&gt;Security policy for AI coding agents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;.npmrc&lt;/code&gt; hardening&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;ignore-scripts=true&lt;/code&gt;, &lt;code&gt;audit=true&lt;/code&gt;, registry pinning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pre-install hook&lt;/td&gt;
&lt;td&gt;Warns before installing packages younger than 7 days&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GitHub Actions workflow&lt;/td&gt;
&lt;td&gt;PR check that flags risky dependency changes (alert-only)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;devradar-policy.json&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Machine-readable policy for CI/CD integration&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;GloriaPPT now uses all 8 generated guardrail files. The pre-install hook would likely have flagged the malicious &lt;code&gt;plain-crypto-js&lt;/code&gt; dependency used in the attack, since it had been published less than 24 hours earlier.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Dependencies monitored&lt;/td&gt;
&lt;td&gt;954&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Raw threat signals collected (first 24h)&lt;/td&gt;
&lt;td&gt;467&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Normalization success rate (first 24h sample)&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Signals matched to GloriaPPT&lt;/td&gt;
&lt;td&gt;1 (axios)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;False positives in this case study&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time from advisory to alert&lt;/td&gt;
&lt;td&gt;&amp;lt; 30 minutes (cron cycle)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Guardrail files generated&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Manual intervention during detection&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What This Doesn't Prove
&lt;/h2&gt;

&lt;p&gt;I want to be honest about what this case study shows and what it doesn't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It shows:&lt;/strong&gt; A real supply chain threat was detected, matched to a real project, and surfaced as an actionable alert — automatically, without manual intervention.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It doesn't show:&lt;/strong&gt; DevRadar Guard catching a zero-day before anyone else. The axios advisory was already published when my pipeline picked it up. I'm not claiming to be faster than GitHub Advisory. I'm claiming to be faster than manual monitoring — finding out from social feeds, security threads, or a post after the fact.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It doesn't show:&lt;/strong&gt; Protection against all supply chain attacks. The Signal Engine currently monitors GitHub Advisories only. Reddit, npm registry anomaly detection, and other sources are planned but not yet active in Alpha.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It doesn't show:&lt;/strong&gt; Automatic blocking. DevRadar Guard Alpha is alert-only. No PR failures, no install blocks, no surprises. You get the information; you decide what to do.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;DevRadar Guard is still in Alpha, and I'm testing it with a small number of pilot teams. Right now that includes hosted monitoring on a 30-minute cycle, matched alerts in Slack or Discord, a generated guardrail bundle for the repo, and a weekly threat briefing. All I ask in return is a few minutes of feedback each week.&lt;/p&gt;

&lt;p&gt;If your project has a &lt;code&gt;package-lock.json&lt;/code&gt; and you want earlier, repo-specific visibility into supply chain incidents, the starter kit and waitlist are below:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/devradar-guard/devradar-guard/tree/main/examples/starter-kit" rel="noopener noreferrer"&gt;Starter Kit on GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://tally.so/r/GxDbbL" rel="noopener noreferrer"&gt;Join the waitlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;DevRadar Guard Alpha — alert-only, no automatic blocking. You stay in control.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>npm</category>
      <category>security</category>
      <category>supplychain</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Axios Was Compromised. Here's What It Means for Your Repo.</title>
      <dc:creator>Joseph Yeo</dc:creator>
      <pubDate>Mon, 06 Apr 2026 03:58:03 +0000</pubDate>
      <link>https://dev.to/josephyeo/axios-was-compromised-heres-what-it-means-for-your-repo-1hh0</link>
      <guid>https://dev.to/josephyeo/axios-was-compromised-heres-what-it-means-for-your-repo-1hh0</guid>
      <description>&lt;p&gt;On March 31, 2026, the &lt;code&gt;axios&lt;/code&gt; npm package — with over 100 million weekly downloads — was compromised and used to distribute malware.&lt;/p&gt;

&lt;p&gt;A threat actor took over the lead maintainer's npm account, published two backdoored versions (&lt;code&gt;1.14.1&lt;/code&gt; and &lt;code&gt;0.30.4&lt;/code&gt;), and added a hidden dependency that deployed a cross-platform remote access trojan. The payload targeted Windows, macOS, and Linux. The malicious versions were live for only about three hours before they were removed.&lt;/p&gt;

&lt;p&gt;In practice, three hours was enough.&lt;/p&gt;

&lt;p&gt;Microsoft attributed the attack to Sapphire Sleet, a North Korean state actor. Google's Threat Intelligence Group confirmed UNC1069 involvement. This was a coordinated, pre-staged operation — the malicious dependency was planted 18 hours before activation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters to You
&lt;/h2&gt;

&lt;p&gt;If your &lt;code&gt;package.json&lt;/code&gt; uses caret ranges like &lt;code&gt;^1.x&lt;/code&gt;, a routine &lt;code&gt;npm install&lt;/code&gt; could have pulled the compromised version automatically. No unusual action required. Just your normal CI/CD pipeline doing what it was designed to do.&lt;/p&gt;

&lt;p&gt;Most teams would not have caught this in time.&lt;/p&gt;

&lt;p&gt;Not because they're careless, but because the tooling gap is real:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;npm audit&lt;/code&gt; looks for known CVEs. This wasn't a CVE when it hit.&lt;/li&gt;
&lt;li&gt;Dependabot follows published advisories. This version came from the real maintainer account.&lt;/li&gt;
&lt;li&gt;Lockfiles help, but only if they're pinned and not being updated automatically.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The teams that stayed safe had one thing in common: they treated dependency management as part of their security practice, not just routine package maintenance.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Happened in Our Setup
&lt;/h2&gt;

&lt;p&gt;I maintain a project called GloriaPPT — a typical Next.js app with 954 npm dependencies. When the axios advisory dropped, I wasn't refreshing Twitter. I got a Slack alert.&lt;/p&gt;

&lt;p&gt;I built DevRadar Guard to answer one practical question fast: does this incident actually touch one of my repos? In this case, the flow looked like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Signal Engine&lt;/strong&gt; picked up the GitHub Advisory within its 30-minute collection cycle.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Exposure Engine&lt;/strong&gt; matched it against GloriaPPT's &lt;code&gt;package-lock.json&lt;/code&gt;, where &lt;code&gt;axios&lt;/code&gt; was a direct dependency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Guardrail Engine&lt;/strong&gt; sent a Slack alert with the risk score, confidence level, and affected version.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No manual checking. No scrolling through threads or advisories. The alert landed with the information I needed to decide what to do next.&lt;/p&gt;

&lt;h2&gt;
  
  
  Axios Was One Incident. The Pattern Keeps Repeating.
&lt;/h2&gt;

&lt;p&gt;Axios will be patched. Credentials will be rotated. Postmortems will be published.&lt;/p&gt;

&lt;p&gt;But the pattern repeats. Before axios, it was event-stream. Before that, ua-parser-js. The attack surface keeps growing with every install that pulls in packages your team didn't explicitly choose or review.&lt;/p&gt;

&lt;p&gt;The question isn't whether the next supply chain attack will happen. It's whether your repo will know about it before your CI/CD pipeline installs it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What You Can Do Today
&lt;/h2&gt;

&lt;p&gt;Even without new tooling, these steps can reduce your risk right away:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pin your dependencies.&lt;/strong&gt; Remove &lt;code&gt;^&lt;/code&gt; and &lt;code&gt;~&lt;/code&gt; from critical packages. Use exact versions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set &lt;code&gt;ignore-scripts=true&lt;/code&gt; in &lt;code&gt;.npmrc&lt;/code&gt;.&lt;/strong&gt; In this incident, that setting would have blocked the malicious install script.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Review your lockfile after every install.&lt;/strong&gt; If a new transitive dependency appears that you didn't add, investigate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit your CI/CD pipeline permissions.&lt;/strong&gt; Does your build environment need network access during &lt;code&gt;npm install&lt;/code&gt;?&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What We're Building
&lt;/h2&gt;

&lt;p&gt;I'm also building DevRadar Guard around this workflow: early signal collection, repo exposure checks, and guardrail generation. Part of it is open source, including starter config for &lt;code&gt;.npmrc&lt;/code&gt;, pre-install hooks, &lt;code&gt;CLAUDE.md&lt;/code&gt; (security policy for AI coding agents), and GitHub Actions.&lt;/p&gt;

&lt;p&gt;DevRadar Guard is still in Alpha and runs in alert-only mode. No automatic blocking, and no surprise PR failures. You stay in control.&lt;/p&gt;

&lt;p&gt;If your team depends on npm and this workflow sounds useful, take a look at the starter kit or join the waitlist:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/devradar-guard/devradar-guard/tree/main/examples/starter-kit" rel="noopener noreferrer"&gt;Starter Kit on GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://tally.so/r/GxDbbL" rel="noopener noreferrer"&gt;Join the waitlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>security</category>
      <category>npm</category>
      <category>supplychain</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
