This is Part 5 of the ForgeFlow series. Part 3: The Determinism War introduced DCR. Part 4: The Information Design Gap showed how information delivery moved our pass rate from 0% to 67%.
We thought we had the answer.
In Part 4, we showed that fixing our information pipeline — zero code changes, just better PRD design — moved ForgeFlow's autonomous pass rate from 0% to 67%. We thought the lesson was clear: give the model enough context and it delivers.
Then we ran a third project. Same model. Same engine. Same careful PRD design. The pass rate dropped to 29%.
This post is about why that happened and what it taught us about measuring AI coding agents.
What Went Wrong on Project C
Project C was a bookmark API with many-to-many relationships — more complex than a simple CRUD, but not wildly different in structure. We applied everything we learned from Project B: explicit context files per task, detailed descriptions, proper test scenario format.
The PRD had sixteen tasks. In the first seven executed tasks, only two passed autonomously — 2 / 7, or 29%. The other five required manual intervention.
The failures weren't the same as Project A's. In Project A, the model hallucinated imports and invented fixtures because it couldn't see the project. In Project C, it could see the project — but it kept hitting runtime patterns our prompt pipeline had not exposed clearly enough:
- Pydantic's
HttpUrladds a trailing slash that breaks equality checks - FastAPI's router prefix with
"/"vs""causes 307 redirects - SQLAlchemy async many-to-many relationships trigger
MissingGreenleterrors -
create_async_enginelives insqlalchemy.ext.asyncio, notsqlalchemy
I would not classify these primarily as intelligence failures. They looked more like behavioral knowledge gaps — framework-specific quirks that context files alone didn't expose.
Why Context Files Weren't Enough
This forced us to confront a limitation in our Part 4 approach.
Context files are static. They're defined when you write the PRD — before any code exists. By TASK-007, the project has files that weren't there when the PRD was written. The model can't see them unless someone manually updates the list.
For example, TASK-007 needed to create a route that depended on models and schemas generated by TASK-003 and TASK-005. But those files didn't exist when the PRD was written. The context file list was correct at design time — and stale by execution time.
And even when context files are current, they show the model what code exists. They don't teach the model how the runtime behaves. No amount of reading bookmark.py will tell you that bookmark.tags.append(tag) triggers a synchronous database call inside an async context.
We realized we needed two different kinds of information, not one:
| Type | What it provides | Source | Example |
|---|---|---|---|
| Structural information | What files exist, what they export, how they relate | Context files, repo maps | "BookmarkCreate has a field url: HttpUrl" |
| Behavioral knowledge | How the runtime actually works, framework quirks, patterns to avoid | Accumulated experience, failure analysis | "HttpUrl adds a trailing slash — use str(url).rstrip('/') for comparison" |
Project B's success came from fixing structural information. Project C's failures seemed to come largely from missing behavioral knowledge. Both feed into what the model needs, but they come from different places and accumulate differently.
Two Axes, Not One
This is what led us to think about AI coding agent performance along two axes instead of one.
After Part 3, we had DCR — the ratio of decisions handled deterministically. At 85%, the model's job was narrow: just write the code.
After Part 4, we had the Information Design Gap — the model needs enough context to do that narrow job.
Now, after three projects, we're working with a slightly more structured version:
Axis 1: DCR (Deterministic Coverage Ratio) — how much of the decision surface is handled without the model. This is the scaffolding.
Axis 2: Information Quality (IQ) — how well the model is equipped for the decisions it does handle. This is the fuel.
This is not a measured equation — just the mental model that best explains our runs so far:
System Reliability ≈ DCR × Information Quality
DCR narrows the blast radius. IQ determines how well the model performs inside it. In our data so far, you need both. Neither alone has been sufficient.
Three Dimensions of Information Quality
After analyzing our failures across three projects and reading through recent work on LLM-based code generation, we've found that "the model didn't get enough information" breaks down into at least three problems:
Dimension 1 — Availability: Does the information exist in the context window?
This was Project A's problem. The model received ~240 tokens of task-relevant content. The information existed on disk — the orchestrator just never loaded it.
Dillon & Varanasi (2026) observed something similar. They measured whether generated code follows team-level architectural decisions. When a decision was visible in the files the model received, compliance was near-perfect. When it existed only in documents the model never saw, compliance dropped to zero.
Dimension 2 — Selection: Is irrelevant information excluded?
More context isn't always better. Alonso et al. (2026) found that adding procedural TDD instructions increased regressions, while a targeted test map reduced them significantly. The practical lesson for us was simple: token budget is finite.
Hu et al. (2026) quantified this from the other direction. In their cross-file code generation benchmark, 62% of functions didn't need cross-file context at all. The skill is knowing which information to include.
Dimension 3 — Structure: Is the information formatted for the model to use?
This is the counterintuitive one. Information can be available and selected correctly but still fail because of how it's structured.
Hu et al.'s ablation showed this clearly. "Inlined" context (dependencies inserted at relevant code locations) versus "prepended" context (same information at the top of the prompt) — same information, different structure. Removing the inlining degraded performance to nearly the same level as removing the context entirely.
Chinthareddy (2026) found a similar pattern with code retrieval. On a set of architectural queries, a deterministic AST-derived knowledge graph scored 100% correctness while a vector-similarity approach on the same codebase scored 40% (on the Shopizer benchmark suite). The gap came from how relationships were structured, not what information was available.
How the Two Axes Interact
Here's why we think you need both:
| Scenario | DCR | IQ | What happens |
|---|---|---|---|
| A | High | High | System has a chance to work. Deterministic decisions are correct, model has what it needs. |
| B | High | Low | Deterministic decisions are correct, but the model is flying blind in its narrow lane. |
| C | Low | High | Model generates good code, but the system mishandles it — wrong task order, broken gates, environment failures. |
| D | Low | Low | Failures become hard to diagnose. This may be what "the model isn't smart enough" often looks like. |
Our Project A was Scenario B. DCR was 85% — the harness was solid. But Information Quality was ~15%. The model couldn't do its job because it couldn't see the project.
Project B was closer to Scenario A. Same DCR. At least on the availability dimension, the delivered context moved closer to ~80%. The model had enough context to complete most of the tasks that fit the orchestration loop.
Project C showed us that IQ itself has layers. Structural availability was good (~80%), but behavioral knowledge was missing. The two-axis model held — DCR was fine, IQ was the bottleneck — but the nature of the IQ problem was different from Project A.
A Practical Diagnostic
If you're building or evaluating an AI coding agent, here's the check we now run on our own system:
Step 1: Measure your DCR. List every decision point in one execution cycle. For each one, ask: is this resolved by deterministic code, or does it depend on model output? Count the ratio. If it's below 50%, the scaffolding likely needs reinforcement before the model can succeed.
Step 2: Dump the prompt. Not the template — the actual string that reaches the model at inference time. Read it as if you're a developer seeing this codebase for the first time. Can you write the code from this prompt alone? If you can't, the model can't either.
Step 3: Diagnose by axis.
- High DCR, low pass rate → Information Quality problem. Check: are context files loaded? Are descriptions specific enough? Are test assertions reaching the model?
- Low DCR, inconsistent results → Structural problem. The model is making decisions that should be deterministic. Move those decisions into code.
- Both seem fine, still failing → Might be a genuine model capability limit. Only after that does a model upgrade become the next reasonable hypothesis.
We jumped to Step 3 after Project A. "Qwen3 can't handle JWT auth" was our first diagnosis. Our initial diagnosis was premature. The bigger problem was that the information pipeline was effectively empty. We could have saved ourselves weeks.
Related Work Pointing in a Similar Direction
I didn't set out to build a framework. I set out to figure out why Project A failed. But a consistent pattern kept showing up in recent work:
Alonso et al. (2026) — TDD procedure instructions hurt. Contextual test maps helped. Procedure without context was counterproductive.
Midolo et al. (2026) — Surveyed 50 developers. 14% independently reported "contextual information about other system components" as a missing factor.
Jalil et al. (2025) — Smaller models with TDD and code execution surpassed larger models without those supports.
Dillon & Varanasi (2026) — Decision compliance went from 46% to 95% by adding product context and structured specs. Cost per merge-ready task dropped 68%.
Hu et al. (2026) — Cross-file inlining improved exact match by a reported average of 29.73% on RepoExec across three backbone models. The result was model-independent.
Chinthareddy (2026) — Deterministic AST-derived code graphs achieved 100% correctness vs. 40% for vector-only retrieval on architectural queries (Shopizer suite). LLM-based graph extraction missed 31% of files entirely.
These studies don't prove our framework. But they point in a consistent direction, and our three internal runs are consistent with that direction.
What We're Not Claiming
I want to be precise about the boundaries here.
We're not claiming that model capability doesn't matter. It does — for the non-deterministic slice. A stronger model will generate better code from the same information.
We're not claiming these two metrics capture everything. Latency, cost, context window size, tool use ability — all matter. But in our limited experience, DCR and IQ have explained the largest share of variance in autonomous pass rates.
We're also not claiming this is proven. ForgeFlow is a sample size of one. We have three data points (0%, 67%, 29%) and they're consistent with the two-axis model, but three points don't make a proof.
If anyone has run similar experiments — different scaffolding levels, different context strategies, measured pass rates — I'd genuinely love to compare notes.
The Thesis, So Far
Part 3: "The bottleneck is not model capability, but the verifiability of specifications."
Part 4: "Even after verifiability is constructed, the bottleneck shifts to information delivery."
Now the version we're working with:
"An AI coding agent's reliability seems to be a product of its deterministic coverage and its information quality. Improving either without the other produces a system that is either structurally sound but informationally blind, or well-informed but structurally fragile."
Two axes. One product. Neither alone has been sufficient in our experience.
Measure your DCR. Dump your prompt. Fix the axis that's actually broken. Only after that does a model upgrade become the next reasonable hypothesis. That's the diagnostic that's worked for us. Whether it generalizes is something we're still finding out.
About
I'm Joseph YEO, building ForgeFlow from Seoul, Korea — a local AI coding agent that runs entirely on Apple Silicon, no cloud inference during execution.
This post synthesizes what I've learned from running three projects end-to-end and reading 40+ papers on LLM-based code generation. The two-axis model isn't a proven theory — it's the working diagnostic I use every time a cycle fails. I'm sharing it because it's been useful, and because I'm curious whether others are seeing the same patterns.
How are you handling the "stale context" problem as your agent modifies the codebase? Are you using repo maps, re-indexing on every task, or something else entirely? I'd love to hear what's working.
Follow along:
- 𝕏: @josephyeo_dev
- GitHub: joseph-yeo
- Site: projectjoseph.dev
All models run locally via Ollama 0.23.0 on Apple Silicon M5 Max 128GB. No cloud APIs were used during autonomous execution.
This post was drafted with Claude and edited by me.
Top comments (0)