Joseph Yeo

Posted on May 13

The Information Design Gap: Why Our AI Agent Was Coding Blind

#ai #agents #softwareengineering #llm

This is Part 4 of the ForgeFlow series. Part 3: The Determinism War introduced DCR (Deterministic Coverage Ratio) and why we stopped chasing better models.

In Part 3, I proposed a hypothesis:

"The bottleneck of LLM-driven software engineering is not model capability, but the verifiability of specifications."

Then I said: "We're building the system to test it."

We ran two projects. Same model. Same engine. Same orchestrator. The autonomous pass rate went from 0% to 67%.

In our case, the fix wasn't a better model. It was giving the model enough information to do its job.

Two Projects, One Model, One Engine

ForgeFlow is a TDD orchestrator that runs entirely locally. No cloud API calls during execution. The cycle is simple: generate test (RED) → generate implementation (GREEN) → run pytest → commit or retry.

We ran it against two internal projects:

	Project A: repo-jwt	Project B: todo-api
Domain	JWT authentication API	Todo CRUD API
Tasks	18	12
Model	Qwen3-Coder-Next Q4_K_M (45GB)	Same
Engine	forgeflow.py v2	Same
Autonomous passes	0 / 18 (0%)	8 / 12 (67%)
Manual intervention	18 tasks (100%)	4 tasks (33%)

Same model. Same engine. The pass rate changed from 0% to 67%.

What changed between Project A and Project B was not the model or the orchestrator. It was the information structure of the PRD — the spec document that tells the model what to build.

The Prompt Dump That Exposed the Problem

After Project A failed across all 18 tasks, we did something we should have done much earlier: we dumped the actual prompt the model received at inference time.

Not the prompt template. Not the system prompt spec. The literal string that arrived at the model's context window.

Here's what we expected to find: a rich prompt containing the task spec, relevant source files, test fixtures, data model definitions, and dependency context.

What we actually found was a prompt of about ~720 tokens — and only ~240 of those were task-relevant project information. The rest was role text, formatting rules, and boilerplate.

No source code. No test fixtures. No existing implementation files. The model was being asked to generate code for a project it could barely see.

In hindsight, this was an embarrassing oversight. The information pipeline existed in the code — it just wasn't wired up. But we didn't notice until we read the raw prompt.

The Five Gaps

We listed out every piece of information the model should have received but didn't. Five gaps emerged:

Gap 1 — No context in RED phase. During test generation, the context parameter was hardcoded to an empty dictionary. The model wrote tests for modules it couldn't see, importing functions that didn't exist yet, guessing at fixture structures it had no way to know.

Gap 2 — No context file list in the PRD. The orchestrator had a function ready to read context files from a task-level field. But the PRD never defined that field. So the function returned an empty list every time.

Gap 3 — Module names without signatures. The prompt listed available modules by name: todo.py, database.py. But not their contents, not their function signatures, not their class fields. The model knew that modules existed, but not what they contained.

Gap 4 — Test assertions not forwarded. The PRD included test assertion fields with precise expected behavior. The prompt builder read a different field name. The assertions existed in the spec but never reached the model.

Gap 5 — No conftest.py. In a pytest project, conftest.py defines shared fixtures — test database sessions, HTTP clients, factory functions. The model never saw it. Every task that required a test client, the model invented its own from scratch, often incompatibly.

Quantifying the Gap

We measured how much relevant information actually reached the model by counting task-specific tokens:

Source	Task-relevant tokens	What it contains
What the model received	~240	Task ID, one-line description, module names
What a human developer would reference	~1,640	Above + conftest.py, database.py, models, schemas, existing routes

The model was operating on roughly 15% of the information a developer would use for the same task.

We started calling this the Information Design Gap: the difference between what a model could use and what the system actually delivers at inference time. Whether this framing is useful beyond our system is something we're still figuring out — but for us, it immediately clarified what to fix.

The Fix: No Code Changes

Here's the part that surprised us.

The orchestrator already had the machinery to deliver context files. A function to resolve which files a task needs — existed. A function to read those files from disk — existed. A function to format them into the prompt — existed. The prompt builder had a slot for context.

The pipeline existed. The PRD just wasn't feeding it.

For Project B (todo-api), we made three changes — all in the PRD, none in the engine:

1. Added a context file list to every task. Each task now listed exactly which existing files the model should see. Early tasks had empty lists. CRUD endpoint tasks included the test fixture file, the relevant model file, and the schema file.

Here's what a task spec looked like after the fix:

- id: TASK-007
  description: >
    Create POST /api/todos.
    Use the client fixture from conftest.py.
    Return 201 with TodoResponse schema.
  context_files:
    - tests/conftest.py
    - app/models/todo.py
    - app/schemas/todo.py

2. Made descriptions explicit. Instead of "Create the POST endpoint", we wrote "Create POST /api/todos. Use the client fixture from conftest.py. Return 201 with TodoResponse schema." One sentence, but it told the model where to look.

3. Unified test scenario format. Aligned PRD field names with what the prompt builder actually read, so test assertions reached the model.

Total lines of code changed in the engine: zero.

What 67% Actually Means

Eight of twelve tasks passed autonomously — the model generated code, tests passed, and ForgeFlow committed without human intervention for those tasks.

Based on manual inspection, I would not classify the four manual tasks as model-capability failures. They were structural mismatches with the TDD cycle:

Task	Why manual	Category
TASK-001	Infrastructure setup — test and implementation must be created together	RED phase incompatible
TASK-003	Fixture-only — conftest.py defines fixtures, nothing to "fail" in RED	No failing test possible
TASK-010	Validation already handled by Pydantic schema from an earlier task	RED unexpected pass
TASK-012	Integration test — no implementation file, test-only task	Engine assumes impl file exists

That said, I should be honest about what we can't fully measure here. "Model-capability failure" is hard to distinguish from "subtle information gap we didn't notice." Our classification is based on manual inspection, not a controlled experiment. What we can say with confidence is that the type of failure changed completely — from hallucinated imports and invented fixtures in Project A to structural mismatches in Project B.

The Lesson: Intelligence Gap vs. Information Gap

After Project A, our diagnosis was: "The model isn't smart enough. Qwen3 at Q4 quantization can't handle multi-file JWT authentication."

That diagnosis was wrong — or at least, premature.

The model appeared to have more usable capability than our system was exposing. In this run, the difference between 0% and 67% looked less like intelligence and more like context delivery.

This completely changed how we thought about local model limitations:

	Intelligence Gap	Information Gap
Symptom	Model generates plausible but wrong code	Model generates structurally incompatible code
Diagnosis	"Model too small / too quantized"	"Prompt missing critical context"
Fix	Upgrade model (expensive, diminishing returns)	Improve information design (free, compounding)
Testable?	Hard — model capability is a black box	Easy — dump the prompt, count what's missing

The Information Design Gap is testable. Dump the prompt. Read it as if you're a developer seeing this project for the first time. If you couldn't write the code from that prompt alone, the model can't either.

Similar Patterns in Recent Research

While writing this post, we surveyed recent research on TDD-based code generation and found similar patterns appearing independently. These don't prove our framework, but the convergence seemed worth noting.

Alonso et al. (2026) tested TDD prompting on SWE-bench Verified with a 30B local model. Adding procedural TDD instructions ("write tests first, then implement") increased regressions. Adding a graph-derived test map ("here are the specific tests at risk") reduced them significantly. Their conclusion: agents don't need to be told how to do TDD — they need to be told which tests to check.

We saw the same mechanism: telling the model what process to follow consumed context tokens that could carry actual project information.

Midolo et al. (2026) surveyed 50 developers about what makes code generation prompts succeed. Their top factors: algorithmic details (57%) and I/O format specification (44%). When asked what else was missing, 14% independently reported "contextual information about other components in the system" — which sounds a lot like the gap our per-task context file list was designed to close.

Jalil et al. (2025) showed that smaller models with TDD and a code interpreter could surpass larger models without those supports. The pattern held across model families: tests as structured context beat model scale.

Different benchmarks, different teams, different setups. They all point toward the same practical lesson: before blaming the model, it might be worth inspecting the information pipeline. Our data adds one more point in that direction.

Implications for DCR

In Part 3, I defined DCR as the ratio of deterministic decisions in an agent loop. A reader asked whether DCR should be tracked like test coverage — not just reviewed once at architecture time.

Running two projects gave us a partial answer: DCR alone wasn't enough.

ForgeFlow's DCR didn't change between Project A and Project B. It was 85% both times — same 11 of 13 decisions handled deterministically. Yet performance went from 0% to 67%.

What changed was the quality of information feeding the non-deterministic decisions. DCR tells you how narrow the model's role is. It doesn't tell you whether the model is equipped to play that role.

This is why we're now thinking about DCR in two layers:

Static DCR: how many decision points are designed to be deterministic. (Architecture metric.)
Observed DCR: how many decisions were actually resolved deterministically during real runs. (Runtime metric.)

And alongside both: Information Delivery Rate — how much of the available, relevant context actually reaches the model at inference time. Using task-relevant token delivery as a rough proxy, Project A was around 15%. Project B was much closer to the information a human developer would expect to see.

We're still working out whether these are the right abstractions — but they've been useful for diagnosing our own failures so far.

What We're Building Next

The immediate roadmap based on these findings:

RED phase context delivery. The RED phase (test generation) was still sending an empty context when we ran these projects. We've since fixed this in the engine — the model now sees existing fixtures before writing new tests.

Automatic context inference. Right now, context files are manually specified per task in the PRD. The next step is deriving them from the dependency graph: if TASK-007 depends on TASK-005 and TASK-006, automatically include their implementation files as context. We're exploring tree-sitter-based approaches for this.

Structural mismatch detection. Four of twelve tasks didn't fit the RED-GREEN cycle. We want ForgeFlow to detect these patterns (infrastructure setup, fixture-only, test-only) during PRD validation and handle them with a separate path — not force them through TDD.

The Thesis, Updated

Part 3's thesis was about structure:

"The bottleneck is not model capability, but verifiability of specifications."

Two projects later, I'd extend it:

"In our runs, even after we built verifiability, the bottleneck seemed to shift to information delivery — whether the model receives enough context to use that verifiability."

DCR gave us the harness. Information design made that harness useful. Both seem to be required. Neither alone was sufficient in our experience.

Same model. Same engine. Zero code changes. 0% → 67%. In our case, the difference was information.

Several recent studies point in a similar direction, though from different setups. The practical suggestion I'd offer: if your AI coding agent is underperforming, it might be worth checking what it's receiving before swapping the model. That's what worked for us.

About

I'm Joseph YEO, a solo builder from Seoul, Korea. ForgeFlow is my experiment in pushing local AI agents toward more reliable autonomous execution — no cloud inference during execution, no hand-holding mid-cycle.

This post covers what happened when we actually ran the system from Part 3 against real projects and discovered the gap between having a verification harness and feeding the generator enough context. I'm sharing this because I wish someone had told me to dump the raw prompt before I spent weeks blaming the model.

If you've run into similar issues — or found different solutions — I'd love to hear about it in the comments.

Follow along:

Built over ~33 sessions, May 2026. All models run locally via Ollama 0.23.0 on Apple Silicon. No cloud APIs were used during autonomous execution.

This post was drafted with Claude and edited by me.

DEV Community