Takayuki Kawazoe

Posted on May 26

"Why we told our AI plan generator to never split tests into a separate sub-task"

#ai #agents #ci #python

The run was marked failed. Two of the three sub-tasks merged cleanly. The third one, titled "Add tests for is_sent=True treated as read in test_inbox_service_unread_propagation.py", never finished. CI retried up to the cap, all failures, then gave up. The whole plan was thrown out even though two thirds of the actual code had already landed on green branches.

The fix turned out to be one paragraph in one prompt. Not a code change in the dispatcher. Not a new CI flag. Just a rule that says: if a sub-task introduces or modifies code, the unit tests for that code go in the same sub-task. The "tests as their own task" pattern is forbidden.

Here is what I observed, why the AI reached for the wrong decomposition, and the exact prompt rule that closed the gap.

What actually happened

Codens Purple has what I call a plan generator. That is the part of the system that takes one PRD or bug report and breaks it into sub-tasks. Each sub-task then gets dispatched on its own Git branch, runs in parallel with the others, and merges back to the base when its CI goes green. The piece of the plan generator that actually does the splitting is driven by what we internally call the analyze prompt, which is just the system prompt the model sees when it decides "how should this work be carved up."

On a project called opsguide-back, for one bug, the plan generator produced this triple:

1. Add tests for is_sent=True treated as read in
   test_inbox_service_unread_propagation.py
2. Fix _store_messages_batch in inbox_service.py to mark
   self-sent messages as read
3. Add sender_email exclusion to _build_activity_unread_count
   in resolver.py

If you read that as a human reviewer, it looks great. Three clean concerns, easy to review independently, no overlap in files touched. Textbook parallelization.

It died anyway. Sub-tasks 2 and 3 both finished and merged. Sub-task 1, the test-only one, kept failing CI. Its branch contained only changes to the test file. The implementation functions it was asserting against did not exist on that branch yet, because the implementation lived on a sibling branch that this branch could not see. pytest collected the test, tried to import the helpers, and the asserted behaviour was simply not present. Retry, retry, retry, give up. Run failed.

The cruel part is that if the merge order had happened to put the test branch last, after both impl branches had landed, the test would have passed. But we cannot guarantee that order. Each sub-task races on its own.

Why the AI did this

This was not a model failure. The model did exactly what every general-purpose decomposition heuristic would tell you to do. Split tests from implementation so they can move in parallel. That is correct advice for a human team, where the reviewer and the merge queue keep the order honest, and where a developer can rebase a test PR onto the impl PR before merging.

The thing the model did not know is that our dispatch system runs each sub-task on its own isolated branch. Each sub-task sees the base branch plus its own changes, and nothing else. Sibling sub-tasks' work is invisible to it until merge time. That is not a universal fact about software development. It is a property of how we, specifically, run parallel agents. Nothing in the model's training corpus tells it that this constraint applies, because most of the corpus is about human teams.

So the model reached for the most-cited decomposition pattern it knew, which happens to be wrong for our dispatcher. The mistake lived in the prompt. We had been asking the model to plan parallel work without telling it the actual rules of "parallel" in our system.

This is the general shape of a lot of AI agent failures I have hit. The agent is not bad at reasoning. It is reasoning correctly in the wrong universe, because the prompt forgot to describe the universe.

The fix

We added this block to the analyze prompt. It is the only change.

## CRITICAL: Tests live with their implementation

NEVER split tests for new behaviour into a separate sub-task. Every sub-task
that introduces or modifies code MUST also add the unit tests for that code
in the SAME sub-task. The pattern "Sub-task A: implement X / Sub-task B:
add tests for X" is FORBIDDEN.

Title heuristic: if you are about to write a sub-task title that starts
with "Add tests for ..." or "Write tests for ...", STOP and merge it
into the impl sub-task whose code it tests.

Two things are doing the work here. The first is the explicit "FORBIDDEN" framing. The second, which I think matters more in practice, is the title heuristic. The model writes the title before it writes the body. If we can get it to catch itself at the title stage, the bad plan never gets generated in the first place, so we do not have to rely on a later pass to repair it.

We also rewrote the few-shot examples in the same prompt. Before, the example impl sub-task's ## Steps section only listed source-code file edits. After, every example impl sub-task lists the implementation file edit and the test file edit side by side. Roughly:

 ## Steps
 1. Edit src/inbox_service.py: in _store_messages_batch,
    set is_read=True when message.sender_email == account_owner_email.
+2. Edit tests/test_inbox_service_unread_propagation.py:
+   add unit test asserting is_sent=True self-messages count
+   as read.

That tiny diff is the part that changes behaviour. Models pattern-match very strongly on few-shot examples. If every example shows tests bundled with impl, the model produces the same shape.

Since the rule went in, the plan generator has stopped emitting "Add tests for ..." sub-tasks on new behaviour. The test-only failure mode is gone.

The exception

There is one shape of test-only sub-task that is still fine. If we are backfilling a regression test for code that is already on the base branch, the test-only sub-task is allowed. The reason is symmetrical to the original failure: when the implementation already exists on main, a test-only branch has everything it needs to compile, import, and assert. pytest finds the function, the test runs, CI passes.

The prompt calls that out explicitly so the model does not over-apply the new rule and start refusing legitimate backfill work. The line in the prompt is roughly "the rule is about new behaviour introduced in this plan, not about all test-only sub-tasks ever."

Generalizing

The bigger lesson is that AI agents reach for human-team decompositions by default, and that is fine when your dispatch system also behaves like a human team. Most agent dispatch systems do not. Ours runs sub-tasks on isolated branches with no cross-visibility. Some teams run agents in long-lived shared worktrees. Some serialize. Each of these creates its own invisible constraint on what can and cannot be split.

The agent does not know which one you have. It cannot infer it from the codebase, because none of those constraints are encoded in the code. They live in the dispatcher.

So the work, when you start letting an agent plan parallel sub-tasks, is to spend prompt tokens drawing the line between what can be split and what cannot. For us that line was: tests for new code live with the new code. For someone else it might be: never split a migration from the code that depends on it. Or: never split a config change from the deployment that consumes it. The shape of the rule depends entirely on your dispatcher, not on the model.

The pattern I would suggest is to add a single "CRITICAL" section to the planning prompt that enumerates the constraints your dispatcher imposes. Use a title-stage heuristic so the model self-rejects bad plans before generating the body. Rewrite the few-shot examples to demonstrate the right shape, because that is what the model actually copies.

We rebuild Codens with Codens. Every prompt rule like this one came from watching a real run fail and adding the one sentence that would have prevented it. If you want to see how the parallel planner works end to end, the English landing page is at https://www.codens.ai/en/.

Top comments (1)

Harjot Singh • Jun 1

i totally get the frustration with sub-tasks causing failures. merging everything together can definitely streamline the process. at moonshift, we help you deploy a full next.js + postgres + auth build in about 7 minutes, and you keep all the code on your github. if you're interested, i can set you up with a complimentary build to try it out.