DEV Community

Cover image for AI-Assisted Development: When orchestration starts collapsing under its own weight
Elena Romanova
Elena Romanova

Posted on

AI-Assisted Development: When orchestration starts collapsing under its own weight

I thought adding more control would make my AI-assisted coding workflow more reliable. Instead, it made the whole system heavier, slower, and more fragile. This article is about that failure — and about the redesign that finally made the workflow stable enough to keep using.

This is the second article in the series, to get a bit more context about the experiment — please check AI-Assisted Development: Part I


After my first attempts to build an agentic workflow that could be used not only for nice demos, but for something closer to real enterprise development, it became clear that moving forward “as is” made no sense.

The first working version showed that orchestrating roles really can improve the result compared to one big generation. The code became more structured, intermediate artifacts started to appear, handoffs (passing results and context from one agent to another) became part of the flow, and there was at least some process discipline. But the limits also became visible: requirements were getting lost on the way, the review gate was still too soft, and confidence in the result was often higher than the result itself deserved.

That is why the next iteration was not just “one more version” for me. It was an attempt to make the workflow more controllable. At this stage, I was not trying to improve code generation itself as much as I was trying to strengthen control around it: preserve requirements, add more independent validation, and make the process itself easier to understand from one version to another.

Agentic flow planned for this iteration

What I was trying to fix after the first version

Drifting requirements

One of the most visible problems in the first version was that some requirements simply disappeared on the way. The original prompt, the user story, and the final implementation no longer fully matched each other. The system could generate fairly clean code, but that still did not mean it preserved the actual meaning of the task all the way through.

To reduce this drift, I added a requirement lock — a separate layer of explicitly fixed constraints that the Team Lead had to pass further down the chain. The idea was simple: models should not “fill in the blanks” for things that look unimportant to them, and they should not silently narrow the task into a shape that is easier for them to implement.

## Requirement Lock
Before delegating, extract and pass a requirement lock that captures:
- Source-of-truth inputs and configuration dependencies
- Required request/response contract constraints
- Default behaviors and mandatory validations
- Explicit exclusions and non-goals
- Unresolved questions that must not be guessed away
Read `documentation/project-overview.md` and scan `documentation/constitution.md` principle titles for constraint relevance. Ask clarification only when docs do not resolve ambiguity.
Enter fullscreen mode Exit fullscreen mode

Validation was still too soft

In the previous stage, responsibility for judging the quality of the result was still mostly falling on the Team Lead, and that was not enough.

So I added a separate Reviewer agent whose job was to validate the implementation plan written by the Architect against the requirements and project documents, and then review the final implementation itself: check whether it followed the plan, whether it respected the general project rules, and whether the code actually worked instead of just looking believable.

At the same time, I decided to strengthen the final quality gate:

  • add Sonar as a required part of final validation
  • introduce mandatory smoke tests (a quick check that the application starts and the main scenario actually works) before passing the implementation further down the chain

No transparency between versions

Once the workflow configuration started changing quickly, I myself started losing track of what was actually helping and what was not. So I added run logs for each iteration — a way to record not only what changed in the setup, but also what really happened during each run.

Final iteration code you can check on GitHub


The workflow looked stricter — but became heavier

Scorecard for version v2.0.0

If you only looked at the final code, the next iteration really was a step forward.

The generated implementation matched the technical task much better: the task itself was quite simple — an API that returns a list of GitLab issues. This time the result was noticeably closer to the original intent. The code was more structured, covered with tests, the application started successfully, and the endpoint really returned a valid result.

Compared to the previous version, this was already a meaningful improvement.

But the main failure of this iteration was not in the code itself. It was in the process.

What went wrong

  • the full cycle took more than two hours
  • there were four return loops between Team Lead and Coder
  • the final review artifact ended up corrupted
  • some of the control mechanisms added more friction than actual stability

And at that point something became much more obvious: good, or at least acceptable, code does not automatically mean that the orchestration system itself is reliable.


Context rotting became the first inevitable cost of complex orchestration

Execution flow for v2.0.0

The main problem turned out to be context rotting — the degradation of quality caused by overloaded context, too many artifacts, and compaction inside a long session.

As the workflow became more complex, the number of artifacts that agents had to read, pass to each other, and generate again also grew. On top of that there was chat history. By the time I reached the second call to the Coder, I could already see signs of compaction, and it was clearly starting to affect agent performance.

Formally, there were now more control points. But that did not make the system more stable — it made it heavier.

Why formal control did not create stability

The problem was not only the size of the context.

Yes, Sonar turned out to be too expensive and too noisy as a quality gate for this kind of process. Yes, the wrong model was again picked up for the Architect role. Yes, the implementation plan was still not specific enough and did not help the Coder as much as it should have. Yes, the Coder remained the weakest part of the system: it returned invalid results, skipped the required structured handoff (a structured batch summary with changed files, checks, and status), got lost in terminal commands, and in general consumed a disproportionate amount of time compared to the real value of the feature.

But these were no longer separate small failures. They were symptoms of a much bigger problem: I was trying to strengthen the workflow by adding more roles, more artifacts, and more review stages. As a result, the system became not more reliable, but heavier.

By the end of that run, one thing was clear to me: reasonably valid code alone does not mean the orchestration is successful. Generation time is a resource just like tokens are. If a simple feature requires more than two hours, several repeat loops, and still does not produce a stable final review, then the problem is no longer in one specific agent. It is in the process design itself.


The next step was not “more control”, but a lighter system

The next iteration was no longer just a series of small fixes. It was an attempt to redesign the workflow so that it would consume less context, create fewer chances for drift, and rely on more deterministic validation.

1. Lighter handoff artifacts

The main goal was to reduce the risk of context rotting.

To do that, I:

  • replaced heavy markdown handoff artifacts (artifacts passed between agents) with JSON wherever possible
  • kept markdown only where the artifact really had to be readable as a document by a human — for example, user story and implementation plan
  • shortened and simplified the agent prompts
  • cleaned up part of the context documentation

2. Circuit breaker and red card logic

Another important change was a circuit breaker — a mechanism that prevents the same fix loop from running forever.

Now the Team Lead had to re-check the Coder’s results, and the system got a red card mechanism: if an agent returned a false-positive or incomplete result several times in a row, the task would not just go through the same loop again — it would go back to the Architect for plan revision.

This was a simple change, but it was the first time the workflow started behaving more like a real engineering escalation process instead of an endless retry loop.

3. Local tooling instead of heavy external gates

Tooling changed a lot too. I moved away from a heavy external quality gate and switched to a local toolset:

  • Checkstyle
  • PMD/CPD
  • SpotBugs
  • JaCoCo check

I also created separate test instructions with clearer test types:

  • unit
  • component
  • integration

And to reduce terminal chaos, I added local scripts:

  • [verify-quick.sh](https://github.com/for-alisia/delivery-flow/blob/feature-1-v2.1.0/scripts/verify-quick.sh)
  • [quality-check.sh](https://github.com/for-alisia/delivery-flow/blob/feature-1-v2.1.0/scripts/quality-check.sh)

Their job was very practical: replace a stream of random terminal commands with more predictable and repeatable validation steps.

4. A more useful plan for the Coder

Finally, I strengthened the requirements for the Architect.

Now I expected not just “a plan in general terms”, but something actually useful for the Coder:

  • task split into slices
  • required payload examples
  • expected class structure
  • logging expectations
  • test coverage expectations

The goal was to reduce guesswork before coding even started.


What actually stabilized the process

This iteration did not make the workflow perfect. But it became the first one where the main problem from the previous run — overloaded and degrading context — stopped being the biggest risk.

In the previous run, the system looked stricter on paper: more artifacts, more steps, more checks. But in practice it produced the opposite effect. The context kept growing, handoff artifacts became too heavy, Reviewer Phase 2 broke, and the Coder spent too much time in loops and unstable validation.

After simplifying handoffs, reducing agent instructions, and moving toward more local and deterministic checks, the workflow finally stopped collapsing halfway through the run. Reviewer artifacts were no longer damaged. JSON handoffs really did stabilize state transfer. Context usage improved noticeably. In this run, the Coder stopped giving false-positive reports. And the total run time dropped by roughly half.

For a quick comparison, it looked like this:

Comparison between v2.0.0 and v2.1.0

The main conclusion here was simple: the problem was not only the quality of specific models, but how much operational noise the workflow itself was creating. Once the handoff became lighter and the checks became closer, simpler, and cheaper, orchestration stopped degrading so early.

But process stability still did not mean system maturity

At the same time, this new iteration did not solve everything.

What was still a problem

The plan was still too long.

Even a good plan should not blow up downstream context. If the implementation plan itself becomes a heavy artifact, it starts hurting the very stability it was supposed to support. A harder limit was needed here — no more than 200 lines.

There was still no mandatory early smoke check for the API.

The Coder still tended to treat the work as finished after tests. But green tests are not the same as a working API. Happy path and error path should be checked before Reviewer Phase 2, not later.

Handoff discipline was still not fully there.

Each batch still needed a proper structured handoff, and agents still should not be allowed to create side files outside the scope of the plan and checkpoint.

Red card logic existed, but was not fully wired into execution.

The counters and checkpoint state were already there, but the mechanism still had to become automatic and formal instead of situational.

Tooling still missed some code-style issues.

For example:

  • repeated string literals
  • one consistent constructor style
  • standardized Mockito matcher practices

So the process became more stable, but code-shape quality still needed separate attention.

Why I decided to keep this code

This was the point where something important changed for me.

This iteration was not perfect. But it was the first version that I decided not just to document and move on from, but to keep as the basis for future runs. Not as a final result, but as the first truly usable seed version for further generation.

And maybe the main lesson was not even about agents themselves. It was about the engineering practice around them.

Local tools — the ones we almost started forgetting in the era of heavy IDE mega-platforms and all-in-one systems — suddenly felt useful again. They were simple, local, understandable, and cheap to run. And that was exactly what gave the workflow the stability it had been missing.

After that, the focus shifted naturally.

Once the main context problem was significantly reduced, the next question became different: not will the orchestration collapse halfway through, but what kind of code does it produce now — and how much does that code look like something I would actually want to maintain further.

Top comments (0)