Elena Romanova

Posted on Apr 8

From AI Hype to Controlled Enterprise AI-Assisted Development

#ai #agents #programming #softwareengineering

Most AI coding demos answer the easiest question: can a model generate something that runs?

This experiment asks a harder one: can a multi-agent workflow produce code that stays aligned with requirements, respects project constraints, and remains reviewable enough for a long-lived enterprise codebase?

This article covers the first iteration of that experiment: a four-role orchestration setup, the artifacts around it, and the lessons that came from requirement drift, weak review gates, and over-optimistic test confidence.

The AI wave of the last few years did not pass my company by.

Presentations, workshops, ambitious plans to improve productivity by double-digit percentages before the end of the year through AI adoption — all of that quickly became part of everyday engineering life. We recently got access to GitHub Copilot, and for me that was the final signal: there is no going back. These tools are here, and I need to adapt to this new reality seriously.

I work in a large financial enterprise — thousands of engineers across the globe, strict regulatory and security constraints, projects that live for decades, and release processes that can take a week to prepare and involve a pile of tickets, approvals, and governance steps. In that world, Java 8 is still a perfectly normal tool, and getting a project on Java 17 already feels like a gift. I am a full-stack developer (Java + React) and a Delivery Lead in a 13-person team, so for me it matters a lot that our processes stay effective and our releases remain predictable.

When I started digging deeper into AI-assisted development, one problem became obvious very quickly: almost every tutorial and demo looks the same. Usually it is a React TODO app, a portfolio page, or some neat little proof of concept that a model or a swarm of agents generates from scratch. Then the video author checks whether it runs, and that is treated as success.

But almost nobody looks deeper:

how maintainable the code is
whether it actually fits the architecture
what the test coverage looks like
how it behaves in edge cases
whether it can be evolved further without a full rewrite

The “if it doesn’t work, delete it and generate again” approach works perfectly well for throwaway code: tiny demos, pet projects, one-page apps, or presentation-ready POCs. It is a good starting point if you want to get familiar with AI code generation. But it has very little to do with real enterprise development.

And this is where things got interesting: finding something genuinely useful for the day-to-day life of a working developer turned out to be much harder than I expected. The experiments my team started trying in practice were often chaotic and, quite often, brought more new problems than real value.

What problems we ran into

Unpredictable generation

Very often, debugging and fixing generated code — or simply figuring out why it did not work or behaved differently than expected — took more time than implementing the same functionality manually.

That is the moment when AI stops being an accelerator and starts becoming an expensive detour. Instead of saving time, the team gets a new layer of noise that still has to be untangled by hand.

Constant violations of project rules and engineering principles

We were slowly turning into code police: adding more and more restrictions to instructions, refining wording, documenting anti-patterns — and the model still kept either ignoring the rules or finding a new creative way to break them.

General phrases like “write production-ready code” or “follow clean architecture” sound nice, but in practice they guarantee almost nothing. Without concrete boundaries and checks, they remain far too vague.

No structured project context

Developers inside a team know a lot of things that are never explicitly written down. For example, a specific Elasticsearch query used by Flowable may not exist in the codebase as a normal .ftl file at all, but instead live inside a .zip archive in a specific directory or be accessible only through a designer UI. To a human, that is normal working knowledge. To the model, it is a blank spot.

So the model starts hallucinating, and the team compensates by piling on more context files, more instructions, more notes, and more workarounds.

The problem got worse because there was no unified team-level approach. Everyone kept their own local context files, prompt snippets, and notes based on what they personally thought was important. As a result, outputs became unpredictable and inconsistent from one developer to another.

Uncontrolled merge request style

The code started looking so unlike its supposed author that it was obvious it had been generated almost end-to-end. Which meant the extra burden immediately shifted to the reviewer.

The reviewer no longer just had to assess logic and task fit. First, they had to convince themselves that the code was even safe to merge into the project, that it did not hide surprises, and that it would not introduce additional technical debt.

Very quickly it became clear that developers get especially annoyed reviewing AI-generated code when the author has not cleaned it up, adapted it, and made it feel like part of the existing codebase before opening the MR. As a result, tension inside the team grew, and trust in each other’s code dropped noticeably.

Unrealistic expectations of AI as a “magic box”

Some people — especially in management, though not only there — started treating AI as a kind of magic box: put a good prompt in, and a great result will automatically come out.

That mindset says a lot not only about inflated expectations, but also about how poorly many people still understand how models actually work, where their limits are, and why good results do not appear on their own without context, verification, and engineering control.

Sometimes this became almost absurd. At one point I was asked to take a 35-page SAD (Software Architecture Document) template in Word, run it through Opus, and generate documentation for a two-year-old project with thousands of classes.

Requests like that make it painfully obvious how little many people still understand about the real cost of context, verification, and engineering accountability.

No clear model for a large real-world project

And this was probably the biggest problem of all. Even in engineering environments, there often was no clear understanding of how to adapt all this AI enthusiasm to a large real project — one with architecture, standards, code reviews, security constraints, support burden, and a long history of decisions.

We had all seen success stories about React TODO apps. We had seen much fewer examples of how AI-assisted development could actually work in a serious team environment.

The experiment: trying to build an enterprise-friendly approach

I do not have a silver bullet.

I am not going to pretend I found some universal solution that removes all of these problems in a few hours. Quite the opposite: my goal is to show the path honestly — the hypotheses I am testing, the decisions I am making, and the mistakes I am making along the way — so others can reuse what works and avoid at least some of what does not.

At its core, I want to test whether it is possible to build an AI-assisted workflow that is reliable enough for a serious project.

What “good” looks like for me

Code that is as indistinguishable as possible from something written by me or a teammate
Reasonably stable output from one iteration to the next
Strong test coverage
Code that actually works, not just compiles
Null-safe implementations that handle edge cases properly
No need for complicated prompt engineering before every generation
Developer control preserved throughout the process: final responsibility for quality stays with a human, and there is always a human in the loop

This part matters a lot: I am not trying to build a magical system that “writes everything by itself.” I am trying to build a controlled process where AI helps accelerate development, while the developer remains the one who sets boundaries, validates the outcome, and owns quality.

Why I chose a pet project

I did not want to run these experiments directly on a work project.

First, the cost of mistakes there is too high. Second, I was interested in a harder challenge: starting from a blank slate.

In my experience, models struggle the most when they have to produce good code with no real project context at all. When there are no established conventions, no patterns to anchor to, and no domain-specific structure yet, the output becomes especially unpredictable. Quite often, it is the kind of code you want to throw away immediately, without even trying to refactor it.

So I started building my own project — Delivery Orchestrator, a Java application with an MCP server acting as its client layer.

This is not a toy application and not yet another TODO list. I deliberately chose a project that is expected to become complex, scalable, and architecture-sensitive over time. It will grow step by step — story by story, generation by generation. Readability, refactorability, package structure, class boundaries, and architecture matter from the very beginning.

But at the first stage, it serves another purpose too: it is my playground for answering one bigger question — whether AI-assisted development can be turned from a pretty demo into a controlled, reviewable, and engineering-sound process.

Where I started: the initial orchestration model

At that point, I already knew that a great prompt plus a shared instruction file would not be enough. Yes, that improves the output noticeably, and I still think it should be the first step in any project. But for a large backend codebase, it is nowhere near sufficient.

That is why I became interested in the idea of agent orchestration, which I first picked up from Burke Holland. What I liked was the core idea itself: instead of trying to squeeze everything out of one “smart” request, split the work into several roles with clear responsibility and explicit handoffs between them.

The initial idea

The first working model was fairly simple:

Team Lead — takes in the task, sets the boundaries, and assembles the final result
Product Manager — formulates the user story
Java Architect — proposes the implementation plan
Java Coder — writes the code, tests, and verification evidence

The point was not to “give agents freedom,” but the opposite — to narrow their area of responsibility. Each step had to produce a clear artifact that could be reviewed with human eyes, instead of just passing along vague context.

What I added right away

At this stage, I already had the basic ingredients that, to me, are required before any serious work even starts.

First, there was a shared layer of project instructions — copilot-instructions.md - defining the baseline architectural and process context.

Second, there was a dedicated code-guidance.md file with rules for code style, structure, and quality.

But the most important layer from the start was constitution.md - a set of technology-agnostic rules that should not be violated regardless of which part of the system an agent is working on. In practice, this was a layer of invariants: not “nice-to-have recommendations,” but constraints that should survive changes in models, prompts, and even the workflow itself. And interestingly, this layer has remained one of the most resilient so far: when these rules are explicit and live in the repository, agents tend to check them first.

On top of that, I already had:

intermediate artifacts between stages
expectations not only around code, but also tests, documentation, and signs of verification
dedicated folders for implementation reports, agent setup in .github/agents

So even at an early stage, the goal was not “generate an endpoint,” but create a minimally disciplined delivery flow.

The first meaningful run: what worked

One of the early setups already produced something useful.

What came out was not random code, but a fairly clean, more enterprise-style skeleton:

a layered Spring structure
basic validation
error mapping
API documentation
unit tests and @WebMvcTest
overall, a much more disciplined result than one-shot generation

That was an important moment: it became clear that orchestration really can improve the quality of the first output compared to the classic “here is a prompt, generate the whole thing” approach.

If you want to see that exact slice of the project, it is available in the feature-1-v1 branch. It already contains the early repository structure, flow-orchestrator, mcp-server skeleton, project artifacts, and the full layer of constraints this run was built on.

What this run exposed immediately

At the same time, this run also showed the limitations of the approach.

Requirements started shrinking along the way

This was probably the most important observation.

Even though the original idea and the intermediate artifacts were broader, the final implementation ended up narrower and simpler than expected. Some requirements simply disappeared during the handoffs.

So the system could already produce neat artifacts and decent code — but it still could not reliably preserve the meaning of the task all the way to the end.

There was no independent technical review

At that stage, there was no dedicated reviewer or audit layer between the plan and final acceptance. As a result, all drift detection and mismatch checking effectively fell back onto the Team Lead.

That exposed a weak spot very quickly: splitting work across roles does not automatically create quality control.

Quality gates were too soft

The tests were already there, but they created a stronger sense of reliability than actual reliability.

Yes, the code was checked better than in a one-shot setup. But for integration-heavy logic, it was still not enough:

there were not enough realistic integration tests
there was no full end-to-end check of the whole chain
there was no proper contract verification for external calls

The instructions existed, but they were not “locked in”

This became another key lesson that shaped the next setup.

Even if the rules already live in the repository — in copilot-instructions.md, code-guidance.md, or constitution.md - that does not mean they are truly enforced. Until a rule becomes part of an acceptance or review gate, it is still closer to a wish than to a constraint.

And this was exactly the point where I realized I needed to log not only the result, but the run itself: what the setup was, which instructions were used, where drift started, which rules actually worked, and which ones only existed on paper.

Here is what I would now consider mandatory even for an early agentic flow:

Requirement lock — a clear record of what must not be lost along the way
Intermediate artifacts between stages, not just final code
An audit step before final acceptance
Quality gates that validate more than just compilation
Explicit project context stored in the repository
An invariant layer, like constitution.md, that is not tied to a specific model or stack and defines non-negotiable rules
Logs for every meaningful run, so setups can actually be compared against each other

The main takeaway from this stage

My first useful conclusion was very simple:

Good orchestration is not about having many roles and pretty artifacts. Good orchestration is about a system that does not lose requirements on the way and can hold quality on its own, instead of pushing all control back onto the human.

That was the moment I started treating each new setup as a separate experiment: tracking the rules, the result, the points of degradation, and looking not only at the code, but at how exactly the system got there.

What I decided to change in the second approach

After this setup, it became clear that the next step was not “add more instructions,” but strengthen the control loop itself.

So in the second approach, I moved toward:

tighter requirement preservation between stages
a dedicated review / audit step
more explicit quality gates
better observability of the run itself, so I could understand not only what came out, but also where exactly the system started losing quality

That became the next important turning point — no longer just role orchestration, but an attempt to make the workflow itself reviewable and reproducible.

The experiment is open in the delivery-flow repository. If you want to dig deeper, you will find not only code there, but also the orchestration layer itself: instructions, artifacts, logs, and the evolution of the workflow.

In the next article, I will show what I changed in the second setup — and what those changes actually improved in practice.

Top comments (1)

Doug Trier • Apr 8

Excellent write‑up, and you’re absolutely right. I pursued my own path by combining deep operational domain knowledge with tightly controlled AI‑assisted coding. It took extensive testing, identifying every failure point, and iterating repeatedly.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.