Dmitrii Glazkov

Posted on Apr 14

Advancing AI-Assisted Engineering Practices

#ai #harness #agents #development

Hi everyone, my name is Dmitrii Glazkov, and I’m a Software Engineer working in Tabby.

My previous articles are here and here. I wrote them before AI became part of our everyday lives, but if you want to learn more about how the React engine works internally and about concurrent rendering, they are still relevant today.

But now we live in a very different world, and today I want to talk about how my own approach to working with AI has evolved. In that context, I want to walk through the main approaches to working with AI agents in software development, highlight the key aspects of each one, and share practical guidance you can apply in your own projects. The goal is not to review every tool on the market, but to help you understand which approaches fit which kinds of tasks – and how you can apply them in your own work.

Chat and Agents

These are both already familiar tools for most of us, and many of you probably use them in your daily work.

Chat

Works only with the context we explicitly provide
Still great for abstract questions and isolated debugging tasks
Especially useful in Deep Research mode, when you want to gather as much information as possible from the internet on a topic

Most of the time, it’s used for questions that are not tied to a specific project. Despite how simple it looks, it is still highly relevant, very useful in many cases, and saves a lot of time you would otherwise spend searching the internet manually.

Keep in mind that the model can return outdated information. To avoid this, use Deep Research or Web Search, or provide up-to-date documentation using tools like Context7 MCP.

Agents

Have access to the repository, local tools, and can gather much more context
Depending on the tool, this may include rules, skills, hooks, subagents, MCP integrations, and similar mechanisms
Can work with CLI utilities

This is also a very familiar tool, and I think many of you already use it in your daily work. There are many different platforms with their own agents, but I’m not going to compare them here. Which one you use is mostly up to you, and it does not really change the ideas discussed further in this article.

Spec-driven Development

Works better on greenfield projects
The spec becomes the main artifact the agent works from
The long-term strategy for maintaining specs is often fragile

Spec-Driven Development (SDD) is an approach to AI-assisted software development where you start by writing a specification before writing code. In this model, the spec becomes the main input artifact for the agent.

In one of the better breakdowns of SDD, three levels are distinguished:

Spec-first: the work starts with a well-defined specification, which the agent then follows during implementation.
Spec-anchored: the specification is kept after delivery and continues to serve as a reference point for future updates and maintenance.
Spec-as-source: over time, the specification turns into the primary artifact, while direct code changes by humans become minimal or disappear entirely.

One common SDD workflow (for example, in tools like Junie by JetBrains) includes three main stages:

Requirements (requirements.md): describe what the system is supposed to do
Implementation plan (plan.md): a bridge between high-level requirements and the future code
Task list (tasks.md): the plan is decomposed into specific small actions

I don’t see Spec-anchored as Spec-as-source a good fit for large multi-contributor projects. Sooner or later, the spec stops being updated, and once that happens, it is no longer more trustworthy than the code.

What we do take from SDD is the discipline around task definition. Before asking the agent to implement something, it is useful to make the task explicit: define the goal, capture the constraints, describe the expected behavior, and break the work into smaller steps the agent can execute and track directly.

That part is genuinely useful in practice. Splitting the work into something like plan.md and tasks.md becomes especially important later, because it gives us a foundation for execution, validation, and iteration.

Improving Autonomy and Code Quality

At this point, the process looks roughly like this: humans are out of the loop. We write plans, split work into tasks, and hand those tasks off to an agent.

That approach can work surprisingly well on simpler tasks. But on anything slightly more complex, the agent can either go completely in the wrong direction or produce code that still needs to be fixed by hand, often because we realize too late that it has drifted away from the original intent. Good planning obviously increases the chance of getting something close to ideal, but it is still not a guarantee of success.

So how do we improve the quality of the code the agent gives us?

Shift-Left for Agent Development

To do that, we can borrow classic shift-left thinking. The basic idea is simple: the earlier we validate the work and catch mistakes, the cheaper they are to fix. Originally, shift-left came out of testing and later became widespread in Agile practices. If we adapt it to agent-based development, the principle stays the same: we should not wait until the very end to discover that the agent misunderstood the task or drifted away from the original intent.

Instead, we should give it several self-check stages along the way, so it can validate its own progress, regulate its own path, and avoid producing low-quality code that we later have to fix by hand. And if an error is found at any of those stages, we move left again – back to development – fix the issue, and then try to pass all the stages again from the updated state.

Agent self-check

There are three main ways an agent can check whether it is still writing working code that satisfies the requirements, or whether it is already generating tomorrow’s tech debt backlog:

Linters
Type checking
Tests

The first two are obvious, and we already use them anyway. The third one, however, is something we often neglect.

And honestly, there are understandable reasons for that. It is not always worth spending time writing tests for old code instead of building new functionality. Though sometimes, let’s be honest, we are just lazy.

But now tests become a key requirement for working effectively with an agent. Before I get to how exactly we use tests with agents, let me first try to persuade you that testing is worth the effort.

Why Tests Become the Key Dependency

In the diagram, I tried to show several parts of our work that reinforce one another. At the center is business value. We want to grow the product and the company, and, as a result, grow as engineers too.

Here is the short version of how these connections work:

Refactoring → Architecture
Any system accumulates legacy over time. Regular refactoring helps keep the architecture healthy and adaptable. It gives us a way to improve the structure of the code without changing the system’s external behavior. The smaller and more frequent the refactoring steps are, the better.

Architecture → AI Agent
If humans struggle to navigate a messy codebase, agents will struggle even more. Clean architecture and documented conventions make agent work far more reliable.

AI Agent → Time
Agents save time. They accelerate implementation and reduce the amount of routine manual work.

Time → Tests
Some of that saved time can be reinvested into tests. Writing good tests is an upfront cost, but without that investment, fast implementation quickly turns into fragile implementation.

Tests → AI Agent
Tests are what make work with AI agents reliable in practice. They give the agent an objective signal: either the change works or it does not.

Tests → Refactoring
Refactoring core functionality without tests is always risky. Tests are what make it safe.

Continuous Integration → Refactoring and Business Value
CI helps ship features faster and supports smaller, more frequent refactorings before they become stale or hard to merge.

Tests → Continuous Integration and Business Value
Tests bring confidence to releases, especially in a CI setup. They also serve as living documentation of expected behavior.

This is a closed-loop system. Every component in the loop is essential. Remove any one of them and the system degrades.

What would you add to this diagram?

Testing is a critical dependency across almost the whole system. Yes, writing tests take extra time. But when you work with AI agents, that cost is often more than paid back by faster implementation. At the same time, tests bring enormous value to the project on their own – confidence in deployments, safe refactoring, and living documentation of expected behavior.

As projects grow, the bottleneck shifts from writing code to integrating changes quickly and safely into a system that no single contributor fully understands. This is true for humans, and it is no less true for AI agents. A human engineer can understand a lot about a project, but never the whole system at once. An AI agent operates under the same constraint, only more explicitly: it works with partial context and limited understanding of the codebase around the task.

That is why tests become more than just a quality practice. They are part of the infrastructure that makes ongoing development possible. They allow both humans and agents to make changes in a growing system without relying on complete knowledge of everything around them. For that to work, tests need to cover business requirements well enough to catch regressions, while also remaining fast, understandable, and reliable.

Another important point: tests are a way to evaluate the agent's work. Whenever a task is delegated to an agent and we want a reliable result, we need an objective signal of success. Tests provide that signal by showing whether the agent has actually accomplished what was asked.

Feedback loop

I hope I managed to convince you that tests matter. Now I want to put everything together and show what stages an agent can go through before it completes a task.

In this workflow, the agent follows a test-first approach. In my view, this is the most effective way to control an agent: tests let you verify very strictly that it did not drift away from the task and did not forget some important requirement. I am deliberately choosing test-first, not strict TDD in the classical Red-Green-Refactor sense. Later I’ll explain in more detail why.

But before writing tests, we need a preparation stage: Scaffolding. This means creating the skeleton of the future solution, which tests and implementation will then build on top of. What exactly you need to scaffold depends heavily on the task and on the language you are working in.

After that comes the test-writing stage. In my feedback loop, I ask the agent to write all tests for the task up front, and those tests should initially fail in the expected way. Here I intentionally step away from classic TDD, where you write one narrow test for a tiny piece of behavior and then write an equally narrow implementation just to satisfy that test. That approach would require constant review of the tests along the way. In my model, we review the full test set for the task once.

Then comes the most interesting stage: implementation.

At this point, the agent writes the code independently, without our participation. After the code is written, the agent can add any missing tests if needed, and then it enters the self-check stage:

type checking
linting
tests

If something fails, it loops, tries to fix the issue, and runs the checks again. And even if everything passes, its suffering is not over. It still has to call a reviewer subagent, which checks what was written, produces a list of comments, and sends the agent back into implementation to fix them.

Human-in-the-Loop

Why Full Autonomy Still Breaks Down

The last thing worth discussing here is when we should inspect what our little protégé is doing.

More broadly, it helps to think of the agent as a junior teammate. You still need to frame the task properly, give it ways to validate its work, and review the result at the right moments. And if it runs into difficulties while coding, you are still there to help.

Even with a strong feedback loop, full autonomy still breaks down in practice. An agent can implement, run checks, and fix obvious issues, but only within the boundaries we originally defined. What it still cannot reliably do is decide when the task itself should be reinterpreted.

That is where human involvement remains necessary. If the agent wants to relax validation, skip tests, or change expected behavior, a human should step in.

The goal, then, is not full autonomy, but controlled autonomy: the agent can move on its own as long as it stays within the original intent, while human approval remains required whenever that intent may change.

Human-in-the-Loop in Practice

In practice, this starts with task definition, which is where the experience from SDD helps us most. From there, we take the Spec-first idea: create a plan and break the work into tasks. The better this plan is assembled and the more clearly it is broken down, the better the final result will usually be.

Instead of asking one agent to do the entire thing in one go, we hand each individual task to a separate agent. First, this helps us stay within the context window and keep the amount of context within a range. This directly affects code quality: when the context becomes too large or noisy, the agent is more likely to miss constraints, misunderstand the task, and produce worse results. Second, whenever possible, we can parallelize the work by launching several agents at once on different tasks.

In the end, tasks are written in a format like this:

[ ] [TaskID] [P?] Description with file path

P? is an optional parameter that marks tasks that can be done in parallel with other tasks.

From there, the human orchestrates the work: we prepare the plan together with the agent, break it into tasks, assign those tasks, and let the agents execute them through the feedback loop described earlier. At the very end, the agent marks the task as Done directly in the task document and can optionally leave information there that might help the next agent working on the next task.

The stages where human attention matters most are planning, scaffolding, and tests. This foundation gives the agent the best chance of producing code that will require little or no changes later.

Beyond those stages, we keep several control points for situations where the agent attempts to change the original intent. If it decides that some tests are unnecessary or that the expected behavior should change, it cannot modify them on its own — that requires human approval. The exact breakpoints differ from project to project, but the principle remains simple: the human stays the gatekeeper whenever the task itself is being reinterpreted.

At the same time, most of the implementation work happens autonomously. Each agent executes its task using shift-left principles inside a feedback loop, validating its work step by step before moving forward. This allows the agent to catch many issues early and significantly increases the chance that the resulting code will work as intended.

This is a human-in-the-loop workflow.

Agent loops

Agents take on different roles, do their own part of the work, and communicate with one another
Сonsumes more resources, but it allows the work to be parallelized
Review remains the bottleneck

The main limitation of a strict human-in-the-loop process is that the human eventually becomes the bottleneck. Agents can generate and iterate on code much faster than we can continuously review, unblock, and redirect them.

This is where the Ralph loop becomes useful.

Ralph is a way of running coding agents in repeated short iterations with bounded context, external memory, and built-in validation. In practice, you run the same prompt repeatedly: the agent picks the next task from the PRD, implements it, commits the result, and then starts the next iteration. Instead of supervising every step manually, you come back later to working code.

By a happy coincidence, we already have a prompt that describes exactly this kind of iteration – the same loop outlined earlier in the Feedback loop section. It has been tested and works effectively. So the remaining problem is not the loop itself, but the human responsibilities around it – the same ones that create the bottleneck in the earlier process.

To reduce that bottleneck, responsibilities are split across agents with different roles. One agent focuses on implementation, while others can handle review, validation, or resolving blockers. The orchestrator agent is responsible for launching these subagents, assigning them tasks, and managing the overall flow from planning to final review of the whole task. In practice, it can start subagents either through a bash script – for example by launching a new Codex or Claude process through the CLI – or through prompts that start subagents. A bash-script approach is stricter and keeps the orchestrator closer to the intended flow, which is both an advantage and a limitation.

The core idea is simple: move more coordination and validation into the loop itself, so the human no longer has to act as the manual coordinator at every stage.

And these are the stages I distinguish for each agent:

not_started
scaffold_created
tests_failed_as_expected
tests_need_fix
implementation_updated
lint_failed
lint_passed
typecheck_failed
typecheck_passed
tests_failed_due_to_tests
tests_failed_due_to_code
tests_passed
review_failed
review_passed
docs_sync_required
docs_synced
blocked

As you can see, I am not going into detail on every line, every status, or every prompt. Those are already implementation details that depend on your specific project, and you can work them out for yourself – with the help of an agent, of course.

AI-first / harness engineering

We build infrastructure around agents so they can work freely, and most importantly, validate themselves
We reduce review as a bottleneck and place more trust in code written by agents

In an effort to reduce human involvement in software delivery even further, many companies are moving toward an AI-first approach. In this model, we review less of the code written by agents and rely more on the infrastructure around them to validate their work automatically.

Harness engineering is the broader idea behind this. It is about designing the environment, expressing intent clearly, and building feedback loops that allow agents to work more reliably and with less direct supervision.

This does not replace the agent loop so much as add another layer on top of it. That layer can include, for example:

browser-based E2E validation that checks both functional behavior and screenshot-based comparison;
access to logs and traces in tools like Datadog, so agents can verify service behavior and even investigate bugs in real environments;
read Debug-level logs during the local run of the application.

These are only a couple of examples, but I think the idea is clear. You can come up with many project-specific cases that would make your own agents more independent, so that your role shifts mostly toward proper task definition.

Separately, I want to highlight the importance of improving both agent code review and the quality of your skills. That is what helps prevent technical debt from accumulating and reduces the amount of manual review over time. Until you have developed genuinely good skills, make sure to review the code manually. Identify the error patterns your agents most often make in your project, and move those patterns into rules or skills.

Common Questions

Can we really let agents work on their own and still expect good code?

In some cases, absolutely. Autonomous loops tend to work best for smaller, isolated, utility-style features with clear requirements and easy validation.

Greenfield tasks can also work well, but much depends on the quality of the planning stage. For more complex work, the main factor is often the harness around the agent. If it provides strong validation, feedback, and guardrails, the agent can still work quite independently. If it does not, you will likely still need to review the code manually and fix part of it yourself.

Does this actually make development faster?

Not always in a direct comparison. On tasks with clear context, a human can still be faster. The real advantage of an agent loop is different: autonomy and parallelism. Agents can keep tasks moving for hours on their own while you are busy with meetings, reviews, or other work, and you return later for final review and adjustments. It also allows you to run several tasks in parallel and focus more on planning and supervising the work than on implementing each task yourself.

Conclusion

I want to end by stressing that code quality still matters. Whatever happens with the current AI wave, we may eventually have to go back into the code ourselves. Bad code creates a separate risk. It leads to technical debt, and it also makes the codebase harder for agents themselves to understand. As a result, they are more likely to make further mistakes.

We still do not know what the Claude Code and Codex experiments with their AI-first approaches will lead to, or what conclusions they themselves will draw from that experience – and how much trust we should place in those conclusions. Some of these bets may pay off. But some may turn out to be short-term advantages that later leave us with code that is difficult to maintain, expensive to evolve, and costly enough that we end up paying more for rework or full rewrites than we gained from the initial speed.

Everything I described in this article is a set of tools that exist today. Each one has its place and its purpose. You should not take Ralph loop and try to hammer nails with it. We need to be fluent with all of these tools so we understand which one is the most effective for a given task.

I described mostly concepts here and did not include many concrete examples. If you would like to see the actual structure I personally use in my project or ask me any questions, feel free to reach out to me:

LinkedIn: Dmitrii Glazkov
Telegram: @dmglazkov
Notes about the web (and not only): Web After AI

Resources

Book: Refactoring - Martin Fowler, with Kent Beck

DEV Community