baltz

Posted on Mar 30 • Edited on Apr 5

What is ‘Harness Design’ and why does it matter

#ai #beginners #programming #productivity

AI coding starts to feel a lot more useful when you stop expecting one agent to do everything.

The real leverage is not just the model. It is the system around it. Clear roles, clear context, clear rules, and clear validation. That is what makes the whole thing more predictable, easier to trust, and much easier to scale.

Once you start thinking this way, the workflow changes. You stop asking for magic and start designing responsibility.

The support layer matters more than people think

A lot of the quality does not come from the prompt itself. It comes from the files and structures around the workflow.

These support files are what keep the system from turning into chaos. They define how work should happen, what should be respected, and how the agent should behave when the task gets bigger or more ambiguous.

You can think about them like this:

Agent files define the default behavior and the overall boundaries
Skills define reusable capabilities for specific kinds of work
Subagents split bigger flows into smaller focused responsibilities

None of this exists to make the setup look advanced. The point is much simpler: reduce randomness.

When the workflow has structure, the agent stops guessing so much. It has a clearer sense of what matters, what is allowed, and how deep it should go.

Agent files

This is the base layer.

An agent file usually acts like the main operating logic for the workflow. It tells the system how to behave by default, what standards matter, what should never be violated, and how to approach tasks inside that project.

This is where you usually define things like:

architecture boundaries
coding conventions
validation expectations
things the agent should avoid
project-specific rules that should always stay true

Without this layer, every task starts from a weaker baseline. The agent has to rediscover too much every time.

Skills

Skills are where the workflow starts to feel modular.

A skill is basically a reusable capability for a recurring type of work. Instead of relying on one giant generic instruction set, you create focused ways of handling specific tasks.

Examples:

planning a feature before coding
reviewing a diff for regressions
writing or improving tests
debugging a failing flow
checking whether an implementation actually matches the original request

Skills make the workflow sharper. They help the agent switch from “general intelligence mode” into “specific job mode”.

That usually leads to better output because the task is framed with more precision.

Subagents

Subagents are useful when one task is too broad to stay reliable inside a single thread of execution.

Instead of one agent trying to understand everything, decide everything, build everything, and validate everything, you split the flow into smaller workers with tighter scopes.

That is where things get cleaner.

A subagent can focus on one thing only. That narrower responsibility usually means less drift, less noise, and fewer weird side effects.

A good subagent is not just “another agent”. It has a very clear purpose.

For example:

one subagent explores the codebase
one proposes the implementation path
one writes the change
one reviews for quality and risk

This is where multi-agent workflows start to feel genuinely useful instead of just flashy.

Simple starting roles are enough

You do not need a huge system to begin. In fact, starting too big usually makes things worse.

A very solid setup can start with just a few simple roles.

Planner

The planner is there to understand the task before code starts moving.

Its job is to break down the problem, map the impacted areas, identify risks, and propose the safest path forward.

A good planner reduces wasted motion. It gives the rest of the workflow direction.

Typical responsibilities:

understand the request
inspect relevant files
identify constraints
propose a small implementation path
call out risks and unknowns

The planner should create clarity, not code.

Builder

The builder is the execution layer.

This role takes the approved path and implements it with tight scope. The key here is discipline. The builder should not redesign the whole system just because it found a prettier solution halfway through.

Its value comes from focus.

Typical responsibilities:

implement the planned change
stay inside scope
preserve existing behavior where needed
keep the solution as small and safe as possible

A good builder is not creative in the wrong places. It is precise.

Reviewer

The reviewer is the friction layer, in a good way.

Its job is not to build more. Its job is to challenge what was built. That includes looking for bugs, regressions, missing edge cases, weak validation, and unnecessary complexity.

Typical responsibilities:

inspect logic quality
look for regressions
find missing tests
spot overengineering
challenge risky assumptions

This role matters a lot because agentic workflows can produce code fast, but speed without review just means mistakes arrive earlier.

Optional roles later

Once the workflow grows, adding a few extra focused roles can help.

You might add:

a tester to focus only on validation and edge cases
a debugger to investigate failures
a refactorer to improve structure without changing behavior
a documenter to keep technical decisions clear

But early on, planner, builder, and reviewer are already enough to build a strong base.

The real idea

The goal is not more agents.

The goal is better separation of responsibility.

That is the shift that makes AI coding feel less random and more like an actual engineering system. Once each part of the workflow has a clear job, quality gets easier to manage, decisions get easier to review, and the whole process becomes much more scalable.

That is when it stops feeling like prompt experimentation and starts feeling like real workflow design.

Do you want to know if the code agents are getting better or worse after every change?
No vibes, just numbers. I wrote about how I'm using EDD (Eval-Driven Development) to actually measure it: Stop Vibing, Start Eval-ing.

Top comments (3)

Nova Elvaris • Apr 1

The planner/builder/reviewer separation is the part that makes the biggest practical difference in my experience. When a single agent tries to both plan and implement, it tends to "discover" the plan as it codes — which means it drifts from the original intent without realizing it. Splitting those roles forces the planning step to actually produce a concrete artifact that the builder follows, and the reviewer can catch when the implementation diverged. One thing I'd add: the reviewer role becomes dramatically more useful when it has access to the original request, not just the code diff. Without that, it tends to rubber-stamp whatever the builder produced. Have you experimented with feeding the reviewer both the plan and the user's original intent as separate inputs?

baltz • Apr 5

that`s it...
A reviewer that only sees the diff can easily validate the code without validating the intent. At that point it is checking quality, but not necessarily correctness against the original ask
feeding the reviewer both the original request and the plan as separate inputs makes the role much more useful, It gives it two anchors

Kuro • Apr 10

This resonates. I build AI agent systems and have been exploring what I call "constraint texture" — the structure around an agent doesn't just prevent mistakes, it shapes what solutions become possible.

Your observation about agent files reducing guessing is deeper than it appears. A rule like "all state mutations go through this module" isn't just a guardrail — it's an affordance. The agent stops exploring dead ends not because it's told "don't" but because the remaining search space is rich enough to work in.

Building on @novaelvaris's point about reviewers needing the original request: each handoff in Planner → Builder → Reviewer is lossy compression. The plan compresses the request. The code compresses the plan. If the reviewer only sees code, they're validating a compression-of-a-compression against nothing. Feeding back the original request restores the constraint lost in translation.

One addition: the three-role split itself has a cost. Separating planning from building gains clarity but loses the serendipity of discovering better approaches during implementation. The real question isn't "should we separate?" but "where does this factorization lose something we care about?" — and how do we compensate for that loss.