Shashank Khandelwal

Posted on Mar 2

Building the Agent Harness: Why the Environment Matters More Than the Model

#ai #systemprompt #claudecode #codex

How system prompts define agent behavior, why the codebase around the agent is what makes it productive, and where the industry is heading.

Everything I'm about to share comes from roughly a month of building with AI coding agents — not just using them, but paying close attention to why they succeed and fail.

This article covers two connected ideas. First, the system prompt — the hidden variable that defines how a coding agent actually works and behaves. Understanding it changes how you think about tools like Claude Code, Cursor, or Codex. Second, the harness — the documentation, constraints, and feedback loops baked into a codebase — and how it determines whether any agent can be productive in your repo.

Here's the thing: you don't own a tool. You can't improve Claude Code internally. You don't own a model — you can't change how it's trained. But what you do own is the environment around your code. That's the harness. And that's the part that actually compounds over time.

The Anatomy: LLM, Agent, System Prompt

Before we go deeper, let's clarify the three-layer system that powers every AI coding tool you use. Understanding what each layer does is really important, because most conversations about these tools conflate all three.

Start from the bottom. The LLM is the brain. Think of it as a very smart person sitting in a room. They can answer any question, reason through complex problems, generate text — but they cannot leave the room. They can't open a file. They can't run a command. They can't check the internet. All they can do is think and talk. Incredibly intelligent, but completely passive.

On top of that sits the system prompt — the judgment layer. This is the briefing you give that smart person before they start working. You tell them: "You're a senior engineer. You always run tests after writing code. When you're unsure, ask instead of guessing. Don't use this tool for that purpose, use this other one instead." The system prompt shapes how the brain operates. It doesn't make the brain smarter — it makes it more effective by guiding its behavior, correcting its quirks, and setting boundaries.

And then comes the agent — the hands. The brain can think, the system prompt can guide the thinking, but the agent is the one that actually does things. It reads files, opens a terminal, runs commands, searches the web, invokes scripts. Without the agent, the brain is stuck in that room.

To sum it up simply: the agent gives the LLM hands, and the system prompt gives it judgment.

What the system prompt manages

The system prompt isn't a one-liner. It manages at least eight distinct dimensions: role and persona (who the agent is, its expertise, communication style), model calibration (overriding training quirks like verbosity or laziness), tool orchestration (which tools exist, when to use them, in what order), behavioral guardrails (what the agent must not do), output format (JSON, markdown, step-by-step), task decomposition (how to break down complex tasks), context management (how to prioritize information within the context window), and safety and defense (input validation, hallucination prevention, injection resistance).

Let me give you a real example of model calibration. Have you ever noticed that by default, when you give a task to Claude Code, it doesn't run things in parallel? It does things one after another, sequentially. The reason is: in the real world, the data these models are trained on, things happen sequentially. You do one thing, then the next. That's how most written text works. But people realized — wait, these tasks are independent, why aren't we running them in parallel? So that instruction got baked explicitly into the system prompts. That's calibration: recognizing a training artifact and correcting it through the prompt.

System Prompts: The Hidden Variable

If you've ever seen a leaked system prompt from ChatGPT, Claude, or any other coding agent, you know — they're enormous. Thousands of tokens of detailed behavioral rules. These aren't suggestions. They're the actual control layer that shapes every single response the agent generates.

The developer teams at companies like Anthropic, Cursor, and OpenAI treat system prompt engineering as a first-class discipline. It is not an afterthought. They put enormous effort into writing and tuning these prompts because the quality of the tool depends on it.

Here's the finding that really drove the point home for me. Testing showed that identical models using a Codex-style system prompt adopted a documentation-first approach — they would read docs, plan, then code. The exact same model with a Claude-style system prompt used iterative trial and error — just start coding and adjust. The difference was purely because of the system prompt. The model didn't change at all.

Same model, different system prompt, completely different workflow.

— Drew Breunig, "System Prompts Define the Agent as Much as the Model"

The takeaway is clear: the system prompt is not a nice-to-have. It's a fundamental piece of the puzzle that defines whether your agent behaves like a careful senior engineer or a reckless junior who just starts typing.

The Gap Nobody Talks About

You know where most of the attention goes. "Which model is best?" "Claude vs GPT vs Gemini?" "Should we use Cursor or Claude Code?" Those are the conversations happening everywhere. But what actually determines the results you get day to day? It's the environment the model operates in. And that's the part you build, maintain, and compound over time.

When your AI tool produces bad output, what's your first instinct? Most people blame the model. "This model is dumb." "I should switch to a different one." But what if the model is fine and the context is what's broken?

Some people think "context" means what's in the current session — maybe the conversation got too long, maybe some context got compacted or lost. And yes, that can happen. But the bigger context problem is usually at the codebase level: the agent simply didn't have the information it needed to make the right decision.

Here's a simple analogy. Think of the model as a really good driver. They know how to drive, they can handle any road. But put them in a brand new city with no GPS, no road signs, and no destination address — they're going to drive around, make some reasonable guesses, and eventually end up somewhere. Maybe even the right place. But probably not.

Now give that same driver a GPS with the exact destination, real-time traffic updates, and local road rules loaded in. Same driver. Completely different outcome. The GPS didn't make them a better driver — it gave them the context to use their skill effectively.

And here's the kicker: nobody blames the driver when the GPS has wrong maps. But that's exactly what we do with AI tools — we blame the model when the environment is what's broken. And the environment? That's the part we actually control.

Why Your Agent Stumbles on New Models

Have you ever noticed this pattern? A new model comes out, you switch to it, and suddenly your coding tool feels... off. The quality dips. The agent starts doing weird things, consuming more tokens, making mistakes it didn't make before. So you revert back to the old model. Then a few weeks later, you try the new model again, and suddenly it works great.

Did the model change? No. The model was trained months ago — it's the same binary. So what changed?

The harness caught up. The tooling around the model got updated. The system prompts were tuned. The tool orchestration was adjusted for the new model's quirks.

Cursor's engineering team published a blog post confirming this exact mechanism. They wrote: "Each model requires specific instructions and tweaks to Cursor's agent harness to improve output quality, prevent laziness, efficiently call tools, and more." They tune their system prompts for every single frontier model using internal benchmarks.

They even documented that removing reasoning traces from one model caused a 30% performance drop. Thirty percent — just from a change in the harness configuration. Not the model. The harness.

There's a design philosophy from Boris Cherny, the creator of Claude Code, that captures this well. He's talked about designing agentic systems for the model that's coming — not just the model you have today. If your environment is well-structured, a better model makes everything immediately better. But if your environment is brittle and prompt-hacked for one specific model, a new model breaks everything.

That's a powerful idea. Build so that when a more capable model arrives, your system takes advantage of it instead of fighting it.

Why Tool + Model Pairing Matters

This leads to something interesting. There's a reason Claude Code works particularly well with Anthropic's models. It's not just marketing — it's engineering proximity.

When a third-party tool maker like Cursor builds their agent, they don't have access to the model's internal quirks. They observe behavior from the outside, guess at calibration, and tune based on external testing. There's always a gap between what the model can do and what the tool knows it can do.

But when the tool and the model come from the same team? The people who built Claude Code sit in the same organization as the people who trained Claude's models. They know the quirks before users discover them. They can bake fixes into the system prompt preemptively.

It's similar to the Apple approach — the reason Apple products often feel smoother than spec-equivalent competitors isn't about raw hardware power. It's the tight integration between hardware and software. Same principle here. The tighter the feedback loop between model and harness, the better the results.

This doesn't mean Claude Code is the "best" tool for everything. Cursor does excellent work tuning for each model from the outside. Their engineering blog proves it. The mechanism is the same — Claude Code just has a structural advantage through insider access. The takeaway: when evaluating AI coding tools, don't just compare models on a benchmark. Compare how well the tool understands and adapts to the model it's running.

What Is a Harness?

So we've talked about system prompts — how they shape agent behavior, why they matter. Now let's shift to the second big idea: the harness. What you build in your own codebase.

An agent harness is the set of constraints, documentation, tools, and feedback loops that keep an agent productive and on track inside your codebase.

Imagine two contractors. Both equally skilled. You hire the first one and just say "build me a kitchen." No blueprints, no building codes, no site foreman. They're talented — they'll figure something out. But they're going to make reasonable-but-wrong assumptions about every ambiguous detail. The outlet placement, the cabinet dimensions, the plumbing layout — they'll guess at all of it.

The second contractor? Same talent. But you give them blueprints, building codes, inspection checklists, a site plan, and a foreman who flags when something drifts from the plan. Same skill level. Completely different outcome. That's the difference between dropping an agent into a repo with just a README versus a repo with a harness.

Five layers of an agent harness

Project memory is where the agent starts. A CLAUDE.md, a README — it tells the agent what this project is, how it's structured, what the rules are.

The decision record is crucial because it tells the agent why past decisions were made. Without it, the agent relitigates every settled question. "Why are we using this library?" "Why is this structured this way?" If the decision is recorded, the agent reads it and moves on. If it's not, the agent guesses — and it will guess differently every time.

The navigable map is linked documentation the agent can traverse. The entry point links to architecture docs, which link to feature references, which link to specs. The agent doesn't need you to say "go read the connector guide." It finds the links and follows them.

Workflow automation provides repeatable steps the agent can invoke instead of improvising. Instead of figuring out "how do I verify my work against the spec?" there's a defined step for that.

And the enforcement layer — pre-commit checks, linting, audits — catches drift automatically before anything reaches human review.

The Map Metaphor: Think of your documentation as a map, not a manual. The agent doesn't need to be told every step. It needs a map it can navigate based on what it's trying to do. The entry point connects to architecture docs, which connect to decisions and guidelines, which connect to specs — and the agent finds its way through this network.

Here's what's really important to understand: this isn't "AI documentation." This is good engineering practice that also makes agents effective. If your docs, decisions, and conventions are clear enough for a new human engineer to be productive in a week, an AI agent can be productive in minutes. The harness benefits both humans and agents. That's why this isn't AI-specific busywork — it's investing in your codebase's navigability.

Signals from the Industry

Two real-world cases show where things are heading — and what happens at each end of the spectrum.

Stripe — Minions

Stripe built fully autonomous coding agents they call Minions. These agents take a task from a Slack message, write the code, run CI, and submit a pull request — with no human involvement during the development process. Over a thousand PRs merged per week.

But here's the critical part: every single PR is reviewed by a human engineer before it gets merged. If the agent can't fix failing tests after a couple of attempts, the task goes back to a human. They've built the escape hatch right into the system. This is the responsible production model — agents do the work, humans validate the output.

Clawd — One Person, Many Agents

On the other end, Peter Steinberger built Clawd largely solo, running 4 to 10 AI agents simultaneously. Over 6,600 commits in a single month. The project went from about 9,000 to over 150,000 GitHub stars in just a few weeks.

The headline from his Pragmatic Engineer interview? "I ship code I don't read." Those are his own words. He was selective — he reviewed critical paths, things touching the database and security-sensitive code. But for routine code, he trusted the agents and moved on. The scale of unreviewed AI-generated code raised real questions about maintainability and hidden defects.

Neither approach is inherently wrong. Stripe has guardrails and human review. Steinberger made a deliberate choice about where to invest his attention. But both succeed because of one thing: the environment around the agents was deliberately designed.

"The bottleneck was never the agent's ability to write code, but rather the lack of structure, tools, and feedback mechanisms surrounding it. When Codex got stuck, they treated it as an environment design problem and asked what was missing for the agent to proceed reliably."

— OpenAI team, on building Codex

The people building one of the most well-known AI coding agents are telling you: the agent is capable enough. The question is whether your codebase is ready for it.

The Shift: Planning Is the New Implementation

What "good engineering" means is evolving. Not replacing old skills — but reweighting them.

Where we came from: the core job of a good software engineer was writing clean, efficient, maintainable code. Runtime complexity, design patterns, code quality — these were the primary measures of skill. And those skills still matter. They're not going away.

But the highest-leverage skill is increasingly shifting toward planning, specification, and environment design. If you define the problem precisely enough, capable agents can produce the implementation. The quality of the plan directly determines the quality of the output.

The quality concern — an honest take

I want to be honest about a real tension that every engineering team is going to face.

AI-generated code can work perfectly. It passes all the tests. It handles edge cases. It performs well. But when you look at the code — it's sometimes not maintainable. The naming is off, the abstractions feel awkward, the patterns don't match the rest of the codebase. It functions, but it doesn't fit.

Different teams handle this differently. Some use human-in-the-loop review — every agent PR gets reviewed before it merges. Humans stay accountable for what goes to production. Other teams take a different approach — they let agents merge code, and then run periodic cleanup agents. Weekly analysis passes that refactor, clean up, and ensure consistency across the codebase.

I'm not going to say one approach is right and the other is wrong. What I will say is this: the worst case is code going to production with nobody reviewing it at all. Whether the reviewer is a human or a well-designed automated system — someone needs to be accountable for what ships.

Iteration Is the Method

So how do you actually build a harness? The key thing to understand: no harness is built in one shot. You don't sit down, design a perfect system, and ship it. It's iterative. Every friction point an agent hits is an environment design problem waiting to be solved.

The loop goes like this: the agent hits friction. You diagnose it — what context was missing? You fix the environment. And the next time the agent encounters that same situation, it works better. That's the entire method.

Some real examples of this loop in action:

Agent didn't know the project's architecture. It was making guesses about where files go, what patterns to follow. Fix: created a project memory file with linked documentation acting as a navigable map.
Agent wrote code that drifted from the specification. The implementation was technically fine, but it didn't match what was planned. Fix: built a verification workflow that checks implementation against the spec before committing.
Code review agent enforced rules inconsistently. Sometimes too strict, sometimes too loose. Fix: iterated on the guidelines and enforcement configuration multiple times. And honestly? It's still being refined. That's the point.
Agent didn't know about past architectural decisions. It would propose solutions that contradicted decisions made weeks ago. Fix: started recording decisions as ADRs and linking them from the project's documentation map.

Expect imperfection. The first version of everything — every workflow, every doc, every enforcement rule — will be wrong or incomplete. That's fine. The value isn't in the first draft — it's in the feedback loop. Each iteration compounds. The harness gets better, the agent gets more productive, and the friction decreases.

What This Means for You

You're already using AI coding tools. This isn't about adoption. This is about leverage — getting more out of what you already use.

There's a mindset shift I want to leave you with. When an agent goes off-track, misses context, or produces something that doesn't fit — treat it as a harness gap, not an agent limitation.

Instead of "the AI got it wrong" — ask "what context was missing?" And then fix the environment so it doesn't happen again. Not just for you — for everyone on the team, and for the agent itself next time.

Here are some concrete things you can start doing right away:

Document your decisions. The next person to touch your code might not be a human — it might be an agent. If there's no written record of why something works the way it does, the agent will guess. And it will guess wrong. Even a two-line note explaining "we chose X over Y because Z" saves hours of agent confusion later.

Invest in your project documentation. A good entry-point doc that links to architecture docs, conventions, and decisions? That's worth more than any prompt trick you'll ever learn. It's the GPS for your codebase.

When you hit friction, improve the map. Don't just prompt-engineer your way past it. If the agent needed context it didn't have, add that context to the documentation. Fix the underlying issue so the friction doesn't repeat — for you or anyone else.

Planning matters more than ever. The more precise the specification, the better the agent's output. Invest time upfront in defining what you want — the "what" and the "why" — and let the agent handle the "how."

Tools change. Models improve. The harness is what you own.

Claude Code, Cursor, Codex — these are interchangeable instruments. They'll keep getting better on their own. But the documentation, the decisions, the guidelines, the enforcement, and the workflows baked into your codebase? Those compound over time, they work with any agent, and they make every future model more effective from day one.

DEV Community