Ask five developers what an "agent harness" is and you will get five different answers. Some mean the model. Some mean a CLAUDE.md file. Some mean orchestration infrastructure. Everyone is building something real. But without shared vocabulary, we cannot learn from each other, cannot reason across systems, cannot even agree on where a problem lives when something goes wrong.
That is where we are with AI agent configuration. The word harness is everywhere, and it means everything. Which is another way of saying it means nothing precise enough to be useful.
This is not a minor inconvenience. In a field this young, the words we settle on shape the mental models we build. And mental models shape what we think to build next. Naming things carefully is an act of collective infrastructure.
This post proposes a taxonomy: The Harness Stack. Five discrete levels, each with a clear scope and responsibility. It is not prescriptive. You do not need all five. It is a shared map, offered as a starting point for a conversation the field needs to have.
The harness defined
A harness is the deliberately shaped configuration around an AI coding agent: everything that sits between the raw model and the work it does.
It spans the tool you chose, the global preferences that travel with you, the project-level scaffolding inside a codebase, the cross-project conventions an organization shares, and the orchestration that coordinates multiple agents at once.
A harness is not the agent. It is not the code the agent edits. It is the context that decides how the agent behaves when it encounters a task.
The five levels
Level 0: Model Harness
The AI coding tool itself. Claude Code, Cursor, Copilot, Pi, whatever you are running.
This is the product layer: the capabilities, interfaces, and built-in behaviors the tool ships with. You do not configure Level 0. You choose it. And that choice matters more than it might seem, because everything above it is built on assumptions the tool makes about how agents should work, what context they can hold, what hooks they expose.
The discipline worth cultivating here is loose coupling. Your higher-level configuration should not be written for a specific tool. It should be written for a class of tools that Level 0 happens to satisfy today. We are not quite at the point where swapping models is frictionless, but designing toward that portability now is an investment that compounds.
Level 1: Agent Harness
How the tool is configured globally, across all your work, not just one project.
This is where memory lives, along with persistent preferences, user-level settings, and the context that travels with you from codebase to codebase. In Claude Code, this is your global CLAUDE.md. In claude.ai, it is memory and system-level instructions. Level 1 answers a deceptively important question: how is this agent configured to behave before it encounters any specific project?
The distinction between Level 0 and Level 1 is easy to collapse and important to preserve. The tool is what it ships as. The agent is what you have made of it. That gap, between default behavior and deliberately shaped behavior, is where a surprising amount of leverage lives. An agent that understands your preferred coding style, your tolerance for verbosity, your conventions around naming and error handling, arrives at every project already partially oriented. That orientation is Level 1.
Level 2: Project Harness
The codebase-level scaffolding an agent operates within.
This is where most developers are actively building right now. It is also where the tooling is most mature. A project harness includes:
- Slash commands and MCP plugins
- Hook scripts (PreToolUse, PostToolUse, Stop, Bash)
- Subdirectory
CLAUDE.mdfiles scoped to specific modules - Characterization tests and static analysis configuration
- Skills, sensors, rules, flywheels, and other "code as markdown" artifacts
Think of Level 2 as terrain. It shapes what the agent encounters as it moves through your codebase: what guardrails exist, what patterns it is expected to follow, what tools are available and where. A well-designed project harness does not just constrain the agent. It makes the right path the easy path. This is the layer that has had my attention recently.
The open questions here are genuinely interesting. How granular should subdirectory context be before it becomes noise? When does a hook encode wisdom and when does it encode fear? How do you keep a project harness from calcifying, from becoming a set of rules that made sense six months ago and now just get in the way? These are craft questions, and we are only beginning to develop shared answers.
Level 3: Organization Harness
The cross-project consistency layer. And the most underbuilt level in the stack.
If Level 2 is the terrain of a single project, Level 3 is the survey that makes multiple terrains legible to the same agent. Its purpose, at any scale, is to make sure an agent moving from one project to another does not have to relearn the fundamentals. Shared conventions. Common tool configurations. Policies that apply everywhere so they do not have to be restated anywhere.
Level 3 does not require an enterprise. In a monorepo, it might be nothing more than a root-level CLAUDE.md and a shared lint config. For larger organizations it scales up to approved tool registries, compliance guardrails, and governance policies. But the intent is the same whether you are a solo developer across multiple repos or a platform team serving dozens of product teams.
Here is the honest state of things: almost nobody is doing Level 3 deliberately yet. Most teams have it accidentally. A convention that emerged organically. A root CLAUDE.md someone added and others quietly inherited. That is not nothing, but it is not design.
Purpose-built tooling for this layer does not really exist yet. But the primitives do, and they are ones developers already know. A version-controlled shared repo can hold your org-level CLAUDE.md, hook templates, and lint configs. Package managers can distribute them. For teams managing multiple separate repos today, git submodules are an underrated pragmatic option: pull the org configuration into each project as a submodule, update it centrally, and let projects inherit changes on their own schedule.
MCP servers are another workaround worth considering: an internal MCP server can expose org-wide tools, prompts, and resources to any agent that connects, without each project needing to vendor the configuration. It solves the distribution problem in a different way than submodules. It does not solve the harder problems: how an org-level harness gets authored, how conflicts with project-level configuration get resolved, or how drift gets detected. Those gaps remain wherever the bytes live.
The real gap is semantic, not technical. Which makes it exactly the kind of gap that shared vocabulary can close.
This is the most interesting empty layer in the stack. As agentic workflows mature and projects multiply, inconsistency compounds quietly. The team that invests in Level 3 early is building something that will pay dividends in ways that are hard to attribute but impossible to miss.
Level 4: Orchestrator Harness
Fleet-level coordination of agents. The level where the products and frameworks are arriving faster than the patterns.
Devin is a Level 4 system. So are CrewAI, AutoGen, LangGraph, and swarm frameworks. So is any infrastructure that treats individual agents as nodes in a larger graph: routing work between them, managing their lifecycles, composing their outputs into something coherent. This is not configuration in the traditional sense. It is choreography. The harness at this level does not shape how an agent thinks. It shapes how agents relate to each other.
LangGraph makes this concrete: you define a graph of agent nodes, edges that represent conditional routing between them, and state that flows through the graph as work progresses. The harness is the graph itself, the encoded decisions about which agent handles what, under what conditions, and what happens when something fails. Devin operates similarly in spirit, if not in implementation: a task enters the system, gets decomposed, gets distributed, gets reassembled. The orchestrator harness is what holds that process together.
What makes Level 4 genuinely hard is not the tooling. LangGraph and its peers are increasingly capable. It is the design questions that do not have settled answers yet. When a fleet of agents is doing something you did not intend, how do you know? How do you trace causation across spawned instances? How do you encode organizational intent at a level that survives decomposition into subtasks? How do you reason about failure when the failing component is itself an agent with its own harness?
These are not small questions. Level 4 is where the absence of shared vocabulary is most costly, because the systems are complex enough that imprecise language leads directly to imprecise design. And imprecise design at this scale fails in ways that are hard to diagnose and expensive to untangle.
Products do not respect the taxonomy
The reason "harness" gets muddled is that real products do not sit cleanly in one level. They span two or three at once.
Claude Code is primarily a Level 0 tool, but it ships Level 2 primitives: skills, commands, the .claude/ directory shape. Cursor straddles Level 0 and Level 2. CrewAI and AutoGen blur Level 1 and Level 4 at the same time: they define how one agent runs and how many coordinate. LangChain sprawls across Level 1, Level 2, and sometimes Level 4. Devin reaches into all five.
This is why the word collapses. The products are not lying. They really do span layers. The fix is not to pretend they do not. The fix is to name which level a product touches when we talk about it.
A debugging ladder
The taxonomy earns its keep when something goes wrong.
When an agent behaves unexpectedly, the instinct is to poke at whatever is most visible, usually a prompt or a config file. But the question "which level is this a problem at?" is more useful:
- Is the tool itself underperforming for this task? (L0)
- Is global memory or agent configuration incomplete or contradictory? (L1)
- Is a hook misconfigured, or is a subdirectory
CLAUDE.mdmissing critical context? (L2) - Are there conflicting conventions across projects that this agent is inheriting inconsistently? (L3)
- Is the orchestration logic routing or spawning incorrectly? (L4)
Five questions. Five places to look. That is not a debugging methodology. It is what shared vocabulary makes possible.
The attention map
The taxonomy also makes the field's attention map visible. Most of the work right now is happening at Level 0 (the tool wars), Level 2 (the explosion of project-level scaffolding), and Level 4 (the multi-agent frameworks). Level 1 is catching up. Level 3 is empty.
If you are looking for where the next interesting work lives, look at the empty layer.
Why naming this matters
We are, collectively, in a period of rapid accumulation. Patterns are emerging faster than they are being named. The result is that knowledge stays local: buried in individual CLAUDE.md files, undocumented hook scripts, tribal conventions that do not survive team changes.
Taxonomies feel like housekeeping until suddenly they are load-bearing. The goal of the Harness Stack is not to add ceremony to a field that is moving fast. It is to give the field something specific to argue about. "We need a better harness" is unanswerable today, because the next person is allowed to interpret it however they want. "We need a better Level 3" is an argument you can act on.
I hold this loosely. The edges are genuinely blurry. Level 1 and Level 2 blur when global memory starts referencing project-specific context. Level 3 and Level 4 blur when org policies begin governing agent spawning behavior. That is fine. A taxonomy does not need to be perfect to be useful. It needs to be shared.
The rule is: when you say "harness," say which level. The taxonomy is wrong somewhere. It is a first attempt. I would rather argue about whether Level 3 should be called the Organization layer or something else than keep watching engineers nod at each other and walk out of the room with five different mental models.
Does this map to how you are building, or does it break somewhere meaningful? I am curious where the levels hold and where they need to be argued with. If you are working in this space, I would rather have a conversation than be right.
Top comments (0)