Louai Boumediene

Posted on May 2

The AI Harness: why your AI coding agent is only as smart as the repo you put it in

#ai #productivity #agentskills #webdev

One-prompt feature development tactics

Hi. I am an engineer at Activepieces, an open source automation platform (think Zapier, but you can self host it and read the source code). I write TypeScript for a living. I argue with bots about tabs vs spaces now. And our team spent the last few months really, really invested in one question:

Why does our AI coding agent suck in this codebase, but cook so hard in a blank Next.js project on my laptop?

Spoiler: it is not the model. It is never the model. It is me. It is the repo. It is the thing I will spend the next ~6,000 words trying to convince you to call an AI harness.

Here is the claim. I want to be careful with it, because the internet is tired of "we 10x'd productivity with AI" hype posts. I am tired of them too.

We have 12+ engineers at Activepieces. Our internal goal is one feature per engineer per day, ideally from a single prompt. We have not hit that goal yet. But we are moving fast toward it, and the speed of progress is what is actually interesting.

We are an open source team trying to figure out, in public, how to set up a real codebase so a frontier model can actually be useful in it.

Our codebase is real. Multi tenant. Multi edition. We have an enterprise tier, a cloud product, a frontend, a backend, an engine, a piece SDK, and integrations with hundreds of third party APIs. So this is not a toy.

This post is the answer we have so far. It is not the final answer. It is the answer that took us from "Claude Code keeps inventing entities that do not exist" to "Claude Code just shipped a credentials manager: a solid plan, two iterations, and the feature was production ready."

Let me show you how.

💡 Reading time check: This is a long post. Around 30 minutes. There is a lot of stuff. I would rather give you the full picture than send you off with a half formed mental model that breaks the moment you try to use it. Grab coffee. Or skip to the section called "Anatomy of an AI Harness" if you want the meat.

Part 1: The Dirty Secret Nobody Talks About

Let me start with the part of AI coding talk that drives me crazy.

Every two weeks, a new model drops. Twitter loses its mind. The benchmarks move by 3%. Someone makes a graph. Someone else makes a counter graph. Someone tweets "Claude is dead, GPT is back." Six hours later: "GPT is dead, Gemini is back." A YC founder says programming is solved. A grumpy backend engineer says programming is doomed. Repeat next Tuesday.

Meanwhile, in real engineering teams trying to actually use these tools, here is what happens:

Engineer: "Add a webhook log feature."
AI: *creates a new entity*
AI: *forgets to register it in getEntities()*
AI: *writes a query without filtering by projectId*
AI: *imports from src/app/ee/ inside the CE codebase*
AI: *uses PUT instead of POST*
AI: *makes up a function called safeFetch that does not exist*

Engineer: "No not do that PLEASE..., Remember to use our brand colors next time...., F**ck your 18th grandma..."

Was that the model's fault?

Kind of. But mostly, no. The model did not know your codebase has a multi tenant rule. It did not know that TypeORM does not auto discover entities in your project and you have to register them by hand. It did not know that src/app/ee/ is the enterprise edition and importing it from the community edition will break the build for self hosters. It did not know that you, very specifically, decided three years ago that every create and update endpoint uses POST because PATCH semantics caused a bug once.

It did not know any of that because you did not tell it. You handed a frontier brain a 200,000 file haunted house with no map, no rules, no glossary, no "here is what we mean when we say piece vs plugin", and then you got mad when it tripped over a rake.

💡 The reframe: The bottleneck for 90% of AI coding teams right now is not model capability. It is context engineering. The model is fine. The model is great. The model is being asked to do brain surgery in a dark room with no chart.

Here is the thing nobody wants to admit: your codebase has tribal knowledge. A lot of it. It lives in your senior engineers' heads. It lives in Slack threads from 2022. It lives in PR comments nobody can search. It lives in the muscle memory of "we do not do it like that here."

When a new human engineer joins your team, they pick up that tribal knowledge over weeks of pairing, code review, and getting things wrong. By month three they are productive.

When you onboard an AI agent, you are starting that process from scratch every single session. Unless you build a harness.

Part 2: What the Hell is an AI Harness?

Okay, let me define the term. I have been using it like you already agreed to use it with me, and we have not even shaken hands yet.

AI harness (noun): The set of files, folders, conventions, and infrastructure inside a codebase that turns the raw power of an AI agent into reliable, project specific output. It is to your AI what a saddle is to a horse. The horse was already strong. The saddle is what lets you actually ride it somewhere.

I am going to keep using it. It is a better term than "agent scaffolding" (sounds like a 2014 web framework), "context engineering" (technically right but vibes off), or "AI coding setup" (descriptive but boring). A harness is the right metaphor. You are not making power. You are channeling it.

A real AI harness has at least these parts. We will go through each one:

Always on instructions that load every session. Architecture rules, conventions, gotchas.
Safety reflexes. Tiny rules that catch the most expensive mistakes.
On demand context. Feature docs, glossaries, schemas the agent fetches when it needs them.
Codified workflows. Slash commands or skills that turn "do this five step thing" into one command.
Scoped subagents. Specialized agents that stay in their lane.
External integrations. MCPs to give the agent access to your tools (Linear, Postgres, browsers).
Session hygiene. Practices for clearing context, parallel work, and knowing when to give up and start over.

If you have all of these, you have a harness. If you have none of them, you have a CLAUDE.md with three bullet points that say "use TypeScript", and you wonder why the AI is bad.

Part 3: Plan First, Execute Second (The Real Unlock)

Before I show you the harness, I have to convince you of something that sounds boring but is the single highest leverage practice in our entire workflow:

The harness exists to enable better planning. The planning is what ships features.

Let me explain. There is a fantasy version of AI coding that goes like this:

"Add a project level analytics dashboard." [Enter] 45 minutes later, PR is open, tests pass, Slack ping fires.

That fantasy is real, but it is the last 10%. It is what happens after you have done the work to make it possible. The work to make it possible looks like this:

You write a clear brief. Not "add analytics." Something like: "Add a project level analytics dashboard showing flow run count, success rate, and average duration for the last 7, 30, 90 days. Project scoped. CE feature. Visible to users with READ_RUN permission." Five minutes of writing. Saves thirty minutes of agent guessing wrong.
You enter Plan Mode. In Claude Code, that is Shift+Tab twice. The agent does not write a single line of code. Instead, it explores. It reads the relevant feature docs. It looks at existing patterns. It asks you questions. ("Should this be cached? Do you want it real time? Should the time range selector be a query param or state?") It proposes a plan: which files to create, which to change, what the API shape will be, what tests will look like.
You review and refine the plan. This is where 80% of the value of the whole workflow comes from. You are not reviewing code yet. You are reviewing intent. If the plan is wrong, the code will be wrong, and you will spend two hours debugging what should have been a five minute fix to the plan. Fix the plan.
You exit Plan Mode and let it execute. Now it writes the code. And because you spent fifteen minutes upfront getting the plan right, the code is usually right too.

That is the workflow. It is the opposite of "vibe coding." It is the opposite of "just yolo a prompt and hope." It is, honestly, the same workflow good engineers have always used: think before you type. The only difference is that now you are thinking with a partner that has read more code than any human alive.

💡 Pro tip from someone who learned this the hard way: If you have corrected Claude twice on the same thing, stop correcting it. /clear the session, rewrite your prompt with what you just learned, and start over. Correction loops add context bloat. By the time you have had four back and forths, the agent is in a fog of contradicting instructions and is going to keep getting it wrong. A clean session with a better prompt beats a polluted session with a perfect prompt every time.

This is why the harness matters. The harness is what makes Plan Mode work. Without a harness, Plan Mode is just the agent reading random files and guessing. With a harness, Plan Mode is the agent reading a 60 line feature doc you wrote on purpose to answer the questions it is about to have, and producing a plan that is actually grounded in how your codebase works.

Part 4: Anatomy of an AI Harness: The Layered Context System

Okay. Let us talk about the harness itself.

The mental model we use at Activepieces is agent context as a tiered cache. Different layers load at different times, with different sizes, so the model gets the right info at the right moment without burning context. Three principles:

Tiny files always loaded. Big files on demand.
Per feature docs hold tribal knowledge that you cannot get from reading the code (edition gating, side effects, registration gotchas).
Workflows are codified as skills or slash commands, not prose. So the model has a fixed path for common tasks.

Here is the layer breakdown, with the actual sizes we run in production:

Layer	What	When Loaded	Purpose
`CLAUDE.md` (root)	~55 lines	Every session	Non obvious architecture rules
`packages/*/CLAUDE.md`	~30-55 lines each	When working in that package	Package specific patterns
`.claude/rules/`	3-5 lines each	Every session	Critical safety reflexes
`.agents/features/*`	~60 lines each, 40+ files	When agent explores feature	Entity schemas, services, data flows
`.agents/skills/*`	30-65 lines each, 9 skills	When invoked as `/command`	Step by step workflows
`.claude/agents/*`	Subagents (server, web, changelog)	When delegated to	Scoped task execution

Total context loaded per session: around 150 lines. Claude has roughly 150 instruction slots before quality starts dropping. We use them all. Nothing wasted.

Now let me walk you through each layer.

Layer 1: `AGENTS.md` / `CLAUDE.md`: The Constitution

This is the file that loads every session, no matter what you are doing. It is the thing the agent reads before it reads anything else. It is, basically, your codebase's constitution.

What goes in it:

Architecture rules (what is the stack? what is the topology?)
Coding rules with teeth (no any, no type casting, named parameter functions, file order)
Commands (how do you build, lint, test?)
Project specific gotchas (multi tenancy, edition gating, etc.)
PR rules (label policy, branch naming)

What does NOT go in it:

Generic "be helpful and write good code" stuff
Anything the agent already knows (e.g. "use TypeScript types" is wasted bytes)
Long examples (link to the file instead)
Per feature details (those go in feature docs)

Our root AGENTS.md is about 55 lines. That is it. It is brutally short, and that is on purpose. Every line has to earn its place because every line gets loaded into every session, and tokens are not free.

Some of the rules in our root file:

Multi tenant: every query filters by projectId or platformId. Forgetting this leaks data across tenants. This rule is in three different places. We are not subtle about it.
Editions: CE/EE/Cloud separated via hooksFactory. Never import src/app/ee/ from CE. This breaks self hosters. Bad day for everyone.
Entity registration: TypeORM does not auto discover. New entities must be added to getEntities(). The two step trap is the most common mistake we saw before we wrote it down.
HTTP: POST for all create and update. Never PUT or PATCH. This is a convention, not a law of physics, but having a convention means the agent does not have to guess.
Outbound HTTP: must go through safeHttp (SSRF protection). One of the few security critical conventions we have.
Frontend: reset forms via key prop, not form.reset(). Server errors go to root.serverError. These are hard won react hook form patterns.

You will notice these are not "best practices in general." They are "this is how we do it, here, in this repo." That is the point. Generic rules are useless because the model already knows them. Specific rules are gold because the model has no way to know them.

💡 Pro tip: When writing your AGENTS.md, ask yourself: "Could a fresh frontier model figure this out by reading my code?" If yes, leave it out. If no, if it needs tribal knowledge, write it down. Your AGENTS.md is the place where you lose your tribe its tribal knowledge, in a good way.

We also keep per package CLAUDE.md files:

/packages/server/AGENTS.md. Fastify + TypeORM + BullMQ stack, controller patterns, email template rules, N+1 prevention.
/packages/web/CLAUDE.md. React + Tailwind + Shadcn + react hook form, useEffect anti patterns, ICU i18n syntax.
/packages/shared/CLAUDE.md. Zod schema + z.infer model pattern, key enums to extend, version bump policy.
/packages/pieces/CLAUDE.md. Piece SDK quickstart, auth types, piece context API.
/packages/server/engine/CLAUDE.md. Engine error handling rules (ExecutionError subclasses).

These load only when the agent is working in that package, which keeps context lean. Working on a frontend feature? You do not need to load engine error handling rules.

Layer 2: `.claude/rules/`: The Safety Reflexes

This is one of my favorite parts of the harness because it is so cheap and so high leverage.

These are tiny files. Three to five lines each. They load every single session. They are not architecture documents. They are not tutorials. They are safety reflexes: short orders that catch the most expensive mistakes.

We have four:

Rule	Why It Exists
`data-isolation.md`	Multi tenant: every query filters by `projectId` or `platformId`. `ArrayContains` for connections.
`edition-safety.md`	CE/EE separation. Use `hooksFactory.create<T>()` to extend.
`entity-registration.md`	The two step trap (entity table + migration registration).
`safe-http.md`	SSRF prevention via `safeHttp.axios`.

Each file is a few lines of "if you are doing X, you must also do Y." That is it. They are cheap because they are tiny. They are high leverage because they catch the most expensive bugs we have ever seen. Data leaks across tenants. Broken self hosted builds. Missing migrations. SSRF holes.

You can think of them as the agent's "muscle memory." If your AGENTS.md is the engineering handbook, your rules folder is the laminated safety card on the airplane seat back. Short. Visual. Always there.

💡 Pro tip: A great test for what belongs in .claude/rules/: if you have ever had to revert a PR or write a postmortem because of this kind of mistake, it is a rule. Rules are crystallized scar tissue.

Layer 3: `.agents/features/`: The Module Encyclopedia

This is the part that took us the longest to figure out, and it is the part I am most proud of.

We have 40+ feature docs, one per module: flows, pieces, agents, ai-providers, alerts, analytics, api-keys, app-connections, audit-logs, authentication, custom-domains, file-storage, flow-runs, folders, knowledge-base, mcp, oauth-apps, platform, projects, scim, secret-managers, signing-keys, store-entry, tables, templates, triggers, user-invitations, users, webhooks, plus ee-* variants for enterprise only modules.

Each file is around 60 lines. Each file has a fixed shape:

Summary. One paragraph.
Key Files. Frontend, backend, shared (with paths).
Edition Availability. CE / EE / Cloud.
Domain Terms. Links to the glossary.
Entities. Column by column schemas.
Endpoints / Service Methods. Table form.

Why does this exist? Because the alternative, the bad alternative, is that every time the agent works on, say, the webhooks module, it has to read 30 files to figure out what is going on. That is expensive in tokens, expensive in latency, and the result is often wrong because the agent had to guess at relationships between files.

With feature docs, the agent reads one ~60 line doc and gets the entity schema, the edition gating, the side effect graph, the relevant endpoints, and the canonical domain terms. In one shot. Then it goes and writes code.

We also have a GLOSSARY.md. It is a canonical term table, organized by domain cluster (Automation Core, Data & Storage, Pieces & Integrations, Platform & Multi tenancy, etc.). It includes an "Aliases to avoid" column to fight synonym drift. The worst thing that happens when AI generates code is when it invents a new term for a concept that already exists. Now you have a WebhookEvent and a WebhookHook and a HookedWebhook, all referring to the same thing, scattered across the codebase.

💡 Pro tip: Do not try to write all your feature docs in one sitting. You will burn out and write garbage. Instead: every time you start a new feature in a module that does not have a doc, write the doc first. Use it as the brief for your Plan Mode. Now the doc has paid for itself before the feature even ships, and you have one more for next time.

Layer 4: `.agents/skills/`: Codified Workflows

If feature docs are what the codebase is, skills are how to do things in it.

A skill is a slash command that maps to a fixed procedural workflow. Instead of writing prose like "to add an entity, first create the schema, then register it in getEntities(), then add a migration..." you write a skill called /add-entity that is that procedure, and the agent follows it step by step.

We ship nine skills today:

/add-endpoint. Fastify Zod controller + securityAccess config + module registration.
/add-entity. TypeORM EntitySchema + getEntities() + repository pattern.
/add-feature. Full stack add: shared types, entity, migration, service, controller, frontend, tests.
/db-migration. Generate via CLI, patch MigrationInterface to Migration, set breaking/release/down(), register, handle PGlite vs CONCURRENTLY.
/piece-builder. Build a third party integration: research API, scaffold via CLI, implement actions and triggers, wire tsconfig, build and test.
/ubiquitous-language. Mandatory feature overlap detection before any new feature.
/agent-browser. Browser automation CLI for testing.
/playwright-e2e-testing. Playwright patterns reference.
/mintlify. Docs site authoring conventions.

The /piece-builder skill is a beautiful example of progressive disclosure. The top level SKILL.md is a checklist. But it ships with eight sub references: auth-patterns, action-patterns, trigger-patterns, props-patterns, common-patterns, ux-guidelines, output-quality, piece-types. Around 2,000 lines of patterns total. The agent only loads the sub reference it needs at the step it is on. So you get deep, deep expertise without the agent having to swallow 2,000 lines upfront.

The /ubiquitous-language skill is interesting for a different reason. It is a negative skill. Its job is to prevent the agent from writing code by first checking if the feature you asked for already exists, possibly under a different name. It checks .agents/features/, components, routes, shared types, plan flags. It maintains the glossary. It is the skill that prevents the agent from re inventing webhooks for the third time in six months because it did not know we already had webhooks.

💡 Pro tip: Skills are great because they encode "this is the thing we always forget." Every time you do a code review and write "you forgot to add a migration", that is a skill. Every time you write "you need to register this in getEntities()", that is a skill. The skill replaces the code review comment because the agent never makes the mistake again.

Layer 5: `.claude/agents/`: Specialized Subagents

These are not more powerful than the main agent. They are not smarter. They have access to fewer things, on purpose, and that is the whole point.

We have three:

Agent	Purpose	Tools
`server`	Backend Fastify/TypeORM work	Read/Edit/Write/Glob/Grep/Bash/Agent
`web`	Frontend React work	Read/Edit/Write/Glob/Grep/Bash/Agent
`changelog`	Writes Mintlify `<Update>` entries with enterprise tone	Read/Edit/Write/Glob/Grep/Bash/WebSearch

Each subagent has a model pin (sonnet), an explicit tool allowlist, and a one paragraph briefing of what it owns. The point is not "this agent has special powers." The point is scoping. A web agent stays out of the server tree. A changelog agent does not touch source code. A server agent does not suddenly decide to refactor your React components.

This is way more useful than it sounds. When the main agent delegates a backend task to the server subagent, the subagent only loads server relevant context. It cannot accidentally wander into the frontend. It cannot confuse itself by reading docs for a different domain. It has a smaller, focused world to work in, and it works in that world better.

💡 Pro tip: Subagents are not for "give the agent more capability." They are for "give the agent fewer ways to mess up." Constraint is a feature.

Layer 6: Settings, Hooks, and the `.agents` Symlink Trick

Okay, this section is pure pro tip territory and I am excited about it.

Here is the problem. There are like nine AI coding tools now. Claude Code wants .claude/. Cursor wants .cursor/. Aider wants .aider/. Continue wants .continue/. Each new tool that drops wants its own folder full of its own conventions. And they all want roughly the same content: rules, skills, agent configs.

If you naively support all of them, you are maintaining nine copies of the same files. That is a maintenance nightmare. Every time you update a rule, you have to remember to update it in nine places. You will not. You will forget. Drift will set in. Eventually different agents will be running on different rules and you will wonder why they produce different code.

Our solution: one source of truth, symlinked everywhere.

We use .agents/ as the canonical folder. Skills, rules, feature docs, all of it lives in .agents/. Then we symlink:

.claude/skills -> ../.agents/skills
.cursor/rules -> ../.agents/rules
# ...etc

When a new tool drops next month, we add a symlink. Done. We supported it. The content stays in one place.

This is the kind of thing that sounds like a tiny implementation detail and is actually a load bearing decision. The harness has to be tool agnostic or it dies. Models will change. Tools will change. The content of your harness, your conventions, your feature docs, your skills, is the part that should outlive any one CLI.

💡 Pro tip: Even if you only use Claude Code today, structure your harness as if you will need to support three other tools tomorrow. Keep your content in .agents/ (or whatever generic name you like) and symlink from .claude/. The day you decide to evaluate Cursor or a new agent, you will thank past you.

We also keep a few harness level config files:

.claude/settings.json + .claude/settings.local.json. Permissions, env vars, hooks.
claude/worktrees/. Used for isolated worktree based parallel runs (more on this in a bit).

Part 5: Beyond the Repo: MCPs and the Outside World

Everything I have described so far lives inside your repo. But a real engineer's job is not just "edit files." It is "edit files, then check Linear for the ticket details, then query the database to see what production data looks like, then test the change in a browser, then write a changelog entry." Most of an engineer's day is not in the repo.

This is where MCPs come in. MCP (Model Context Protocol) is, roughly, a standard for connecting AI agents to external tools. If your harness is the saddle, MCPs are the reins. They let the agent steer beyond the codebase.

We use a few:

Linear MCP. Tickets live in Linear. When I tell Claude "implement LIN-1234," I want it to actually pull the ticket, read the description, see the comments, check the labels. Linear MCP makes this a one step operation instead of "let me copy paste the ticket description into the prompt."

Postgres MCP. Sometimes the agent needs to know what production data actually looks like. Not the schema, the shape of real data. Are these IDs UUIDs or strings? Is this column ever null in practice? What is the distribution of values in this enum? Postgres MCP gives the agent read only access to a dev or staging DB so it can answer these questions itself instead of asking me. Huge for migrations, where the difference between "this works in theory" and "this works on production data" is sometimes a 3am page.

Chrome MCP. This one is powerful and expensive. It lets the agent control a browser: navigate, click, screenshot, run E2E flows. We use it for testing UI changes end to end. The agent makes a frontend change, opens the browser, navigates to the page, verifies the thing works, screenshots the result.

💡 Pro tip on Chrome MCP, this one is real: Chrome MCP is super token expensive. Every page snapshot is hundreds to thousands of tokens. Every interaction adds more. Use it surgically, not casually. Do not let the agent "browse around to figure things out." Do not let it open the same page five times. Treat it like a precision tool. Tell the agent exactly which page to open, exactly what to verify, and /clear the session afterwards before the bloated context kills quality on the next task. We have had sessions where 80% of token usage was Chrome MCP screenshots, and the work itself was 20%. Be deliberate.

The general rule for MCPs: add them when the alternative is the agent being blind, not when they are cool. Every MCP you add is more tools, more potential confusion, more context cost. We use Linear, Postgres, Chrome. We have evaluated and not added several others, because the marginal value did not justify the marginal noise.

Part 6: A Day in the Life (One day..... XD)

Enough theory. Let me walk you through what shipping a feature actually looks like for me on a normal day.

8:50 AM. Coffee. I open Linear. I have a ticket: "Add project level analytics dashboard showing flow run counts." I read the ticket. I look at the comments. I think for two minutes about scope.

9:00 AM. Brief. I open my laptop and write a brief in plain text. Five sentences. What it does. Who sees it. CE or EE. Constraints.

Add a project level analytics dashboard showing flow run count, success rate, and average duration for the last 7, 30, 90 days. Project scoped. CE feature. Visible to users with READ_RUN permission. Should not break existing analytics page routing. Success looks like: API returns 200 with the right shape, the dashboard page renders, the role based access check works.

That brief took five minutes to write. It will save 30 minutes of agent wandering. This is the highest leverage time of my day.

9:05 AM. Plan Mode. I open Claude Code in the repo. I press Shift+Tab twice to enter Plan Mode. I paste the brief.

The agent reads .agents/features/flow-runs.md. It reads .agents/features/projects.md. It looks at the existing analytics page. It looks at how role based access checks are done in similar endpoints. It comes back with a plan: "Create a new endpoint at GET /v1/analytics/projects/:id/runs with these query params, returning this shape. Add a new service method. Create a frontend feature folder with API client + hooks. Add a new route. Use the existing withAccess(READ_RUN) middleware. Tests at test/integration/ce/analytics.test.ts."

I read the plan. I notice it did not account for a quota check. I tell the agent: "Also enforce the analytics quota for paid plans." It updates the plan.

9:20 AM. Implement. I exit Plan Mode. I tell the agent: "Use /add-feature." This triggers our skill, which has the full cross cutting checklist: shared types, entity (if needed), migration (if needed), service, controller, frontend, tests, verify.

The agent goes. I switch tabs. I check Slack. I make more coffee.

9:55 AM. Check in. I come back. The agent has created 14 files, modified 6, run the linter, and is currently writing tests. I skim the diff. The endpoint shape looks right. The frontend hook looks right. There is one component that has a useEffect doing something weird. I leave a comment for myself to fix it. I let the agent finish.

10:15 AM. Test. Tests are written. I tell the agent: "Run them." It runs npm run test-api. Two failures. The agent reads the failures, finds the issues (one was a missing fixture, one was an off by one on the date range), fixes them, re runs. Green.

10:30 AM. Verify. I tell the agent: "Run lint." It runs npm run lint-dev. Clean.

I open the dev server and look at the dashboard in the browser. It works. The numbers look reasonable. The role check works (I log in as a non admin user, the page redirects).

10:45 AM. PR. I tell the agent: "Create a PR for this with the feature label." It handles the branch, the commit message, the PR description. I review the PR description, tweak two sentences, hit submit.

Total time: around 1 hour 55 minutes. For a real feature with backend, frontend, tests, role based access, and a quota check.

Was it from a single prompt? No. The brief was the prompt, then there was the Plan Mode refinement, then the implement step, then the test step, then the lint step, then the PR step. Probably six "prompts" total, most of them one or two words.

Is it close to "single prompt"? Closer than I would have believed two years ago.

Will it be a single prompt eventually? Maybe. We are moving in that direction. The harness is what makes it possible. The model getting smarter is what closes the gap.

💡 Pro tip: A great metric to track for yourself, instead of "did I do it in one prompt": how many corrections did I have to make? When I started using Claude Code in our repo, I corrected it 15 times per feature. Now it is 2-3. The harness is what changed. The model is roughly the same.

Part 7: Parallel Sessions and the Worktree Trick

Once you are cooking with one Claude session, you will start to notice something. The agent is often working while you are idle. Plan Mode is exploring? You are waiting. Tests are running? You are waiting. Migration is generating? You are waiting.

You can fix this.

We use git worktrees to run multiple Claude sessions in parallel, each on its own isolated branch:

# Tab 1
claude --worktree    # Feature A (backend heavy)

# Tab 2
claude --worktree    # Feature B (frontend heavy)

# Tab 3
claude --worktree    # Feature C (piece/integration)

Each tab gets its own worktree, which means each tab has its own checked out branch with its own working directory. No conflicts. No "wait, did I commit that?" No cross contamination.

While Tab 1 is exploring in Plan Mode, you switch to Tab 2 and review a diff. While Tab 2 is running tests, you switch to Tab 3 and write a brief. You are never blocked.

Practical limit: 3 parallel sessions. Beyond that, the context switching overhead in your own head is more than the gains. We have tried 4-5; it is worse, not better. Three is the sweet spot.

💡 Pro tip: Mix the types of work across parallel sessions. Do not run three backend features in parallel. They will all need your full backend brain at the same time. Run a backend feature, a frontend feature, and a piece integration in parallel. Each one uses different mental muscles, so the context switching tax is lower.

Part 8: Session Hygiene: The Boring Skills That Save You

This is the part of the post nobody wants to write but everybody needs to read.

AI coding sessions degrade. Always. Context gets bloated. Earlier instructions contradict later ones. The agent starts referencing things you said 40 minutes ago that no longer apply. The quality of output drops, slowly, in a way that is hard to notice from inside the session.

You have to manage this actively. Here is how we do it:

Situation	Command	Why
Starting unrelated feature	`/clear`	Wipe old context, start clean
Continuing same feature next day	Reopen in dir	Auto loads CLAUDE.md, no carryover needed
Claude keeps making same mistake	`/clear` + rewrite	Correction loops waste context
Long session (45+ min)	`/compact`	Prevent quality degradation
Context feels bloated	`/compact`	Summarize and continue fresh

The hardest one to internalize is the second to last: if you have corrected the agent twice on the same thing, stop correcting it. /clear, rewrite the prompt with what you just learned, and start over.

I cannot stress how counterintuitive this feels in the moment. You spent 20 minutes getting to this point. The agent is so close to right. Just one more correction and it will have it. Surely throwing away the session would be wasteful.

But every correction adds context. Context that contradicts earlier context. The agent is now operating in a fog of "the user told me to do X, then they told me to do not X, then they told me to do X but different." It will keep getting confused. The next correction will fail too.

Restart. Take what you learned. Write a better prompt. Three minutes of starting over beats thirty minutes of correction spiraling.

💡 Pro tip: The thing you are "throwing away" when you /clear is not really gone. Your repo still has all the work the agent did. The PR does not get unmade. You are throwing away a confused session, not your progress.

Part 9: The Weekly Rhythm

Here is the cadence we aim for. Not "hit", aim for. The number of weeks we hit it cleanly is climbing. We are not all the way there.

Day	Morning	Afternoon
Mon	Plan Feature A (explore + plan)	Implement Feature A
Tue	Finish A, test, PR	Plan + start Feature B
Wed	Finish B	Plan + implement Feature C
Thu	Finish C, test, PR	Feature D (full cycle in 1 day)
Fri	Feature E (or polish D)	Code review, merge, retrospective

Five features a week. Per engineer. Across 12+ engineers, that is a lot of features.

Notice the thing nobody talks about: planning happens the day before implementation. The brain that plans Feature B on Tuesday afternoon is fresh, while the brain that executes Feature A on Tuesday morning is in execution mode. The overnight gap lets you think about edge cases. The next morning, you come back with "oh, I should also handle this case", and you handle it before writing a single line of code.

This is the same trick monks have known for a thousand years: sleep on the hard problem. The harness lets you do it without losing context, because the next morning the agent reloads everything and you pick up exactly where you left off.

💡 Pro tip: Block out the first 30 minutes of every morning for planning, not implementing. Use Plan Mode aggressively. The afternoon is for execution. Treat your day as two completely different cognitive modes and use the harness differently in each.

Part 10: What We Are NOT Claiming

Let me be very explicit about this, because the internet is full of grifters and I want to be clear about what we are saying and what we are not saying.

We are NOT claiming:

That every feature ships from a single prompt. They do not. Most ship from 5 to 10 prompts, most of which are short.
That AI does the engineering and humans just review. Humans plan, humans review, humans make architectural calls. The agent executes.
That we have eliminated bugs. We have not. We have eliminated some classes of bugs (entity registration, data isolation) via rules. Others remain. Some are AI generated. Some are AI amplified (the agent confidently writes a bug faster than a human would have).
That this works for any codebase as is. Activepieces is a TypeScript monorepo with strong conventions. Your codebase might need different scaffolding.
That the 5 features per week target is hit every week by every engineer. It is not. Some weeks people ship 1 feature because the feature was hard. Some weeks people ship 7 because the features were easy. Five is the aim.

We ARE claiming:

That our pace has gone up a lot since investing in the harness.
That the rate of "agent gets it right on the first try" has gone from like 20% to like 70% on routine features.
That the time we spend correcting the agent has dropped 5x.
That junior engineers are productive faster, because the harness encodes the senior engineers' tribal knowledge.
That we are moving toward "one feature per engineer per day from a single prompt" and we expect to keep moving toward it as both the harness and the models improve.

The honest version of the productivity story is: the harness is a multiplier on whatever your team and your model can do. It does not generate productivity from nothing. It removes the friction that was preventing the productivity you already had from showing up in your output.

Part 11: Objections (The Comment Section, Pre Empted)

Let me address the objections I know are coming. Some of these I agree with, some I do not, and I will be honest about which is which.

"This is just slop generation."

Sometimes! When we do not follow our own rules. When we skip Plan Mode. When we accept the first diff without review. When we let the agent ship without lint and test. The harness is what prevents slop. Every rule, every skill, every feature doc is a piece of code review that happens before the code exists.

Code review still matters. We still do PR review. The harness does not replace human judgment. It raises the floor of what arrives at human judgment.

"What about juniors? Will they fail to learn?"

This is the objection I actually think is most interesting and least settled. Honestly, I do not fully know yet.

Here is what I see at Activepieces: juniors using the harness ship faster. They also seem to learn architectural patterns faster, because the rules and feature docs are teaching artifacts. The doc that explains how multi tenancy works in our codebase is a doc the agent reads, and it is also a doc the junior reads.

But: I do not think they are learning the fundamentals faster. The senior engineer who can debug a memory leak from first principles? That skill is not being taught by the harness. The harness teaches "how things work here", not "how computers work."

So the answer might be: juniors get productive in this codebase faster, but they need other paths to learn the deeper craft. Conferences, side projects, reading source code, debugging without AI. We are still figuring this out.

"Does this break in big codebases?"

We have around 1.6 million lines of code. It works in ours. It works better in ours than it does in a small codebase, actually, because the harness is more useful when there is more tribal knowledge to encode.

The thing that breaks in big codebases is not harnesses. It is unstructured AI use. If you do not have a harness, AI gets worse as the codebase grows because there is more for it to be wrong about. With a harness, AI scales up with the codebase because every new feature doc, every new rule, makes the next feature easier.

"This sounds like a lot of work."

It is. It is also work that pays compound interest. Every feature doc you write makes the next feature in that module easier. Every skill you encode means the next time you do that workflow, it is a slash command.

Start small. One AGENTS.md. One feature doc for the module you work in most. One skill for the thing you do most often. Iterate from there.

"What if the model changes? Will the harness become obsolete?"

This is the objection I take most seriously, and the answer is: maintain the harness with the same care you would maintain any other part of your codebase. New models change what the harness needs to contain. Older instructions become unnecessary. New ones become important.

The structure of the harness, layered context, rules, feature docs, skills, has been stable through several model generations for us. The contents update. That is fine. That is true of any documentation.

Part 12: The Minimum Viable Harness: Where to Start

If you have made it this far, you might feel overwhelmed. 40 feature docs, 9 skills, 4 rules, layered context, MCPs, worktrees. That is a lot. We did not build it in a day. We built it over many months, by adding the next most painful thing every week.

Here is how to start. This is the order we would do it again:

Day 1: Write one AGENTS.md (or CLAUDE.md) at the root of your repo.

Keep it under 100 lines. Include:

Stack: what is the tech stack in one sentence.
Architecture: the three or four most important architectural rules ("multi tenant: filter every query by tenant ID").
Commands: how to build, lint, test.
Conventions: the three or four conventions that are not obvious from reading the code.
Don'ts: the three or four things that will break the build or production.

Week 1: Add one rule to .claude/rules/.

Pick the bug your team has fixed most often. Write a 5 line rule that prevents it. Put it where the agent will load it every session.

Week 2: Write one feature doc.

Pick the module you work in most. Write a 60 line doc with the structure I described: summary, key files, edition (if applicable), entities, endpoints. Use it as a brief next time you ask the agent to add a feature in that module. Notice how much faster the planning goes.

Week 3: Encode one skill.

Pick the workflow you do most often (add an entity, add an endpoint, build an integration). Turn it into a slash command that lists every step. Use it next time. Notice how it forces you to do all the steps you usually forget.

Month 2: Symlink your .agents/ folder.

Set up .agents/ as the source of truth. Symlink .claude/ to it. Now you are tool agnostic.

Month 3: Add MCPs as needed.

Your ticket system. Your DB. Your browser, if UI testing is a real bottleneck. Add them one at a time. Evaluate each one.

Month 6: Look back.

Look at your team's pace. Look at how much you correct the agent now versus six months ago. Look at the size of your harness. The harness has compounded. You shipped features in the meantime. You are still iterating.

💡 Final pro tip: The harness is a living artifact. It is not a one time setup. Every week, ask: "what is the thing the agent got wrong this week that I had to correct?" Write it down. Either it goes in a rule, a feature doc, or a skill. Three months later, you have a harness that reflects three months of organizational learning.

Part 13: The Bigger Thesis: Why the Harness is the Moat

Let me close with something a little speculative, because if you read 6,000 words of mine you earned a take.

Every six months, a new frontier model drops. It is better at coding. It is better at reasoning. It is better at following instructions. Every team using AI gets a free upgrade.

But: which teams capture that upgrade?

The team with no harness gets a slightly better model that still cannot navigate their codebase. They notice a small improvement and move on.

The team with a strong harness gets a slightly better model that can use their entire harness more effectively. The same feature docs work better. The same skills run more reliably. Plan Mode produces better plans. The marginal model improvement compounds across every layer of the harness.

The harness is a leverage multiplier on model improvements. Teams with good harnesses get a bigger lift from each new model release than teams without. Over years, that compounds into a real difference in shipping velocity.

In other words: the harness is the moat. Models are commoditizing. Frontier capability is converging. What is not converging is how well structured your codebase is for AI to actually use it. That is a thing your team builds, that compounds, that does not get taken away when a new model drops.

If I were starting a company today, I would invest in the harness from day one. Not because the current models are great. Because the next models will be great, and I want to be ready to absorb every ounce of that capability the moment it arrives.

That is the bet we are making at Activepieces. We are not there yet. We are moving fast. We will keep writing about what we learn.

If you are building something similar, I would love to hear from you. Find me in the Activepieces community, poke around our open source repo, or just yell into the void on Twitter and I will probably find it.

The harness is what we are working on. The harness is what works. The harness is the bet.

Let us see where it takes us.

If you found this useful, the best thing you can do is steal something from it. Pick one section. Apply it to your codebase tomorrow. Tell me what worked and what did not.

Top comments (16)

Louai Boumediene • May 2

Very Helpful!!

Anmol Baranwal • May 11

this is great. I was just reading about agent harness on langchain blog -- learning about it since the last week.

CapeStart • May 6

AI doesn’t magically understand your codebase. Tests, naming, architecture, and documentation are the real instruction layer.

John Cotterell • May 2

I'm not at 1.9m LOC yet, more like 100k and one developer, but it's reached the point where I need to rethink my strategy, a simple guidelines.md has served me well but it's no longer enough.
Specialised subagents looks like a really good idea.
But there's lots of great advice here.

Bnklf Houda • May 2

I’m really happy to see you back with new blogs! I love how you organize them—they’re both fun and clear, and the information really stands out. Big applause for the structure of the blog and even the title, AI Harness.

Mykola Kondratiuk • May 8

hit this with our biggest repo - suggestions were technically valid but wrong for our patterns. the fix was not prompt engineering, it was 3 days of context doc cleanup. repo hygiene as AI prep work is underrated.

Said • May 6

Thank you very much now i have motivation to go and buy subscription.

I promised myself by the end if the day I would have AI subscription but I had no idea what I would do with it beside well writing same things I do now with free tier on the browser.

I intend to do creative learning so i must adjust the harness to do more planning than yours.

I have found that system design heavy prompt generate better and easier to use code and its easier to locate what need to be fixed for example if I say "decoupled and modular with self explanatory naming convention".

Ian Johnson • May 4

Great post! I just released a Claude Code plugin to do this: dev.to/tacoda/introducing-bridle-a...
Here's the repo (open to create issues to improve): github.com/tacoda/bridle
I like a lot of what you have in here. Would love to collaborate with you on this!

Bnklf Houda • May 2

I feel like part of the struggle isn’t even technical. For a long time, we were made to feel like using AI is cheating or doing something wrong, so a lot of us held back. Now it’s more about unlearning that mindset and actually embracing it as a tool.