Kento IKEDA for AWS Community Builders

Posted on Apr 26

Harness Engineering with Nothing but Markdown

#ai #automation #productivity #llm

If coding agents aren't your primary battlefield, "harness engineering" probably feels like a distant concept. Scrolling through a timeline full of articles written for Claude Code and Codex users, you may have thought, "This isn't about me."

My own agent use wasn't centered on coding either, so none of the articles out there seemed to apply to my case. But looking back, I'd been doing the same thing — it just didn't have a name yet.

I've been running a business automation agent via Claude Desktop (through MCP servers) for several months now. It gathers information across multiple work tools like Slack, Confluence, and Google Calendar, switches judgment criteria based on context, and produces outputs accordingly. What the agent refers to goes beyond surface-level rules — accumulated knowledge such as understanding of organizational structure, past decision-making history, and writing style guidelines forms the foundation for its judgment.

I haven't written a single line of code. All I write is Markdown. And most of that Markdown is generated by the agent itself — I just approve or give revision instructions through chat. I almost never open the files directly to edit them.

This article isn't for people already practicing harness engineering. It's for those who've heard the term but thought, "That's a coding thing, right?" — I'm sharing the structure I've found. Each example includes a ready-to-use sample, so if you're running a business automation agent with MCP, you can try them as-is.

What Is Harness Engineering?

Let me set the foundation.

Mitchell Hashimoto, co-founder of HashiCorp, gave the name "Engineer the Harness" to a practice he'd cultivated in his AI agent workflow, in a February 2026 blog post. The approach: when an agent makes a mistake, instead of fixing the prompt, build an environment where the same mistake can't happen again.

https://mitchellh.com/writing/my-ai-adoption-journey

Days later, OpenAI published a practice report titled "Harness engineering." A small engineering team spent five months building a product using only Codex agents with zero hand-written code, and the repository reached roughly one million lines. The back-to-back publication of Hashimoto's blog and this report cemented "harness engineering" as a term.

https://openai.com/index/harness-engineering/

In the coding agent context, this translates to implementations like banning specific patterns with ESLint, defining commands in AGENTS.md, and running automated reviews via pre-commit hooks.

From "asking" (prompts) to "building" (environment). That's the core.

Up to this point, the story seems confined to the world of coding agents. But in 2025, MCP became widespread and rapidly expanded the practical scope of non-coding agents. Once agents gained direct access to business tools like Slack, Confluence, Google Calendar, and Jira, the risk of "agents making mistakes on their own" spilled beyond coding. Harnesses are no longer just for coding agents.

I Kept Rewriting Prompts

When you incorporate agents into business workflows, you run into experiences like these.

You write "don't make financial judgments" and it makes them anyway. You write "don't post directly to Slack — create a draft" and it tries to post. You write "commit and push at the end of the session" and it forgets.

Each time, I'd rewrite the prompt. Under the assumption that "if I write it more clearly, it'll understand."

At some point, I realized the assumption itself was wrong. No matter how much you polish a prompt, the agent makes the same mistake in the next session. Instructions get buried in long contexts. When the session ends, memory disappears entirely. Requests are volatile.

Stop expecting the agent to remember. Change the environment instead. Looking back, this was the entry point to harness engineering.

Harnesses for Non-Coding Agents

When I lined up what I'd been doing in my repository, the same structure as coding agent harnesses emerged.

Coding Agent Environment	Non-Coding Agent Environment
ESLint / TypeScript strict type enforcement	Prohibited actions section under `agents/`
`AGENTS.md` command definitions	Context routing rules in instruction files
Pre-commit hooks	Mandatory actions at session end
CI gates (can't merge unless tests pass)	Forced knowledge accumulation rules under `knowledge/`

The materials on each side are completely different. One uses linters and hooks, the other uses Markdown files. But the design intent is the same: building an environment outside the agent where the agent can behave correctly.

One prerequisite to note: most AI chat tools have a designated place for instruction files that are automatically loaded at session start. In Claude Desktop it's Project Knowledge; in ChatGPT it's Custom Instructions. What I call "instruction files" in this article are Markdown files placed in this mechanism. Unlike writing in the prompt each time, they're automatically placed in a position that's hard to bury even as conversations grow longer.

Here are three concrete examples, each with a ready-to-use sample.

Structuring Prohibited Actions

Say you've delegated Slack posting to your agent. Even if you write "don't post directly — create a draft" in the prompt, it forgets across sessions.

The solution is to create a prohibited actions section in the instruction file and structure it so it's loaded every session. Move the instruction's location from prompt (volatile) to file (persistent).

## Prohibited Actions

Follow these without exception.

- Do not auto-post to company Slack (draft only; user handles posting)
- Do not make definitive financial judgments (always ask user for confirmation)
- Do not treat replies to clients as final versions (always get user approval)
- Do not make judgments about personnel evaluations or compensation
- When including confidential information (salaries, contract amounts, etc.) in summaries, explicitly note this

Instead of telling someone verbally each time, place rules in a fixed location and reference them every time. It's that simple, but it changes the lifespan of rules from per-session to permanent.

Forcing Actions at Session End

You want to leave a work log at the end of each session. Even if you write "create a work log and commit & push at the end" in the prompt, the agent just wraps up when the conversation gets lively.

The solution is to define trigger conditions and mandatory actions as a set in the instruction file.

## Mandatory Actions at Session End

When the user indicates work completion with phrases like "done," "thanks," or "commit,"
execute the following. Skipping is prohibited.

1. Create a work log at `docs/work-logs/YYYY-MM-DD-{topic}.md`
   - Include: background, options considered, key decisions, deliverables, next steps
2. Append a summary of changes to `CHANGELOG.md`
3. Execute git commit & push

The difference from the prohibited actions example is that trigger conditions for "when to fire" are also defined. By explicitly stating end signals like "done," "thanks," and "commit," the agent can more easily judge "this is the moment." It's not perfect, but the firing rate goes up significantly compared to writing "execute at the appropriate timing" with vague triggers.

The key is the single line: "Skipping is prohibited." If you leave room for the agent to judge, it will decide on its own that "it's probably fine to skip this time" when conversations get long. Removing discretion stabilizes behavior.

There's a secondary benefit too. When rules are defined in the instruction file, a simple "leave a log" or "commit" is enough for the agent to instantly understand "that action." No need to explain from scratch each time. The instruction file becomes shared vocabulary between human and agent.

Forced Knowledge Accumulation

The third is an example of a "can't proceed without passing the check" structure.

In conversations with agents, information worth accumulating comes up frequently — things decided in meetings, conclusions from tool selection, facts discovered during troubleshooting. Even if you write "save important information" in the prompt, it predictably forgets.

The solution is to embed a "knowledge check" protocol in the instruction file.

## Knowledge Accumulation (Mandatory Check)

Before each response, internally execute the following check. Skipping is prohibited.

Check: Does the user's immediately preceding statement, or your own response,
contain new information matching any of the following?

1. Factual information: team composition, tech stack, account info, environment configuration
2. Decisions: architecture selection, tool adoption, policy changes
3. Learnings: facts discovered during troubleshooting, gotchas, operational tips
4. Client-specific: contact names, contact info, project progress

→ If applicable: In addition to the normal response, append the following at the end.

💾 Knowledge capture proposal:
  File: knowledge/{project-name}/{filename}.md
  Content: (summary of content to add)
  Reason: (why this should be accumulated)

→ If not applicable: Append nothing.

The intended structure is "can't produce a response without passing the check." Of course, LLMs can skip instructions, so the enforcement isn't as strong as a mechanical gate. Still, by embedding the check into the system, the probability of capturing information rises significantly even when the human forgets to say "save that."

Since implementing this system, knowledge files have been steadily accumulating in the knowledge directory.

Acknowledging the Enforcement Gap

Let me address the strongest counterargument upfront. "Markdown prohibitions don't have the same enforcement power as a linter." That's correct.

Linters and type checkers mechanically detect rule violations. Depending on configuration, they can even block builds and merges entirely. Markdown prohibitions, on the other hand, carry the risk of the agent reading past them. If buried in a long instruction file, effectiveness drops.

However, the comparison here isn't against "mechanical enforcement" — it's against "writing it in the prompt each time." Why does writing in a file work better than a prompt? Two reasons.

First, the "reference mechanism is different." As noted earlier, instructions placed in Project Knowledge or Custom Instructions are passed to the agent in a separate channel from regular messages. They're placed in a position that's harder to bury even as conversations grow longer, structurally increasing the probability of being referenced.

Second, "accumulation becomes irreversible." Instructions written in a prompt don't exist in the next session. Write them in a file, and they persist unless deleted. The cycle of "write a good instruction → forget → write again" becomes "write a good instruction → append to file → automatically referenced from then on."

Lining up enforcement strength from weakest to strongest:

"Write in the prompt each time" → "Place in a persistent file and reference every time" → "Mechanically block with linters and hooks"

Non-coding agents are currently at the middle position. Definitely stronger than the left end, doesn't reach the right end. But moving to the middle is still better for agent stability than staying at the left.

Repository Structure as a Design Decision

So far I've written about individual rules, but the "where to put" the rules is itself a design decision.

The repository structure that solidified through operation looks like this:

ai-agents/
├── agents/                  # Role-specific instruction files
│   ├── assistant.md         # Main instructions (prohibitions, mandatory actions)
│   ├── project-a/
│   │   ├── sre-support.md   # SRE-specific instructions
│   │   ├── qa-support.md    # QA-specific instructions
│   │   └── ...
│   └── project-b/
│       ├── accounting.md    # Accounting-specific instructions
│       └── ...
├── knowledge/               # Accumulated knowledge
│   ├── project-a/
│   ├── project-b/
│   └── writing-style-guide.md
├── docs/work-logs/          # Per-session work logs
└── CHANGELOG.md

This structure shares two principles with coding agent harness design.

The first is "separation of concerns." OpenAI's report documents the experience of a monolithic AGENTS.md not working well. When everything in the context is "important," nothing is important. In my own repository too, I initially crammed everything into a single Markdown file. Separating files by role and having the agent reference only what's needed improved instruction effectiveness.

What enables this is context routing rules. Define routing in the main instruction file so the agent can reference the appropriate specialized instructions based on conversation content.

## Context Routing Rules
Judge which context the user's statement belongs to and reference the appropriate specialized instructions.

- Project A context signals: AWS, infrastructure, SRE, QA, team member names → Reference files under `agents/project-a/`
- Project B context signals: billing, contracts, accounting, legal → Reference files under `agents/project-b/`
- Ambiguous: Ask which project this is about

This is the same structure as the AGENTS.md "design as a pointer" principle. The main file handles routing only, delegating details to specialized files. OpenAI's report describes keeping AGENTS.md to roughly 100 lines, functioning as a map. For non-coding agents, I've observed the same tendency — the longer the instruction file, the more effectiveness drops.

The second is "version control." By placing instruction files in a Git repository, change history is preserved. "When was this prohibition added?" "Which rule change made things stable?" — all traceable via diff. Slack messages and ad-hoc prompts don't preserve this history. Additionally, since it's a Git repository, you're not tied to a specific PC. Keep it on a remote, and you can launch the same harness from any device.

OpenAI's team makes the same point. Slack discussions, Google Docs content — if it's not in the repository, it's inaccessible to the agent and might as well not exist. This applies equally to non-coding agents.

Getting Started

You don't need to structure everything from the start when beginning harness engineering for non-coding agents.

In my case too, the early days were spent rewriting prompts. The order in which structure solidified was:

When the agent makes the same mistake twice, write it in a file instead of a prompt
When the file gets bloated, split by role
When information is lost between sessions, build an accumulation system

It's the same pattern Mitchell Hashimoto describes. "When the agent makes a mistake, build a system where that mistake can't happen again." For coding, you build it with linters and hooks. For non-coding, you build it with Markdown file structure. The material differs, but the thinking loop is the same.

Here's a minimal starter template. Place it in Claude Desktop's Project Knowledge or ChatGPT's Custom Instructions and it works as-is.

# Assistant Instructions

## Your Role

An AI assistant that supports user workflows.
Use MCP tools like Slack, Google Calendar, and Confluence for information gathering and organization.

## Prohibited Actions

- Do not auto-post to company Slack (draft only)
- Do not make definitive financial judgments (always ask user for confirmation)
- When including confidential information in summaries, explicitly note this

## Mandatory Actions at Session End

When the user indicates work completion, execute the following. Skipping is prohibited.

1. Create a work log at `docs/work-logs/YYYY-MM-DD-{topic}.md`
2. If there are changes, execute git commit & push

## Knowledge Accumulation (Mandatory Check)

Before each response, internally execute the following check. Skipping is prohibited.

Check: Does the immediately preceding conversation contain new information matching any of the following?
1. Factual information (team composition, tech stack, environment configuration)
2. Decisions (architecture selection, tool adoption, policy changes)
3. Learnings (facts discovered during troubleshooting, gotchas)

→ If applicable: Append a knowledge capture proposal at the end
→ If not applicable: Append nothing

This template is roughly 30 lines. Start here, and add one line to the prohibited actions every time the agent makes a mistake. In a few months, you'll have a harness built specifically for you.

The Question Harnesses Share

Harness engineering isn't a coding-specific technique. It's a design philosophy: giving agents a reliable execution environment.

Coding agents build that environment with types, linters, and hooks. Non-coding agents build it with structured Markdown and forced referencing. The materials differ, but the question is the same: "When this agent makes a mistake, where is the system that prevents it from happening a second time?"

Since shifting from "I just need to write better prompts" to "I need to build a structure where the same mistake can't happen," my agents have been running more stably.

DEV Community