Luhui Dev

Posted on May 4

AHE Deep Dive: How Coding Agent Harnesses Automatically Evolve

#ai #agents #luhuidev

Introduction

When building a coding agent, the capability of your base model is only part of the equation. In real production scenarios, what matters just as much is the harness wrapped around that model — the prompt, tools, middleware, memory, execution environment, trace, and evaluation pipeline.

This is exactly what the AHE paper addresses: how to make a coding agent's harness continuously observable, modifiable, testable, rollback-able, and even self-iterating — just like software engineering.

The full paper title is "Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses", authored by researchers from Fudan University, Peking University, and Shanghai Qiji Zhifeng Co., Ltd. The academic teams bring methodological design, while the industry team contributes experience from Agent/LLM infrastructure and Nex AGI systems.

Even better, AHE is open source: china-qijizhifeng/agentic-harness-engineering.

This makes it more than just a paper concept — you can directly examine the seed coding agent, evolve agent, experiment configs, traces, manifests, and rollback structures. For anyone building coding agents, agent infrastructure, or broader agent products, this repository is worth dissecting.

This article explores three questions: why AHE works, how it evolves harnesses, and how to start your own small experiment with the repository.

Part 1: A Quick Intro to Harness Engineering

A harness is the external engineering shell that makes a model actually work. In a coding agent, it typically includes:

System prompt: defines the agent's basic working mode
Tools: file I/O, shell, search, test execution, code modification, etc.
Tool descriptions: what the model sees about tool usage and parameter schemas
Middleware: interception, validation, correction, and logging before/after tool calls
Memory: short-term, long-term, and experience accumulation
Context management: compression, pruning, and retrieval
Execution environment: sandbox, permissions, runtime isolation
Evaluation/observability: testing, trace, logs, rewards, failure reports, regression tracking

This structure determines how the model approaches tasks, invokes tools, handles failures, and judges completion.

For example, when a shell command hangs in production, the solution isn't to keep adding "don't use interactive commands" to the prompt. A more robust approach: add timeout to the shell tool, use middleware to detect high-risk commands, truncate long outputs at the response layer, and enforce state checks before task completion.

This is the essence of Harness Engineering: putting agent capabilities into a maintainable runtime system.

I won't dive deeper into the Harness concept here. If you want to learn more, search for keywords like: Harness Engineering, Agent Harness, Agent Runtime, Tool-use Agent, Agent Observability, Agent Evaluation, Coding Agent Infrastructure.

Let's move to the main focus of this article.

Part 2: AHE's Core Positioning — Self-Iterating Coding Agent Harnesses

AHE stands for Agentic Harness Engineering.

The paper's subtitle contains the key phrase: Observability-Driven Automatic Evolution of Coding-Agent Harnesses.

This breaks down into three layers:

First, AHE targets coding agent harnesses. It doesn't train new models or modify base model parameters.

Second, it performs automatic evolution. The goal isn't a one-time manual prompt tweak, but continuous harness evolution across multiple runs.

Third, it relies on observability. Changes come from traces, logs, rewards, failure analysis, change manifests — not from vague "self-reflection" in a prompt.

So AHE's precise positioning is:

An automatic evolution framework for coding agent harnesses. Through observable runtime evidence, it continuously improves the agent's surrounding prompt, tools, middleware, memory, skills, and sub-agents.

This is the key difference from ordinary prompt optimization. AHE does modify prompts, but its action space is much larger — it includes tools, middleware, and memory as evolvable structures.

Part 3: AHE's Experimental Results

AHE's main experiments ran on Terminal-Bench 2. The paper reports that after 10 iterations, AHE improved the seed harness's pass @1 from 69.7% to 77.0%. This shows that on the target benchmark, AHE found effective harness modifications.

The ablation study is even more revealing. The paper replaced different components in full AHE back to the seed harness individually, with roughly these results:

This result is highly informative.

If gains mainly came from better system prompts, prompt-only should improve. But in the experiment, prompt-only actually decreased, while memory, tools, and middleware showed more significant improvements.

This means AHE's key benefits come from structural harness modifications. It also suggests that in complex tasks, many agent failures require harder (more engineering-focused) mechanisms: tool behavior, runtime interception, state recording, long-term experience, regression testing.

The paper also conducted transfer experiments. When the evolved harness transferred to SWE-bench-verified, success rate gains were small, but token usage dropped more noticeably. This suggests AHE's evolved structures may be better at reducing ineffective exploration and context waste.

Cross-model transfer is also noteworthy. When AHE-generated harnesses were applied to multiple base models, the paper reports positive gains across the board. This indicates the learned components contain some transferable engineering structures.

My assessment: AHE's prediction of "which changes will fix problems" is significantly better than random, but its prediction of "which changes will cause regressions" is still relatively weak. It does prove that harnesses can be continuously evolved in a file-based, evidence-based, version-controlled manner.

Part 4: AHE's Key Workflow — Evaluate, Diagnose, Modify, Verify, Rollback

AHE's main loop:

graph TD
    A[Current Harness] --> B[Run Code Agent on benchmark]
    B --> C[Collect trace, log, reward]
    C --> D[Analyze failure patterns]
    D --> E[Evolve Agent modifies Harness files]
    E --> F[Write change_manifest]
    F --> G[Re-evaluate next round]
    G --> H[Verify if changes work, rollback if needed]
    H -.-> A

This closed loop has three main actors.

First is the Code Agent.

This is the actual agent completing coding tasks, and the object being optimized. In the AHE repository, the seed agent is quite simple — basically a bash-only coding agent.

Second is the Agent Debugger.

It reads the Code Agent's execution traces and compresses massive traces into readable failure reports. After a benchmark run, raw traces can be extremely long, making direct model reading too costly. Agent Debugger converts these traces into overviews and per-task analyses, providing evidence for subsequent modifications.

Third is the Evolve Agent.

It reads the previous round's results, failure analysis, and historical modification records, then modifies harness files in the workspace. Its modification targets include prompts, tools, middleware, memory, skills, sub-agent configs, etc.

AHE adds strong engineering constraints to this process:

Every modification must land in files. Every modification requires a manifest. The next round must verify predictions in the manifest. Poor results must be rollback-able. The entire process should leave an auditable evidence chain.

The self-reflection agent must answer more specific questions: which file was changed, why, which tasks are expected to be fixed, which tasks might be harmed, and whether the next round's results validate this judgment.

Part 5: What Evolvable Components Does AHE Break the Harness Into?

AHE's first step is breaking the harness into explicit components.

The paper emphasizes several evolvable object types:

System Prompt: Defines the Code Agent's basic behavior, like executing shell non-interactively, checking state before task completion, not exiting prematurely.

Tool Descriptions: What the model sees about tools. The tool itself might not change, but if the description changes, so does how the model calls it.

Tool Implementations: The actual tool implementation. For example, how the shell tool executes commands, handles timeouts, truncates output, returns error messages.

Middleware: Runtime interception layer. It can check before/after tool calls, like detecting dangerous commands, reminding about unverified tasks, blocking premature endings, recording risk states.

Skills: Reusable experience. Think of these as operation manuals for certain task patterns.

Sub-agents: Sub-agent configurations. Complex tasks can be split to different roles.

Long-term Memory: For accumulating experience across tasks and rounds.

This decomposition gives the Evolve Agent a richer action space. It can choose the right place to intervene based on failure evidence.

Example: Code Agent keeps hanging in shell. The least efficient approach is adding more prompt reminders. AHE's path is more engineering-focused: add timeout to shell tool; middleware checks for obviously interactive commands; return messages explicitly state failure reasons; system prompt adds behavioral constraints.

These structural modifications are more stable and easier to reuse and rollback.

The key is understanding the positioning: prompts are behavioral suggestions; tools, middleware, and memory are execution mechanisms.

AHE's value lies in bringing these execution mechanisms into the evolution scope.

Part 6: Three Layers of Observability — How AHE Avoids Blind Search

Just having an agent randomly modify files and rerun benchmarks has limited value. AHE's core design is three layers of observability.

1. Component Observability

Component observability means the system knows what parts the harness has, where each part is, how to modify it, and how to register it.

In the AHE repository, prompts, tool descriptions, tool implementations, middleware, memory, etc., all appear as files. New tools need YAML descriptions and Python implementations, plus config registration; new middleware needs explicit integration; new skills or sub-agents also need config exposure.

2. Experience Observability

Experience observability means after an agent runs, the system records how it succeeded or failed.

AHE collects each task's trace, runtime log, reward, etc. Then Agent Debugger compresses these raw traces into analysis reports.

When a coding agent fails, simply knowing "it failed" isn't very useful. What you really need to locate is the failure level: command execution failure, dependency installation failure, test not run, file path error, output too long causing context pollution, agent prematurely judging task complete, losing previous state in long tasks.

Through traces and analysis, AHE turns failures into readable, summarizable, actionable evidence.

3. Decision Observability

After each modification, the Evolve Agent must write a change_manifest.json. This manifest records which files were changed, what failure pattern they address, why this component was chosen, which tasks are expected to be fixed, which might regress, and the modification's constraint strength.

After the next evaluation round, the system checks this manifest to see if predictions came true.

This step turns every modification into a verifiable hypothesis. Even without using AHE's full automatic evolution pipeline, just introducing the change manifest habit into your own agent team will immediately improve engineering transparency.

Many agent projects struggle with long-term maintenance precisely because of this: lots of prompt changes, lots of tool adjustments, but nobody knows what each change actually solved, and nobody knows if it introduced new problems. AHE's manifest mechanism at least makes this process auditable.

Part 7: AHE's Engineering Organization from the Repository

The main entry point for the AHE repository is evolve.py. It orchestrates the entire evolution workflow, including initializing workspace, running evaluations, handling iteration directories, doing attribution, recovery, and rollback.

The seed agent being evolved is agents/code_agent_simple/, which includes:

code_agent.yaml describes how this agent loads prompts, which tools it uses, what tracer to use.

systemprompt.md is the initial system prompt.

LongTermMEMORY.md and ShortTermMEMORY.md correspond to long-term and short-term memory interfaces. tool_descriptions/ holds tool descriptions, tools/ holds tool implementations.

The Evolve Agent is in agents/evolve_agent/. Key files worth examining:

evolve_agent.yaml defines what tools, middleware, and skills the Evolve Agent itself can use.

evolve_prompt.md is an evolution contract: it specifies that Evolve Agent can only modify workspace, must make evidence-based changes, must write summaries and manifests, must follow registration rules.

Config files are in configs/ and configs/experiments/. configs/base.yaml is the base config, configs/experiments/exp-simple-code-gpt54.yaml is a config overlay close to the paper experiments.

Launch scripts are in scripts/, like scripts/evolve.sh for starting long experiments, scripts/build_templates.py for building task templates for E2B.

If you just want to understand the project, you don't need to read all files at once. I recommend this reading order:

README
  ↓
agents/code_agent_simple/code_agent.yaml
  ↓
agents/code_agent_simple/systemprompt.md
  ↓
agents/evolve_agent/evolve_prompt.md
  ↓
configs/base.yaml
  ↓
configs/experiments/exp-simple-code-gpt54.yaml
  ↓
evolve.py

This sequence helps you build concepts first, then see execution details.

Part 8: Getting Started with the Repository — Run a Small Experiment First

AHE is not a lightweight SDK. You can't expect to pip install and immediately embed it in production systems.

It's more like a research experiment framework. Running full paper-level experiments requires LLM API, E2B sandbox, SERPER API, benchmark data, concurrent scheduling, and considerable token costs.

So a more realistic onboarding approach is to run a minimal closed loop first.

Set the goal as: get AHE's core pipeline running.

That is:

graph LR
    A[Task execution] --> B[Trace generation]
    B --> C[Analysis generation]
    C --> D[change_manifest written]
    D --> E[Next round re-evaluation]
    E --> F[change_evaluation<br>judges modification effect]

Once this pipeline works, you understand AHE's practical value.

1. Clone the Repository

Official repository:

git clone https://github.com/china-qijizhifeng/agentic-harness-engineering.git
cd agentic-harness-engineering

2. Install Dependencies

The project uses uv to manage Python dependencies.

uv sync

3. Configure Environment Variables

Copy the environment variable template:

cp .env.example .env

At minimum, pay attention to these variables:

LLM_API_KEY
LLM_BASE_URL
E2B_API_KEY
SERPER_API_KEY
GITHUB_TOKEN

Agent Debugger can also configure model endpoints separately. Refer to .env.example for specifics.

One important note: AHE's task execution depends on E2B sandbox. Much code execution happens in isolated remote environments. This helps with security and reproducibility, but also means you need an E2B account and credits.

4. Prepare Benchmark Task Templates

The official workflow requires building task templates first. Example command:

uv run python scripts/build_templates.py --dataset-dir /path/to/dataset -j 16

Replace /path/to/dataset with your actual task data path.

If you're just doing a small experiment, I don't recommend preparing full Terminal-Bench 2 at the start. Select a few tasks and get the pipeline working first — that's more important.

5. Start with a Small Config

For paper experiment config, refer to:

configs/experiments/exp-simple-code-gpt54.yaml

Running the full config is quite costly. Copy a small config, for example:

cp configs/experiments/exp-simple-code-gpt54.yaml configs/experiments/exp-mini.yaml

Then reduce the parameters:

max_iterations: 2
harbor:
  k: 2
  n_concurrent: 4

If the config supports specifying task subsets, use only 3 to 5 tasks. The point of a small experiment is validating the workflow, not chasing scores.

6. Launch the Evolution Experiment

You can use the script:

./scripts/evolve.sh configs/experiments/exp-mini.yaml

Or look inside the script to see how it calls evolve.py, then manually launch as needed.

Full experiments can run for a long time. Even small experiments require attention to API costs, E2B concurrency limits, and network stability.

7. Look at Experiment Artifacts, Not Just Scores

After running, don't just look at pass rate.

What's more worth examining are these artifacts:

runs/iteration_*/
analysis/overview.md
analysis/detail/*.md
change_manifest.json
change_evaluation.json
agent/nexau_in_memory_tracer.cleaned.json
verifier/reward.txt

After running, focus on observing and answering these questions:

What patterns were this round's failures attributed to?
Which files did Evolve Agent change?
Why did it choose to change these files?
Which tasks does the manifest predict will be fixed?
Did the next round verify this prediction?
Were there cases where fixing one task broke another?

If you can find answers to all these questions in the artifacts, it means AHE's core closed loop is working.

Part 9: What AHE Hasn't Solved Yet

AHE is valuable, but its boundaries should be clear too.

First, it's still a research framework. Full runs aren't cheap, requiring benchmarks, sandboxes, LLM APIs, and fairly complex experiment configs.

Second, the effectiveness evidence in the paper needs more replication experiments. The improvement on Terminal-Bench 2 is clear, but for strong statistical conclusions, more seeds, more campaigns, and more confidence intervals are needed.

Third, its prediction of regression risk isn't strong enough. The system is better at explaining what a modification might fix, but not as good at judging what it might harm. This is a hard problem for automatic evolution systems.

Part 10: AHE's Inspiration for Agent Product Teams

AHE's biggest inspiration for product-focused agent teams is pulling agent improvement processes from "mystical prompt tuning" back into the engineering world.

A real agent product will eventually face these questions:

After a user reports an error, how do you reproduce it?
How do you aggregate failure causes?
Did a certain prompt modification actually help?
Did a tool change regress other scenarios?
Is there regression testing before release?
Can you rollback if production performance degrades?
How do you distill effective experience into memory or skills?

No single model can solve these problems for you.

They belong to the scope of harness engineering work.

If you're also building your own agent, this repository is worth thoroughly dissecting. Even without running it completely, you can learn a lot about harness organization, trace design, modification attribution, and regression verification engineering methods.

References

Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses\
arXiv: https://arxiv.org/abs/2604.25850
AHE Official Code Repository\
GitHub: https://github.com/china-qijizhifeng/agentic-harness-engineering
Harness engineering: leveraging Codex in an agent-first world\
OpenAI Engineering Blog: https://openai.com/index/harness-engineering/

🙋‍
I’m Luhui Dev, a developer who has been breaking down Agent engineering and exploring how AI can be applied in education.
I focus on Agent Harness, LLM application engineering, AI for Math, and the productization of education SaaS.

DEV Community