Self-improving Coding Agents

#agents #harness #ai #evals

In this article, I demonstrate how a more capable coding agent (Codex/GPT-5.4) successfully refined the prompt for a less capable agent (GitHub Copilot CLI/GPT-5-mini) using a custom evaluation tool called Katt. This "Test-Driven Agentic Workflow" (TDAW) increased passing evals from 0/3 to 3/3, but also significantly increased runtime and token usage, demonstrating that an agent can improve a prompt's effectiveness by clarifying instructions and fixing the evaluation process itself.

Why this article now?

I realized I haven't been very verbose about my AI research in relation to coding agents and the ability to extract accurate outputs, even with low-capability models. I've spent the last 3 years experimenting with ML and AI, and 1.5 years with coding agents, while using my engineering skills to create different harness strategies on top of them. With this more recent experiment, I felt like it was time to be more outspoken and share a bit of my learnings and journey with AI.

So here comes my first article in many years. Let's start by talking about Katt.

Katt: A benchmark/evals tool for testing coding agents

Recently, I decided to create a simple CLI tool inspired by unit testing libraries like Jest and Vitest to run on top of coding models like Claude Code, Codex, and GitHub Copilot. It aims for a more deterministic way of working with automation/autonomous coding workflows while keeping a syntax that is very similar to the unit testing tools we are familiar with.

It is able to do a lot of things in terms of benchmarking and evaluation harness efficiency. But my favorite value it creates is the ability to give this tool to an agent with the focus of improving other coding agents for a quick loop that aims to generate the most efficient and accurate generation.

So let's break down this experiment.

The experiment

The main idea was simple: Can a coding agent improve the effectiveness of a prompt running on a less capable model?

For this to happen, Codex got access to a pre-existing test that used gpt-5-mini and a simple prompt. This test was expecting to return a valid output that was verified with a snapshot. A very similar method to the known toMatchSnapshot() from Jest.

Tooling used

Codex (gpt-5.4, reasoning xhigh) is the improver.
GitHub Copilot CLI (gpt-5-mini) is the agent under test.
GitHub MCP tools provide live repository issues.
Katt runs the eval three times and records pass/fail, time, and token usage.

The initial prompt for `GPT-5-mini`

Main goal: Using the Copilot MCP tools, list the 5 most recent issues in this repo.

The loop

Codex follow the instructions of self-improving.
Runs the katt eval.
Reads the failure.
Make one focused change.
Commit it.
Run the eval again.
Keep what helps. Fix what does not.

What GPT-5.4 changed along the way

The interesting part is that Codex did not just keep piling on more prompt text. It discovered that there were two different problems:
The tested agent needed clearer instructions.
The evaluator itself was brittle because it compared live GitHub data against stale snapshots.

Here is the breakdown of changes:

Commit	What changed	Why it mattered
`bc096d4`	Baseline experiment setup	A minimal prompt created a clean starting point.
`df9761c`	Forced exact JSON output, exact fields, exact count, and no extra text	This attacked the first failure mode: Copilot answered conversationally instead of matching the expected schema.
`5ca4ce4`	Added explicit fetch depth, sorting, and tie-breaking rules	This reduced ranking mistakes when issues had very similar timestamps.
`422c325`	Switched the eval from stale snapshots to live GitHub issue validation and pinned the repo source	This was the big insight: sometimes the test is wrong, not just the agent.
`93df064`	Forced live MCP issue data, blocked local-file shortcuts, and required unlabeled issues to be included	This stopped the model from falling back to stale examples inside the repo.
`fca11b8`	Saved the before/after runs and Codex reasoning log	This preserved the experiment as evidence instead of just a final state.

The final prompt

Main goal: Using the Copilot MCP tools, list the 5 most recent issues in this repo.

Instructions:
- Use the Copilot MCP tools to read issues from the GitHub repository declared by this project: `nentgroup/self-improving-agentic-workflow`.
- Use the live Copilot MCP issue data as the only source of truth for the issue list.
- Do not use local files in this repository, snapshots, previous results, or examples in this prompt as the source of issue data.
- List issues only. Exclude pull requests.
- Do not filter by labels. Include unlabeled issues if they are among the 5 most recent.
- Fetch enough issues to determine the top 5 reliably. Do not rely on the tool's default ordering alone.
- Sort the fetched issues yourself using this exact rule:
  1. Newer `created_at` first.
  2. If `created_at` is identical, higher issue number first.
- Sort by `created_at` descending. If two issues have the same `created_at`, put the higher issue number first.
- Return exactly 5 issues.
- Return exactly one valid JSON object and nothing else. Do not use Markdown. Do not use code fences. Do not add explanations.
- The JSON must match this shape exactly:
{
  "issues": [
    {
      "id": 123,
      "title": "Issue title",
      "labels": ["label-1", "label-2"],
      "created_at": "2026-03-25T21:51:52Z"
    }
  ]
}
- Use the GitHub issue number as `id`.
- `labels` must be an array of label names only.
- `created_at` must be the ISO 8601 creation timestamp for the issue.
- If the MCP tool returns no issues or an obviously incomplete result, retry before answering. Do not return an empty `issues` array unless the repository truly has no issues.
- Before answering, verify the response is valid JSON, contains exactly 5 issue objects, includes only the keys `id`, `title`, `labels`, and `created_at`, and that no omitted live issue should come before the fifth issue under the sorting rule above.

Before vs After

Metric	Before	After	Change
Passing evals	`0 / 3`	`3 / 3`	`+3`
Total runtime	`286,552 ms`	`364,460 ms`	`+27%`
Average runtime per eval	`95.5 s`	`121.5 s`	`+26.0 s`
Total tokens used	`329,625`	`893,151`	`+171%`
Average tokens per eval	`109,875`	`297,717`	`+171%`

Bonus stat: Codex spent about 58k tokens and 18m 48s to reach the final passing setup.

The result

Codex was able to generate a prompt that made GPT-5-mini use an MCP tool to retrieve information in an expected output, while also making it accurate and avoiding hallucinations along the way. One more interesting approach was making the agent try again in case of a problem with the connection.

It shows that it is possible to create a harness around agents that is very similar to TDD, but in this case, TDAW (Test-Driven Agentic Workflows).

Check out the details and the repo used in this project:
https://github.com/raphaelpor/self-improving-agentic-workflow

What's next?

I will continue to evaluate self-improving agents. As with every experiment I do, I don’t know if they will really bring value to the real world. But my job is to scale them just a bit in every step and evaluate the current limits.

See you in the next one.

Top comments (1)

Travis Drake • Apr 1

This is good practice even within the same ecosystem. Most output targets just good enough, and if you take a few passes at a plan or spec, even with the same level of agent, you can get a better output. It is clever using better agents for planning and cheaper ones for execution.