Pramoda Sahu

Posted on Jun 19

Designing a Self-Prompting Agent Harness with Per-Task Prompt, Tool, and Strategy Synthesis

#agents #ai #llm #showdev

Most agent stacks have matured in roughly the same direction: we version the code, test the tools, constrain the runtime, and instrument the loop. But one part of the system still often lives as an unversioned artifact copied between docs, chats, and notebooks: the prompt.

That mismatch gets harder to ignore once you start treating the agent harness as the real product. If the harness is what determines reliability, cost, safety, and task success, it is strange that the prompt is often the least engineered part of the stack.

That question led me to build SynthAgent: a small framework that generates a task-specific prompt, tool plan, and runtime strategy at task time instead of relying on one fixed prompt and one fixed loop for every task.

This post is best read as an architecture exploration, not a benchmark report. I have not yet run the A/B test that would justify a strong performance claim over a fixed-prompt baseline. What I do have is a working harness, a clear design thesis, and a set of implementation lessons that were useful enough to write down.

I’ll walk through the architecture, show what the synthesized artifacts actually look like, explain the tradeoffs behind the design, and point out where the current version is still weak.

The Core Idea: Generate the Harness at Task Time

Here’s the one-line version of the project:

SynthAgent takes (task, success_criteria) and, instead of running a fixed prompt through a fixed loop, generates a custom Prompt P, Tool Plan T, and Strategy S for that specific task. It then runs them through a Plan-Execute-Verify loop, scores the result, reflects on the failure, and tries again with revised components.

That is the core thesis: prompts should not be static artifacts when the rest of the harness is dynamic.

This setup is inspired by recent work on agent harnesses and automated agent design. In particular, it resembles the plan-execute-verify style control loop discussed in Code as Agent Harness (arXiv:2605.18747) and the meta-level search perspective in ADAS / Meta Agent Search (Hu et al., ICLR 2025). The difference is scope: instead of trying to invent entirely new agents, SynthAgent tries to invent a per-task harness.

SynthAgent Architecture

At a high level, the system looks like this:

                            ┌──────────────────────────┐
   TASK (free-form)  ───►   │  META-PROMPT GENERATOR   │   ◄── the only
   + success criteria       │  (MPG)                   │       hand-written
                            └────────────┬─────────────┘       instruction
                                         │ synthesizes         (about how
                                         ▼                     to write
                            ┌──────────────────────────┐       prompts,
                            │  TASK PROMPT P           │       not the
                            │  + TOOL PLAN T           │       task itself)
                            │  + STRATEGY S            │
                            └────────────┬─────────────┘
                                         │ runs inside
                                         ▼
        ┌──────────────────────────────────────────────────────────────┐
        │                       THE PEV LOOP                           │
        │                                                              │
        │    ┌─────────┐    plan     ┌─────────┐   tool calls   ┌─────┐ │
        │    │ PLANNER │ ──────────► │EXECUTOR │ ─────────────► │ENV  │ │
        │    └────┬────┘             └────┬────┘                └──┬──┘ │
        │         ▲                       │ observations           │    │
        │         │ replan                ▼                        │    │
        │         │                 ┌─────────┐                    │    │
        │         └──────────────── │MEMORY / │ ◄──────────────────┘    │
        │                           │STATE    │                          │
        │                           └────┬────┘                          │
        │                                │ trajectory                    │
        │                                ▼                               │
        │                         ┌─────────────┐                        │
        │                         │  VERIFIER V │                        │
        │                         └──────┬──────┘                        │
        │                                │ score + critique              │
        │                                ▼                               │
        │                         ┌─────────────┐                        │
        │                         │ REFLECTOR R │ ──► revise P/T/S       │
        │                         └─────────────┘                        │
        └──────────────────────────────────────────────────────────────┘
                                         │
                            loop until V=Pass or budget exhausted

There are six components, each in its own file and each intentionally small. The point is not abstraction for its own sake. It is to make the harness legible to a human debugger so that when something fails, you can inspect the trajectory, the prompt, the plan, and the verifier output directly.

A Concrete Example of Synthesized `P/T/S`

Before going component by component, it helps to see the core artifact.

Here is a trimmed, representative example of the JSON object the Meta-Prompt Generator emits for a simple task like:

Task: Summarize the repository's Python modules and write the summary to SUMMARY.md.

Success criteria: Cover each top-level .py file once, do not invent files, and produce a concise markdown summary.

{
  "prompt_p": "You are the planner for a repository-analysis task. First inspect the directory, then read each relevant Python file, extract its purpose conservatively, and only summarize files you actually observed. Before finishing, verify that every top-level .py file has been covered exactly once.",
  "tool_plan_t": [
    "list_directory",
    "read_file",
    "write_file"
  ],
  "strategy_s": {
    "mode": "plan_execute",
    "notes": [
      "Enumerate files before summarizing",
      "Do not infer missing modules",
      "Validate file coverage before final write"
    ]
  }
}

That example is not meant as benchmark evidence. It is there to make the mechanism concrete: the harness is generating a task-specific instruction, an intended tool sequence, and a strategy hint for the runtime.

A representative failure-and-retry sequence for that kind of task looks like this:

Attempt 1 lists the directory and reads several files, but misses one module.
The Verifier fails the run because the output does not satisfy the “cover each top-level .py file once” criterion.
The Reflector revises prompt_p to explicitly require a checklist of discovered files before writing.
Attempt 2 re-runs with the stricter instruction, covers the missing file, and passes.

That does not prove per-task synthesis is better than a fixed prompt. It does show the shape of the feedback loop and the kind of failure it is designed to repair.

The Meta-Prompt Generator Is the Only Hand-Written Prompt

`mpg.py`

The Meta-Prompt Generator (MPG) is the only place in the system that contains a human-authored prompt.

That prompt is not a task instruction. It is a meta-prompt: an instruction for how to write task-specific prompts. Given a task description, success criteria, an available tool catalog, and any prior lessons retrieved from memory, the MPG emits a JSON object with three fields:

prompt_p — a detailed task-specific instruction for the Planner
tool_plan_t — which tools are relevant and the intended order of use
strategy_s — which loop pattern to prefer, such as ReAct, Plan-and-Execute, or Decompose-and-Solve

This recursive structure is deliberate: the system is generating instructions about how to solve the task, not solving the task directly in the meta-prompt itself. That is part of what attracted me to the approach described in Meta Prompting for AI Systems, which also discusses recursive meta-prompting (Zhang et al.).

I chose this route over prompt optimization frameworks like DSPy for one reason: I wanted the harness behavior to stay inspectable. If a run goes wrong, I want to read the synthesized prompt and the execution trace. I do not want the optimization process hidden behind a compiled graph or framework abstraction that makes the final behavior harder to audit.

That transparency shaped most of the rest of the system too.

The PEV Loop Turns Prompt Synthesis Into Runtime Behavior

`harness.py`

Once the MPG emits P, T, and S, those artifacts are fed into the inner Plan-Execute-Verify loop.

The loop works like this:

Planner decides the next action given the current prompt, tool plan, strategy, and trajectory.
Executor calls the selected tool.
Memory persists the step immediately.
The cycle repeats until the Planner finishes, the Verifier passes the result, or the attempt budget runs out.

This is the operational center of the system. A synthesized prompt is only interesting if it changes downstream behavior in a controlled way. The PEV loop is what gives that prompt something to steer.

Two implementation details mattered more than I expected.

Structured planner output matters more than planner creativity

Where provider support allows it, the Planner output is requested as a JSON object using a response format like:

{ "type": "json_object" }

That is less glamorous than tuning reasoning quality, but it matters more. In an agent harness, the Planner is not writing prose for a human reader. It is producing a decision interface that the Executor must parse reliably.

In practice, this is not uniformly standardized across all OpenAI-compatible providers. Tool calling, response_format, streaming, and provider-specific parameters still vary. For the subset of plain chat-completion behavior used in this project, though, structured JSON output was stable enough to be worth enforcing where supported and recovering heuristically where it was not.

The system therefore prioritizes structure over style. Reliability beats expressiveness here.

Tool argument normalization prevents embarrassingly common failures

The Executor is intentionally thin, but it does one very practical thing: it resolves argument-name aliases like:

filepath
file_path
filename

Models get these wrong all the time, even when the tool schema is explicit. Rather than pretending this will not happen, the harness normalizes common variants before dispatch. That small tolerance layer ended up being more useful than a more elaborate execution abstraction.

Hard step limits are non-negotiable

The default is:

max_steps_per_attempt = 25

This is enforced unconditionally in the current design.

The Planner can return action: "finish" when it believes the task is done, but if it gets stuck in a loop, the harness terminates the attempt. Self-prompting agents are not immune to looping. If anything, giving them the ability to rewrite their own instructions can make unmanaged loops more likely.

A hard cap is therefore part of the product, not just a debugging safeguard.

Persisting every step to disk changes failure recovery

After every step, the trajectory is written to disk at:

agent_memory/runs/<uuid>.json

That persistence layer matters for more than observability. If the process crashes mid-task, the run history still exists. That history becomes the source of truth for verification, reflection, debugging, and future lessons.

In other words, the trace is not just logging. It is part of the system state.

How Much Does `strategy_s` Change Runtime Behavior Today?

One gap in many “dynamic strategy” writeups is that the strategy field sounds more powerful than it really is.

That is worth being explicit about here.

Today, strategy_s is partly operative and partly advisory.

It is operative in the sense that it changes the planner context and nudges the loop toward a different decomposition style.
It is not yet a fully separate runtime policy engine with deeply different execution branches for each strategy family.

So if the MPG emits something like “ReAct” versus “Plan-and-Execute,” the current implementation mostly changes how the Planner is instructed to proceed, not an entirely different harness implementation under the hood.

That still matters, but it is narrower than “the runtime swaps in a wholly different agent architecture.” If I extend the project, strategy branching is one of the first places I would make more explicit.

The Verifier Is the Weakest Part of the Current Design

`verifier.py`

Right now, the Verifier is LLM-as-judge.

It evaluates the final output against the provided success criteria and emits:

a score from 0.0 to 1.0
a critique
suggested fixes

The current pass threshold is:

0.85

This works well enough to support iterative refinement, but it is also the most obvious weakness in the architecture.

When the agent and the verifier come from the same model family, the system can end up hill-climbing on the verifier's blind spots. A prompt can get “better” according to the judge while the actual task result stays wrong in ways the judge fails to detect.

One mitigation in SynthAgent is that the Verifier gets the full trajectory, not just the final answer. That gives it a better chance of spotting failure modes like:

sloppy execution after a good initial plan
bad tool usage followed by confident synthesis
premature commitment to an answer without validation

But that is still a mitigation, not a solution.

The real fix is to support deterministic, pluggable verifiers wherever the task class allows it. For code, that might be pytest. For SQL, it might be execution against a test database. For structured output, it might be jsonschema.validate. In some environments, the environment itself can serve as the oracle.

That lesson also shows up in recent harness work. The AutoHarness authors report that a smaller model with a stronger synthesized harness can outperform a larger model in constrained game-like environments because the environment itself supplies a deterministic feedback signal. That is the part I find most important—not a specific leaderboard result, but the fact that the verifier is external, cheap, and hard to game.

Design your verifier first. Everything else is downstream.

That is the biggest architectural lesson in the repo.

Reflection Works Best at the Prompt Level, Not the Response Level

`reflector.py`

When verification fails, the Reflector (R) is invoked.

It receives:

the original task
the current P/T/S
the full trajectory
the verifier score
the verifier critique

Its job is to identify the likely failure mode and revise the harness accordingly.

That raises an important design question: what exactly should reflection operate on?

There are three obvious levels:

Response-level reflection — edit the final answer
Prompt-level reflection — revise the task prompt P
Harness-level reflection — re-run synthesis and regenerate P/T/S from scratch

These are not equivalent.

Response-level reflection is usually too shallow

Editing the answer after the fact is the weakest form of reflection. For example, if the agent wrote a flawed final summary, response-level reflection would just rewrite that summary.

It can improve phrasing or patch a local omission, but it does not fix the process that produced the failure. This is close to the Self-Refine pattern, and for a system like SynthAgent it is not where the real leverage is.

Prompt-level reflection is the current sweet spot

Right now, SynthAgent revises the prompt-level artifacts. In the repository-summary example above, that might mean changing the Planner instruction from “summarize the repo” to “enumerate files first, maintain a checklist, and do not finish until every discovered .py file has been covered.”

That is a useful middle ground because it lets the system improve behavior without paying the full cost of fresh synthesis every time.

Full re-synthesis is stronger, but more expensive and riskier

The strongest option would be to re-call the MPG and regenerate all of P/T/S from scratch. In the same example, that might swap not just the prompt wording but the tool sequence and decomposition strategy too.

That is the next experiment I would try, but not the current default.

The reason is practical: full re-synthesis is more expensive, and in early testing it often overfit to the most recent failure rather than learning a stable improvement. Prompt-level revisions turned out to be the better default tradeoff.

Memory Stores Both Raw Runs and Reusable Lessons

`memory.py`

The memory layer has two stores:

Runs — runs/<uuid>.json
Lessons — lessons.json

The runs store captures the full trajectory, prompt/tool/strategy history, verifier scores, and reflections for each attempt. It is a forensic record of what happened.

The lessons store is different. It contains distilled takeaways from completed runs and indexes them for retrieval using embeddings. When a new task arrives, the MPG gets the top-5 most similar lessons as additional context.

By default, the embedding model is NVIDIA's:

nv-embed-v1

But the implementation is flexible enough to work with other OpenAI-style embedding endpoints too.

This is one area where I am still not convinced the architecture is earning its complexity. For a small lesson corpus, cosine similarity over lesson text is doing a lot of work. It may be that a simpler baseline like “last 5 lessons” performs just as well. I would not claim embedding-based retrieval here is a decisive win without a proper A/B test.

That uncertainty matters. Not every component that sounds architecturally elegant turns out to matter in practice.

The Tool Surface Defines the Agent's Legal Action Space

`tools.py`

SynthAgent ships with five tools:

tavily_search — web search via Tavily
read_file
write_file
list_directory
execute_python — sandboxed Python via subprocess with a 30-second timeout

These tools are deliberately boring.

That is by design. The aim was not to build a flashy tool ecosystem. It was to make the action space constrained, legible, and safe enough to reason about.

A few implementation details matter:

file tools enforce a project-root sandbox
execute_python runs in a subprocess
the subprocess has a hard 30s timeout
stdout and stderr are returned
sandboxed Python has no network access by default

This is where the harness perspective becomes concrete: the tool surface is the legal action space. If a tool is not in the registry, the agent cannot call it. That is not just a convenience; it is a safety boundary.

It also prevents a whole class of failures that appear in less constrained frameworks, where the model “discovers” capabilities the runtime should never have exposed in the first place.

Model Selection Uses Any OpenAI-Compatible Endpoint

`llm.py`

The LLM client began with NVIDIA NIM hardcoded, but that quickly felt too limiting. I wanted to test:

Groq for latency
Ollama for privacy-sensitive local runs
OpenRouter for model flexibility
standard OpenAI-compatible providers without rewriting client logic

So llm.py became a thin wrapper around any OpenAI-compatible /v1/chat/completions endpoint.

Here is the basic configuration shape:

# OpenAI
LLM_BASE_URL=https://api.openai.com/v1
LLM_MODEL=gpt-4o-mini

# Groq
LLM_BASE_URL=https://api.groq.com/openai/v1
LLM_MODEL=llama-3.3-70b-versatile

# OpenRouter
LLM_BASE_URL=https://openrouter.ai/api/v1
LLM_MODEL=anthropic/claude-3.5-sonnet

# Local Ollama
LLM_BASE_URL=http://localhost:11434/v1
LLM_MODEL=llama3.1

In practice, there were only two provider quirks that deserved special handling.

NVIDIA embedding requests need `input_type`

NVIDIA's embedding endpoint expects an input_type field. Other providers reject that same field. The client detects the NVIDIA case and only sends it there.

Chat completions are similar enough for this project's core flow

The more encouraging result is that, for the subset of non-streaming chat-completion calls used in this project, the interface was similar enough across providers that one small client wrapper worked with only minor conditionals.

That is a narrower claim than “all OpenAI-compatible APIs are standardized.” They are not. Tool calling, structured output modes, reasoning-token controls, and provider-specific parameters still vary. But for the core request/response flow used here, treating model choice as configuration rather than architecture worked well.

In my testing, the entire llm.py implementation stayed small and worked across the providers above with only lightweight provider-specific handling.

What's in the Repository

The repo is intentionally small and direct:

main.py — CLI entry point with onboarding wizard
config.py — environment loader with backward-compatibility aliases
llm.py — OpenAI-compatible client with retry and robust JSON extraction
tools.py — tool implementations and tool registry schema
mpg.py — Meta-Prompt Generator
verifier.py — LLM-as-judge verifier
reflector.py — failure diagnosis and P/T/S revision
harness.py — PEV loop and outer attempt loop
memory.py — runs and lessons persistence

Dependencies are minimal:

requests
Python 3.8+

There is no LangChain, no LlamaIndex, no DSPy, and no heavyweight framework hidden underneath. That constraint was part of the point. I wanted every line of behavior to be readable in an afternoon.

What I Learned Building SynthAgent

The architecture is the headline, but the implementation taught me a few things that mattered more than the conceptual framing.

JSON parsing is harder than model selection

The single biggest source of early failures was malformed JSON from the Planner step.

The failure patterns were painfully familiar:

markdown code fences around JSON
prose before the object
raw newlines inside strings
trailing commas
partial object emission

To make the system resilient, llm.generate_json() now uses a three-stage recovery pipeline:

direct parse
strip markdown
brace-matching with control-character sanitization

A simplified sketch of the idea looks like this:

def recover_json(text: str):
    # 1. Try direct parse
    try:
        return json.loads(text)
    except Exception:
        pass

    # 2. Remove markdown fences
    cleaned = strip_markdown_fences(text)
    try:
        return json.loads(cleaned)
    except Exception:
        pass

    # 3. Extract brace-matched object and sanitize control chars
    candidate = extract_first_brace_matched_json(cleaned)
    candidate = sanitize_control_chars(candidate)
    return json.loads(candidate)

Those recovery paths are not elegant, but they are extremely practical. In a harness like this, robust JSON extraction matters more than almost any model-level tuning.

Reflection needs the trajectory, not just the score

If you tell the Reflector only that the verifier score was 0.4, you are not giving it enough to improve anything meaningful.

If you show it the actual trace and make the failure concrete—for example, that it called tavily_search with the wrong query argument on step 3 and then committed to an unsupported answer on step 5—it has something operational to fix.

That is why the full trace gets passed through the system. Reflection without trajectory is mostly guesswork.

The smallest prompts are often the highest-leverage components

The most valuable hand-written parts of the system are also the smallest:

the MPG prompt in mpg.py is about 20 lines
the Verifier prompt is about 15 lines
the Reflector prompt is about 20 lines

Those prompts earn their place because they define behavior at the right level of abstraction. They are not trying to solve the task directly. They are specifying how the system should generate, evaluate, and revise task-solving behavior.

Everything else is composed around those few instructions.

Self-prompting without a real verifier is hallucination with extra steps

This was the most sobering lesson.

I watched the loop improve prompts across multiple attempts, tighten wording, raise the verifier score from 0.7 to 0.85, and still produce a confidently wrong answer. The system looked like it was learning, but it was really optimizing against an imperfect judge.

That does not make self-prompting useless. It just means it cannot rescue a broken oracle.

A better harness cannot compensate for a verifier that does not track reality.

If I had to summarize the whole project in one cautionary line, that would be it.

Where I'd Take the Architecture Next

The current version works, but a few next steps feel much more important than adding more tools or more prompt tricks.

Pluggable verifiers

This is the most obvious gap.

Right now, verification is LLM-as-judge only. The next version should expose a verifier interface where task-specific checks can be plugged in, such as:

pytest
sql_execute
jsonschema.validate
custom rubric evaluators

The current structure already supports this direction. verifier.py mainly needs to be refactored into a strategy-style interface.

Better lesson-store retrieval

Lessons are currently appended and retrieved by naive cosine similarity. That may be enough for now, but I suspect even a small reranker—or a baseline like BM25—could outperform raw embedding similarity on a small corpus.

This is another place where I would want to measure before claiming the design is sound.

Cost ceilings

Termination currently depends on either pass/fail outcome or attempt limit. That is useful, but not sufficient for unattended runs.

A token-cost ceiling per task would make the system much safer to leave running in the background, especially when testing across providers with different latency and pricing profiles.

A/B testing whether MPG actually helps

This is the experiment the project still owes itself.

I have not yet run a clean A/B test comparing:

a fixed, hand-written prompt with the same harness
the synthesized P/T/S approach

That comparison is overdue. It would either validate the central thesis or force me to narrow it. Either outcome would be useful.

How to Try SynthAgent

If you want to run it locally:

git clone https://github.com/pksw4u/synthagent
cd synthagent
pip install -r requirements.txt
python main.py

The onboarding flow prompts for an LLM endpoint. Most OpenAI-compatible endpoints should work for the core chat flow used here, with occasional provider-specific adjustments. The default points to NVIDIA NIM, but you can switch to OpenAI, Groq, OpenRouter, Ollama, or a local llama.cpp-style server by setting two environment variables:

LLM_BASE_URL
LLM_MODEL

That portability was one of the more satisfying parts of the project. It lets the harness stay stable while the model backend remains easy to swap.

Key Takeaways

SynthAgent treats prompt synthesis as an architectural primitive, not a static authoring step.
The system generates a task-specific Prompt P, Tool Plan T, and Strategy S, then runs them through a Plan-Execute-Verify loop with reflection.
The Meta-Prompt Generator is the only human-authored prompt in the system, which keeps harness logic transparent and inspectable.
Structured planner output and robust JSON recovery mattered more in practice than most model-selection debates.
The current LLM-as-judge verifier is the weakest link; deterministic, task-specific verifiers are the most important next step.
The tool registry defines the legal action space, which is both a capability boundary and a safety boundary.
The architecture is deliberately lightweight: Python 3.8+, requests, and a small set of readable modules.

Conclusion

SynthAgent started from a simple discomfort: if we already accept that the harness matters more than the model in many real-world agent systems, it makes little sense to keep the prompt frozen as a manually maintained artifact while everything else evolves around it.

Building the project made that intuition sharper, but it also exposed the limits of the idea. Prompt synthesis is useful. Reflection is useful. Per-task harness generation is promising. But none of those can substitute for a verifier that actually tracks correctness. If the oracle is weak, the loop just gets better at fooling itself.

That is why I think the most important question for agent design is shifting. It is no longer just “what model should I use?” or even “what tools should I expose?” Increasingly, it is: what parts of the harness should be generated, what parts should be fixed, and how do we verify the result without getting gamed?

If that question interests you, take a look at the repo, try it on a task class you care about, and compare it against your own fixed-prompt baseline. The central claim only matters if it pays for itself in practice.

Repo: github.com/pksw4u/synthagent