DEV Community

Cover image for I stopped prompting my agent. Now I design the loop that prompts it.
Sébastien Conejo
Sébastien Conejo

Posted on • Originally published at linkedin.com

I stopped prompting my agent. Now I design the loop that prompts it.

We're moving past the era where working with an agent meant typing a prompt, waiting, reading the diff, typing again.

You are the loop in that setup. Your attention is the constraint, not the model. The agent sits idle until you show up.

I wanted out of that cycle. Not completely out, I'll get to that. But I wanted to stop being the operator and become the supervisor instead.

I run a personal AI agent on a VM. It triages my emails, monitors Reddit, handles parts of my calendar. It is built on Hermes, an open-source agent framework.

For months I still had to prompt it every morning. Tell it what to do, review, correct, tell it again. The model was good. I was the slow part.

So I built a loop. The architecture borrows from Karpathy's autoresearch: a propose-execute-evaluate-keep/discard cycle where the agent searches for improvements inside a fixed evaluation harness. The optimization loop itself fits in four files.

Four files on disk

No framework. No SDK. A Python script and four files.

  • contract.md defines what the agent can do. The boundaries. I wrote it once.
  • target/ holds the files the agent can edit. The only thing that changes between runs.
  • eval scores the output. Returns a number. This file is immutable, which matters more than it sounds: because the agent can't edit its own judge, it can't hack the score. The eval stays honest even when the agent gets creative.
  • state/ logs every experiment, every score, every keep or discard. Append-only.

The script itself is a while loop:

while budget > 0:
    planner reads contract + state → proposes a change
    generator modifies target/
    eval scores the result
    if better → keep
    if worse → revert
    log everything
    budget -= 1
Enter fullscreen mode Exit fullscreen mode

The models do the thinking. The script is plumbing. It's also greedy hill-climbing, which means it can get stuck in local optima. For my use case that's fine. For yours it might not be.

One design decision that mattered
I learned this the hard way. My first version had one model classify my emails and rate its own confidence. Confidence was always "HIGH." The scores looked fine. The results were mediocre.

The problem: one model generating and evaluating produces correlated errors. It has the same blind spots on both sides. In practice it feels like two optimists agreeing with each other. The loop looks healthy while the quality drifts.

The fix for my setup was context separation. I split it into two calls where neither sees the other's reasoning. The generator never sees why the evaluator scored it low. The evaluator never sees the generator's chain of thought. They share the artifact and the score, nothing else. You can do this with the same model on two separate calls.

On top of that, I route them to different models through Manifest, our LLM router that lets you send specific requests to different models and patches failures on the fly instead of letting them drop. One HTTP header picks which neural network handles each role. This reduces correlated blind spots, since different training data produces different failure modes. But it's a reinforcement, not the core mechanism. Two different models can still share biases from similar RLHF pipelines. The context wall does the heavy lifting.

Disagreements between the two turned out to be my best quality signal.

Writing the eval is the actual skill

This is the part I underestimated and the part that matters most.

Most people stall here. "What do I measure?"

Take email triage as an example. Every mornin the agent classifies my inbox and marks noise as read. It reports what it did. I correct the mistakes: "that one was important," "that one is noise." Each correction becomes a test case.

The eval re-classifies past emails with the current rules and compares them against my corrections. Score = percentage match. The loop optimizes that number.

Every correction expands the definition of "good." More corrections, bigger test set, harder eval, tighter rules. After a few days I correct less often, partly because the rules got better, partly because I trust it more. Both are happening and I try not to confuse the two.

Writing the eval took me longer than writing the loop. The loop is a while with an if/else. The eval forces you to define what "good" means in a way a script can check.

My other mistake

No state file. The agent forgot everything between runs. Every morning it started fresh, re-learned the same patterns, made the same calls. Adding a TSV log and a rules file on disk turned a forgetful script into something that accumulates instead of resetting.

The agent forgets. The file doesn't.

When a loop is the wrong tool
Loops re-read context, retry, explore. They cost tokens on every run whether they ship anything or not. If your task doesn't repeat weekly, a good prompt is cheaper.

Before building one, check five things:

Does the task repeat? Can you score the output with a number? Can the agent run what it produces? Does the loop have a hard stop? Is the eval hard to game?

That last one is easy to miss. If your loop optimizes a score, the optimizer will find ways to inflate it that don't mean the output got better. My eval scores against past corrections, which means I can reach 100% on the history with rules that fail on new emails. The fix is the same as in ML: hold out recent data so the optimizer can't see all the answers.

Miss one of these and you have a manual prompt pretending to be automation.

What actually changed

I didn't remove myself from the loop. That would be a lie. I still correct a few emails in the morning, and those corrections are the ground truth that makes the eval work. Without me, the loop has nothing to optimize toward.

What changed is my role. I went from operator to supervisor. I used to sit in the loop at every turn, reviewing every output, prompting the next action. Now I define what good means, then step back. The loop runs on its own on Sunday, tries twenty variations of the rules, keeps whatever scores higher, discards the rest. Monday morning the triage is a little sharper than last week.

I still design the contract. I still write the eval. But I set the standard and let the loop chase it, instead of chasing every email myself.

Top comments (0)