Sébastien Conejo

Posted on Jun 30 • Originally published at linkedin.com

I stopped prompting my agent. Now I design the loop that prompts it.

#agents #ai #llm #opensource

We're moving past the era where working with an agent meant typing a prompt, waiting, reading the diff, typing again.

You are the loop in that setup. Your attention is the constraint, not the model. The agent sits idle until you show up.

I wanted out of that cycle. Not completely out, I'll get to that. But I wanted to stop being the operator and become the supervisor instead.

I run a personal AI agent on a VM. It triages my emails, monitors Reddit, handles parts of my calendar. It is built on Hermes, an open-source agent framework.

For months I still had to prompt it every morning. Tell it what to do, review, correct, tell it again. The model was good. I was the slow part.

So I built a loop. The architecture borrows from Karpathy's autoresearch: a propose-execute-evaluate-keep/discard cycle where the agent searches for improvements inside a fixed evaluation harness. The optimization loop itself fits in four files.

Four files on disk

No framework. No SDK. A Python script and four files.

contract.md defines what the agent can do. The boundaries. I wrote it once.
target/ holds the files the agent can edit. The only thing that changes between runs.
eval scores the output. Returns a number. This file is immutable, which matters more than it sounds: because the agent can't edit its own judge, it can't hack the score. The eval stays honest even when the agent gets creative.
state/ logs every experiment, every score, every keep or discard. Append-only.

The script itself is a while loop:

while budget > 0:
    planner reads contract + state → proposes a change
    generator modifies target/
    eval scores the result
    if better → keep
    if worse → revert
    log everything
    budget -= 1

The models do the thinking. The script is plumbing. It's also greedy hill-climbing, which means it can get stuck in local optima. For my use case that's fine. For yours it might not be.

One design decision that mattered
I learned this the hard way. My first version had one model classify my emails and rate its own confidence. Confidence was always "HIGH." The scores looked fine. The results were mediocre.

The problem: one model generating and evaluating produces correlated errors. It has the same blind spots on both sides. In practice it feels like two optimists agreeing with each other. The loop looks healthy while the quality drifts.

The fix for my setup was context separation. I split it into two calls where neither sees the other's reasoning. The generator never sees why the evaluator scored it low. The evaluator never sees the generator's chain of thought. They share the artifact and the score, nothing else. You can do this with the same model on two separate calls.

On top of that, I route them to different models through Manifest, our LLM router that lets you send specific requests to different models and patches failures on the fly instead of letting them drop. One HTTP header picks which neural network handles each role. This reduces correlated blind spots, since different training data produces different failure modes. But it's a reinforcement, not the core mechanism. Two different models can still share biases from similar RLHF pipelines. The context wall does the heavy lifting.

Disagreements between the two turned out to be my best quality signal.

Writing the eval is the actual skill

This is the part I underestimated and the part that matters most.

Most people stall here. "What do I measure?"

Take email triage as an example. Every mornin the agent classifies my inbox and marks noise as read. It reports what it did. I correct the mistakes: "that one was important," "that one is noise." Each correction becomes a test case.

The eval re-classifies past emails with the current rules and compares them against my corrections. Score = percentage match. The loop optimizes that number.

Every correction expands the definition of "good." More corrections, bigger test set, harder eval, tighter rules. After a few days I correct less often, partly because the rules got better, partly because I trust it more. Both are happening and I try not to confuse the two.

Writing the eval took me longer than writing the loop. The loop is a while with an if/else. The eval forces you to define what "good" means in a way a script can check.

My other mistake

No state file. The agent forgot everything between runs. Every morning it started fresh, re-learned the same patterns, made the same calls. Adding a TSV log and a rules file on disk turned a forgetful script into something that accumulates instead of resetting.

The agent forgets. The file doesn't.

When a loop is the wrong tool
Loops re-read context, retry, explore. They cost tokens on every run whether they ship anything or not. If your task doesn't repeat weekly, a good prompt is cheaper.

Before building one, check five things:

Does the task repeat? Can you score the output with a number? Can the agent run what it produces? Does the loop have a hard stop? Is the eval hard to game?

That last one is easy to miss. If your loop optimizes a score, the optimizer will find ways to inflate it that don't mean the output got better. My eval scores against past corrections, which means I can reach 100% on the history with rules that fail on new emails. The fix is the same as in ML: hold out recent data so the optimizer can't see all the answers.

Miss one of these and you have a manual prompt pretending to be automation.

What actually changed

I didn't remove myself from the loop. That would be a lie. I still correct a few emails in the morning, and those corrections are the ground truth that makes the eval work. Without me, the loop has nothing to optimize toward.

What changed is my role. I went from operator to supervisor. I used to sit in the loop at every turn, reviewing every output, prompting the next action. Now I define what good means, then step back. The loop runs on its own on Sunday, tries twenty variations of the rules, keeps whatever scores higher, discards the rest. Monday morning the triage is a little sharper than last week.

I still design the contract. I still write the eval. But I set the standard and let the loop chase it, instead of chasing every email myself.

Top comments (3)

Max Quimby • Jul 1

The context-separation fix resonates — we hit the exact "two optimists agreeing" failure when the generator and judge were the same model sharing a prompt. Splitting the calls helped, but what helped more was making the judge a different model family entirely. Same-family models tend to share systematic blind spots, so even with separated context they'd nod along at the same mistakes; a cheaper cross-family judge caught things the expensive same-family one waved through.

On the greedy hill-climbing: have you seen the loop start gaming the eval over long runs? Even an immutable judge gets exploited if it's scoring a proxy — ours drifted toward the measurable thing (length, keyword hits) instead of what we actually cared about. Rotating a couple of eval variants so no single scoring quirk was worth exploiting helped. Curious whether you've needed anything beyond keep/discard to escape local optima — a periodic "take a worse step on purpose" round, or restarts from logged state?

Tae Kim • Jul 1

The generator/evaluator context isolation point is the one I'd underline — in LangGraph multi-agent pipelines I've run, the failure mode you describe (optimist-optimist convergence) shows up as inflated internal scores that look stable while actual output quality drifts. One extension worth considering: separate the judge's training signal from the artifact it's evaluating, not just the reasoning chain, because if they share the same document embedding space they'll still correlate on subtle lexical features even when neither "sees" the other's chain of thought. The append-only state log pattern here also maps cleanly to LangGraph's checkpointer design, where you can replay any loop iteration from the persisted state without re-running prior steps.

Dipankar Sarkar • Jul 1

The immutable eval is the part people skip, and then they wonder why the agent 'improved' into nonsense. The moment the agent can touch its own scorer, every optimization loop quietly becomes a reward-hacking loop. Keeping eval append-only and untouchable is the whole reason keep/discard means anything.

The failure mode I would watch next is Goodhart on a fixed eval. Propose-execute-evaluate will happily climb whatever the number rewards, including artifacts of the harness rather than the task. Rotating held-out cases the planner never sees in state/ is what keeps it honest past the first plateau. How are you handling eval staleness as the target/ files drift?