Self-Critique Loops for Agents: Where the 3rd Iteration Stops Helping

#agents #ai #llm #promptengineering

Book: AI Agents Pocket Guide: Patterns for Building Autonomous Systems with LLMs
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You wired a self-critique loop into your agent last sprint. Iteration one cleans up the obvious mistakes. Iteration two catches the subtler ones. By iteration three, the agent is rewriting cosmetic phrasing and the eval score barely moves. By iteration four, the score has dropped two points because the model started defending an earlier wrong choice and edited around it instead of fixing it.

This is the shape teams converge on when they put reflection loops in front of real eval data. The first papers on self-refine and reflexion got the headline right: asking a model to critique its own output helps. The part most production teams miss is that the curve is concave and short, and after a small number of passes the loop costs you tokens, latency, and accuracy.

What the iteration curve really looks like

The pattern is not a single number. It depends on the task, the model, and whether the critic has access to ground truth or only to the actor's previous output. Across the published work (the Self-Refine paper, the Reflexion paper, and Anthropic's own writing on building effective agents) the shape is consistent enough to plan around.

Iteration zero is the baseline draft. Iteration one is the first critique-and-revise pass. The lift from zero to one is the largest gain you will see in the loop. In practice, on the evals teams have shared publicly, that lift tends to land in roughly a 5 to 15 point range on a domain-specific benchmark, with the exact number heavily dependent on task and model — treat it as anecdotal rather than a guarantee. Iteration two adds a smaller but real bump, usually a fraction of the iteration-one delta. Iteration three is where the curve flattens. Iteration four is where regressions start to appear in some runs (this is the author's observation across production loops, not a published statistic).

The mechanism for the regression is not exotic. Once a model has produced text and then defended it through one critique pass, that text is in the context. The model treats it as committed work. A second self-critique sees the prior critique and the prior revision and starts to over-edit, touching things that were already correct, removing nuance, padding cautious hedges into sentences that were fine. A code task ends up with a working function rewritten into a more "defensive" version that introduces a subtle off-by-one. A writing task ends up with so many qualifiers stacked on the original claim that the claim is gone.

So the practical rule is: budget one or two critique iterations, treat three as a maximum, and never let the loop run unbounded just because the agent says "actually, on reflection, I could improve this further".

Actor and critic should not be the same call

The cleanest fix to the anchoring problem is structural. Do not use the same model call to produce text and to critique it. Use two distinct prompts (and ideally two distinct conversation contexts) — an actor that produces output, and a critic that judges it.

The split matters for two reasons. First, the critic does not see the actor's chain of reasoning, only the final output, so it cannot rationalize defects it would otherwise commit to. Second, the critic's prompt can be aggressive about looking for errors without contaminating the actor's instruction with negative framing.

A minimal version looks like this in Python with the Anthropic SDK:

import anthropic

client = anthropic.Anthropic()
MODEL = "claude-sonnet-4-5"  # check console for current id


def actor(task: str, prior_output: str | None,
          critique: str | None) -> str:
    if prior_output is None:
        prompt = task
    else:
        prompt = (
            f"Task: {task}\n\n"
            f"Your previous answer:\n{prior_output}\n\n"
            f"Critic feedback:\n{critique}\n\n"
            "Revise the answer. Address every concrete "
            "issue. Do not change parts the critic did "
            "not flag."
        )
    msg = client.messages.create(
        model=MODEL,
        max_tokens=2048,
        messages=[{"role": "user", "content": prompt}],
    )
    return msg.content[0].text


def critic(task: str, candidate: str) -> dict:
    prompt = (
        f"Task: {task}\n\n"
        f"Candidate answer:\n{candidate}\n\n"
        "List concrete defects only. For each, give the "
        "exact text to change and why. If there are no "
        "concrete defects, reply with the literal token "
        "NO_DEFECTS and nothing else."
    )
    msg = client.messages.create(
        model=MODEL,
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}],
    )
    text = msg.content[0].text.strip()
    return {
        "verdict": "ok" if text == "NO_DEFECTS" else "fix",
        "feedback": text,
    }

Two details in the prompts pull a lot of weight. The actor is told not to change parts the critic did not flag, which stops the rewrite-everything drift. The critic is told to use a literal NO_DEFECTS token when satisfied, which gives you a deterministic halt signal rather than free-form approval language to parse.

You can run the critic with a smaller, cheaper model (Haiku, gpt-4o-mini, a fine-tuned Llama) without much loss as long as the critic's job is bounded enough. Pattern matching against a rubric is cheaper than generating a full answer.

The bounded loop with halting heuristics

The loop itself is short. Three counters, two halt conditions, one explicit budget:

def critique_loop(task: str, max_iters: int = 3):
    output = actor(task, prior_output=None,
                   critique=None)
    history = [output]

    for i in range(max_iters):
        verdict = critic(task, output)
        if verdict["verdict"] == "ok":
            return {"output": output, "iters": i,
                    "halt": "critic_ok"}

        revised = actor(
            task,
            prior_output=output,
            critique=verdict["feedback"],
        )

        if revised.strip() == output.strip():
            return {"output": output, "iters": i + 1,
                    "halt": "no_progress"}

        if len(history) >= 2 and revised in history:
            return {"output": output, "iters": i + 1,
                    "halt": "loop_detected"}

        history.append(revised)
        output = revised

    return {"output": output, "iters": max_iters,
            "halt": "budget_exhausted"}

Three halt heuristics earn their place here.

critic_ok is the obvious one: the critic emitted the agreed token. no_progress catches the case where the actor's revised output is byte-identical to the prior one, which happens more often than you would expect once the model has run out of useful edits. loop_detected catches the rarer case where the actor oscillates between two outputs, agreeing with each round of critique by reverting to a prior version. Without that check, your loop will burn the full budget alternating between two near-identical answers.

max_iters=3 is the production default for almost every task you would wire a critique loop onto. Raise it only if your eval shows a real iteration-four gain on your specific task, and require that evidence on every review.

Ground the critic in something other than itself

Pure self-critique — model A asks model A "is this good" — has a known ceiling. The model will not catch defects that require knowledge it does not have, and it will not catch defects that match its own training-data biases.

The lift comes when the critic has access to something external the actor does not. Three options that work in production:

Tool-grounded critique. Run the actor's output through a deterministic check before it goes to the LLM critic. Code: run it. SQL: explain it. JSON output: validate it against a schema. Math: compute it. The critic prompt then includes both the candidate output and the tool results, and the model has something concrete to anchor its judgment against.

Retrieval-grounded critique. Pull the source documents the actor was supposed to be summarizing or citing, and pass them to the critic alongside the candidate. The critic's job becomes "find claims in the candidate that are not supported by the source", which is a much smaller and more tractable task than "is this good".

Comparison-judge critique. Generate two candidates from the actor (with different seeds, different temperatures, or different sub-prompts) and have the critic pick the better one or merge them. This bounds the critic's job to a comparison instead of an open-ended judgment, and in published comparisons (see the LLM-as-a-judge MT-Bench paper) pairwise judging tracks human preference more reliably than asking the same model to grade a single output in isolation.

Combine these where you can. A code-generation loop with the critic seeing both the failing test output and the docs for the API the actor used catches more defects than either alone.

The cost math

A self-critique loop turns one model call into 1 + 2N calls for N iterations of critique-and-revise (one critic call per iteration, one actor call per revision). At three iterations that is seven calls instead of one.

Token costs scale with context length, which grows on every revision because the prior output and prior critique are part of the new prompt. By iteration three, the actor's prompt for revision-three may be five times the size of the original. If your task uses 2k input tokens to start with, by the third revision the actor is processing roughly 8k to 10k input tokens.

A back-of-envelope on a hypothetical workload (assume roughly 2k input and 500 output tokens per call on a Sonnet-class model at public list pricing): a critique loop with three iterations, on a task that originally cost about $0.01 per call, ends up costing somewhere between $0.05 and $0.08 per task. Public model prices change often, so re-run the math against current rates before you commit. Whether that is worth it depends on your eval delta. On a high-stakes task (code that ships, contract clauses, medical summarization) the math works fine. On a chat reply where the user does not notice a 3-point eval bump, it does not.

Two ways to reduce that bill without giving up the loop:

Cheaper critic model. Use Haiku-class for critique, full-size model for the actor. The critic's job is comparison and pattern-matching against a rubric; it does not need the actor's reasoning depth.
Conditional loop. Run the critic once on every output, but only enter the revision branch if the critic returns concrete defects. Many outputs will pass the first critic check on iteration one with NO_DEFECTS, especially after you tune the critic rubric to your task. You pay one extra call for those, not seven.

What to do with this on Monday

If you already have a critique loop running, the changes that pay back fastest:

Cap iterations at three (or lower) and add a NO_DEFECTS halt token plus a no-progress check so the loop exits the moment it has nothing left to do.
Split the actor and critic into separate prompts and separate calls, so the critic does not see the actor's chain of reasoning.
Look at your last 100 production runs that hit max iterations and check whether iteration three actually beat iteration two on your eval. If it did not (and it usually does not), drop the cap to two.

If you do not have a loop yet and you are planning to add one, start with one iteration. Measure the lift on your eval. Add iteration two only if the data supports it. The default expectation should be that iterations three and beyond are a waste of compute, and you should require evidence to justify each one above two.

The shape of the curve is not a model defect. It is information theory: critique without external grounding is bounded by what the actor and critic together already know. Once you have squeezed the easy errors out in one or two passes, the next pass has nothing left to find. Halt there. Adding another iteration is hope, not signal.

If this was useful

AI Agents Pocket Guide: Patterns for Building Autonomous Systems with LLMs covers the loop patterns from the same angle as this post: bounded iteration, halting heuristics, actor/critic separation, and tool-grounded judgment. It stitches them together with the surrounding pieces (memory, sub-agent dispatch, recovery from partial failure) you need when a critique loop is one stage in a larger agent. Self-critique is one of the easier loops to wire up wrong; it is also one of the easier ones to make pay back once you give it a real budget and a real halt rule.