Claude Result Loops + Rubrics: 5 Self-Eval Patterns for Production Agents

#ai #productivity #claudecode #automation

Result Loops let an agent score its own output against a JSON rubric and retry until the score passes, public beta since 2026-05-06
Pattern 1 is a blog rubric I run on every draft: TLDR present, four H2s, no banned words, ~14% retry rate
Pattern 2 is a code-PR rubric that gates on tests, lint, and types before a human ever sees the diff
Patterns 3 to 5 cover email tone, image-prompt structure, and bug-triage completeness with the same retry shape
Honest cost note: every retry is real tokens, so cap iterations and set the threshold low enough that you actually exit

I have been running Anthropic's Result Loops in private beta for about three weeks. Last Tuesday it went public. Here is what I actually use it for and what it cost me to learn the difference between a good rubric and a rubric that loops forever.

What Result Loops actually do

Anthropic announced Result Loops at the SF dev conference on 2026-05-06, in the same release that brought public-beta Multi-Agent Orchestration with 20 specialists, public-beta Webhooks, and the testing harness they call Dreaming.

The mechanic is simple on paper. Your agent produces an output. A second pass scores that output against a rubric you wrote. If the score clears your threshold, the agent returns it. If not, the agent gets the rubric feedback and tries again. Loop until pass or until you hit max iterations.

That is genuinely it. There is no new model, no new SDK, no special prompt format. The "rubric" is a JSON object with a list of criteria, each with a weight and a checker (regex, function call, LLM judge, structural assertion). The "loop" is wrapped around any tool call, any agent task, any structured output you already have.

What changes is who pays attention to the output first. Before Result Loops, that was always you, in review. Now the agent gets a chance to catch obvious failures before they reach your inbox. The catch: every retry is a full token bill. A rubric that loops six times is six times the cost. So the discipline is not "write more rubrics." The discipline is "write rubrics that exit."

I tried this on five different jobs in my studio. Some saved me real time. One of them I disabled within a day. Here is what worked.

Pattern 1: blog-post quality rubric

This is the one I keep on by default. Every blog draft my pipeline produces gets scored before it reaches the publish step. The rubric is short:


{
  "name": "raxxo-blog-quality",
  "threshold": 0.85,
  "max_iterations": 3,
  "criteria": [
    {"id": "tldr_present", "weight": 0.2, "type": "regex",
     "pattern": "
.+?
"},
    {"id": "h2_count", "weight": 0.2, "type": "structural",
     "assert": "h2_count >= 4 && h2_count <= 6"},
    {"id": "word_count", "weight": 0.2, "type": "structural",
     "assert": "words >= 1400 && words <= 1800"},
    {"id": "no_em_dash", "weight": 0.2, "type": "regex_absent",
     "pattern": "\\u2014"},
    {"id": "voice_first_person", "weight": 0.2, "type": "llm_judge",
     "prompt": "Does this read as one person speaking? yes/no"}
  ]
}

The first four are deterministic. They do not need a model to evaluate. The fifth is an LLM judge, which is the expensive one but also the one that catches drafts that technically pass every regex and still sound like a press release.

In practice my retry rate sits around 14%. Most drafts pass on the first try. The ones that retry usually fail on word count (model overshoots to 1900) or em dash (the model still loves them, even with hard rules). The voice judge fires maybe once a week.

If you want the longer version of how this fits into the broader blog pipeline, the 9 Claude Code hooks that audit every file I write covers what runs locally before anything ever reaches a Result Loop.

Pattern 2: code-PR rubric

This one is the highest-value pattern in the studio and the one I was most nervous about. The rubric runs after a code-writing agent finishes a task and before the diff opens for review:


{
  "name": "code-pr-gate",
  "threshold": 1.0,
  "max_iterations": 2,
  "criteria": [
    {"id": "tests_pass", "weight": 0.4, "type": "shell",
     "cmd": "bun test", "expect_exit": 0},
    {"id": "lint_clean", "weight": 0.2, "type": "shell",
     "cmd": "bun lint", "expect_exit": 0},
    {"id": "types_check", "weight": 0.3, "type": "shell",
     "cmd": "bun typecheck", "expect_exit": 0},
    {"id": "no_console_log", "weight": 0.1, "type": "regex_absent",
     "pattern": "console\\.log\\("}
  ]
}

Threshold is 1.0 because I want all four to pass. Max iterations is 2 because at iteration 3 the model usually starts trying to mute the failures (deleting tests, suppressing lint warnings, casting types to any). Two retries is the sweet spot in my data.

The honest result: about 30% of agent diffs fail the first run. About half of those pass after one retry. The rest get returned as "could not pass rubric" and I look at them myself, which is exactly what should happen. I am not trying to get to 100% auto-pass. I am trying to remove the diffs that obviously do not compile from my review queue.

Running LLM evals in production covers the broader question of when an automated check should block versus warn. Result Loops are basically a more expensive, more flexible version of the same idea.

Pattern 3: email-draft rubric

I draft a lot of customer emails. Tone matters more than length. This rubric is mostly LLM-judged:


{
  "name": "email-tone",
  "threshold": 0.8,
  "max_iterations": 2,
  "criteria": [
    {"id": "length_ok", "weight": 0.2, "type": "structural",
     "assert": "words >= 60 && words <= 180"},
    {"id": "no_banned", "weight": 0.3, "type": "regex_absent",
     "pattern": "(circle back|touch base|synergy|leverage)"},
    {"id": "tone_match", "weight": 0.5, "type": "llm_judge",
     "prompt": "Does this match the tone in voice-samples/customer.txt?"}
  ]
}

The banned phrases list is mine. Yours will be different. The tone judge is the part that earns its keep, because tone is the thing humans notice and tools usually miss.

This pattern has the highest cost-to-value ratio because LLM judges are not cheap and emails are short. I keep it on for the customer-facing inbox and off for internal Slack drafts. If you want to test whether this is worth it for your pipeline, run it for a week and look at how many emails you actually edit before sending. If you edit fewer than 20%, the rubric is paying for itself. If you edit more than 50%, your rubric is wrong.

Pattern 4: image-prompt rubric

I generate a lot of image prompts for Magnific and similar tools. The agent that writes those prompts kept producing prompts that violated my brand rules (wrong aspect ratio for the destination, requesting text in the image, palette outside the brand). The rubric:


{
  "name": "image-prompt-structure",
  "threshold": 1.0,
  "max_iterations": 2,
  "criteria": [
    {"id": "aspect_ratio_set", "weight": 0.3, "type": "regex",
     "pattern": "--ar (16:9|9:16|1:1|4:5)"},
    {"id": "no_text_in_image", "weight": 0.3, "type": "regex_absent",
     "pattern": "(text saying|with the words|caption reads)"},
    {"id": "palette_match", "weight": 0.4, "type": "llm_judge",
     "prompt": "Does this prompt steer toward the RAXXO palette: dark gray bg, lime accent? yes/no"}
  ]
}

This is one of the higher-leverage rubrics in dollar terms because every bad prompt that makes it through costs an actual generation credit. Catching it pre-generation is genuinely cheaper than catching it post-generation. The retry rate here is around 25%, mostly because the model wants to add text to the image even though I told it not to.

Pattern 5: bug-triage rubric

The last pattern is for the agent that triages incoming bug reports. Each report gets enriched with severity, a reproducer, and an owner. The rubric makes sure all three exist:


{
  "name": "bug-triage-complete",
  "threshold": 1.0,
  "max_iterations": 2,
  "criteria": [
    {"id": "severity_set", "weight": 0.3, "type": "structural",
     "assert": "severity in ['low','med','high','critical']"},
    {"id": "reproducer_present", "weight": 0.4, "type": "structural",
     "assert": "reproducer.length >= 40"},
    {"id": "owner_assigned", "weight": 0.3, "type": "structural",
     "assert": "owner != null"}
  ]
}

No LLM judges here. All three checks are deterministic. Retry rate is around 18%, almost always because the reproducer is too short. The fix is for the agent to ask the reporter one clarifying question, not to invent a longer reproducer. Result Loops are not magic. If the input data is incomplete, the loop will retry forever, which is why max_iterations matters more than threshold for triage rubrics.

The cost question, honestly

Here is the thing nobody puts in the launch posts. Every retry is a full agent run. A loop that retries three times is a 4x token bill on that task. If you run a rubric on every output your studio produces, you will see a real bump on your invoice.

In my numbers, Result Loops added about 11% to my monthly Anthropic spend. The trade is fewer human review cycles, which is genuinely worth it for me because human review is the actual bottleneck. But it is not free, and the framing "the agent self-checks" hides the fact that "self-checking" is just "running again."

Two rules I now follow:

Set the threshold low enough that you actually exit. A 0.95 threshold on a rubric with five criteria means you are functionally requiring perfection. Most of my rubric thresholds sit at 0.8 or 0.85. The one at 1.0 is for code, where I want every check to pass and I can afford the bill.
Cap max_iterations at 2 or 3. After three retries, the model usually starts gaming the rubric (deleting tests, padding word counts with filler, satisfying the regex without satisfying the spirit). Better to fail loudly and hand the task to a human than to loop into nonsense.

For the bigger picture on how Result Loops sit alongside Multi-Agent Orchestration and Dreaming, the Claude Managed Agents update covers all three at once.

Bottom line

Result Loops are not a new way of thinking. They are a JSON wrapper around the very old idea of "check your work before you submit it." What is new is that the wrapper is now part of the SDK, the retry happens automatically, and the rubric travels with the agent code instead of living in a separate test file.

If you have an agent that produces structured output today, you can write a rubric for it this afternoon. Pick the three things that matter most. Give them weights. Set a threshold around 0.8. Cap iterations at 2. Run it for a week and look at the retry rate. If it is over 30%, your rubric is too strict or your agent is too weak. If it is under 5%, your rubric is not actually catching anything.

If you want a starter rubric library to copy from, the templates above are inside the Claude Blueprint, along with the rest of the studio's audit scaffolding. Pick one job, write one rubric, and ship one loop. The interesting work is not in the loop. It is in deciding what "passing" means.