pueding

Posted on Jun 10 • Originally published at learnaivisually.com

Agent-Harness Scaling Law: Feedback Quality Predicts Success, Not Raw Compute: Effective Feedback Compute (EFC)

#ai #llm #machinelearning #agents

What: A new agent-harness scaling-law paper introduces Effective Feedback Compute (EFC) — a single quantity that predicts whether an agent finishes a task from the quality of the feedback its harness returns each step, scored on four axes and normalized by how hard the task is.

Why: It reframes agent reliability as a feedback-quality problem, not a token-budget problem — plotted against EFC, harness-run success follows a clean law (R²≈0.94–0.99), while against raw compute the same runs barely fit (R²≈0.33–0.42).

vs prior: Prior reliability work leaned on raw-compute scaling — more tokens, more tool calls, bigger reasoning budgets — but EFC shows that axis is nearly flat, since lifting only feedback quality moved success from 0.27 to 0.90 with cost and tool-call counts held fixed.

Think of it as

a student with a sharp tutor instead of just re-reading the textbook

                  SAME EXAM, SAME HOURS LOGGED
                             │
               ┌─────────────┴──────────────┐
               ▼                            ▼
       ┌───────────────┐          ┌───────────────┐
       │  RE-READ THE  │          │  SHARP TUTOR  │
       │    TEXTBOOK   │          │  per problem  │
       │ (raw compute) │          │ (feedback Q)  │
       └───────┬───────┘          └───────┬───────┘
               │                          │
      pages logged, but          points at the exact
      no correction lands        mistake — and it sticks
               │                          │
               ▼                          ▼
         ✗ grade ~0.27              ✓ grade ~0.90
         effort, no signal         signal absorbed

agent harness = the study setup that feeds you a correction each round
raw compute = hours logged and pages re-read
feedback quality = how useful the tutor's correction is each time
informativeness = the tutor points at the exact mistake, not "study harder"
validity = the correction is actually right, not misleading
non-redundancy = the tutor doesn't repeat a note you already wrote down
retention = you keep the correction in your notes for the next problem
EFC = total useful correction absorbed, divided by how hard the exam is

Quick glossary

EFC — Effective Feedback Compute — the paper's core metric. It measures how much useful feedback signal a harness feeds back into the agent loop, scored on four axes (informativeness, validity, non-redundancy, retention) and normalized by task demand. It is the x-axis of the proposed scaling law, replacing "tokens and tool calls spent."

Agent harness — The scaffolding around the model — the loop that runs tool calls, observes results, and feeds the next observation back to the model. The harness is what delivers feedback, so it is where EFC is won or lost. Covered in Agent Engineering → Production Harness Architecture.

Scaling law — An empirical curve that predicts an outcome (here, task success rate) from one quantity (here, EFC). A tight scaling law means the curve explains most of the variation; a loose one means the quantity is a poor predictor.

R² (fit quality) — The fraction of variation in success the curve explains, from 0 (the x-axis predicts nothing) to 1 (it predicts everything). EFC reaches R²≈0.94–0.99; the raw-compute baseline only 0.33–0.42. Higher R² = a better predictor.

The four feedback axes — Informativeness (does the message localize the error?), validity (is the correction actually right?), non-redundancy (is it new, or a repeat?), and retention (does the agent still have it later?). EFC is built from all four, so a harness can fail on any one of them.

Task demand — How much corrective signal a task actually needs to be solved. EFC divides feedback quality by task demand so harnesses can be compared fairly across easy and hard tasks — the same crisp feedback is worth more on a demanding task than a trivial one.

The news. On May 28, 2026, researchers posted an agent-harness scaling-law paper to arXiv introducing Effective Feedback Compute (EFC) — a metric that predicts agent success from the quality of feedback the harness returns, not the compute it spends. Plotted against EFC, harness-run success rates fit a clean scaling law (reported R²≈0.94–0.99 across datasets); plotted against raw compute, the same runs barely fit (R²≈0.33–0.42, rising to ~0.88 only with a hand-built multivariate baseline). In one controlled comparison, lifting feedback quality moved success from 0.27 to 0.90 with token cost and tool calls held fixed.

Picture two students prepping for the same exam. The first logs ten hours re-reading the textbook cover to cover — enormous effort, page after page. The second spends one hour with a sharp tutor who, after each practice problem, points at the exact line where the reasoning went wrong, confirms the fix is correct, never repeats a note already written down, and makes sure it lands in the margin for next time. On exam day the second student wins, and it is not close. The hours-logged number — the raw compute — told you almost nothing. The number that predicted the grade was how much useful correction actually got absorbed. That second number is what this paper names Effective Feedback Compute, and the claim is that agent harnesses behave the same way.

The mechanism is a re-definition of the x-axis. Instead of counting tokens or tool invocations, EFC measures the useful signal the harness feeds back each step — scored on four axes (informativeness, validity, non-redundancy, retention) — and then normalizes by task demand so a crisp correction counts for more on a hard task than an easy one. That normalized quantity becomes the horizontal axis of a scaling law that fits success rates across the paper's datasets. The practical reading for anyone building agents: the lever is not your reasoning budget but what your harness chooses to log and return after every tool call.

This is why the raw-compute axis goes flat. A harness can burn an enormous budget returning low-quality feedback — a terse exit code 1 with no stack trace (low informativeness), a linter warning that is actually a false positive (low validity), the same "tests failed" string ten turns in a row (high redundancy), or an error the agent has already forgotten by the time it matters (low retention). All of that is real compute and real tool calls, and on the EFC axis it is worth almost nothing. The tutor who just says "study harder" for an hour spent the hour; the student learned nothing. Worse, in a long rollout the low-signal steps let compounding errors accumulate unchecked, so the spend actively buys you a longer path to the same failure.

Where the feedback gap actually comes from

Hold three variables fixed. One agent. One task. Two runs at the same budget — 40 tool calls, ~120K tokens each. The only difference is the harness's feedback quality. In Run A, every step returns a terse pass/fail string; say each step carries about 0.1 units of useful, valid, non-redundant, retained signal, so over 40 steps the agent accumulates 40 × 0.1 = 4 units. The task demands roughly 30 units to solve, so EFC = 4 / 30 ≈ 0.13 — low on the law's curve, landing near the 0.27 success rate the paper reports at the bottom of its range. In Run B, the harness returns the failing assertion, the offending input, and a one-line diff each step — call it 0.8 units per step, 40 × 0.8 = 32 units, EFC = 32 / 30 ≈ 1.07, high on the curve and up near 0.90 success. Same cost, same tool count, ~8× the effective feedback (illustrative decomposition calibrated to the paper's 0.27→0.90 and R² headline figures — the per-step unit values and task-demand figure are stand-ins, not measured constants). The success jump is the headline; the per-call yield jump is the deeper story.

Scaling-law x-axis	What it counts	Fit to success (R²)
Raw compute	tokens + tool calls spent	~0.33–0.42 — poor (paper)
Multivariate compute baseline	several spend features combined	~0.88 — better, hand-built (paper)
Effective Feedback Compute (EFC)	4-axis feedback quality ÷ task demand	~0.94–0.99 — tight (paper)

A caveat worth stating plainly: this is a scaling-law fit on the paper's own datasets, and a tight fit is a strong correlation, not a guaranteed control knob. EFC is also harder to move than a token budget — "return better feedback" is a design problem, not a slider, and scoring the four axes reliably is itself non-trivial. The honest framing is that EFC gives you a yardstick and a direction: instrument the feedback your harness returns, A/B candidate changes in shadow, and treat feedback quality as a first-class number alongside latency and cost. Whether the exact coefficients transfer to your stack is exactly the kind of thing you should measure, not assume.

Goes deeper in: AI Agents → Evals & Diagnostics → Error analysis first

Related explainers

PushBench — Quantitative Goal Persistence (QGP) — another harness-level number for long-horizon agent reliability
FutureSim — harness-level agent eval — why evaluating the harness, not the model alone, is the trend
Cursor Composer 2.5 — targeted textual feedback RL — the training-time analogue: a sharp, targeted correction beats a blunt end-of-rollout reward

FAQ

What is Effective Feedback Compute (EFC)?

EFC is a metric that predicts agent-harness success from the quality of the feedback the harness returns each step, rather than from the raw compute it spends. It scores feedback on four axes — informativeness, validity, non-redundancy, and retention — and normalizes by task demand so harnesses can be compared fairly across easy and hard tasks. Plotted against EFC, the paper reports success rates fitting a scaling law at R²≈0.94–0.99, far tighter than the ~0.33–0.42 fit against raw compute.

Why does feedback quality predict success better than raw compute?

A harness can spend an enormous budget returning low-quality feedback — terse pass/fail strings, false-positive warnings, repeated messages, or errors the agent has already forgotten. That is real compute that carries almost no useful signal, so the raw-compute axis goes nearly flat. EFC captures the signal that actually reaches the agent, which is why it fits success so much more tightly. In one controlled comparison, lifting only feedback quality moved success from 0.27 to 0.90 with token cost and tool-call counts held fixed.

How do I improve a harness's EFC in practice?

Treat the feedback your harness returns as a first-class design surface: make tool-call results localize the error (informativeness), verify the signal is correct before returning it (validity), suppress repeated or stale messages (non-redundancy), and persist corrections so they survive later in the rollout (retention). Because EFC is a measurable yardstick rather than a slider, the practical loop is to instrument the feedback you return, A/B candidate changes in shadow mode, and track feedback quality alongside latency and cost.

Originally posted on Learn AI Visually.

DEV Community