DEV Community

Cover image for A protocol for auditing AI agent harnesses
Evangelos Pappas
Evangelos Pappas

Posted on • Originally published at Medium

A protocol for auditing AI agent harnesses

I have been building coding agents for the last several months and watching every component I added fail to move the resolve-rate I cared about. A verifier first. Multi-candidate sampling next. A structured-output sub-agent after that. Each was justified by a specific observed failure mode and each looked cheap at the margin. None of them helped. The Tsinghua paper Natural-Language Agent Harnesses, run on SWE-bench Verified at GPT-5.4 high reasoning, explains the loss directly: a same-model verifier on top of a baseline coding agent regresses task success on OSWorld by 8.4 percentage points, and multi-candidate sampling regresses it by 5.6 the same way. Both lose for the same structural reason. The verifier and the proposer are the same model as the doer. They share its training distribution, its priors, its failure modes. When the doer is confidently wrong, the verifier endorses the wrong output with the same confidence. The check does not catch errors. It approves them.

The pattern generalises. Three papers from late March 2026 explain the failure mode, the rule that follows from it, and the audit you can run on a harness. Read in the right order they form a three-layer protocol that catches the verifier failure mode first, then predicts the rest of the ablation table without reference to the numbers.


tl;dr

  • Tsinghua's NLAH ablation, controlled at the module level: verifiers regress accuracy by 8.4pp on OSWorld; multi-candidate search by 5.6pp. Both lose for the same structural reason: they recycle the doer's blind spots.
  • The whole table follows from one rule. Harness modules that introduce a new signal win; modules that recycle the doer's signal lose. The rule predicts every row without reference to the numbers.
  • Fudan's AHE turns ablation into an edit-level audit. Each edit ships a manifest of predicted fixes and predicted regressions; the next iteration verifies it against task-level deltas; misses revert in git. Fix-precision is 33.7% (5x random). Regression-precision is 11.8% (2x random), and that asymmetry is the methodology's open problem.
  • Stanford's Meta-Harness is upstream of both. The proposer fed raw failure traces hits 50.0% search-set accuracy; fed LLM summaries it hits 34.9%, statistically the same as scores only. Trace compression destroys roughly 15pp of optimisation signal.
  • The protocol composes them by dependency: L0 trace utility first, then L1 module ablation, then L2 manifest verification on every subsequent edit. Get L0 wrong and L1/L2 collapse on degraded ground truth.

Full article here >

Top comments (0)