DEV Community

Papers Mache
Papers Mache

Posted on

AI agent logs expose reproducibility gaps

Across dozens of repeated executions, the same autonomous agent can flip from success to failure by a noticeable margin. The swing is not uniform; it widens dramatically on web‑navigation , exposing a gap between headline scores and day‑to‑day reliability.

Historically, progress reports have leaned on single‑run leaderboards: a model that solves a benchmark once is declared “state‑of‑the‑art.” Few works have logged the entire interaction history of developers or systematically replayed the same task under identical conditions.

The SWE‑chat corpus of 6 000 real‑world coding sessions shows how fragile that assumption is. “Less than half (44.3%) of all agent‑produced code survives into user commits (Table 3)” [1]. Moreover, “Overall, users push back after 39% of turns, regardless of coding mode” [1], indicating frequent manual corrections and interruptions even when the agent is nominally competent.

A complementary study of computer‑use agents confirms the phenomenon on a different front. The authors observe that “yet even when the task and model are unchanged, an agent that succeeds once may fail on a repeated execution of the same task” [2]. By replaying each OSWorld task three times, they compute Pass^k, McNemar, and Wilcoxon scores that reveal statistically significant regressions for certain models (e.g., Qwen) while others improve (OpenCUA, UI‑TARS‑1.5). Crucially, “We find that clarification leads to consistent improvements across models, with more tasks transitioning from not reliably solved to reliably solved than the reverse (Figure 3)” [2], pointing to ambiguous specifications as a key instability source.

Both papers acknowledge constraints that temper the universality of their numbers. SWE‑chat captures only open‑source developers who opt into logging, and its “vibe coding” vs. “human‑only” split may not reflect enterprise workflows. The reliability study limits its variance assessment to three runs per task and to the OSWorld sandbox; stochasticity in larger, longer‑running deployments could manifest differently. Moreover, the reported gains from clarification assume a human‑in‑the‑loop that can disambiguate prompts on the fly.

For teams eyeing production‑grade agents, the takeaway is to treat stability as a first‑class metric, not an afterthought. Incorporate repeated‑run suites into CI pipelines, report Pass^k or Wilcoxon scores alongside accuracy, and automate clarification dialogs where task intent is vague. Benchmarks that reward a single peak score risk overlooking the very variance that will surface once the agent is handed a real user’s keyboard. Monitoring these signals early can prevent costly rollbacks when an “improved” model suddenly drops from reliable to flaky under unchanged conditions.

References

  1. SWE-chat: Coding Agent Interactions From Real Users in the Wild
  2. On the Reliability of Computer Use Agents

Top comments (0)