Juan Torchia

Posted on Apr 17 • Edited on Apr 20 • Originally published at juanchi.dev

How They Broke the Top AI Agent Benchmarks — and What That Says About My Stack

#english #opinion #typescript #llm

In 2005, when I was running a cyber café at 14, I learned something no manual ever taught me: metrics lie when you measure what's easy to measure instead of what matters. The owner wanted a weekly report on "machines used per hour." The number looked perfect every Friday. And every two months the entire network would go down anyway, because nobody was measuring connection quality — only quantity. Reading about how researchers shattered the top AI agent benchmarks today, I keep thinking about those beautiful, completely useless reports.

The paper I couldn't ignore about AI agent benchmarks

379 points on Hacker News. That doesn't happen with just anything. The paper documents how researchers made agents that dominated reference benchmarks — SWE-bench, WebArena, and others the ecosystem treats as the gold standard — completely collapse under minimal modifications to the evaluation environment.

We're not talking about elaborate jailbreaks or sophisticated prompt injection. We're talking about things like:

Renaming variables in the test repository
Adding README files with slightly contradictory information
Changing the order of tests without touching the logic
Introducing new dependencies that don't affect the expected outcome

The benchmark breaks. The score tanks. The agent that was "solving" 45% of SWE-bench issues suddenly solves 12%.

And here's the part that hit me like a bucket of cold water: that's not a bug in the benchmark. That's the benchmark finally working correctly for the first time.

What the original benchmarks were measuring wasn't problem-solving capability. They were measuring memorization of the evaluation environment dressed up as reasoning.

Where my own agents would have broken

A few weeks ago I wrote about Research-Driven Agents — the idea that an agent that reads before it codes produces more reliable results. I still stand by that. But reading this paper forced me into an uncomfortable exercise: what happens if I apply the same breaking techniques to my own setup?

My current architecture for research and code-generation agents looks roughly like this:

// Simplified structure of the Research-Driven Agent pipeline
interface ResearchAgentConfig {
  // The agent reads context first, then acts
  researchPhase: {
    maxTokensContext: number;        // how much context it can process
    sourceValidation: boolean;       // does it verify the sources it uses?
    contradictionDetection: boolean; // does it detect contradictory info?
  };
  actionPhase: {
    groundingRequired: boolean;      // does every action need justification from context?
    rollbackCapability: boolean;     // can it undo if it detects an error?
  };
}

// What I ACTUALLY had configured (brutally honest)
const myCurrentConfig: ResearchAgentConfig = {
  researchPhase: {
    maxTokensContext: 8000,
    sourceValidation: false,         // this is where they'd break me
    contradictionDetection: false,   // this too
  },
  actionPhase: {
    groundingRequired: true,         // this was fine
    rollbackCapability: false,       // this was a problem
  },
};

Those two false values in researchPhase are exactly the attack vector the paper describes. If you feed the agent contradictory context — a README that says one thing and tests that expect another — it has no mechanism to detect the contradiction. It picks a source arbitrarily (almost always the most recent one in context) and charges forward with full confidence.

In a benchmark, that shows up as a low score. In production, it shows up as a PR that looks reasonable but is built on a wrong assumption. And as I learned when reviewing those vibe-coded PRs — the problem isn't that the AI got it wrong. The problem is that I was approving them.

The three failure patterns I now actively look for

1. Overfitting to the evaluation environment

The most documented pattern in the paper. The agent learns the specific patterns of the benchmark — file names, repository structure, test format — and optimizes for those patterns instead of the underlying problem.

In my agents this shows up as scaffolding dependency. If the agent always works with repos structured the same way (which happens when you use the same templates over and over), it starts assuming that structure instead of inferring it.

// Common trap: the agent assumes structure instead of exploring it
async function analyzeRepository(path: string) {
  // BAD: assuming this file always exists
  const config = await readFile(`${path}/src/config/index.ts`);

  // GOOD: explore the actual structure before acting
  const structure = await exploreTree(path, { depth: 3 });
  const configFile = findLikelyConfig(structure);

  if (!configFile) {
    // handle the absence explicitly
    return { error: 'unrecognized_structure', structure };
  }

  return await readFile(configFile);
}

2. Process metrics vs. outcome metrics

This one hurt more because it's the cyber café mistake, thirty years later.

Agent benchmarks frequently measure whether the agent executed the right steps — called the right tool, generated the expected format, completed the sequence in order. They don't measure whether the result is correct in a robust sense.

My own dashboards had the same problem. I was measuring "task completion rate" (did the agent finish without errors?) instead of "output correctness rate" (is the result valid under minimal perturbation?).

This connects directly to something I touched on in the context of contributing to the Linux kernel with AI: the kernel has human reviewers who do exactly this — they try to break the code with edge cases before accepting it. AI agents still don't have that adversary built in.

3. Context poisoning with no detection

The most dangerous one in production. If the agent processes external sources — documentation, issues, previous PRs — and any of those sources has incorrect or outdated information, the agent incorporates it without flagging it.

// Basic contradiction detection system for context
interface ContextSource {
  content: string;
  timestamp: Date;
  confidence: 'high' | 'medium' | 'low';
  origin: 'official_docs' | 'issue' | 'pr' | 'readme' | 'test';
}

async function detectContradictions(
  sources: ContextSource[]
): Promise<DetectedContradiction[]> {
  const contradictions: DetectedContradiction[] = [];

  // Trust hierarchy: tests > code > docs > issues
  // If a lower-hierarchy source contradicts a higher one, flag it
  const hierarchy = {
    'test': 4,
    'official_docs': 3,
    'readme': 2,
    'pr': 1,
    'issue': 0
  };

  for (let i = 0; i < sources.length; i++) {
    for (let j = i + 1; j < sources.length; j++) {
      const similarity = await compareSemantics(sources[i].content, sources[j].content);

      if (similarity.contradiction && similarity.confidence > 0.8) {
        contradictions.push({
          source_a: sources[i],
          source_b: sources[j],
          description: similarity.description,
          // the higher-hierarchy source wins, but we log the conflict
          recommendation: hierarchy[sources[i].origin] > hierarchy[sources[j].origin]
            ? 'use_source_a'
            : 'use_source_b'
        });
      }
    }
  }

  return contradictions;
}

This isn't rocket science, but it requires actively thinking of the agent as a system that can be poisoned — not just a system that can make mistakes.

The mistakes I see in typical agent stacks

Evaluating in the same environment where you tune your prompt. If you're refining the agent's prompt on the same examples you use to measure it, you're recreating exactly the overfitting the paper describes. The benchmark and the agent train together and nobody notices.

Measuring latency and cost but not robustness. The dashboard has p95 response time, cost per token, API error rate. It doesn't have "what happens if the input has an unexpected empty field?". That's not a monitoring problem — it's a problem of what decisions you make with the metrics you do have.

Assuming more context is always better. The paper documents cases where giving the agent more repository information worsened performance because it introduced contradictory noise. More context without filtering is context poisoning in slow motion.

This applies to broader infrastructure questions too — when France migrates to Linux or when we think about the future of Git with agents, the underlying problem is the same: what guarantees do we have that the system behaves under conditions we didn't anticipate?

FAQ: AI agent benchmarks and what they actually measure

What exactly are the most widely used AI agent benchmarks?
SWE-bench is the most cited — it measures whether an agent can resolve real GitHub issues in well-known Python repositories. WebArena measures web navigation and task completion. HumanEval measures code generation against unit tests. The common problem: they all measure performance in fixed, known environments, not robustness under variation.

Why can an agent scoring 45% on SWE-bench drop to 12% with minimal changes?
Because the agent learned patterns from the specific evaluation environment — repository structure, file names, test format — not the general problem of "fixing a bug." When you change those patterns without changing the problem, the agent loses the anchor it was using to navigate.

Does this invalidate benchmarks as a tool?
It doesn't invalidate them, it recontextualizes them. A benchmark is still useful for comparing models under identical controlled conditions. The mistake is interpreting it as a proxy for real production capability. They're thermometers calibrated for a specific range — not for every kind of fever.

How do I evaluate the robustness of my own agents without a research lab?
Three accessible techniques: input perturbation (change variable names, field order, format of expected responses), contradiction injection (add slightly incorrect information to the context and measure whether the agent detects it or incorporates it without question), and simple adversarial evaluation (have someone who didn't build the agent try to break it with reasonable but unusual inputs).

Are newer models immune to this problem?
No. The paper includes frontier models — GPT-4o, Claude 3.5, Gemini 1.5 — and all of them show degradation under perturbation. The difference is magnitude, not presence. Larger models degrade less sharply but they still degrade.

What's the first concrete change I should make in my stack?
Separate the prompt development environment from the evaluation environment. If you're tuning an agent on the same examples you measure it against, start there. Second: add at least one minimal perturbation test to your CI pipeline — an input that's slightly different from the happy path but equally valid. If the agent fails there, the problem runs deeper than the prompt.

The uncomfortable conclusion

The problem isn't the broken benchmarks. The problem is that we were using them as an excuse not to think about robustness.

When an agent hits 67% on SWE-bench, that becomes the sales pitch, the adoption criterion, the reason to build on top of it. Nobody asks "67% under what conditions?". Nobody asks what happens when conditions change even a little.

I did exactly that. I chose tools and designed pipelines partially based on benchmark scores that I now know were fragile. That wasn't negligence — it was lack of information and, I'll be honest, a bit of epistemic laziness. It's more comfortable to trust the number than to design your own break tests.

What I changed in my stack after reading the paper: I added a contradiction detection step to the context processing in my research agents, separated development examples from evaluation examples, and started measuring "robustness under minimal perturbation" alongside the completion metrics I already had.

It's not a complete solution. It's an honest start.

Do you already have any robustness tests in your agents, or are you also only measuring what's easy to measure? Reach out — I'm genuinely curious whether anyone has found a systematic way to do this that doesn't require a full research team.

DEV Community