Claude Opus 4.6 Hit 80.84% on SWE-bench. What That Hides.

#ai #llm #observability #programming

Book: LLM Observability Pocket Guide
Also by me: Prompt Engineering Pocket Guide
My project: Hermes IDE | GitHub (an IDE for developers who ship with Claude Code and other AI coding tools)
Me: xgabriel.com | GitHub

On February 5, 2026, Anthropic announced Claude Opus 4.6 with a SWE-bench Verified score of 80.84% (per Anthropic's announcement). The number leads the press cycle. Now picture pointing the same model at a multi-thousand-line refactor inside the codebase you actually maintain, running it agentically, and watching it propose a patch that passes a couple of your internal tests while breaking callers it never opened.

Both things are true. The model is genuinely stronger than 4.5 on real coding work. And 80.84% on SWE-bench Verified does not mean the model solves 80% of the work in your repo. The benchmark measures something narrower than the headline implies, and the gap between "narrower" and "useless to you" depends on parts of the metric that nobody puts in the announcement.

What SWE-bench Verified actually is

The original SWE-bench paper (Jimenez et al., ICLR 2024) collected 2,294 real GitHub issues from twelve popular Python repositories: Django, sympy, scikit-learn, Flask, and friends. Each task pairs an issue description with the repository state before the fix and a hidden test suite that validates the fix. The model is given the issue, given the repository, and asked to produce a patch. The patch passes if the hidden tests pass.

SWE-bench Verified is a 500-task subset of the original, filtered by OpenAI's Preparedness team to remove tasks with broken or under-specified test cases. It is the benchmark every frontier-model release reports against. Anthropic's 80.84% on Opus 4.6, the mid-eighties scores on competing models per the public leaderboards, the higher number Anthropic later reported for Opus 4.7: all from the same 500 tasks.

Three things the score hides.

Hide #1: Repository selection bias

The 12 repositories in SWE-bench were chosen because they're popular Python projects with high-quality test suites and well-formed issues. That filter does heavy work. These repositories have:

Mature, well-typed APIs that change slowly.
Issue templates that force reproducers and expected behavior.
Maintainers who write small, focused fixes; most Verified patches touch one file with a few lines changed.
Years of training data covering their internals (Django and sympy date to 2005 and 2007).

Compare to the codebase you actually maintain. Hand-rolled framework, idiosyncratic conventions, no public training corpus, issues that say "broken on staging, see Slack." SWE-Bench Pro was built specifically because Verified's repo bias was inflating scores. Per its writeup, Pro spans more repos and more languages and demands coordinated changes across multiple files, and models that look saturated on Verified drop sharply on Pro.

The takeaway isn't that 80.84% is fake. It's that the score generalizes to repositories that look like Django. If your codebase doesn't look like Django, the score isn't telling you what you think it's telling you.

Hide #2: Test-driven evaluation rewards test-aware patches

SWE-bench grades by running the hidden test suite. A patch passes if the tests turn green; it fails otherwise. That sounds objective and is, on the dimension being measured. The dimension being measured is "produces a patch that passes the maintainer's tests," which is different from "produces a patch that's correct."

Why the distinction matters. Models that have seen these repositories during training (or seen close analogs of these issue patterns) can produce patches optimized for the test signal: minimal change, narrowly scoped, preserves the public surface the tests touch. They look great on the rubric and may quietly miss invariants the tests don't cover. The behavior is rational. The model is being scored by tests, so the model writes for the tests.

OpenAI acknowledged contamination concerns when they built Verified and have since stopped reporting Verified scores in favor of Pro. The SWE-Bench Pro writeup raises further contamination concerns about Verified: enough public exposure of the underlying issues and patches that scores can drift up without the underlying work getting easier. The score still climbs. Real capability accounts for some of that. Some is the model learning the test, not the work.

In your repo, the tests cover what you remembered to cover. A green suite is proof you didn't break the cases you anticipated, which is a much weaker statement than "the patch is correct." The model scoring 80% on Verified is being graded by maintainers who wrote tests against the patch they expected; nobody wrote your tests against a patch a 2026 model might draft instead.

Hide #3: Latency and cost are not in the metric

A SWE-bench score is a binary per task: pass or fail. The benchmark ignores how long the model thought, how many tokens it burned, how many tool calls it made, whether it retried after a failed test run. Anthropic's announcement reports the headline alongside high-effort thinking conditions. Max-effort thinking is expensive. As a rough estimate against published Anthropic Opus 4.6 token pricing, a single SWE-bench-style task with extended thinking can cost meaningful fractions of a dollar in API spend and run for minutes wall-clock.

In a benchmark, that's invisible. In your day, that's the difference between Claude Code feeling instant and feeling like you're waiting on CI. The Vellum analysis of Opus 4.7 benchmarks flags this directly: peak-score configurations and production-defaults aren't the same model. The score you read assumed a budget you may not have.

The corollary is that "Opus 4.6 hit 80.84%" tells you the upper end of capability, not the operating point. If you're running coding agents in your IDE with conservative thinking budgets, you're getting a different model than the one in the press release. Treating one benchmark number as the verdict on "is this thing good for my work" is the actual defect.

A small eval harness for your own repo

The fix is boring and effective: build your own eval. Point it at one repository you actually maintain. Pick five to twenty closed pull requests that fixed real bugs or added small features. Reset the repo to the parent commit, hand the issue text to the model, and check whether the model's patch makes the test suite green.

The harness below is intentionally minimal. It doesn't try to compete with SWE-bench's infrastructure. It tries to give you one number that's about your code.

Start with the shape of the data. Each task is one closed PR captured at its parent commit, with the issue text and the test command you want to grade against. The Result mirrors it on the way out, plus the wall-clock and token cost the public benchmark drops on the floor.

# eval_harness.py: minimal repo-local SWE-bench-shaped eval.
# Design notes (no fabricated numbers):
# - Pick 5-20 closed PRs that fixed scoped bugs.
# - For each PR: capture parent_sha, issue_text, test_cmd.
# - Reset, prompt the model, apply the patch, run tests.
# - Record pass/fail, wall-clock seconds, output tokens.

import json
import subprocess
import time
from dataclasses import dataclass, asdict
from pathlib import Path

from anthropic import Anthropic

client = Anthropic()
MODEL = "claude-opus-4-6"


@dataclass
class Task:
    pr_number: int
    parent_sha: str
    issue_text: str
    test_cmd: str
    touched_files: list[str]


@dataclass
class Result:
    pr_number: int
    passed: bool
    seconds: float
    output_tokens: int
    notes: str

Next, the three things the harness actually does on each task: reset the repo to the parent commit, prompt the model with just the issue text and a short list of likely-relevant files, then apply the diff and run the tests. The prompt is deliberately thin — no agent loop, no tool use — so the score reflects the raw single-shot capability before you bolt scaffolding on top.

def reset_repo(repo: Path, sha: str) -> None:
    subprocess.run(["git", "reset", "--hard", sha],
                   cwd=repo, check=True)


def ask_model(task: Task, repo: Path) -> tuple[str, int]:
    listing = "\n".join(task.touched_files)
    prompt = f"""<instructions>
You fix bugs in this repository. Output only a unified diff.
</instructions>

<repo_files_likely_relevant>
{listing}
</repo_files_likely_relevant>

<issue>
{task.issue_text}
</issue>

<output_format>
A unified diff (git apply compatible). No prose.
</output_format>"""

    r = client.messages.create(
        model=MODEL,
        max_tokens=4096,
        messages=[{"role": "user", "content": prompt}],
    )
    return r.content[0].text, r.usage.output_tokens


def apply_and_test(repo: Path, diff: str,
                   test_cmd: str) -> bool:
    patch_file = repo / ".eval.patch"
    patch_file.write_text(diff)
    apply = subprocess.run(
        ["git", "apply", str(patch_file)],
        cwd=repo, capture_output=True,
    )
    if apply.returncode != 0:
        return False
    test = subprocess.run(test_cmd, cwd=repo,
                          shell=True, capture_output=True)
    return test.returncode == 0

Finally, the driver — iterate the task list, time each run, and write the results to disk. Keep tasks.json in version control so the next time you grade a new model you're grading it against the same workload.

def run(repo: Path, tasks: list[Task]) -> list[Result]:
    results = []
    for t in tasks:
        reset_repo(repo, t.parent_sha)
        start = time.monotonic()
        diff, tokens = ask_model(t, repo)
        passed = apply_and_test(repo, diff, t.test_cmd)
        results.append(Result(
            pr_number=t.pr_number,
            passed=passed,
            seconds=time.monotonic() - start,
            output_tokens=tokens,
            notes="" if passed else "diff failed or tests red",
        ))
    return results


if __name__ == "__main__":
    repo = Path("/path/to/your/repo")
    tasks = [Task(**t) for t in
             json.loads(Path("tasks.json").read_text())]
    out = [asdict(r) for r in run(repo, tasks)]
    Path("results.json").write_text(json.dumps(out, indent=2))

The harness is honest about its limits. It runs against a repo you maintain, with PRs you picked, with tests you trust. It records pass-rate, wall-clock seconds, and output tokens: three numbers the public benchmark doesn't surface. The numbers it produces are about your code and nothing else.

Two design notes worth calling out. First, the model only sees touched_files as a hint, not the full repo. Real agentic coding loops do navigation, and you should add tool-use to the harness if you want apples-to-apples with how Claude Code runs. Second, "tests pass" is the same imperfect signal SWE-bench uses; for higher-stakes evaluation, layer in a code-review pass that checks for patches that pass tests by deleting tests.

What you do with the numbers depends on what you're measuring. If you're choosing between models for a coding agent, run the harness against three of them on the same task list. If you're tracking regressions across model versions, freeze the task list and run on every release. If you're justifying a purchase, the numbers from your own repo are the ones that should land in the deck, not the ones from a Princeton paper.

Tomorrow morning, pick one repo, three closed PRs from the last quarter, and run the harness against Opus 4.6 with the thinking budget you'd actually ship. The number you get back is the only one with your name on it.

If this was useful

The LLM Observability Pocket Guide covers exactly this: how to instrument coding agents and LLM features so the metrics you ship against are about your work, not someone else's benchmark. Spans for tool calls, eval rigs that don't lie, and the traces that catch regressions before users do. The Prompt Engineering Pocket Guide is the companion for the prompt side: how to keep the agent prompt that drives those evals from drifting under you.