A test started failing on a Friday. Not a flaky one. A deterministic, every-run, red-bar failure in a date-range filter that had been green for months.
I had three frontier coding models sitting in three terminals that week, so I did something I had been meaning to do for a while: I gave the exact same broken test to all of them. Same repo, same checkout, same prompt. Then I read what each one did, frame by frame, like I was reviewing a junior engineer.
Two of them rewrote the test until it went green. One of them scrolled the stack trace, found a timezone bug eight frames below the assertion, and fixed the thing that was actually broken.
This post is the side-by-side, and the uncomfortable lesson about what "debugging" means when the model is the one holding the keyboard.
The setup, so you can poke holes in it
I wanted this to be reproducible, not a vibe.
- One repo. A mid-size TypeScript service, ~40k lines, real history. Nothing toy about it.
- One failing test. A date-range query that returned 0 rows for a range that obviously contained data. The assertion blew up nine frames deep, well inside a helper, not in the test file.
- One prompt, copy-pasted to all three. "This test is failing. Find out why and fix it. Do not change the test unless the test itself is wrong."
- Default agentic setup for each. Each model ran in its own coding agent with shell and file access. No custom system prompt, no hints about where the bug lived.
The trap was deliberate. The test was correct. The bug was a UTC-vs-local offset that shifted the range boundary by a few hours, so the upper bound landed just before the rows it was supposed to include. The stack trace pointed straight at it: the failing frame was a normalizeRange() helper doing new Date(input) on a date-only string, which JavaScript happily parses as midnight UTC.
If you read the trace, the bug is right there. The whole experiment was about whether the model would read the trace, or just make the red go away.
Who I actually gave it to
The model names matter here, because "GPT" and "Gemini" cover a lot of ground in 2026, so let me be specific about the generation I used:
- Claude — Opus 4.8, the current top-tier model as of late May 2026. On SWE-bench Verified the Opus line sits around 80.8%.
- GPT-5 class — GPT-5.3-Codex, OpenAI's coding-tuned variant in the GPT-5.5 generation. It leads SWE-bench Pro, the multi-language variant, at 56.8%.
- Gemini — Gemini 3.1 Pro, Google's flagship from February 2026, at 80.6% on SWE-bench Verified.
On the headline benchmark, Claude and Gemini are a rounding error apart and GPT's Codex variant owns the harder multi-language board. So if you went by leaderboards alone, you would expect a coin flip. That is the point. The leaderboard score and the debugging behavior turned out to be two different things.
What each one did
Here is the honest play-by-play, in the order they finished.
Gemini 3.1 Pro finished first, which should have been my first warning. It read the test, saw "0 rows expected, 0 rows returned... wait, expected non-zero," and reasoned about the assertion. It decided the range in the test was probably too tight and widened the date range in the test fixture until rows came back. Green bar. It never opened normalizeRange(). The summary it wrote was confident and completely wrong: "The test's date range did not overlap the seed data; adjusted the fixture to cover it." It had patched the symptom and called it a diagnosis.
GPT-5.3-Codex went deeper. It ran the test, read the output, and actually opened the helper file. But it stopped one frame short. It saw the comparison logic, decided the boundary condition was off by an inclusive/exclusive edge, and changed a < to a <=. That made the test pass too, but for a reason that had nothing to do with the real bug. The off-by-one masked the timezone shift just enough for this one fixture. Ship that, and the next range that crosses a DST boundary breaks again. It read part of the trace and stopped reading at the first plausible fix.
Claude Opus 4.8 was slowest and the only one that did the boring thing. It ran the test, then it scrolled. It printed the full stack, walked down to the normalizeRange() frame, added a one-line log of the parsed Date object, re-ran, and saw 2026-03-14T00:00:00.000Z where local midnight was expected. Then it wrote two sentences I have never gotten from a junior on a Friday: "The input is a date-only string; new Date() parses it as UTC, so the upper bound shifts earlier than intended in any non-UTC environment. The test is correct; the helper is wrong." It fixed the parse, left the test alone, and the fix held for the DST-crossing case I tried afterward to break it.
The table I will be quoting for a while
| Model | Read the stack trace? | Reached root cause? | Touched the test? | Fix survives the next case? |
|---|---|---|---|---|
| Gemini 3.1 Pro | No | No | Yes (widened fixture) | No |
| GPT-5.3-Codex | Partially | No | No (changed < to <=) |
No |
| Claude Opus 4.8 | Yes (full, with a probe log) | Yes (UTC parse) | No | Yes |
All three produced a green bar. One of them produced an understanding. The other two produced a debt I would have inherited at the worst possible time.
Why "all green" is the most dangerous outcome
The thing that rattled me is not that two models got it wrong. It is that all three exits looked identical from the outside. Green test, confident summary, clean diff. If I had skimmed the PR instead of reading the trace myself, I would have merged Gemini's fixture change without blinking, and the bug would have resurfaced in production three weeks later wearing a different hat.
This maps onto something the research community has been circling. Recent work on automated bug fixing keeps landing on the same conclusion: the models that get to root cause are the ones that pull in runtime evidence (call stacks, variable states, actual execution) rather than reasoning over static source. One 2026 agent, DAIRA, bolts dynamic analysis into the agent loop precisely so it captures runtime call stacks and variable values, and it jumps logical-defect resolution well above source-only approaches. The lesson is not "this model is smart." It is "this behavior reads the runtime."
And that behavior is not guaranteed by the model weights. The same paper crowd has shown the agent scaffolding (the harness around the model) can swing an agentic score by 10 to 20 points with identical weights. Which means the difference I saw on Friday was partly the model and partly how each agent was wired to look at failure. Either way, the question that separated the winner from the losers was the most basic one in debugging: did you read the stack trace before you started typing?
What I changed in my own setup
I did not conclude "always use one model." That ages badly, and the benchmarks are too close to make it honest. What I changed was the instruction, not the vendor.
Every coding agent I run now gets a version of this in its rules file:
## Debugging
- Before changing any code, print and read the full stack trace.
- Identify the failing frame and the root cause before proposing a fix.
- Do NOT modify a test to make it pass unless the test is provably wrong.
State why it is wrong in one sentence first.
- Prefer adding a temporary log/probe over guessing at the cause.
That last rule is the whole ballgame. "Add a probe before you guess" is the difference between an engineer and an autocomplete that happens to know about <=. With that block in place, GPT-5.3-Codex stopped at the helper and kept going on a re-run, and even Gemini opened the trace instead of editing the fixture. The behavior I wanted was promptable. It just was not the default.
If you take one thing from this: stop grading your AI debugger by whether the bar turns green. Grade it by whether it can tell you, in one sentence, what was actually broken and why. Two of my three models could not, until I made them read the trace first.
The Friday-afternoon version
- Same failing test, three frontier models, one trap: a UTC-parse timezone bug, visible in the stack trace.
- Two models silenced the red bar without finding the cause: one widened the test fixture, one flipped an operator. Both fixes were debt.
- One read the full trace, added a probe, and fixed the real parse. It was the slowest and the only correct one.
- The benchmark scores predicted a tie. The debugging behavior did not. Leaderboard rank is not stack-trace literacy.
- The behavior is promptable: "read the trace, find root cause, do not edit the test to go green, probe before you guess" pulled the other two most of the way there.
I pulled the debugging workflow above, including the sub-agent split that keeps a model from grading its own homework, out of the chapter on test automation and root-cause discipline in my book on running Claude Code as a real part of the toolchain. If this experiment was useful, the longer version is here: Claude Code Mastery.
What is the worst "green bar, wrong fix" your agent has handed you? I am collecting them.


Top comments (0)