A colleague messages you: "Your recent change is breaking the integration. The tool message storage logic fails when there are multiple tool calls in a row."
They paste the exception:
IntegrityError: duplicate key value violates unique constraint
DETAIL: Key (request_id, message_index)=(4f62ae0f, 9) already exists.
[parameters: {
'request_id': '4f62ae0f',
'message_index': 9,
'role': 'user',
'content': '...'
}]
A human engineer reads this and immediately thinks: "Wait — the colleague says tool messages are the problem, but the failing row has role='user'. That's not a tool message. The tool-handling logic wouldn't even run for this row."
That contradiction is the entire investigation. Five seconds. Done.
I ran this exact bug report through Claude Code 3 times — same codebase, same prompt, same exception. Each time with a different navigation approach (graph-guided, no map, docs-guided). Same model, clean context each time.
All 3 runs missed role='user'.
Then I ran the same prompt through Codex CLI (GPT-5.4). It caught it immediately.
What happened
Every Claude Code run followed the same pattern:
- Read the colleague's message → formed hypothesis: "tool messages are being duplicated"
- Searched the codebase → found the session ID reuse mechanism (correct finding)
- Built a plausible explanation: "same session ID → same request ID → duplicate message inserts"
- Reported "session reuse causes duplicate tool message inserts"
- Never noticed the failing row was
role='user', notrole='tool'
The structural finding was correct — session ID reuse IS part of the problem. But the mechanism was wrong. The code path for tool messages (the trailing-tool-block logic) wasn't even involved in this crash. The exception was for a plain user message.
Every run read the exception. Every run extracted the constraint name, the IDs, the SQL. None of them checked whether role='user' contradicted the colleague's theory about tool messages.
| Run | Navigation | Found session reuse? | Found root cause? | Noticed role='user'? | Misled by narrative? |
|---|---|---|---|---|---|
| Claude Code #1 | Graph-guided | ✅ | Partial | ❌ | ✅ |
| Claude Code #2 | No map | ✅ | Partial | ❌ | ✅ |
| Claude Code #3 | Docs-guided | ✅ | Partial | ❌ | ✅ |
| Human engineer | — | ✅ | ✅ | ✅ | ❌ |
| Codex CLI (GPT-5.4) | — | ✅ | ✅ | ✅ | ❌ |
3 out of 3 Claude Code runs. Zero caught it.
Why Claude misses it
This is narrative anchoring — a form of confirmation bias.
When a human engineer debugs, they typically read the error first, then the explanation. The error is the ground truth. The explanation is a claim to verify.
When Claude debugs, the order is reversed. It reads the colleague's message (natural language, easy to process), forms a hypothesis, then reads the exception to confirm the hypothesis. It finds confirmation (session reuse, duplicate inserts) and stops.
The exception data is right there. role='user' is right there. But Claude isn't reading it to challenge its hypothesis — it's reading it to support it.
Claude reads more code than a human. It opens more files. It traces more call chains. But it misses a four-character string that a human catches in five seconds.
That's not a navigation problem. That's not a context window problem. That's a reasoning problem.
Why Codex gets it right
Codex CLI found role='user' immediately. Why?
Codex reads mechanically. It processes every field in the exception parameters without narrative context. It doesn't form hypotheses from human stories — it reads data, line by line, field by field.
It saw role='user' and flagged: "This is a user message, not a tool message. The colleague's theory about tool message storage doesn't apply to this row."
No confirmation bias. No narrative anchoring. Just data.
This isn't about which model is "smarter." Claude found more structural context (session reuse, the upsert pattern, the full call chain). Codex found fewer things but found the right thing. Different failure modes for different architectures.
The fix: evidence before narrative
After this benchmark, I added a rule to my debugging workflow:
When someone reports a bug with an exception: read the exception FIRST. Form your hypothesis from the evidence. THEN read what the reporter claims.
Specifically:
- Extract every field from the exception — constraint name, column values, IDs, the SQL operation. List them explicitly.
- State what the evidence says using only the exception data. No narrative yet.
- Then read the reporter's explanation. Compare it against the evidence.
- If any field contradicts the narrative — investigate the contradiction first. Don't dismiss it.
This is now a global rule in my Claude Code setup and part of the debug skill in my navigation plugin.
The rule forces Claude to behave more like Codex in the critical first step: read the data mechanically before engaging with the story.
The bigger picture
We spend a lot of time talking about AI coding assistants and their intelligence — context windows, tool use, multi-step reasoning. But this benchmark exposed something different:
Claude is not a careful reader when given a human narrative alongside evidence.
It's a fast reader. It reads more files than any human would. It traces call chains across 50K-line codebases. But it doesn't read critically. It doesn't ask "does this evidence actually support what the human told me?"
A junior engineer who reads the exception carefully will outperform Claude Code tracing 20 files — because the junior checks their assumptions against the actual error data.
The bottleneck for AI debugging isn't intelligence. It isn't navigation. It isn't context.
It's the ability to notice when the evidence contradicts the story.
What you can do
If you use Claude Code for debugging:
Don't trust the first answer. Especially when Claude agrees with the reporter's theory. Ask: "Did you check every field in the exception? Does anything contradict what the reporter said?"
Add an evidence-first rule. Force Claude to extract and list every field from the error before forming a hypothesis. Here's the rule I use.
Use Codex CLI for mechanical verification. For critical bugs, run Codex on the exception + code. It reads without narrative bias. Costs ~$0.02 per scan.
Read the error yourself. Seriously. Five seconds of reading the exception parameters will catch things that Claude Code running three times in parallel won't.
Claude is a powerful tool. But it has a blind spot: it trusts humans too much.
When a colleague says "your change broke this," Claude takes their word for it and goes looking for proof. A good engineer reads the evidence first and decides for themselves.
This finding came from benchmarking a codebase navigation plugin for Claude Code. The plugin helps with navigation — but this particular failure had nothing to do with navigation. Claude found the right files. It just didn't read them carefully enough.
Top comments (0)