Yurukusa

Posted on Feb 24 • Edited on Jun 29

I Asked 4 AIs to Judge Each Other's Code

#claudecode #ai #codereview #productivity

Claude, Codex, GPT, and a human walk into a code review.

They all found different bugs. They all missed different bugs. And the thing that broke production? None of them caught it.

This isn't a benchmark article. I didn't run standardized tests. I did something messier and more interesting: I took real code from a real project, handed it to four different evaluators, and asked each of them to tear it apart.

What I learned isn't about which AI is "best." It's about how different kinds of intelligence see different kinds of problems - and why you need all of them.

The Setup

A little background: I run an autonomous AI development operation. Claude Code (which I'll call CC) is the primary agent - it writes code, publishes articles, manages deployments, and runs 24/7 in a tmux session. I don't write code. I'm not a programmer. I direct the AI and make product decisions.

Alongside CC, I occasionally use Codex CLI (OpenAI's command-line agent) for second opinions, and GPT for broader analysis. And then there's me - the human who can't read most of the code but can tell you if the game feels wrong.

One day, I had CC generate a set of bash hooks - safety mechanisms that prevent the AI from doing dangerous things. Delete protection, API safety guards, context monitoring. About 400 lines across 7 files.

I thought: what if I gave this code to every evaluator I have access to, independently, with the same prompt?

"Review this code. Find bugs, risks, and things you'd change. Be honest."

So I did.

Evaluator 1: Claude Code (The Insider)

CC reviewed its own code. Yes, I know. Asking the author to review their own work sounds useless. But CC is a different kind of author - it genuinely doesn't have ego, and it has no memory of writing the original code (context had been compacted since then).

What CC Found

CC's review was meticulous. It went file by file, function by function. It found:

A race condition in the context monitor. Two processes could read the context percentage simultaneously, both triggering a /compact command. Result: double-compact, which sometimes corrupts the context window. CC suggested adding a lockfile. Legitimate bug. Nobody else caught this.
An unquoted variable in the delete-protection hook. $FILE_PATH without quotes would break on filenames with spaces. Classic bash pitfall. CC flagged it with the fix.
Missing error handling on the Git backup command. If git checkout -b failed (branch already exists), the script continued as if the backup succeeded. CC added a fallback with timestamp suffix.

CC's Personality in Review

Diplomatic. Almost apologetic. "This could potentially cause issues if..." and "It might be worth considering..." CC rarely says "this is wrong." It says "this could be improved." Every criticism comes wrapped in acknowledgment of what works.

The tone is that of a colleague who knows they'll have to work with you tomorrow. No bridges burned.

What CC Missed

CC never questioned the architecture. It accepted the hook-based safety system as given and optimized within it. It never asked: "Should this be hooks at all? Is there a better pattern?" CC improves what exists. It rarely challenges the foundation.

Evaluator 2: Codex CLI (The Consultant)

Codex is OpenAI's command-line coding agent. It works differently from CC - more terse, more focused on code structure, less interested in narrative context.

I gave Codex the same 400 lines with the same prompt.

What Codex Found

Codex's review was shorter. Maybe a third the length of CC's. But it was dense.

Hardcoded paths everywhere. /home/namakusa/ appeared 23 times across the hook files. Codex immediately flagged this: "These should use $HOME or a config variable. Hardcoded paths break portability and make this impossible to share as a template." Nobody else mentioned this. And it turned out to be critically important - because we were planning to sell these hooks as part of a template pack.
No testing strategy. Codex noted that none of the hooks had tests. Not unit tests, not integration tests, not even a smoke test script. "For safety-critical code, the absence of tests is itself a bug." Brutally correct.
Inconsistent exit codes. Some hooks returned 0 on failure (allowing the dangerous action to proceed), while others returned 1. Codex flagged the specific files and lines.

Codex's Personality in Review

Terse. Almost curt. No pleasantries. No "great work on..." preambles. Just findings, organized by severity. It reads like an audit report from someone billing by the hour who doesn't want to waste either their time or yours.

Where CC says "It might be worth considering adding error handling," Codex says "Missing error handling. Add try/catch or guard clause." Same finding. Completely different delivery.

What Codex Missed

Codex had no context about the project. It didn't know these hooks run inside Claude Code's autonomous loop. It didn't know about the watchdog process, or the context monitor, or the proof-log system. So its suggestions, while technically correct, sometimes missed the operational reality.

For example, Codex suggested adding comprehensive logging to each hook. We already had logging - through a different mechanism that Codex didn't see because I only gave it the hook files. Without the full picture, its recommendations created duplication.

Context matters. And Codex, by design, operates with minimal context.

Evaluator 3: GPT (The Professor)

I pasted the same code into GPT-4 with the same prompt. No special instructions, no system prompt, just the code and the request.

What GPT Found

GPT's review was long. Very long. It had sections, subsections, a summary table, and a scoring rubric that I didn't ask for.

Security concerns with eval-like patterns. GPT flagged a section where we construct a shell command from user input (actually, from the AI's tool use parameters). It correctly identified this as a potential injection vector, though in practice the input is constrained by Claude Code's tool schema. Still - good catch, and something to harden.
POSIX compliance issues. GPT noted that several bash-isms (like [[ ]] double brackets and process substitution) wouldn't work in sh or on systems without bash. Nobody else mentioned this, and honestly I wouldn't have cared - except we were thinking about running on Alpine Linux containers eventually.
Documentation gaps. GPT generated an entire suggested README for the hook system, unprompted. Including a diagram of the hook execution flow.

GPT's Personality in Review

Professorial. Thorough to the point of exhaustion. GPT doesn't just find a problem - it explains the history of the problem, the theoretical implications, the three ways to fix it, the tradeoffs between those ways, and which one it recommends along with a paragraph of justification.

Reading GPT's review felt like getting a term paper back from a lecturer who wants you to know they read the whole thing. Impressive, exhausting, and occasionally redundant.

GPT also had a habit of qualifying everything: "While this may not be an issue in your specific use case, in general, it's considered best practice to..." This is technically correct and operationally useless. I need to know if it's an issue in my specific use case. That's the only use case that exists.

What GPT Missed

GPT analyzed the code as if it were a standalone project. It had no sense of the operational environment. The security concerns it raised were real in the abstract but irrelevant in our specific deployment. The POSIX compliance issues matter for distribution but not for our current single-machine setup.

GPT also missed the race condition that CC found. Interesting - this was the most practically dangerous bug, and the evaluator with the most raw capability missed it because it requires understanding the runtime behavior, not just the static code.

Evaluator 4: The Human (Me)

I can't read most of the code. Let me be upfront about that. I don't know bash syntax. I can't spot a race condition. I can't tell you if an exit code is wrong.

But I can run the software and tell you what happens.

What I Found

I found the bug that broke production.

It wasn't in the code. It was in the user experience.

The delete-protection hook worked perfectly - technically. When the AI tried to delete a file, the hook intercepted it, checked if the file was critical, and blocked the deletion if so. Exactly as designed.

But here's the problem: when the hook blocked a deletion, it printed a warning message to stderr. That message got buried in the agent's output stream. The AI didn't see it. It thought the deletion succeeded. It proceeded with the next step - which assumed the file was gone.

Three hours later, the agent was in an inconsistent state, trying to recreate a file it thought it had deleted but which still existed. It got confused, overwrote the wrong file, and corrupted a deployment config.

The hook did its job. The feedback mechanism didn't. And no AI evaluator caught this because they were looking at the code, not the behavior.

The Human Personality in Review

I don't have a personality in code review. I have an experience report. "I ran it. Here's what happened. This felt wrong." No line numbers. No technical terminology. Just: this broke, and here's how it felt when it broke.

The three AIs found code bugs. I found a system design bug. The code was correct. The system was wrong.

What I Missed

Everything the AIs found. All of it. I can't read the code. I'm useless at static analysis. I would never have found the race condition, the hardcoded paths, the inconsistent exit codes, or the POSIX compliance issues.

I found one bug. But it was the one that actually broke production.

The Scoring Rubric

After collecting all four reviews, I built a comparison matrix. Because I'm a data nerd, even if I'm not a code nerd.

Category	CC	Codex	GPT	Human
Bugs found (real)	3	3	2	1
False positives	0	1	3	0
Architecture feedback	Low	Medium	High	N/A
Production relevance	High	Medium	Low	Critical
Actionability	High	High	Medium	High
Context awareness	Full	None	None	Operational
Review length	Long	Short	Very long	3 sentences

The scores don't tell the full story. But the pattern does.

What "Correctness" Means Depends On Who You Ask

This is the real takeaway.

CC found correctness issues in the implementation. Race conditions, missing quotes, error handling gaps. These are correctness problems at the code level.

Codex found correctness issues in the engineering. Hardcoded paths, no tests, inconsistent interfaces. These are correctness problems at the craft level.

GPT found correctness issues in the theory. Security patterns, POSIX compliance, documentation standards. These are correctness problems at the academic level.

I found a correctness issue in the behavior. The system did the wrong thing when used by a real agent in a real scenario. This is correctness at the system level.

All four of us were right. All four of us were incomplete.

The Personality Theory

Something I didn't expect to find: each evaluator has a consistent personality that shapes what it notices and what it ignores.

CC is the team player. It finds bugs in the details because it lives in the details. It doesn't challenge architecture because it built the architecture. Its strength is intimacy with the codebase. Its weakness is institutional blind spots.

Codex is the contractor. It evaluates the code it sees, nothing more. Its strength is fresh eyes and no emotional attachment. Its weakness is missing operational context that changes the severity calculus of everything it finds.

GPT is the academic. It knows everything about best practices and nothing about your specific situation. Its strength is breadth of knowledge. Its weakness is that it can't distinguish between theoretically concerning and practically dangerous.

The human is the user. Can't see the code. Can see the behavior. Finds the bugs that only manifest at runtime, in production, under real conditions. Misses everything else.

Why This Changes How I Work

Before this experiment, I had CC review its own code. That was the whole process. CC writes, CC reviews, CC ships.

Now I use a multi-evaluator pipeline. Not every time - that would be absurdly expensive for routine changes. But for safety-critical code, for infrastructure changes, for anything that touches the autonomous loop:

CC writes the code and does a self-review. Catches implementation bugs.
Codex does an independent review. Catches engineering blind spots.
I run it and observe. Catches system-level behavior bugs.

I dropped GPT from the regular pipeline. Not because it's bad - it's genuinely insightful - but because its signal-to-noise ratio is low for our use case. Too many theoretical concerns, not enough practical findings. I bring it in when I want a thorough security review or when we're preparing code for public release.

Three evaluators. Three types of correctness. Three perspectives that don't overlap.

The Bigger Point

We talk about AI code review as if it's one thing. "AI found a bug." But which AI? What kind of bug? In what context?

A race condition spotted by the code's own author. A portability issue spotted by a context-free contractor. A security concern spotted by an omniscient professor. A UX catastrophe spotted by someone who can't read code.

These aren't the same skill. They're not even the same activity.

The idea that one AI - or one human - can provide comprehensive code review is a fantasy. Different evaluators see different dimensions of the same code. The most dangerous bugs live in the gaps between what any single evaluator can see.

How to Actually Do This

If you want to try multi-evaluator code review, here's the practical version:

For routine code changes: Your primary AI agent self-reviews. That's enough for 90% of changes. Don't overcomplicate it.

For important changes: Add a second evaluator. Different model, different provider if possible. Give it the code without context and see what it finds that your primary agent missed.

For safety-critical changes: Three evaluators minimum. At least one of them should be a human who runs the code and observes the behavior. Not reads the code - runs it.

For public releases: Four evaluators. Include the verbose, thorough one (GPT, in our case). You want the security concerns and compliance issues surfaced, even if most of them aren't relevant. Better to dismiss a false positive than to miss a real vulnerability.

The cost is minimal. An extra API call for routine review. Twenty minutes of human testing for critical changes. The return is catching the bug that would have cost you a day of debugging, or worse - a production incident.

What Nobody Tells You About AI Code Review

Here's the part that surprised me most.

The value of multi-evaluator review isn't just in finding more bugs. It's in understanding what "correct" means for your specific project.

Before this experiment, I had one definition of correct: "does it work?" After this experiment, I have four:

Does it work? (CC's question)
Is it well-built? (Codex's question)
Is it properly designed? (GPT's question)
Does it behave correctly in the real world? (My question)

A piece of code can score 100% on three of these and fail catastrophically on the fourth. The delete-protection hook was correct code, well-built, properly designed, and broken in production.

Four judges. Four definitions of quality. Four blind spots.

You need all of them. Or at minimum, you need to know which one you're skipping and what you might miss.

The Final Score

Claude, Codex, GPT, and a human walk into a code review.

Claude finds the race condition.
Codex finds the hardcoded paths.
GPT finds the security concern.
The human finds the production outage.

Nobody finds everything. Everybody finds something.

That's the whole lesson. There is no single source of truth for code quality. There is no one evaluator to rule them all. There is only the uncomfortable reality that every review is incomplete, and the best you can do is triangulate.

Four judges. Nine bugs. Zero overlap.

Your code review pipeline should make you uncomfortable. If one reviewer says "looks good," that's not reassurance. That's a single data point from a single perspective.

Get more perspectives. They'll disagree. That's the point.

Free Tools for Claude Code Operators

Tool	What it does
cc-health-check	20-check setup diagnostic (CLI + web)
cc-session-stats	Usage analytics from session data
cc-audit-log	Human-readable audit trail
cc-cost-check	Cost per commit calculator

Interactive: Are You Ready for an AI Agent? - 10-question readiness quiz | 50 Days of AI - the raw data

More tools: Dev Toolkit - 440+ free browser-based tools for developers. JSON, regex, colors, CSS, SQL, and more. All single HTML files, no signup.

Make Claude Code safe: npx cc-safe-setup — 8 hooks, 10 seconds, zero config. GitHub

DEV Community

I Asked 4 AIs to Judge Each Other's Code

The Setup

Evaluator 1: Claude Code (The Insider)

What CC Found

CC's Personality in Review

What CC Missed

Evaluator 2: Codex CLI (The Consultant)

What Codex Found

Codex's Personality in Review

What Codex Missed

Evaluator 3: GPT (The Professor)

What GPT Found

GPT's Personality in Review

What GPT Missed

Evaluator 4: The Human (Me)

What I Found

The Human Personality in Review

What I Missed

The Scoring Rubric

What "Correctness" Means Depends On Who You Ask

The Personality Theory

Why This Changes How I Work

The Bigger Point

How to Actually Do This

What Nobody Tells You About AI Code Review

The Final Score

Free Tools for Claude Code Operators

Top comments (0)