DEV Community

Tsunamayo
Tsunamayo

Posted on

How I Made Claude Code and GPT-5.4 Review Each Other's Code

The Problem: Same Model Writes and Reviews

When Claude Code writes code and Claude reviews it, you get the AI equivalent of grading your own homework. Blind spots survive.

I wanted GPT-5.4 to review Claude's code from a genuinely different perspective. So I built helix-codex โ€” an MCP server that bridges Claude Code (Opus 4.6) to Codex CLI (GPT-5.4).

What Makes It Different

There are 6+ Codex MCP bridges on GitHub. They all do the same thing: call codex exec, return raw text. Claude has no idea what happened inside.

helix-codex parses the entire JSONL event stream and returns a structured report:

[Codex gpt-5.4] Completed

โฑ Execution time: 8.3s
๐Ÿงต Thread: 019d436e-4c39-...

๐Ÿ“ฆ Tools used (3):
  โœ… read_file โ€” src/auth.py
  โœ… edit_file โ€” src/auth.py
  โœ… shell โ€” python -m pytest tests/

๐Ÿ“ Files touched (1):
  โ€ข src/auth.py

โ”โ”โ” Codex Response โ”โ”โ”
Fixed the authentication logic.
Enter fullscreen mode Exit fullscreen mode

The Self-Review Experiment

The most interesting test: I had GPT-5.4 review helix-codex's own source code. It found 3 critical issues:

  1. Return code logic bug โ€” returncode != 0 with partial output was treated as success
  2. Terminal injection vulnerability โ€” No ANSI/OSC escape sanitization in output
  3. Path double-application โ€” cwd passed to both -C flag and subprocess cwd=

Claude (the model that wrote the code) had missed all three. Different model, different blind spots.

Real Performance Numbers

Tool Time What It Does
explain 5.4s Full code explanation
review 15.7s CRITICAL/WARNING/INFO classified review
execute 2.8s Task delegation with structured trace
parallel_execute โ€” Up to 6 simultaneous tasks

Cross-Model Comparison

I ran Claude Agent and Codex in parallel on the same question: "Best thread-safe singleton pattern in Python?"

  • Claude: Metaclass + Lock, module variable, __new__
  • Codex: Module variable, lru_cache, Lock + classmethod

The lru_cache approach was unique to Codex โ€” Claude hadn't considered it. Two models genuinely produce different solutions.

Key Features

  • Full JSONL trace parsing โ€” tools, files, timing, errors
  • Parallel execution โ€” up to 6 tasks via asyncio.gather
  • Session management โ€” threadId persistence
  • Adversarial Review Loop โ€” GPT-5.4 challenges Claude's code
  • Sandbox security โ€” 3-tier policy + terminal injection prevention
  • 56 tests โ€” comprehensive coverage
  • Single file โ€” ~820 lines, zero external deps beyond FastMCP

Get Started (3 Minutes)

npm install -g @openai/codex && codex login
git clone https://github.com/tsunamayo7/helix-codex.git
cd helix-codex && uv sync
Enter fullscreen mode Exit fullscreen mode

Add to ~/.claude/settings.json and you're done.

What I Learned

  1. Different models have different blind spots. Cross-model review catches things self-review misses.
  2. Structured traces change everything. Raw text is useless for programmatic decisions.
  3. Parallel execution is underrated. Analyzing 6 files simultaneously saves real time.

GitHub: tsunamayo7/helix-codex โ€” MIT license, 56 tests, Python 3.12+.

Star if useful! Feedback welcome.

Top comments (0)