Tsunamayo

Posted on Mar 31 • Edited on Apr 10

How I Made Claude Code and GPT-5.4 Review Each Other's Code

#mcp #ai #python #opensource

The Problem: Same Model Writes and Reviews

When Claude Code writes code and Claude reviews it, you get the AI equivalent of grading your own homework. Blind spots survive.

I wanted GPT-5.4 to review Claude's code from a genuinely different perspective. So I built claude-code-codex-agents — an MCP server that bridges Claude Code (Opus 4.6) to Codex CLI (GPT-5.4).

What Makes It Different

There are 6+ Codex MCP bridges on GitHub. They all do the same thing: call codex exec, return raw text. Claude has no idea what happened inside.

claude-code-codex-agents parses the entire JSONL event stream and returns a structured report:

[Codex gpt-5.4] Completed

⏱ Execution time: 8.3s
🧵 Thread: 019d436e-4c39-...

📦 Tools used (3):
  ✅ read_file — src/auth.py
  ✅ edit_file — src/auth.py
  ✅ shell — python -m pytest tests/

📁 Files touched (1):
  • src/auth.py

━━━ Codex Response ━━━
Fixed the authentication logic.

The Self-Review Experiment

The most interesting test: I had GPT-5.4 review claude-code-codex-agents's own source code. It found 3 critical issues:

Return code logic bug — returncode != 0 with partial output was treated as success
Terminal injection vulnerability — No ANSI/OSC escape sanitization in output
Path double-application — cwd passed to both -C flag and subprocess cwd=

Claude (the model that wrote the code) had missed all three. Different model, different blind spots.

Real Performance Numbers

Tool	Time	What It Does
`explain`	5.4s	Full code explanation
`review`	15.7s	CRITICAL/WARNING/INFO classified review
`execute`	2.8s	Task delegation with structured trace
`parallel_execute`	—	Up to 6 simultaneous tasks

Cross-Model Comparison

I ran Claude Agent and Codex in parallel on the same question: "Best thread-safe singleton pattern in Python?"

Claude: Metaclass + Lock, module variable, __new__
Codex: Module variable, lru_cache, Lock + classmethod

The lru_cache approach was unique to Codex — Claude hadn't considered it. Two models genuinely produce different solutions.

Key Features

Full JSONL trace parsing — tools, files, timing, errors
Parallel execution — up to 6 tasks via asyncio.gather
Session management — threadId persistence
Adversarial Review Loop — GPT-5.4 challenges Claude's code
Sandbox security — 3-tier policy + terminal injection prevention
56 tests — comprehensive coverage
Single file — ~820 lines, zero external deps beyond FastMCP

Get Started (3 Minutes)

npm install -g @openai/codex && codex login
git clone https://github.com/tsunamayo7/claude-code-codex-agents.git
cd claude-code-codex-agents && uv sync

Add to ~/.claude/settings.json and you're done.

What I Learned

Different models have different blind spots. Cross-model review catches things self-review misses.
Structured traces change everything. Raw text is useless for programmatic decisions.
Parallel execution is underrated. Analyzing 6 files simultaneously saves real time.

GitHub: tsunamayo7/claude-code-codex-agents — MIT license, 56 tests, Python 3.12+.

Star if useful! Feedback welcome.

DEV Community