The Problem: Same Model Writes and Reviews
When Claude Code writes code and Claude reviews it, you get the AI equivalent of grading your own homework. Blind spots survive.
I wanted GPT-5.4 to review Claude's code from a genuinely different perspective. So I built helix-codex โ an MCP server that bridges Claude Code (Opus 4.6) to Codex CLI (GPT-5.4).
What Makes It Different
There are 6+ Codex MCP bridges on GitHub. They all do the same thing: call codex exec, return raw text. Claude has no idea what happened inside.
helix-codex parses the entire JSONL event stream and returns a structured report:
[Codex gpt-5.4] Completed
โฑ Execution time: 8.3s
๐งต Thread: 019d436e-4c39-...
๐ฆ Tools used (3):
โ
read_file โ src/auth.py
โ
edit_file โ src/auth.py
โ
shell โ python -m pytest tests/
๐ Files touched (1):
โข src/auth.py
โโโ Codex Response โโโ
Fixed the authentication logic.
The Self-Review Experiment
The most interesting test: I had GPT-5.4 review helix-codex's own source code. It found 3 critical issues:
-
Return code logic bug โ
returncode != 0with partial output was treated as success - Terminal injection vulnerability โ No ANSI/OSC escape sanitization in output
-
Path double-application โ
cwdpassed to both-Cflag and subprocesscwd=
Claude (the model that wrote the code) had missed all three. Different model, different blind spots.
Real Performance Numbers
| Tool | Time | What It Does |
|---|---|---|
explain |
5.4s | Full code explanation |
review |
15.7s | CRITICAL/WARNING/INFO classified review |
execute |
2.8s | Task delegation with structured trace |
parallel_execute |
โ | Up to 6 simultaneous tasks |
Cross-Model Comparison
I ran Claude Agent and Codex in parallel on the same question: "Best thread-safe singleton pattern in Python?"
-
Claude: Metaclass + Lock, module variable,
__new__ -
Codex: Module variable,
lru_cache, Lock + classmethod
The lru_cache approach was unique to Codex โ Claude hadn't considered it. Two models genuinely produce different solutions.
Key Features
- Full JSONL trace parsing โ tools, files, timing, errors
- Parallel execution โ up to 6 tasks via asyncio.gather
- Session management โ threadId persistence
- Adversarial Review Loop โ GPT-5.4 challenges Claude's code
- Sandbox security โ 3-tier policy + terminal injection prevention
- 56 tests โ comprehensive coverage
- Single file โ ~820 lines, zero external deps beyond FastMCP
Get Started (3 Minutes)
npm install -g @openai/codex && codex login
git clone https://github.com/tsunamayo7/helix-codex.git
cd helix-codex && uv sync
Add to ~/.claude/settings.json and you're done.
What I Learned
- Different models have different blind spots. Cross-model review catches things self-review misses.
- Structured traces change everything. Raw text is useless for programmatic decisions.
- Parallel execution is underrated. Analyzing 6 files simultaneously saves real time.
GitHub: tsunamayo7/helix-codex โ MIT license, 56 tests, Python 3.12+.
Star if useful! Feedback welcome.
Top comments (0)