Nishil Bhave

Posted on May 25 • Edited on May 31 • Originally published at maketocreate.com

Claude Code vs Codex CLI: 6 Months of Real Daily Use

#claudecode #codexcli #aicodingagents #terminalagents

Claude Code vs Codex CLI: Six Months of Real Daily Use

Two terminal agents. One slot in your daily driver workflow. I've been running both Claude Code and OpenAI's Codex CLI as primary tools for the last six months — different repos, different stakes, different team setups. They look almost identical from the outside: a CLI, a permission prompt, a model that edits your files. Under the hood, they're not the same product at all.

JetBrains' April 2026 research shows Claude Code adoption at work jumped from roughly 3% (April–June 2025) to 18% by January 2026 — a 6x increase in nine months — and its customer satisfaction score hit 91%, the highest of any coding tool they tracked (JetBrains Research, 2026). Codex CLI grew from 82,000 monthly npm downloads at launch to 14.53 million by March 2026, a 177x increase (gradually.ai, 2026). Both are winning. They're winning for different reasons. This is the honest comparison.

portable code review across both agents

Key Takeaways

Claude Code wins on agentic quality and extensibility (Hooks, Skills, Subagents) but is closed-source and had four CVEs disclosed and patched across 2025–2026 (Check Point Research, 2026).

Codex CLI is Apache 2.0, Rust-native, and ships a stricter default sandbox — better for untrusted repos and pull request review work.

70% of developers run 2–4 AI tools at once (The Pragmatic Engineer, 2026). The right answer is usually both, with one as the daily driver.

Why Are Claude Code and Codex CLI Converging?

Both products believe the same thing: the IDE is a deeply customized editor, and an agent doesn't need to live inside it to be useful. 95% of engineers in the Pragmatic Engineer survey now use AI tools weekly, and 75% report AI handles at least half of their engineering work (The Pragmatic Engineer, 2026). When the agent is doing half the work, the question stops being "which editor extension" and starts being "which process runs my repo."

That's the philosophical convergence. A terminal agent reads your files, runs commands, watches output, and proposes changes. It's a long-running process that owns a working directory. Claude Code shipped this model in February 2025. Codex CLI shipped its first public version in April 2025 and then rewrote the whole thing in Rust by June 2025 — the TypeScript prototype is gone, the repo is now 95.6% Rust with over 75,000 stars and 400 contributors (OpenAI Codex GitHub, 2026).

The convergence isn't surface-level. The daily ritual is genuinely the same: open a terminal in the repo, type a goal, watch a plan appear, approve or deny tool calls, accept the diff. If you blindfolded me and dropped me into either CLI mid-task, I'd need at least thirty seconds to figure out which one I was in. The differences only show up under load — when the agent gets confused, when something fails, when you need to do anything outside the happy path.

why the hook layer matters once you're past the happy path

How Do GPT-5 and Claude Opus 4.7 Actually Behave on Real Codebases?

Among the models powering these two tools, Claude Opus 4.7 posts 87.6% on SWE-bench Verified — ahead of GPT-5.3-Codex at around 85% and the base GPT-5 at 74.9% (Vellum, 2026; LLM-Stats, 2026). That gap is real but it's also misleading — both models are trained on a lot of public SWE-bench-like data, and the benchmark increasingly measures how well a model has memorized the eval set, not how it handles your code.

Here's what I see in practice. On a tangled refactor — say, lifting a service interface out of three coupled controllers in a legacy PHP/Laravel codebase — Claude Opus 4.7 produces a more cautious plan. It asks before touching shared types. It writes a checklist and follows it. It backs out cleanly when I tell it to. GPT-5.3-Codex is faster and bolder. It writes more code per turn, which is great when the code is right and painful when it isn't.

My finding: On a 20-file refactor I ran on the same Laravel repo, Claude Code needed 3 prompts and stopped to confirm 4 times. Codex CLI did it in 1 prompt but introduced two regressions that broke tests in unrelated files. The fix for the regressions took longer than the original task would have on Claude.

That's the consistent pattern. Claude is more conservative, more aligned with "ask first," and recovers from mistakes better. Codex is more aggressive, more willing to refactor adjacent code without asking, and faster on greenfield work. Pragmatic Engineer's 2026 survey reflects this preference split: 46% of engineers named Claude Code as the tool they love most, vs 19% for Cursor and 9% for GitHub Copilot (The Pragmatic Engineer, 2026).

Don't read that as "Codex is bad." Codex didn't exist when the 2025 survey ran, and it's already at 6% with momentum. Read it as "Claude Code has the strongest emotional pull right now, especially for engineers doing focused refactor and debug work."

the multi-model workflow I actually use

What's the Real Difference in Sandboxing and Permissions?

Codex CLI ships with a stricter default. It runs with three sandbox modes — read-only, workspace-write, and danger-full-access — combined with three approval modes (suggest, auto-edit, full-auto) (OpenAI Codex Sandboxing, 2026). The default behavior asks before every write and refuses network calls outside the workspace. Claude Code has five permission modes (default, acceptEdits, plan, dontAsk, bypassPermissions) with file-level and command-level deny rules layered on top (Claude Code Permission Modes, 2026).

The naming is different. The actual capabilities are roughly equivalent. The difference that matters is the default. Codex's default refuses more aggressively. Claude's default trusts more aggressively. Neither is wrong; they reflect different assumptions about who's at the keyboard.

Then there's the security record. Four CVEs were disclosed against Claude Code across 2025–2026: CVE-2025-59536 (RCE via untrusted project config, CVSS 8.7), surfaced by Check Point Research (2026); CVE-2025-54794 (path bypass, CVSS 7.7) and CVE-2025-54795 (command injection, CVSS 8.7), both from Cymulate (2025); and CVE-2025-55284 (DNS exfiltration, CVSS 7.1) from Embrace The Red (2025). Anthropic patched all of them, and the underlying issue — that CLAUDE.md and .mcp.json files in a cloned repo could execute arbitrary shell on startup — is now mitigated. But the lesson is real: cloning a repo and immediately running Claude Code on it is not as safe as the UX makes it feel.

Why this matters: If you're reviewing untrusted pull requests or pulling random GitHub repos to investigate them, Codex's stricter default sandbox is the safer starting point. If you're working in a repo you own, on a machine you trust, with a workflow you've tuned, Claude Code's permission model is more ergonomic.

how MCP server config intersects with Claude Code's permission model

Which One Has the Stronger MCP and Extensibility Story?

Both support the Model Context Protocol. Claude Code shipped MCP first and shaped the spec. Codex CLI added MCP support in 2026 with stdio and Streamable HTTP transports, including OAuth, configured through ~/.codex/config.toml (OpenAI Codex MCP docs, 2026). The MCP ecosystem now has more than 10,000 public servers, and the protocol was donated to the Linux Foundation's Agentic AI Foundation in December 2025 (MCP Bundles, 2026).

So MCP support is no longer a Claude-only advantage. What is still Claude-only: Hooks, Skills, and Subagents.

Hooks intercept tool calls at nine documented lifecycle events (PreToolUse, PostToolUse, UserPromptSubmit, Stop, and others). They run as shell scripts, return exit codes, and let you build deterministic gates the model can't reason its way past.
Skills are reusable prompt + tool bundles installed via npm-style commands. Anthropic shipped them in October 2025 alongside Plugins, and they're how the broader ecosystem (skills.sh, etc.) packages workflows.
Subagents are model-launched workers with their own context windows. You spawn one for research, code review, or exploration, and the parent agent continues without polluting its context.

Codex doesn't have direct equivalents. You can build a lot of the same outcomes with shell wrappers and MCP servers, but you're rebuilding the framework. This is the part of the comparison that gets undersold in most reviews. The extensibility surface isn't a checkbox — it's a multiplier. Once you have a Skill that knows how to ship a feature in your repo, or a Hook that blocks rm -rf regardless of what the model thinks, the productivity gap widens fast.

The trade-off is that this surface is also the attack surface. Three of the four CVEs above exploited Hooks, MCP config files, or project-level instructions. Power and risk on the same axis.

when to reach for a Skill vs an MCP server

How Does the Pricing Math Actually Compare?

Both tools start at the same price. Claude Code is included in Claude Pro at $20/month, with Max 5x at $100/month (5x the Pro rate limits) and Max 20x at $200/month (20x Pro limits) above that — or pay-per-token through the API (Anthropic Pricing, 2026). Codex CLI is bundled into ChatGPT Plus ($20/month), Pro ($200/month), Business, Enterprise, and Edu plans, plus pay-per-token through the OpenAI API.

Entry price is a wash — $20 either way. What differs is the ladder above it. Here's the math from my own usage:

Claude Pro or ChatGPT Plus ($20/mo): both real entry points, and both throttle hard on long agentic sessions. I burn through either one in roughly 2 hours of serious refactor work.
Claude Max 5x ($100/mo): comfortable for one developer doing 6–8 hours of agent-heavy work a day. I rarely hit limits — and Codex has no equivalent middle tier, so its next step up is $200.
Claude Max 20x or ChatGPT Pro ($200/mo): top tier for both. Max 20x rarely throttles me even on heavy solo days; ChatGPT Pro lifts Codex's ceiling the same way.
API for both: roughly comparable per-token, but Claude Sonnet 4.6 is significantly cheaper than Opus 4.7 for most coding tasks, and you can route between them in the same session.

The honest version: at $20 it's a genuine tie — both rate-limit you on heavy days. Claude's real edge is the $100 Max 5x tier, which has no ChatGPT counterpart and is the sweet spot for full-time agent work. At $200 the two are matched again. I run both — Max 20x for daily-driver work, ChatGPT Plus for the occasional Codex run on something Claude is being weird about.

Feature Matrix: Where Each Tool Genuinely Wins

Capability	Claude Code	Codex CLI
Underlying model	Claude Opus 4.7 / Sonnet 4.6	GPT-5 / GPT-5.3-Codex
SWE-bench Verified	87.6% (Opus 4.7)	~85% (5.3-Codex)
License	Closed-source (npm)	Apache 2.0 (Rust)
Permission modes	5 modes + deny rules	3 approval + 3 sandbox modes
MCP support	Yes (original)	Yes (stdio + Streamable HTTP)
Hooks	Yes (9+ lifecycle events)	No direct equivalent
Skills	Yes (Oct 2025)	No
Subagents	Yes	No
Plugins	Yes	No
IDE extensions	VS Code, JetBrains	VS Code, JetBrains, Cursor, Windsurf
Desktop app	Yes (Mac/Windows), web, CLI	Yes (macOS Feb 2026, Windows Mar 2026)
Entry pricing	$20/mo (Claude Pro)	$20/mo (ChatGPT Plus)
Mid tier	$100/mo (Max 5x)	— (no equivalent)
Top pricing	$200/mo (Max 20x)	$200/mo (ChatGPT Pro)
Recent CVEs	4 patched in early 2026	None publicly disclosed
CSAT	91%	Not publicly reported
GitHub stars	n/a (closed-source)	75K+
Open governance	No	Yes (Apache 2.0, 400 contributors)

The matrix tells you what; it doesn't tell you what to actually do. That's the next section.

Which One Should Be Your Daily Driver in 2026?

Use this framework. The decision isn't "which is better" — it's "which fits the work you do most."

Pick Claude Code as your daily driver if:

You spend most of your time in repos you trust (your own code, your team's code).
You want the strongest agentic quality and recovery behavior on hard tasks.
You'll actually use Hooks, Skills, or Subagents — that extensibility edge is the main reason to pick Claude over a roughly-comparable Codex.
You do enough daily agent work to justify Max 5x ($100), though Pro ($20) is a fine place to start.
You value extensibility over open-source guarantees.

Pick Codex CLI as your daily driver if:

You frequently work on untrusted repos (PR review, OSS triage, security research).
You need open-source guarantees for legal or audit reasons.
You're already paying for ChatGPT Plus or Pro and want to avoid a second subscription.
You prefer GPT-5's faster, more aggressive coding style.
You want Codex's desktop app with built-in parallel agent management.

Run both if:

You're doing 6+ hours of agent-heavy work daily.
You want a second opinion on hard tasks (one agent's stuck plan often unblocks fast in the other).
You work across languages where the models diverge — I find Claude better on PHP/Ruby/Go, Codex slightly stronger on TypeScript/Python/Rust.

The Pragmatic Engineer survey backs this up: 70% of engineers run 2–4 AI tools simultaneously, and 15% run 5 or more (The Pragmatic Engineer, 2026). Treating this as a one-winner question is the wrong frame.

My current setup: Claude Code is the daily driver. Codex CLI is the second opinion. When Claude gets confused on a long task — usually around the 30-minute mark on something architecturally tangled — I'll fork the conversation, paste the state into a fresh Codex session, and see what it does. The disagreement is often more useful than either agent's answer alone.

why running two agents in parallel beats one for hard tasks

Frequently Asked Questions

Is Claude Code's $100 Max tier worth it, or is the $20 Pro plan enough?

Both Claude Code and Codex start at $20/month (Claude Pro and ChatGPT Plus), and both throttle on heavy sessions at that tier. Claude Max 5x ($100/mo) gives roughly 5x Pro rate limits — enough for 6–8 hours of agent-heavy work daily without hitting walls (Anthropic Pricing, 2026). The crossover where Max pays for itself is around 3 hours/day of active agent use; below that, Pro is plenty.

Does Codex CLI support MCP servers?

Yes. Codex CLI added MCP support in 2026 with both stdio and Streamable HTTP (including OAuth) transports, configured in ~/.codex/config.toml (OpenAI Codex MCP, 2026). The MCP ecosystem has more than 10,000 public servers, and most work in both Claude Code and Codex CLI without modification.

Are Claude Code's CVEs a reason to avoid it?

Not really. All four 2026 CVEs were patched within days of disclosure, and the underlying class of bug — trusting project-local config files on startup — has been mitigated (The Hacker News, 2026). The takeaway isn't "Claude is unsafe," it's "don't run any agent on a freshly cloned untrusted repo without sandboxing."

Which model is actually better at coding, GPT-5 or Claude Opus 4.7?

On SWE-bench Verified, Claude Opus 4.7 leads at 87.6% vs GPT-5.3-Codex at ~85% and base GPT-5 at 74.9% (Vellum, 2026). In practice the gap is smaller and task-dependent. Claude is more cautious and recovers better from mistakes; GPT-5 is faster and more aggressive on greenfield work.

Can I use Claude Code on the cheaper Claude Pro plan?

Yes. Claude Code is included in Claude Pro at $20/month — the same entry price as ChatGPT Plus with Codex. Pro's rate limits are tight for heavy agentic work, which is why Max 5x ($100/mo) exists for full-time use. You can also pay per-token through the Anthropic API. There's no free tier for Claude Code itself (Anthropic Pricing, 2026).

Nishil Bhave is a developer and builder who writes about AI tooling, agentic workflows, and the practical realities of shipping with AI. He has been running Claude Code and Codex CLI as primary tools since their respective launches.

Conclusion

If you only have one slot, Claude Code is the daily driver I'd recommend for most engineers in mid-2026 — 91% CSAT and the strongest extensibility story aren't accidents. If you also have $20/month for ChatGPT Plus, add Codex CLI as your second opinion. The cost of running both is rounding-error for any working engineer, and the dual-agent setup beats either one solo on hard tasks.

The terminal-agent paradigm is the new default. Pick the one that fits the work you do most, and don't agonize over the choice — both will be different products by Q4 2026, and the only mistake is staying on the sidelines.

DEV Community