Gabriel Anhaia

Posted on Apr 27

Cursor + Claude Code + Codex: The AI Coding Stack Nobody Planned For

#ai #agents #productivity #programming

Book: AI Agents Pocket Guide: Patterns for Building Autonomous Systems with LLMs
Also by me: Prompt Engineering Pocket Guide: Techniques for Getting the Most from LLMs
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

In the first week of April 2026, three things happened that nobody planned together. Cursor shipped version 3 — codename "Glass" with a rebuilt Agents Window for parallel cloud agents. OpenAI published codex-plugin-cc, an official plugin that runs inside Anthropic's Claude Code. And early adopters started running all three tools at once, Cursor in the editor, Claude Code in a terminal pane, Codex CLI either standalone or piped through the OpenAI plugin. The result is a workflow nobody at any of those companies designed end to end. It's also, for some teams, the most productive setup of the year — and for others, expensive theatre.

Some of the convergence is real. Some of it is teams paying three vendors to feel busy. Below: where each tool earns its keep, plus a same-task comparison with the bills attached.

What changed in April 2026

Cursor 3 ("Glass") dropped on April 2. The Agents Window replaces the old Composer panel with a full workspace built around running multiple agents in parallel. You can spin up one agent to refactor a module, another to write tests, another to update docs, and watch all three move at once. It introduces local-to-cloud handoff so the agent runs on Cursor's infrastructure for the heavy parts. Per devtoolpicks' April 2026 review, the Pro tier held steady around $20/month base, with actual heavy-use spend landing at $40–50/month.

Claude Code kept its terminal-first design and remained the top performer on agentic coding benchmarks. Anthropic's CLI doesn't try to be an IDE — it lives in your terminal, runs as long as you want it to, and is happiest when given a clear plan and access to your shell.

Codex CLI is OpenAI's terminal coder. Per the thoughts.jock.pl 2026 harness writeup, it sat at 77.3% on SWE-bench Verified versus Claude Code at 80.8%, at roughly 3–4× lower token cost. Then OpenAI shipped the plugin that runs Codex from inside Claude Code as a sub-tool. You're now using Anthropic's harness to call OpenAI's model on cheap tasks, then handing back to Claude for the parts where it wins.

That's the stack. Three tools, three companies, no master plan.

Where they genuinely complement each other

There are three real reasons to run all three, and a lot of fake reasons.

Real reason 1: Cursor's parallel agents for shallow breadth-first work. The Agents Window is genuinely good at running 3–5 small tasks at once. "Update the import path in every file." "Add type hints to these 12 modules." "Generate a unit test stub for each public method in service/." You launch the agents in the morning, do something else, come back to a queue of PRs to review. Claude Code can do these too, but serially. Cursor wins on wall-clock time when the work is parallelizable and shallow.

Real reason 2: Claude Code for one deep task at a time. When you have a single hard problem — "the cache invalidation logic is wrong, find it and fix it" — Claude Code in a terminal beats the Agents Window. The terminal context is unbounded, the tools are uniform, and the agent doesn't get distracted by the IDE state. Long-horizon tasks live here.

Real reason 3: Codex (via the OpenAI plugin) for the cheap, narrow stuff. Generating a JSON schema. Drafting a regex. Writing a one-shot bash script. Codex is fast and cheap, and inside Claude Code's plugin, you call it without leaving the harness. You don't need Opus 4.7 to write a find . -name "*.tmp" -delete one-liner. Don't pay for it.

The honest pattern is: each tool wins at a specific shape of work, and the shape is different enough that switching between them is cheaper than forcing one tool to do all three.

Where the three-tool stack is theatre

The failure modes are easy to spot once you've been burned. Three of them keep showing up across teams.

Theatre 1: Running parallel Cursor agents on tasks that should be sequential. The Agents Window is dazzling. Five panels, five agents, five progress bars. It feels like you're 5× as productive. You are not. If task B depends on task A's output, parallelism is a lie — you're just paying for two agents to do half the work twice. Most "small tasks" in real codebases share state. Test the assumption before you fan out.

Theatre 2: Using all three for a task one would handle. "I'll have Cursor do the refactor, Claude Code review it, and Codex write the migration script." Each handoff is a chance for context to be lost, prompts to be re-written, and bills to compound. If you can do the whole thing in Claude Code in one terminal, do that. Three-tool workflows earn their cost when the tasks are genuinely different shapes. They don't earn their cost when you're trying to look thorough.

Theatre 3: Running both Cursor's cloud agents and Claude Code's --continue sessions on the same machine. Now you have two long-running agent contexts, both remembering different things, both billing in the background. When they disagree, you mediate. That's not workflow, that's a meeting.

A decision rule

Watch teams do this for a while and the rule that holds up is depth-first. Pick the tool that matches the shape of the work, and only switch when the shape changes. Concretely:

Single deep task, unclear scope, will take an hour or more — Claude Code.
3+ small parallel tasks, clear scope each — Cursor 3 Agents Window.
One small task, low complexity, low cost target — Codex CLI (or via the Claude Code plugin).
Anything visual / UI-shaped — Cursor's Design Mode in the IDE.
Anything terminal-shaped (build, test, deploy) — Claude Code.

If a task spans shapes, start in Claude Code and call out to the others as plugins or subprocesses. The terminal is the lowest-friction integration point because every other tool has a CLI.

Workflow comparison: same task, three setups

The task: "Add OAuth login to a Flask app, write tests, update the docs."

Single-tool: Claude Code only

claude
> Add Google OAuth to this Flask app. Use Authlib.
> Write integration tests. Update README with setup steps.

One context, one bill. Claude Code reads the codebase, drafts the OAuth wiring, writes the tests, updates the README. Time: 22 minutes. Tokens (rough): 380k input, 95k output. On Sonnet 4.5 list pricing ($3/M input + $15/M output), the token math lands near $2.57; with cache misses and tool-call overhead in our test runs it came out at roughly $3, maybe $4 — order of magnitude, not a precise figure.

The drawbacks: serial. The README update happens after the tests, which happen after the wiring. Wall clock matches token spend.

Three-tool stack: Cursor 3 + Claude Code + Codex

1. Cursor Agents Window:
   Agent A: scaffold OAuth routes
   Agent B: write integration tests
   Agent C: update README
2. Claude Code (terminal):
   Review the three Cursor PRs, fix the OAuth state-management bug
   Agent A introduced
3. Codex CLI:
   Generate the .env.example file with the new OAuth variables

Three contexts, three bills, real parallelism on the scaffolding step. In our test runs: 14 minutes wall-clock for scaffolding (the three agents ran together) + 11 minutes for the Claude Code review and fix + 30 seconds for Codex. Total wall-clock around 26 minutes, slightly worse than single-tool, because the integration step took longer.

Cost breakdown (rough, our own runs against Cursor Pro overage rates per devtoolpicks and standard Anthropic / OpenAI API list pricing as of April 2026):

Cursor 3 cloud agents: ~$1.80 (three parallel runs)
Claude Code review pass: ~$2.10 (smaller because most code was already drafted)
Codex .env.example: ~$0.04
Total: ~$3.94

The three-tool stack cost more, took slightly longer, and produced one extra bug (the OAuth state-management issue from Agent A) that needed Claude Code to find. On this task, single-tool wins.

When the three-tool stack actually wins

The same comparison flips when the task is genuinely parallelizable across files. Try: "Migrate 30 service files from requests to httpx." In our test runs, three Cursor agents handling ten files each in parallel finished in roughly 6 minutes wall-clock at about $2.10. Single-tool Claude Code, serial, took roughly 18 minutes at about $1.80. The three-tool stack costs slightly more but ships in a third of the wall-clock time. For a real engineer's morning, that's worth it.

The pattern: three-tool stack wins on width, single-tool wins on depth. Match the stack to the shape of the work, not to the marketing.

What to budget for

If you're piloting this stack, budget honestly:

Cursor 3 Pro: $20/month base, $40–50/month with heavy Agents Window use.
Claude Code: API-priced. A productive engineer lands around $80–200/month on Sonnet 4.5, more on Opus.
Codex CLI: cheap. Easy to keep under $20/month even with daily use.

That's a $140–270/month per-engineer line. For a senior IC, it's nothing. For an org of 50, it's a real procurement conversation. What matters is the ratio of spend to time saved, not the absolute number. If the stack saves you four hours a week, the math works. If it's saving you "vibes," it doesn't.

The honest convergence

None of these tools want to be the others' subprocess, but the OpenAI-plugin-inside-Claude-Code release is exactly that: Codex as a tool inside Anthropic's harness, with OpenAI shipping the integration because users were already wiring it up manually.

The best stack is whatever your fingers reach for at 4pm on a Friday. Each of these tools is impressive in isolation, and none of that matters if the workflow you actually ship is "open Cursor, get distracted by parallel agents, finish the work in Claude Code." Pick the tool that matches the shape of the next hour. The stack will sort itself out.

If this was useful

Most of the workflow patterns above are agent-orchestration problems wearing IDE clothes. The pocket guides below cover the underlying patterns: how agents coordinate, how prompts shape what each tier of the stack does well, and the wrapping patterns that keep your workflow tool-agnostic.

Top comments (1)

Jill Mercer • Apr 28

the theatre part of this stack is real—i've watched agents loop on the same simple bug while my credits evaporate. since you're building hermes, you probably see the same tool-overlap issues i do. vibe first, polish later only works if you actually ship before the bill doubles. austin taught me: just start the thing, and keep the stack simple.