Hariom Sharma

Posted on May 5 • Edited on May 10 • Originally published at harryy.me

OpenAI Codex vs Claude Code: Real Developer Workflow Comparison

#ai #agents #programming #openaicodex

I gave two big coding tasks to both Claude and Codex.

Claude finished in about one hour. Codex took about eight.

That sounds like a clean win for Claude until you look at what came back. The Claude output was fast, confident, and useless. Bad assumptions, broken code, missing integration points, half-followed rules, and a shape that would make the codebase harder to maintain even if I patched it into working order.

I threw it away.

The Codex output took a full workday. It read the repo. It followed the local instructions. It reused existing patterns. It added tests. It ran checks. And when I ran the result the first time, it worked.

That changed how I think about AI coding tools.

I am not writing this as a casual weekend test. I have shipped three products in the last four months with AI doing a serious amount of the implementation work. I use these tools hard enough to burn through weekly limits on a $200/month subscription. This is still personal opinion, not a benchmark, but it is opinion built from long hours in real repos where broken code costs me time.

Key Takeaways

Claude optimizes for motion. Codex optimizes for grounded work.
Fast output is not fast delivery if you spend the next day cleaning it up.
Codex's friction is annoying, but most of that friction is it reading, verifying, and respecting project rules.
Superpowers improves both tools, but each agent follows the workflow differently.
Claude's subagent defaults are useful, but delegation does not save you when the main agent is willing to invent around your codebase.
For work I actually ship, I now trust the slower agent more.

Claude feels like an engineer who wants to finish the ticket before lunch.

Codex feels like an engineer who annoys you by reading every linked file before touching the code, then quietly hands you something that passes tests.

The one-hour trap

The trap is simple: Claude makes progress visible very quickly.

It edits. It assumes. It finds a path through your rules. It writes what it thinks should exist. It may use subagents aggressively. It may produce a lot of code fast. That feels productive while it is happening.

Then you run it.

The first failure is usually small. A type mismatch. A missing import. A component API that does not exist. A hook that follows a generic React pattern instead of your local data-fetching pattern.

You ask for a fix. Claude patches the symptom. Now something else breaks.

You ask why the original thing failed. Instead of tracing the root cause, it often jumps into another patch. The code starts collecting bandaids. The more you accept, the harder it becomes to tell what is necessary and what is agent residue.

That is how a codebase becomes fragile. Not from one bad commit. From many plausible commits that never had to prove they matched the system.

The eight-hour tradeoff

Codex is slower in a way that can be frustrating.

For even simple work, it tends to read first. It opens the files. It checks instructions. It searches for existing patterns. It tries to understand why the current code is shaped the way it is. It is much less eager to introduce a new abstraction just because it can.

To be fair, that eight-hour run was not an optimized Codex setup. I tested Codex almost as soon as I installed it. I only had Superpowers installed. I had not yet tuned my AGENTS.md, wired a stronger agent workflow, or built the muscle memory for telling Codex when to split work.

So a lot of the time was self-inflicted setup friction.

Codex also tested after almost every meaningful change. That ate most of the wall-clock time. It would edit, run the relevant check, read the failure, patch, run again, and keep going. That is slow compared with an agent that writes a big diff and says "done." It is also why the final result worked.

The other missing piece was default subagents. Claude tends to fan out work more aggressively. My Codex setup did not do that by default. Without explicit subagent instructions, more of the work stayed in the main thread, which made the run longer than it needed to be.

That patience costs time.

But the time is not empty. It is spent on the things that usually decide whether code survives first contact with the repo:

Does this match the surrounding modules?
Is there already a helper for this?
Are there project rules that override the obvious solution?
Is the failure the root cause, or just the first visible symptom?
What is the smallest check that proves this change worked?

I do not want a coding agent that wins the stopwatch and loses the branch.

I want the branch.

Search behavior is a bigger difference than people admit

Claude tends to work from model knowledge unless you push it to fetch current information.

For stable things, that is fine. For modern SDKs, AI tools, product docs, pricing, CLI behavior, framework releases, or anything that may have changed last month, it is dangerous.

I do not want an agent confidently writing against a six-month-old mental model of a library.

Codex is much more willing to say: this might have changed, search first. The installed CLI even exposes --search, and the current Codex docs have first-class pages for rules and subagents. The exact feature surface changes fast, which is the point: for fast-moving tools, searching is not optional polish. It is correctness.

This is one of the reasons Codex feels slow. It spends time making sure the premise is still true.

Rules matter more than intelligence

My biggest complaint about Claude is not that it is dumb. It is not. It is powerful.

The problem is that it is too willing to route around constraints.

If a repo says "do not edit env files," I do not want the agent to decide that .env.example is close enough to documentation. If a repo says "do not add scripts unless asked," I do not want the agent to add a convenience script because it makes its own workflow nicer. If I ask "why is this not working," I do not want three patch attempts before the root cause is found.

Codex is slower partly because the rules stay heavy in its head.

It treats instructions as part of the job, not as decoration. That matters in real repos, where the fastest fix is often the fix that breaks a migration, a hook, a deploy job, or another engineer's in-progress work.

Where Superpowers fits

I also use Superpowers with both Claude and Codex.

Superpowers is a workflow plugin by Jesse Vincent. The pitch is simple: do not let the agent jump straight into coding. Make it brainstorm, plan, route work, use TDD when appropriate, review the result, and verify before claiming success. The official Claude marketplace page describes it as a skills framework for brainstorming, subagent development, debugging, TDD, and skill authoring. The GitHub repo also documents Codex installation, so this is not a Claude-only idea.

Claude does pretty well with Superpowers. It understands the workflow and can move fast once the plan exists. The problem is that Claude still feels like it is deciding when the workflow matters. Sometimes it invokes the right skill. Sometimes it treats the skill like advice. Sometimes it decides the task is obvious and starts moving.

Codex behaves differently. If the rules say "use Superpowers to route this," Codex is much more likely to actually do that. It will invoke the routing step, check which skill applies, and follow the workflow even when I personally think the task could have been done in two edits. That strictness was part of why my first Codex run took so long: Superpowers wanted routing and planning, and Codex treated that as a rule, not a suggestion.

That is both good and bad.

For big tasks, I want that discipline. Route first. Plan first. Use subagents when the work is independent. Review before declaring done. That is exactly where Superpowers shines.

For small tasks, it can be overkill. If I ask for a copy tweak or a one-line bug fix, I do not always need a methodology. This is the same tradeoff as Codex itself: the discipline that makes it reliable on large work can feel heavy on tiny work.

Still, if I have to choose which failure mode I prefer, I will take "too much process" over "fast wrong code" almost every time.

Subagents are not magic

Claude defaults harder toward subagents. Codex can use subagents too, but in my workflow I usually need to ask explicitly or encode it in AGENTS.md. That lines up with the Codex subagents docs, which describe them as explicit helpers for larger tasks. On that first run, I had not done the tuning yet, so Codex paid the cost of doing too much sequentially.

That is a real advantage for Claude. Subagents can isolate context, split work, and make large tasks move faster.

But delegation only helps if the delegated work is bounded and grounded.

If the parent agent makes bad assumptions, you get multiple workers implementing bad assumptions in parallel. That is not leverage. That is multiplication.

Codex's default is more conservative. It does not fan work out as aggressively unless the task clearly benefits from it. Sometimes that costs time. Sometimes it prevents a mess.

The real speed metric

Here is the metric I care about now:

How long from prompt to shippable branch?

Not prompt to diff.

Not prompt to "I made the changes."

Not prompt to a confident summary.

Prompt to code I can ship without feeling like I need to audit every line for agent damage.

On that metric, the one-hour Claude run was slower than the eight-hour Codex run because the Claude output had no path to trust. Every line needed suspicion. The Codex output had already done the boring work: inspect, edit, test, verify.

That is why I keep using Codex even when it annoys me.

I do not need the fastest agent.

I need the one whose output I can ship with confidence.

Originally published at harryy.me

Top comments (1)

John • May 16

The stopwatch vs branch framing is the right distinction. I’ve found the same thing: the expensive part is not raw generation time, it is how many confident but unverified assumptions make it into the repo before the first real check runs.

The best agent workflow is usually boring: read local rules, make the smallest diff, run the nearest check, then widen scope only if the result proves it.