Lars Winstand

Posted on May 14 • Originally published at standardcompute.com

I thought Claude Code vs Codex was about model IQ until I watched one prompt eat 53% of a session

#ai #agents #llm #devtools

I went into the Claude Code vs Codex debate expecting the usual answer: compare model quality, pick the smartest one, move on.

That is not what I found.

The most useful Reddit threads were not really about intelligence. They were about what happens when you let an agent run for hours inside a real coding workflow: reading half a repo, retrying patches, dragging memory files forward, and quietly burning through context or quota before the actual task even starts.

One r/openclaw thread made the whole thing click for me. A user said their first Claude request consumed 53% of a Pro session. Two more requests pushed usage to 76%. Finishing the task took another 23%.

That is not a benchmark problem.

That is an operations problem.

The moment “cheap” stops being cheap

Here’s the thread that changed how I think about this:

r/openclaw: https://reddit.com/r/openclaw/comments/1tce1xn/agents_and_models_and_and_and/

The user was not asking whether Claude was smart.

They were basically asking: how are people running autonomous coding loops like this without constantly watching the meter?

That is a much better question.

If you are using OpenClaw, Claude Code, Codex, or any custom agent harness, your costs are not driven by a neat prompt/response model anymore. They are driven by:

how much context gets loaded before the first tool call
how often the agent retries
how much state gets resent between steps
how aggressively the orchestrator summarizes and rehydrates memory
whether your pricing model punishes long-running loops

That is why “best coding model” discourse gets weird fast. People think they are comparing brains. They are actually comparing entire operating environments.

The real failure mode: context bloat before work starts

The best advice in that thread was six words:

Context context context! Use /new /compact often

That sounds obvious. It is also the whole game.

Across multiple OpenClaw discussions, the complaints are very consistent:

workspace files loaded too early
memory files and notes bloating the first turn
AGENTS.md getting shoved into context by default
orchestration layers resending too much state
broad file inclusion causing the agent to inspect everything
session caps turning normal agent behavior into a budgeting exercise

One commenter said OpenClaw can spend “a lot of context before you even type the first message.”

That line should make every agent engineer stop and check their setup.

Claude, Codex, and OpenClaw are not competing on one axis

Here is the practical version.

Option	What actually matters in practice
Claude Code via Claude subscription or API	Strong coding quality on hard engineering work, but long loops can become session-sensitive fast when context is unmanaged
OpenAI Codex / GPT-5.4 Codex setups	Often feel more tolerant for iterative coding loops, but ceilings still exist and behavior depends heavily on the surrounding harness
OpenClaw-style orchestration	Great for autonomous workflows, but can amplify token burn through memory, workspace loading, retries, and tool chatter if you do not constrain it

That last row is the important one.

OpenClaw is not expensive by itself. It is an amplifier.

If your stack is disciplined, it amplifies good routing and clean context boundaries.

If your stack is sloppy, it amplifies waste.

The most honest Reddit take was basically a postmortem

Another thread put it better than any product page could:

r/openclaw: https://reddit.com/r/openclaw/comments/1tcyqda/how_good_is_openclaw_at_orchestrating_codex_and/

One user wrote:

It is good and bad at the same time. How i fixed the bad things i built a skills specifically for coding give the agent context about specific things i want.

That is not polished. It is useful.

Their fix was not “switch models.”

Their fix was to narrow the context boundary.

That matches what I keep seeing in practice. The teams getting the best results are not just choosing Claude or Codex. They are shaping what each model sees and when it sees it.

The expensive mistake: asking one model to do everything

A lot of teams still want one model for the entire workflow.

Repo scanning. Planning. Refactors. Patch generation. Retry loops. Documentation. Triage. Tool use. Cleanup.

That sounds elegant. It is usually lazy architecture.

The stronger pattern is task-based routing.

For example:

Qwen or GLM locally for cheap utility work
Gemini Flash for fast lightweight passes
GPT-5.4 Codex for coding loops that need speed and tolerance
Claude Opus or Claude Sonnet for harder engineering judgment

That is not indecision. That is treating your agent stack like infrastructure instead of fandom.

Why pricing gets ugly once agents are autonomous

This part is where the model debate stops being theoretical.

One Reddit user said they gave up on OpenClaw after about 3.5 months, 1,300 hours, nearly 5 billion tokens, and around $700 in spend.

Another post referenced $2,500 of Opus token spend for software-shop workflows involving vision, server management, and form filling.

Those are not “I asked too many questions” numbers.

Those are “my workflow became a utility bill” numbers.

And yes, there are contradictory reports too:

some users say they rarely hit Codex limits
some say they hit them in a few days
some say premium plans feel effectively unlimited
others say the ceilings show up exactly when the agent gets useful

That is not actually contradictory.

It just means agent economics are highly sensitive to runtime behavior.

What developers should measure instead of arguing about IQ

If you are evaluating Claude Code vs Codex for real engineering work, I would measure these five things before I cared about leaderboard screenshots:

First-turn context size
Average retry count per task
Tool call volume per successful patch
State carried between turns
Cost or quota burn per hour of autonomous runtime

That gives you a real picture of whether a setup survives long loops.

Benchmarks do not tell you that.

A two-hour unattended refactor does.

What I would actually do on Monday

Start with context hygiene, not model shopping.

If your stack includes OpenClaw, Claude CLI, Codex, Ollama, or a custom orchestrator, check the boring stuff first.

/new
/compact
cmd openclaw logs --follow
ollama list

If Ollama is in the stack, also check:

http://localhost:11434/

One commenter pointed out that Ollama may start with a 4096-token context even if the model supports 32k. That kind of mismatch is exactly how teams end up blaming Claude, Codex, or OpenClaw for what is really configuration drift.

A practical checklist for reducing burn

1) Trim what loads before the first task

Do not automatically stuff these into the initial turn:

workspace summaries
memory files
project notes
AGENTS.md
stale task history

Load them intentionally.

2) Build narrower skills

If your coding skill only needs a few files and a small set of tools, make that explicit.

Bad:

scope:
  files: "**/*"
  tools: [read, write, bash, grep, search, browser]

Better:

scope:
  files:
    - src/api/**
    - tests/api/**
  tools: [read, write, grep]

3) Reset aggressively

If your agent has finished a subtask, compact or start fresh.

/new
/compact

These are not hacks. They are maintenance.

4) Route by task type

Do not spend premium-model context on janitorial work.

A rough example:

def route_task(task):
    if task.type in ["lint_fix", "file_scan", "summarize_logs"]:
        return "gpt-5.4-codex"
    if task.type in ["architecture_decision", "complex_refactor", "ambiguous_bug"]:
        return "claude-opus"
    return "gemini-flash"

The exact models can change. The principle should not.

5) Watch orchestration overhead

Your model is not the only thing spending budget.

Your orchestrator may be:

summarizing after every step
rehydrating old memory too often
retrying tool calls too aggressively
passing giant intermediate outputs between agents

Log it.

Measure it.

Kill the waste.

My actual take on Claude Code vs Codex

For hard engineering judgment, I still think Claude Opus earns its reputation. A lot of developers clearly trust it more when the task is ambiguous, architectural, or full of tradeoffs.

For coding loops that need tolerance, speed, and less emotional damage from quota watching, Codex-style setups can feel better.

But the bigger point is this:

The winner is usually not one model.

The winner is the stack that gives you:

a strong model for hard reasoning
a cheaper or flatter-cost path for repetitive loops
ruthless control over context growth

If you only optimize for model quality, you get great demos and bad economics.

If you only optimize for price, you save money until the work gets real.

The part most teams eventually realize

Once your agents run long enough, pricing stops feeling like pricing and starts feeling like system design.

That is exactly why flat-cost compute is getting more interesting for agent teams.

If your workflow depends on long autonomous runs, per-token billing turns every orchestration mistake into a cost spike. You end up engineering around the invoice instead of engineering around the task.

That is the appeal of Standard Compute:

unlimited AI compute for a flat monthly price
OpenAI-compatible API, so you can keep your existing SDKs and clients
works with agent and automation stacks like n8n, Make, Zapier, OpenClaw, and custom workflows
dynamic routing across GPT-5.4, Claude Opus 4.6, and Grok 4.20

That does not remove the need for context discipline.

It just means your agent can keep working without you treating every long loop like a billing incident.

And honestly, that is the real question behind Claude Code vs Codex.

Not which model is smarter.

Which setup lets the agent keep going without turning you into its accountant?

DEV Community