Lars Winstand

Posted on May 27 • Originally published at standardcompute.com

Codex vs Claude stopped being the wrong question once I split my coding agent into a worker and an advisor

#ai #llm #devops #productivity

I used to think the coding-agent model debate was a knife fight.

Pick one model. Commit. Hope you picked the right one.

That’s still how a lot of people frame codex vs claude. One session, one model, one price tier.

After digging through some OpenClaw discussions and ClawCodex docs, I think that framing is outdated.

The more useful question is:

Which model should do which part of the work?

Once I started thinking about coding agents that way, the architecture got a lot simpler:

let a cheap worker model do the repetitive repo work
call a stronger advisor model only at risky forks
keep routing in infrastructure, not in human memory

That sounds obvious in hindsight. It also sounds a lot cheaper.

ClawCodex claims this pattern can be about 6× cheaper than running Claude Opus for the full session. Whether your exact number is 6× or 3× or 10× depends on provider pricing and routing. The bigger point is the same: most coding-agent tokens are spent on boring work, not on high-stakes reasoning.

Most of your agent session is just expensive shovel work

A typical coding loop is not glamorous.

It’s usually:

read files
inspect code
patch code
run tests
read stack trace
patch again
rerun tests
repeat until the bug gives up or you do

That is not where I want to spend premium-model money.

That is worker-model territory.

ClawCodex makes this explicit with /advisor mode:

/advisor anthropic:claude-opus-4-7

The idea is simple:

a cheaper model handles the execution loop
a stronger model gets consulted only when judgment matters

That’s a much better split than paying for Claude Opus to watch your agent open files and rerun pytest for 20 minutes.

Why this split makes sense in practice

The worker should do:

file reads
code edits
test loops
command execution
grep/search
incremental fixes

The advisor should do:

architecture decisions
migration safety checks
root-cause review
final code review
“are we solving the right problem?” checks

That division maps to how senior engineers actually work.

You don’t ask your most expensive person on the team to sit around refreshing test output. You ask them to step in when the implementation is about to go sideways.

The pricing gap is big enough to matter

ClawCodex’s docs list example pricing like this:

Model	Best role right now
Claude Haiku 4.5	Low-cost worker for file reads, edits, and test loops; listed by ClawCodex at $1 input / $5 output per 1M tokens
DeepSeek V4 Pro	Cheap cross-provider worker candidate; listed at $1.74 input / $3.48 output per 1M tokens, with very low cache-hit input pricing
Claude Opus 4.7	High-cost advisor for architectural and risk review at decision points; listed by ClawCodex at $5 input / $25 output per 1M tokens

Even if routing is imperfect, that spread is wide enough that role-based model selection can save real money.

And if you’re running agents all day in CI, in internal tools, or in automations, these savings stop being theoretical very quickly.

The important shift: the worker does not need to be Claude

This is the part I think more teams should take seriously.

Once you separate worker from advisor, the worker doesn’t need to come from the same provider.

A session can look more like this:

worker: deepseek/deepseek-v4-pro
advisor: anthropic:claude-opus-4-7

That’s not just an Anthropic trick. That’s a routing pattern.

And once you have a routing pattern, you can swap providers based on cost, latency, or reliability instead of treating model choice like a religious commitment.

The advisor should be short and sharp

This part matters more than people think.

A bad advisor setup turns into two models talking too much.

That’s expensive. It also pollutes context.

The advisor should not restate the worker’s entire plan every turn. It should behave more like a staff engineer doing a quick review:

this migration path is unsafe
you’re fixing symptoms, not the root cause
isolate the patch instead of refactoring now
this test passing does not prove the bug is fixed

That’s the high-value use of a premium model.

If your advisor is basically a second intern writing long summaries, you’re burning tokens for very little gain.

What I’d actually build

If I were setting up coding agents for a team today, I’d use this pattern:

1. Worker model for the execution loop

Use a cheaper model for repo exploration, edits, commands, and tests.

2. Advisor model for risky decisions

Only call the expensive model when:

the agent wants to change architecture
the fix touches migrations or data integrity
the failure mode is weird
the worker keeps circling
you want a final review before merge

3. Gateway for routing and failover

Put model routing in infrastructure.

That means using something like OpenClaw or LiteLLM as the layer that decides:

which provider gets the request
which model is the worker
which model is the advisor
what happens on retry or fallback

That is much better than asking every developer to manually switch models mid-session.

Example: keep the client simple

One reason this pattern is practical is that you don’t need to rewrite your whole stack.

If your tooling already speaks the OpenAI API shape, a gateway can hide most of the complexity underneath.

That’s the part a lot of teams miss.

You can keep your existing SDK flow and change the economics at the routing layer.

For example, your app code can stay boring:

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
  baseURL: process.env.LLM_BASE_URL
});

const res = await client.chat.completions.create({
  model: "coding-agent-worker",
  messages: [
    { role: "system", content: "You are a coding agent." },
    { role: "user", content: "Fix the failing test in auth.spec.ts" }
  ]
});

console.log(res.choices[0].message.content);

Then your gateway can decide what coding-agent-worker actually means today.

Maybe it routes to a cheap model for the normal loop and escalates to a stronger advisor only when a policy says it should.

That’s exactly why OpenAI-compatible infrastructure matters.

This is also where Standard Compute fits really well

If you’re experimenting with worker/advisor patterns, per-token billing gets annoying fast.

You start optimizing every call before you’ve even learned what architecture works.

That’s one reason I think flat-rate infrastructure is underrated for agent teams.

With Standard Compute, you can keep the OpenAI-compatible client you already use, but stop treating every extra test loop like a billing event. That makes it much easier to build agents that run continuously in:

n8n
Make
Zapier
OpenClaw
custom internal agent frameworks

The practical benefit is obvious: you can route aggressively, retry when needed, and let agents do real work without someone watching token spend like a hawk.

For worker/advisor setups specifically, that matters a lot.

Because the whole point of this architecture is that the cheap worker should be free to do a lot of low-stakes iteration. If your team is still stressed about every token, they’ll underuse the loop and overthink every call.

Flat monthly pricing is a better fit for agents than human chat ever was.

Example routing config

Here’s the kind of setup I’d want behind the scenes:

models:
  coding-agent-worker:
    provider: standardcompute
    route: auto
    role: worker

  coding-agent-advisor:
    provider: standardcompute
    route: premium
    role: advisor

policies:
  escalate_on:
    - migration_detected
    - repeated_test_failure
    - large_diff
    - final_review

The exact syntax will vary depending on your gateway, but the architecture is the point.

Your application should ask for a role.

Your infrastructure should decide the model.

The catches are real

This pattern is good, but it is not magic.

Routing overhead exists

If the advisor gets called too often, latency stacks up.

A worker-plus-advisor system only works if the advisor is used sparingly.

If you escalate every few minutes, you didn’t build a clean architecture. You built a committee.

Cost dashboards can be misleading

Displayed pricing is often based on list prices, not your actual provider path.

If you route through gateways, aggregators, or cloud providers, the real bill can differ.

So measure actual spend, not just the pretty status output.

Cheap workers can still be wrong

A low-cost model that confidently makes tiny bad edits is not cheap if it wastes hours.

So you still need evals, guardrails, and sane escalation rules.

This is an architecture decision, not a free lunch.

The pattern feels more mature than “pick one best model”

What I like about the worker/advisor split is that it feels honest.

It admits two things:

not every token needs premium reasoning
some decisions absolutely do

That is a better mental model than arguing forever about codex vs claude as if one model should own the entire coding session.

I don’t think the future is one perfect coding model.

I think the future is:

cheap execution
expensive judgment
routing handled by infrastructure
OpenAI-compatible clients on top
predictable cost underneath

That last part matters more as agents move from demos to production.

Because once your coding agent is running all day, the real enemy is not just bad output.

It’s bad output plus unpredictable cost.

That’s why I think the better question now is not:

Codex or Claude?

It’s:

Which model is my worker, which model is my advisor, and is my infrastructure set up so I don’t have to care about token math every time the agent runs?

If you’re building coding agents with OpenClaw, LiteLLM, or your own OpenAI-compatible stack, that’s the question worth answering.

DEV Community