Lars Winstand

Posted on May 19 • Originally published at standardcompute.com

I kept seeing the same OpenClaw mistake: one expensive model for every job

#ai #devops #automation #openai

I kept running into the same OpenClaw setup mistake over and over:

people pick one expensive model, wire it in as the default, and then let it handle everything.

Heartbeat checks.
Cron pings.
Inbox triage.
"Nothing changed" loops.
Low-stakes tagging.

That is not a clever agent architecture. That is just an expensive default.

While researching OpenClaw setups, I found a thread on r/openclaw where someone said the quiet part out loud:

“Stop using opus for everything. seriously. i was running it on heartbeat checks and cron pings which is just lighting money on fire. glm-5.1 handles all that stuff fine. i only use sonnet 4.6 now when the task actually needs reasoning and my token costs are like a third of what they were”

That is the right lesson.

Not just for OpenClaw.
For n8n, Make, Zapier, custom Python workers, and basically any agent setup that runs on a schedule.

If you are still using one premium model for every task, you do not have a model strategy. You have a billing strategy you forgot to review.

The actual takeaway: route by task, not by brand loyalty

A lot of developers still treat model selection like a global app setting.

Pick GPT-5.4.
Or Claude Opus 4.6.
Or Gemini 3.5 Flash.
Done.

That works for a demo.
It falls apart in production.

Real agent systems do different kinds of work:

cheap classification
extraction
tagging
summarization
retries
memory maintenance
occasional hard reasoning
occasional high-risk decisions

Those should not all hit the same model.

The boring jobs should be cheap.
The hard jobs should get the expensive model.
The dangerous jobs should get the model that passes your evals.

That is model routing.
And honestly, it is just basic engineering once your workflows run all day.

OpenClaw already nudges you toward this

One thing I like about OpenClaw is that the config shape already hints at the right mental model.

You can define a primary model and ordered fallbacks.
You can also split out image, PDF, and image generation models.

That is not accidental.
That is the product telling you different tasks deserve different models.

Example:

agents:
  defaults:
    model:
      primary: openai/gpt-5.4-mini
      fallbacks:
        - anthropic/claude-sonnet-4.6
        - google/gemini-3.5-flash
    imageModel: google/gemini-3.5-flash
    pdfModel: openai/gpt-5.4
    imageGenerationModel: openai/gpt-image-1

Once you look at OpenClaw this way, the Reddit advice stops sounding like a hack.
It starts sounding like the intended operating model.

Why people keep wasting frontier models on tiny jobs

Because "agent work" sounds smarter than it usually is.

A heartbeat check feels sophisticated because an agent is doing it.
A cron-triggered inbox review feels important because it uses AI.
But a lot of recurring automation work is just:

classify this
summarize that
compare two notes
tag a ticket
decide whether anything changed
move on

That is exactly where smaller, cheaper models win.

One commenter in the same thread said it perfectly:

“No reason to burn opus tokens on a cron check that runs every 10 minutes.”

Yep.

If a task runs every 10 minutes, you are not choosing a model once.
You are choosing it 144 times per day.
Then you multiply that by every queue, retry loop, mailbox, and background task you forgot was still running.

The pricing spread is big enough that bad defaults compound fast

This is where the mistake stops being theoretical.

Here is the rough shape of the cost difference across common automation-friendly models:

Model	What it means for automation work
GPT-5.4	$2.50 input / $15.00 output per 1M tokens; best kept for hard reasoning and high-value steps
GPT-5.4-mini	$0.75 input / $4.50 output per 1M tokens; solid default for routine transforms and summaries
GPT-5.4-nano	$0.20 input / $1.25 output per 1M tokens; strong candidate for heartbeat checks, classifiers, and cron work
Gemini 3.5 Flash	$1.50 input / $9.00 output per 1M tokens; usable for recurring admin tasks and batch workflows

The exact numbers will change over time.
The important part is the spread.

If your default is a frontier model, every low-value task inherits premium pricing.
And retries make it worse.

OpenClaw memory makes this even more obvious

OpenClaw’s memory model is one of the more practical parts of the system.

Instead of pretending memory is some magical hidden state, it writes durable state to files like:

MEMORY.md
memory/YYYY-MM-DD.md
optional DREAMS.md

That means a lot of recurring "agentic" work is really just file maintenance plus lightweight judgment.

Things like:

checking whether today’s notes contain anything worth promoting
summarizing a session into MEMORY.md
tagging daily notes
triaging low-priority email
deciding whether something needs escalation
confirming a scheduled task completed normally

That does not automatically require Claude Opus 4.6 or GPT-5.4.

If the model is reading yesterday’s note, reading today’s note, and deciding whether to append one sentence, you probably want cheap and consistent, not frontier-tier reasoning.

Retries and fallbacks are where expensive defaults get really dumb

OpenClaw supports failover and fallback chains.
That is good.
You want that.

But fallback logic changes the economics.

If your default model is expensive, you do not just overpay once.
You overpay on:

the initial call
the retry
the fallback attempt
the loop you forgot to cap

That is why background jobs are dangerous.
They are easy to ignore, and they quietly multiply usage.

A cron task every 30 minutes does not feel expensive.
A cron task every 30 minutes for weeks, with retries, definitely is.

A sane routing policy for OpenClaw

This is the practical version.

Use smaller models for:

heartbeat checks
cron pings
simple classification
tagging
deduping
queue cleanup
low-risk summarization

Use mid-tier models for:

routine transforms
memory promotion drafts
support triage with some ambiguity
structured extraction with moderate complexity

Use premium models for:

hard reasoning
ambiguous multi-step tool use
sensitive customer-facing responses
compliance-sensitive decisions
destructive actions

Here is a simple default stack:

agents:
  defaults:
    model:
      primary: openai/gpt-5.4-nano
      fallbacks:
        - openai/gpt-5.4-mini
        - anthropic/claude-sonnet-4.6
        - openai/gpt-5.4

Then override specific tasks that actually need the heavier model.

That setup is less pretty than "we use Claude for everything" or "we standardized on GPT-5.4."

It is also much more competent.

Don’t route by price alone

Cheap routing can absolutely backfire.

A task can look simple but still be failure-sensitive.

Examples:

approving refunds
sending customer-facing messages
deciding whether to escalate a compliance issue
triggering a destructive action in a tool chain

Those should not be assigned by vibes.

Two rules help a lot:

1. Route by consequence, not just complexity

A simple classifier can still be dangerous.
If the output controls money, customer trust, or irreversible actions, treat it as high-risk.

2. Route by evals, not marketing

A cheaper model that fails your real prompts is not cheaper.
It is just a slower way to ship bugs.

If Gemini 3.5 Flash, GLM-5.1, GPT-5.4-nano, or Claude Sonnet 4.6 passes your actual eval set for a task, great.
Use it.
If it fails, move up.

That is routing.
Not ideology.

Quick way to audit your current setup

If you already have OpenClaw running, here is a dead simple audit process.

1. List every recurring task

Make a table for:

task name
trigger frequency
current model
failure impact
average prompt size
average output size

2. Find the obviously overpriced jobs

Look for tasks that are:

frequent
repetitive
low-risk
easy to evaluate

Those are your first routing wins.

3. Create a cheap-first policy

Start with a small model for low-risk jobs.
Escalate only if evals say you need to.

4. Cap loops and retries

If a job can retry forever, your pricing model is already broken.

5. Measure before and after

Even a rough comparison is enough:

# pseudo-checklist
# before
# - model: claude-opus-4.6
# - task frequency: every 10 minutes
# - retries: 2
# - monthly usage: painful

# after
# - default: gpt-5.4-nano
# - escalate only on low confidence / failed eval cases
# - monthly usage: much less painful

If you’re building agents at scale, per-token pricing becomes a workflow problem

This is the part people eventually learn the hard way.

Per-token billing is annoying enough in interactive chat apps.
In automations, it is worse.

Because the expensive calls are often not the flashy ones.
They are the boring background ones:

scheduled checks
agent loops
retries
nightly summaries
queue maintenance
memory updates

That is exactly why routing matters.
And it is also why a flat-cost API setup is appealing for teams running lots of recurring agent work.

If your agents are constantly working through n8n, Make, Zapier, OpenClaw, or custom queues, the real pain is not just token price.
It is the constant need to babysit usage.

That is the problem Standard Compute is aimed at.

It gives you an OpenAI-compatible API with flat monthly pricing, so you can keep the routing mindset without getting punished every time your automations actually run.

The useful combo is:

route small jobs to cheaper models
reserve bigger models for hard steps
stop treating every cron task like it deserves frontier pricing
stop watching token spend like a stress dashboard

More here if that sounds familiar: https://standardcompute.com

The real tell that someone is new to agents

They brag about which model they use.

People who have actually operated automations for weeks brag about which tasks they stopped wasting expensive models on.

That is why that OpenClaw thread stuck with me.
The useful lesson was not "GLM-5.1 is secretly amazing" or "Claude Sonnet 4.6 is enough."

It was the shift underneath:

Your agent is a workflow, not a shrine to your favorite model.

Once you see that, model routing stops looking like an optimization trick.
It starts looking like basic competence.

If a heartbeat check is hitting Claude Opus 4.6 every 10 minutes, that is not sophistication.
It is a leak.

And if your setup still uses one expensive model for everything, you probably do not need a better prompt first.

You need a routing policy.

Top comments (2)

Harpinder • May 23

yeah. i think the mistake starts one step earlier: waking the model for checks that should be deterministic. if the job is just "did anything change?", keep that cheap and only hand the real delta to the bigger model.

Harjot Singh • May 31

You've named the single most common and most expensive mistake in agent design: one big model for every job. It feels safe ("just use the best model and it'll always work") but it means you're paying frontier prices to do things like format a string, classify an intent, or decide which tool to call next - tasks a tiny model nails for a fraction of a cent. The bill is mostly made of cheap work being charged at premium rates.

The fix isn't complicated, it's just disciplined: a cheap default model handles the bulk, and you explicitly escalate to the expensive one only for the genuinely hard reasoning steps. The hard part is deciding the escalation boundary, not the routing mechanics. This is the core design choice in Moonshift (a multi-agent pipeline that ships a prompt to a deployed SaaS) - different agents/steps run on different-tier models, which is what holds a full build at ~$3 flat instead of the one-expensive-model bill you're describing. Spot-on post. How do you decide the escalation threshold - static rules per task type, or a cheap classifier that scores difficulty first? That boundary decision is where most of the real engineering lives.