Lars Winstand

Posted on Jun 5 • Originally published at standardcompute.com

I thought the cheap model would save my OpenClaw bill, then I watched $100 disappear in 2 days

#ai #openclaw #llm #agents

I keep seeing the same failure mode in agent stacks:

Someone builds something cool with OpenClaw
They use Claude Opus, Claude Sonnet, or GPT-class models
The first bill lands
They panic and switch everything to the cheapest model they can find

That feels like cost optimization.

Most of the time, it is not.

It is just moving the cost from token price to retries, bad tool calls, recovery loops, and supervisor escalations.

While looking into budget OpenClaw setups, I found a thread on r/openclaw where one user said:

"I blew 100 usd in two days in openclaw using opus, sonnet, haiku. Moved to deepseek and its consuming pennies."

That sounds like a clean lesson: stop using expensive models.

I think the real lesson is different.

The biggest cost in an agent system is often not the posted price of the model. It is what happens after the model gets something slightly wrong.

Cheap per token can be expensive per successful task

A chatbot can survive a mediocre answer.

An agent usually cannot.

When OpenClaw is driving tools, a weak model does not just produce a bad sentence. It can:

trigger extra tool calls
retry the same step three times
lose track of state
ask for clarification when it should act
take the wrong action and force a cleanup pass
escalate to a stronger model after already wasting time and tokens

That is where "just use the cheapest model" falls apart.

If a low-cost model needs 3 attempts, then a stronger model has to rescue the run anyway, you did not save money. You added failure overhead.

For agent workloads, the metric that matters is not cost per call.

It is cost per completed task.

OpenClaw already gives you the right abstraction: routing

This is why I think single-model OpenClaw setups are usually a bad default.

Not because cheap models are useless.

Because OpenClaw is built around sessions, routing, failover, and multi-agent patterns. It is infrastructure for agent execution, not just a chat wrapper.

So the right question is not:

Which model is cheapest?

It is:

Which steps are safe enough to be cheap?

That is a much better optimization target.

What Reddit got right about small models

Another r/openclaw thread had a comment that was more useful than most benchmark charts:

"I use Gemma 4 E4B for simple tool tasks, but I would have serious doubts about trying to use any of the Gemma 4 models for the main agent. It will almost certainly fail in horrible and unpredictable ways."

That sounds harsh.

It is also exactly how weak agent control feels in production.

Not "slightly less reasoning quality."

More like:

weird tool sequencing
forgotten constraints
brittle recovery
random collapse after a long session

Another user described Gemma fallback behavior as:

"barely keep the lights on basic"

That is actually a useful mental model.

A lot of cheap or local models are fine as workers, fallbacks, parsers, or bounded tool executors.

They are often a bad choice for the main controller.

Small models are useful. They are just easy to miscast.

This is where people get fooled by capability checklists.

Gemma 3 and similar models support things like:

function calling
structured output
long context windows
single-GPU deployment

All of that matters.

But a model supporting function calling does not mean it is reliable as the main autonomous planner in OpenClaw.

There is a big difference between:

extracting fields from an email
classifying whether a message is urgent
formatting JSON for a tool call

and:

planning a 6-step tool sequence
recovering from an API timeout
deciding whether to retry, ask a question, or escalate
handling side effects safely

That gap is where a lot of "cheap model savings" go to die.

My opinionated take: the winner is routing, not DeepSeek or Gemma

If I had to compress this into one sentence:

Single-model OpenClaw setups are lazy architecture wearing a budget hat.

The answer is not:

always use Claude Opus
always use Claude Sonnet
always use GPT-5
always use DeepSeek Flash
always run Gemma locally

The answer is routing.

Use cheap models where mistakes are cheap.

Use strong models where mistakes cascade.

That is how you actually lower cost.

A practical role map for OpenClaw

Model option	Best role in OpenClaw
DeepSeek Flash	Cheap worker for classification, extraction, formatting, and bounded subtasks where retries are acceptable
Gemma 3 / Gemma 4 12B-class models	Local helper, fallback, simple tool work, low-risk subtasks
Claude Sonnet / Claude Opus / GPT-5-class models	Planner, supervisor, recovery model, and decision-maker for ambiguous or high-consequence turns

That table is the real optimization strategy.

Not model tribalism.

Role design.

What should go to a cheap model?

These are usually good candidates:

intent classification
entity extraction
schema-constrained JSON formatting
spam filtering
low-risk summarization
simple routing decisions
low-consequence tool calls with tight validation

Example: classify an inbound webhook before handing it to the main agent.

{
  "task": "classify_support_ticket",
  "input": {
    "subject": "billing issue",
    "body": "customer says invoice failed and wants retry"
  },
  "expected_output": {
    "category": "billing",
    "priority": "high",
    "requires_human": false
  }
}

That is exactly the kind of job where a cheap model can save money without creating chaos.

What should not go to a cheap model?

These are the places where weak reasoning gets expensive fast:

main agent planning across multiple tools
recovery after failed tool calls
long-horizon tasks with lots of state
anything that sends emails, updates records, or triggers transactions
ambiguous decisions with messy context
supervisor logic
retry policy decisions

If a mistake means "rerun the parser," cheap is fine.

If a mistake means "the agent spirals for 10 minutes and then Sonnet has to rescue it," the cheap model is not cheap anymore.

A concrete routing pattern

Here is a simple architecture I would actually use.

Step 1: cheap intake model

Use a cheap worker for:

classification
extraction
normalization
low-risk transforms

Step 2: strong planner for important turns

Escalate to Claude Sonnet, Claude Opus, or GPT-5-class models when:

the task touches multiple tools
the context is ambiguous
side effects are involved
retries have already started

Step 3: local fallback for continuity

Use a local model only to keep the system alive, not to maintain quality.

That means fallback should preserve uptime, not pretend to preserve capability.

Step 4: log by failure type, not average token price

This part matters a lot.

If you only track token spend, you miss the real problem.

Track things like:

retries per task
tool-call failure rate
escalation rate
average steps per successful task
recovery rate after timeout or invalid tool output

A weak model often looks cheap in isolation and expensive in workflow metrics.

Example pseudo-routing logic

This is the kind of logic more teams should implement.

type TaskRisk = "low" | "medium" | "high";

type Task = {
  name: string;
  risk: TaskRisk;
  hasSideEffects: boolean;
  toolCount: number;
  previousFailures: number;
};

function selectModel(task: Task): string {
  if (task.hasSideEffects) return "claude-sonnet";
  if (task.previousFailures > 0) return "claude-sonnet";
  if (task.toolCount > 2) return "gpt-5";
  if (task.risk === "high") return "claude-opus";
  return "deepseek-flash";
}

Is this perfect? No.

Is it better than "everything goes to the cheapest model"? Absolutely.

OpenClaw setup is infrastructure, so treat it like infrastructure

Even the setup flow makes this obvious:

npm install -g openclaw@latest
openclaw onboard --install-daemon
openclaw status --deep

OpenClaw recommends modern Node versions and exposes operational concepts like routing and failover.

That should push you toward architecture decisions, not one-model-for-everything shortcuts.

When the system is agent infrastructure, your cost strategy also needs to be infrastructure-level.

If your workload is simple, yes, a cheap model may be enough

To be fair: sometimes the cheap model really is the right answer.

If your workload is tightly scoped and low-risk, then DeepSeek Flash, Qwen, Gemma, GLM, MiniMax, or a local Ollama model may be the most economical option.

Especially if you are doing things like:

webhook classification
document parsing
simple support triage
structured extraction
low-risk internal automations

For local stacks, API cost can drop to near zero.

Then the tradeoff becomes:

hardware cost
latency
setup complexity
reliability under longer agent runs

That is a real tradeoff.

But it is still a routing question.

Not proof that the cheapest model should run your whole agent harness.

The weird truth about agent costs

Stronger models are often cheaper per successful task, even when they are more expensive per call.

That sounds backwards until you watch a weak model wander around a tool graph for 8 minutes.

The user who spent $100 in two days with Opus, Sonnet, and Haiku found one kind of pain.

The users describing Gemma as fallback-grade found the other kind.

Put those together and the pattern is obvious:

The cheapest model becomes the most expensive part of your OpenClaw stack when you give it the wrong job.

What I would actually do

If I were optimizing an OpenClaw stack this week, I would do this:

Put a cheap model on intake, extraction, and simple schema-bound tasks
Route planning, recovery, and side-effecting actions to a stronger model
Keep a local model as continuity fallback only
Measure cost per successful task, not cost per call
Review failures by task type

That is the part most teams skip.

They optimize the sticker price and ignore failure cost.

That works for chatbots.

It usually fails for agents.

One more thing: pricing model matters too

Even if you route well, per-token billing still creates a weird incentive structure for agent systems.

You start watching every long run like a taxi meter.

That gets old fast when you are running automations all day in OpenClaw, n8n, Make, Zapier, or custom agent workflows.

This is why I think flat-rate API access is underrated for agents.

If your stack is constantly doing:

retries
structured extraction
tool orchestration
background automations
long-running workflows

then predictable monthly cost is often more useful than squeezing pennies out of each individual call.

That is also why Standard Compute is interesting here. It gives you OpenAI-compatible API access with flat monthly pricing instead of per-token billing, so you can run agent workloads without constantly babysitting usage.

For teams building automations, that pricing model can be just as important as model routing.

Final takeaway

Do not optimize your OpenClaw stack for the lowest model sticker price.

Optimize for the lowest cost of getting the task done once, correctly, without cleanup.

Most of the time, that means:

cheap models for low-risk bounded work
strong models for planning and recovery
routing based on failure cost
pricing that does not punish every long-running agent loop

That is the difference between looking efficient and actually being efficient.

DEV Community