Lars Winstand

Posted on May 15 • Originally published at standardcompute.com

I read the 49-comment OpenClaw meltdown and the real problem isn’t just OpenClaw

#ai #agents #devops #openai

A 22-upvote r/openclaw thread about quitting OpenClaw after 3.5 months, 1,300 hours, nearly 5 billion tokens, and $700 is not just one person rage-posting.

It exposed two separate problems that developers keep mashing together:

OpenClaw gets fragile as workflows become longer, more stateful, and more tool-heavy.
Per-token pricing gets ugly fast when agent runtimes burn 8k-18k tokens before doing much useful work.

That distinction matters.

If you’re building agents with OpenClaw, n8n, Make, Zapier, MCP servers, Ollama, Claude Opus 4.6, GPT-5.4, or mixed-provider setups, you’ve probably felt this already.

The original Reddit post was blunt:

“I have spent 3.5 month, 1300 hours, almost 5 billion tokens and 700 usd on it... it works okay for light and shorter tasks, but one will eventually be running in circles repairing same thing over and over and over again as the tasks grow.”

That does not sound like a one-off bug.

It sounds like an agent system hitting both reliability limits and economic limits at the same time.

The thread is really about 2 different failure modes

When people say OpenClaw is “fragile,” they’re often describing two very different things.

1. Operational fragility

This is the classic long-running agent problem.

Short tasks work.
Long sessions start wobbling.

Once you add:

long context
MCP tools
memory files
AGENTS.md
project notes
retries
repair loops

...the stories start sounding the same.

The agent gets lost.
It repeats itself.
It edits the wrong file.
It starts fixing the same thing over and over.

That’s a framework/runtime problem.

And it’s not just one thread. In nearby discussions, users described:

keeping a second cloud instance around so they don’t break the main one
paying for extra Hetzner backups because a working setup feels fragile
avoiding config changes because recovering a known-good state is painful

That is not normal confidence in software.

That is “don’t touch it, it might collapse” energy.

2. Economic fragility

This one is easier to miss, and honestly more important.

A bunch of users were not just complaining that OpenClaw fails.
They were complaining that it fails expensively.

In related discussions, users reported that even small tasks could start with:

~8k tokens for “light context”
~12k tokens for “normal context”
nearly 18k tokens per input in some cases

Before the actual task really begins.

That means the real tax is often not the model itself.
It’s the orchestration overhead.

The hidden bill: agent wrappers can be the expensive part

A lot of developers still think pricing is mostly about model selection.

Should you use Claude Opus 4.6?
GPT-5.4?
Gemini 3 Flash Preview?
A local model through Ollama?

That matters, but for agents, the wrapper can dominate the bill.

A typical request can include:

system instructions
AGENTS.md
workspace files
memory files
tool instructions
project notes
previous turns
tool outputs
retry context

So your “small task” is not actually small.

It might already be carrying a 12k-token backpack before the model writes one useful line.

That changes the economics completely.

A cheap model stops being cheap if you keep resending a giant prompt on every loop.

Quick way to think about token burn

If your agent loop looks like this:

base context + memory + tool schema + previous outputs + retry instructions

Then your cost per step is closer to:

effective_cost = full_prompt_tokens * number_of_turns * retries

Not:

effective_cost = user_message_tokens

That’s why these threads get heated so fast.

People are not arguing about token pricing in the abstract.
They’re discovering that long-running agents amplify prompt overhead into a real bill.

“Just use a better model” is true and also incomplete

Some commenters pushed back on the original complaint with a simple answer:

Use a stronger model.

And yes, there’s truth there.

Claude Opus 4.6 is usually more reliable than weaker models on long coding/tool-use sessions.
GPT-5.4-class models generally hold the thread better than bargain routing on complex tasks.

If you run hard tasks on weak models, you will absolutely blame the framework for failures that are partly model failures.

But “just use a stronger model” does not solve the whole problem.

Because the Reddit comments revealed something more interesting:

Users are manually acting as the routing layer.

They are mixing providers, splitting tasks, assigning specialized agents, and constantly balancing cost vs reliability.

That means the system is not really solving orchestration for them.

It’s asking them to solve orchestration by hand.

What developers are actually doing in the wild

From the surrounding threads, the real-world patterns looked like this:

use Claude Opus 4.6 or GPT-5.4 for hard tasks
use Ollama or cheaper APIs for lighter steps
split work into specialized agents
trim context aggressively
keep backups of working setups
avoid changing config unless necessary

That is a valid survival strategy.

But it’s also a signal.

When users need to become part-time runtime engineers just to keep costs sane, the framework is not “simple.”

The practical problem: people are paying for safety, not speed

This was the part that stuck with me.

The weirdest behavior in these threads was not the cost complaints.
It was the defensive infrastructure.

People were paying for:

second instances
backup snapshots
safer rollback paths
config isolation

Why?

Because when an agent setup fails, it can fail in ways that are:

expensive
hard to debug
hard to reproduce
hard to unwind

That’s what makes fragility feel worse in agent systems than in normal software.

The failure is not just annoying.
It consumes tokens, time, and trust at the same time.

What OpenClaw users seem to be choosing between

If you strip out the drama, the tradeoffs are pretty clear.

Option	What developers seem to get
OpenClaw + frontier APIs	Strong capability with Claude Opus 4.6 or GPT-5.4-class models, but context and retries can make costs climb fast
OpenClaw + local/Ollama models	Lower marginal cost and more freedom to experiment, but weaker performance and more failures on harder tasks
Subscription-style compute plans	Predictable spend is much easier to manage for agents, but some plans still hide quotas, caps, or throttles

That last row matters more than it sounds.

A lot of this debate is really about pricing model fit for agent workloads.

Per-token billing makes sense for occasional prompts.
It gets painful when agents run in loops, retry, call tools, and carry huge context windows all day.

If you’re running automations 24/7, the billing model starts shaping architecture decisions.

Practical checks before you blame OpenClaw

If you’re debugging an OpenClaw setup, here are the boring checks worth doing first.

1. Verify Ollama is actually reachable

If you’re using local models, confirm the endpoint is alive:

curl http://localhost:11434/
ollama list

If Ollama is down or the model is missing, OpenClaw can look broken when the real issue is just a dead local dependency.

2. Inspect what’s being stuffed into context

Before upgrading models, inspect the prompt inputs.

Look for:

AGENTS.md
memory files
workspace files
project notes
skills
tool schemas

If a tiny task starts with 12k tokens of baggage, a better model may improve quality but won’t fix economics.

3. Measure token usage per step, not per task

If you only inspect final task cost, you’ll miss where the burn is happening.

Track each loop:

step_01: input_tokens=11842 output_tokens=611
step_02: input_tokens=12790 output_tokens=944
step_03: input_tokens=13402 output_tokens=388 retry=1
step_04: input_tokens=14110 output_tokens=502 retry=2

That’s where the real story usually is.

4. Save MCP config deliberately

One commenter mentioned that MCP credentials can be lost unless the configuration is explicitly saved as a skill.

Tiny detail, huge impact.

If the agent forgets how to access a tool it already used, your next few loops are just expensive confusion.

5. Trim context before changing providers

A lot of teams jump straight from:

Gemini 3 Flash Preview
local Ollama models
cheaper routing setups

...to Claude Opus 4.6 or GPT-5.4 because they want better reliability.

Sometimes that’s correct.

But if the root problem is prompt bloat, switching providers just means you’re paying more for the same oversized loop.

A better debugging checklist for agent runtimes

If I were troubleshooting an OpenClaw workflow today, I’d use this order:

# 1. Check local dependencies
curl http://localhost:11434/
ollama list

# 2. Log prompt size per step
# 3. Log retries and tool-call failures
# 4. Remove unnecessary memory/project files
# 5. Reduce tool surface area
# 6. Re-test with a stronger model
# 7. Compare total cost over a realistic workload

That order matters.

Too many people jump from “this feels flaky” to “buy a better model.”

Sometimes the real bug is that the agent is hauling too much context and thrashing.

My take: the quitter was more right than wrong

I don’t think OpenClaw is useless.
Clearly people are shipping with it.
Some developers genuinely like it.

But the core complaint — too fragile for real work — lands because real work is where all the hidden costs pile up at once.

Real work means:

longer sessions
more tools
more memory
more retries
more state
more chances to drift
more money spent on orchestration overhead

The defenders are right that model choice matters.
That part is real.

But once a community starts normalizing:

second cloud instances
backup anxiety
8k-18k token overhead
manual provider mixing
constant cost management

...I stop calling that simple user error.

That’s a design constraint.

The bigger lesson is not about OpenClaw

The reason this thread matters is that it exposed something bigger than one framework.

Agent runtimes make every mistake more expensive.

A bad prompt costs more.
A retry costs more.
A wrong tool call costs more.
A context-heavy loop costs way more.

And when you’re billed per token, all of that turns into a budgeting problem fast.

That’s why predictable compute is becoming the real conversation for agent builders.

If your agents run in n8n, Make, Zapier, OpenClaw, or custom workflows all day, the problem is not just “which model is smartest?”

It’s also:

can I afford the retries?
can I let this run unattended?
can I stop watching token usage like a hawk?
can I scale without turning every workflow into a pricing spreadsheet?

That’s exactly why products like Standard Compute are interesting right now.

It’s a drop-in OpenAI-compatible API, but the bigger value is the pricing model: flat monthly cost instead of per-token anxiety.

For normal chat apps, per-token billing is tolerable.
For long-running agents and automations, it becomes operational drag.

If your workflow burns tokens just to stay alive, predictable pricing is not a nice-to-have.
It changes what you’re willing to automate.

Final thought

The 49-comment OpenClaw meltdown hit a nerve because a lot of developers recognized the pattern.

Not just “my agent failed.”

More like:

it failed after a long loop
it burned money while failing
I’m not even sure whether the bug is the model, the framework, the context, or my setup

That combination is brutal.

So yes, OpenClaw may be part of the problem.

But the deeper issue is that long-running agent systems turn fragility into a cost multiplier.

And once you’ve felt that, you stop caring about headline token prices.
You start caring about whether your stack lets you run agents without constantly thinking about the meter.

DEV Community