Lars Winstand

Posted on May 18 • Originally published at standardcompute.com

OpenClaw told me it failed its own trust test, and that’s the real story

#agents #ai #automation #devops

I found a small r/openclaw thread recently that explained agent reliability better than most polished benchmarks.

The post had 16 upvotes, 15 comments, and a perfect title:

“OpenClaw falling on it's sword.”

Funny headline. Real bug.

The screenshot showed OpenClaw narrating its own failure like a terminal-based tragedy:

“I have failed the Atomic Append test. My attempt to read, append, and rewrite was a total failure. The file remains unchanged. I am halting all operations.”

That’s funny for a second.

Then it stops being funny, because if you run agents in production, you know what this actually means:

a basic file mutation failed
the agent noticed
the agent stopped trusting the environment
the workflow died

That is the real story.

Not “prompting issue.”
Not “user error.”
Not “AI is random lol.”

This is what fragile agent stacks look like in real life.

The bug was small. That’s why it matters.

The task was not complicated.

OpenClaw was writing to a log file.

That matters because most agent systems do not fail on the big flashy benchmark tasks. They fail on boring stateful operations:

read file
append content
rewrite file
verify write
continue

If those steps are unreliable, the whole stack is unreliable.

Here’s the shape of the failure in plain English:

1. Agent reads file through a tool
2. Agent plans an append
3. Agent rewrites the file
4. Agent checks whether the change actually happened
5. Verification fails
6. Agent decides the environment is untrustworthy
7. Agent halts

That is not an edge case. That is the core loop for coding agents, automation agents, and any workflow that mutates state.

The comments made it worse

The thread got more interesting once I read the replies.

One commenter described an even stranger failure mode:

“I've seen a similar spiralling happening when I had two agents on the same chat group and the other one started accusing the other of fabricating something... I had to surgigally remove all memories related to that or the self-doubt about fabrication would occasionally resurface and it would spiral again.”

That is not a normal software bug.

That is memory contamination plus agent self-doubt plus tool-state confusion.

If you work on agent systems, this should sound familiar. Once you mix:

long context windows
tool outputs
memory
retries
partial failures

...you stop debugging a single prompt and start debugging an unstable distributed system made of model behavior, tool semantics, and framework assumptions.

The real question: is OpenClaw broken, or are people using the wrong models?

The comments were blunt.

People reported issues with:

Gemma 4 26B
Qwen 3.5
Qwen 3.6
local setups through Ollama

And one commenter basically answered the whole thread with:

“Use Opus.”

Honestly, that’s closer to the truth than most onboarding guides.

A lot of people hear “OpenClaw supports local models” and mentally translate that into:

My Ollama setup should behave like Claude Opus 4.6.

It won’t.

Not for tool use.
Not for recovery.
Not for long-running stateful tasks.

A model can look smart in chat and still be bad at actually doing work.

That distinction matters.

Chat-smart is not tool-smart

This is the part developers keep rediscovering.

A model can:

explain a patch well
summarize architecture correctly
sound confident in a plan

...and still fail to call write_file correctly when the workflow depends on it.

That is the difference between a good demo model and a good agent model.

If I had to summarize the subreddit experience, it looks like this:

Option	What users keep reporting
Local models via Ollama	Lower direct cost, more setup friction, weaker tool-use reliability
Frontier models like Claude Opus 4.6 or GPT-5.4	Better reliability, better recovery, much higher API cost exposure
OpenClaw version pinning	Often necessary because newer versions can regress core behavior

That does not mean local models are useless.

It means OpenClaw is much more model-sensitive than beginners expect.

Version churn makes this harder than it should be

Even if you choose a stronger model, you still have framework churn.

Across recent OpenClaw discussions, I kept seeing the same pattern:

write failures on Ollama + Qwen 3.6
long first-token delays on M4 Pro MacBook + OpenClaw + OpenRouter + Gemini 3 Flash
reports that 5.12 introduced immediate context-limit failures, even on fresh chats

That last one is especially bad.

Users were seeing errors like this:

Context limit exceeded. I've reset our conversation to start fresh - please try again. To prevent this, increase your compaction buffer by setting agents.defaults.compac...

If a fresh chat is already “over context,” that is not a prompt problem.

That is a regression.

And when the best community advice becomes some variation of:

# not literally universal advice, but this is the vibe
pin older version
apply patch
avoid latest release
hope nothing else breaks

...you do not have a mature stack.

You have a moving target.

The part nobody wants to say out loud: reliability bugs turn into billing bugs

This is where the OpenClaw story connects directly to every developer running agents in production.

A lot of the subreddit discussion is really about cost, even when people are nominally talking about reliability.

One widely shared example pointed to reporting that the OpenClaw creator burned through $1.3 million in OpenAI API tokens in a month, with numbers like:

603 billion tokens
7.6 million requests
100 coding agents

Whether or not you operate anywhere near that scale is almost beside the point.

The point is what developers immediately infer from numbers like that:

retries are expensive
loops are expensive
tool failures are expensive
always-on agents are expensive
bad days cost more than good days

That is a terrible pricing model for automation.

If an agent retries file writes, replans after a tool failure, or keeps recovering from a framework regression, per-token billing turns reliability problems into cost spikes.

The worse the system behaves, the more you pay.

That same pattern shows up outside OpenClaw too:

n8n workflows that loop on failed tool steps
Make scenarios that retry model calls
Zapier automations with branching LLM steps
custom agent runners that keep re-planning after partial tool errors

Every retry path becomes a billing path.

This is why “just use Opus” is both good advice and incomplete advice

Yes, stronger models help.

If you want to test whether your setup is fundamentally viable, start with a frontier model.

Something like Claude Opus 4.6 or GPT-5.4 gives you a much cleaner signal than starting with a shaky local model and wondering whether the framework is broken.

But that advice skips the second problem:

frontier-model reliability often comes with frontier-model pricing.

So the real choices developers end up with look like this:

Use local models via Ollama and spend time debugging tool behavior.
Use Claude Opus or GPT-5.4 and spend money.
Mix routers, versions, patches, and model backends until you forget which layer is actually failing.

That is not just an OpenClaw problem.

That is the current agent-stack problem.

What I’d actually do before touching OpenClaw again

If I were setting up OpenClaw tomorrow, I’d be a lot more disciplined about the bring-up process.

1. Prove the workflow with a strong model first

Do not start with the cheapest local model and then blame everything else.

Start with a known-good model and test the workflow end to end.

openclaw onboard

Then verify the agent can actually mutate files, not just talk about mutating files.

2. Treat file writes as a first-class acceptance test

I would explicitly test:

create file
append line
rewrite block
verify checksum or exact contents
recover from a failed write

Something as simple as this catches a lot:

echo "line1" > test.log

Then ask the agent to append line2, and verify the result with a second tool call.

Expected result:

line1
line2

If your agent cannot do that reliably, it is not ready for repo edits, memory files, or stateful automation.

3. Pin versions aggressively

If the community keeps saying one version is stable and another version regressed, listen.

# example only
npm install openclaw@5.7

Do not assume latest means safest.

4. Measure latency and retries together

A setup that is “cheap” but retries constantly is not cheap.

A setup that takes 23 seconds to first token is not responsive enough for many interactive workflows.

Track both.

5. Separate framework problems from pricing problems

This is the mistake I see constantly.

People debug agent quality and pricing as if they are separate concerns.

They are not.

If your architecture depends on retries, tool verification, recovery loops, and long-running sessions, your pricing model matters as much as your model choice.

My actual takeaway

OpenClaw did not just fail a log write.

It accidentally produced one of the most honest status reports in the agent ecosystem:

the stack is only as trustworthy as its weakest write path, weakest model choice, and most recent upgrade

That is why the thread mattered.

Benchmarks tell you whether a model solved a task once.

Reddit tells you what happens on Tuesday at 2:13 a.m. when your agent refuses to append one line to a file, declares reality compromised, and halts.

That is much closer to production truth.

Where Standard Compute fits

This is also why pricing model design matters so much for agents.

If the only stable path is using stronger models like GPT-5.4, Claude Opus 4.6, or Grok 4.20, then per-token billing becomes a tax on reliability.

Every retry costs more.
Every recovery step costs more.
Every always-on automation costs more.

That’s exactly the problem Standard Compute is built to remove.

It’s a drop-in OpenAI-compatible API with unlimited AI compute on a flat monthly plan, so you can run agents and automations without doing cost math on every loop.

If you’re running:

OpenClaw
n8n
Make
Zapier
custom agent workflows

...predictable pricing is not a nice-to-have. It changes what kinds of automations are practical to run continuously.

You can check it out here:

https://standardcompute.com

If I were building an always-on agent stack today, I’d care about three things first:

tool reliability
version stability
pricing that doesn’t punish retries

The OpenClaw thread was funny.

But the lesson was dead serious.

DEV Community