Lars Winstand

Posted on May 16 • Originally published at standardcompute.com

I read the 32-comment OpenClaw fight about GPT 5.5 and I think people are blaming the wrong thing

#ai #openai #automation #agents

A thread on r/openclaw asked a simple question:

Is GPT 5.5 in OpenClaw a bad model?

The thread had 32 comments, and the real answer was nowhere near simple.

After reading through it, my takeaway is this:

Most people were not actually complaining about raw model quality.
They were complaining about a mix of:

agent behavior defaults
system prompts and soul files
OpenClaw reliability problems
wrapper effects from Codex

That distinction matters if you build agents, automations, or long-running workflows.

Because if you blame the base model for what is really a prompt/runtime/product issue, you will optimize the wrong layer.

And yes, this is very relevant outside OpenClaw too. The same mistake happens in n8n, Make, Zapier, custom agent stacks, and basically every OpenAI-compatible workflow once people start swapping providers.

The real complaint: GPT 5.5 felt passive

The strongest criticism in the thread was not "GPT 5.5 is dumb."

It was more like:

it feels smart, but dead

One commenter described GPT 5.5 through Codex as feeling like:

an incredibly smart person who has no desire to live, doesn’t want to do anything unless you tell it exactly what to do

That is not a benchmark complaint.
That is an agent behavior complaint.

For normal chat, that might be fine.
For agent workflows, it is a huge deal.

If your assistant is supposed to:

propose next steps
notice missing context
keep momentum going
act like a collaborative operator

then low initiative feels terrible.

And several users said switching back to Claude Opus 4.7 made OpenClaw feel "alive" again.

That tells you what people were really testing.
They were not running isolated evals.
They were testing whether the agent felt useful in real work.

Why this matters more in agent products than in raw model evals

In an agent product, "vibe" is not cosmetic.
It changes throughput.

A passive model means:

more explicit instructions
more back-and-forth
more supervision
more friction in long-running tasks

That compounds fast.

If you are paying per token, it also means more cost.
If you are running automations 24/7, it means more babysitting.

This is exactly why teams eventually start caring less about abstract model rankings and more about whether their stack can run predictably at scale.

That is also why flat-cost OpenAI-compatible infrastructure is appealing for agent workloads. Once you are iterating on prompts, wrappers, retries, and routing, per-token pricing becomes a tax on experimentation.

The important part: this probably is not just a GPT 5.5 issue

One of the smarter comments in the thread pushed back on the whole premise.

The point was basically:

"waiting to be told" vs "taking initiative" is partly a model trait, but system prompts can shift it a lot

I think that is correct.

When you use a model inside OpenClaw, you are not interacting with the base model directly. You are interacting with a stack.

Something like this:

the base model: GPT 5.5 or Claude Opus 4.7
the provider/wrapper layer: Codex in this case
OpenClaw prompt architecture: system prompts, soul files, defaults
runtime behavior: tools, sessions, auth, cron, upgrades, failures

If layer 3 tells the model to be conservative, avoid assumptions, and wait for explicit permission, users will say:

this model has no spark

Even if the base model is fine.

That is the trap.

The stack is the product

Developers love clean comparisons.

We want to think we are comparing:

GPT 5.5 vs Claude Opus 4.7

But in practice we are comparing:

GPT 5.5 + Codex + OpenClaw defaults + OpenClaw runtime behavior
Claude Opus 4.7 + OpenClaw defaults + OpenClaw runtime behavior

That is a very different test.

Here is the thread in table form:

Option	What users in the thread seemed to feel
GPT 5.5 via Codex in OpenClaw	Smart and capable, but often less proactive and more dependent on explicit instructions
Claude Opus 4.7 in OpenClaw	More intuitive, more collaborative, more "alive" for assistant-style work
OpenClaw itself	A major confounder because prompting, soul files, upgrades, auth, cron, and reliability all shape the experience

That third row is the one people usually underestimate.

Some users were happy with Codex

This is where the thread gets more useful.

Not everyone hated GPT 5.5 via Codex.
Some commenters said they were happy with Codex and blamed most breakage on OpenClaw itself.

That suggests GPT 5.5 is not uniformly bad.
It is bad for a specific kind of interaction.

My read:

if you want a careful coding/operator model, GPT 5.5 via Codex may be fine
if you want a proactive collaborator, Claude Opus 4.7 felt better to people in this thread

That is a workflow fit issue, not just a model quality issue.

The practical workaround power users mentioned

One of the more revealing comments described using Codex CLI directly to maintain OpenClaw.

Instead of expecting OpenClaw chat to feel like Claude, they used Codex in a persistent working directory with docs loaded as context.

Example:

cd ~/.openclaw
codex resume

And they referenced:

https://docs.openclaw.ai/

That workaround makes a lot of sense.

Many coding-oriented models feel mediocre in open-ended chat and much better when you give them:

a real repo
shell access
persistent task context
concrete files to edit

So if GPT 5.5 feels passive in chat but works well in a repo, that is not a contradiction.
That is a clue.

Reliability bugs can poison model perception

This part changed my mind the most.

Looking around related OpenClaw discussions, there were enough reliability complaints that I do not think you can judge GPT 5.5 inside OpenClaw cleanly right now.

The kinds of issues people mentioned included:

auto-upgrades changing behavior
cron jobs silently failing
API key loading breaking
generic "Something went wrong" failures
auth/session weirdness

If your agent stops executing after an upgrade, users do not say:

the runtime abstraction layer introduced instability

They say:

this model sucks

That is normal human behavior. It is also wrong.

This is why agent teams eventually care a lot about infrastructure consistency.
If your API surface is stable and OpenAI-compatible, you can swap routing and providers underneath without forcing every workflow to relearn failure modes.

That is one of the underrated benefits of Standard Compute’s approach: developers can keep their existing OpenAI SDKs and clients while routing across models like GPT-5.4, Claude Opus 4.6, and Grok 4.20 without rebuilding their entire automation stack.

For teams running agents in production, that matters more than model tribalism.

What I would actually test before picking a side

If you are trying to decide whether GPT 5.5 is the problem, run controlled tests.

Test 1: same task, different model, same wrapper

Keep the prompt and runtime identical.
Only swap the model.

# pseudo-example
agent run --model gpt-5.5 --prompt-file task.md
agent run --model claude-opus-4.7 --prompt-file task.md

If behavior changes a lot here, the model may actually be the issue.

Test 2: same model, different system prompt or soul file

Keep the model fixed.
Change the prompt architecture.

model: gpt-5.5
system_prompt: prompts/conservative.txt

model: gpt-5.5
system_prompt: prompts/proactive.txt

If initiative changes dramatically, you are looking at a prompt-layer issue.

Test 3: same model, outside OpenClaw

Try the same model with a direct provider workflow or CLI.

codex resume

Or call the model through your own OpenAI-compatible client.

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
  baseURL: process.env.OPENAI_BASE_URL
});

const response = await client.chat.completions.create({
  model: "gpt-5.5",
  messages: [
    { role: "system", content: "Be proactive. Suggest next steps without being asked." },
    { role: "user", content: "Review this repo migration plan and tell me what I missed." }
  ]
});

console.log(response.choices[0].message.content);

If GPT 5.5 suddenly performs much better outside OpenClaw, then OpenClaw chat was the confounder.

My opinionated take

I do not think the thread proves GPT 5.5 is a bad model.

I think it shows something more useful:

A model can be smart and still feel wrong inside a specific agent wrapper.

That is not the same thing as being bad.

For initiative-heavy collaboration, the users in that thread seemed to prefer Claude Opus 4.7.
For careful execution and coding-oriented tasks, GPT 5.5 via Codex still had defenders.

Both can be true.

The bigger lesson is this:

If you are evaluating AI agents, stop pretending you are only evaluating the base model.

You are evaluating:

the model
the system prompt
the wrapper
the tool runtime
the reliability of the product around it

Miss that, and you will spend weeks swapping models when the real fix was prompt architecture or infrastructure.

What this means for teams building automations

If you run agents in:

n8n
Make
Zapier
OpenClaw
custom OpenAI-compatible stacks

then your goal should not be "pick one model forever."

Your goal should be:

keep your API layer stable
test model behavior under your actual workload
route different tasks to different models
avoid pricing models that punish iteration

That last point matters more than people admit.

When every retry, prompt tweak, and long-running agent loop costs more, teams stop experimenting. They narrow tasks to fit budget instead of optimizing for outcomes.

That is why unlimited, flat-rate compute is such a strong fit for agent-heavy workflows. You can test prompt variants, run automations continuously, and route across models without watching a token meter all day.

Final takeaway

My answer to the original Reddit question is:

No, GPT 5.5 in OpenClaw probably is not a bad model. But it may be a bad default experience for users who want proactive collaboration.

That is a much more precise diagnosis.

And precision matters here.

Because once you separate:

base model behavior
prompt defaults
wrapper effects
runtime reliability

you can actually fix the problem.

If you are building agents seriously, that is the whole game.

And if you are tired of burning time and money on per-token experimentation while doing it, Standard Compute is worth a look. It is a drop-in OpenAI API replacement with flat monthly pricing, built for exactly this kind of automation-heavy workflow.

Same SDKs. Predictable cost. Much less token anxiety.

That is a better foundation for testing agent behavior than arguing about one Reddit thread forever.

DEV Community