DEV Community

Cover image for OpenClaw Burned $5,600 of API Credits in One Month. Here's the Spec Habit That Prevents It.
Daniel Nwaneri
Daniel Nwaneri

Posted on

OpenClaw Burned $5,600 of API Credits in One Month. Here's the Spec Habit That Prevents It.

OpenClaw Challenge Submission 🦞

The formula was exact. The assumption was wrong.

I was building vectorize-mcp-worker — a semantic search system on Cloudflare Workers that costs $5/month where alternatives run $130-190. The architecture was clean. Workers AI for embeddings, Vectorize for the index, one Worker handling everything at the edge. I'd spec'd the core search flow, written the populate endpoint, deployed it. Everything worked.

Then I tried to add hybrid search. I gave the agent a clear instruction: extend the search endpoint to support keyword fallback when vector similarity scores drop below 0.7.

It did exactly that. It also rewrote the embedding pipeline to use a different model — bge-base-en-v1.5 instead of bge-small-en-v1.5 — because the larger model "would improve result quality." Technically correct. Completely wrong for my use case. The dimensions changed from 384 to 768. The existing Vectorize index became incompatible. I didn't catch it until I tried to query documents I'd already populated.

The formula was exact. The assumption — that I wanted better quality regardless of index compatibility — was wrong.

That's not an agent failure. That's a spec failure. And it's the most dangerous kind because it looks like progress until production breaks.


I build with Claude Code CLI. That incident is what pushed me toward spec-first development — not as a formality, but as a forcing function. Stop feeding half-formed ideas into agents and hoping they fill in the right gaps. They don't. They fill in a gap, confidently, and move on.

So I built spec-writer — a Claude Code skill that takes a vague feature request and forces it into a structured spec before any code gets written. Goals, non-goals, assumptions, success criteria, edge cases. The stuff that feels like overhead until an agent ships something technically correct and operationally broken.

When OpenClaw dropped, I didn't start building. I started speccing.


The concern everyone has but few articulate clearly

Francis from the DEV community put something plainly in his OpenClaw post this week:

"You have to be careful on how you prompt something to the AI agent."

He's right. His conclusion — wait before using OpenClaw — is understandable but backwards. The problem isn't OpenClaw's autonomy. The problem is going into a session without constraining the ambiguity space first. The agent fills every gap you leave. You just don't get to choose which gaps or how.

The answer isn't caution. It's a spec.


What spec-writer actually does

The skill lives in your Claude Code workflow as /spec-writer. You describe what you want to build — as loosely or as precisely as you want — and it returns a structured spec with explicit goals and non-goals, a list of assumptions it caught in your description, success criteria you can actually verify, and a task breakdown ready to hand to any coding agent.

That last part is the one people underestimate. "A task breakdown ready to hand to any coding agent" sounds like formatting. It isn't. It's the difference between an agent that does what you meant and an agent that does what you said.

Here's what happens when you skip it.


The gap between what you said and what you meant

I wanted to build a household manager on top of Tradclaw — the OpenClaw scaffold Claire Vo shipped last week for family and home operations. School reminders. Meal plans. NEPA generator schedules (that's load-shedding for anyone outside Nigeria). Bedtime stories in Pidgin English.

My first instinct was to clone the repo, read BOOTSTRAP.md, and start the interview process. That's what the scaffold tells you to do.

But I ran /spec-writer first.

The input was roughly: "Build a household AI manager for a Nigerian family using Tradclaw and OpenClaw. School calendar, NEPA schedules, local meal planning, Pidgin bedtime stories."

The spec-writer came back with 11 flagged assumptions. Not suggestions. Flags.

Things like:

  • "Assumed NEPA schedule is predictable and retrievable — is there an API or does the agent need to infer from household patterns?"
  • "Assumed 'Pidgin' means Nigerian Pidgin English specifically — clarify dialect and code-switching rules before bedtime story generation."
  • "Assumed meal planning requires local market integration — is pricing data available, or is this a manual input workflow?"
  • "Assumed school calendar maps to Nigerian term structure — which state? Federal school calendar differs from state school calendars."

None of these feel like big catches until an agent confidently answers all of them wrong and you're two sessions deep wondering why your "Nigerian family manager" thinks school starts in September like it's England.


What the spec looked like after

After answering the assumption flags, I had something I could actually hand to OpenClaw:

Project: Nigerian Family Household Manager
Scaffold: Tradclaw (https://github.com/ChatPRD/tradclaw)
Agent runtime: OpenClaw

Core objectives (priority order):
1. Daily family brief — school events, power schedule, weather, tasks
2. NEPA/generator schedule management — manual pattern input, 30-min reminders
3. School calendar tracking — Rivers State federal calendar, term dates, exam periods
4. Meal planning — local market context, no external pricing API
5. Bedtime story generation — Lagos Pidgin/English code-switching, age-appropriate

Non-goals:
- No automatic purchasing or financial actions
- No external data calls for power schedules
- No sharing children's data beyond local workspace

Constraints:
- SOUL.md must define explicit boundaries around children's data
- Agent must surface ambiguity instead of resolving silently
- Pidgin dialect: Lagos variety, natural code-switching is correct behavior

Success criteria:
- Daily brief runs without hallucinated school events
- NEPA reminder fires within 5 minutes of predicted outage window
- Bedtime story passes native speaker read test (natural Pidgin rhythm)
- No skill composition errors across calendar + messaging + file-write

Flagged assumptions (resolved):
- School calendar = Rivers State federal (not Lagos state, not UK)
- Generator schedule = household-defined patterns (no API)
- Pidgin = Lagos variety (not Warri, not Kano)
- Meal planning = manual ingredient input (no market price API)
Enter fullscreen mode Exit fullscreen mode

That spec didn't take long to write. But the session that followed it was different from every session I'd run without one.

The agent stayed in scope. When it didn't have enough information, it asked instead of assumed. When I gave it an ambiguous instruction, it surfaced the ambiguity rather than resolving it silently.


Why this matters for OpenClaw specifically

OpenClaw is an agentic system. That's the whole point — you give it a workspace, skills, memory, and it runs. The autonomy is the feature.

But autonomy without constraints is just energy. It goes somewhere. The question is whether that somewhere is where you wanted to go.

Skills in OpenClaw get composed. A household agent might combine a calendar skill, a messaging skill, a file-writing skill, and a web search skill in a single session. Each skill is well-defined. The composition is where your spec either held or didn't.

One commenter on the OpenClaw challenge launch post put it better than I could: "A skill that chains cleanly into other skills — well-defined inputs, clear failure modes — tends to be far more useful in real agent workflows than a monolithic 'do everything' skill."

That's composability. Your spec is what makes composability possible. Without it, you're handing OpenClaw a list of groceries and hoping it knows which store you meant.

The spec-writer habit gives you a document that sits at the top of every OpenClaw session. Not a system prompt. Not a README. A live constraint document the agent can reference when it hits an ambiguous decision point.

That document is the difference between an agent that asks "should I use the family's school calendar or the national one?" and one that picks silently and proceeds.

There's now economic proof for what that gap costs.

Luo Fuli — who leads Xiaomi's MiMo large model team and previously worked at DeepSeek — posted a detailed breakdown in April after Anthropic cut off third-party tools from routing through Claude subscriptions. Her analysis circulated widely in Chinese tech circles. Almost no English coverage picked it up.

Her core finding: a single Claude Max subscriber paying $100/month generated over $5,600 in equivalent API costs in one billing cycle. A 56x subsidy ratio. The mechanism was context management failure — third-party harnesses firing rounds of low-value tool calls as separate API requests, each carrying 100,000-token context windows. Every 3 steps, the harness compressed tool responses to manage the context limit. Every compression rewrote the conversation prefix. Every rewrite killed the cache hit. The model reprocessed the full context from scratch, again and again. Luo's description: "That's not a gap. That's a crater."

The spec is the fix for the crater. When your spec defines what tools the agent is allowed to call, in what order, with what scope — you're not just preventing bad outputs. You're preventing the token burn pattern that makes the economics collapse. Constrained sessions hit cache. Unconstrained sessions don't.


Three assumptions OpenClaw will make if you don't stop it

These aren't hypothetical. They're documented failure patterns from the April 2026 release notes — fixed in recent builds, but revealing about how the agent thinks when it has room to decide.

1. Memory bleeds into the wrong session.

Nested agent runs were inheriting the caller's channel account in shared workspaces. A subagent spawned from one session was running under another session's identity and routing context. The fix (#67508) scoped cross-agent runs to the target agent's bound channel. But the underlying pattern — agent assumes shared context is the right context — is exactly what happens without an explicit spec. Your household agent doesn't know which family member's calendar to update unless you tell it. It picks one.

2. The dreaming loop re-ingests its own output.

OpenClaw's memory system was promoting its own dream narratives back into short-term recall, then re-ingesting them as new memories. The agent was training itself on its own summaries (#64682, #65320). This is context rot. What you spec as "remember meal preferences" becomes "remember OpenClaw's summary of meal preferences" over time. Without a constraint in your spec that defines what counts as a valid memory source, the agent will decide. It will decide wrong.

3. Heartbeat events poisoned later sessions.

Heartbeat, cron, and exec events were overwriting shared session routing metadata. A synthetic heartbeat target was corrupting delivery for subsequent user turns (#66073). For a household manager running on a daily brief schedule, this means your morning reminder context could bleed into the evening meal planning session. Same agent, different task, wrong metadata.

None of these failed because the code was bad. They failed because the agent assumed shared context was safe context. A spec that explicitly defines session boundaries, memory sources, and skill scope constraints would have caught all three before they ran.


The lesson that transfers to every OpenClaw build

Start with what the agent is not allowed to assume.

That sounds backwards. You're building capabilities — why start with limitations? Because the failure mode for agentic systems isn't "agent couldn't do the task." It's "agent did the wrong task fluently."

Run your idea through a spec-writer equivalent before the first session. It takes 20 minutes. The 11 assumption flags I got back saved me from sessions that would have felt productive and built the wrong thing.

OpenClaw is powerful because it composes skills and acts autonomously. Those are also exactly the conditions where bad assumptions compound. One wrong inference in a household manager that runs daily isn't a one-time bug. It's a recurring one.

The spec doesn't constrain the agent. It constrains the ambiguity space the agent operates in. That's a different thing entirely.


I built spec-writer because I got tired of fixing confident mistakes. The repo is public at github.com/dannwaneri/spec-writer. Fork it, adapt it to your workflow, run it before your next OpenClaw session.

Not because it'll stop every wrong assumption. But because the ones it catches before a session are cheaper than the ones you find after.

Top comments (3)

Collapse
 
leob profile image
leob

Spec writer looks cool !

And what I also noticed:

"A 56x subsidy ratio ..."

Wow - so I googled that, and:

"The current landscape of AI token usage is characterized by heavy, venture-capital-funded subsidies, where AI providers (OpenAI, Anthropic, etc.) often charge users significantly less for subscriptions (e.g., $20/month) than the actual cost of the computing power consumed ..."

"The era of heavily subsidized AI is expected to end soon as companies approach IPOs and seek profitability, potentially increasing costs by 10x ..."

"Note: The data indicates a high likelihood of market correction and significantly higher token costs within 18–24 months of early 2026 ..."

reddit.com/r/ArtificialInteligence...

Companies jumping on the AI bandwagon and prematurely firing most of their employees could be in for a nice surprise ... ;-)

Collapse
 
dannwaneri profile image
Daniel Nwaneri

The reddit thread you found is the other half of the argument. The $5,600 crater exists because subsidies are still deep enough to absorb it but that math only holds while VCs are funding the gap. Once Anthropic and OpenAI are pricing toward profitability, every agent workflow built on implicit subsidy assumptions breaks overnight..

The companies that fired their developers to run on subsidized AI aren't just exposed to a 10x cost increase. They're exposed to it with no one left who understands the systems well enough to optimize them. spec-writer is partly a hedge against that — constrained sessions that hit cache and avoid redundant tool calls are the same sessions that survive a pricing correction.
The employees weren't the cost center. The sloppy prompting was..

Collapse
 
benjamin_nguyen_8ca6ff360 profile image
Benjamin Nguyen

great explanation!