Astrodevil for Studio1

Posted on Mar 20 • Edited on Mar 25

Why Your OpenClaw Agent Gets Slower and More Expensive Over Time

#ai #python #opensource #programming

Introduction

OpenClaw feels fast in the first week. You send a message, the agent responds, and the workflow makes sense. Then gradually, without any obvious change, responses take a little longer, and the API bill at the end of the month is higher than it was two weeks ago, with no single thing you can point to as the cause.

That is not a coincidence, and it is not bad luck. It is what happens when three separate problems compound on each other quietly, over time, without any of them being obvious on its own.

Context bloating, static content being reprocessed on every call, and every request hitting the same model regardless of what it actually needs, these are not dramatic failures. They are the kind of inefficiencies that feel invisible until they are not, and by the time the invoice makes them obvious, they have been running for weeks.

In this post, we will break down what is driving each of them and why routing, not prompt tuning or model switching, is the fix that addresses all three at the layer where they actually live.

Why the Default Setup Works Against You Over Time

OpenClaw's default configuration is built to get you started. It is not designed to remain efficient as your usage grows, and the gap between the two becomes apparent faster than most people expect. Three things are responsible for most of it.

Context grows faster than you think

Before you type a single message, your agent has already loaded a significant amount into the context window. SOUL.md, AGENTS.md, bootstrap files, the results of a memory search against everything you have accumulated, all of it lands in the prompt before your request even starts.

That base footprint is manageable in week one. By week three, the memory graph has grown, the search results are broader, and the conversation history from your previous sessions is traveling with every new request. The agent is not selectively pulling relevant data; it loads everything it has access to every time.

The result is a base token cost per request that is meaningfully higher than it was when you started, without any deliberate change on your part.

Static tokens are processed fresh every time

A large portion of what is loaded into every request consists of content that has not changed since last week, system instructions, bootstrap files, and agent configuration. Provider-side caching exists specifically to avoid paying full price for static content on repeat calls, but the default OpenClaw setup does not use it.

The same unchanged content, reprocessed from scratch on every heartbeat call.

Every call processes that unchanged content from scratch. For a setup running a 30-minute heartbeat, that means a full API call with no caching, hitting the configured model, every half hour, regardless of whether anything meaningful is happening in the session. Most users never think of the heartbeat as a cost source, but over a full month, it adds up to a figure worth paying attention to.

Every request hits the same model

OpenClaw routes all requests to a single globally configured model. There is no built-in distinction among task types: a status check, a memory lookup, a formatting task, and a multi-step reasoning problem all map to the same endpoint at the same price.

In practice, the majority of what an agent handles day-to-day is simple work. Summaries, lookups, structured output, short responses. None of it requires a frontier model, but all of it gets one anyway. That is not a usage problem; it is a configuration gap, and it is the highest-leverage thing to fix.

The Structural Fix: Routing

The problem with every approach people try first, switching to a cheaper model, trimming prompts, and reducing heartbeat frequency, is that they address one variable at a time. The bill declines slightly, then rises again. What is needed is a layer that sits between OpenClaw and the provider, evaluates each request before it is sent, and determines which model to route it to. That is what routing is, and that is why it is a structural fix rather than a configuration tweak.

That layer is Manifest, an open-source OpenClaw plugin built specifically to solve this. It sits between your agent and the provider, and the original OpenClaw configuration remains unchanged.

Manifest intercepts every request before it reaches the LLM. The routing decision takes under 2 ms with zero external calls, after which the request is forwarded to the appropriate model. During that interval, five distinct mechanisms run before the request moves anywhere, starting with how the scoring algorithm decides which tier a request belongs to.

How the scoring algorithm works

Before any request leaves your setup, Manifest runs a scoring pass across 23 dimensions. These dimensions fall into two groups:

13 keyword-based checks that scan the prompt for patterns like "prove", "write function", or "what is", and
10 structural checks that evaluate token count, nesting depth, code-to-prose ratio, tool count, and conversation depth, among others.

Each dimension carries a weight. The weighted sum maps to one of four tiers through threshold boundaries. Alongside the tier assignment, Manifest produces a confidence score between 0 and 1 that reflects how clearly the request fits that tier.

How Manifest scores a request across 23 dimensions and assigns it a tier in under 2 ms.

One edge case worth knowing: short follow-up messages like "yes" or "do it" do not get scored in isolation. Manifest tracks the last 5 tier assignments within a 30-minute window and uses that session momentum to keep follow-ups at the right tier, rather than dropping them to simple because they contain almost no content.

Certain signals also force a minimum tier regardless of score. Detected tools push the floor to the standard. Context above 50,000 tokens forces complex. Formal logic keywords move the request directly to reasoning.

The four tiers and what they route

The tier system is where the cost reduction actually happens. Manifest defines four tiers, each mapped to a different class of model:

Simple: greetings, definitions, short factual questions. Routed to the cheapest model.
Standard: general coding help, moderate questions. Good quality at low cost.
Complex: multi-step tasks, large context, code generation. Best quality models.
Reasoning: formal logic, proofs, math, multi-constraint problems. Reasoning-capable models only.

In a typical active session, most requests fall into the simple or standard category. Routing those away from frontier models, while sending only what genuinely needs it to complex or reasoning, is where the up to 70% cost reduction reported by users comes from.

How Manifest maps each request type to the cheapest model that can handle it.

Every routed response returns three headers you can inspect: X-Manifest-Tier, X-Manifest-Model, and X-Manifest-Confidence. If a request was routed differently than you expected, those headers tell you exactly what the algorithm saw.

OAuth and provider auth

Manifest lets users authenticate with their own Anthropic or OpenAI credentials directly through OAuth. If OAuth is unavailable or a session is inactive, it falls back to an API key.

Manifest lets users authenticate with their own Anthropic or OpenAI credentials

This keeps your model access under your own account, which matters for rate limits, spend visibility, and not routing your traffic through a third-party proxy. More providers are being added.

Fallbacks and what they protect

Each tier supports up to 5 fallback models. If the primary model for a tier is unavailable or rate-limited, Manifest automatically moves to the fallback chain. The request still resolves, just against the next available model in that tier's list. This is particularly relevant for the reasoning tier, where model availability can be less predictable during high-traffic periods, and losing a request entirely is more costly than a slight capability downgrade.

Spend limits without manual monitoring

Manifest lets you set rules per agent against two metrics: tokens and cost. Each rule has a period (hourly, daily, weekly, or monthly), a threshold, and an action. Notify sends an email alert when the threshold is crossed. Block returns HTTP 429 and stops requests until the period resets.

Rules that block are evaluated on every ingest, while rules that notify run on an hourly cron and fire once per rule per period to avoid repeated alerts for the same breach. For a setup with a 30-minute heartbeat running continuously, a daily cost block is the most direct way to prevent a runaway spend event from compounding overnight without any manual check.

The Rest Is Worth Knowing

Routing is the core of what Manifest does, but it ships with a few other things that are worth understanding before you use it in production.

Manifest provides a dashboard that gives a full view of each call: input tokens, output tokens, cache-read tokens, cost, latency, model, and routing tier. Cost is calculated against a live pricing table covering 600+ models, so nothing is estimated. The message log stores all requests and is filterable by agent, model, and time range.

Manifest dashboard

In local mode, nothing leaves your machine. In cloud mode, only OpenTelemetry metadata is sent: model name, token counts, and latency. Message content never moves. The full codebase is open source and self-hostable at github.com/mnfst/manifest, and the routing logic is fully documented.

A quick note before we move on.

Everything in this post reflects how Manifest works at the time of writing, and the space is moving fast enough that some details may already look different by the time you read it. The OAuth providers, the supported models, the scoring thresholds, and the team were shipping changes even while this article was being written. For anything that has moved since, the docs are the right place to check.

With that said, back to the article. Here is how all of it fits together.

Putting It Together

The three problems do not take turns. They compound on the same request, every time.

Three problems converging into every single request, all at once.

A heartbeat call on a 30-minute cycle loads accumulated context, reprocesses unchanged system files, and hits a frontier model for a task that needed none of that. Week one is a small number. In week three, it is a pattern you cannot see until the invoice lands.

Routing is the layer that addresses all three at once, not because it solves context or caching directly, but because it changes the cost of every request before it leaves your setup, and once that layer is in place, the three problems no longer have room to compound.

Where to Start

The order matters here. Do not start by switching models or trimming prompts.

Install Manifest and let it run for a few days without changing anything else. The dashboard will show you where the cost is actually coming from.
Check the model distribution. If simple and standard requests are hitting your highest-tier model, routing is the first thing to configure.
Set a daily cost block rule to prevent a runaway session from compounding overnight.
Once routing is active, the cache read token metric indicates how much static content was served from cache versus processed fresh. That number is worth watching.
Add per-tier fallbacks to prevent availability gaps from interrupting the session.

The Manifest docs cover installation, routing configuration, and limit setup in full. If you want the broader context on what makes OpenClaw production-ready, this post is a good place to start.

Top comments (4)

klement Gunndu • Mar 21

Context bloat is the silent killer — we hit the same wall where unchanged system files were getting reprocessed every call. Routing by task complexity sounds right, though the tricky part is classifying requests accurately without adding its own latency.