DEV Community: Albert Mavashev

Adding runtime budget control to a Spring Boot AI agent

Albert Mavashev — Wed, 06 May 2026 13:36:37 +0000

We recently wired budget checks into a Java/Spring Boot agent platform.

The problem was not that the platform had no limits. It had too many: subscription checks, credit checks, feature limits, and OpenAI calls all had their own logic.

The expensive path was simple: calling OpenAI. So we wrapped that path directly.

@Service
public class CyclesOpenAIService {

    private final SimpleOpenAI openAI;

    public CyclesOpenAIService(OpenAIClientProvider clientProvider) {
        this.openAI = clientProvider.getOpenAI();
    }

    @Cycles(estimate = "1", workspace = "#workspaceId", unit = "CREDITS")
    public Response runOpenAIRequestCycles(ResponseRequest request, String workspaceId) {
        return this.openAI.responses().create(request).join();
    }
}

That one annotation does the request-time budget check.

Before the OpenAI call runs, Cycles reserves one credit against the user's workspace. If allowed, the call proceeds. If it succeeds, the reservation is committed. If it fails before billable work completes, the reservation is released.

The basic flow is:

reserve -> execute -> commit or release

A few lessons from the integration:

The annotation path was the easy part.
The admin side was more work: creating wallets, funding budgets, looking up balances, and showing event history.
Stable idempotency keys matter. SDK-generated UUIDs are fine for one call, but app-level retries need the same logical request ID.
This first version protects OpenAI calls, not every tool call.
Tool-call gating is the next step for actions like email, refunds, CRM updates, deploys, and paid APIs.

The main takeaway: budget control belongs in the runtime path, not only in dashboards or billing reports.

The question is not just:

How much did this agent spend?

It is:

Should this agent, for this workspace, be allowed to take the next expensive action right now?

Full field report:

https://runcycles.io/blog/how-scalerx-wired-cycles-into-a-java-agent-runtime

AI Agents Are Cross-Cutting. Your Controls Aren't.

Albert Mavashev — Sun, 12 Apr 2026 01:28:26 +0000

A pattern keeps showing up in production agent systems:

the agent is cross-cutting, but the controls usually are not.

One agent run can touch:

multiple model providers
multiple tools
multiple tenants
multiple workers

But the controls teams usually start with still live inside one slice:

provider spending caps
observability tools
framework loop limits
some Redis counter somebody wrote in an afternoon

Those controls are useful. They solve real problems.

They just do not answer the actual runtime question:

Can this agent, for this customer, on this worker, take the next action right now?

That is not a missing feature in one product.

It is a layer problem.

The shape of the problem

Take a normal SaaS setup.

You have an agent feature serving many customers. A single run might use:

OpenAI for reasoning
Anthropic for long context
a cheaper model for embeddings
a search API
a CRM client
an outbound mailer
a payment API
a vector store
a web fetcher

And it does not all happen in one process. It runs on stateless workers, with retries, fan-out, and concurrent jobs hitting shared budgets.

Now ask a simple question:

Where do we put the budget cap?

That is where things get messy.

Why the obvious controls fall short

1. Provider spending caps

Provider caps are useful safety nets.

They can stop a catastrophic monthly bill on that provider.

But they only see their own billing surface.

An OpenAI cap does not know:

what you spent on Anthropic
what the search API billed
which customer the run belonged to
whether this is one runaway workflow or normal traffic across fifty tenants

Provider caps govern the provider’s exposure to your account.

They do not govern your application’s full runtime surface.

2. Observability tools

Observability tools are also useful.

They help you answer:

what happened
where cost went
which step was slow
which tool was called
where the run failed

But they answer those questions after the action already happened.

That is the constraint.

A trace can tell you an agent burned through budget over the weekend. It cannot stop the bad call at iteration 50 when the damage was still small.

Alerts reduce reaction time.

They do not create a pre-execution decision point.

3. Framework limits

Frameworks often provide things like:

max iterations
max execution time
step limits
tool-call ceilings

Those are sensible defaults.

But they are still local.

They usually apply to one orchestrator instance. Not to:

multiple workers
multiple retries
multiple runs hitting the same budget
multi-tenant shared infrastructure

They also speak in loop counts, not in money or blast radius.

A loop limit does not know whether the next step costs $0.001 or $4.00.
It does not know whether the action is a harmless read or a customer-facing mutation.

4. DIY counters

This is the classic path.

A team writes a Redis-backed counter. Increment per call. Check before request. Done.

It works for a sprint.

Then reality shows up:

TOCTOU races under concurrency
drift across providers
retries that double-count or under-count
counters lost on worker crashes
awkward logic around estimate vs actual
tenant isolation problems
a growing pile of exceptions and patches

What started as “just a counter” turns into a distributed transactional system.

That is usually the moment people realize they are not building a helper anymore. They are building a control plane.

The structural mismatch

All of those tools share the same limit.

They govern themselves, not the full agent.

That is the mismatch.

Control	What it governs
Provider cap	One provider’s billing surface
Observability	A read path into events and traces
Framework limit	One orchestrator instance
DIY counter	One local or semi-shared budget view

An agent, on the other hand, spans all of them.

So the control layer has to span the same surface the agent spans.

Anything inside only one dimension is a partial view.

And a partial view is not governance.

What the missing layer actually needs

Once you accept that the control layer has to live outside any one provider, tool, framework, or worker, the requirements become pretty clear.

External authority

The decision point has to live outside any one runtime slice.

Otherwise it will always be blind to the rest of the system.

Atomic distributed reservations

Two workers cannot both think the same remaining budget is available.

The concurrency problem has to be solved at the control layer itself, not patched afterward.

Hierarchical scope

The same primitive should work across:

tenant
workspace
app
workflow
agent
toolset

That is how you answer both:

how much can this customer spend?
how much can this single run spend?

Reserve, commit, release

You need to hold budget before the action runs, then settle with the actual amount after.

Otherwise you end up with one of two bad choices:

optimistic execution with bad enforcement
conservative blocking with permanently stranded budget

More than binary allow/deny

Real systems need graceful degradation.

Sometimes the right answer is not just ALLOW or DENY.

Sometimes it is:

allow, but cap tools
allow, but downgrade the model
allow, but reduce context
allow, but disable optional steps

That requires a real decision layer, not just a counter.

Provider- and framework-agnostic design

The control primitive cannot care whether the next action is:

an OpenAI call
an Anthropic call
a Stripe charge
an outbound email
a search query
a database write

If the agent spans all of them, governance has to as well.

This is a layer, not a feature

It is tempting to think the gap closes with one more feature release.

Maybe observability tools add enforcement.
Maybe providers add richer caps.
Maybe frameworks add distributed counters.

Possible? Sure.

But that would turn each of those tools into a different category of system.

This is why I think the missing piece is not another feature inside one runtime component.

It is an external, cross-cutting authority layer.

Keep the provider cap.
Keep the observability stack.
Keep the framework guardrails.
Keep the local circuit breakers.

But add the layer that answers the one question none of those can answer on their own:

May this agent, for this customer, on this worker, take the next action right now?

That is the control question that actually matters in production.

And it is why agent governance has to be cross-cutting.

Using RAII to Add Budget and Action Guardrails to Rust AI Agent

Albert Mavashev — Tue, 31 Mar 2026 18:35:47 +0000

Rust is a strong fit for agent runtimes, but until now it has largely lacked a first-class runtime and budget enforcement layer.

We built cycles to add pre-execution budget and action control to Rust agents with an API that leans into ownership and compile-time safety.

The key ideas:

commit(self) consumes the guard, so double-commit becomes a compile-time error
#[must_use] helps catch ignored reservations
Drop triggers best-effort release if a guard is never finalized
one lifecycle works across simple calls, streaming, and multi-agent workflows

The guide walks through three integration levels:

with_cycles() for simple LLM/tool calls
ReservationGuard for streaming and manual commit
low-level client methods for custom lifecycles

It also covers caps-aware execution:

max token caps
tool allow/deny lists
step limits
cooldowns

And a practical rollout path:

start in dry_run
inspect decisions and caps
move to partial enforcement
then hard limits

Full guide: https://runcycles.io/blog/how-to-add-budget-and-action-guardrails-to-rust-ai-agents-with-cycles

Docs: https://runcycles.io/quickstart/getting-started-with-the-rust-client

Repo: https://github.com/runcycles/cycles-client-rust

Five Lessons from Building a Production OpenClaw Plugin

Albert Mavashev — Mon, 30 Mar 2026 14:22:15 +0000

Built a non-trivial budget enforcement plugin for OpenClaw and ran into several behaviors that were not obvious from the public plugin surface: missing model metadata, no clean way to block model calls, install-time config validation traps, and a security-scanner false positive. The most surprising discovery: OpenClaw's before_model_resolve hook has no way to prevent a model call — we had to redirect to a fake model name to force a provider-side rejection.

This post is a practical writeup of the five issues that mattered most, the workarounds we shipped, and the feature requests we filed.

None of this is a complaint about OpenClaw. The platform is well-designed and the hook lifecycle is the right abstraction. These are field notes from building a production plugin, shared so other developers don't have to rediscover the same things.

Lesson 1: The model name isn't in the model resolve event

The before_model_resolve hook is called before the LLM provider is invoked. You'd expect the event to include which model is being resolved. It doesn't.

// What we expected
interface BeforeModelResolveEvent {
  model: string;
  prompt: string;
}

// What OpenClaw actually passes
interface BeforeModelResolveEvent {
  prompt: string;  // that's it
}

We discovered this by logging Object.keys(event) — which returned ["prompt"]. No model, modelId, modelName, model_id, or any variant.

Why it matters: Our plugin needs the model name to look up per-model cost estimates, apply fallback chains (Opus → Sonnet → Haiku), and track per-model spend in the session summary. Without it, budget enforcement for models is blind.

Workaround: We added a defaultModelName config property and a multi-source auto-detection chain that checks api.config, api.pluginConfig, and several nested paths:

const eventModel = event.model
  ?? (event as Record<string, unknown>).modelId
  ?? (event as Record<string, unknown>).modelName
  ?? (ctx.metadata as Record<string, unknown>)?.model
  ?? config.defaultModelName;

If none of those resolve, the plugin logs the available keys at info level so operators can configure defaultModelName:

before_model_resolve: cannot determine model name.
Event keys: [prompt]. Metadata keys: [].
Set defaultModelName in plugin config.

Feature request: openclaw/openclaw#55771 — include model and provider in the before_model_resolve event.

Lesson 2: You can't cleanly block a model call

OpenClaw's before_tool_call hook has clean blocking semantics:

// Tool hooks support this — works perfectly
return { block: true, blockReason: "Budget exhausted" };

The before_model_resolve hook has no equivalent. The return type only supports { modelOverride?, providerOverride? }. There is no block field and no shouldStop policy in the hook runner.

When our plugin throws BudgetExhaustedError, OpenClaw catches it (the default catchErrors: true behavior), logs "handler failed," and proceeds with the model call. The agent gets a response. Budget enforcement is bypassed.

Workaround: We redirect to a non-existent model. When budget is exhausted, the plugin returns:

return { modelOverride: "__cycles_budget_exhausted__" };

OpenClaw passes this to the LLM provider, which rejects it (model not found). The provider rejects the call before generation, so the agent produces no response. The user sees:

⚠ Agent failed before reply: Unknown model: openai/__cycles_budget_exhausted__

Not pretty, but the budget is enforced. The model call costs nothing because the provider never executes it.

Feature request: We've asked for block support in before_model_resolve, matching the before_tool_call pattern.

Lesson 3: Your plugin initializes multiple times

A smaller but confusing runtime behavior: OpenClaw calls the plugin's default export once per internal channel or worker — typically 4–5 times on startup. Each instance gets its own isolated state, which is correct for concurrency. But our startup banner printed 5 times and it looked broken.

Workaround: A module-level startupBannerShown flag shows the full config banner once; subsequent inits get a one-liner with a sequential instance counter: Cycles Budget Guard initialized (tenant=cyclist, dryRun=false, instance=3).

Lesson 4: process.env triggers a security warning

OpenClaw's plugin installer scans the bundled dist/index.js for dangerous code patterns. Our plugin read process.env.CYCLES_API_KEY as a config fallback, and the same bundle contained fetch() calls for webhook delivery and OTLP metrics.

The scanner flagged this combination:

WARNING: Plugin "openclaw-budget-guard" contains dangerous code patterns:
Environment variable access combined with network send — possible
credential harvesting

This is a false positive — we read the API key to authenticate with the Cycles server, not to exfiltrate it. But users see "dangerous code patterns" during openclaw plugins install and understandably hesitate.

Workaround: We removed all process.env access from the plugin. Both cyclesBaseUrl and cyclesApiKey are now required in the plugin config. For secrets management, we document OpenClaw's built-in env var interpolation:

{
  "cyclesBaseUrl": "${CYCLES_BASE_URL}",
  "cyclesApiKey": "${CYCLES_API_KEY}"
}

OpenClaw resolves ${...} before passing config to the plugin, so the env var access happens in OpenClaw's trusted code — not in the scanned plugin bundle.

Verification: grep -c process.env dist/index.js returns 0.

Lesson 5: The plugin contract has undocumented rules

Several behaviors of the OpenClaw plugin system are not documented but are critical to get right:

api.pluginConfig vs api.config: Your plugin config is on api.pluginConfig (from plugins.entries.<id>.config in openclaw.json). We initially read api.config — which is the full system config — and couldn't figure out why our settings were always undefined.

Manifest id derivation: The id field in openclaw.plugin.json must match what OpenClaw derives from the npm package name. For @runcycles/openclaw-budget-guard, OpenClaw strips the scope and gets openclaw-budget-guard. Our manifest originally said cycles-openclaw-budget-guard — a mismatch warning on every load.

Config validation timing: If your configSchema includes required fields, OpenClaw validates during openclaw plugins install — before the user has written any config. We had required: ["tenant"] which crashed the install. Fix: remove required from the schema and validate at runtime in your resolveConfig().

Install-time loading: OpenClaw loads and executes the plugin during install to inspect it. If your plugin throws on missing config, the install fails with a confusing error. Wrap your initialization in try/catch and log a friendly message:

try {
  config = resolveConfig(raw);
} catch (err) {
  api.logger.warn(`[openclaw-budget-guard] Skipping registration: ${err.message}`);
  return;
}

What OpenClaw gets right

This post focuses on rough edges, but the foundation is solid:

The 5-hook lifecycle is well-designed. before_model_resolve → before_prompt_build → before_tool_call → after_tool_call → agent_end covers the full agent execution lifecycle. You can build meaningful enforcement without modifying agent code.
before_tool_call blocking is clean. { block: true, blockReason } with shouldStop is exactly the right pattern. We just want the same for model calls.
Plugin isolation per channel is correct. Each channel gets its own plugin instance with its own state. No shared-state bugs across concurrent sessions.
api.logger integration works well. Plugin log output appears in OpenClaw's log stream with proper prefixes and levels.
The install/enable flow is simple. openclaw plugins install + openclaw plugins enable — two commands and you're running.

What we'd like to see

These are filed or planned feature requests:

block support in before_model_resolve — same pattern as before_tool_call
Model name in before_model_resolve event — event.model and event.provider (#55771)
after_model_call hook — with tokensInput, tokensOutput, latencyMs for actual cost tracking
Channel/worker ID on the api object — so plugins can differentiate instances in logs
Plugin contract documentation — api.pluginConfig vs api.config, manifest id rules, config validation timing, install-time behavior

Build your own

If you're building an OpenClaw plugin, start with our source as a reference: github.com/runcycles/cycles-openclaw-budget-guard. The patterns for config resolution, hook registration, state management, and error handling are all used in our released plugin.

Full integration guide: Integrating Cycles with OpenClaw

Why 200 OK Is the Most Dangerous Response in Agent Production

Albert Mavashev — Thu, 26 Mar 2026 15:57:38 +0000

The scary failures are not always the ones that crash.

Sometimes everything looks fine.

The API returns 200 OK.

The logs are clean.

The workflow completes.

No alert fires.

And the result is wrong.

That is a much worse failure mode than a timeout or a hard error, because nothing tells you to go look. The system says success. The output just quietly drifts away from reality.

This is starting to show up more in agent systems than in normal software.

A normal service usually fails loudly. Bad input throws an exception. A downstream service times out. A database call returns an error. Something breaks in a way people know how to detect.

Agents can fail differently.

They can keep going.

They can produce something that looks plausible, structured, and complete while being based on the wrong state, the wrong tool result, or the wrong interpretation of the task.

That is where 200 OK gets dangerous.

Three examples

1. The tool call "worked"

An agent is supposed to pull data from a system and summarize it.

The request goes through. The workflow finishes. The output looks polished.

But the underlying tool response was incomplete, malformed, or misunderstood, and the agent filled in the gaps with something that sounded reasonable.

No crash.

No red light.

Just bad output wrapped in a success path.

2. The coding agent fixed the wrong thing

A coding agent gets asked to make the test suite green.

It does.

CI passes. Everyone moves on.

Later someone realizes the agent did not actually fix the bug. It changed the tests to match the broken behavior.

Again: success on paper, failure in reality.

3. The workflow lost state in the middle

One agent gathers context. Another agent is supposed to use it.

Somewhere in the handoff, part of the state gets dropped. Not enough to crash. Just enough to make the next decision wrong.

The rest of the pipeline still runs. The final report still gets produced. It just happens to be built on partial data.

That is the pattern: wrong result, valid-looking execution.

Why monitoring does not solve this

The default reaction is usually: we need better observability.

Observability absolutely matters. Traces, logs, dashboards, metrics, all useful.

But they mostly tell you what happened after the system already acted.

That helps with debugging.

It does not help much when the system keeps doing the wrong thing while still looking healthy.

A dashboard is great at showing crashes, latency spikes, and budget overruns.

It is much worse at telling you:

this agent used the wrong tool output
this handoff lost key context
this branch decision was wrong but still valid enough to continue
this run should have stopped three steps ago

The core problem is not missing charts.

It is missing checkpoints.

What is actually missing

Most agent stacks have a gap between:

the agent deciding to do something
the action actually happening

In a lot of systems, that gap is basically empty.

The agent reasons, chooses, acts, and reports success in one uninterrupted flow.

If the reasoning is wrong, the action still happens.

That is why silent failures spread so easily. There is no mandatory pause where the system asks:

Should this step be allowed to proceed?

Not "did it return 200?"

Not "did the code throw?"

A different question:

Does this step still make sense under the current budget, policy, and expected shape of the run?

That checkpoint matters more than another dashboard.

A better pattern

The safer pattern looks more like this:

decide -> check -> act -> record

Not:

decide -> act -> maybe notice later

That checkpoint can be simple.

Before a model call, tool invocation, file write, or external side effect, force the run through a control point.

At that point, you can ask:

is this action expected here?
is the run still within budget?
is this tool allowed?
does the cost pattern still look normal?
is the agent starting to loop or fan out unexpectedly?

This will not catch every semantic mistake.

But it will catch a lot of structural ones, which is already a big improvement over "everything returned 200 so I guess we're fine."

Why this matters in production

Silent failures are expensive because they compound.

A crash stops the workflow.

A silent failure keeps feeding bad state into later steps.

One wrong tool result becomes a wrong decision.

That wrong decision becomes a wrong action.

That wrong action becomes a wrong report, bad write, or misleading recommendation.

And by the time someone notices, the original step is buried.

That is why the cleanest run is not always the safest run.

In agent systems, a green dashboard can be lying to you.

What I would do first

If you are running agents in production, I would start here:

Identify one workflow where a wrong answer actually matters.
Add a mandatory checkpoint before each costly or risky action.
Record what the step was supposed to do and what actually happened.
Put a hard cap on how far one run is allowed to go.
Look for runs that are "successful" but economically or behaviorally weird.

That last one matters.

Wrong runs often have a shape.

They loop.
They fan out.
They use the wrong tool.
They cost too little for the work they claim to have done.
Or they cost too much for what should have been simple.

Those signals are often more useful than waiting for an exception that never comes.

Closing

The most dangerous response in agent production is not 500.

It is 200 OK attached to the wrong result.

That is the failure mode that slips through monitoring, avoids alerts, and reaches users looking completely normal.

Loud failures are annoying.

Silent ones are how you lose trust in the system.

Original post: AI Agent Silent Failures: Why 200 OK Is the Most Dangerous Response

Project: runcycles.io

Your AI Agent Budget Check Has a Race Condition

Albert Mavashev — Wed, 25 Mar 2026 15:48:17 +0000

When I first started putting budget limits around agent workflows, I thought the solution would be simple.

Track the spend.

Check what is left.

Stop the next call if the budget is gone.

That works in a demo. It even works in light testing.

Then you run the same workflow with concurrency, retries, or a restart in the middle, and the whole thing gets shaky.

The problem is not the math.

The problem is where the decision gets made.

The naive version

A lot of first implementations look roughly like this:

def call_model(prompt: str, estimated_cost: int) -> str:
    remaining = get_remaining_budget()

    if remaining < estimated_cost:
        raise RuntimeError("budget exceeded")

    result = llm_call(prompt)

    actual_cost = calculate_cost(result)
    record_spend(actual_cost)

    return result

At first glance, this seems fine.

Check the remaining budget
Make the call
Record the spend

For a single worker, single process, no retries, no failures, it mostly works.

Production is not that environment.

Where it breaks

1. Concurrency

Say you have $5 left.

Now 10 workers all check the budget at about the same time.

They all read the same value.

They all think they have room.

They all proceed.

You did not have one bug. You had ten correct reads and one broken design.

That is a classic time-of-check vs time-of-use problem.

2. Retries

Now add retry logic.

Maybe the model call times out.

Maybe the network flakes.

Maybe your framework retries automatically and your application retries too.

Did the first attempt spend money?

Did the second one?

Did both get recorded?

Did neither?

If your budget tracking is tied to local control flow, retries turn accounting into guesswork.

3. Restarts and crashes

If the process dies after the model call but before record_spend(), your state is wrong.

The money is gone.

Your counter says it is not.

If you keep the budget in memory, a restart makes it even worse. The counter resets. The spend does not.

The real issue

A budget check inside application code is not an authority.

It is a hint.

It is only as correct as the current process, the current thread, and the current execution path. Once multiple workers share the same budget, you need the decision to happen in one place, atomically.

That means:

check the budget
reserve the amount
make the call
reconcile the actual cost

Not:

read the budget
hope nothing else changes
make the call
update later

Those are not the same thing.

A better pattern: reserve, execute, commit

The simplest durable shape I found was this:

reservation = reserve_budget(
    scope="tenant/acme/workflow/summarizer",
    amount=estimated_cost,
    idempotency_key=run_step_id,
)

try:
    result = llm_call(prompt)
    actual_cost = calculate_cost(result)

    commit_budget(
        reservation_id=reservation.id,
        amount=actual_cost,
    )
except Exception:
    release_budget(reservation_id=reservation.id)
    raise

That changes the semantics in a useful way.

Before the model call happens, the budget is already spoken for.

A concurrent worker cannot grab the same dollars.

A retry can reuse the same idempotency key.

A failure can release what was reserved but not spent.

The important part is not the API shape.

The important part is that the reservation is atomic.

Why this belongs outside the agent

I also learned that once agents start sharing budgets across tenants, workflows, runs, and tools, this logic stops being “just a wrapper.”

Now you need:

atomic reservations
idempotency
retry safety
shared scope rules
audit history
behavior that is consistent across runtimes

At that point, budget control starts to look less like a helper function and more like infrastructure.

That was the point where I stopped treating it as app code and pulled it into its own service.

One practical rule

If your budget check looks like:

remaining = read_balance()
if remaining >= estimated_cost:
    do_the_thing()
    write_new_balance()

you do not have enforcement yet.

You have a race condition with good intentions.

Closing thought

Most agent failures are not exotic.

They come from very ordinary bugs:

stale reads, duplicate retries, counters that live in the wrong place, and side effects that happen before anyone realizes the run has gone off the rails.

A budget limit only matters if it can say no before the next step happens.

Everything else is reporting.

I ran into this while building Cycles, an open-source budget authority for autonomous agents. What looked like a simple spend check turned into a distributed systems problem: concurrency, retries, idempotency, and state that had to stay correct under failure.

That was the real lesson. Once multiple workers can spend from the same pool, the budget check has to be atomic or it is not real.

If you want to see how I implemented this in Cycles, more at https://runcycles.io

The AI Agent Control Layer Nobody Talks About

Albert Mavashev — Mon, 23 Mar 2026 16:28:48 +0000

A lot of agent control discussion still sits at the wrong layer.
Observability tells you what happened. Guardrails help shape behavior.

Neither answers the production question that matters most when agents are looping, retrying, fanning out:

Can this agent still act — given what it has already done?

That's the control point I've been building toward with Cycles.

Simple example — a support agent with CRM and email access:

Without a runtime decision layer: the customer email fires (may be multiple emails).
With Cycles: blocked before execution. The function never runs, emails don't go out.

Emails that fire are customer commitments, a compliance exposure, or a support promise.

Demo here: https://github.com/runcycles/cycles-agent-action-authority-demo

Budget Limits for Claude Code, Cursor, and Windsurf via MCP

Albert Mavashev — Sun, 22 Mar 2026 12:00:00 +0000

Budget Limits for Claude Code, Cursor, and Windsurf via MCP

A developer starts a Claude Code session to clean up an auth flow.

A few hours later, the agent has read a bunch of files, rewritten services, generated tests, retried a few times, and kept going longer than expected. The work may still be useful, but the bill is now much higher than planned.

That is the gap with coding agents in Claude Code, Cursor, and Windsurf.

They are designed for long, mostly unsupervised sessions. That is the benefit. It is also the risk.

Most of the time, there is no built-in way to say:

stop after this amount

Not after one more retry.

Not after one more tool call.

Just stop.

That is where MCP becomes useful.

Why MCP matters

MCP gives coding hosts a standard way to discover and call external tools.

That makes it a clean place to add budget enforcement without wrapping every model call in application code.

With Cycles, the idea is simple:

agent estimates the next step
Cycles reserves budget for it
action runs
actual usage is committed
unused budget is released

In other words:

estimate -> reserve -> execute -> commit/release

That is stronger than dashboards or alerts, because the decision happens before the next expensive step.

Why provider caps are not enough

Provider caps help, but they are usually the wrong layer.

They are often:

account-level, not session-level
vendor-specific, not workflow-wide
blind to tool calls and other side effects

What you actually want is a runtime question:

Is this session still allowed to take the next expensive step?

That is the question a budget authority should answer.

Thin MCP setup

For Claude Code:

claude mcp add cycles -- npx -y @runcycles/mcp-server

Then set the API key:

export CYCLES_API_KEY=cyc_live_...

For Cursor or Windsurf:

{
  "mcpServers": {
    "cycles": {
      "command": "npx",
      "args": ["-y", "@runcycles/mcp-server"],
      "env": {
        "CYCLES_API_KEY": "cyc_live_..."
      }
    }
  }
}

That is the appeal here: no SDK in the project, no custom wrapper around every call, no deep integration work.

Better than a kill switch

A useful budget control system should not only say yes or no.

It should be able to return:

ALLOW
ALLOW_WITH_CAPS
DENY

That middle state matters.

If budget is getting tight, the agent can finish in a constrained way instead of crashing abruptly. For example, it can reduce scope, use smaller token limits, skip expensive tools, and wrap up cleanly.

Wrapper vs authority

You can absolutely vibe-code a lightweight wrapper that tracks spend.

That is fine for a demo.

But wrappers usually break where real control matters:

retries
concurrency
partial failures
non-LLM tool actions
inconsistent enforcement across hosts

A wrapper observes.

An authority decides whether the next action is allowed.

That is the real difference.

Closing

Coding agents are great at compressing lots of work into one session.

They are also very good at turning a simple task into many model calls, tool invocations, retries, and side effects before anyone notices.

Dashboards help.
Alerts help.
Provider caps help.

But they do not answer the only question that matters inside the loop:

should this agent be allowed to continue right now?

That is the role Cycles is trying to fill through MCP.

Original post: Budget Limits for Claude Code, Cursor, and Windsurf via MCP

Project: runcycles.io

How to Add Budget Control to a LangChain Agent

Albert Mavashev — Thu, 19 Mar 2026 18:52:40 +0000

How to Add Budget Control to a LangChain Agent

LangChain makes it easy to build agents that call LLMs, search the web, execute code, and chain tool calls together. What it doesn't give you is any way to cap how much a single agent run is allowed to spend.

That's fine when you're experimenting. It's a real problem when you're running agents in production — especially across multiple users or tenants. A single misbehaving agent loop can burn through hundreds of dollars before anyone notices.

This guide shows how to add per-run budget control to a LangChain agent using Cycles — without rewriting your agent logic.

tip Already using the callback handler?
If you want per-LLM-call budget tracking (a reservation around every model invocation), see Integrating Cycles with LangChain. This guide covers a different pattern: a single reservation around the entire agent run, plus optional tool-level checks.
:::

The problem

Here's a typical LangChain agent loop:

from langchain.agents import AgentExecutor, create_openai_functions_agent
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool

@tool
def search_web(query: str) -> str:
    """Search the web for information."""
    return your_search_implementation(query)

llm = ChatOpenAI(model="gpt-4o")
agent = create_openai_functions_agent(llm, [search_web], prompt)
executor = AgentExecutor(agent=agent, tools=[search_web])

result = executor.invoke({"input": "Research the top 10 competitors..."})

This works. But there's no limit on how many LLM calls the agent makes, how many tool invocations it triggers, or what it costs. If the agent gets stuck in a loop, retries a failing tool, or expands scope unexpectedly, it keeps running — and spending — until it either finishes or hits the provider's rate limits.

The fix: reserve before, commit after

The pattern Cycles uses is borrowed from database transactions:

Reserve budget before the agent run starts
Execute the agent if the reservation is granted
Commit actual usage after — releases unused budget back to the pool
Release the full reservation if the run fails

This gives you hard limits that are enforced before spend happens — not discovered afterward on your bill.

Prerequisites

pip install runcycles langchain langchain-openai

export CYCLES_BASE_URL="http://localhost:7878"
export CYCLES_API_KEY="your-api-key"
export OPENAI_API_KEY="sk-..."

Need an API key? Create one via the Admin Server — see Deploy the Full Stack or API Key Management.

Per-run budget wrapper

Wrap your AgentExecutor invocation in a single Cycles reservation:

import uuid
from langchain.agents import AgentExecutor, create_openai_functions_agent
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool
from runcycles import (
    CyclesClient, CyclesConfig, ReservationCreateRequest,
    CommitRequest, ReleaseRequest, Subject, Action, Amount,
    Unit, BudgetExceededError, CyclesProtocolError,
)

client = CyclesClient(CyclesConfig.from_env())

@tool
def search_web(query: str) -> str:
    """Search the web for information."""
    return your_search_implementation(query)

def run_agent_with_budget(
    user_input: str,
    tenant: str,
    budget_microcents: int,
) -> dict:
    key = str(uuid.uuid4())

    # 1. Reserve budget for the entire run
    res = client.create_reservation(ReservationCreateRequest(
        idempotency_key=key,
        subject=Subject(tenant=tenant, workflow="research"),
        action=Action(kind="agent.run", name="research-task"),
        estimate=Amount(unit=Unit.USD_MICROCENTS, amount=budget_microcents),  # 1 USD = 100_000_000 microcents
        ttl_ms=120_000,
    ))

    if not res.is_success:
        error = res.get_error_response()
        if error and error.error == "BUDGET_EXCEEDED":
            raise BudgetExceededError(
                error.message, status=res.status,
                error_code=error.error, request_id=error.request_id,
            )
        msg = error.message if error else (res.error_message or "Reservation failed")
        raise CyclesProtocolError(msg, status=res.status)

    reservation_id = res.get_body_attribute("reservation_id")
    decision = res.get_body_attribute("decision")

    # 2. Execute the agent — optionally downgrade if budget is tight
    try:
        if decision == "ALLOW_WITH_CAPS":
            llm = ChatOpenAI(model="gpt-4o-mini")
        else:
            llm = ChatOpenAI(model="gpt-4o")
        agent = create_openai_functions_agent(llm, [search_web], prompt)
        executor = AgentExecutor(
            agent=agent, tools=[search_web], max_iterations=10,
        )

        result = executor.invoke({"input": user_input})

        # 3. Commit actual usage
        client.commit_reservation(reservation_id, CommitRequest(
            idempotency_key=f"commit-{key}",
            actual=Amount(
                unit=Unit.USD_MICROCENTS,
                amount=budget_microcents // 2,  # replace with real tracking
            ),
        ))
        return result

    except Exception:
        # 4. Release on failure — budget returns to the pool
        client.release_reservation(
            reservation_id,
            ReleaseRequest(idempotency_key=f"release-{key}"),
        )
        raise

# Run it
result = run_agent_with_budget(
    user_input="Research the top 10 competitors in the CRM space",
    tenant="acme",
    budget_microcents=5_000_000_000,  # $50.00
)

::: info Crash safety
If the agent crashes before committing or releasing, the reservation expires automatically after ttl_ms and the held budget returns to the pool. See TTL, Grace Period, and Extend.
:::

Adding tool-level budget checks

Individual tools can also reserve budget before costly operations. If the tool's reservation is denied, it returns a skip message instead of failing:

@tool
def search_web(query: str) -> str:
    """Search the web for information."""
    tool_key = str(uuid.uuid4())

    # Reserve before the tool call
    res = client.create_reservation(ReservationCreateRequest(
        idempotency_key=tool_key,
        subject=Subject(tenant="acme", toolset="web-search"),
        action=Action(kind="tool.call", name="search-web"),
        estimate=Amount(unit=Unit.USD_MICROCENTS, amount=100_000_000),  # $1.00
        ttl_ms=30_000,
    ))

    if not res.is_success:
        return "Budget exhausted — skipping web search."

    tool_reservation_id = res.get_body_attribute("reservation_id")

    # Execute the tool
    results = your_search_implementation(query)

    # Commit actual usage
    client.commit_reservation(tool_reservation_id, CommitRequest(
        idempotency_key=f"commit-{tool_key}",
        actual=Amount(unit=Unit.USD_MICROCENTS, amount=40_000_000),  # $0.40
    ))

    return results

Multi-tenant scoping

Use the Subject hierarchy to give each customer their own budget scope:

def run_for_customer(customer_id: str, user_input: str):
    return run_agent_with_budget(
        user_input=user_input,
        tenant=customer_id,
        budget_microcents=10_000_000_000,  # $100.00, or pull from the customer's plan
    )

Each customer's spend is tracked independently. One customer burning through their budget doesn't affect others.

Graceful degradation with ALLOW_WITH_CAPS

When budget is running low, Cycles can return ALLOW_WITH_CAPS instead of a hard denial. Use the decision to switch to a cheaper model or limit tool access:

res = client.create_reservation(ReservationCreateRequest(
    idempotency_key=key,
    subject=Subject(tenant=tenant, workflow="research"),
    action=Action(kind="agent.run", name="research-task"),
    estimate=Amount(unit=Unit.USD_MICROCENTS, amount=5_000_000_000),  # $50.00
    ttl_ms=120_000,
))

if not res.is_success:
    # Budget denial arrives as a 409 BUDGET_EXCEEDED error — handle it here
    error = res.get_error_response()
    if error and error.error == "BUDGET_EXCEEDED":
        raise BudgetExceededError(
            error.message, status=res.status,
            error_code=error.error, request_id=error.request_id,
        )
    raise CyclesProtocolError(...)

# On success, decision is ALLOW or ALLOW_WITH_CAPS
decision = res.get_body_attribute("decision")

if decision == "ALLOW_WITH_CAPS":
    # Budget is tight — switch to a cheaper model
    llm = ChatOpenAI(model="gpt-4o-mini")
else:
    # ALLOW — full capacity
    llm = ChatOpenAI(model="gpt-4o")

See Caps and Three-Way Decisions for more on how ALLOW_WITH_CAPS works and what cap fields are available.

What you get

With this pattern in place:

Per-tenant isolation — Subject(tenant="acme") means each customer's budget is tracked and enforced independently
Graceful degradation — ALLOW_WITH_CAPS lets agents downgrade instead of stopping cold
Automatic reconciliation — committing less than the reserved amount releases the difference back to the pool
Crash safety — if the agent crashes before committing, the reservation expires automatically and budget is released

Next steps

Integrating Cycles with LangChain — per-LLM-call callback handler pattern
Reserve / Commit Lifecycle — protocol deep-dive
Degradation Paths — strategies for deny, downgrade, disable, or defer
Add to a Python App — Python client quickstart

I burned $153 in 30 minutes with an agent loop — here's the pattern that stopped it

Albert Mavashev — Wed, 18 Mar 2026 16:08:49 +0000

The incident

Short story — my agent retried a failing loop using multiple models:

My agent:
GPT-4o + Stable Diffusion + TradingView Charts + Kling.
100s iterations. $153 in under 30 minutes.

My enforcement layer did not work

Rate limiters — control velocity, not total cost.
Provider caps — per-provider, not cross-provider.
Observability — tells me after. I need before.

The pattern that worked for me: reserve-run-commit

My budget control in 3 steps:

Reserve estimated cost before the call (instrumented code)
Execute action (LLM, toolcall)
On success — commit actual cost, release unused portion
On failure — release the full reservation

The critical piece for retries: each reservation has an idempotency key. If the agent retries the exact same action, the second reservation is a no-op. Budget only gets locked once per logical action, not once per attempt.

The critical piece for concurrent agents: the reservation is atomic. Two agents can't both check the balance, both see enough, and both proceed. One gets through, the other gets blocked.

What it looks like in code @cycles annotation

from runcycles import cycles

@cycles(estimate=5000, action_kind="llm.completion", action_name="openai:gpt-4o")
def ask(prompt: str) -> str:
    return openai.chat.completions.create(...)

The decorator handles the reserve-commit lifecycle. If the budget is gone before the call, it raises a BudgetExceededError and the call never fires. Nothing is billed.

Works with any LLM provider — OpenAI, Anthropic, Bedrock anything.

The demo

I built a runaway agent demo that shows the failure mode in under 60 seconds — same agent, same bug, two outcomes. No API key needed to run it.

demo: https://github.com/runcycles/cycles-runaway-demo

What I built (Cycles Protocol + Reference implementation)

Full docs: https://runcycles.io
Self-hosted, Multi-language SDKs, Apache 2.0

What's your approach to agent cost control? Still rolling your own counters + limiters, nothing or something else?