I Let My AI Agent Run Overnight. It Cost $437.

Magicrails — Wed, 29 Apr 2026 20:22:39 +0000

This is the launch-day essay. It's the most important piece of writing for the launch — more important than the README. The README sells the tool; this sells the problem.

Tone: first-person, narrative, specific. No marketing voice. Land softly on the library.

Length target: ~1,000 words. Flat, factual, slightly self-deprecating.

Where to host: dev.to + your personal blog + (optionally) cross-posted to Substack. Avoid Medium — they paywall and de-rank organic posts.

I Let My AI Agent Run Overnight. It Cost $437.

Last month I built a small agent. It was supposed to read through a directory of legal documents, summarize them, and file the summaries into a database. The whole thing was maybe 80 lines of LangChain plus a custom tool that looked at filesystem state.

I tested it on five files. It worked. I queued up the full job — 1,200 documents — and went to bed.

I woke up to a Slack DM from my CFO that I'd rather not paste in full. The short version: the agent had been calling the same tool — list_files("/data") — over and over, getting the same 1,200 filenames back, and reasoning that maybe if it called it again it would notice something it had missed. It did not notice anything it had missed. It called it 14,000 times. Each call burned a few hundred tokens of context, plus the model's reasoning step around the tool call.

By the time the rate limiter slowed it down and a token quota stopped it cold, it had spent $437.

I was lucky. The Anthropic dashboard showed the spike. The quota saved me from another zero. A friend at a different company had the same kind of bug and didn't notice for three days. His final bill was $5,200.

This post is about why this happens, why nothing in the existing tooling catches it before it happens, and the very small library I ended up writing because I couldn't keep working without it.

The failure mode

The agent didn't crash. It didn't error. From its own perspective, it was doing exactly what it was told: "use the tools available to make progress on the task." It had a tool. It used the tool. The tool returned data. The data didn't unblock the task, so it tried again.

There is no exception to catch when this happens. There is no log line that says "this is bad." The traces look normal — every individual call looks like every other normal call. The only signal is cost, and cost is post-hoc.

This is the central problem with autonomous agents in 2026: they don't fail loudly. They fail expensively.

Three patterns I've seen in the wild

After my own incident, I started asking around. Almost every agent developer I talked to had a story. The patterns clustered into three:

Tool-call loops. The agent calls the same tool with the same arguments, gets the same answer, doesn't realize it's not making progress, tries again. Mine was this. The CrewAI examples I've seen in the wild often hit it on web_search queries that return the same top-10 results.

Budget runaway. The agent isn't stuck per se — it's just doing work that costs more than the work was worth. A "summarize this PDF" call expands into thirty subagent calls because the planner LLM decided each section deserved its own dedicated agent with a 200k-token context window.

Reasoning loops. The trickiest one. The agent's state — its memory, its plan, its scratchpad — stops changing across iterations. From the outside it's still calling tools, still emitting tokens. But internally, it's stuck on the same thought. I once watched an agent rewrite the same paragraph thirty times, each rewrite identical to the last, because its self-critique step kept producing the same critique.

What existing tools catch

Almost nothing, in real time.

The observability platforms — Langfuse, Arize Phoenix, AgentOps, LangSmith — are excellent at showing you what happened, after it happened. You open the trace, you see the loop, you say "ah." The dashboards are beautiful. They will tell you, in retrospect, exactly how you spent $437.

Validation libraries — Guardrails AI, Pydantic-AI — check that the agent's output matches a schema. They don't watch the agent's behavior over time. A perfectly-formatted JSON output that arrives 14,000 times is still 14,000 wasted calls.

Routing libraries — LiteLLM, OpenRouter — switch between models for cost reasons. Useful, but they don't stop a runaway. A loop on a cheap model is still a loop.

Provider-side caps — the spending limits in the OpenAI dashboard, the Anthropic budget alerts — are account-wide and lagging. They cut you off after the limit is hit, sometimes hours later, and they don't distinguish a runaway from legitimate work.

What's missing, I realized, is a brake. Something in-process that watches what the agent is doing and stops it before the bill arrives.

Magicrails

So I wrote one. It is genuinely small — a few hundred lines of pure Python, no dependencies — and it does three things:

from magicrails import guard

@guard(budget_usd=10.0, max_repeats=3, stasis_steps=5)
def my_agent(task): ...

budget_usd=10.0 — track tokens against a built-in pricing table, halt at $10.

max_repeats=3 — if the same tool gets called with identical arguments three times in a window, halt.

stasis_steps=5 — if the hash of the agent's state hasn't changed in five iterations, halt.

When any of those trip, the default behavior is to raise a TripError and exit cleanly. You can override that with a Slack webhook, a human-in-the-loop prompt, or any function you want.

It's framework-agnostic. There are adapters for the Anthropic and OpenAI Python SDKs that auto-instrument token counting; LangChain and CrewAI adapters are the next thing. Or you can use it without a framework at all — call session.record_tokens(...) from wherever your inference happens.

It's not an observability platform. It's not a tracing tool. It's not a guardrails framework in the validation sense. It does one thing: stop an agent that is about to cost you money or time you cannot get back.

The library is at github.com/magicrails/magicrails and pip install magicrails. It's MIT-licensed.

What I'd do differently

If I'd had this two months ago, the bill would have been $9.87 instead of $437. I would not have written this essay. The colleague-of-a-colleague with the $5,200 bill would have found out at $20.

Build agents. Run them overnight. But install a brake first. The agent isn't going to install one for you.

Source: github.com/magicrails/magicrails. Discussion: [HN thread link]

DEV Community: Magicrails

I Let My AI Agent Run Overnight. It Cost $437.