Your AI agents aren't just unreliable — they're expensive, and you can't see it in the logs

#ai #llm #opensource #python

Most reliability tooling answers one question: "did it work?"

Almost nobody asks the more expensive one: "what did it cost me to work?"

Here's the failure mode that keeps showing up. An agent enters a loop — calls a tool, gets a result, calls again — and the exit condition never trips. Nothing crashes. No exception is thrown. The pipeline looks healthy in every dashboard you have. Then the invoice arrives, and you discover one workflow quietly burned through your token budget for a week.

These aren't reasoning failures. They're structural. And the thing about structural cost is that it's visible in the topology before you ever run the system — if you know where to look.

The patterns that quietly burn tokens

Three structures account for most of the silent spend:

Unbounded loops. A directed cycle with no exit edge. Once execution enters it, there's no topological way out — it re-invokes the agents in the cycle indefinitely, and every lap costs tokens with no error to stop it.

Retry-prone fragile dependencies. An agent with exactly one upstream and no fallback. When that upstream fails or returns a malformed payload, every retry re-spends the upstream's tokens on top of the downstream's. Cost compounds per attempt.

Long critical paths. A deep linear chain where every request pays the full token cost of every hop, and a failure deep in the chain re-runs the upstream work that led there.

None of these throw errors. All of them show up on the bill.

Scoring cost risk statically

I've been building swarm-test — open-source static reliability testing for multi-agent systems (CrewAI, LangGraph, AutoGen, or a generic graph). It models your agents as a directed graph and analyzes the topology — no live LLM calls, so it runs in milliseconds at zero API cost.

The latest addition is a Cost Risk score: it ranks the structural patterns most likely to waste tokens and assigns a 0–100 risk band, alongside the existing reliability score.

A system with an unbounded loop, for example, surfaces like this:

Swarm Score: 0/100 — CRITICAL
Cost Risk: 75/100 — SEVERE (1 unbounded loop, 3 retry-prone paths)

And the finding itself is specific and actionable:

CRITICAL — unbounded loop (Planner, Reviewer, Worker) can re-invoke
3 agents indefinitely. Every cycle spends tokens with no error to stop it.
→ Add a max-iteration cap to bound worst-case token spend, or add a
terminating edge that routes execution outside the cycle.

One honest boundary worth stating: this is a structural estimate from graph topology. It tells you where the architecture allows runaway cost — it does not measure real dollars from a real run. Run-level measurement needs execution data. But narrowing "where could this bleed money" down to the exact edges, before you deploy, is most of the battle.

Try it

pip install swarm-test
swarm-test run my_crew.py --open

GitHub: https://github.com/surajkumar811/swarm-test

If you run agents in production: what's the worst "silent cost" failure you've hit — the one that didn't break anything, just quietly ran up the bill? That's the boundary I'm trying to map, and the real-world cases are more instructive than any synthetic example.

DEV Community

Your AI agents aren't just unreliable — they're expensive, and you can't see it in the logs

The patterns that quietly burn tokens

Scoring cost risk statically

Try it

Top comments (0)