DEV Community: amabito

The $0.64 bug: how nested retries silently multiply your LLM costs

amabito — Mon, 09 Mar 2026 09:08:12 +0000

One user click. One document. My LangChain agent made 64 API calls to GPT-4o before it finally returned a result.

At typical GPT-4o pricing, that turns a one-cent task into a sixty-cent task. With longer prompts, worse.

The agent wasn't broken. The bug was in how the retries multiply.

The problem: retries stack, and nobody tracks the total

This pattern shows up in most LLM agent stacks I've looked at:

Your application code        retries 3 times on failure
  calls a LangChain chain    retries 3 times on failure
    which calls a tool        retries 3 times on failure
      which calls the LLM API

Each layer is reasonable on its own. 3 retries is a perfectly normal default.

When the LLM returns a transient error, the innermost layer retries 3 times. The middle layer sees a failure, retries -- triggering 3 more inner retries each time. Outer layer does the same.

Worst case: 4 x 4 x 4 = 64 API calls from a single user action.

(Each layer makes 1 initial attempt + 3 retries = 4 attempts. Three layers: 4^3 = 64.)

Nobody in the stack tracks the total retry count. Each layer only knows about its own attempts. I built veronica-core to fix this -- a run-level budget that sits across all layers.

The math

Retry layers	Retries per layer	Worst-case calls
2	3	16
3	3	64
4	3	256
3	5	216

This is exponential, not linear. Adding one more retry layer doesn't add 3 calls -- it multiplies the total by 4.

What this costs

GPT-4o at $2.50/1M input + $10.00/1M output tokens. A typical 2K-token agent step with a 500-token response costs about $0.01.

Scenario	Calls	Cost per request
No retries needed	1	$0.01
3-layer retry, worst case	64	$0.64
4-layer retry, worst case	256	$2.56
1000 users/day, 3-layer worst case	64,000	$640/day

Most of the time you won't hit worst case. But you will hit partial amplification regularly -- 8-12 calls where 1-2 would suffice. That's a steady 4-6x cost multiplier that shows up as "the API is expensive" rather than "our retry logic is broken."

Why `max_iterations` doesn't fix this

Most agent frameworks have some form of step or iteration limit. LangChain has max_iterations, others have conversation turn caps or loop counters. These limit how many steps your agent takes, not how many API calls happen underneath.

If an agent has max_iterations=10 and each iteration retries 3 times internally, you can still get 40 API calls. The step counter doesn't see the retries.

These are step limits, not cost limits. None of them track how much money the run has spent.

Before: no containment

This example is intentionally simplified. In real LangChain stacks, the same multiplication usually comes from a mix of provider retries, tenacity decorators, tool wrappers, and chain-level retries -- which makes it hard to spot by reading any single file.

import time
import random

def call_llm(prompt: str) -> str:
    """Calls the LLM API. Fails sometimes."""
    if random.random() < 0.3:
        raise RuntimeError("API timeout")
    return "result"

def inner_tool(prompt: str, max_retries: int = 3) -> str:
    for attempt in range(max_retries):
        try:
            return call_llm(prompt)
        except RuntimeError:
            if attempt == max_retries - 1:
                raise
            time.sleep(0.1)

def chain_step(prompt: str, max_retries: int = 3) -> str:
    for attempt in range(max_retries):
        try:
            return inner_tool(prompt)
        except RuntimeError:
            if attempt == max_retries - 1:
                raise
            time.sleep(0.1)

def agent_run(prompt: str, max_retries: int = 3) -> str:
    for attempt in range(max_retries):
        try:
            return chain_step(prompt)
        except RuntimeError:
            if attempt == max_retries - 1:
                raise
            time.sleep(0.1)

# One user click. Up to 64 API calls. No total limit.
result = agent_run("Summarize this document")

Every layer is doing the right thing locally. But nobody tracks the total. No budget, no circuit breaker. If the API goes down for 30 seconds, this burns through 64 calls before giving up.

After: chain-level containment

from veronica_core.containment import ExecutionContext, ExecutionConfig
from veronica_core.shield.types import Decision

config = ExecutionConfig(
    max_cost_usd=0.10,       # Hard dollar ceiling for this run
    max_steps=20,            # Max successful operations
    max_retries_total=5,     # Total retries across ALL layers
    timeout_ms=30_000,       # 30-second wall clock limit
)

def summarize(prompt: str) -> dict:
    with ExecutionContext(config=config) as ctx:
        # Returns Decision.HALT if any limit is breached,
        # or the return value of call_llm() on success.
        result = ctx.wrap_llm_call(fn=lambda: call_llm(prompt))

        if result == Decision.HALT:
            # Limit breached. The LLM call was not dispatched,
            # so this blocked attempt adds no API cost.
            snapshot = ctx.get_snapshot()
            return {
                "status": "degraded",
                "reason": snapshot.abort_reason,
                "spent": f"${snapshot.cost_usd_accumulated:.2f}",
            }

        return {"status": "ok", "result": result}

What this does:

max_cost_usd=0.10 -- the entire agent run cannot spend more than 10 cents, regardless of how many layers retry.
max_retries_total=5 -- total retries across all layers combined. Not per layer. Chain-wide.
max_steps=20 -- total successful API calls. Prevents infinite tool loops.
timeout_ms=30_000 -- wall-clock hard stop after 30 seconds.

When any limit is hit, wrap_llm_call() returns Decision.HALT without dispatching the LLM call. The blocked attempt itself adds no API cost.

On a stubbed call path (benchmarks/bench_baseline_comparison.py in the repo), the full policy check averages around 11 microseconds. Typical LLM calls take 500-5000ms, so the containment overhead is negligible in practice.

Before/after

	Before	After
Worst-case calls (3-layer retry)	64	6 (1 + 5 retries)
Cost ceiling	None	$0.10
Total retry tracking	No	Yes
Wall-clock timeout	No	Yes
Behavior when API is down	Burns 64 calls, then fails	Burns 5 retries, then stops
Code changes to agent logic	--	None

The agent logic doesn't change. ExecutionContext wraps the calls from the outside. Your retries still work -- they just can't exceed the chain budget.

What this does not do

veronica-core is a cost and execution control library. It is not:

An output validator. It doesn't check what the LLM says. Use Guardrails AI or NeMo Guardrails for that.
A content filter. It doesn't block harmful outputs. That's a different problem.
A prompt engineering tool. It doesn't modify your prompts.
A framework. It wraps your existing LLM calls. It doesn't replace your agent framework or custom loop.
A latency optimizer. It doesn't make calls faster.
A fix for bad prompts. If your agent loops because the prompt is wrong, that's a prompt problem. This just caps the damage.

It controls how many times your agent calls the API and how much money it spends. That's it.

Install

pip install veronica-core

Python 3.10+. No required dependencies beyond the standard library.

Optional extras:

pip install veronica-core[redis]   # Distributed budget tracking (multi-process)

Source: github.com/amabito/veronica-core

There's also a BudgetEnforcer for standalone budget tracking, a CircuitBreaker for failure isolation, and ASGI/WSGI middleware if you want per-request containment in a web app. The retry amplification example above is probably the simplest place to start.

You can reproduce the benchmark

The benchmark script is in the repo. It uses stub LLM implementations -- no API keys, no network calls:

git clone https://github.com/amabito/veronica-core
cd veronica-core
pip install -e .
python benchmarks/bench_retry_amplification.py

The article uses the 64-call example because it matches the common "3 retries per layer" mental model: each layer makes 1 initial attempt + 3 retries = 4 attempts, so 4^3 = 64.

The benchmark in the repo uses a simpler always-failing stub with a 3 x 3 x 3 retry loop, which produces 27 baseline calls. Same bug, different retry convention. The benchmark shows those 27 calls reduced to 3 contained calls with max_retries_total=5.

Retry amplification is not a new idea. What's missing in most LLM stacks is a hard budget that applies to the entire run, not just one call at a time.

If you want to see the failure mode without spending real money, run python benchmarks/bench_retry_amplification.py in the repo. No API key, no network calls, and it makes the bug obvious in a few seconds.

The $12K Weekend: What Nobody Tells You About LLM Agents in Production

amabito — Mon, 23 Feb 2026 13:40:10 +0000

An autonomous agent ran over a weekend. By Monday it had made 47,000 API calls.

No one set a budget ceiling. No one enforced a retry limit. The agent hit a transient API error, retried, hit another, retried again — and kept going for 60 hours because nothing told it to stop.

I spent the first hour convinced it was a billing bug.

This isn't a one-off. Simon Willison has documented the pattern. The r/MachineLearning thread from January had 800 upvotes. The numbers vary — $3K, $8K, $12K — but the shape is always the same: retry loop, no ceiling, nobody home.

We tried the obvious things first

The instinct is better observability. Set up cost alerts. Wire up a dashboard. These are good things — I'm not arguing against them.

But an alert fires after the call happens. The call has already consumed tokens. The money is already spent. You're getting a notification about something that's over.

Then we tried retry libraries. Tenacity, backoff. These handle transient failures fine — if a call fails, wait and retry. The problem is they have no concept of a dollar ceiling. And if your process crashes mid-run and auto-recovers, the retry counter resets to zero. You're back to the beginning.

We spent two weeks on circuit breakers, which felt clever for a while. Trip the breaker, stop the runaway, done. Except: the breaker lives in process memory. Process dies, auto-recovery kicks in, breaker is gone. We'd solved the problem for the happy path and nothing else.

Provider spend limits have a different issue — they're per-account, not per call chain. They don't propagate across sub-agents. Agent A has a $1.00 limit and spawns Agent B, which independently racks up $8. The provider limit never triggers because $9 total is nowhere near your account ceiling. Agent A never knew.

The gap isn't subtle once you see it: nothing enforces bounded execution before the call happens, in a way that survives process restarts.

Why this is harder than it sounds

LLM agents are probabilistic, cost-generating components inside systems expected to behave reliably. That's a hard contradiction and it doesn't resolve through better prompts or more careful orchestration — those operate at the wrong layer.

The analogy I kept coming back to (and resisting, because it sounds grandiose) is operating systems. An OS doesn't know what your application is doing. It just enforces the resource contract: this process gets X memory, Y CPU time, and when it's done, it's done. If the process tries to take more, the OS says no.

LLM systems don't have that. Every agent is running on the honor system.

What you actually need is something that enforces, at call time: this chain can spend at most $X, run for at most N steps, and if I crash and restart, those limits are still in effect. If I spawn a sub-agent, its costs count against my limit — not just its own.

What we built

from veronica_core.containment import ExecutionConfig, ExecutionContext

with ExecutionContext(config=ExecutionConfig(
    max_cost_usd=1.00,
    max_steps=50,
    max_retries_total=10,
    timeout_ms=30_000,
)) as ctx:
    decision = ctx.wrap_llm_call(fn=my_agent_step)
    # Returns Decision.HALT if any limit exceeded
    # fn is never called when halted — no network request, no spend

wrap_llm_call checks the budget before calling fn. If the ceiling is hit, it returns Decision.HALT and never makes the network request. Nothing gets spent.

The multi-agent case is where this gets genuinely useful:

with ExecutionContext(config=ExecutionConfig(max_cost_usd=1.00)) as parent:
    with parent.spawn_child(max_cost_usd=0.50) as child:
        decision = child.wrap_llm_call(fn=sub_agent_step)
        # child spend counts against parent's $1.00
        # parent halts if cumulative cost exceeds $1.00

Agent B has its own $0.50 sub-limit. But whatever B spends also comes off A's $1.00. A halts before the chain blows past $1.00 through a path nobody was watching.

The halt state problem

The circuit breaker issue — state disappearing on restart — we solved with atomic disk writes:

from veronica_core import VeronicaIntegration
from veronica_core.state import VeronicaState

veronica = VeronicaIntegration()
veronica.state.transition(VeronicaState.SAFE_MODE, "operator halt")
# Writes atomically to disk (tmp → rename)
# Survives kill -9. Auto-recovery does NOT clear it.
# Requires explicit .state.transition(VeronicaState.IDLE, ...) to resume

Write to tmp, rename to target — this survives kill -9 because the rename is atomic at the filesystem level. Auto-recovery doesn't clear SAFE_MODE. You put it in SAFE_MODE because something went wrong; you should have to explicitly decide to resume.

When you'd rather degrade than stop

Hard halts aren't always right. Sometimes you want the system to keep running at reduced capacity:

# 80% budget: downgrade to a cheaper model
# 85%: trim context
# 90%: rate limit between calls
# 100%: halt

Thresholds and model mappings are configurable.

Across processes

config = ExecutionConfig(max_cost_usd=10.00, redis_url="redis://localhost:6379")
# Workers share one budget ceiling via Redis INCRBYFLOAT

Numbers

This runs inside an autonomous trading system. 30 days continuous, 1,000+ ops/sec, 2.6M+ operations. During that time, 12 crashes — SIGTERM, SIGINT, one OOM kill. 100% recovery, no data loss.

The destruction tests are reproducible. You don't have to take our word for it:

git clone https://github.com/amabito/veronica-core
python scripts/proof_runner.py

SAFE_MODE persistence through kill -9, budget ceiling enforcement, child cost propagation. They pass or they don't.

Install

pip install veronica-core

Zero external dependencies for core. opentelemetry-sdk optional for OTel export, redis optional for cross-process budget. Works with LangChain, AutoGen, CrewAI, or whatever you're building. MIT.

GitHub: amabito/veronica-core

If you've hit something like this in production — curious what the failure mode looked like and what you ended up doing about it.

DEV Community: amabito

The $0.64 bug: how nested retries silently multiply your LLM costs

The problem: retries stack, and nobody tracks the total

The math

What this costs

Why max_iterations doesn't fix this

Before: no containment

After: chain-level containment

Before/after

What this does not do

Install

You can reproduce the benchmark

The $12K Weekend: What Nobody Tells You About LLM Agents in Production

We tried the obvious things first

Why this is harder than it sounds

What we built

Numbers

Install

Why `max_iterations` doesn't fix this