Mukunda Rao Katta

Posted on May 25

token-budget-py: Thread-Safe Token and USD Caps for Python Agents

#hermeschallenge #ai #python #agents

I was running a batch agent job last month. Twelve threads, each making their own LLM calls, each blissfully unaware of what the others were spending. I had set a rough daily limit in my head but had not enforced it in code. By the time the job finished I had burned through three times my budget in forty minutes. Every worker had done its own rough estimate of what it needed, none of them could see the others coming, and there was nothing in the code that would have stopped it even if one of them had gotten close to the cap.

The obvious first attempt is a shared counter. One global variable, all threads read it before calling the API. But a shared counter with concurrent threads has a race condition in it: two workers can both read "we have 5000 tokens left," both decide that looks fine, and both proceed to spend 4000 tokens each. Together they spend 8000 against a 5000-token remaining balance. The check happened before the spend, so the check was wrong by the time it mattered.

What you actually need is a reserve-then-commit model. You claim a slice of the budget before you start work. If the claim is rejected you stop there. If it succeeds you do your work, then reconcile actual usage at the end. That is how database row locks work. It is how airline seat reservations work. And it is what token-budget-py gives you for LLM token budgets.

Shape of the fix

from token_budget_py import TokenBudget, BudgetExceeded

budget = TokenBudget(max_tokens=100_000)

def run_worker(prompt: str) -> str:
    # Phase 1: reserve capacity before the LLM call
    # This raises BudgetExceeded immediately if we are already over
    reservation = budget.reserve(estimated_tokens=2000)

    try:
        response = call_llm(prompt)
        actual = response.usage.total_tokens

        # Phase 2: commit actual usage
        # Raises BudgetExceeded if actual pushes past the cap
        budget.commit(reservation, actual)
        return response.text

    except BudgetExceeded as e:
        # Cancel the reservation so other workers can use that slice
        budget.cancel(reservation)
        raise

    except Exception:
        # LLM call failed for some other reason, return reserved capacity
        budget.cancel(reservation)
        raise

# USD cap variant: same API, different units
from token_budget_py import UsdBudget

usd_budget = UsdBudget(max_usd=5.00, cost_per_1k_tokens=0.003)

reservation = usd_budget.reserve(estimated_tokens=1500)
# If 1500 tokens * $0.003/1k = $0.0045 would exceed $5.00 remaining, raises BudgetExceeded

The worker pattern is intentionally verbose because the three code paths (reserve fails, commit fails, call fails) each need different handling. cancel is idempotent: calling it twice on the same reservation is safe.

What it does NOT do

This library handles one problem: making a shared numeric cap safe across concurrent Python threads. It does not track costs per provider, per model, or per session separately. There is no database persistence: if your process restarts the budget counter resets. It has no HTTP API, no Prometheus metrics endpoint, and no alerting hooks when you are close to the cap. It does not know what an Anthropic call actually costs per token: you supply the cost_per_1k_tokens rate yourself in the USD variant. If you need a full cost observability platform, this library is not that. If you need a shared numeric gate that multiple threads respect, this is exactly that.

Inside the lib

The core data structure is two integers and a threading.Lock. The integers are reserved (total tokens claimed by active reservations) and committed (total tokens consumed by completed calls). The lock is held for the duration of every read-modify-write cycle.

When you call reserve(estimated_tokens), the lock is acquired. The library checks reserved + estimated_tokens > max_tokens. If the check fails it raises BudgetExceeded immediately without incrementing anything. If it passes, reserved is incremented by estimated_tokens and a Reservation object is returned. The reservation holds a copy of the slice size so that cancel and commit know exactly how much to release or replace later.

commit(reservation, actual) does a second check: committed + actual > max_tokens. Your estimate and your actual usage can diverge. If you estimated 2000 but the call used 3100 and that extra 1100 tokens would push committed past max_tokens, you get a BudgetExceeded at commit time. The reservation held 2000 slots open in reserved; those are released and the actual amount is added to committed only when there is room. If there is not room, the reservation is automatically cancelled as part of the exception path.

cancel(reservation) just decrements reserved by the slice that was held. Because the lock is used on every operation, calling cancel from multiple threads at once is safe. The UsdBudget variant wraps the same mechanism with a conversion layer: it converts token counts to USD using the rate you provide at construction time, then delegates to the same lock-protected counter. No external rate tables, no network calls, no provider SDK dependencies.

18 tests cover: basic reserve and commit, concurrent threads racing to exhaust the budget, cancel behavior, USD conversion, commit-time overage detection, and double-cancel safety.

When useful

Batch agent jobs where multiple worker threads share one daily or hourly token quota and you need the quota to actually hold
A queue consumer where you want to drain the queue but stop before hitting a billing cap regardless of how fast workers move
Any multi-threaded code where "check, then act" is a race condition and you need "reserve, then act, then confirm"
Integration tests where you want to confirm that your agent stops at a specific token count and does not sneak past it

When not useful

Single-threaded scripts where a simple counter after each call is enough
Multi-process or distributed workers where you need the budget shared across process boundaries (use Redis or a database-backed counter for that)
Situations where you need per-provider or per-model sub-budgets rather than one shared pool
Places where the LLM provider already enforces a hard spend cap and you trust that cap completely without wanting an application-layer gate

Install

pip install token-budget-py

Zero dependencies. Python 3.9+. The library is a single module with no third-party runtime requirements.

Siblings

Library	Language	What it does
token-budget-pool	Rust	Original Rust implementation, same two-phase API
llm-budget-window	Rust	Time-windowed token and USD budget that resets on an interval
llm-cost-cap	Python	Pre-flight USD cost gate before the call goes out
tool-call-budgets	Python	Per-tool call-count caps rather than token caps
llm-circuit-breaker-py	Python	Circuit breaker that opens after repeated failures, not budget exhaustion
agent-deadline	Python	Cooperative per-task time deadline for long-running agent loops

What's next

The two features I want to add next are an async-friendly variant using asyncio.Lock so it works cleanly in async agent frameworks without blocking the event loop, and a callback hook for "budget is 80% consumed" so callers can start wrapping up or alert before the cap is hit hard. If you are using this in a production pipeline and have a feature request, open an issue on GitHub.

Part of the Hermes Agent Challenge sprint. Source at github.com/MukundaKatta/token-budget-py. PyPI: pip install token-budget-py.

DEV Community