Stop Getting Surprise Bills: Pre-Flight Cost Checks for LLM Calls

#hermeschallenge #ai #python #agents

The $400 Morning

You left an agent running overnight. It was supposed to summarize 200 documents. Something went wrong around document 47. The agent entered a retry loop. By 7am it had made 1,400 API calls. You open Anthropic's usage dashboard and see $400 charged in eight hours.

The frustrating part is that you had a budget. You just never enforced it programmatically. You assumed the job would finish before the cost climbed too high.

That assumption is the bug. Budgets enforced only in your head are not budgets. They are hopes.

llm-cost-cap is a small Python library that puts the cost check before the call, not after. It estimates the USD cost of a request using a built-in 2026-05-24 price table covering Anthropic, OpenAI, Gemini, and Bedrock. If the estimate exceeds your cap, it raises an exception before any token is sent.

The Shape of the Fix

from llm_cost_cap import CostCap, BudgetExceededError

cap = CostCap(
    provider="anthropic",
    model="claude-sonnet-4-6",
    budget_usd=0.05,
)

prompt = "Summarize the following 10,000-word document..."

try:
    cap.check(input_tokens=2500, output_tokens=500)
    # only reaches here if estimated cost is within budget
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=500,
        messages=[{"role": "user", "content": prompt}],
    )
except BudgetExceededError as e:
    print(f"Skipped: estimated ${e.estimated_usd:.4f} exceeds cap ${e.budget_usd:.4f}")

When the estimate is too high, BudgetExceededError carries both the estimated cost and your configured cap. You can log it, count it, or route to a cheaper model.

Here is what a rejection looks like:

cap = CostCap(
    provider="openai",
    model="gpt-5.4",
    budget_usd=0.01,
)

# A 10k-token input blows the 1-cent cap
try:
    cap.check(input_tokens=10_000, output_tokens=1_000)
except BudgetExceededError as e:
    # BudgetExceededError: estimated $0.0320 > budget $0.0100
    print(e)

The token counts can come from your tokenizer, from a rough estimate (len(text) // 4), or from a previous dry-run response. The library does not call the API to count tokens. You supply the numbers.

What It Does NOT Do

This library does not track cumulative spend. It checks one request at a time against a per-call cap. If you want to enforce a session total or a monthly budget, you need to add that accounting layer yourself.

It does not call any pricing API. The price table is baked into the package at release time. Prices change. If a provider adjusts rates, you will need a newer version of the library.

It does not auto-switch to a cheaper model. It just rejects over-budget calls. Routing logic belongs in your agent loop.

It does not handle streaming cost estimation differently from batch. For streaming you still pass estimated token counts up front.

Inside the Library

The price table is a plain Python dict keyed by (provider, model) tuples. Each entry has input_per_million and output_per_million in USD. The check() method does one multiplication and one comparison. No network calls. No dependencies.

PRICES = {
    ("anthropic", "claude-sonnet-4-6"): {
        "input_per_million": 3.00,
        "output_per_million": 15.00,
    },
    ("openai", "gpt-5.4"): {
        "input_per_million": 2.50,
        "output_per_million": 10.00,
    },
    # ... Gemini, Bedrock entries
}

The 27 tests cover exact boundary values (cost == cap passes, cost > cap raises), all four providers, unknown model names (raises UnknownModelError), and zero-token edge cases.

The design choice that took the most thought was the token count input format. Several drafts accepted a string prompt and called a tokenizer internally. That created a dependency and made the library opaque. Accepting explicit token counts keeps the library auditable. You see exactly what math is being done.

BudgetExceededError inherits from ValueError rather than a custom base class. That means callers that only catch ValueError will still see it. It felt more Pythonic than requiring a specific import for the base class.

When It Helps and When It Doesn't

It helps most in batch jobs and agent loops where a single request can be large and where the cost of one bad iteration matters. Document summarization, RAG with long retrieved chunks, and agent steps that build up context all benefit from a per-call gate.

It helps less in interactive chat where requests are small and predictable. Checking cost before every one-sentence follow-up adds latency without meaningful protection.

It does not help at all if you never call check(). The library cannot instrument the HTTP client. You have to add the check yourself. That is a feature, not a bug. It means you control where the gate sits in your code.

Pair it with llm-circuit-breaker or llm-retry for retry loops. The pattern is: check cost, then call, then handle the circuit breaker if the provider returns an error. Do not retry without re-checking cost.

Install

pip install git+https://github.com/MukundaKatta/llm-cost-cap

Zero dependencies. Python 3.10+.

Basic usage:

from llm_cost_cap import CostCap, BudgetExceededError, UnknownModelError

cap = CostCap(provider="anthropic", model="claude-sonnet-4-6", budget_usd=0.10)
cap.check(input_tokens=1000, output_tokens=300)  # raises if over budget

Sibling Libraries

Library	What it does
`token-budget-pool`	Shared USD/token budget across concurrent workers
`llm-budget-window`	Time-windowed budget (per minute, per hour)
`llm-circuit-breaker`	Trips after repeated provider errors
`llm-retry`	Exponential backoff with jitter
`agent-rate-fence`	Per-key sliding-window rate limiting

These libraries are designed to compose. llm-cost-cap answers "is this one call affordable?" while token-budget-pool answers "do I have budget left across all my workers?"

What's Next

The most requested feature is a cumulative mode: track spend across calls in a session and raise when the total exceeds a cap. That requires a session object or a shared counter, which adds complexity. I want to think through thread safety before shipping it.

A second feature under consideration is a model recommendation mode: when a call is over budget, return the cheapest model from the price table that fits within the cap. That turns a hard rejection into a routing hint.

The Hermes Agent Challenge pushed me to keep this library genuinely zero-dependency. Every draft that added a tokenizer or an HTTP client got rejected. The final form is 120 lines and a price table. It does one thing well.

If you have updated price data or a model that is missing from the table, open a PR against the prices module. That contribution has a clear, bounded scope and does not require understanding the rest of the codebase.