Mukunda Rao Katta

Posted on May 25

The $12 LLM Call Nobody Saw Coming: llm-cost-cap

#hermeschallenge #ai #python #agents

A user uploaded a 400-page PDF

A user uploaded a 400-page PDF to a legal document review tool. The agent chunked the document, but one of the chunking paths had a bug. Instead of slicing the PDF into manageable pieces, it shoved the entire extracted text into a single prompt.

The call went through. The model returned a response. Nobody got an error.

The per-call cost was $12.47.

Nobody noticed until the weekly billing report. By then, two other users had hit the same path. The total cost from three calls was just under $40. The root cause was a four-line bug in a chunking function. But the deeper problem was that nothing had a per-call cap. The system had no gate that said "this call costs too much, stop before sending it."

That is what llm-cost-cap does.

The shape of the fix

Install the library:

pip install llm-cost-cap

Create an estimator with a cap and check before you call:

from llm_cost_cap import CostEstimator, CostCapExceeded

estimator = CostEstimator(
    model="claude-sonnet-4-6",
    per_call_cap_usd=0.50,
)

prompt = load_user_document()  # could be huge

try:
    estimator.check(prompt)
except CostCapExceeded as e:
    print(f"Blocked: estimated ${e.estimated_usd:.4f}, cap is ${e.cap_usd:.2f}")
    # surface the error to the user or truncate the prompt
    raise

# safe to call the API now
response = anthropic_client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": prompt}],
)

The estimator approximates token count from the prompt string, looks up the current input price for the model, and raises CostCapExceeded before the request ever leaves your machine.

You can set caps globally or per model:

from llm_cost_cap import CostEstimator

# one cap for all models
estimator = CostEstimator(per_call_cap_usd=1.00)

# or different caps per model
estimator = CostEstimator(
    per_call_cap_usd={
        "claude-sonnet-4-6": 0.50,
        "claude-opus-4-5": 2.00,
        "gpt-4o": 0.75,
        "default": 0.25,
    }
)

The default key catches any model not explicitly listed. If you add a new model to your codebase and forget to add a cap for it, the default fires instead of silently allowing uncapped calls.

What it does NOT do

This is a narrow library. Worth stating clearly what is outside its scope:

It does not make the LLM call. It is a pre-flight check, not a client wrapper. You integrate it at the callsite.
It does not track output cost. Output tokens are unknown until after the call. The library only gates on input cost. If your concern is total session cost across many turns, pair this with token-budget-py which runs a shared USD pool after each call.
It does not enforce per-user or per-session budgets. That is a higher-level concern. This library handles the single-call input gate. Budgets across calls need a stateful store, which is a different problem.
It does not automatically truncate the prompt. When CostCapExceeded is raised, the decision about what to do is yours. You might truncate, you might return an error to the user, you might route to a smaller model. The library just tells you the gate failed.

Inside the lib: input-only pre-flight

The most common question when people see this library is: why only input cost?

The honest answer is that you cannot know output cost before the call. You set max_tokens as a ceiling, but the model usually generates far fewer tokens than the ceiling allows. Estimating cost from max_tokens would give you a worst-case number that is almost always wrong.

Input cost is different. You have the full prompt before you send it. You can count the tokens. You can compute the cost with reasonable accuracy.

The most common source of unexpected per-call cost is a large input, not a large output. A 500,000-token context window stuffed with retrieved documents is the pattern that produces $5 and $10 single-call surprises. The 400-page PDF story above is that pattern exactly.

Gating on input cost catches the real problem without pretending to know the output.

Internally the estimator uses a character-based approximation for token count (chars divided by 4 is the default). If you have access to the real tokenizer for your model, you can pass a custom tokenizer function:

import tiktoken
from llm_cost_cap import CostEstimator

enc = tiktoken.encoding_for_model("gpt-4o")

estimator = CostEstimator(
    model="gpt-4o",
    per_call_cap_usd=0.50,
    tokenizer=lambda text: len(enc.encode(text)),
)

The built-in price table covers Anthropic, OpenAI, Gemini, and Bedrock models as of 2026-05-24. For models not in the table, the estimator raises UnknownModelError by default. You can override this to use the default cap instead if you prefer permissive fallback:

estimator = CostEstimator(
    per_call_cap_usd=0.25,
    unknown_model_policy="use_default",
)

When this is useful

Any agent that ingests user-uploaded content. Files are the main vector for accidentally large prompts.
RAG pipelines where retrieval can return 50 chunks when 5 are sufficient. The cap surfaces misconfigured retrievers before they burn budget.
Multi-tenant SaaS products where one user's bad input should not crater your API bill.
Development environments where you want to catch runaway prompts before they hit production.
Shared API keys used across a team. A cap prevents one workflow from consuming the daily budget in one call.

When NOT to use it

If your prompts are always small and tightly controlled, this adds overhead for no benefit.
If you are running batch jobs at full context window intentionally. A cap would block your own design. In that case, skip the gate and rely on token-budget-py for session-level accounting.
If you want to block on total cost (input plus output), you need a post-call budget pool, not a pre-flight gate.

Install

pip install llm-cost-cap

GitHub: MukundaKatta/llm-cost-cap
Python 3.9 and above
Zero dependencies
27 tests covering cap enforcement, per-model config, default fallback, unknown model policy, and custom tokenizer integration

Siblings

Lib	Boundary	Repo
token-budget-py	Shared pool budget across calls, runs AFTER each call to track total spend	MukundaKatta/token-budget-py
claude-cost	Rust cost calculator for Anthropic models, the pricing reference this library mirrors for Anthropic entries	MukundaKatta/claude-cost
agentfit	Fit messages to context window before the cost check so oversized prompts get trimmed first	MukundaKatta/agentfit
llm-stop-conditions	MaxUsd condition stops the agent loop when cumulative cost exceeds a session total	MukundaKatta/llm-stop-conditions

The intended order of operations in a well-guarded agent call: agentfit trims the prompt to fit the window, llm-cost-cap checks the trimmed input cost against the per-call cap, the LLM call runs, token-budget-py debits the actual cost from the shared pool, and llm-stop-conditions checks whether the session total has crossed the session budget.

Each library owns one boundary. You can use any subset of them.

What is next

A few things on the list:

Cost breakdown by message role. In chat APIs, the input is a list of messages. Breaking down cost per system, user, and assistant turn (for multi-turn contexts) would help identify which part of the prompt is the expensive one.
Price table update tooling. The built-in table is a point-in-time snapshot. A small CLI command to pull the latest prices from provider APIs and regenerate the table would keep it current without requiring a library release.
Cap suggestion mode. Given a sample of past prompts, the library could suggest a reasonable cap based on the 95th-percentile input cost from the sample. Better than picking a number by intuition.

For now, the library does one thing: it stops a call before it leaves your machine when the input alone would cost more than you intended to spend. The gate is narrow and fast. It will not save you from a chatty model. But it will stop the next 400-page PDF from costing $12.

DEV Community