The user submitted a 200-page document. Your agent is about to send it to the model. The input alone will cost $0.60 per call. You have a per-request budget of $0.10. But you do not know the cost until after the call completes.
Or you do know — you have an estimate. Token counting is approximate but close enough. You can estimate the cost before you send the request and reject it if it exceeds the budget.
llm-cost-cap is a pre-flight USD cost gate: estimate the cost, check the budget, reject before hitting the API.
The Shape of the Fix
from llm_cost_cap import CostCap, CostCapExceeded
cap = CostCap(
max_usd=0.10,
model_rates={
"claude-sonnet-4-6": {"input": 3.00 / 1_000_000, "output": 15.00 / 1_000_000},
"claude-haiku-4-5": {"input": 0.25 / 1_000_000, "output": 1.25 / 1_000_000},
},
)
def call_llm_safe(messages: list[dict], model: str, max_tokens: int = 1024) -> dict:
try:
cap.check(messages=messages, model=model, max_tokens=max_tokens)
except CostCapExceeded as e:
raise ValueError(
f"Request too expensive: estimated ${e.estimated_usd:.4f}, "
f"budget ${e.max_usd:.2f}"
)
return anthropic_client.messages.create(
model=model,
messages=messages,
max_tokens=max_tokens,
)
Before the API call: estimate the input tokens, multiply by the model rate, add the max possible output cost. If the estimate exceeds the budget, raise before sending.
What It Does NOT Do
llm-cost-cap does not count tokens exactly. It uses a heuristic estimate: approximately 4 characters per token for English text. The actual token count may differ by 10-20% depending on content. The cap is a pre-flight gate, not an exact accounting layer.
It does not track cumulative spend. It checks each request independently against the per-request budget. For cumulative spend tracking across multiple calls, use token-budget-pool or llm-budget-window.
It does not prevent cost overruns from output. The cap estimates output cost as max_tokens * output_rate. If the model generates fewer tokens, the actual cost is lower. If your max_tokens is very high, the estimate will be pessimistic.
Inside the Library
The token estimator and cost calculator:
class CostCap:
def __init__(self, max_usd: float, model_rates: dict[str, dict]):
self._max_usd = max_usd
self._rates = model_rates
def estimate_tokens(self, messages: list[dict]) -> int:
"""Rough heuristic: ~4 chars per token for English text."""
total_chars = 0
for msg in messages:
content = msg.get("content", "")
if isinstance(content, str):
total_chars += len(content)
elif isinstance(content, list):
for block in content:
if isinstance(block, dict) and block.get("type") == "text":
total_chars += len(block.get("text", ""))
return max(1, total_chars // 4)
def estimate_cost(self, messages: list[dict], model: str, max_tokens: int) -> float:
if model not in self._rates:
raise ValueError(f"Unknown model: {model}. Add it to model_rates.")
rates = self._rates[model]
input_tokens = self.estimate_tokens(messages)
input_cost = input_tokens * rates["input"]
output_cost = max_tokens * rates["output"] # pessimistic: assumes full max_tokens
return input_cost + output_cost
def check(self, messages: list[dict], model: str, max_tokens: int) -> float:
estimated = self.estimate_cost(messages, model, max_tokens)
if estimated > self._max_usd:
raise CostCapExceeded(
estimated_usd=estimated,
max_usd=self._max_usd,
model=model,
input_tokens=self.estimate_tokens(messages),
max_output_tokens=max_tokens,
)
return estimated
The CostCapExceeded exception carries enough information to make an intelligent decision:
@dataclass
class CostCapExceeded(Exception):
estimated_usd: float
max_usd: float
model: str
input_tokens: int
max_output_tokens: int
def suggest_alternatives(self, cheaper_models: list[str], model_rates: dict) -> list[str]:
"""Return cheaper models that fit within the budget."""
return [
m for m in cheaper_models
if (self.input_tokens * model_rates[m]["input"] +
self.max_output_tokens * model_rates[m]["output"]) <= self.max_usd
]
The suggest_alternatives() method lets you automatically fall back to a cheaper model when the primary model is over budget.
When to Use It
Use it as a guard against accidental large requests. A misconfigured tool that returns a 500KB document as input context will be caught before it charges $5.00. User-uploaded files with no size limit are a common source of unexpected large costs.
Use it in multi-model systems. Set different budgets per model: expensive models have low budgets for complex tasks; cheap models have higher budgets for simple tasks. The cap gates which model is appropriate for a given input size.
Use it for cost-conscious multi-tenant systems. Each user has a per-request budget. Large requests from one user should not silently charge more than their budget allows.
Skip it for systems where all requests are bounded in size. If your agent only processes short inputs (user chat messages, structured API payloads), the overhead of a pre-flight cost check adds latency without preventing real problems.
Install
pip install git+https://github.com/MukundaKatta/llm-cost-cap
# Or from PyPI
pip install llm-cost-cap
from llm_cost_cap import CostCap, CostCapExceeded
from llm_fallback_chain import FallbackChain
RATES = {
"claude-opus-4-5": {"input": 15.00 / 1_000_000, "output": 75.00 / 1_000_000},
"claude-sonnet-4-6": {"input": 3.00 / 1_000_000, "output": 15.00 / 1_000_000},
"claude-haiku-4-5": {"input": 0.25 / 1_000_000, "output": 1.25 / 1_000_000},
}
def smart_route(messages: list[dict], preferred_model: str = "claude-opus-4-5") -> dict:
cap = CostCap(max_usd=0.05, model_rates=RATES)
model = preferred_model
for _ in range(len(RATES)):
try:
estimated = cap.check(messages=messages, model=model, max_tokens=1024)
print(f"Using {model}, estimated ${estimated:.4f}")
return anthropic_client.messages.create(
model=model,
messages=messages,
max_tokens=1024,
)
except CostCapExceeded as e:
alternatives = e.suggest_alternatives(
list(RATES.keys()), RATES
)
if not alternatives:
raise RuntimeError("Input too large for any model within budget")
model = alternatives[0]
Sibling Libraries
| Library | What it solves |
|---|---|
token-budget-pool |
Cumulative USD budget across multiple calls |
llm-budget-window |
Time-windowed daily/hourly spend cap |
agent-budget-coordinator |
Compose multiple budget checks into one |
prompt-token-counter |
More accurate token counting (tiktoken-based) |
llm-fallback-chain |
Fall back to cheaper model when primary is over budget |
The cost control stack: llm-cost-cap for pre-flight per-request gates, token-budget-pool for cumulative tracking, llm-budget-window for time-windowed caps, agent-budget-coordinator for composing all three.
What's Next
Exact token counting: integrate with prompt-token-counter (tiktoken-based) for exact counts when estimate_tokens() is called. The heuristic is fast; the exact count is more accurate for edge cases (code, non-English text, structured JSON).
Model auto-selection: cap.cheapest_model_for_budget(messages, max_tokens) that returns the most capable model that fits within the budget. This replaces the manual fallback loop in user code.
Actual cost tracking: record the estimated cost before the call and the actual cost after (from response.usage) and expose a cap.overrun_rate() metric showing how often estimates were over vs. under actual costs. Helps calibrate the heuristic.
Built as part of the agent-stack family: composable Python primitives for production LLM agents.
Top comments (0)