Pre-Flight Cost Gates for LLM Calls: Stop Expensive Requests Before They Hit the API

#hermeschallenge #ai #python #agents

The user submitted a 200-page document. Your agent is about to send it to the model. The input alone will cost $0.60 per call. You have a per-request budget of $0.10. But you do not know the cost until after the call completes.

Or you do know — you have an estimate. Token counting is approximate but close enough. You can estimate the cost before you send the request and reject it if it exceeds the budget.

llm-cost-cap is a pre-flight USD cost gate: estimate the cost, check the budget, reject before hitting the API.

The Shape of the Fix

from llm_cost_cap import CostCap, CostCapExceeded

cap = CostCap(
    max_usd=0.10,
    model_rates={
        "claude-sonnet-4-6": {"input": 3.00 / 1_000_000, "output": 15.00 / 1_000_000},
        "claude-haiku-4-5": {"input": 0.25 / 1_000_000, "output": 1.25 / 1_000_000},
    },
)

def call_llm_safe(messages: list[dict], model: str, max_tokens: int = 1024) -> dict:
    try:
        cap.check(messages=messages, model=model, max_tokens=max_tokens)
    except CostCapExceeded as e:
        raise ValueError(
            f"Request too expensive: estimated ${e.estimated_usd:.4f}, "
            f"budget ${e.max_usd:.2f}"
        )

    return anthropic_client.messages.create(
        model=model,
        messages=messages,
        max_tokens=max_tokens,
    )

Before the API call: estimate the input tokens, multiply by the model rate, add the max possible output cost. If the estimate exceeds the budget, raise before sending.

What It Does NOT Do

llm-cost-cap does not count tokens exactly. It uses a heuristic estimate: approximately 4 characters per token for English text. The actual token count may differ by 10-20% depending on content. The cap is a pre-flight gate, not an exact accounting layer.

It does not track cumulative spend. It checks each request independently against the per-request budget. For cumulative spend tracking across multiple calls, use token-budget-pool or llm-budget-window.

It does not prevent cost overruns from output. The cap estimates output cost as max_tokens * output_rate. If the model generates fewer tokens, the actual cost is lower. If your max_tokens is very high, the estimate will be pessimistic.

Inside the Library

The token estimator and cost calculator:

class CostCap:
    def __init__(self, max_usd: float, model_rates: dict[str, dict]):
        self._max_usd = max_usd
        self._rates = model_rates

    def estimate_tokens(self, messages: list[dict]) -> int:
        """Rough heuristic: ~4 chars per token for English text."""
        total_chars = 0
        for msg in messages:
            content = msg.get("content", "")
            if isinstance(content, str):
                total_chars += len(content)
            elif isinstance(content, list):
                for block in content:
                    if isinstance(block, dict) and block.get("type") == "text":
                        total_chars += len(block.get("text", ""))
        return max(1, total_chars // 4)

    def estimate_cost(self, messages: list[dict], model: str, max_tokens: int) -> float:
        if model not in self._rates:
            raise ValueError(f"Unknown model: {model}. Add it to model_rates.")

        rates = self._rates[model]
        input_tokens = self.estimate_tokens(messages)

        input_cost = input_tokens * rates["input"]
        output_cost = max_tokens * rates["output"]  # pessimistic: assumes full max_tokens

        return input_cost + output_cost

    def check(self, messages: list[dict], model: str, max_tokens: int) -> float:
        estimated = self.estimate_cost(messages, model, max_tokens)
        if estimated > self._max_usd:
            raise CostCapExceeded(
                estimated_usd=estimated,
                max_usd=self._max_usd,
                model=model,
                input_tokens=self.estimate_tokens(messages),
                max_output_tokens=max_tokens,
            )
        return estimated

The CostCapExceeded exception carries enough information to make an intelligent decision:

@dataclass
class CostCapExceeded(Exception):
    estimated_usd: float
    max_usd: float
    model: str
    input_tokens: int
    max_output_tokens: int

    def suggest_alternatives(self, cheaper_models: list[str], model_rates: dict) -> list[str]:
        """Return cheaper models that fit within the budget."""
        return [
            m for m in cheaper_models
            if (self.input_tokens * model_rates[m]["input"] +
                self.max_output_tokens * model_rates[m]["output"]) <= self.max_usd
        ]

The suggest_alternatives() method lets you automatically fall back to a cheaper model when the primary model is over budget.

When to Use It

Use it as a guard against accidental large requests. A misconfigured tool that returns a 500KB document as input context will be caught before it charges $5.00. User-uploaded files with no size limit are a common source of unexpected large costs.

Use it in multi-model systems. Set different budgets per model: expensive models have low budgets for complex tasks; cheap models have higher budgets for simple tasks. The cap gates which model is appropriate for a given input size.

Use it for cost-conscious multi-tenant systems. Each user has a per-request budget. Large requests from one user should not silently charge more than their budget allows.

Skip it for systems where all requests are bounded in size. If your agent only processes short inputs (user chat messages, structured API payloads), the overhead of a pre-flight cost check adds latency without preventing real problems.

Install

pip install git+https://github.com/MukundaKatta/llm-cost-cap

# Or from PyPI
pip install llm-cost-cap

from llm_cost_cap import CostCap, CostCapExceeded
from llm_fallback_chain import FallbackChain

RATES = {
    "claude-opus-4-5": {"input": 15.00 / 1_000_000, "output": 75.00 / 1_000_000},
    "claude-sonnet-4-6": {"input": 3.00 / 1_000_000, "output": 15.00 / 1_000_000},
    "claude-haiku-4-5": {"input": 0.25 / 1_000_000, "output": 1.25 / 1_000_000},
}

def smart_route(messages: list[dict], preferred_model: str = "claude-opus-4-5") -> dict:
    cap = CostCap(max_usd=0.05, model_rates=RATES)

    model = preferred_model
    for _ in range(len(RATES)):
        try:
            estimated = cap.check(messages=messages, model=model, max_tokens=1024)
            print(f"Using {model}, estimated ${estimated:.4f}")
            return anthropic_client.messages.create(
                model=model,
                messages=messages,
                max_tokens=1024,
            )
        except CostCapExceeded as e:
            alternatives = e.suggest_alternatives(
                list(RATES.keys()), RATES
            )
            if not alternatives:
                raise RuntimeError("Input too large for any model within budget")
            model = alternatives[0]

Sibling Libraries

Library	What it solves
`token-budget-pool`	Cumulative USD budget across multiple calls
`llm-budget-window`	Time-windowed daily/hourly spend cap
`agent-budget-coordinator`	Compose multiple budget checks into one
`prompt-token-counter`	More accurate token counting (tiktoken-based)
`llm-fallback-chain`	Fall back to cheaper model when primary is over budget

The cost control stack: llm-cost-cap for pre-flight per-request gates, token-budget-pool for cumulative tracking, llm-budget-window for time-windowed caps, agent-budget-coordinator for composing all three.

What's Next

Exact token counting: integrate with prompt-token-counter (tiktoken-based) for exact counts when estimate_tokens() is called. The heuristic is fast; the exact count is more accurate for edge cases (code, non-English text, structured JSON).

Model auto-selection: cap.cheapest_model_for_budget(messages, max_tokens) that returns the most capable model that fits within the budget. This replaces the manual fallback loop in user code.

Actual cost tracking: record the estimated cost before the call and the actual cost after (from response.usage) and expose a cap.overrun_rate() metric showing how often estimates were over vs. under actual costs. Helps calibrate the heuristic.

Built as part of the agent-stack family: composable Python primitives for production LLM agents.