How to stop your AI bill from surprising you

#ai #api #budget #governance

There is a particular kind of email you only get when something has gone badly in your AI stack. It's not from a customer. It's not from on-call. It's from your AI provider, on the first of the month, with a number on it that is several times what last month was.

It usually traces back to one of three things:

A retry loop nobody noticed. A model returned a malformed JSON; the wrapper retried; the retry hit a different model and also returned malformed JSON; the wrapper retried again. By Monday morning, a single buggy feature has fired ten thousand opus-grade requests over the weekend.
A new feature that quietly defaults to the most expensive model. The launch went well. The users love it. It uses sport mode and never falls back to balanced because the prompt is "complex." Your average request cost just tripled and nobody pointed it out.
An eval pipeline somebody forgot was running. It was meant to be a one-time backfill but the cron is still firing. You can find it in the logs if you know to look.

These don't happen because anyone is reckless. They happen because the AI proxy layer is the place where a five-line config change can quietly cost more than a developer's monthly salary, and most stacks don't surface that until the invoice arrives.

The previous Prism release — v1.3 Observability — made the production traffic visible. You could see every request, attribute cost per feature, watch p95 latency, capture feedback. But seeing isn't stopping. As of today, v1.4 Policy + Governance ships the layer that stops this class of failure before the email arrives.

Three components. One page on the dashboard. Pro and Team only.

Routing rules — consistency on demand

A per-project policy that the hot path enforces in under five milliseconds. Four shapes:

Deny a model. "Never use Opus on this project." A request that would have routed to Opus returns HTTP 403 with a structured error envelope your code can handle:

  {
    "error": {
      "type": "policy_rule",
      "rule": "denied_model",
      "message": "Model 'claude-opus' is denied by project policy",
      "denied_value": "claude-opus",
      "policy_url": "/dashboard/policy"
    }
  }

Useful when you've decided that for a given workload, the quality gap between Sonnet and Opus isn't worth the 5x cost. Set the rule once; every call from that project respects it forever, even ones that haven't been written yet.

Deny a mode. "Never sport-mode in production." Catches the case where someone copy-pastes a development snippet that hard-codes X-Prism-Mode: sport and forgets to switch it.
Force a model per task type. "Always use Sonnet for code, never Haiku." This one doesn't 403 — it silently overrides the router's choice and continues. The override is captured on the usage log row and in the audit timeline. Use this when the router's defaults are close but you want a hard guarantee for a specific task family.
Cap input tokens. "Reject requests where the estimated input exceeds 8k tokens." Defends against an attacker (or an unintentional bug) feeding the model arbitrarily long context.

The rules apply before cache lookup. Once you've denied Opus, a previously-cached Opus response is also blocked — your deny intent beats the cache. If you'd rather have the cache continue serving its existing entries, just don't deny the model; the router will simply stop generating new Opus responses.

Budget caps — predictability on demand

Per-project, monthly USD cap. Two thresholds:

Soft warn — default 80%, configurable. The first time the project crosses this threshold in a calendar month, the project owner (or a custom alert email) gets an email. Requests keep flowing. The email arrives once. It does not arrive again until next month.
Hard block — default ON, optional. When the project's spend plus the next request's pre-bill estimate would meet or exceed the cap, the request returns HTTP 402 Payment Required:

  {
    "error": {
      "type": "budget_exceeded",
      "message": "Project would exceed monthly cap of $50.00 (current $49.87, this request est. $0.18)",
      "monthly_cap_usd": 50.00,
      "current_spend_usd": 49.87,
      "policy_url": "/dashboard/policy"
    }
  }

Some design choices worth being explicit about, because they're the boring ones that matter when you're under pressure:

Cache hits cost $0 and are never blocked. Caching is exactly how you stay under budget. Blocking a cache hit because you're at 99% of cap would be perverse — the cache hit doesn't move the needle. (Cache hits don't show up in the counter at all.)
Mid-stream requests are never killed. A streaming request that was already in flight when the cap fires gets to finish. The block only applies to new requests after the threshold. The alternative would corrupt the customer's response stream; that's not a tradeoff we're willing to make for ~5% of cap overhead.
Pre-bill uses a 10% safety margin. The estimate is max_tokens × output price + tokens_in × input price with a 10% buffer. Most requests come back with fewer output tokens than max_tokens, so actual spend usually runs below estimate. The margin prevents the pathological case where someone thinks they have headroom but the pre-bill blocks anyway.
Failed-provider requests don't count. No cost row, no counter increment. A provider outage doesn't burn through your cap.
Redis down? Fail open. The budget counter lives in Redis for hot-path speed. If Redis is unreachable, we serve the request and log a warning. Budget is a financial safety net, not a security control — under-serving is worse than over-serving for a few minutes.

A reconciliation job runs nightly at 02:00 UTC. It recomputes the authoritative spend from the usage_logs table and overwrites the Redis counter. Any drift from missed increments or Redis hiccups gets corrected before it can compound.

Audit log — defensibility on demand

Every rule change. Every enforcement firing. Every actor. Every before-and-after. Captured in an append-only table the moment it happens, surfaced on /dashboard/usage?tab=audit as a colored timeline.

Three categories show up:

Config changes — "Ravi set deny-list to claude-opus on 2026-05-18 14:22." Diff view in the expanded panel.
Policy firings — "Blocked: denied_mode=eco on a balanced/code request." The rule plus the value that fired it.
Budget events — "Warned at 80% of $50 cap." / "Blocked at $50.04 of $50 cap."

Retention: 30 days on Pro, 365 days on Team. The data exists forever; the dashboard window is the customer-facing limit on how far back you can scan.

The audit log is what makes the other two components defensible during a compliance review. A customer asking "do you have controls on which AI models we can use?" gets a yes, with evidence. A customer asking "if a rule blocks a legitimate request, can we trace what happened?" gets a yes, with a timestamp and an actor. That's the difference between checking a SOC 2 box and actually being able to ship into a regulated environment.

What this changes

We don't think of v1.4 as "Prism added budget caps." We think of it as Prism moving from a tool that makes AI cheaper to a tool you can actually commit your platform to. The argument for using a proxy at all gets stronger the more controls live in the proxy and the fewer live in each individual application.

A team that's been burning a couple of hundred dollars a week on a low-priority experimental feature can put a $50/month cap on the project and move on. They no longer have to remember to check the bill. The proxy remembers for them.

A team that's been told by procurement that they need to demonstrate cost controls before going to the next stage of the contract can point at /dashboard/policy and the audit timeline and answer the question on the spot.

A developer who joined the team last week and is unfamiliar with the cost differences between Opus and Haiku can't accidentally route 50,000 batch jobs to Opus. The rule says no, and the rule wrote itself down explaining why.

That's the shape of the value: not "we spend less," but "we know what we're going to spend." Budgets aren't about not spending. They're about predictability. Policy isn't about restricting. It's about consistency.

Live today on Pro and Team accounts. Free and Paid customers see an upsell card; everything works for them as before, with zero added latency on the hot path because the policy stage short-circuits the moment it sees a non-subscriber tier.

If you're on Pro or Team, take ten minutes this week to set a budget cap on every project. The day you don't get the surprise email, you'll be glad you did.