Per-customer budget caps on our caption pipeline: 3 weeks with virtual keys

#mlops #llm #machinelearning #infrastructure

TL;DR: We were burning around €4,200/month in vision-LLM costs across roughly 80 customers, with zero way to tell who was responsible. Bifrost virtual keys plus per-customer budgets gave us hard caps and clean attribution in a couple of days. Semantic caching saved another 34%, though it needed more babysitting than the README implies.

At Photoroom we ship diffusion-based product photography to about half a million users. The diffusion side is our own infra, GPU pool, custom UNet, the lot. The part nobody writes blog posts about is the captioning and safety-filtering step that runs before each generation, plus a prompt rewriter. Those calls go out to OpenAI's gpt-4o-mini and Anthropic's claude-haiku-4-5-20251001 depending on the route.

For months we treated those calls as overhead. Two API keys, one invoice per provider, no real attribution. Then in April our bill went from €1,800 to €4,200 in three weeks. To be precise: nothing had launched. A single enterprise customer was retrying caption generation in a loop because their pipeline interpreted our 429s as transient.

That was the trigger to put something in front of the providers.

What we tried first

The obvious first pass was a small Python proxy. Maybe 150 lines, sitting between our worker and OpenAI. It worked. For about a week. Then we needed per-customer rate limits, then budget caps, then a second provider, then someone in finance asked for usage reports by customer ID.

This is the point where most teams adopt a real gateway. We compared three.

Feature	Bifrost	LiteLLM	Portkey
Self-hostable	Yes	Yes	Yes (paid tier)
OpenAI-compatible endpoint	Yes	Yes	Yes
Per-customer virtual keys	Yes	Yes	Yes
Hierarchical budgets (key → team → customer)	First-class	Limited	Yes
Semantic caching built-in	Yes	Plugin	Yes
Prometheus metrics out of the box	Yes	Add-on	Hosted dashboard
Web UI for config	Functional	Minimal	The nicest

Portkey's dashboard is honestly the nicest of the three. LiteLLM has the longest tail of niche providers. We picked Bifrost because we wanted a self-hosted box where the budget hierarchy was first-class, and we wanted customer IDs to stay out of any third-party SaaS for GDPR reasons.

Setting up virtual keys

Step one was running Bifrost as a sidecar to our Python workers.

docker run -d \
  -p 8080:8080 \
  -v $(pwd)/config.json:/app/config.json \
  -e OPENAI_API_KEY=$OPENAI_API_KEY \
  -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
  maximhq/bifrost

Then a virtual key per customer, with a monthly cap and an allow-list of models:

virtual_keys:
  - id: vk_acme_corp
    customer_id: acme_corp
    allowed_models:
      - openai/gpt-4o-mini
      - anthropic/claude-haiku-4-5-20251001
    budget:
      monthly_usd: 200
      hard_cap: true
    rate_limit:
      requests_per_minute: 60

Our worker now sends Authorization: Bearer vk_acme_corp instead of the raw provider key. When acme_corp hits €200, the gateway returns a typed 429 and we surface "monthly quota reached" in the customer's UI. The customer that triggered the April spike is capped at €150 now, and we sleep at night.

The whole rollout took a long Tuesday. Most of that was rewriting our worker to thread the customer_id through every call, not the gateway itself.

Where semantic caching helped (and where it bit us)

Bifrost's semantic cache hashes the embedding of the prompt and returns a cached completion when cosine similarity exceeds a threshold. For caption generation on near-duplicate product photos, this is a big deal. We saw a 34% hit rate over the first 10 days.

The catch: our prompt includes brand and SKU for context, but the brand string was occasionally elided when the customer's metadata was incomplete. Two prompts differing only in brand: null vs brand: Nike hashed close enough to collide, and a Nike trainer got captioned with a placeholder description. Not great.

We fixed it by raising the similarity threshold from 0.92 to 0.97 and adding the SKU as a cache namespace prefix. Worth a proper ablation paper-style. We have not had time.

Trade-offs and Limitations

A few things to know before you adopt this.

The Bifrost UI is functional but newer than Portkey's. If your finance team wants a self-service dashboard they can poke at without engineer help, Portkey is ahead today. We plumbed Bifrost's Prometheus output into our existing Grafana instance instead, which took an afternoon.

LiteLLM has a longer tail of obscure providers. If you route to OpenRouter sub-models or self-hosted vLLM endpoints with non-standard URL shapes, check the docs first.

Hard budget caps are hard. A customer who hits the cap mid-batch will see a 429 even if they're €0.10 over. We added a soft-cap warning at 80% and an autoscale-up flow for enterprise contracts so the experience does not feel punitive.

The semantic cache will collide on prompts that differ in metadata you forgot to include in the embedding. Tune the threshold, namespace aggressively, and assume you will discover one edge case per week for the first month.

Top comments (1)

Harjot Singh • May 31

Per-customer budget caps via virtual keys is exactly the right architecture for any AI product with variable usage - it solves the nightmare where one heavy customer's usage silently torches your margin on a flat plan. Issuing a scoped virtual key per customer with its own cap means the blast radius of a runaway is contained to that customer, and you get per-customer cost attribution for free (which is gold for pricing decisions later). That's the difference between "our AI costs are a mystery" and "we know exactly who costs what."

The subtle win in the 3-weeks-in retrospective is usually operational: caps turn cost from a thing you anxiously monitor into a thing that's structurally bounded - you stop watching the dashboard in fear because the cap enforces itself. That enforced-ceiling-per-unit principle is the same discipline I build on in Moonshift (a multi-agent pipeline that ships a prompt to a deployed SaaS) - hard caps per build so cost is bounded by construction, ~$3 flat. Really practical writeup, virtual keys are underused for this. 3 weeks in, what surprised you most - the cost visibility per customer, or how much the caps changed your pricing/packaging thinking? The pricing knock-on is often the bigger story.