TL;DR: We were burning around €4,200/month in vision-LLM costs across roughly 80 customers, with zero way to tell who was responsible. Bifrost virtual keys plus per-customer budgets gave us hard caps and clean attribution in a couple of days. Semantic caching saved another 34%, though it needed more babysitting than the README implies.
At Photoroom we ship diffusion-based product photography to about half a million users. The diffusion side is our own infra, GPU pool, custom UNet, the lot. The part nobody writes blog posts about is the captioning and safety-filtering step that runs before each generation, plus a prompt rewriter. Those calls go out to OpenAI's gpt-4o-mini and Anthropic's claude-haiku-4-5-20251001 depending on the route.
For months we treated those calls as overhead. Two API keys, one invoice per provider, no real attribution. Then in April our bill went from €1,800 to €4,200 in three weeks. To be precise: nothing had launched. A single enterprise customer was retrying caption generation in a loop because their pipeline interpreted our 429s as transient.
That was the trigger to put something in front of the providers.
What we tried first
The obvious first pass was a small Python proxy. Maybe 150 lines, sitting between our worker and OpenAI. It worked. For about a week. Then we needed per-customer rate limits, then budget caps, then a second provider, then someone in finance asked for usage reports by customer ID.
This is the point where most teams adopt a real gateway. We compared three.
| Feature | Bifrost | LiteLLM | Portkey |
|---|---|---|---|
| Self-hostable | Yes | Yes | Yes (paid tier) |
| OpenAI-compatible endpoint | Yes | Yes | Yes |
| Per-customer virtual keys | Yes | Yes | Yes |
| Hierarchical budgets (key → team → customer) | First-class | Limited | Yes |
| Semantic caching built-in | Yes | Plugin | Yes |
| Prometheus metrics out of the box | Yes | Add-on | Hosted dashboard |
| Web UI for config | Functional | Minimal | The nicest |
Portkey's dashboard is honestly the nicest of the three. LiteLLM has the longest tail of niche providers. We picked Bifrost because we wanted a self-hosted box where the budget hierarchy was first-class, and we wanted customer IDs to stay out of any third-party SaaS for GDPR reasons.
Setting up virtual keys
Step one was running Bifrost as a sidecar to our Python workers.
docker run -d \
-p 8080:8080 \
-v $(pwd)/config.json:/app/config.json \
-e OPENAI_API_KEY=$OPENAI_API_KEY \
-e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
maximhq/bifrost
Then a virtual key per customer, with a monthly cap and an allow-list of models:
virtual_keys:
- id: vk_acme_corp
customer_id: acme_corp
allowed_models:
- openai/gpt-4o-mini
- anthropic/claude-haiku-4-5-20251001
budget:
monthly_usd: 200
hard_cap: true
rate_limit:
requests_per_minute: 60
Our worker now sends Authorization: Bearer vk_acme_corp instead of the raw provider key. When acme_corp hits €200, the gateway returns a typed 429 and we surface "monthly quota reached" in the customer's UI. The customer that triggered the April spike is capped at €150 now, and we sleep at night.
The whole rollout took a long Tuesday. Most of that was rewriting our worker to thread the customer_id through every call, not the gateway itself.
Where semantic caching helped (and where it bit us)
Bifrost's semantic cache hashes the embedding of the prompt and returns a cached completion when cosine similarity exceeds a threshold. For caption generation on near-duplicate product photos, this is a big deal. We saw a 34% hit rate over the first 10 days.
The catch: our prompt includes brand and SKU for context, but the brand string was occasionally elided when the customer's metadata was incomplete. Two prompts differing only in brand: null vs brand: Nike hashed close enough to collide, and a Nike trainer got captioned with a placeholder description. Not great.
We fixed it by raising the similarity threshold from 0.92 to 0.97 and adding the SKU as a cache namespace prefix. Worth a proper ablation paper-style. We have not had time.
Trade-offs and Limitations
A few things to know before you adopt this.
The Bifrost UI is functional but newer than Portkey's. If your finance team wants a self-service dashboard they can poke at without engineer help, Portkey is ahead today. We plumbed Bifrost's Prometheus output into our existing Grafana instance instead, which took an afternoon.
LiteLLM has a longer tail of obscure providers. If you route to OpenRouter sub-models or self-hosted vLLM endpoints with non-standard URL shapes, check the docs first.
Hard budget caps are hard. A customer who hits the cap mid-batch will see a 429 even if they're €0.10 over. We added a soft-cap warning at 80% and an autoscale-up flow for enterprise contracts so the experience does not feel punitive.
The semantic cache will collide on prompts that differ in metadata you forgot to include in the embedding. Tune the threshold, namespace aggressively, and assume you will discover one edge case per week for the first month.
Further Reading
- Bifrost virtual keys docs
- Budget and limits hierarchy
- Semantic caching reference
- Bifrost on GitHub
- LiteLLM's budget docs for comparison
Top comments (0)