TL;DR: Our 11-person CV team at Prophesee was burning through €3-4k weeks of VLM spend on dataset annotation with no idea which researcher caused which spike. We put Bifrost between the labelling scripts and the providers, mapped one virtual key per person with monthly caps, and the receipt-chasing stopped. Took an afternoon to wire up. Real savings came from enforcement, not from clever routing.
So, the thing is, when you let eleven people each script their own VLM annotation passes against three different providers, you get one giant invoice at the end of the month and no idea who is responsible for the €700 Tuesday spike. We hit that exact wall in late February. Two consecutive weeks at around €4k each, and our team lead asked, very politely, over an espresso, whether maybe we should think about it.
The honest answer was: yes, six months ago.
What we were actually doing
Our annotation pipeline labels event-camera frames reconstructed from recordings (indoor robotics, low-light driving, drone footage). For each scene we run a VLM pass to produce captions, bounding box suggestions, and a sanity-check description. The VLM is the teacher; a smaller distilled model is what we deploy on the edge.
The traffic profile looked like this:
| Source | Calls/week | Provider | Notes |
|---|---|---|---|
| Bulk caption jobs | ~80k | gpt-4o-mini | overnight cron, predictable |
| Bounding box suggestions | ~12k | claude-3.5-sonnet | interactive, daytime |
| Sanity-check descriptions | ~6k | gemini-2.0-flash | per-researcher scripts |
| Ad-hoc exploration | who knows | mixed | the actual problem |
That last row was biting us. Researchers were testing prompt variations, sweeping over thousands of frames out of curiosity, forgetting they kicked off the script before going home for the weekend.
What we wired up
We deployed Bifrost as a Docker container on the same internal box that already hosts our annotation queue. Ten minutes. Then we mapped one virtual key per researcher (docs) with a monthly cap, and two teams (vision research, robotics demos), each with their own pooled budget on top.
The config looked something like this:
teams:
- id: vision-research
monthly_budget_eur: 1800
- id: robotics-demos
monthly_budget_eur: 1200
virtual_keys:
- id: vk_marco
team: vision-research
monthly_budget_eur: 220
allowed_models: ["openai/gpt-4o-mini", "anthropic/claude-3.5-sonnet"]
- id: vk_sofia
team: vision-research
monthly_budget_eur: 220
allowed_models: ["openai/gpt-4o-mini"]
When a researcher's key hits 90% of their monthly cap, the gateway rejects further requests until the next cycle. No silent overspend. The team-level budget acts as a second ceiling: even if every individual cap stays under, the team budget kicks in.
We pointed every script at http://bifrost.internal:8080/v1/chat/completions and replaced the OpenAI/Anthropic SDK calls with the unified endpoint. The drop-in replacement meant we didn't touch the actual annotation code beyond the base URL.
What we got back
Three things, in honest order of value:
- Per-key spend visibility. Prometheus metrics out of the box. We pipe them into our existing Grafana, and now I can see at a glance that Sofia spent €43 on Monday running a prompt sweep. Not to shame anyone. To understand the pattern.
- Automatic failover. When OpenAI had its 47-minute hiccup on March 14, our overnight caption job didn't die. Bifrost rerouted to Anthropic. We found out from the metrics, not from a panicked Slack message at 7am.
- Semantic caching on the bulk caption jobs. About 18% cache hit rate on repeated frames from similar scenes. Modest. Not the savings argument I would lead with, but real.
Trade-offs and limitations
I should be honest about where this is and isn't worth it.
If you're a solo engineer with one API key and one provider, none of this applies. Put your OPENAI_API_KEY in .env and move on. The hierarchical budget story matters when you have multiple humans with their own scripts.
LiteLLM is the obvious comparison. We tried it first. It works, the routing logic is solid, and the Python-native feel suits a research team. What pushed us to Bifrost was the governance side: virtual keys with per-team and per-user budgets as a first-class concept, not bolted on. Portkey has similar governance features but felt heavier than we needed for an 11-person internal deployment.
The Bifrost web UI is functional but not pretty. The semantic cache config has some sharp edges (the embedding model choice matters more than the docs imply). And if your provider mix is exotic (we briefly wanted Mistral's hosted vision endpoint, which wasn't supported at the time we tested), you'll need to wait or contribute.
The real limitation though: a gateway cannot save you from a researcher who genuinely needs to spend €500 on a legitimate experiment. It can only make that spend visible and intentional. Which, honestly, is what we wanted.
Further reading
- Bifrost virtual keys and governance
- Budget and limits documentation
- Drop-in replacement guide
- Prometheus observability defaults
- [LiteLLM router docs (for comparison)]](https://docs.litellm.ai/docs/routing)
Top comments (0)