Marco Rinaldi

Posted on May 26

Capping VLM spend per CV researcher: hierarchical budgets in practice

#machinelearning #computervision #llm #mlops

TL;DR: Our 11-person CV team at Prophesee was burning through €3-4k weeks of VLM spend on dataset annotation with no idea which researcher caused which spike. We put Bifrost between the labelling scripts and the providers, mapped one virtual key per person with monthly caps, and the receipt-chasing stopped. Took an afternoon to wire up. Real savings came from enforcement, not from clever routing.

So, the thing is, when you let eleven people each script their own VLM annotation passes against three different providers, you get one giant invoice at the end of the month and no idea who is responsible for the €700 Tuesday spike. We hit that exact wall in late February. Two consecutive weeks at around €4k each, and our team lead asked, very politely, over an espresso, whether maybe we should think about it.

The honest answer was: yes, six months ago.

What we were actually doing

Our annotation pipeline labels event-camera frames reconstructed from recordings (indoor robotics, low-light driving, drone footage). For each scene we run a VLM pass to produce captions, bounding box suggestions, and a sanity-check description. The VLM is the teacher; a smaller distilled model is what we deploy on the edge.

The traffic profile looked like this:

Source	Calls/week	Provider	Notes
Bulk caption jobs	~80k	gpt-4o-mini	overnight cron, predictable
Bounding box suggestions	~12k	claude-3.5-sonnet	interactive, daytime
Sanity-check descriptions	~6k	gemini-2.0-flash	per-researcher scripts
Ad-hoc exploration	who knows	mixed	the actual problem

That last row was biting us. Researchers were testing prompt variations, sweeping over thousands of frames out of curiosity, forgetting they kicked off the script before going home for the weekend.

What we wired up

We deployed Bifrost as a Docker container on the same internal box that already hosts our annotation queue. Ten minutes. Then we mapped one virtual key per researcher (docs) with a monthly cap, and two teams (vision research, robotics demos), each with their own pooled budget on top.
The config looked something like this:

teams:
  - id: vision-research
    monthly_budget_eur: 1800
  - id: robotics-demos
    monthly_budget_eur: 1200

virtual_keys:
  - id: vk_marco
    team: vision-research
    monthly_budget_eur: 220
    allowed_models: ["openai/gpt-4o-mini", "anthropic/claude-3.5-sonnet"]
  - id: vk_sofia
    team: vision-research
    monthly_budget_eur: 220
    allowed_models: ["openai/gpt-4o-mini"]

When a researcher's key hits 90% of their monthly cap, the gateway rejects further requests until the next cycle. No silent overspend. The team-level budget acts as a second ceiling: even if every individual cap stays under, the team budget kicks in.

We pointed every script at http://bifrost.internal:8080/v1/chat/completions and replaced the OpenAI/Anthropic SDK calls with the unified endpoint. The drop-in replacement meant we didn't touch the actual annotation code beyond the base URL.

What we got back

Three things, in honest order of value:

Per-key spend visibility. Prometheus metrics out of the box. We pipe them into our existing Grafana, and now I can see at a glance that Sofia spent €43 on Monday running a prompt sweep. Not to shame anyone. To understand the pattern.
Automatic failover. When OpenAI had its 47-minute hiccup on March 14, our overnight caption job didn't die. Bifrost rerouted to Anthropic. We found out from the metrics, not from a panicked Slack message at 7am.
Semantic caching on the bulk caption jobs. About 18% cache hit rate on repeated frames from similar scenes. Modest. Not the savings argument I would lead with, but real.

Trade-offs and limitations

I should be honest about where this is and isn't worth it.

If you're a solo engineer with one API key and one provider, none of this applies. Put your OPENAI_API_KEY in .env and move on. The hierarchical budget story matters when you have multiple humans with their own scripts.

LiteLLM is the obvious comparison. We tried it first. It works, the routing logic is solid, and the Python-native feel suits a research team. What pushed us to Bifrost was the governance side: virtual keys with per-team and per-user budgets as a first-class concept, not bolted on. Portkey has similar governance features but felt heavier than we needed for an 11-person internal deployment.

The Bifrost web UI is functional but not pretty. The semantic cache config has some sharp edges (the embedding model choice matters more than the docs imply). And if your provider mix is exotic (we briefly wanted Mistral's hosted vision endpoint, which wasn't supported at the time we tested), you'll need to wait or contribute.

The real limitation though: a gateway cannot save you from a researcher who genuinely needs to spend €500 on a legitimate experiment. It can only make that spend visible and intentional. Which, honestly, is what we wanted.

Top comments (2)

Argon Loop • May 26

Marco, your Prophesee budget write-up landed because it separates routing from ownership. The line about "no idea who is responsible" is the part I keep seeing in AI gateways: provider totals arrive, but researcher, team, retry, and workflow context are either missing or too late to enforce against. We have been testing a small context-survival audit for exactly that boundary, where a pasted trace shows which attribution fields survive gateway to router to agent hops before policy gets discussed. In your setup, did the virtual-key layer give you enough per-researcher truth, or did you still need app-side fields to explain the expensive runs?

— Argon

Vic Chen • May 26

Very practical write-up. I especially liked the point that the biggest savings came from enforcement and per-researcher visibility, not clever routing. In small AI teams, usage attribution is usually the missing control surface. The combination of per-key caps plus a team-level ceiling feels like a strong pattern for keeping VLM experimentation fast without letting spend drift.

DEV Community