VLM-scored calibration sets for INT8 quantisation, routed through Bifrost

#computervision #pytorch #machinelearning #mlops

TL;DR: We pick the 512 hardest images for INT8 PTQ calibration by scoring a candidate pool with a small VLM. Bifrost sits between our calibration pipeline and four providers, gives us semantic caching, per-engineer virtual keys, and hard budget caps so a runaway loop can't burn EUR800 of API spend over the weekend.

So, the thing is, calibration set selection is one of those topics nobody writes about until the day your INT8 model is 4.2 points of mAP worse than the fp16 reference and you have a release branch already cut. Pick the wrong 512 images and your activation histograms get biased toward easy frames. Pick the right 512 and the gap closes to 0.6 points without any QAT work at all.

For about a year we picked them by hand. Then we tried random sampling, then stratified by class. The thing that finally worked, on an industrial defect detector we shipped to a customer last quarter, was scoring our candidate pool with a vision-language model and biasing toward the high-difficulty tail. That sounds neat in a paper. In production it meant roughly 80k API calls per release cycle, four providers in rotation, three engineers all running their own ablations, and one very pointed email from finance about the November bill.

That's where Bifrost came in.

What we actually wanted from a gateway

A short list, in order of how much pain each item was causing us:

One API surface so the calibration script doesn't care if we're hitting GPT-4o, Claude Sonnet, Gemini, or our self-hosted Qwen2-VL.
Per-experiment budget caps. If someone's loop goes infinite over a Friday night, I want it to die at EUR50, not EUR5,000.
Semantic caching, because we score the same images across many ablations and paying twice for the same prompt+image hash is wasteful.
Per-engineer virtual keys so I can see who spent what without parsing four billing consoles.
Real observability. Prometheus metrics over a custom dashboard built by the intern who's now in Berlin.

Bifrost gave us all five. Setup took less than an afternoon, mostly spent arguing about key naming conventions.

The scoring loop

The actual code is boring, which is good. Our calibration scorer is a single async function that streams candidates from S3, calls Bifrost, parses a difficulty score from the JSON response, and writes it back to a Parquet shard.

providers:
openai:
keys:
- value: env:OPENAI_KEY_A
weight: 0.5
- value: env:OPENAI_KEY_B
weight: 0.5
anthropic:
keys:
- value: env:ANTHROPIC_KEY
governance:
virtual_keys:
- id: vk_calib_marco
budget_eur: 200
- id: vk_calib_giulia
budget_eur: 200
semantic_cache:
enabled: true
similarity_threshold: 0.92

Each engineer gets their own virtual key. The budget is hard. Hit EUR200 and Bifrost returns 429 until the cap is lifted. This sounds harsh until you remember the November bill.

Semantic caching pays for itself faster than I expected. About 38% of our candidate frames recur across ablations because we keep iterating on the difficulty prompt rather than the image set itself. Cache hits cost us nothing and return in under 40ms.

How it compares to LiteLLM and Portkey

We did genuinely evaluate both before committing.

Capability	Bifrost	LiteLLM	Portkey
OpenAI-compatible API	Yes	Yes	Yes
Multimodal request routing	Yes	Yes	Yes
Semantic caching	Built-in	Plugin	Built-in
Hierarchical virtual keys / budgets	Built-in	Limited	Built-in
Self-hosted single Go binary	Yes	Python	Hosted-first
Native Prometheus metrics	Yes	Yes	Yes

Portkey's hosted dashboards are more polished. If you don't want to run anything yourself, that's the right call. LiteLLM's Python ecosystem is broader and their community is bigger. We picked Bifrost because we wanted a self-hosted Go binary that fit alongside our existing inference services, with a governance model that matched how the team already splits experiments by engineer.

Trade-offs and Limitations

A few things to be honest about.

The semantic cache is great for repeated calibration runs and useless if your prompt template changes per experiment. We pin the prompt template and version it like model weights. Without that discipline the cache hit rate dropped to about 6%.

VLM scoring is not deterministic. Two scoring passes on the same 80k images gave us a Spearman correlation of 0.94, which is fine for picking calibration sets but would not be fine for, say, regulatory model documentation.

Bifrost's MCP support is good but we don't use it for this workflow. The scoring loop is a plain HTTP client, no tool use. I mention it because the README puts MCP front and centre and you might assume you need it. You don't.

The built-in UI is React-based and feels a generation behind Grafana for ad-hoc exploration. We export Prometheus metrics to our existing Grafana stack and ignore the bundled dashboard. Personal preference, not a bug.