Auto-labelling 1.2M robotics frames with VLMs: a failover story

#computervision #mlops #llm

TL;DR: We needed to caption 1.2M reconstructed event-camera frames using vision-language models for auxiliary supervision. The first run died at 340K from Anthropic rate limits. Putting Bifrost in front of three VLM providers cut the rerun cost by 22% and finished in 9 hours.

So, the thing is, when you work at a neuromorphic vision startup, your training data looks strange. At Prophesee we accumulate event streams into time-binned windows that we render into pseudo-frames. For a self-supervised pretraining run on a new asynchronous backbone, we wanted natural-language captions on every window. Not because we're going language-first. The captions act as auxiliary targets for a contrastive head that sits alongside the actual event tensor.

1.2M frames. Three candidate VLMs: GPT-4o, Claude 3.7 Sonnet, Gemini 2.5 Pro. All three caption our weird greyscale reconstructions differently enough that we wanted a mix per frame.

I tried Anthropic first because the captions were qualitatively the best on our pilot set. Job died at 340,317 captions on a sustained TPM cap. That was a Friday evening before a long weekend in Bologna. I lost the weekend.

Choosing a gateway over more retry code

My first instinct was to write a smarter retry loop. Every CV engineer has this instinct when they discover REST APIs aren't deterministic. After about three hours of writing what was clearly going to become a half-baked rate-limit handler with provider-specific quirks, I stopped.

The actual problem was that I had multiple providers, all with their own SDKs and their own error formats. I needed something in the middle that knew about quotas, retries, and fallback chains, and that wasn't going to require me to learn yet another vendor lock-in.

I looked at LiteLLM, Portkey, and Bifrost. Ended up running Bifrost in Docker on the same node as the batch dispatcher.

The setup

Bifrost runs as a single Go binary or container. The config that mattered for us was the fallback chain. Here's the trimmed version we shipped:

providers:
  openai:
    keys: [${OPENAI_KEY_1}, ${OPENAI_KEY_2}]
    weight: 0.5
  anthropic:
    keys: [${ANTHROPIC_KEY_1}]
    weight: 0.3
  vertex:
    keys: [${VERTEX_KEY_1}]
    weight: 0.2

fallbacks:
  - model: openai/gpt-4o
    next: [anthropic/claude-3-7-sonnet, vertex/gemini-2.5-pro]
  - model: anthropic/claude-3-7-sonnet
    next: [openai/gpt-4o, vertex/gemini-2.5-pro]

Our batch dispatcher called http://bifrost:8080/v1/chat/completions with whatever model we picked for that frame. If a provider was over quota, Bifrost handled the failover and the dispatcher never saw the error. That part is documented under retries and fallbacks.

We also turned on semantic caching for the prompt template because we caption a lot of near-identical static scenes. Robotics demos have long boring stretches. Cache hit rate landed around 14% on the full run, which isn't huge but covered the cost of running the gateway itself.

How it compared

Concern	LiteLLM	Portkey	Bifrost
Multi-provider failover	Yes	Yes	Yes
Self-hosted in our VPC	Yes	Paid tier	Yes (Docker)
Semantic caching built-in	Plugin	Yes	Yes
Prometheus metrics native	Partial	Yes	Yes
Single binary deploy	No (Python)	N/A (SaaS)	Yes (Go)
800 req/s sustained	GIL issues	N/A	Held

LiteLLM was the most familiar option for our team because we already use it for eval scripts. Honestly for offline single-process work it's fine. The problem hit us when we tried to push sustained throughput through one Python process. Bifrost being Go meant we didn't fight the GIL. Portkey's hosted product is genuinely nice and the analytics UI is better than what Bifrost shipped, but we needed everything inside our VPC for frames covered by client confidentiality.

Results

The full 1.2M caption run finished in 9 hours and 14 minutes. Total cost was $4,180, down from a projected $5,360 if we'd run everything on GPT-4o. The 22% saving came from routing roughly a third of traffic to Gemini, which is cheaper per token for our prompt length.

Two providers had transient 429 spikes during the run. I didn't have to do anything about either. The gateway absorbed them. I noticed only because the per-provider request graph in the Bifrost dashboard had a visible dip on Anthropic around hour four.

Trade-offs and limitations

Not everything was clean.

Latency overhead. Bifrost adds a hop. For batch labelling it didn't matter. For an interactive vision app streaming a webcam, I'd benchmark carefully before putting any gateway in the path.

Caption drift across providers. Captions from Gemini and Claude are stylistically different even with the same prompt. We had to normalise downstream with a small T5 rewriter. The gateway doesn't solve this for you.

Config sprawl. Once you have weights, fallbacks, virtual keys, and cache rules in one YAML, it gets hard to reason about which path a given request actually took. Bifrost's logging helped but I had to dig.

MCP and tool use. We didn't need them. If you're building an agent product instead of a labelling pipeline, the MCP support might matter more than failover.

What I'd do differently

Run a 5K-frame pilot before launching the full job. We did 50K, which was enough to catch the rate-limit issue conceptually but not enough to see what 800 req/s sustained does to a Python process. Also: drink the espresso before debugging gateway configs at 1am, not after.