A VLM gate for generated images, with provider failover via Bifrost

#computervision #llm #machinelearning #infrastructure

TL;DR: At Photoroom we run a vision-language model as the last check before a generated product image reaches a customer. When one VLM provider degrades, that gate stalls and images back up. We put Bifrost in front of those calls for automatic failover and per-team budgets. Here is what it fixed, and what it didn't.

The gate everything waits behind

Some context first, because the architecture matters here. Photoroom generates product photos with diffusion models. Before an image is served, it passes through a VLM check: is the cutout clean, are there hallucinated artifacts near the object edge, is the contact shadow physically plausible. We started with GPT-4o-mini for this, then added Claude and Gemini for a second opinion on the borderline cases.

The nuance here is that this VLM step is synchronous on the serving path. The diffusion sampling can be perfect, 28 steps, 1.4 seconds on an L40S, and still the customer waits if the moderation call hangs.

And it did hang. In March we logged a 6-minute window where one provider's vision endpoint returned 529s on roughly 40% of requests. Our gate has a 12-second timeout, so those requests didn't fail fast. They sat. The queue depth on our caption-and-check workers tripled before our own retry logic gave up.

What we wanted from a gateway

I sketched the requirements on the whiteboard before evaluating anything, which is how I usually start. Three things.

One, a single multimodal interface so the same base64 image payload could hit OpenAI, Anthropic, or Vertex without three separate client wrappers in our Python service. Two, automatic failover, so a provider returning 529s gets skipped without us shipping new code. Three, per-team budget accounting, because the research org and the production org share keys and I could never tell which experiment burned the monthly spend.

We looked at LiteLLM, Portkey, and Bifrost. We run Bifrost now. To be precise about why, it's a Go binary with low added latency, and we self-host it next to our inference cluster in the same VPC, so there's no extra network hop to a SaaS control plane.

The config that replaced our retry code

The drop-in replacement is the part that sold the team. Our service already spoke the OpenAI chat-completions format for the vision call, so we changed the base URL and deleted about 80 lines of hand-rolled retry-and-rotate logic.

The fallback chain lives in config, not in our application:

# bifrost provider + fallback config (abridged)
providers:
  openai:
    keys:
      - value: env.OPENAI_KEY_PROD
        weight: 1.0
  anthropic:
    keys:
      - value: env.ANTHROPIC_KEY_PROD
        weight: 1.0

# in the request body our service sends:
# "model": "openai/gpt-4o-mini",
# "fallbacks": ["anthropic/claude-3-5-sonnet"]

The image goes in as a normal multimodal message. Bifrost exposes text, images, and streaming behind the one interface, so the VLM payload is the same shape regardless of which provider answers it. When OpenAI throws 529s, the request continues to Claude without our code knowing.

Did it actually help

Yes, with a caveat I'll get to. We re-ran a fault injection by pointing one provider key at a dead endpoint during a low-traffic window. Before Bifrost, that scenario drained our worker pool in under two minutes. With the fallback chain, gate p95 went from 0.9s to 2.3s during the fault and zero images stalled past the timeout. Higher latency, but the queue stayed flat.

The budget piece was the quiet win. With virtual keys we gave the research org and the production org separate scoped keys under the same providers, and now I can read spend per team in the dashboard instead of reconciling provider invoices by hand at month end.

Bifrost vs LiteLLM vs Portkey

I want to be fair here, because each tool is stronger somewhere.

Concern	Bifrost	LiteLLM	Portkey
Language / overhead	Go, low added latency	Python, heavier under load	SaaS-first
Provider breadth	23+	very broad, most mature	broad
Multimodal failover	config-level	supported	supported
Self-host in your VPC	yes	yes	self-host is secondary
Governance / virtual keys	built in	basic	polished, mature

LiteLLM has the widest provider list and the most battle-tested Python integration. If your stack is entirely Python and you want one library, it's a reasonable default. Portkey's hosted dashboard and guardrails are more finished than anything I'd self-build, and for a team that doesn't want to run infrastructure it's the lower-effort path. We picked Bifrost because we already self-host inference and the low-latency Go proxy in-VPC mattered more to us than a managed UI.

Trade-offs and Limitations

The gateway is one more hop on a latency-sensitive path. Measured overhead is small, but it's not zero, and for a synchronous gate I had to actually verify it rather than assume.

Failover changes which model judged the image. A Claude artifact-check and a GPT-4o-mini artifact-check don't agree 100% of the time, so during a failover our reject rate shifted by about 3 points. We now log which provider served each verdict, because otherwise a quality drift looks like a model regression when it's really a routing event.

Semantic caching exists in Bifrost and helps text, but for our VLM gate every image is unique, so cache hit rate is near zero there. Don't expect caching to save you on novel-image moderation.

And it's still infrastructure you own. A gateway in front of your providers can become a single point of failure if you run one instance. We run two behind a load balancer and test the failure path on a schedule.