Unifying image inputs across three vision providers behind Bifrost

#machinelearning #computervision #llm #infrastructure

TL;DR: We run an automated visual QA step that scores generated product shots with vision LLMs from OpenAI, Anthropic, and Google. Each provider wanted the image payload shaped differently, and one rate-limit spike could stall the whole batch. Putting Bifrost in front gave us one OpenAI-compatible image schema and automatic failover, with about 4ms of added latency per call.

At Photoroom I work on the diffusion side of product photography. The model that generates a clean studio shot is only half the job. The other half is deciding, automatically, whether the output is actually usable before it reaches a customer.

So we built a QA scorer. It sends each generated render to a vision model and asks for a structured verdict: background artifacts, clipped edges, color drift against the source. We send the same image to more than one provider because the failure modes differ, and a single model's blind spots leak through otherwise.

That is where the mess started.

Three providers, three image schemas

The nuance here is that "OpenAI-compatible vision" is not a settled standard. To be precise, the message envelope diverges per provider.

OpenAI wants image_url content parts, and you can pass a data: base64 URI or a real URL. Anthropic's native API wants a source block with type: base64, a media_type, and raw data. Google Vertex wants inline_data with mime_type. Our scorer started life with three code paths, three sets of size limits, and three retry policies that drifted out of sync within a month.

For a 12-image batch per product across two providers, that branching logic was the part that broke most often. Not the diffusion model. The plumbing.

What we changed

We dropped Bifrost in as the gateway and pointed the scorer at a single endpoint. It exposes one OpenAI-compatible API across 23+ providers, so the image part is always written the OpenAI way, and Bifrost translates to whatever the target provider expects. Multimodal support for text and images sits behind that common interface.

One request shape now. The only thing that changes per call is the model string.

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "anthropic/claude-sonnet-4-6",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "Score this render for edge clipping. Return JSON."},
        {"type": "image_url",
         "image_url": {"url": "data:image/png;base64,iVBORw0K..."}}
      ]
    }]
  }'

Swap anthropic/claude-sonnet-4-6 for openai/gpt-4o or vertex/gemini-2.5-pro and the body stays identical. The scorer no longer knows or cares how each provider encodes pixels.

Failover instead of a dead batch

The second problem was throughput. When one vision provider returned 429s during a busy stretch, our batch queue used to back up because the scorer kept hammering the same key.

Bifrost's automatic fallbacks let us declare an ordered list. If the primary returns an error or times out, the request moves to the next provider with the same payload. No code change in the scorer.

fallbacks:
  - openai/gpt-4o
  - anthropic/claude-sonnet-4-6
  - vertex/gemini-2.5-pro

Across 30 days the failover fired on roughly 0.8% of calls. Small number. It was the difference between a stalled queue and one that drains.

We also get native Prometheus metrics out of the gateway, so per-provider latency and error rates land in the same Grafana board we already use for GPU utilization. Before, that data lived in three provider dashboards and a spreadsheet.

How it compares

We looked at LiteLLM and Portkey before committing. Here is the honest read for our specific multimodal use case.

Concern	Bifrost	LiteLLM	Portkey
Unified image schema	Yes, OpenAI-compatible translation	Yes, very wide provider list	Yes, managed gateway
Self-hosted, single binary	Yes, Go, `npx` or Docker	Yes, Python proxy	Self-host available, more involved
Added latency	~4ms in our test	Higher under Python load	Low, hosted edge
Provider breadth	23+	Largest list of the three	Broad
Guardrails / managed cloud	Enterprise tier	Lighter	Strongest managed feature set

LiteLLM has the widest provider coverage, and if you live in Python its proxy is genuinely fast to wire up. Portkey's managed guardrails and analytics are more polished than what the open-source Bifrost gives you out of the box. We picked Bifrost because it runs as one self-hosted Go binary next to our inference cluster, and the latency overhead stayed flat under concurrent image traffic.

Trade-offs and limitations

This is not free.

You are adding a network hop. We measured about 4ms median, which is noise next to a 2-3 second vision call, but it is not zero, and for pure text streaming you would feel it more.

It is also one more service to run and patch. If Bifrost goes down without a redundant deployment, every provider goes down with it, so you trade per-provider fragility for a single point you now own. We run two replicas behind a load balancer for that reason.

And the deep governance pieces, like adaptive load balancing and clustering, sit in the enterprise tier. The open-source core covered our failover and multimodal needs, but check the docs before assuming a specific feature is in the free build.

The translation layer is also only as good as its provider coverage. A brand-new provider quirk can lag the upstream API by a release.