TL;DR: We run an automated visual QA step that scores generated product shots with vision LLMs from OpenAI, Anthropic, and Google. Each provider wanted the image payload shaped differently, and one rate-limit spike could stall the whole batch. Putting Bifrost in front gave us one OpenAI-compatible image schema and automatic failover, with about 4ms of added latency per call.
At Photoroom I work on the diffusion side of product photography. The model that generates a clean studio shot is only half the job. The other half is deciding, automatically, whether the output is actually usable before it reaches a customer.
So we built a QA scorer. It sends each generated render to a vision model and asks for a structured verdict: background artifacts, clipped edges, color drift against the source. We send the same image to more than one provider because the failure modes differ, and a single model's blind spots leak through otherwise.
That is where the mess started.
Three providers, three image schemas
The nuance here is that "OpenAI-compatible vision" is not a settled standard. To be precise, the message envelope diverges per provider.
OpenAI wants image_url content parts, and you can pass a data: base64 URI or a real URL. Anthropic's native API wants a source block with type: base64, a media_type, and raw data. Google Vertex wants inline_data with mime_type. Our scorer started life with three code paths, three sets of size limits, and three retry policies that drifted out of sync within a month.
For a 12-image batch per product across two providers, that branching logic was the part that broke most often. Not the diffusion model. The plumbing.
What we changed
We dropped Bifrost in as the gateway and pointed the scorer at a single endpoint. It exposes one OpenAI-compatible API across 23+ providers, so the image part is always written the OpenAI way, and Bifrost translates to whatever the target provider expects. Multimodal support for text and images sits behind that common interface.
One request shape now. The only thing that changes per call is the model string.
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "anthropic/claude-sonnet-4-6",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "Score this render for edge clipping. Return JSON."},
{"type": "image_url",
"image_url": {"url": "data:image/png;base64,iVBORw0K..."}}
]
}]
}'
Swap anthropic/claude-sonnet-4-6 for openai/gpt-4o or vertex/gemini-2.5-pro and the body stays identical. The scorer no longer knows or cares how each provider encodes pixels.
Failover instead of a dead batch
The second problem was throughput. When one vision provider returned 429s during a busy stretch, our batch queue used to back up because the scorer kept hammering the same key.
Bifrost's automatic fallbacks let us declare an ordered list. If the primary returns an error or times out, the request moves to the next provider with the same payload. No code change in the scorer.
fallbacks:
- openai/gpt-4o
- anthropic/claude-sonnet-4-6
- vertex/gemini-2.5-pro
Across 30 days the failover fired on roughly 0.8% of calls. Small number. It was the difference between a stalled queue and one that drains.
We also get native Prometheus metrics out of the gateway, so per-provider latency and error rates land in the same Grafana board we already use for GPU utilization. Before, that data lived in three provider dashboards and a spreadsheet.
How it compares
We looked at LiteLLM and Portkey before committing. Here is the honest read for our specific multimodal use case.
| Concern | Bifrost | LiteLLM | Portkey |
|---|---|---|---|
| Unified image schema | Yes, OpenAI-compatible translation | Yes, very wide provider list | Yes, managed gateway |
| Self-hosted, single binary | Yes, Go, npx or Docker |
Yes, Python proxy | Self-host available, more involved |
| Added latency | ~4ms in our test | Higher under Python load | Low, hosted edge |
| Provider breadth | 23+ | Largest list of the three | Broad |
| Guardrails / managed cloud | Enterprise tier | Lighter | Strongest managed feature set |
LiteLLM has the widest provider coverage, and if you live in Python its proxy is genuinely fast to wire up. Portkey's managed guardrails and analytics are more polished than what the open-source Bifrost gives you out of the box. We picked Bifrost because it runs as one self-hosted Go binary next to our inference cluster, and the latency overhead stayed flat under concurrent image traffic.
Trade-offs and limitations
This is not free.
You are adding a network hop. We measured about 4ms median, which is noise next to a 2-3 second vision call, but it is not zero, and for pure text streaming you would feel it more.
It is also one more service to run and patch. If Bifrost goes down without a redundant deployment, every provider goes down with it, so you trade per-provider fragility for a single point you now own. We run two replicas behind a load balancer for that reason.
And the deep governance pieces, like adaptive load balancing and clustering, sit in the enterprise tier. The open-source core covered our failover and multimodal needs, but check the docs before assuming a specific feature is in the free build.
The translation layer is also only as good as its provider coverage. A brand-new provider quirk can lag the upstream API by a release.
Top comments (0)