TL;DR: Six weeks running an AI gateway between our edge cameras and three cloud VLM providers cut our pilot VLM spend by 58% and gave us actual failover during a 90-minute Anthropic blip last month. Bifrost handled it. Here's what worked, what didn't, and how it compared to LiteLLM and Portkey on the same workload.
So, the thing is, when our team at a partner factory near Bologna wired up a defect inspection pilot with cloud VLMs in the loop, the cost story turned ugly within the first ten days. We had 28 stations, each catching anomalies from local event-camera and frame fusion, then escalating ambiguous frames to GPT-4o-mini or Claude Sonnet for a second opinion. The VLM bill landed at €4,800 in week one. Production was running 11 hours a day. Nobody had budgeted for that.
The pilot also stalled twice. Once because OpenAI returned 429s for 22 minutes during what I assume was a regional capacity issue, and once because a key rotated wrong and half the fleet froze. Neither outage was the model's fault. Both were avoidable.
We picked Bifrost as a gateway and ran it for six weeks. This is a writeup of what we measured. Independent perspective. I have no commercial relationship with them.
The setup
Our stack: Jetson Orin Nano per station, edge model (a distilled student of CLIP for class triage), and an escalation rule. If the edge confidence falls below a threshold, the frame plus context gets sent to a cloud VLM through Bifrost. Bifrost runs on a small VM in our partner's DMZ, two replicas behind a TCP load balancer. We use the OpenAI-compatible endpoint so our existing inference client didn't change.
Three things mattered for us, in this order. First, fallback semantics. Second, per-station budgeting. Third, observability metrics that hit Prometheus without extra scaffolding. We tried LiteLLM and Portkey before settling on Bifrost. More on that below.
Configuration was a single YAML and two environment variables. This is roughly the relevant part:
providers:
openai:
keys:
- env: OPENAI_KEY_PRIMARY
weight: 0.7
- env: OPENAI_KEY_FAILOVER
weight: 0.3
anthropic:
keys:
- env: ANTHROPIC_KEY
bedrock:
keys:
- env: BEDROCK_KEY
region: eu-central-1
routes:
- models: ["openai/gpt-4o-mini"]
fallbacks:
- "anthropic/claude-sonnet-4-6"
- "bedrock/anthropic.claude-3-5-haiku"
That's the whole fallback chain. When OpenAI started rate limiting on April 18, traffic shifted to Anthropic within the configured timeout and we lost zero frames. Our oncall got a Prometheus alert on the fallback rate metric, which is exposed natively, and we caught the issue 90 seconds before any operator noticed.
What we measured
Six weeks of data from our pilot, three weeks before Bifrost and three weeks after. Same shift schedule, same product mix. Numbers below are real, scrubbed of station-specific identifiers.
| Metric | Before gateway | After Bifrost | Notes |
|---|---|---|---|
| Weekly VLM spend | €4,640 avg | €1,920 avg | Semantic cache hit at 41% |
| P95 escalation latency | 1.9s | 1.4s | Some hits served from cache |
| Outage minutes | 112 | 0 | One Anthropic blip auto-mitigated |
| Operator interventions | 7 | 1 | Most were cost-driven |
| Per-station cost visibility | none | per virtual key | New capability |
The 41% semantic cache hit rate surprised me. I expected maybe 15%. Factory floor frames have a lot of repeated context (same product variant, same lighting, similar defect prompts), and the cache exploits that pattern. Documented behaviour, see the semantic caching docs linked at the bottom.
Bifrost vs LiteLLM vs Portkey
Honest comparison, because we tested all three on the same workload.
| Feature | Bifrost | LiteLLM | Portkey |
|---|---|---|---|
| OpenAI-compatible API | yes | yes | yes |
| Self-host as binary | yes (Go binary or Docker) | yes (Python proxy) | hosted primary |
| Semantic caching | yes | via Redis plugin | yes |
| Virtual keys with budgets | yes | yes | yes |
| Prometheus metrics | native | requires setup | hosted dashboard |
| Throughput in our test | ~9k RPS sustained | ~2k RPS before tuning | hosted (couldn't test fairly) |
LiteLLM is more mature in some places. Its Python plugin ecosystem is bigger, and if your team already lives in Python middleware, that's a real advantage. Portkey has the slickest hosted dashboard out of the box, no question. We couldn't put a hosted service in the path for this pilot because the partner factory had restrictions on outbound traffic from the OT network.
Bifrost won for us on two specific points. The single Go binary deployed without a fight on the small VM we had. And the per-virtual-key budgeting let us bill each station's VLM cost back to the production line owner, which mattered for getting the pilot extended past phase one.
Trade-offs and limitations
The semantic cache occasionally returns a cached response when product variants change mid-shift. We had two false-negative defect reports in week three traced to a cache hit on a near-identical SKU. We tuned the similarity threshold and added a variant ID to the cache key. Lesson learned, but the takeaway is that semantic caching needs context-aware keys for industrial workloads. Not a Bifrost bug, but a real failure mode you have to design around.
Bifrost's MCP integration is interesting but we didn't use it for this pilot. Our cameras don't need tool use. If you are building agentic flows on top, that calculus changes.
Documentation gaps exist. The clustering setup for HA was less detailed than I would have liked when we first read it. Their team answered on Discord within an hour, which helped a lot.
You don't need a gateway if you are single-provider and single-region. The added complexity is worth it once you have real failover requirements or cost attribution problems. We had both. Your factory may not.
Further reading
- Bifrost retries and fallbacks: https://docs.getbifrost.ai/features/retries-and-fallbacks
- Bifrost semantic caching: https://docs.getbifrost.ai/features/semantic-caching
- Bifrost governance and virtual keys: https://docs.getbifrost.ai/features/governance/virtual-keys
- LiteLLM proxy: https://github.com/BerriAI/litellm
- Portkey gateway: https://github.com/Portkey-AI/gateway
Next pilot starts in two weeks. Same gateway, different factory, event-camera-only feeds this time. I'll write that one up once the numbers are in.
Top comments (0)