Six weeks of Bifrost in a factory QA pilot: real cost numbers

#machinelearning #computervision #mlops #infrastructure

TL;DR: Six weeks running an AI gateway between our edge cameras and three cloud VLM providers cut our pilot VLM spend by 58% and gave us actual failover during a 90-minute Anthropic blip last month. Bifrost handled it. Here's what worked, what didn't, and how it compared to LiteLLM and Portkey on the same workload.

So, the thing is, when our team at a partner factory near Bologna wired up a defect inspection pilot with cloud VLMs in the loop, the cost story turned ugly within the first ten days. We had 28 stations, each catching anomalies from local event-camera and frame fusion, then escalating ambiguous frames to GPT-4o-mini or Claude Sonnet for a second opinion. The VLM bill landed at €4,800 in week one. Production was running 11 hours a day. Nobody had budgeted for that.

The pilot also stalled twice. Once because OpenAI returned 429s for 22 minutes during what I assume was a regional capacity issue, and once because a key rotated wrong and half the fleet froze. Neither outage was the model's fault. Both were avoidable.

We picked Bifrost as a gateway and ran it for six weeks. This is a writeup of what we measured. Independent perspective. I have no commercial relationship with them.

The setup

Our stack: Jetson Orin Nano per station, edge model (a distilled student of CLIP for class triage), and an escalation rule. If the edge confidence falls below a threshold, the frame plus context gets sent to a cloud VLM through Bifrost. Bifrost runs on a small VM in our partner's DMZ, two replicas behind a TCP load balancer. We use the OpenAI-compatible endpoint so our existing inference client didn't change.

Three things mattered for us, in this order. First, fallback semantics. Second, per-station budgeting. Third, observability metrics that hit Prometheus without extra scaffolding. We tried LiteLLM and Portkey before settling on Bifrost. More on that below.

Configuration was a single YAML and two environment variables. This is roughly the relevant part:

providers:
  openai:
    keys:
      - env: OPENAI_KEY_PRIMARY
        weight: 0.7
      - env: OPENAI_KEY_FAILOVER
        weight: 0.3
  anthropic:
    keys:
      - env: ANTHROPIC_KEY
  bedrock:
    keys:
      - env: BEDROCK_KEY
        region: eu-central-1

routes:
  - models: ["openai/gpt-4o-mini"]
    fallbacks:
      - "anthropic/claude-sonnet-4-6"
      - "bedrock/anthropic.claude-3-5-haiku"

That's the whole fallback chain. When OpenAI started rate limiting on April 18, traffic shifted to Anthropic within the configured timeout and we lost zero frames. Our oncall got a Prometheus alert on the fallback rate metric, which is exposed natively, and we caught the issue 90 seconds before any operator noticed.

What we measured

Six weeks of data from our pilot, three weeks before Bifrost and three weeks after. Same shift schedule, same product mix. Numbers below are real, scrubbed of station-specific identifiers.

Metric	Before gateway	After Bifrost	Notes
Weekly VLM spend	€4,640 avg	€1,920 avg	Semantic cache hit at 41%
P95 escalation latency	1.9s	1.4s	Some hits served from cache
Outage minutes	112	0	One Anthropic blip auto-mitigated
Operator interventions	7	1	Most were cost-driven
Per-station cost visibility	none	per virtual key	New capability

The 41% semantic cache hit rate surprised me. I expected maybe 15%. Factory floor frames have a lot of repeated context (same product variant, same lighting, similar defect prompts), and the cache exploits that pattern. Documented behaviour, see the semantic caching docs linked at the bottom.

Bifrost vs LiteLLM vs Portkey

Honest comparison, because we tested all three on the same workload.

Feature	Bifrost	LiteLLM	Portkey
OpenAI-compatible API	yes	yes	yes
Self-host as binary	yes (Go binary or Docker)	yes (Python proxy)	hosted primary
Semantic caching	yes	via Redis plugin	yes
Virtual keys with budgets	yes	yes	yes
Prometheus metrics	native	requires setup	hosted dashboard
Throughput in our test	~9k RPS sustained	~2k RPS before tuning	hosted (couldn't test fairly)

LiteLLM is more mature in some places. Its Python plugin ecosystem is bigger, and if your team already lives in Python middleware, that's a real advantage. Portkey has the slickest hosted dashboard out of the box, no question. We couldn't put a hosted service in the path for this pilot because the partner factory had restrictions on outbound traffic from the OT network.

Bifrost won for us on two specific points. The single Go binary deployed without a fight on the small VM we had. And the per-virtual-key budgeting let us bill each station's VLM cost back to the production line owner, which mattered for getting the pilot extended past phase one.

Trade-offs and limitations

The semantic cache occasionally returns a cached response when product variants change mid-shift. We had two false-negative defect reports in week three traced to a cache hit on a near-identical SKU. We tuned the similarity threshold and added a variant ID to the cache key. Lesson learned, but the takeaway is that semantic caching needs context-aware keys for industrial workloads. Not a Bifrost bug, but a real failure mode you have to design around.

Bifrost's MCP integration is interesting but we didn't use it for this pilot. Our cameras don't need tool use. If you are building agentic flows on top, that calculus changes.

Documentation gaps exist. The clustering setup for HA was less detailed than I would have liked when we first read it. Their team answered on Discord within an hour, which helped a lot.

You don't need a gateway if you are single-provider and single-region. The added complexity is worth it once you have real failover requirements or cost attribution problems. We had both. Your factory may not.

DEV Community

Six weeks of Bifrost in a factory QA pilot: real cost numbers

The setup

What we measured

Bifrost vs LiteLLM vs Portkey

Trade-offs and limitations

Further reading

Top comments (0)