TL;DR: We route a mix of diffusion and LLM traffic across three providers from a single Go-based gateway called Bifrost. The 11 microsecond overhead is real, the failover works, and the part I care about most (weighted routing for cost vs latency tradeoffs) finally stopped being a custom Python service nobody wanted to maintain.
I work on diffusion models for product photography. Most of what I write about is training, but the boring truth is that inference traffic management eats more of my week than I would like to admit.
We have three categories of model calls in production. Hosted diffusion endpoints for fallback when our own GPU pool is saturated. LLM calls for prompt rewriting and caption generation. And a small embedding service for similarity search on reference images. Three providers, three SDKs, three retry policies. It was becoming a mess.
What we had before
A Python FastAPI service in front of everything. It worked. It was also slow, and the team had stopped trusting the metrics because the gateway itself was adding 40-80ms of overhead depending on the day.
The nuance here is that for a diffusion call taking 3 seconds, 60ms of gateway overhead is noise. For a small LLM rewrite that should take 200ms, it is a third of your budget. We were optimizing the wrong axis.
I spent a weekend evaluating replacements. Kong felt heavy. LiteLLM was the obvious choice for the LLM side but does not really speak the dialect of provider-specific diffusion APIs we need. Then a colleague pointed me at Bifrost.
Why a Go gateway actually matters here
To be precise: the language is not the point. The point is the runtime model. Bifrost runs as a single Go binary, uses goroutines for concurrency, and the published overhead is around 11 microseconds per request. I measured it on our own staging hardware and got numbers in the same ballpark, which is rare enough that I noticed.
For our embedding service this matters. For diffusion it does not. But having one gateway that does not become the bottleneck for the fast calls is what made the consolidation possible.
providers:
openai:
keys:
- value: env.OPENAI_KEY_PRIMARY
weight: 0.7
- value: env.OPENAI_KEY_SECONDARY
weight: 0.3
network:
retry:
max_retries: 2
backoff_initial_ms: 100
anthropic:
keys:
- value: env.ANTHROPIC_KEY
stability:
keys:
- value: env.STABILITY_KEY
mcp:
- id: prompt-rewrite
primary: openai/gpt-4o-mini
fallbacks:
- anthropic/claude-haiku-4-5
That config replaced about 400 lines of Python.
The weighted routing thing
This is the feature I did not know I wanted. We have two OpenAI accounts because of rate limits and billing isolation between research and production workloads. Previously we ran two separate clients with manual round-robin logic that always had off-by-one bugs.
Weighted routing in the gateway just handles it. 70/30 split, configured declaratively, and when one key hits a 429 the failover kicks in without us writing the retry code ourselves. Virtual keys on top of that let us issue per-team credentials that map to the underlying provider keys, so the research team and the production team see different rate limits and different cost dashboards.
Comparison with what we considered
| Capability | LiteLLM | Kong | Bifrost |
|---|---|---|---|
| Per-request overhead | ~50ms (Python) | ~5ms but heavy footprint | ~11Ξs |
| Failover across providers | Yes | Plugin required | Yes, built-in |
| Weighted key routing | Limited | Custom plugin | Native |
| Semantic caching | Via plugin | No | Native |
| Diffusion provider support | Weak | Generic HTTP only | Provider-aware |
| Operational footprint | Python service | Lua plugins, DB | Single Go binary |
LiteLLM remains excellent for pure LLM-only stacks. Kong is the right answer if you already run Kong. For us, the combination of low overhead and provider-aware routing was the deciding factor.
Semantic caching on prompt rewrites
About 40% of our prompt-rewrite calls are near-duplicates. Same product, slightly different angle, same desired caption style. We were paying for every one of them.
Bifrost has semantic caching built in, using embeddings to match similar requests within a configurable threshold. I was skeptical because cache invalidation on semantic similarity is famously a footgun. We set the threshold conservatively (cosine similarity above 0.94) and audit the cache hits weekly. Hit rate is around 22%, cost savings are real, and we have not had a quality complaint yet. The audit is the part nobody talks about, but you need it.
Trade-offs and Limitations
It is a young project. The documentation has gaps, particularly around custom provider plugins. I had to read the source to understand how the streaming response handling works for SSE-heavy diffusion APIs.
Observability is functional but basic. We forward to our existing Prometheus setup and it works, but if you expect a polished UI for traffic analysis you will be disappointed. We built our own Grafana dashboards.
Semantic caching is only as good as your embedding model and threshold tuning. If your prompts have high lexical variation but identical intent, you will get false negatives. If your prompts are templated and only the parameters change, you will get false positives. Test on your own traffic before trusting it.
And one honest note: an 11 microsecond gateway does not make a 3-second diffusion call faster. It just stops being the reason your fast calls are slow. Know which problem you are solving.
Further Reading
- Bifrost on GitHub: https://github.com/maximhq/bifrost
- LiteLLM proxy documentation: https://docs.litellm.ai/docs/simple_proxy
- Kong AI Gateway: https://konghq.com/products/kong-ai-gateway
- "Inference Without Interference" (Microsoft Research, 2024) on multiplexing inference workloads
- A useful primer on semantic caching trade-offs from Pinecone's engineering blog
Top comments (0)