TL;DR: We added OpenTelemetry tracing across the four LLM and VLM hops in our product-photo pipeline by routing them through Bifrost. Pipeline-level p95 went from 11.2s to 6.8s in two weeks, mostly because we could finally see which step was the bottleneck. The tracing was free once the gateway was in place; we weren't going to instrument four SDKs by hand.
Where the time went was a mystery
At Photoroom we run a four-stage pipeline for catalog photo cleanup. Background removal with an in-house model. Inpainting via SDXL with our internal LoRA. Upscaling through Real-ESRGAN. A caption step that hits an external VLM provider. Four hops, two of them outside our VPC.
A bug report sat in our queue for a week: "the pipeline is slow on Tuesdays." Designers were timing out at 12 seconds. Our internal Grafana showed the GPU jobs were fine. The external VLM call latency? Nobody had a number.
To be precise, we had per-service latency in Datadog, but the spans didn't stitch together across the external API hops. We could see step 1 took 320ms and step 4 took "8 to 10 seconds" but the "to" in there was the entire problem.
What we changed
I'd been evaluating Bifrost (github.com/maximhq/bifrost) for an unrelated project: semantic caching on the caption step. While reading the observability docs I noticed Prometheus metrics and OTel export were native, not plugins.
So I rewired both external calls (caption VLM and a prompt-rewrite call we make before SDXL) to go through Bifrost instead of directly to the provider. The Python change was about 9 lines. Swap the base URL.
The relevant slice of our config.json:
{
"providers": {
"openai": {
"keys": [{"value": "env.OPENAI_KEY", "models": ["gpt-4o-mini"]}],
"concurrency_and_buffer_size": {"concurrency": 32}
},
"anthropic": {
"keys": [{"value": "env.ANTHROPIC_KEY", "models": ["claude-haiku-4-5"]}]
}
},
"observability": {
"prometheus_labels": ["x-team", "x-pipeline-stage"],
"otel_endpoint": "http://otel-collector:4318"
}
}
The two custom headers (x-team, x-pipeline-stage) attach to every Prometheus sample and every OTel span. In Tempo I can filter by pipeline stage and see exactly which hop is slow.
What we found
The Tuesday slowness was a regional capacity blip with our caption provider. p99 for the caption call was 14 seconds with no error, only slow token output. Once Bifrost sat in front of it we configured a fallback to a second provider and the 12-second pipeline timeouts disappeared.
Full p95 numbers, two weeks before and two weeks after:
| Stage | p95 before | p95 after | Why |
|---|---|---|---|
| Background removal | 340ms | 340ms | Untouched, on our GPUs |
| Prompt rewrite | 1.1s | 480ms | Semantic cache hit ~62% |
| SDXL + LoRA | 4.2s | 4.2s | Untouched |
| Caption VLM | 6.8s | 1.4s | Failover plus cache |
| End-to-end | 11.2s | 6.8s |
The semantic cache hit rate on the prompt-rewrite step is high because designers re-run very similar product descriptions. That was latency we didn't know we were leaving on the table.
How it compares to what we considered
We looked at LiteLLM and Portkey first. We already had a half-deployed LiteLLM somewhere in the cluster.
| LiteLLM | Portkey | Bifrost | |
|---|---|---|---|
| Self-host | Yes | Yes (some features gated) | Yes |
| Prometheus native | Plugin | Yes | Yes |
| OTel spans | Plugin | Yes | Yes |
| Semantic cache | Yes | Yes | Yes |
| MCP support | No | Limited | Yes |
| Measured overhead | ~9ms | cloud RTT dependent | ~1.2ms |
The nuance here is that Portkey is excellent if you are happy on their cloud. We couldn't be, because some product images carry customer-identifiable data and the entire pipeline runs inside our VPC. LiteLLM is mature and its community is larger than Bifrost's, but the observability story required wiring up the Prometheus plugin myself. Bifrost ships that out of the box per the docs.
Trade-offs and limitations
Bifrost is younger than LiteLLM and its plugin ecosystem is correspondingly smaller. We wrote a 40-line Go plugin to attach Photoroom-internal request IDs to spans. It works but it's one more thing we own.
The MCP feature is genuinely useful for tool-using agents. We aren't using it yet. Our pipeline isn't agentic; if you only call chat completions you can ignore MCP entirely.
Latency overhead is low (~1.2ms in our measurements) but it isn't zero. For an image pipeline where each hop is hundreds of milliseconds it's invisible. For a high-QPS embedding service it might matter. Benchmark in your own environment before assuming.
One operational note: the OTel exporter sends spans synchronously by default. We had to bump the batch interval to 5 seconds; otherwise the gateway pod's CPU climbed under load.
Further reading
- Bifrost observability docs: https://docs.getbifrost.ai/features/observability/default
- Bifrost semantic caching: https://docs.getbifrost.ai/features/semantic-caching
- Bifrost fallbacks and retries: https://docs.getbifrost.ai/features/retries-and-fallbacks
- OpenTelemetry GenAI semantic conventions: https://opentelemetry.io/docs/specs/semconv/gen-ai/
- Real-ESRGAN paper, Wang et al. 2021: https://arxiv.org/abs/2107.10833
Top comments (0)