Tracing our 4-stage product photo pipeline through Bifrost

#machinelearning #mlops #llm

TL;DR: We added OpenTelemetry tracing across the four LLM and VLM hops in our product-photo pipeline by routing them through Bifrost. Pipeline-level p95 went from 11.2s to 6.8s in two weeks, mostly because we could finally see which step was the bottleneck. The tracing was free once the gateway was in place; we weren't going to instrument four SDKs by hand.

Where the time went was a mystery

At Photoroom we run a four-stage pipeline for catalog photo cleanup. Background removal with an in-house model. Inpainting via SDXL with our internal LoRA. Upscaling through Real-ESRGAN. A caption step that hits an external VLM provider. Four hops, two of them outside our VPC.

A bug report sat in our queue for a week: "the pipeline is slow on Tuesdays." Designers were timing out at 12 seconds. Our internal Grafana showed the GPU jobs were fine. The external VLM call latency? Nobody had a number.

To be precise, we had per-service latency in Datadog, but the spans didn't stitch together across the external API hops. We could see step 1 took 320ms and step 4 took "8 to 10 seconds" but the "to" in there was the entire problem.

What we changed

I'd been evaluating Bifrost (github.com/maximhq/bifrost) for an unrelated project: semantic caching on the caption step. While reading the observability docs I noticed Prometheus metrics and OTel export were native, not plugins.

So I rewired both external calls (caption VLM and a prompt-rewrite call we make before SDXL) to go through Bifrost instead of directly to the provider. The Python change was about 9 lines. Swap the base URL.

The relevant slice of our config.json:

{
  "providers": {
    "openai": {
      "keys": [{"value": "env.OPENAI_KEY", "models": ["gpt-4o-mini"]}],
      "concurrency_and_buffer_size": {"concurrency": 32}
    },
    "anthropic": {
      "keys": [{"value": "env.ANTHROPIC_KEY", "models": ["claude-haiku-4-5"]}]
    }
  },
  "observability": {
    "prometheus_labels": ["x-team", "x-pipeline-stage"],
    "otel_endpoint": "http://otel-collector:4318"
  }
}

The two custom headers (x-team, x-pipeline-stage) attach to every Prometheus sample and every OTel span. In Tempo I can filter by pipeline stage and see exactly which hop is slow.

What we found

The Tuesday slowness was a regional capacity blip with our caption provider. p99 for the caption call was 14 seconds with no error, only slow token output. Once Bifrost sat in front of it we configured a fallback to a second provider and the 12-second pipeline timeouts disappeared.

Full p95 numbers, two weeks before and two weeks after:

Stage	p95 before	p95 after	Why
Background removal	340ms	340ms	Untouched, on our GPUs
Prompt rewrite	1.1s	480ms	Semantic cache hit ~62%
SDXL + LoRA	4.2s	4.2s	Untouched
Caption VLM	6.8s	1.4s	Failover plus cache
End-to-end	11.2s	6.8s

The semantic cache hit rate on the prompt-rewrite step is high because designers re-run very similar product descriptions. That was latency we didn't know we were leaving on the table.

How it compares to what we considered

We looked at LiteLLM and Portkey first. We already had a half-deployed LiteLLM somewhere in the cluster.

	LiteLLM	Portkey	Bifrost
Self-host	Yes	Yes (some features gated)	Yes
Prometheus native	Plugin	Yes	Yes
OTel spans	Plugin	Yes	Yes
Semantic cache	Yes	Yes	Yes
MCP support	No	Limited	Yes
Measured overhead	~9ms	cloud RTT dependent	~1.2ms

The nuance here is that Portkey is excellent if you are happy on their cloud. We couldn't be, because some product images carry customer-identifiable data and the entire pipeline runs inside our VPC. LiteLLM is mature and its community is larger than Bifrost's, but the observability story required wiring up the Prometheus plugin myself. Bifrost ships that out of the box per the docs.

Trade-offs and limitations

Bifrost is younger than LiteLLM and its plugin ecosystem is correspondingly smaller. We wrote a 40-line Go plugin to attach Photoroom-internal request IDs to spans. It works but it's one more thing we own.

The MCP feature is genuinely useful for tool-using agents. We aren't using it yet. Our pipeline isn't agentic; if you only call chat completions you can ignore MCP entirely.

Latency overhead is low (~1.2ms in our measurements) but it isn't zero. For an image pipeline where each hop is hundreds of milliseconds it's invisible. For a high-QPS embedding service it might matter. Benchmark in your own environment before assuming.

One operational note: the OTel exporter sends spans synchronously by default. We had to bump the batch interval to 5 seconds; otherwise the gateway pod's CPU climbed under load.