Async inference for long-running diffusion jobs through Bifrost

#machinelearning #mlops #computervision #ai

TL;DR: Async inference through Bifrost lets long-running diffusion jobs submit and poll with the x-bf-async header, so SDXL batches survive the 60-second proxy timeouts that were killing our product-photo pipeline.

A large product-variant batch in our pipeline at Photoroom takes 70 to 110 seconds to render across SDXL, and our AWS ALB closes any connection idle past 60 seconds by default. When we increased batch sizes to cut per-image GPU cost, the synchronous calls began returning 504s before the diffusion step finished. Clients retried on the 504, which double-queued the same render and roughly doubled GPU load during peak hours. We moved the generation traffic behind Bifrost, the open-source AI gateway from Maxim AI, and switched the slow jobs to async inference so the HTTP connection no longer has to stay open for the full render.

What async inference means at an AI gateway

Async inference at an AI gateway lets a client submit a generation job, receive a job ID, and poll for the result instead of holding one HTTP connection open for the whole compute. Bifrost exposes this with the x-bf-async: true request header and an x-bf-async-id returned on submission, so a 100-second diffusion call decouples from any proxy or load-balancer idle limit between the client and the gateway.

The nuance here is that the GPU work does not get faster. What changes is the connection model. A synchronous request ties the success of a 100-second render to a TCP connection staying healthy for 100 seconds across two network hops. Async breaks that coupling: the submit call returns in milliseconds, and the poll calls are short and idempotent.

Submitting and polling jobs with x-bf-async

The submit request looks like a normal call through the OpenAI-compatible endpoint, with one extra header. Bifrost runs as a drop-in replacement, so our existing image client only changed at the header layer, not the request body.

# Submit a long-running generation job
curl -X POST http://localhost:8080/v1/images/generations \
  -H "Content-Type: application/json" \
  -H "x-bf-async: true" \
  -d '{\n    "model": "openai/gpt-image-1",\n    "prompt": "studio product shot, white seamless background",\n    "n": 8\n  }'
# Response returns: x-bf-async-id: job_8f2c...

# Poll for the result with the returned job id
curl http://localhost:8080/v1/images/generations \
  -H "x-bf-async-id: job_8f2c..."

To be precise about what we measured: the submit call returns before the model starts decoding, so the client thread is free in well under a second. The poll interval we settled on is two seconds, which keeps the queue worker cheap without adding noticeable tail latency on completion. We retired the old retry-on-504 logic entirely, because there is no long-held connection left to fail.

Tagging and observing jobs in flight

Once jobs run detached, you need a way to attribute each one, otherwise a slow render is invisible until a customer complains. Bifrost forwards custom dimension headers prefixed x-bf-dim-* into logs, traces, and Prometheus, so we tag every submission with the team and the experiment that created it.

  -H "x-bf-dim-team: catalog-enrichment" \
  -H "x-bf-dim-experiment: sdxl-batch-v3" \

Those tags land in the observability layer, which Bifrost writes asynchronously at under 0.1ms overhead per request. We now graph time-to-completion per experiment instead of one aggregate, which is how we found that one prompt template was three times slower than the rest of the batch. For cost attribution across teams, we pair the dimension tags with scoped virtual keys so each business unit carries its own budget against the same provider pool.

Routing also mattered here. The gateway unifies 20+ providers behind one endpoint, and the same async mechanism works whether the job lands on a self-hosted SDXL deployment or a hosted image model, so we can fail a batch over without rewriting the client.

Trade-offs and limitations

Async is the wrong default for fast paths. An interactive thumbnail that renders in 900ms gains nothing from submit-and-poll; you add a second round trip and a polling loop for a job that would have finished inside the original connection. We only route batches above roughly 30 seconds of expected render time through x-bf-async.

The honest limitation on the Bifrost side is operational. Production deployments need Postgres backing the gateway, and you self-host the whole thing, which is real infrastructure to run and patch rather than a managed endpoint. The benchmark numbers are strong: Bifrost sustains 5,000 RPS on a single instance at 100% success with about 11µs of overhead on a t3.xlarge, but those figures describe a node you operate. The ecosystem is also younger than older proxies like LiteLLM, so some integration paths have fewer community examples to copy from. For our team the trade was clearly worth it, since the alternative was tuning load-balancer timeouts per route and still losing jobs at the tail.

Wrapping up

Async inference did not make our diffusion models faster; it made long renders survivable by removing the dependency on a single long-lived connection. The x-bf-async submit-and-poll model, plus dimension tags for attribution, turned a class of intermittent 504s into a measurable queue we can reason about. If you run image or video generation jobs that routinely cross your proxy timeout, this is the pattern I would try first.

If you want to see async inference and the rest of the gateway against your own workload, book a demo: https://getmaxim.ai/bifrost/book-a-demo