Cost accounting for diffusion image generation at $0.0008 per render

#mlops #llm #machinelearning #infrastructure

TL;DR: Per-image cost on our SDXL-based product photography pipeline at Photoroom dropped from $0.0031 to $0.0008 over six months. Most of the win came from boring infrastructure work, not model tricks. An AI gateway in front of our text-conditioning calls saved more than I expected.

I spent most of Q1 staring at a Grafana panel labelled cost_per_render_eur. Our diffusion pipeline generates background-replaced product images at volume. When marketing asks for a million renders, the per-image number matters.

To be precise: the cost I track is GPU-seconds on A100/H100 SXM nodes plus any external API calls plus storage IO. Not amortised salaries, not the office espresso machine. Just the marginal cost of one more render.

Where the money actually goes

Before I started measuring properly, I assumed the UNet denoising loop was 80%+ of the cost. It wasn't.

Stage	% of wall time	% of cost	Notes
Text encoder (CLIP + T5)	4%	11%	T5-XXL is expensive on H100
LLM caption rewriting	8%	22%	External API, GPT-4o-mini initially
UNet denoising (25 steps)	71%	48%	DPM++ 2M Karras
VAE decode	9%	7%	fp16, no tricks
Storage IO + image post	8%	12%	S3 multipart, sharpen, resize

The caption-rewriting step shocked me. We use an LLM to take a customer prompt like "white sneaker on beach" and expand it into a diffusion-friendly description with lighting, framing, camera details. That single API call was 22% of cost.

Killing the bill in three places

Step 1 — UNet quantisation to int8. Used torchao + a small calibration set of 512 product images. Quality drop measured by CLIP-similarity on a held-out set: 0.847 to 0.841. Negligible. Throughput went from 14 renders/sec to 23 renders/sec on an H100. That's a 39% cost drop on the dominant stage.

Step 2 — Caching the text-encoder outputs. For our product taxonomy, only about 4,000 unique caption stems exist (variations on "minimalist white background", "studio lighting from upper-left", etc.). T5-XXL embeddings for these are 14KB each. I cached them in Redis with a 30-day TTL. Hit rate after two weeks: 91%. Text-encoder cost dropped from 11% to 1.2%.

Step 3 — The gateway problem. This is where it got interesting.

The LLM caption step was the messy one

The caption-rewriting calls were originally direct OpenAI API hits from our Python ranking service. When OpenAI had a partial outage in late January (the one that affected gpt-4o-mini specifically for ~40 minutes), we lost 280k renders. The cost of those failed renders, billed but not delivered, was around €890.

I put Bifrost in front. The choice was between LiteLLM, Portkey, and Bifrost. I'll be honest about the comparison.

LiteLLM has wider provider coverage in the Python ecosystem and a more mature semantic-cache integration with langchain-style apps. If your stack is pure Python and you live inside LangChain, it's a more natural fit.

Portkey's UI for prompt management is genuinely nicer than what Bifrost ships, and their guardrail catalog has more pre-built rules.

I picked Bifrost because (a) it's a Go binary with a single HTTP endpoint and our caption service is Go, (b) the automatic fallbacks between providers work without me writing routing logic, and (c) the semantic caching layer sits at the gateway so my Python preprocessing service and Go caption service share the cache.

Config that replaced about 140 lines of fallback logic in our caption service:

providers:
  openai:
    keys:
      - value: env.OPENAI_KEY_PRIMARY
        weight: 0.7
      - value: env.OPENAI_KEY_BACKUP
        weight: 0.3
  anthropic:
    keys:
      - value: env.ANTHROPIC_KEY
        weight: 1.0

fallbacks:
  - primary: openai/gpt-4o-mini
    secondary: anthropic/claude-haiku-4-5
    tertiary: openai/gpt-4o-mini

semantic_cache:
  enabled: true
  similarity_threshold: 0.94
  ttl_seconds: 604800

The 0.94 similarity threshold matters. We tested 0.90, 0.92, 0.94, 0.96 on 10,000 caption pairs and measured downstream image quality. Below 0.94, the cached caption sometimes mismatched the product category enough to confuse the UNet. Above 0.96, hit rate dropped under 30% and the cost win disappeared.

Current numbers after one month with the gateway in place:

Caption API spend: down 61% (semantic cache hit rate of 47%)
Caption-step latency p95: 340ms to 110ms on cache hits
Failed render rate from upstream LLM issues: 0.31% to 0.04%
New cost share for captions: 22% to 8.2%

Trade-offs and Limitations

Quantisation to int8 cost me about three weekends of calibration tuning. For very high-end fashion shoots where we render at 2048x2048, the quality drop becomes visible in fine fabric weave. We keep an fp16 path for those.

The semantic cache occasionally returns a "close enough" caption that doesn't match a niche product category. For our long-tail (about 4% of requests), I disable the cache via a header per-call. The gateway supports this through request metadata.

Bifrost's clustering features are gated to enterprise, which fine for our scale, but if I were running this across three regions I'd want to evaluate that cost honestly. Portkey's pricing for similar features came in lower for the team-collaboration tier.

I haven't migrated the image-generation outputs themselves through the gateway. The UNet runs on our own GPUs, not behind an LLM API, so the gateway adds no value there. Don't put infrastructure in places it doesn't earn its keep.