Semantic caching the VLM step in our product-photo pipeline

#llm #machinelearning #computervision #mlops

TL;DR: We put Bifrost in front of the VLM step that captions and rewrites prompts for our product-photo diffusion pipeline. Semantic caching cut that bill by ~62% in three weeks. The diffusion side, where the GPUs live, was never the cost we should have been worrying about.

The bill that surprised us

Our pipeline at Photoroom (paraphrased, not exact internal numbers) does three things per product image. A vision-language model reads the input and produces structured captions. A second LLM call rewrites the user's prompt into something the diffusion model behaves well with. Then SDXL with our internal LoRAs does the actual generation on our own A100s.

The diffusion step is what we obsess over. To be precise, it is what we benchmark and profile every sprint. So when we looked at the Q1 numbers, the surprise was that Claude and Gemini Vision together cost more than the GPU lease for the same workload. The VLM and prompt-rewrite layer was 58% of total inference spend.

The nuance here is that we had been calling the providers directly from a Python service with no caching. Same product image, same user request. The response paid for again.

Why we chose Bifrost over the alternatives

I looked at LiteLLM and Portkey first. Both are good. LiteLLM is the path of least resistance if you want a Python library inside an existing FastAPI service, and its provider coverage is excellent. Portkey has a polished hosted UX and very clean dashboarding.

We landed on Bifrost for three reasons specific to our setup. It runs as a Go binary, which means the gateway isn't competing for the same GIL-bound CPU as our inference service. Semantic caching is built in rather than an add-on. The OpenAI-compatible endpoint meant we didn't need to change any of our SDK calls, as documented here.

Honest comparison. LiteLLM has a larger Python ecosystem footprint and its routing config will feel more native if your stack is Python-first. Portkey's analytics UI is, frankly, prettier than what we get out of the box.

The setup

Bifrost runs as a sidecar next to the prompt-rewrite service. Both the captioning and rewrite calls now go through http://bifrost:8080/v1/chat/completions. Our config is small.

providers:
  anthropic:
    keys:
      - value: env.ANTHROPIC_KEY_PRIMARY
      - value: env.ANTHROPIC_KEY_BACKUP
  google_vertex:
    keys:
      - value: env.VERTEX_KEY

semantic_cache:
  enabled: true
  similarity_threshold: 0.94
  ttl_seconds: 86400

fallbacks:
  - primary: anthropic/claude-3-5-sonnet
    fallback:
      - google_vertex/gemini-1.5-pro

governance:
  virtual_keys:
    - id: vk_caption_team
      budget_usd_monthly: 800
    - id: vk_rewrite_team
      budget_usd_monthly: 400

Three things matter here. The cache threshold of 0.94 was tuned against a held-out set of 5,000 captioning calls. At 0.97 we missed too many obvious duplicates. At 0.90 we started returning captions that were close but wrong about colour, which is unforgivable for an e-commerce use case. The fallback list isn't theatre. We measured Anthropic 5xx rates of 0.4% over March, which on our volume is real customer-visible failures.

Numbers from week three

Metric	Before	After
Cache hit rate (caption)	0%	71%
Cache hit rate (rewrite)	0%	49%
p95 latency, caption step	1.8s	0.31s
Monthly VLM+LLM spend	baseline	-62%
Provider failover events handled	0 (we returned an error)	14

The rewrite step caches less well because user prompts vary more. Captioning is the big win, because product photos from the same merchant cluster heavily in embedding space. Roughly 70% of merchants in our top tier upload 80% of their catalogue images within a 90-day window.

What we actually changed in code

The migration was unromantic. Two lines.

client = openai.OpenAI(
    base_url="http://bifrost:8080/v1",
    api_key=os.environ["BIFROST_VIRTUAL_KEY"],
)

Everything else stayed. The VLM team didn't touch their code. The rewrite team flipped a config flag.

Trade-offs and Limitations

Semantic caching has a real failure mode. If your downstream model output is meant to vary across calls (creative generation, sampling-heavy use cases) you don't want this on. We disable it for the diffusion-prompt-suggestion endpoint that gives editors three variants. The cache would happily return the same triplet twice.

The Go binary is one more service to operate. For a small team this is non-trivial. LiteLLM-as-a-library has fewer moving parts if you don't need the cache.

Cost attribution through virtual keys is per-key, not per-end-customer-of-our-customers. If you need full multi-tenant chargeback down to the merchant level, you will write some glue.

The semantic cache uses an embedding model itself. Read the docs on what backs it before you assume your prompts stay inside your VPC.