<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Elise Moreau</title>
    <description>The latest articles on DEV Community by Elise Moreau (@elise_moreau).</description>
    <link>https://dev.to/elise_moreau</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3864909%2F72833c18-30db-4456-82ee-e7d2016cc38f.jpg</url>
      <title>DEV Community: Elise Moreau</title>
      <link>https://dev.to/elise_moreau</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/elise_moreau"/>
    <language>en</language>
    <item>
      <title>Best Cloudflare AI Gateway Alternative in 2026</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Tue, 02 Jun 2026 11:57:17 +0000</pubDate>
      <link>https://dev.to/elise_moreau/best-cloudflare-ai-gateway-alternative-in-2026-774</link>
      <guid>https://dev.to/elise_moreau/best-cloudflare-ai-gateway-alternative-in-2026-774</guid>
      <description>&lt;p&gt;Cloudflare AI Gateway has earned its place as an accessible on-ramp for teams that want to proxy and observe LLM traffic without much setup friction. Free-tier analytics, basic caching, and rate limiting make it a reasonable starting point for lightweight apps running inside Cloudflare's edge network. But when AI workloads mature into production systems, processing thousands of requests per second across multiple providers, enforcing governance policies, and demanding gateway overhead in the microseconds range, the architectural ceilings become hard to work around.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bifrost is the best Cloudflare AI Gateway alternative for teams that need production-grade performance, enterprise governance, and full deployment flexibility.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Gaps That Appear at Scale
&lt;/h2&gt;

&lt;p&gt;Cloudflare AI Gateway functions as a centralized proxy, but several constraints become significant as workloads grow:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Logging caps create operational blind spots.&lt;/strong&gt; The free tier stores a maximum of 100,000 log events per month. The Workers Paid plan raises that ceiling to one million. Once either limit is reached, incoming requests stop being logged entirely, which means production teams lose request-level visibility exactly when traffic is heaviest.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Infrastructure costs are not as flat as they appear.&lt;/strong&gt; The gateway itself carries no per-request charge, but it runs on Cloudflare Workers. At high volume, that means Workers billing kicks in: $0.30 per additional million requests and $0.02 per million CPU-milliseconds beyond the base allocation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The ecosystem coupling creates migration risk.&lt;/strong&gt; Cloudflare AI Gateway is built to run on Cloudflare's stack. Organizations not already invested in that ecosystem take on extra complexity and cost to adopt it, and any future migration requires rearchitecting the integration layer from scratch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Enterprise governance coverage is limited.&lt;/strong&gt; Recent Cloudflare updates added Unified Billing and basic content moderation. What's still missing: granular per-team budget controls, virtual key-level spend limits, role-based access control, and hierarchical cost management. These are table-stakes requirements for enterprise AI deployments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No path to self-hosted deployment.&lt;/strong&gt; Cloudflare AI Gateway runs exclusively on Cloudflare-managed infrastructure. Teams with data residency requirements, air-gapped environments, or regulatory constraints have no option to deploy it inside their own VPC or private cloud.&lt;/p&gt;

&lt;p&gt;For prototype environments or low-volume projects already on the Cloudflare stack, these constraints may be workable. For teams running production AI at scale, they introduce real risk.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Bifrost Is the Right Alternative
&lt;/h2&gt;

&lt;p&gt;Bifrost is an open-source AI gateway written in Go, built from the ground up for production AI infrastructure. It exposes a unified, OpenAI-compatible API across 1,000+ models from 12+ providers, including OpenAI, Anthropic, AWS Bedrock, Google Vertex AI, Azure OpenAI, Mistral, Groq, Cohere, and Ollama.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gateway Overhead Measured in Microseconds
&lt;/h3&gt;

&lt;p&gt;Bifrost adds just &lt;strong&gt;11 microseconds of overhead per request at 5,000 RPS&lt;/strong&gt; on standard t3.xlarge instances, the lowest latency overhead measured across AI gateways in that class. Go's concurrency model eliminates the bottlenecks that affect Python-based gateways, and Bifrost avoids the infrastructure indirection of managed proxy layers entirely.&lt;/p&gt;

&lt;p&gt;For real-time agents, customer-facing chatbots, and high-frequency tool-calling pipelines, that performance difference accumulates meaningfully at scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  Governance at Every Level
&lt;/h3&gt;

&lt;p&gt;Bifrost ships with a governance model designed around how enterprise teams actually work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Budget controls&lt;/strong&gt; scoped to virtual keys, teams, and individual customers, not just the top-level account&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rate limiting and access policies&lt;/strong&gt; configurable per API key, per team, or per application&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SSO&lt;/strong&gt; via Google and GitHub for enterprise authentication&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HashiCorp Vault integration&lt;/strong&gt; for secure key management and rotation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit logging&lt;/strong&gt; with no monthly caps, suitable for compliance and regulatory requirements&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Automatic Failover Without Manual Intervention
&lt;/h3&gt;

&lt;p&gt;Bifrost manages failure paths at the gateway level. When a provider is rate-limited, slow, or unavailable, automatic fallbacks route traffic to the next available provider or model with no downtime. Adaptive load balancing distributes requests across API keys and providers based on real-time availability and performance, and none of it requires manual intervention from the engineering team.&lt;/p&gt;

&lt;h3&gt;
  
  
  Semantic Caching That Reduces API Spend
&lt;/h3&gt;

&lt;p&gt;Bifrost's semantic caching layer stores responses and retrieves them for future prompts that are semantically similar, not just exact string matches. For applications where users regularly ask overlapping questions, support bots, knowledge assistants, internal search tools, this approach captures a broader range of cache-eligible requests than basic prompt caching and meaningfully reduces spend on redundant provider calls.&lt;/p&gt;

&lt;h3&gt;
  
  
  A Native MCP Gateway for Agentic Applications
&lt;/h3&gt;

&lt;p&gt;Bifrost includes built-in support for the Model Context Protocol, giving AI models a standardized interface to interact with external tools: filesystems, web search, databases, and custom services. LLM routing and MCP tool access run through the same gateway, removing the need for separate infrastructure components. Centralized tool governance controls which tools each team and application can access.&lt;/p&gt;

&lt;h3&gt;
  
  
  Real-Time Guardrails
&lt;/h3&gt;

&lt;p&gt;Content filtering and safety guardrails run at the gateway layer, blocking unsafe outputs and enforcing compliance policies before responses reach end users. They operate in real time with no meaningful impact on request latency.&lt;/p&gt;

&lt;h3&gt;
  
  
  Observability Without Extra Infrastructure
&lt;/h3&gt;

&lt;p&gt;Native Prometheus metrics, distributed tracing, and structured logging are built directly into Bifrost. No sidecars, no wrappers, no third-party integrations needed. Token counts, latency, error rates, and costs are all tracked at the per-request level, across models, teams, and environments. Paired with the Maxim AI observability platform, teams get a unified view across cost, latency, model behavior, and output quality.&lt;/p&gt;

&lt;h3&gt;
  
  
  Self-Hosted, Open Source, No Lock-In
&lt;/h3&gt;

&lt;p&gt;Bifrost deploys via Docker, Kubernetes, or NPX and runs inside any infrastructure environment. A production-ready gateway is up in under 60 seconds:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It is a drop-in replacement for OpenAI, Anthropic, and Google GenAI SDKs. One line change routes all traffic through Bifrost. Licensed under Apache 2.0 with no managed infrastructure dependency and no vendor billing surprises.&lt;/p&gt;

&lt;h2&gt;
  
  
  Feature Comparison: Bifrost vs. Cloudflare AI Gateway
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;Cloudflare AI Gateway&lt;/th&gt;
&lt;th&gt;Bifrost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gateway Overhead&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Workers-dependent (variable)&lt;/td&gt;
&lt;td&gt;11µs at 5,000 RPS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Self-Hosted Deployment&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No (Cloudflare-managed only)&lt;/td&gt;
&lt;td&gt;Yes (Docker, K8s, NPX)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Log Storage Limits&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;100K free / 1M paid per month&lt;/td&gt;
&lt;td&gt;Unlimited (Prometheus + structured logs)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Budget Controls&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Account-level only&lt;/td&gt;
&lt;td&gt;Per-key, per-team, per-customer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MCP Support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Native MCP gateway&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Caching&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Basic prompt caching&lt;/td&gt;
&lt;td&gt;Semantic similarity-based caching&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Open Source&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (Apache 2.0)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Guardrails&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Basic content moderation&lt;/td&gt;
&lt;td&gt;Real-time safety and compliance guardrails&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Failover&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Retry + model fallback&lt;/td&gt;
&lt;td&gt;Automatic multi-provider failover with adaptive load balancing&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Who Should Move to Bifrost
&lt;/h2&gt;

&lt;p&gt;Bifrost is built for teams that have hit the ceiling on what a managed, ecosystem-coupled gateway can offer:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;High-throughput engineering teams&lt;/strong&gt; that need microsecond-level gateway overhead without unpredictable Workers billing on top.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Enterprise organizations&lt;/strong&gt; that require per-team budgets, RBAC, audit logging, SSO, and governance controls beyond what a basic proxy layer provides.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agentic application builders&lt;/strong&gt; that need unified LLM routing and MCP tool access through a single gateway rather than stitching together separate components.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Teams with data residency or compliance requirements&lt;/strong&gt; that must self-host within their own VPC or air-gapped environment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost-focused teams&lt;/strong&gt; looking to reduce LLM API spend through semantic caching and intelligent fallback routing across providers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Start Using Bifrost
&lt;/h2&gt;

&lt;p&gt;Bifrost is open source and free to deploy. It takes under 60 seconds to get a production-ready gateway running locally.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Try Bifrost →&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Book a Demo →&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GitHub →&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Documentation →&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>infrastructure</category>
      <category>llm</category>
      <category>performance</category>
    </item>
    <item>
      <title>Our LiDAR detector spent 40% of its time in voxelization, not convs</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Tue, 02 Jun 2026 05:40:34 +0000</pubDate>
      <link>https://dev.to/elise_moreau/our-lidar-detector-spent-40-of-its-time-in-voxelization-not-convs-2kbb</link>
      <guid>https://dev.to/elise_moreau/our-lidar-detector-spent-40-of-its-time-in-voxelization-not-convs-2kbb</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We profiled a LiDAR object detector expecting the 3D backbone to dominate. It didn't. Voxelization plus the scatter-to-pillars step ate roughly 40% of per-frame latency on an A100, and pulling them out of the Python hot path took our p50 from 31ms down to 19ms.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The assumption that cost us two weeks
&lt;/h2&gt;

&lt;p&gt;When I was at Valeo.ai we ran a PointPillars-style detector on nuScenes-scale point clouds, around 250k points per sweep. The mental model everyone carried was simple. The sparse conv backbone is heavy, so the backbone is where the milliseconds go. We spent a sprint trying to prune channels and fuse BatchNorm into the conv weights before anyone actually looked at a trace.&lt;/p&gt;

&lt;p&gt;When we finally ran &lt;code&gt;torch.profiler&lt;/code&gt; with CUDA activities enabled, the picture was not what we expected. The 2D CNN head and the pillar feature net were fast. The expensive part lived upstream, in the code nobody thought of as "the model."## What the trace actually showed&lt;/p&gt;

&lt;p&gt;To be precise, two things dominated. First, the voxelization that buckets raw points into a fixed grid of pillars. Second, the scatter operation that writes encoded pillar features back into a dense BEV canvas before the 2D backbone runs.&lt;/p&gt;

&lt;p&gt;Here is the kind of profiling output that made us stop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;torch.profiler&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ProfilerActivity&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;activities&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ProfilerActivity&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CPU&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ProfilerActivity&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CUDA&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;record_shapes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;prof&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;detections&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;point_cloud&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;synchronize&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prof&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;key_averages&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;sort_by&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda_time_total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row_limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="c1"&gt;# voxelize_cpu          12.4 ms  (self CPU)
# scatter_nd            3.1 ms   (CUDA)
# sparse_conv_backbone  9.8 ms   (CUDA)
# rpn_head              4.2 ms
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The voxelization ran on CPU in a Python loop. Every frame paid a host-to-device copy after the points were bucketed, and the CPU work serialized against the GPU instead of overlapping with it. The backbone was not the problem. The plumbing around it was.## Moving the work to where the data already lives&lt;/p&gt;

&lt;p&gt;The fix was not exotic. We replaced the CPU voxelizer with a GPU implementation from &lt;code&gt;spconv&lt;/code&gt; 2.x, kept the point cloud resident on the device from the sensor decode onward, and let the scatter run as a single fused kernel instead of an indexed assignment in Python.&lt;/p&gt;

&lt;p&gt;The nuance here is that the win came from removing a synchronization point, not from making any single kernel faster. Once voxelization happened on-device, the CPU could prepare frame N+1 while the GPU finished frame N. That overlap is invisible in a microbenchmark of one op and very visible in end-to-end p50.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;Before (CPU voxelize)&lt;/th&gt;
&lt;th&gt;After (GPU voxelize)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Voxelization&lt;/td&gt;
&lt;td&gt;12.4 ms (CPU)&lt;/td&gt;
&lt;td&gt;1.9 ms (CUDA)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scatter to BEV&lt;/td&gt;
&lt;td&gt;3.1 ms&lt;/td&gt;
&lt;td&gt;1.4 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sparse conv backbone&lt;/td&gt;
&lt;td&gt;9.8 ms&lt;/td&gt;
&lt;td&gt;9.6 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RPN head&lt;/td&gt;
&lt;td&gt;4.2 ms&lt;/td&gt;
&lt;td&gt;4.1 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;p50 end-to-end&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;31 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;19 ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The backbone barely moved, which is the whole point. We had been optimizing the one part of the pipeline that was already efficient.&lt;/p&gt;

&lt;h2&gt;
  
  
  The labeling side had the same shape of bug
&lt;/h2&gt;

&lt;p&gt;While we were in there, we hit a related issue in our auto-labeling loop. We used a VLM to spot-check a sample of frames where the detector confidence was low, around 3% of 1.2M frames. The captioning calls were a different kind of bottleneck, one that came from rate limits and the occasional provider timeout rather than a kernel.&lt;/p&gt;

&lt;p&gt;We put those calls behind a gateway so a failed request to one provider would fail over to another without us babysitting it. Bifrost (&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;https://github.com/maximhq/bifrost&lt;/a&gt;) was the one we landed on, mostly because it spoke an OpenAI-compatible API and we didn't want to rewrite the client. There are other options in that space. The lesson was the same as the LiDAR one. The slow part was rarely the model itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;Moving voxelization to the GPU is not free. The &lt;code&gt;spconv&lt;/code&gt; GPU path holds more device memory for the hash tables that map points to voxels, so on a memory-constrained Jetson Orin we had to drop the max points-per-pillar from 32 to 20 to fit, which cost us about 0.4 mAP on the moderate split. On an A100 that tradeoff never came up.&lt;/p&gt;

&lt;p&gt;There is also a portability cost. A CPU voxelizer runs anywhere. The GPU version pins you to a CUDA toolkit version that matches your &lt;code&gt;spconv&lt;/code&gt; build, and we burned an afternoon on a mismatch between CUDA 11.8 and a wheel built for 12.1.&lt;/p&gt;

&lt;p&gt;And profiling itself can mislead. CUDA kernels launch asynchronously, so without an explicit &lt;code&gt;torch.cuda.synchronize()&lt;/code&gt; before reading timings, you measure launch overhead instead of real work. Half our early numbers were wrong for exactly this reason.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd tell my past self
&lt;/h2&gt;

&lt;p&gt;Profile before you optimize, and profile the whole pipeline, not the layer you find intellectually interesting. The detector network is the part you publish papers about. The preprocessing is the part that ships in production and quietly dominates latency.&lt;/p&gt;

&lt;p&gt;For LiDAR specifically, watch the boundary between CPU and GPU. Every host-to-device copy per frame is a stall, and stalls hide from per-op benchmarks. The model is usually the easy part.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/1812.05784" rel="noopener noreferrer"&gt;PointPillars: Fast Encoders for Object Detection from Point Clouds&lt;/a&gt; - the original paper, still the clearest explanation of the pillar encoding.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/traveller59/spconv" rel="noopener noreferrer"&gt;spconv: Spatially Sparse Convolution Library&lt;/a&gt; - the GPU voxelization and sparse conv kernels we switched to.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html" rel="noopener noreferrer"&gt;PyTorch Profiler with TensorBoard&lt;/a&gt; - how to read CUDA traces without fooling yourself on async timing.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.nuscenes.org/" rel="noopener noreferrer"&gt;nuScenes dataset&lt;/a&gt; - the benchmark these numbers came from.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost AI gateway&lt;/a&gt; - the failover layer we used for the VLM spot-check calls.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>llm</category>
      <category>machinelearning</category>
      <category>computervision</category>
      <category>mlops</category>
    </item>
    <item>
      <title>Centralising tool access for our prompt-assembly agent with Bifrost MCP gateway</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Mon, 01 Jun 2026 14:55:36 +0000</pubDate>
      <link>https://dev.to/elise_moreau/centralising-tool-access-for-our-prompt-assembly-agent-with-bifrost-mcp-gateway-5b4n</link>
      <guid>https://dev.to/elise_moreau/centralising-tool-access-for-our-prompt-assembly-agent-with-bifrost-mcp-gateway-5b4n</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: Before our SDXL stack renders a single product photo, a small LLM agent assembles the request from product metadata, a template lookup, and a brand-colour database. Wiring those tools separately for each provider kept drifting. Bifrost's MCP gateway let us register the tools once and keep them when we fail over from GPT-4o-mini to Claude. Below is what it cost, and where LiteLLM and Portkey would honestly have served us better.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The step nobody benchmarks in a diffusion pipeline
&lt;/h2&gt;

&lt;p&gt;At Photoroom I work on the diffusion stack that generates product photography. The denoising loop gets all the attention. The model is the easy part.&lt;/p&gt;

&lt;p&gt;Before any of that runs, a small agent assembles the generation request. It reads a product metadata JSON off our object store, looks up a background template by category in Postgres, and sometimes runs a web search to pin down a brand's palette. Three tool calls, maybe 40 tokens of reasoning, then it emits a structured prompt and a set of conditioning parameters for SDXL.&lt;/p&gt;

&lt;p&gt;To be precise, that agent is &lt;code&gt;gpt-4o-mini&lt;/code&gt; about 90% of the time. It's cheap and the latency budget is roughly 600ms. The nuance here is that when OpenAI rate-limits us during a traffic spike, we fail over to &lt;code&gt;claude-haiku&lt;/code&gt;, and the agent has to keep working with the exact same tools.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why per-provider tool wiring broke
&lt;/h2&gt;

&lt;p&gt;We had every tool defined twice. OpenAI function-calling schemas and Anthropic's tool blocks are close but not identical, and our 9 tools lived in both formats inside the agent service.&lt;/p&gt;

&lt;p&gt;That duplication drifted. Someone would add a &lt;code&gt;region&lt;/code&gt; argument to the template-lookup tool on the OpenAI path and forget the Anthropic one. The failover path then silently produced worse prompts, which we only caught because a reviewer noticed bland backgrounds in a sample set. Hard to test. Easy to miss.&lt;/p&gt;

&lt;p&gt;The deeper issue: tool execution was glued into the agent process. Filesystem reads, the Postgres query, the web search, all ran inline, so each provider integration re-implemented the same plumbing.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the MCP gateway changed
&lt;/h2&gt;

&lt;p&gt;Bifrost exposes a &lt;a href="https://docs.getbifrost.ai/mcp/overview" rel="noopener noreferrer"&gt;Model Context Protocol gateway&lt;/a&gt; that lets models call external tools (filesystem, web search, databases) registered once at the gateway rather than per provider. We moved our tool definitions there. The agent now talks to one OpenAI-compatible endpoint, and the same tool set is presented whether the request lands on GPT-4o-mini or Claude.&lt;/p&gt;

&lt;p&gt;Config looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;openai&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;env.OPENAI_KEY&lt;/span&gt;
  &lt;span class="na"&gt;anthropic&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;env.ANTHROPIC_KEY&lt;/span&gt;

&lt;span class="na"&gt;fallbacks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;openai/gpt-4o-mini&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;anthropic/claude-haiku&lt;/span&gt;

&lt;span class="na"&gt;mcp&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;servers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;photoroom-tools&lt;/span&gt;
      &lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;template_lookup&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;brand_palette&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;metadata_read&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Failover and the tool surface are now described in the same file. When &lt;a href="https://docs.getbifrost.ai/features/retries-and-fallbacks" rel="noopener noreferrer"&gt;automatic fallback&lt;/a&gt; kicks in, the tools come along. No second schema to keep in sync.&lt;/p&gt;

&lt;p&gt;One side benefit we didn't plan for: the &lt;a href="https://docs.getbifrost.ai/features/observability/default" rel="noopener noreferrer"&gt;Prometheus metrics&lt;/a&gt; Bifrost emits gave us per-tool call counts. We learned the web-search tool fired on only 4% of requests but accounted for a third of agent latency. We've since gated it behind a confidence check.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it compares
&lt;/h2&gt;

&lt;p&gt;I evaluated Bifrost against LiteLLM and Portkey, which we'd both used before. None of these is strictly best.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Concern&lt;/th&gt;
&lt;th&gt;Bifrost&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;th&gt;Portkey&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MCP tool gateway&lt;/td&gt;
&lt;td&gt;Built in, registered once&lt;/td&gt;
&lt;td&gt;Not a first-class feature&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Provider breadth&lt;/td&gt;
&lt;td&gt;23+&lt;/td&gt;
&lt;td&gt;Largest provider list&lt;/td&gt;
&lt;td&gt;Broad&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Runtime&lt;/td&gt;
&lt;td&gt;Go, low overhead&lt;/td&gt;
&lt;td&gt;Python proxy&lt;/td&gt;
&lt;td&gt;Hosted-first&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Observability&lt;/td&gt;
&lt;td&gt;Native Prometheus&lt;/td&gt;
&lt;td&gt;Callbacks, more wiring&lt;/td&gt;
&lt;td&gt;Strong hosted dashboard&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Self-host simplicity&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;npx&lt;/code&gt; or Docker&lt;/td&gt;
&lt;td&gt;pip + config&lt;/td&gt;
&lt;td&gt;Possible, cloud-leaning&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;LiteLLM is the honest pick if your differentiator is provider coverage or you live entirely in Python and want callbacks in-process. Portkey's hosted analytics are more polished than anything I'd build on top of raw Prometheus, and their guardrails UI is genuinely nicer for non-engineers. We chose Bifrost because the MCP gateway matched our specific shape, an agent with shared tools across a failover path, and because the Go proxy added under a millisecond at p50 in our load tests.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;The MCP gateway centralises tool definitions, but it also centralises a failure mode. If the gateway is down, both the model call and the tools go with it. We run two replicas behind a load balancer and treat the gateway as a tier-1 dependency now, which is more operational weight than the old inline tools carried.&lt;/p&gt;

&lt;p&gt;Debugging a tool call across the MCP boundary is harder than a local function. A bad Postgres query used to throw inside our process with a clean stack trace. Now I read gateway logs and correlate by request ID. Workable, not free.&lt;/p&gt;

&lt;p&gt;And the comparison above is not eternal. LiteLLM was adding MCP support when I last checked, so the gap here may close. Evaluate against current versions, not this table.&lt;/p&gt;

&lt;p&gt;Semantic caching, which Bifrost also offers, did nothing for us on this step. Each request is keyed on a unique product, so cache hit rate was near zero. We left it off. Worth saying plainly, since the feature is often pitched as a default win.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/mcp/overview" rel="noopener noreferrer"&gt;Bifrost MCP overview&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/retries-and-fallbacks" rel="noopener noreferrer"&gt;Retries and fallbacks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/observability/default" rel="noopener noreferrer"&gt;Observability and Prometheus metrics&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost on GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://modelcontextprotocol.io" rel="noopener noreferrer"&gt;The original MCP specification&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>llm</category>
      <category>machinelearning</category>
      <category>computervision</category>
      <category>mlops</category>
    </item>
    <item>
      <title>torch.compile recompiled our SDXL UNet 38 times in production</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Fri, 29 May 2026 05:38:13 +0000</pubDate>
      <link>https://dev.to/elise_moreau/torchcompile-recompiled-our-sdxl-unet-38-times-in-production-o91</link>
      <guid>https://dev.to/elise_moreau/torchcompile-recompiled-our-sdxl-unet-38-times-in-production-o91</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: &lt;code&gt;torch.compile&lt;/code&gt; gave us a 2.3x speedup on our SDXL pipeline in benchmarks, then quietly recompiled 38 times across the first 100 production requests because every customer uploads a product photo at a different resolution. The fix wasn't turning compile off. It was understanding what counts as a guard, bucketing inputs to fixed shapes, and reading the recompilation logs PyTorch 2.3 gives you for free.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The benchmark that lied to me
&lt;/h2&gt;

&lt;p&gt;At Photoroom we run diffusion models for product photography. Someone uploads a sneaker on a kitchen table, and the model gives it a clean studio background. The UNet is the heavy part, so when PyTorch 2.3 promised free speedups through &lt;code&gt;torch.compile&lt;/code&gt;, I spent a week wiring it in.&lt;/p&gt;

&lt;p&gt;The benchmark looked great. Fixed 1024x1024 input, batch size 4, an A10G. 2.3x faster than eager mode after warmup. I shipped it to a 5% canary.&lt;/p&gt;

&lt;p&gt;p99 latency went &lt;em&gt;up&lt;/em&gt;. Not by a little. Some requests took 70 seconds longer than before the change.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a guard actually is
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;torch.compile&lt;/code&gt; traces a graph for a specific set of input properties. Tensor shapes, dtypes, certain scalar values, device. Dynamo wraps that graph in what it calls guards, which are cheap runtime checks that say "this compiled kernel is valid only if the next input matches". Miss a guard, and it recompiles.&lt;/p&gt;

&lt;p&gt;Compiling the SDXL UNet takes 40 to 90 seconds on an A10G. That cost is fine once, at startup. The nuance here is that it happens lazily, inside the first request that violates a guard. So the recompile lands in the middle of a customer waiting for their image.&lt;/p&gt;

&lt;p&gt;And product photos do not have a fixed shape. Phones shoot 3024x4032, someone crops to 800x600, a Shopify export is 1200x1200. Every new resolution is a new shape, a new guard miss, another recompile mid-request.&lt;/p&gt;

&lt;h2&gt;
  
  
  Watching the recompiles happen
&lt;/h2&gt;

&lt;p&gt;The thing that saved me was not a profiler. It was one environment variable.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;TORCH_LOGS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"recompiles"&lt;/span&gt; python serve.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That prints a line every time Dynamo throws away a compiled graph, with the reason:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Recompiling function forward in unet.py:412
    triggered by the following guard failure(s):
    - tensor 'L['''sample''']' size mismatch at index 2.
      expected 128, actual 96
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Index 2 is the latent height. A 1024px image becomes a 128-wide latent, a 768px image becomes 96. Different shape, recompile. I counted 38 distinct recompiles before the cache stabilized, and it never fully stabilized because new resolutions kept arriving.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three ways to stop it
&lt;/h2&gt;

&lt;p&gt;I tested three approaches over a week against real traffic from our logs. Here's what held up.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Recompiles after warmup&lt;/th&gt;
&lt;th&gt;Speedup kept&lt;/th&gt;
&lt;th&gt;Main cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;torch.compile(model, dynamic=True)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;~1.6x&lt;/td&gt;
&lt;td&gt;More general kernel, slower per step&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Resolution bucketing&lt;/td&gt;
&lt;td&gt;3 (warmed at boot)&lt;/td&gt;
&lt;td&gt;~2.1x&lt;/td&gt;
&lt;td&gt;Padding pixels wasted through VAE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fixed canonical resolution&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;2.3x&lt;/td&gt;
&lt;td&gt;Quality loss on extreme aspect ratios&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;torch.compile(model, dynamic=True)&lt;/code&gt; tells Dynamo to assume shapes vary from the start. No per-shape recompiles, but you pay with a more general kernel. We measured 1.6x instead of 2.3x. Honest, predictable, leaves speed on the table.&lt;/p&gt;

&lt;p&gt;Bucketing won. We resize and pad every input so the long edge lands on one of {768, 1024, 1280}, then compile each bucket once at boot before the readiness probe goes green.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;BUCKETS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;768&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1280&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;to_bucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;long_edge&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;height&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;target&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BUCKETS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;long_edge&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;resize_and_pad&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Warm all three at startup, before serving traffic
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;BUCKETS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;dummy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;zeros&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;compiled_unet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dummy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tensor&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three compiles total, all before the pod takes traffic. Cache hits forever after.&lt;/p&gt;

&lt;h2&gt;
  
  
  The part that isn't the UNet
&lt;/h2&gt;

&lt;p&gt;There's a small LLM step before the diffusion model even runs. It rewrites the user's text prompt into something the model handles better, turning "make it look nice" into a structured scene description. Those calls go out to an external provider, and when that provider has a bad minute the whole render stalls behind it.&lt;/p&gt;

&lt;p&gt;We route that step through an AI gateway so a provider hiccup fails over to a backup instead of blocking. We use &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; for this, though several gateways do automatic fallback. It's one box in the pipeline diagram, not the interesting one, but it kept the prompt-rewrite step from becoming a single point of failure once compile fixed the UNet side.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;Bucketing wastes compute on padding. An 800x600 image padded to 1024 pushes roughly 60% extra pixels through the VAE decode. For us that's worth a stable p99. For a catalog that's all square crops, it'd be pure overhead, and &lt;code&gt;dynamic=True&lt;/code&gt; would be the saner call.&lt;/p&gt;

&lt;p&gt;Warmup adds about 4 minutes to pod startup. Our Kubernetes &lt;code&gt;readinessProbe&lt;/code&gt; has &lt;code&gt;initialDelaySeconds&lt;/code&gt; set to wait it out, which slows autoscaling response during traffic spikes. You feel this when a flash sale hits.&lt;/p&gt;

&lt;p&gt;The Inductor compile cache is per-process by default. Every new pod recompiles from scratch. You can point &lt;code&gt;TORCHINDUCTOR_CACHE_DIR&lt;/code&gt; at a shared volume to persist it, but the cache is keyed loosely on environment, and a mismatched CUDA driver version across nodes gave us a silent fallback to eager once. Test that path before you trust it.&lt;/p&gt;

&lt;p&gt;And none of this helps if your bottleneck is the VAE or the scheduler loop. Profile first. I burned two days compiling a UNet that was already 40% of wall-clock.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://pytorch.org/docs/stable/torch.compiler_troubleshooting.html" rel="noopener noreferrer"&gt;torch.compile troubleshooting guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pytorch.org/docs/stable/torch.compiler_dynamic_shapes.html" rel="noopener noreferrer"&gt;Dynamic shapes in torch.compile&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html" rel="noopener noreferrer"&gt;Compile cache tutorial&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2307.01952" rel="noopener noreferrer"&gt;SDXL paper (Podell et al., 2023)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost AI gateway&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>pytorch</category>
      <category>machinelearning</category>
      <category>computervision</category>
      <category>mlops</category>
    </item>
    <item>
      <title>Tracing our 4-stage product photo pipeline through Bifrost</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Thu, 28 May 2026 14:55:01 +0000</pubDate>
      <link>https://dev.to/elise_moreau/tracing-our-4-stage-product-photo-pipeline-through-bifrost-3b09</link>
      <guid>https://dev.to/elise_moreau/tracing-our-4-stage-product-photo-pipeline-through-bifrost-3b09</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We added OpenTelemetry tracing across the four LLM and VLM hops in our product-photo pipeline by routing them through Bifrost. Pipeline-level p95 went from 11.2s to 6.8s in two weeks, mostly because we could finally see which step was the bottleneck. The tracing was free once the gateway was in place; we weren't going to instrument four SDKs by hand.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the time went was a mystery
&lt;/h2&gt;

&lt;p&gt;At Photoroom we run a four-stage pipeline for catalog photo cleanup. Background removal with an in-house model. Inpainting via SDXL with our internal LoRA. Upscaling through Real-ESRGAN. A caption step that hits an external VLM provider. Four hops, two of them outside our VPC.&lt;/p&gt;

&lt;p&gt;A bug report sat in our queue for a week: "the pipeline is slow on Tuesdays." Designers were timing out at 12 seconds. Our internal Grafana showed the GPU jobs were fine. The external VLM call latency? Nobody had a number.&lt;/p&gt;

&lt;p&gt;To be precise, we had per-service latency in Datadog, but the spans didn't stitch together across the external API hops. We could see step 1 took 320ms and step 4 took "8 to 10 seconds" but the "to" in there was the entire problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we changed
&lt;/h2&gt;

&lt;p&gt;I'd been evaluating Bifrost (github.com/maximhq/bifrost) for an unrelated project: semantic caching on the caption step. While reading the observability docs I noticed Prometheus metrics and OTel export were native, not plugins.&lt;/p&gt;

&lt;p&gt;So I rewired both external calls (caption VLM and a prompt-rewrite call we make before SDXL) to go through Bifrost instead of directly to the provider. The Python change was about 9 lines. Swap the base URL.&lt;/p&gt;

&lt;p&gt;The relevant slice of our &lt;code&gt;config.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"providers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"openai"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"keys"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"env.OPENAI_KEY"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"models"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"gpt-4o-mini"&lt;/span&gt;&lt;span class="p"&gt;]}],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"concurrency_and_buffer_size"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"concurrency"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"anthropic"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"keys"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"env.ANTHROPIC_KEY"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"models"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"claude-haiku-4-5"&lt;/span&gt;&lt;span class="p"&gt;]}]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"observability"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"prometheus_labels"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"x-team"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"x-pipeline-stage"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"otel_endpoint"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http://otel-collector:4318"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The two custom headers (&lt;code&gt;x-team&lt;/code&gt;, &lt;code&gt;x-pipeline-stage&lt;/code&gt;) attach to every Prometheus sample and every OTel span. In Tempo I can filter by pipeline stage and see exactly which hop is slow.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we found
&lt;/h2&gt;

&lt;p&gt;The Tuesday slowness was a regional capacity blip with our caption provider. p99 for the caption call was 14 seconds with no error, only slow token output. Once Bifrost sat in front of it we configured a fallback to a second provider and the 12-second pipeline timeouts disappeared.&lt;/p&gt;

&lt;p&gt;Full p95 numbers, two weeks before and two weeks after:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;p95 before&lt;/th&gt;
&lt;th&gt;p95 after&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Background removal&lt;/td&gt;
&lt;td&gt;340ms&lt;/td&gt;
&lt;td&gt;340ms&lt;/td&gt;
&lt;td&gt;Untouched, on our GPUs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prompt rewrite&lt;/td&gt;
&lt;td&gt;1.1s&lt;/td&gt;
&lt;td&gt;480ms&lt;/td&gt;
&lt;td&gt;Semantic cache hit ~62%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SDXL + LoRA&lt;/td&gt;
&lt;td&gt;4.2s&lt;/td&gt;
&lt;td&gt;4.2s&lt;/td&gt;
&lt;td&gt;Untouched&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Caption VLM&lt;/td&gt;
&lt;td&gt;6.8s&lt;/td&gt;
&lt;td&gt;1.4s&lt;/td&gt;
&lt;td&gt;Failover plus cache&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;End-to-end&lt;/td&gt;
&lt;td&gt;11.2s&lt;/td&gt;
&lt;td&gt;6.8s&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The semantic cache hit rate on the prompt-rewrite step is high because designers re-run very similar product descriptions. That was latency we didn't know we were leaving on the table.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it compares to what we considered
&lt;/h2&gt;

&lt;p&gt;We looked at LiteLLM and Portkey first. We already had a half-deployed LiteLLM somewhere in the cluster.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;th&gt;Portkey&lt;/th&gt;
&lt;th&gt;Bifrost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Self-host&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes (some features gated)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prometheus native&lt;/td&gt;
&lt;td&gt;Plugin&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OTel spans&lt;/td&gt;
&lt;td&gt;Plugin&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic cache&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MCP support&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Measured overhead&lt;/td&gt;
&lt;td&gt;~9ms&lt;/td&gt;
&lt;td&gt;cloud RTT dependent&lt;/td&gt;
&lt;td&gt;~1.2ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The nuance here is that Portkey is excellent if you are happy on their cloud. We couldn't be, because some product images carry customer-identifiable data and the entire pipeline runs inside our VPC. LiteLLM is mature and its community is larger than Bifrost's, but the observability story required wiring up the Prometheus plugin myself. Bifrost ships that out of the box per the docs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;Bifrost is younger than LiteLLM and its plugin ecosystem is correspondingly smaller. We wrote a 40-line Go plugin to attach Photoroom-internal request IDs to spans. It works but it's one more thing we own.&lt;/p&gt;

&lt;p&gt;The MCP feature is genuinely useful for tool-using agents. We aren't using it yet. Our pipeline isn't agentic; if you only call chat completions you can ignore MCP entirely.&lt;/p&gt;

&lt;p&gt;Latency overhead is low (~1.2ms in our measurements) but it isn't zero. For an image pipeline where each hop is hundreds of milliseconds it's invisible. For a high-QPS embedding service it might matter. Benchmark in your own environment before assuming.&lt;/p&gt;

&lt;p&gt;One operational note: the OTel exporter sends spans synchronously by default. We had to bump the batch interval to 5 seconds; otherwise the gateway pod's CPU climbed under load.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Bifrost observability docs: &lt;a href="https://docs.getbifrost.ai/features/observability/default" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/observability/default&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Bifrost semantic caching: &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/semantic-caching&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Bifrost fallbacks and retries: &lt;a href="https://docs.getbifrost.ai/features/retries-and-fallbacks" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/retries-and-fallbacks&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;OpenTelemetry GenAI semantic conventions: &lt;a href="https://opentelemetry.io/docs/specs/semconv/gen-ai/" rel="noopener noreferrer"&gt;https://opentelemetry.io/docs/specs/semconv/gen-ai/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Real-ESRGAN paper, Wang et al. 2021: &lt;a href="https://arxiv.org/abs/2107.10833" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2107.10833&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>mlops</category>
      <category>llm</category>
    </item>
    <item>
      <title>Semantic caching the VLM step in our product-photo pipeline</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Wed, 27 May 2026 14:52:40 +0000</pubDate>
      <link>https://dev.to/elise_moreau/semantic-caching-the-vlm-step-in-our-product-photo-pipeline-5ahj</link>
      <guid>https://dev.to/elise_moreau/semantic-caching-the-vlm-step-in-our-product-photo-pipeline-5ahj</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We put Bifrost in front of the VLM step that captions and rewrites prompts for our product-photo diffusion pipeline. Semantic caching cut that bill by ~62% in three weeks. The diffusion side, where the GPUs live, was never the cost we should have been worrying about.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The bill that surprised us
&lt;/h2&gt;

&lt;p&gt;Our pipeline at Photoroom (paraphrased, not exact internal numbers) does three things per product image. A vision-language model reads the input and produces structured captions. A second LLM call rewrites the user's prompt into something the diffusion model behaves well with. Then SDXL with our internal LoRAs does the actual generation on our own A100s.&lt;/p&gt;

&lt;p&gt;The diffusion step is what we obsess over. To be precise, it is what we benchmark and profile every sprint. So when we looked at the Q1 numbers, the surprise was that Claude and Gemini Vision together cost more than the GPU lease for the same workload. The VLM and prompt-rewrite layer was 58% of total inference spend.&lt;/p&gt;

&lt;p&gt;The nuance here is that we had been calling the providers directly from a Python service with no caching. Same product image, same user request. The response paid for again.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why we chose Bifrost over the alternatives
&lt;/h2&gt;

&lt;p&gt;I looked at LiteLLM and Portkey first. Both are good. LiteLLM is the path of least resistance if you want a Python library inside an existing FastAPI service, and its provider coverage is excellent. Portkey has a polished hosted UX and very clean dashboarding.&lt;/p&gt;

&lt;p&gt;We landed on Bifrost for three reasons specific to our setup. It runs as a Go binary, which means the gateway isn't competing for the same GIL-bound CPU as our inference service. Semantic caching is built in rather than an add-on. The OpenAI-compatible endpoint meant we didn't need to change any of our SDK calls, &lt;a href="https://docs.getbifrost.ai/features/drop-in-replacement" rel="noopener noreferrer"&gt;as documented here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Honest comparison. LiteLLM has a larger Python ecosystem footprint and its routing config will feel more native if your stack is Python-first. Portkey's analytics UI is, frankly, prettier than what we get out of the box.&lt;/p&gt;

&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;p&gt;Bifrost runs as a sidecar next to the prompt-rewrite service. Both the captioning and rewrite calls now go through &lt;code&gt;http://bifrost:8080/v1/chat/completions&lt;/code&gt;. Our config is small.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;anthropic&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;env.ANTHROPIC_KEY_PRIMARY&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;env.ANTHROPIC_KEY_BACKUP&lt;/span&gt;
  &lt;span class="na"&gt;google_vertex&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;env.VERTEX_KEY&lt;/span&gt;

&lt;span class="na"&gt;semantic_cache&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;similarity_threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.94&lt;/span&gt;
  &lt;span class="na"&gt;ttl_seconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;86400&lt;/span&gt;

&lt;span class="na"&gt;fallbacks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;primary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;anthropic/claude-3-5-sonnet&lt;/span&gt;
    &lt;span class="na"&gt;fallback&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;google_vertex/gemini-1.5-pro&lt;/span&gt;

&lt;span class="na"&gt;governance&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;virtual_keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vk_caption_team&lt;/span&gt;
      &lt;span class="na"&gt;budget_usd_monthly&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;800&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vk_rewrite_team&lt;/span&gt;
      &lt;span class="na"&gt;budget_usd_monthly&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;400&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three things matter here. The cache threshold of 0.94 was tuned against a held-out set of 5,000 captioning calls. At 0.97 we missed too many obvious duplicates. At 0.90 we started returning captions that were close but wrong about colour, which is unforgivable for an e-commerce use case. The fallback list isn't theatre. We measured Anthropic 5xx rates of 0.4% over March, which on our volume is real customer-visible failures.&lt;/p&gt;

&lt;h2&gt;
  
  
  Numbers from week three
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cache hit rate (caption)&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;71%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cache hit rate (rewrite)&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;49%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p95 latency, caption step&lt;/td&gt;
&lt;td&gt;1.8s&lt;/td&gt;
&lt;td&gt;0.31s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monthly VLM+LLM spend&lt;/td&gt;
&lt;td&gt;baseline&lt;/td&gt;
&lt;td&gt;-62%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Provider failover events handled&lt;/td&gt;
&lt;td&gt;0 (we returned an error)&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The rewrite step caches less well because user prompts vary more. Captioning is the big win, because product photos from the same merchant cluster heavily in embedding space. Roughly 70% of merchants in our top tier upload 80% of their catalogue images within a 90-day window.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we actually changed in code
&lt;/h2&gt;

&lt;p&gt;The migration was unromantic. Two lines.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://bifrost:8080/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BIFROST_VIRTUAL_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Everything else stayed. The VLM team didn't touch their code. The rewrite team flipped a config flag.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;Semantic caching has a real failure mode. If your downstream model output is meant to vary across calls (creative generation, sampling-heavy use cases) you don't want this on. We disable it for the diffusion-prompt-suggestion endpoint that gives editors three variants. The cache would happily return the same triplet twice.&lt;/p&gt;

&lt;p&gt;The Go binary is one more service to operate. For a small team this is non-trivial. LiteLLM-as-a-library has fewer moving parts if you don't need the cache.&lt;/p&gt;

&lt;p&gt;Cost attribution through virtual keys is per-key, not per-end-customer-of-our-customers. If you need full multi-tenant chargeback down to the merchant level, you will write some glue.&lt;/p&gt;

&lt;p&gt;The semantic cache uses an embedding model itself. Read &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;the docs on what backs it&lt;/a&gt; before you assume your prompts stay inside your VPC.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;Bifrost semantic caching docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/retries-and-fallbacks" rel="noopener noreferrer"&gt;Bifrost fallbacks and retries&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;Bifrost virtual keys and budgets&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2412.11919" rel="noopener noreferrer"&gt;RetroLLM paper on retrieval-augmented caching&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.litellm.ai/docs/proxy" rel="noopener noreferrer"&gt;LiteLLM proxy docs&lt;/a&gt; for the honest comparison&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>mlops</category>
      <category>computervision</category>
      <category>llm</category>
    </item>
    <item>
      <title>The bf16 grad accumulator that killed our SDXL LoRA training</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Wed, 27 May 2026 05:37:20 +0000</pubDate>
      <link>https://dev.to/elise_moreau/the-bf16-grad-accumulator-that-killed-our-sdxl-lora-training-nc8</link>
      <guid>https://dev.to/elise_moreau/the-bf16-grad-accumulator-that-killed-our-sdxl-lora-training-nc8</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: Our SDXL LoRA fine-tune for a Photoroom product photography model trained for six days while silently corrupting its adapter weights. The cause was bf16 gradient accumulation interacting badly with a custom adapter init we'd ported from a paper. Eval scores stayed in the same range the whole time, which is why nobody noticed.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;p&gt;We train SDXL LoRAs for product photography categories at Photoroom. Bottles, packaged food, soft goods. Each LoRA is 192MB. Training stack: PyTorch 2.3, bf16 mixed precision, gradient accumulation across 8 steps, A100 80GBs.&lt;/p&gt;

&lt;p&gt;The LoRA init follows a small modification of the OFT paper for better stability on small datasets. To be precise, we orthogonalize the down-projection before training begins, then let the up-projection drift freely. This had been working for nine months.&lt;/p&gt;

&lt;h2&gt;
  
  
  What broke
&lt;/h2&gt;

&lt;p&gt;Six days into a 7-day run, our automated CLIPScore check started showing variance that was technically inside our acceptance band but trending the wrong way. The nuance here is that our eval pipeline grades generations using a fan-out across three VLM providers (Claude vision, GPT-4o, Gemini 1.5) routed through an LLM gateway. We use &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; for that fan-out, which gives us provider-level failover when one of them rate-limits us mid-grade. Useful, and uneventful. The grade scores looked fine.&lt;/p&gt;

&lt;p&gt;The real signal was a per-step gradient norm log we'd turned off a quarter earlier when it was spamming the dashboard. When I turned it back on for a sanity check, the grad norms had been collapsing to ~1e-5 in the down-projection layer since step 12,000.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tracing it back
&lt;/h2&gt;

&lt;p&gt;I added a hook to dump the raw bf16 gradient tensors before they hit the accumulator:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;grad_dump_hook&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;hook&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grad&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;finite&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isfinite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grad&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;item&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;absmax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;grad&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;item&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;finite&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nan&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;absmax&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;1e-4&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;finite&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] finite=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;finite&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; absmax=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;absmax&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;grad&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;hook&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;lora&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;named_parameters&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;requires_grad&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lora_A&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;register_hook&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;grad_dump_hook&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output across 200 steps:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[blocks.7.attn1.lora_A] finite=True absmax=4.32e-06
[blocks.8.attn1.lora_A] finite=True absmax=2.11e-06
[blocks.9.attn1.lora_A] finite=True absmax=8.54e-07
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Gradient magnitudes in bf16 land are bounded below by roughly 6e-8 before they round to zero. We were producing real gradients, but most of them were being silently quantized to zero during the accumulation step. The accumulator in our custom training loop accumulated in bf16, not fp32.&lt;/p&gt;

&lt;p&gt;This is documented behavior. Standard PyTorch grad accumulation in Accelerate uses fp32 accumulators by default. Our custom loop, forked from an internal repo two years ago, did not.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# before
&lt;/span&gt;&lt;span class="n"&gt;optim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;param_groups&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;grad_accumulator_dtype&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bfloat16&lt;/span&gt;

&lt;span class="c1"&gt;# after
&lt;/span&gt;&lt;span class="n"&gt;optim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;param_groups&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;grad_accumulator_dtype&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float32&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Single line. Six days. We re-ran training with fp32 accumulation. Grad norms stabilized in the expected 1e-3 to 1e-2 range. Eval scores moved up by ~6% in our internal background-consistency metric.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why nothing else caught it
&lt;/h2&gt;

&lt;p&gt;A few obvious questions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why didn't loss curves show it?&lt;/strong&gt; They did, mildly. The loss was still going down, only slower. Within noise of a normal run.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why didn't the VLM eval catch it?&lt;/strong&gt; Because the generations were still "good." Product on a clean background, lighting roughly correct. The drift was in finer details (brand text legibility, soft-good fabric texture) that our three-VLM grading averages out. We're now adding a per-category CLIPScore-vs-reference check that runs without averaging.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why did we trust the init?&lt;/strong&gt; We had nine months of green runs. The OFT-style init only became a problem when we tightened the LR schedule three weeks ago, which made the gradient magnitudes smaller across the board.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Memory cost&lt;/th&gt;
&lt;th&gt;Bookkeeping&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;bf16 accumulator&lt;/td&gt;
&lt;td&gt;baseline&lt;/td&gt;
&lt;td&gt;low&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;fp32 accumulator (all params)&lt;/td&gt;
&lt;td&gt;+4% peak&lt;/td&gt;
&lt;td&gt;low&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;fp32 only for LoRA params&lt;/td&gt;
&lt;td&gt;+0.6% peak&lt;/td&gt;
&lt;td&gt;painful&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The fp32 accumulator costs us ~4% more GPU memory per step. Not free. On the A100 80GBs it's invisible, but if you're tight on the H100 80GBs or sharing with another job, you'll feel it.&lt;/p&gt;

&lt;p&gt;You can accumulate in fp32 only for the LoRA params and keep the base model gradients in bf16, but the bookkeeping is annoying. We took the 4% hit.&lt;/p&gt;

&lt;p&gt;The deeper lesson: a custom training loop that worked for nine months is not a training loop you understand. It's a training loop that hasn't been stressed in the right place yet. I should have re-read it when we changed the LR schedule.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://pytorch.org/docs/stable/amp.html" rel="noopener noreferrer"&gt;Mixed-precision training recipes (PyTorch docs)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2306.07280" rel="noopener noreferrer"&gt;Orthogonal Fine-Tuning (OFT), NeurIPS 2023&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/docs/accelerate/main/en/concept_guides/gradient_synchronization" rel="noopener noreferrer"&gt;Hugging Face Accelerate gradient accumulation internals&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus" rel="noopener noreferrer"&gt;bfloat16 numerics on TPUs (Google Cloud)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost: multi-provider LLM gateway we use for VLM fan-out&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>pytorch</category>
      <category>mlops</category>
      <category>computervision</category>
    </item>
    <item>
      <title>Per-customer budget caps on our caption pipeline: 3 weeks with virtual keys</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Tue, 26 May 2026 14:53:00 +0000</pubDate>
      <link>https://dev.to/elise_moreau/per-customer-budget-caps-on-our-caption-pipeline-3-weeks-with-virtual-keys-hip</link>
      <guid>https://dev.to/elise_moreau/per-customer-budget-caps-on-our-caption-pipeline-3-weeks-with-virtual-keys-hip</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We were burning around €4,200/month in vision-LLM costs across roughly 80 customers, with zero way to tell who was responsible. Bifrost virtual keys plus per-customer budgets gave us hard caps and clean attribution in a couple of days. Semantic caching saved another 34%, though it needed more babysitting than the README implies.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At Photoroom we ship diffusion-based product photography to about half a million users. The diffusion side is our own infra, GPU pool, custom UNet, the lot. The part nobody writes blog posts about is the captioning and safety-filtering step that runs before each generation, plus a prompt rewriter. Those calls go out to OpenAI's gpt-4o-mini and Anthropic's claude-haiku-4-5-20251001 depending on the route.&lt;/p&gt;

&lt;p&gt;For months we treated those calls as overhead. Two API keys, one invoice per provider, no real attribution. Then in April our bill went from €1,800 to €4,200 in three weeks. To be precise: nothing had launched. A single enterprise customer was retrying caption generation in a loop because their pipeline interpreted our 429s as transient.&lt;/p&gt;

&lt;p&gt;That was the trigger to put something in front of the providers.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we tried first
&lt;/h2&gt;

&lt;p&gt;The obvious first pass was a small Python proxy. Maybe 150 lines, sitting between our worker and OpenAI. It worked. For about a week. Then we needed per-customer rate limits, then budget caps, then a second provider, then someone in finance asked for usage reports by customer ID.&lt;/p&gt;

&lt;p&gt;This is the point where most teams adopt a real gateway. We compared three.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Bifrost&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;th&gt;Portkey&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Self-hostable&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes (paid tier)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI-compatible endpoint&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per-customer virtual keys&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hierarchical budgets (key → team → customer)&lt;/td&gt;
&lt;td&gt;First-class&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic caching built-in&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Plugin&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prometheus metrics out of the box&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Add-on&lt;/td&gt;
&lt;td&gt;Hosted dashboard&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Web UI for config&lt;/td&gt;
&lt;td&gt;Functional&lt;/td&gt;
&lt;td&gt;Minimal&lt;/td&gt;
&lt;td&gt;The nicest&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Portkey's dashboard is honestly the nicest of the three. LiteLLM has the longest tail of niche providers. We picked Bifrost because we wanted a self-hosted box where the budget hierarchy was first-class, and we wanted customer IDs to stay out of any third-party SaaS for GDPR reasons.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting up virtual keys
&lt;/h2&gt;

&lt;p&gt;Step one was running Bifrost as a sidecar to our Python workers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 8080:8080 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;pwd&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;/config.json:/app/config.json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$OPENAI_API_KEY&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$ANTHROPIC_API_KEY&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  maximhq/bifrost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then a virtual key per customer, with a monthly cap and an allow-list of models:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;virtual_keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vk_acme_corp&lt;/span&gt;
    &lt;span class="na"&gt;customer_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;acme_corp&lt;/span&gt;
    &lt;span class="na"&gt;allowed_models&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;openai/gpt-4o-mini&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;anthropic/claude-haiku-4-5-20251001&lt;/span&gt;
    &lt;span class="na"&gt;budget&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;monthly_usd&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;200&lt;/span&gt;
      &lt;span class="na"&gt;hard_cap&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;rate_limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;requests_per_minute&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;60&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Our worker now sends &lt;code&gt;Authorization: Bearer vk_acme_corp&lt;/code&gt; instead of the raw provider key. When acme_corp hits €200, the gateway returns a typed 429 and we surface "monthly quota reached" in the customer's UI. The customer that triggered the April spike is capped at €150 now, and we sleep at night.&lt;/p&gt;

&lt;p&gt;The whole rollout took a long Tuesday. Most of that was rewriting our worker to thread the customer_id through every call, not the gateway itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where semantic caching helped (and where it bit us)
&lt;/h2&gt;

&lt;p&gt;Bifrost's semantic cache hashes the embedding of the prompt and returns a cached completion when cosine similarity exceeds a threshold. For caption generation on near-duplicate product photos, this is a big deal. We saw a 34% hit rate over the first 10 days.&lt;/p&gt;

&lt;p&gt;The catch: our prompt includes brand and SKU for context, but the brand string was occasionally elided when the customer's metadata was incomplete. Two prompts differing only in &lt;code&gt;brand: null&lt;/code&gt; vs &lt;code&gt;brand: Nike&lt;/code&gt; hashed close enough to collide, and a Nike trainer got captioned with a placeholder description. Not great.&lt;/p&gt;

&lt;p&gt;We fixed it by raising the similarity threshold from 0.92 to 0.97 and adding the SKU as a cache namespace prefix. Worth a proper ablation paper-style. We have not had time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;A few things to know before you adopt this.&lt;/p&gt;

&lt;p&gt;The Bifrost UI is functional but newer than Portkey's. If your finance team wants a self-service dashboard they can poke at without engineer help, Portkey is ahead today. We plumbed Bifrost's Prometheus output into our existing Grafana instance instead, which took an afternoon.&lt;/p&gt;

&lt;p&gt;LiteLLM has a longer tail of obscure providers. If you route to OpenRouter sub-models or self-hosted vLLM endpoints with non-standard URL shapes, check the docs first.&lt;/p&gt;

&lt;p&gt;Hard budget caps are hard. A customer who hits the cap mid-batch will see a 429 even if they're €0.10 over. We added a soft-cap warning at 80% and an autoscale-up flow for enterprise contracts so the experience does not feel punitive.&lt;/p&gt;

&lt;p&gt;The semantic cache will collide on prompts that differ in metadata you forgot to include in the embedding. Tune the threshold, namespace aggressively, and assume you will discover one edge case per week for the first month.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;Bifrost virtual keys docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;Budget and limits hierarchy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;Semantic caching reference&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost on GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;LiteLLM's &lt;a href="https://docs.litellm.ai/docs/proxy/users" rel="noopener noreferrer"&gt;budget docs&lt;/a&gt; for comparison&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>llm</category>
      <category>infrastructure</category>
      <category>mlops</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Cost accounting for diffusion image generation at $0.0008 per render</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Mon, 25 May 2026 14:53:09 +0000</pubDate>
      <link>https://dev.to/elise_moreau/cost-accounting-for-diffusion-image-generation-at-00008-per-render-8j</link>
      <guid>https://dev.to/elise_moreau/cost-accounting-for-diffusion-image-generation-at-00008-per-render-8j</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: Per-image cost on our SDXL-based product photography pipeline at Photoroom dropped from $0.0031 to $0.0008 over six months. Most of the win came from boring infrastructure work, not model tricks. An AI gateway in front of our text-conditioning calls saved more than I expected.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I spent most of Q1 staring at a Grafana panel labelled &lt;code&gt;cost_per_render_eur&lt;/code&gt;. Our diffusion pipeline generates background-replaced product images at volume. When marketing asks for a million renders, the per-image number matters.&lt;/p&gt;

&lt;p&gt;To be precise: the cost I track is GPU-seconds on A100/H100 SXM nodes plus any external API calls plus storage IO. Not amortised salaries, not the office espresso machine. Just the marginal cost of one more render.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the money actually goes
&lt;/h2&gt;

&lt;p&gt;Before I started measuring properly, I assumed the UNet denoising loop was 80%+ of the cost. It wasn't.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;% of wall time&lt;/th&gt;
&lt;th&gt;% of cost&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Text encoder (CLIP + T5)&lt;/td&gt;
&lt;td&gt;4%&lt;/td&gt;
&lt;td&gt;11%&lt;/td&gt;
&lt;td&gt;T5-XXL is expensive on H100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM caption rewriting&lt;/td&gt;
&lt;td&gt;8%&lt;/td&gt;
&lt;td&gt;22%&lt;/td&gt;
&lt;td&gt;External API, GPT-4o-mini initially&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;UNet denoising (25 steps)&lt;/td&gt;
&lt;td&gt;71%&lt;/td&gt;
&lt;td&gt;48%&lt;/td&gt;
&lt;td&gt;DPM++ 2M Karras&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VAE decode&lt;/td&gt;
&lt;td&gt;9%&lt;/td&gt;
&lt;td&gt;7%&lt;/td&gt;
&lt;td&gt;fp16, no tricks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage IO + image post&lt;/td&gt;
&lt;td&gt;8%&lt;/td&gt;
&lt;td&gt;12%&lt;/td&gt;
&lt;td&gt;S3 multipart, sharpen, resize&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The caption-rewriting step shocked me. We use an LLM to take a customer prompt like "white sneaker on beach" and expand it into a diffusion-friendly description with lighting, framing, camera details. That single API call was 22% of cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  Killing the bill in three places
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Step 1 — UNet quantisation to int8.&lt;/strong&gt; Used torchao + a small calibration set of 512 product images. Quality drop measured by CLIP-similarity on a held-out set: 0.847 to 0.841. Negligible. Throughput went from 14 renders/sec to 23 renders/sec on an H100. That's a 39% cost drop on the dominant stage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2 — Caching the text-encoder outputs.&lt;/strong&gt; For our product taxonomy, only about 4,000 unique caption stems exist (variations on "minimalist white background", "studio lighting from upper-left", etc.). T5-XXL embeddings for these are 14KB each. I cached them in Redis with a 30-day TTL. Hit rate after two weeks: 91%. Text-encoder cost dropped from 11% to 1.2%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3 — The gateway problem.&lt;/strong&gt; This is where it got interesting.&lt;/p&gt;

&lt;h2&gt;
  
  
  The LLM caption step was the messy one
&lt;/h2&gt;

&lt;p&gt;The caption-rewriting calls were originally direct OpenAI API hits from our Python ranking service. When OpenAI had a partial outage in late January (the one that affected &lt;code&gt;gpt-4o-mini&lt;/code&gt; specifically for ~40 minutes), we lost 280k renders. The cost of those failed renders, billed but not delivered, was around €890.&lt;/p&gt;

&lt;p&gt;I put Bifrost in front. The choice was between LiteLLM, Portkey, and Bifrost. I'll be honest about the comparison.&lt;/p&gt;

&lt;p&gt;LiteLLM has wider provider coverage in the Python ecosystem and a more mature semantic-cache integration with langchain-style apps. If your stack is pure Python and you live inside LangChain, it's a more natural fit.&lt;/p&gt;

&lt;p&gt;Portkey's UI for prompt management is genuinely nicer than what Bifrost ships, and their guardrail catalog has more pre-built rules.&lt;/p&gt;

&lt;p&gt;I picked Bifrost because (a) it's a Go binary with a single HTTP endpoint and our caption service is Go, (b) the &lt;a href="https://docs.getbifrost.ai/features/retries-and-fallbacks" rel="noopener noreferrer"&gt;automatic fallbacks&lt;/a&gt; between providers work without me writing routing logic, and (c) the &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;semantic caching&lt;/a&gt; layer sits at the gateway so my Python preprocessing service and Go caption service share the cache.&lt;/p&gt;

&lt;p&gt;Config that replaced about 140 lines of fallback logic in our caption service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;openai&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;env.OPENAI_KEY_PRIMARY&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.7&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;env.OPENAI_KEY_BACKUP&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.3&lt;/span&gt;
  &lt;span class="na"&gt;anthropic&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;env.ANTHROPIC_KEY&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1.0&lt;/span&gt;

&lt;span class="na"&gt;fallbacks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;primary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai/gpt-4o-mini&lt;/span&gt;
    &lt;span class="na"&gt;secondary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;anthropic/claude-haiku-4-5&lt;/span&gt;
    &lt;span class="na"&gt;tertiary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai/gpt-4o-mini&lt;/span&gt;

&lt;span class="na"&gt;semantic_cache&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;similarity_threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.94&lt;/span&gt;
  &lt;span class="na"&gt;ttl_seconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;604800&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The 0.94 similarity threshold matters. We tested 0.90, 0.92, 0.94, 0.96 on 10,000 caption pairs and measured downstream image quality. Below 0.94, the cached caption sometimes mismatched the product category enough to confuse the UNet. Above 0.96, hit rate dropped under 30% and the cost win disappeared.&lt;/p&gt;

&lt;p&gt;Current numbers after one month with the gateway in place:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Caption API spend: down 61% (semantic cache hit rate of 47%)&lt;/li&gt;
&lt;li&gt;Caption-step latency p95: 340ms to 110ms on cache hits&lt;/li&gt;
&lt;li&gt;Failed render rate from upstream LLM issues: 0.31% to 0.04%&lt;/li&gt;
&lt;li&gt;New cost share for captions: 22% to 8.2%&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;Quantisation to int8 cost me about three weekends of calibration tuning. For very high-end fashion shoots where we render at 2048x2048, the quality drop becomes visible in fine fabric weave. We keep an fp16 path for those.&lt;/p&gt;

&lt;p&gt;The semantic cache occasionally returns a "close enough" caption that doesn't match a niche product category. For our long-tail (about 4% of requests), I disable the cache via a header per-call. The gateway supports this through request metadata.&lt;/p&gt;

&lt;p&gt;Bifrost's clustering features are gated to enterprise, which fine for our scale, but if I were running this across three regions I'd want to evaluate that cost honestly. Portkey's pricing for similar features came in lower for the team-collaboration tier.&lt;/p&gt;

&lt;p&gt;I haven't migrated the image-generation outputs themselves through the gateway. The UNet runs on our own GPUs, not behind an LLM API, so the gateway adds no value there. Don't put infrastructure in places it doesn't earn its keep.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;Bifrost semantic caching docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/retries-and-fallbacks" rel="noopener noreferrer"&gt;Bifrost retries and fallbacks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/pytorch/ao" rel="noopener noreferrer"&gt;torchao quantisation recipes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2307.01952" rel="noopener noreferrer"&gt;SDXL paper, Podell et al.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2211.01095" rel="noopener noreferrer"&gt;DPM-Solver++ paper, Lu et al.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>mlops</category>
      <category>infrastructure</category>
      <category>llm</category>
    </item>
    <item>
      <title>Why your diffusion model is slow at batch size 1 (and what actually helps)</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Fri, 22 May 2026 05:37:23 +0000</pubDate>
      <link>https://dev.to/elise_moreau/why-your-diffusion-model-is-slow-at-batch-size-1-and-what-actually-helps-16e0</link>
      <guid>https://dev.to/elise_moreau/why-your-diffusion-model-is-slow-at-batch-size-1-and-what-actually-helps-16e0</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: Single-image diffusion inference is bottlenecked by kernel launch overhead and attention memory traffic, not raw FLOPs. torch.compile with mode="reduce-overhead", a fused attention backend, and CFG batching get you most of the way before you reach for distillation.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I spend a lot of time looking at flame graphs from production diffusion pipelines. The pattern is almost always the same. The team profiles their model, sees 50 steps of a UNet or DiT, and assumes the path to lower latency is fewer steps. So they try LCM, then TCD, then some flavor of consistency distillation, and the quality drops in ways the product team notices.&lt;/p&gt;

&lt;p&gt;The nuance here is that at batch size 1, your GPU is mostly idle. You are not compute-bound. You are launch-bound and memory-bound. Distillation helps eventually, but only after you have fixed the boring things.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the profiler actually shows
&lt;/h2&gt;

&lt;p&gt;Run a vanilla SDXL or a 1B-parameter DiT at 1024x1024, batch 1, on an H100. Capture a trace with &lt;code&gt;torch.profiler&lt;/code&gt; and zoom into a single denoising step.&lt;/p&gt;

&lt;p&gt;You will see something like this, roughly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;~30-40% of wall time inside attention kernels&lt;/li&gt;
&lt;li&gt;~20-25% inside conv and linear layers&lt;/li&gt;
&lt;li&gt;~15-20% in layernorm, GELU, residual adds&lt;/li&gt;
&lt;li&gt;The rest: kernel launch gaps, host-to-device syncs, Python overhead&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last bucket is the embarrassing one. On an H100 a kernel launch costs ~5 microseconds. A UNet step fires hundreds of kernels. A 50-step sample fires tens of thousands. You are paying for the privilege of dispatching work, not for the work itself.&lt;/p&gt;

&lt;p&gt;To be precise: at batch 1, the same model at batch 8 often runs in less than 2x the wall time. That gap is your overhead bill.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step one: torch.compile, but the right mode
&lt;/h2&gt;

&lt;p&gt;The default &lt;code&gt;torch.compile(model)&lt;/code&gt; call uses &lt;code&gt;mode="default"&lt;/code&gt;, which optimizes for compile time and flexibility. For inference you want:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;

&lt;span class="n"&gt;unet&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;unet&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reduce-overhead&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;fullgraph&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;dynamic&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;reduce-overhead&lt;/code&gt; enables CUDA graphs, which replay a captured sequence of kernels in one launch. This is the single largest win for batch 1 diffusion on modern GPUs. In my measurements on PyTorch 2.3, this alone takes a 1024x1024 SDXL UNet step from ~42ms to ~28ms on H100. No quality change, no architecture change.&lt;/p&gt;

&lt;p&gt;The catch: &lt;code&gt;fullgraph=True&lt;/code&gt; will yell at you about any graph break. CFG implementations that branch on &lt;code&gt;guidance_scale&lt;/code&gt; need rewriting. Custom samplers that touch &lt;code&gt;.item()&lt;/code&gt; between steps will break CUDA graph capture. Plan for a day of fighting this.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step two: pick an attention backend on purpose
&lt;/h2&gt;

&lt;p&gt;PyTorch's &lt;code&gt;scaled_dot_product_attention&lt;/code&gt; dispatches to one of several backends. The defaults are not always right for high-resolution diffusion.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Backend&lt;/th&gt;
&lt;th&gt;Best for&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;FlashAttention-2&lt;/td&gt;
&lt;td&gt;Long sequences, H100/A100&lt;/td&gt;
&lt;td&gt;Default on most setups, good general choice&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FlashAttention-3&lt;/td&gt;
&lt;td&gt;H100 only&lt;/td&gt;
&lt;td&gt;~1.5x faster than FA2 on Hopper, requires manual install&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;xFormers memory-efficient&lt;/td&gt;
&lt;td&gt;Older GPUs (V100, T4)&lt;/td&gt;
&lt;td&gt;Lower memory, slower than Flash on modern hardware&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Math (fallback)&lt;/td&gt;
&lt;td&gt;Debugging only&lt;/td&gt;
&lt;td&gt;Never ship this&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For DiT-style models at 2K resolution the sequence length per attention block hits 16K+ tokens. FA3 on H100 is a real difference there. I have seen 18% end-to-end latency drop on a 2B DiT just from switching FA2 to FA3 via &lt;code&gt;torch.nn.attention.sdpa_kernel&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step three: batch your CFG
&lt;/h2&gt;

&lt;p&gt;Classifier-free guidance runs the model twice per step, once conditional and once unconditional. Most reference implementations call the UNet twice sequentially. Do not do this.&lt;/p&gt;

&lt;p&gt;Concatenate the two prompts into one batch of 2, run one forward pass, split the output. On batch 1 this nearly halves your per-step latency because you were leaving the GPU idle anyway. The memory cost is negligible at typical inference resolutions.&lt;/p&gt;

&lt;p&gt;This is a 3-line change and somehow lives in maybe 60% of the codebases I review.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step four, only now: think about steps
&lt;/h2&gt;

&lt;p&gt;After the above, a 50-step SDXL sample on H100 is in the 1.2-1.5 second range. If your product needs sub-second, then yes, look at LCM, Hyper-SD, or DMD2. But evaluate quality on your own data, not on the curated examples in the paper. Distilled models lose the most quality on the long tail of prompts your users actually send, particularly text rendering and fine compositional structure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;CUDA graphs hate dynamic shapes. If your service accepts arbitrary aspect ratios you will recompile constantly. Either bucket aspect ratios into a small set of fixed shapes, or accept the warmup cost on cold paths.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;reduce-overhead&lt;/code&gt; mode increases memory usage because it pins workspace buffers. On a 24GB consumer card this can push you over the edge with larger models. Profile before deploying.&lt;/p&gt;

&lt;p&gt;FlashAttention-3 requires building from source against a specific CUDA version. If your deployment runs across mixed GPU generations, the version matrix becomes painful. Pick one backend per deployment target.&lt;/p&gt;

&lt;p&gt;And the obvious one: none of this fixes a slow VAE decode. If you are generating at 2K, the VAE can dominate. Tiled VAE decoding or a distilled decoder like TAESD is a separate fight.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://pytorch.org/blog/pytorch2-3/" rel="noopener noreferrer"&gt;PyTorch 2.3 release notes on torch.compile inference&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2407.08608" rel="noopener noreferrer"&gt;FlashAttention-3 paper&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2405.14867" rel="noopener noreferrer"&gt;DMD2: Improved Distribution Matching Distillation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;For routing across multiple model providers in production pipelines, &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; is one option alongside LiteLLM and direct SDK calls.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2404.13686" rel="noopener noreferrer"&gt;Hyper-SD: Trajectory Segmented Consistency Models&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>llm</category>
      <category>mlops</category>
      <category>pytorch</category>
    </item>
    <item>
      <title>Routing diffusion inference traffic across three providers</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Thu, 21 May 2026 14:52:31 +0000</pubDate>
      <link>https://dev.to/elise_moreau/routing-diffusion-inference-traffic-across-three-providers-51h3</link>
      <guid>https://dev.to/elise_moreau/routing-diffusion-inference-traffic-across-three-providers-51h3</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We route a mix of diffusion and LLM traffic across three providers from a single Go-based gateway called Bifrost. The 11 microsecond overhead is real, the failover works, and the part I care about most (weighted routing for cost vs latency tradeoffs) finally stopped being a custom Python service nobody wanted to maintain.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I work on diffusion models for product photography. Most of what I write about is training, but the boring truth is that inference traffic management eats more of my week than I would like to admit.&lt;/p&gt;

&lt;p&gt;We have three categories of model calls in production. Hosted diffusion endpoints for fallback when our own GPU pool is saturated. LLM calls for prompt rewriting and caption generation. And a small embedding service for similarity search on reference images. Three providers, three SDKs, three retry policies. It was becoming a mess.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we had before
&lt;/h2&gt;

&lt;p&gt;A Python FastAPI service in front of everything. It worked. It was also slow, and the team had stopped trusting the metrics because the gateway itself was adding 40-80ms of overhead depending on the day.&lt;/p&gt;

&lt;p&gt;The nuance here is that for a diffusion call taking 3 seconds, 60ms of gateway overhead is noise. For a small LLM rewrite that should take 200ms, it is a third of your budget. We were optimizing the wrong axis.&lt;/p&gt;

&lt;p&gt;I spent a weekend evaluating replacements. Kong felt heavy. LiteLLM was the obvious choice for the LLM side but does not really speak the dialect of provider-specific diffusion APIs we need. Then a colleague pointed me at Bifrost.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why a Go gateway actually matters here
&lt;/h2&gt;

&lt;p&gt;To be precise: the language is not the point. The point is the runtime model. Bifrost runs as a single Go binary, uses goroutines for concurrency, and the published overhead is around 11 microseconds per request. I measured it on our own staging hardware and got numbers in the same ballpark, which is rare enough that I noticed.&lt;/p&gt;

&lt;p&gt;For our embedding service this matters. For diffusion it does not. But having one gateway that does not become the bottleneck for the fast calls is what made the consolidation possible.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;openai&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;env.OPENAI_KEY_PRIMARY&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.7&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;env.OPENAI_KEY_SECONDARY&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.3&lt;/span&gt;
    &lt;span class="na"&gt;network&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;retry&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;max_retries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
        &lt;span class="na"&gt;backoff_initial_ms&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;
  &lt;span class="na"&gt;anthropic&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;env.ANTHROPIC_KEY&lt;/span&gt;
  &lt;span class="na"&gt;stability&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;env.STABILITY_KEY&lt;/span&gt;

&lt;span class="na"&gt;mcp&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prompt-rewrite&lt;/span&gt;
    &lt;span class="na"&gt;primary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai/gpt-4o-mini&lt;/span&gt;
    &lt;span class="na"&gt;fallbacks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;anthropic/claude-haiku-4-5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That config replaced about 400 lines of Python.&lt;/p&gt;

&lt;h2&gt;
  
  
  The weighted routing thing
&lt;/h2&gt;

&lt;p&gt;This is the feature I did not know I wanted. We have two OpenAI accounts because of rate limits and billing isolation between research and production workloads. Previously we ran two separate clients with manual round-robin logic that always had off-by-one bugs.&lt;/p&gt;

&lt;p&gt;Weighted routing in the gateway just handles it. 70/30 split, configured declaratively, and when one key hits a 429 the failover kicks in without us writing the retry code ourselves. Virtual keys on top of that let us issue per-team credentials that map to the underlying provider keys, so the research team and the production team see different rate limits and different cost dashboards.&lt;/p&gt;

&lt;h2&gt;
  
  
  Comparison with what we considered
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;th&gt;Kong&lt;/th&gt;
&lt;th&gt;Bifrost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Per-request overhead&lt;/td&gt;
&lt;td&gt;~50ms (Python)&lt;/td&gt;
&lt;td&gt;~5ms but heavy footprint&lt;/td&gt;
&lt;td&gt;~11μs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Failover across providers&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Plugin required&lt;/td&gt;
&lt;td&gt;Yes, built-in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Weighted key routing&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Custom plugin&lt;/td&gt;
&lt;td&gt;Native&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic caching&lt;/td&gt;
&lt;td&gt;Via plugin&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Native&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Diffusion provider support&lt;/td&gt;
&lt;td&gt;Weak&lt;/td&gt;
&lt;td&gt;Generic HTTP only&lt;/td&gt;
&lt;td&gt;Provider-aware&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operational footprint&lt;/td&gt;
&lt;td&gt;Python service&lt;/td&gt;
&lt;td&gt;Lua plugins, DB&lt;/td&gt;
&lt;td&gt;Single Go binary&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;LiteLLM remains excellent for pure LLM-only stacks. Kong is the right answer if you already run Kong. For us, the combination of low overhead and provider-aware routing was the deciding factor.&lt;/p&gt;

&lt;h2&gt;
  
  
  Semantic caching on prompt rewrites
&lt;/h2&gt;

&lt;p&gt;About 40% of our prompt-rewrite calls are near-duplicates. Same product, slightly different angle, same desired caption style. We were paying for every one of them.&lt;/p&gt;

&lt;p&gt;Bifrost has semantic caching built in, using embeddings to match similar requests within a configurable threshold. I was skeptical because cache invalidation on semantic similarity is famously a footgun. We set the threshold conservatively (cosine similarity above 0.94) and audit the cache hits weekly. Hit rate is around 22%, cost savings are real, and we have not had a quality complaint yet. The audit is the part nobody talks about, but you need it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;It is a young project. The documentation has gaps, particularly around custom provider plugins. I had to read the source to understand how the streaming response handling works for SSE-heavy diffusion APIs.&lt;/p&gt;

&lt;p&gt;Observability is functional but basic. We forward to our existing Prometheus setup and it works, but if you expect a polished UI for traffic analysis you will be disappointed. We built our own Grafana dashboards.&lt;/p&gt;

&lt;p&gt;Semantic caching is only as good as your embedding model and threshold tuning. If your prompts have high lexical variation but identical intent, you will get false negatives. If your prompts are templated and only the parameters change, you will get false positives. Test on your own traffic before trusting it.&lt;/p&gt;

&lt;p&gt;And one honest note: an 11 microsecond gateway does not make a 3-second diffusion call faster. It just stops being the reason your fast calls are slow. Know which problem you are solving.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Bifrost on GitHub: &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;https://github.com/maximhq/bifrost&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;LiteLLM proxy documentation: &lt;a href="https://docs.litellm.ai/docs/simple_proxy" rel="noopener noreferrer"&gt;https://docs.litellm.ai/docs/simple_proxy&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Kong AI Gateway: &lt;a href="https://konghq.com/products/kong-ai-gateway" rel="noopener noreferrer"&gt;https://konghq.com/products/kong-ai-gateway&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;"Inference Without Interference" (Microsoft Research, 2024) on multiplexing inference workloads&lt;/li&gt;
&lt;li&gt;A useful primer on semantic caching trade-offs from Pinecone's engineering blog&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>llm</category>
      <category>mlops</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>Why your diffusion model is slow at batch size 1 (and what actually helps)</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Tue, 19 May 2026 05:37:02 +0000</pubDate>
      <link>https://dev.to/elise_moreau/why-your-diffusion-model-is-slow-at-batch-size-1-and-what-actually-helps-n15</link>
      <guid>https://dev.to/elise_moreau/why-your-diffusion-model-is-slow-at-batch-size-1-and-what-actually-helps-n15</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: Single-image diffusion inference is bottlenecked by kernel launch overhead and attention memory traffic, not raw FLOPs. torch.compile with mode="reduce-overhead", a fused attention backend, and CFG batching get you most of the way before you reach for distillation.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I spend a lot of time looking at flame graphs from production diffusion pipelines. The pattern is almost always the same. The team profiles their model, sees 50 steps of a UNet or DiT, and assumes the path to lower latency is fewer steps. So they try LCM, then TCD, then some flavor of consistency distillation, and the quality drops in ways the product team notices.&lt;/p&gt;

&lt;p&gt;The nuance here is that at batch size 1, your GPU is mostly idle. You are not compute-bound. You are launch-bound and memory-bound. Distillation helps eventually, but only after you have fixed the boring things.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the profiler actually shows
&lt;/h2&gt;

&lt;p&gt;Run a vanilla SDXL or a 1B-parameter DiT at 1024x1024, batch 1, on an H100. Capture a trace with &lt;code&gt;torch.profiler&lt;/code&gt; and zoom into a single denoising step.&lt;/p&gt;

&lt;p&gt;You will see something like this, roughly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;~30-40% of wall time inside attention kernels&lt;/li&gt;
&lt;li&gt;~20-25% inside conv and linear layers&lt;/li&gt;
&lt;li&gt;~15-20% in layernorm, GELU, residual adds&lt;/li&gt;
&lt;li&gt;The rest: kernel launch gaps, host-to-device syncs, Python overhead&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last bucket is the embarrassing one. On an H100 a kernel launch costs ~5 microseconds. A UNet step fires hundreds of kernels. A 50-step sample fires tens of thousands. You are paying for the privilege of dispatching work, not for the work itself.&lt;/p&gt;

&lt;p&gt;To be precise: at batch 1, the same model at batch 8 often runs in less than 2x the wall time. That gap is your overhead bill.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step one: torch.compile, but the right mode
&lt;/h2&gt;

&lt;p&gt;The default &lt;code&gt;torch.compile(model)&lt;/code&gt; call uses &lt;code&gt;mode="default"&lt;/code&gt;, which optimizes for compile time and flexibility. For inference you want:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;

&lt;span class="n"&gt;unet&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;unet&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reduce-overhead&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;fullgraph&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;dynamic&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;reduce-overhead&lt;/code&gt; enables CUDA graphs, which replay a captured sequence of kernels in one launch. This is the single largest win for batch 1 diffusion on modern GPUs. In my measurements on PyTorch 2.3, this alone takes a 1024x1024 SDXL UNet step from ~42ms to ~28ms on H100. No quality change, no architecture change.&lt;/p&gt;

&lt;p&gt;The catch: &lt;code&gt;fullgraph=True&lt;/code&gt; will yell at you about any graph break. CFG implementations that branch on &lt;code&gt;guidance_scale&lt;/code&gt; need rewriting. Custom samplers that touch &lt;code&gt;.item()&lt;/code&gt; between steps will break CUDA graph capture. Plan for a day of fighting this.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step two: pick an attention backend on purpose
&lt;/h2&gt;

&lt;p&gt;PyTorch's &lt;code&gt;scaled_dot_product_attention&lt;/code&gt; dispatches to one of several backends. The defaults are not always right for high-resolution diffusion.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Backend&lt;/th&gt;
&lt;th&gt;Best for&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;FlashAttention-2&lt;/td&gt;
&lt;td&gt;Long sequences, H100/A100&lt;/td&gt;
&lt;td&gt;Default on most setups, good general choice&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FlashAttention-3&lt;/td&gt;
&lt;td&gt;H100 only&lt;/td&gt;
&lt;td&gt;~1.5x faster than FA2 on Hopper, requires manual install&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;xFormers memory-efficient&lt;/td&gt;
&lt;td&gt;Older GPUs (V100, T4)&lt;/td&gt;
&lt;td&gt;Lower memory, slower than Flash on modern hardware&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Math (fallback)&lt;/td&gt;
&lt;td&gt;Debugging only&lt;/td&gt;
&lt;td&gt;Never ship this&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For DiT-style models at 2K resolution the sequence length per attention block hits 16K+ tokens. FA3 on H100 is a real difference there. I have seen 18% end-to-end latency drop on a 2B DiT just from switching FA2 to FA3 via &lt;code&gt;torch.nn.attention.sdpa_kernel&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step three: batch your CFG
&lt;/h2&gt;

&lt;p&gt;Classifier-free guidance runs the model twice per step, once conditional and once unconditional. Most reference implementations call the UNet twice sequentially. Do not do this.&lt;/p&gt;

&lt;p&gt;Concatenate the two prompts into one batch of 2, run one forward pass, split the output. On batch 1 this nearly halves your per-step latency because you were leaving the GPU idle anyway. The memory cost is negligible at typical inference resolutions.&lt;/p&gt;

&lt;p&gt;This is a 3-line change and somehow lives in maybe 60% of the codebases I review.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step four, only now: think about steps
&lt;/h2&gt;

&lt;p&gt;After the above, a 50-step SDXL sample on H100 is in the 1.2-1.5 second range. If your product needs sub-second, then yes, look at LCM, Hyper-SD, or DMD2. But evaluate quality on your own data, not on the curated examples in the paper. Distilled models lose the most quality on the long tail of prompts your users actually send, particularly text rendering and fine compositional structure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;CUDA graphs hate dynamic shapes. If your service accepts arbitrary aspect ratios you will recompile constantly. Either bucket aspect ratios into a small set of fixed shapes, or accept the warmup cost on cold paths.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;reduce-overhead&lt;/code&gt; mode increases memory usage because it pins workspace buffers. On a 24GB consumer card this can push you over the edge with larger models. Profile before deploying.&lt;/p&gt;

&lt;p&gt;FlashAttention-3 requires building from source against a specific CUDA version. If your deployment runs across mixed GPU generations, the version matrix becomes painful. Pick one backend per deployment target.&lt;/p&gt;

&lt;p&gt;And the obvious one: none of this fixes a slow VAE decode. If you are generating at 2K, the VAE can dominate. Tiled VAE decoding or a distilled decoder like TAESD is a separate fight.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://pytorch.org/blog/pytorch2-3/" rel="noopener noreferrer"&gt;PyTorch 2.3 release notes on torch.compile inference&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2407.08608" rel="noopener noreferrer"&gt;FlashAttention-3 paper&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2405.14867" rel="noopener noreferrer"&gt;DMD2: Improved Distribution Matching Distillation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;For routing across multiple model providers in production pipelines, &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; is one option alongside LiteLLM and direct SDK calls.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2404.13686" rel="noopener noreferrer"&gt;Hyper-SD: Trajectory Segmented Consistency Models&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>pytorch</category>
      <category>computervision</category>
      <category>mlops</category>
    </item>
  </channel>
</rss>
