<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Elise Moreau</title>
    <description>The latest articles on DEV Community by Elise Moreau (@elise_moreau).</description>
    <link>https://dev.to/elise_moreau</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3864909%2F72833c18-30db-4456-82ee-e7d2016cc38f.jpg</url>
      <title>DEV Community: Elise Moreau</title>
      <link>https://dev.to/elise_moreau</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/elise_moreau"/>
    <language>en</language>
    <item>
      <title>Why Your Diffusion Model Is Slow at Inference (And It's Not the UNet)</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Thu, 23 Apr 2026 14:48:58 +0000</pubDate>
      <link>https://dev.to/elise_moreau/why-your-diffusion-model-is-slow-at-inference-and-its-not-the-unet-443d</link>
      <guid>https://dev.to/elise_moreau/why-your-diffusion-model-is-slow-at-inference-and-its-not-the-unet-443d</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: Most inference bottlenecks in diffusion pipelines are not in the UNet denoising loop. They are in the VAE decoder, the text encoder on first call, and CPU-GPU synchronization between steps. Profile before you optimize. To be precise, a 30% speedup often comes from fixing the 5% of the code nobody looks at.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I spent three weeks last month trying to make a Stable Diffusion XL variant run faster on A10G. The model was trained in-house for product photography. Inference was around 4.2 seconds per image at 1024x1024, 30 steps. Target was under 2 seconds.&lt;br&gt;
My first instinct was wrong. I went straight to the UNet. Compiled it with &lt;code&gt;torch.compile&lt;/code&gt;, tried different attention implementations, looked at FlashAttention-3. I got it from 3.1s to 2.7s on the UNet alone. Nice. But total pipeline time barely moved.&lt;br&gt;
Then I actually profiled.&lt;/p&gt;
&lt;h2&gt;
  
  
  What the profile showed
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;torch.profiler&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ProfilerActivity&lt;/span&gt;
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;activities&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ProfilerActivity&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CPU&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ProfilerActivity&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CUDA&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;record_shapes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;prof&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;image&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_inference_steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prof&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;key_averages&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;sort_by&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda_time_total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row_limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;
&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The breakdown was not what I expected:&lt;br&gt;
| Component | Time (ms) | % of pipeline |&lt;br&gt;
|---|---|---|&lt;br&gt;
| UNet forward (30 steps) | 2700 | 64% |&lt;br&gt;
| VAE decoder | 890 | 21% |&lt;br&gt;
| Text encoder (first call) | 340 | 8% |&lt;br&gt;
| Scheduler + CPU ops | 270 | 6% |&lt;br&gt;
The VAE decoder, which runs once at the end, was taking almost a quarter of total latency. The text encoders, which I assumed were negligible, were non-trivial on the first call because of kernel compilation.&lt;br&gt;
The nuance here is that people optimize what they read about. Every blog post is about UNet attention. Almost nobody writes about the VAE.&lt;/p&gt;
&lt;h2&gt;
  
  
  Fixing the VAE
&lt;/h2&gt;

&lt;p&gt;SDXL's VAE decoder processes a 128x128x4 latent into a 1024x1024x3 image. The default implementation in diffusers runs in fp32 for numerical stability. The tiled decoder, which splits the latent into patches, is even slower but uses less memory.&lt;br&gt;
Three things helped:&lt;br&gt;
First, cast the VAE to bf16. The numerical argument for fp32 is weak on modern GPUs. I ran a small eval on 500 prompts, compared LPIPS and a CLIP-based aesthetic score between fp32 and bf16 output. Differences were within noise. Paper to look at: the SDXL technical report touches on this, but the TAESD work from madebyollin is where the practical tricks live.&lt;br&gt;
Second, use &lt;code&gt;channels_last&lt;/code&gt; memory format for the VAE. This one is documented but rarely applied:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vae&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;memory_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;channels_last&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vae&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vae&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reduce-overhead&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;fullgraph&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Third, if you do not need full 1024x1024 decoding quality, swap in TAESD (Tiny AutoEncoder). It is a distilled VAE that decodes 8x faster. Quality is worse for fine details but fine for thumbnails and previews. We use the full VAE for final renders and TAESD for the interactive preview in the product UI.&lt;br&gt;
Combined, VAE time dropped from 890ms to 210ms.&lt;/p&gt;
&lt;h2&gt;
  
  
  The text encoder trap
&lt;/h2&gt;

&lt;p&gt;On the first pipeline call, the text encoders compile their kernels. If you are benchmarking with a single prompt, you pay this cost once and it looks small. In production, if you have cold starts on autoscaled GPUs, every new replica eats that 300-400ms on the first request.&lt;br&gt;
Solution is unglamorous: warm up the encoders at startup.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;warmup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;dummy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a photo of a product on a white background&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;no_grad&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dummy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;synchronize&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run this during container startup, not on first user request.&lt;/p&gt;

&lt;h2&gt;
  
  
  CPU sync between steps
&lt;/h2&gt;

&lt;p&gt;This one took me a while to find. In the scheduler step, there are small tensor operations that implicitly synchronize GPU and CPU. On A10G with a well-tuned UNet, these become visible. You see it in the profiler as gaps between CUDA kernel launches.&lt;br&gt;
The fix is either a custom scheduler that keeps everything on GPU, or using &lt;code&gt;torch.cuda.graphs&lt;/code&gt; to capture the full denoising loop. Graphs are fragile, they break if any input shape changes, but for a fixed-resolution product they are worth it. I got another 8% off pipeline time this way.&lt;br&gt;
If you route through a gateway that fronts multiple model backends (internal triton, replicate, fal), the gateway itself adds 20-80ms depending on implementation. Bifrost (&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;https://github.com/maximhq/bifrost&lt;/a&gt;), LiteLLM, and Portkey sit in this space. Measure your gateway overhead before you blame the model. We saw 35ms of unnecessary latency from a naive proxy before we switched.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final numbers
&lt;/h2&gt;

&lt;p&gt;After all the above:&lt;br&gt;
| Stage | Before (ms) | After (ms) |&lt;br&gt;
|---|---|---|&lt;br&gt;
| Text encode | 340 | 12 (warmed) |&lt;br&gt;
| UNet 30 steps | 2700 | 2100 |&lt;br&gt;
| VAE decode | 890 | 210 |&lt;br&gt;
| Scheduler/sync | 270 | 90 |&lt;br&gt;
| &lt;strong&gt;Total&lt;/strong&gt; | &lt;strong&gt;4200&lt;/strong&gt; | &lt;strong&gt;2410&lt;/strong&gt; |&lt;br&gt;
Still above target. To hit 2s we dropped to 24 steps with a DPM++ 2M Karras scheduler. Acceptable quality trade-off for our use case.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;Casting the VAE to bf16 is fine for photographic content. For pixel art or content with hard edges, fp32 can preserve small structures better. Test on your data.&lt;br&gt;
&lt;code&gt;torch.compile&lt;/code&gt; in reduce-overhead mode uses CUDA graphs internally. It is strict about input shapes. Dynamic batch sizes or resolutions will trigger recompilation, which costs seconds. Pin your shapes or expect volatility.&lt;br&gt;
TAESD is not a free lunch. Look at outputs manually before shipping. It is a lossy compression of the VAE, and the losses are not always perceptually small.&lt;br&gt;
CUDA graph capture can hide memory leaks. If you see OOM on long-running workers, disable graphs and re-profile before assuming the model is the problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;SDXL technical report: &lt;a href="https://arxiv.org/abs/2307.01952" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2307.01952&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;TAESD repository by madebyollin: &lt;a href="https://github.com/madebyollin/taesd" rel="noopener noreferrer"&gt;https://github.com/madebyollin/taesd&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;PyTorch 2 compile notes on memory formats: &lt;a href="https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html" rel="noopener noreferrer"&gt;https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;NVIDIA Nsight Systems for GPU profiling: &lt;a href="https://developer.nvidia.com/nsight-systems" rel="noopener noreferrer"&gt;https://developer.nvidia.com/nsight-systems&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Diffusers optimization guide: &lt;a href="https://huggingface.co/docs/diffusers/optimization/fp16" rel="noopener noreferrer"&gt;https://huggingface.co/docs/diffusers/optimization/fp16&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>pytorch</category>
      <category>computervision</category>
      <category>ai</category>
    </item>
    <item>
      <title>Diffusion Model Inference in Production: What the Benchmarks Leave Out</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Thu, 23 Apr 2026 14:31:27 +0000</pubDate>
      <link>https://dev.to/elise_moreau/diffusion-model-inference-in-production-what-the-benchmarks-leave-out-2669</link>
      <guid>https://dev.to/elise_moreau/diffusion-model-inference-in-production-what-the-benchmarks-leave-out-2669</guid>
      <description>&lt;h3&gt;
  
  
  BIFROST COMMENT
&lt;/h3&gt;

&lt;p&gt;The routing overhead caught us off guard. We were running caption generation through a larger model for every input when 70% of them only needed a fast small model. Adding a gateway with cost-aware routing (we landed on Bifrost for this, though LiteLLM and Portkey do the same thing: &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;https://github.com/maximhq/bifrost&lt;/a&gt;) cut LLM spend in our vision pipeline by 38% without touching the heavy-model cases.&lt;/p&gt;

</description>
      <category>pytorch</category>
      <category>machinelearning</category>
      <category>computervision</category>
      <category>mlops</category>
    </item>
  </channel>
</rss>
