DEV Community

gameindie
gameindie

Posted on

improve 30% inference speed for Stable Diffusion pipelines:

I've been generating a lot of nail art images for my image site lately,Finally i'm using OneDiff get 30% speed up and I've found a few things that can improve the speed of stable diffusion reasoning, as summarized below

Config

Here are some key ways to optimize inference speed for Stable Diffusion pipelines:

1. Use half-precision (FP16) instead of full precision (FP32)

  • Load the model with torch_dtype=torch.float16
  • This can provide up to 60% speedup with minimal quality loss

2. Enable TensorFloat-32 (TF32) on NVIDIA GPUs[1]:

   import torch
   torch.backends.cuda.matmul.allow_tf32 = True
Enter fullscreen mode Exit fullscreen mode

3. Use a distilled model[1]:

  • Smaller distilled models like "nota-ai/bk-sdm-small" can be 1.5-1.6x faster
  • They maintain comparable quality to full models

4. Enable memory-efficient attention implementations[1]:

  • Use xFormers or PyTorch 2.0's scaled dot product attention

5. Use CUDA graphs to reduce CPU overhead[3]:

  • Capture UNet, VAE and TextEncoder into CUDA graph format

6. Apply DeepSpeed-Inference optimizations[2][4]:

  • Can provide 1.7x speedup with minimal code changes
  • Fuses operations and uses optimized CUDA kernels

7. Use torch.inference_mode() or torch.no_grad()[4]:

  • Disables gradient computation for slight speedup

8. Consider specialized libraries like stable-fast[3]:

  • Provides CUDNN fusion, low precision ops, fused attention, etc.
  • Claims significant speedups over other methods

9. Reduce the number of inference steps if quality allows

10. Use a larger batch size if memory permits

By combining multiple optimizations, you can potentially reduce inference time from over 5 seconds to around 2-3 seconds for a single 512x512 image generation on high-end GPUs[1][2][4]. The exact speedup will depend on your specific hardware and model configuration.

Citations:
[1] https://huggingface.co/docs/diffusers/en/optimization/fp16
[2] https://www.philschmid.de/stable-diffusion-deepspeed-inference
[3] https://github.com/chengzeyi/stable-fast
[4] https://blog.cerebrium.ai/how-to-speed-up-stable-diffusion-to-a-2-second-inference-time-500x-improvement-d561c79a8952?gi=94a7e93c17f1
[5] https://www.felixsanz.dev/articles/ultimate-guide-to-optimizing-stable-diffusion-xl

Try Other Inference Runtime

Yes, there are several compile backends that can improve inference speed for Stable Diffusion pipelines. Here are some key options:

1. torch.compile:

  • Available in PyTorch 2.0+
  • Can provide significant speedups with minimal code changes
  • Example usage:

     model = torch.compile(model, mode="reduce-overhead")
    
  • Compilation takes some time initially but subsequent runs are faster[1]

2. Onediff:

  • Can provide 30% speedup with minimal code changes for diffusers
  • Easy to integrate with Hugging Face Diffusers[2]

3. DeepSpeed-Inference:

  • Can provide around 1.7x speedup with minimal code changes
  • Optimizes operations and uses custom CUDA kernels
  • Easy to integrate with Hugging Face Diffusers[2]

4. stable-fast:

  • Specialized optimization framework for Hugging Face Diffusers
  • Implements techniques like CUDNN convolution fusion, low precision ops, fused attention, etc.
  • Claims significant speedups over other methods
  • Provides fast compilation within seconds, much quicker than torch.compile or TensorRT[4]

5. TensorRT:

  • NVIDIA's deep learning inference optimizer and runtime
  • Can provide substantial speedups but requires more setup

6. ONNX Runtime:

  • Cross-platform inference acceleration
  • Supports various hardware accelerators

When choosing a compile backend, consider factors like:

  • Ease of integration
  • Compilation time
  • Compatibility with your specific model and hardware
  • Performance gains for your particular use case

For Stable Diffusion specifically, stable-fast seems promising as it's optimized for Diffusers and claims fast compilation times[4]. However, torch.compile is also a solid choice for its ease of use and good performance gains[1]. DeepSpeed-Inference is another strong contender, especially if you're already using the Hugging Face ecosystem[2].

Remember that the effectiveness of these optimizations can vary depending on your specific hardware, model, and inference settings. It's often worth benchmarking multiple options to find the best fit for your particular use case.

Citations:
[1] https://www.felixsanz.dev/articles/ultimate-guide-to-optimizing-stable-diffusion-xl
[2] https://github.com/siliconflow/onediff/tree/main/onediff_diffusers_extensions/examples/sd3
[3] https://www.philschmid.de/stable-diffusion-deepspeed-inference
[4] https://www.youtube.com/watch?v=AKBelBkPHYk
[5] https://github.com/chengzeyi/stable-fast
[6] https://www.reddit.com/r/StableDiffusion/comments/18lvwja/stablefast_v1_2x_speedup_for_svd_stable_video/

Top comments (0)