gameindie

Posted on Jul 5

improve 30% inference speed for Stable Diffusion pipelines:

I've been generating a lot of nail art images for my image site lately,Finally i'm using OneDiff get 30% speed up and I've found a few things that can improve the speed of stable diffusion reasoning, as summarized below

Config

Here are some key ways to optimize inference speed for Stable Diffusion pipelines:

1. Use half-precision (FP16) instead of full precision (FP32)

Load the model with torch_dtype=torch.float16
This can provide up to 60% speedup with minimal quality loss

2. Enable TensorFloat-32 (TF32) on NVIDIA GPUs[1]:

   import torch
   torch.backends.cuda.matmul.allow_tf32 = True

3. Use a distilled model[1]:

Smaller distilled models like "nota-ai/bk-sdm-small" can be 1.5-1.6x faster
They maintain comparable quality to full models

4. Enable memory-efficient attention implementations[1]:

Use xFormers or PyTorch 2.0's scaled dot product attention

5. Use CUDA graphs to reduce CPU overhead[3]:

Capture UNet, VAE and TextEncoder into CUDA graph format

6. Apply DeepSpeed-Inference optimizations[2][4]:

Can provide 1.7x speedup with minimal code changes
Fuses operations and uses optimized CUDA kernels

7. Use torch.inference_mode() or torch.no_grad()[4]:

Disables gradient computation for slight speedup

8. Consider specialized libraries like stable-fast[3]:

Provides CUDNN fusion, low precision ops, fused attention, etc.
Claims significant speedups over other methods

9. Reduce the number of inference steps if quality allows

10. Use a larger batch size if memory permits

By combining multiple optimizations, you can potentially reduce inference time from over 5 seconds to around 2-3 seconds for a single 512x512 image generation on high-end GPUs[1][2][4]. The exact speedup will depend on your specific hardware and model configuration.

Citations:
[1] https://huggingface.co/docs/diffusers/en/optimization/fp16
[2] https://www.philschmid.de/stable-diffusion-deepspeed-inference
[3] https://github.com/chengzeyi/stable-fast
[4] https://blog.cerebrium.ai/how-to-speed-up-stable-diffusion-to-a-2-second-inference-time-500x-improvement-d561c79a8952?gi=94a7e93c17f1
[5] https://www.felixsanz.dev/articles/ultimate-guide-to-optimizing-stable-diffusion-xl

Try Other Inference Runtime

Yes, there are several compile backends that can improve inference speed for Stable Diffusion pipelines. Here are some key options:

1. torch.compile:

Available in PyTorch 2.0+
Can provide significant speedups with minimal code changes

Example usage:

 model = torch.compile(model, mode="reduce-overhead")

Compilation takes some time initially but subsequent runs are faster[1]

2. Onediff:

Can provide 30% speedup with minimal code changes for diffusers
Easy to integrate with Hugging Face Diffusers[2]

3. DeepSpeed-Inference:

Can provide around 1.7x speedup with minimal code changes
Optimizes operations and uses custom CUDA kernels
Easy to integrate with Hugging Face Diffusers[2]

4. stable-fast:

Specialized optimization framework for Hugging Face Diffusers
Implements techniques like CUDNN convolution fusion, low precision ops, fused attention, etc.
Claims significant speedups over other methods
Provides fast compilation within seconds, much quicker than torch.compile or TensorRT[4]

5. TensorRT:

NVIDIA's deep learning inference optimizer and runtime
Can provide substantial speedups but requires more setup

6. ONNX Runtime:

Cross-platform inference acceleration
Supports various hardware accelerators

When choosing a compile backend, consider factors like:

Ease of integration
Compilation time
Compatibility with your specific model and hardware
Performance gains for your particular use case

For Stable Diffusion specifically, stable-fast seems promising as it's optimized for Diffusers and claims fast compilation times[4]. However, torch.compile is also a solid choice for its ease of use and good performance gains[1]. DeepSpeed-Inference is another strong contender, especially if you're already using the Hugging Face ecosystem[2].

Remember that the effectiveness of these optimizations can vary depending on your specific hardware, model, and inference settings. It's often worth benchmarking multiple options to find the best fit for your particular use case.

Citations:
[1] https://www.felixsanz.dev/articles/ultimate-guide-to-optimizing-stable-diffusion-xl
[2] https://github.com/siliconflow/onediff/tree/main/onediff_diffusers_extensions/examples/sd3
[3] https://www.philschmid.de/stable-diffusion-deepspeed-inference
[4] https://www.youtube.com/watch?v=AKBelBkPHYk
[5] https://github.com/chengzeyi/stable-fast
[6] https://www.reddit.com/r/StableDiffusion/comments/18lvwja/stablefast_v1_2x_speedup_for_svd_stable_video/

DEV Community