Author: Muhammad Ali Afridi, Morphic
Date: November 2025
Originally published on the Morphic Blog. Reposted here with permission.
If you’re working on diffusion-based video models and want faster inference, this guide covers optimizations we used to boost Wan2.2 by 2.5×.
Introduction
Open-source video generation models like Wan2.1 and Wan2.2 are closing the gap with closed-source systems. However, inference speed remains a bottleneck for real-time deployment.
In this post, we share how we accelerated Wan2.2’s image-to-video (I2V) inference pipeline using several optimization techniques.
The result: 2.5× faster performance on 8× NVIDIA H100 GPUs.
This work is part of Morphic’s ongoing effort to optimize diffusion-based video generation pipelines. You can find the detailed benchmarks and results on Morphic’s official blog.
Experiment Setup
- Hardware: 8× NVIDIA H100 (80 GB)
 - Resolution: 1280×720
 - Frames: 81
 - Steps: 40
 - Framework: PyTorch with FSDP and custom parallelism
 
Clone the repository to get started: https://github.com/morphicfilms/wan2.2_optimizations.git
1. Baseline — Flash Attention 2
Default Wan2.2 with Flash Attention 2 took 250.7 seconds to generate one 81-frame 720p video on 8xH100.
torchrun --nproc_per_node=8 generate.py \
  --task i2v-A14B --size 1280*720 \
  --ckpt_dir ./Wan2.2-I2V-A14B \
  --image examples/i2v_input.JPG \
  --dit_fsdp --t5_fsdp --ulysses_size 8 \
  --prompt "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard..."
This baseline serves as the reference for all further optimizations.
2. Flash Attention 3 — +1.28x Speedup
Hopper GPUs perform significantly better with Flash Attention 3.
Install it separately:
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention && pip install wheel
cd hopper && python setup.py install
Re-running inference yields 195.13 seconds, a 1.28× speedup, with no quality loss.
3. TensorFloat32 Tensor Cores — +1.57x Speedup
Enable TF32 matmul and convolution acceleration:
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
Or use the flag --tf32 True.
This reduces inference time to 159.55 seconds (1.57× faster).
4. Quantization (int8_weight_only)
Quantization allows both low-noise and high-noise models to fit on a single GPU, eliminating FSDP overhead.
pip install -U torchao
And then use the flag --quantize True.
Result: 170.24 seconds (1.47× speedup).
TF32 has no effect here because matrix multiplies are now in int8.
5. Magcache — Smarter Diffusion Caching
We extended Magcache for multi-GPU use. Using parameters E012K2R20 (threshold 0.12, K = 2, retention = 0.2) balanced quality and performance.
To enable and use magcache, pass in additional parameters:
--use_magcache --magcache_K 2 \
--magcache_thresh 0.12 --retention_ratio 0.2
Performance: 157.1 seconds (1.6×), and 121.56 seconds (1.97×) when combined with TF32.
6. Torch Compile — Autotuned Kernels
Enable torch.compile with "max-autotune-no-cudagraphs" mode by passing:
--compile True --compile_mode "max-autotune-no-cudagraphs"
Benchmarks
| Optimization Combo | Time (s) | Speedup | 
|---|---|---|
| FA3 + Compile | 172.87 | 1.45× | 
| FA3 + TF32 + Compile | 142.73 | 1.76× | 
| FA3 + Quant + Compile | 142.40 | 1.76× | 
| FA3 + TF32 + Magcache + Compile | 109.81 | 2.28× | 
Pushing Magcache parameters (E024K2R10) achieves 98.87 seconds (2.53×) but introduces slight artifacts.
Final Results
| Configuration | Time (s) | Speedup | 
|---|---|---|
| Baseline (FA2) | 250.7 | 1.0× | 
| FA3 + TF32 + Magcache + Compile | 109.8 | 2.28× | 
| Aggressive (E024K2R10) | 98.9 | 2.53× | 
Conclusion
These optimizations collectively cut Wan2.2 I2V inference time by more than half, without any quality degradation.
Such improvements bring open-source diffusion models closer to real-time video generation on modern GPUs.
Special thanks to Modal for powering our multi-GPU inference setup.
References
- Wan2.1 Repository
 - Wan2.2 Repository
 - PyTorch CUDA Docs
 - FSDP API
 - TorchAO Quantization
 - Magcache Paper
 - Torch Compile Tutorial
 
Tags: #pytorch #deeplearning #gpu #videogeneration #opensource
Written by Muhammad Ali Afridi — ML Engineer at Morphic, building next-gen generative video systems.
    
Top comments (0)