DEV Community

Muhammad Ali Afridi
Muhammad Ali Afridi

Posted on

Boosting Wan2.2 I2V Inference on 8 H100s — 2.5 Faster with Sequence Parallelism & Magcache

 Author: Muhammad Ali Afridi, Morphic

Date: November 2025

Originally published on the Morphic Blog. Reposted here with permission.


If you’re working on diffusion-based video models and want faster inference, this guide covers optimizations we used to boost Wan2.2 by 2.5×.

Introduction

Open-source video generation models like Wan2.1 and Wan2.2 are closing the gap with closed-source systems. However, inference speed remains a bottleneck for real-time deployment.

In this post, we share how we accelerated Wan2.2’s image-to-video (I2V) inference pipeline using several optimization techniques.

The result: 2.5× faster performance on 8× NVIDIA H100 GPUs.

This work is part of Morphic’s ongoing effort to optimize diffusion-based video generation pipelines. You can find the detailed benchmarks and results on Morphic’s official blog.


Experiment Setup

  • Hardware: 8× NVIDIA H100 (80 GB)
  • Resolution: 1280×720
  • Frames: 81
  • Steps: 40
  • Framework: PyTorch with FSDP and custom parallelism

Clone the repository to get started: https://github.com/morphicfilms/wan2.2_optimizations.git


1. Baseline — Flash Attention 2

Default Wan2.2 with Flash Attention 2 took 250.7 seconds to generate one 81-frame 720p video on 8xH100.

torchrun --nproc_per_node=8 generate.py \
  --task i2v-A14B --size 1280*720 \
  --ckpt_dir ./Wan2.2-I2V-A14B \
  --image examples/i2v_input.JPG \
  --dit_fsdp --t5_fsdp --ulysses_size 8 \
  --prompt "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard..."
Enter fullscreen mode Exit fullscreen mode

This baseline serves as the reference for all further optimizations.


2. Flash Attention 3 — +1.28x Speedup

Hopper GPUs perform significantly better with Flash Attention 3.

Install it separately:

git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention && pip install wheel
cd hopper && python setup.py install

Enter fullscreen mode Exit fullscreen mode

Re-running inference yields 195.13 seconds, a 1.28× speedup, with no quality loss.


3. TensorFloat32 Tensor Cores — +1.57x Speedup

Enable TF32 matmul and convolution acceleration:

torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

Enter fullscreen mode Exit fullscreen mode

Or use the flag --tf32 True.

This reduces inference time to 159.55 seconds (1.57× faster).


4. Quantization (int8_weight_only)

Quantization allows both low-noise and high-noise models to fit on a single GPU, eliminating FSDP overhead.

pip install -U torchao
Enter fullscreen mode Exit fullscreen mode

And then use the flag --quantize True.

Result: 170.24 seconds (1.47× speedup).

TF32 has no effect here because matrix multiplies are now in int8.


5. Magcache — Smarter Diffusion Caching

We extended Magcache for multi-GPU use. Using parameters E012K2R20 (threshold 0.12, K = 2, retention = 0.2) balanced quality and performance.

To enable and use magcache, pass in additional parameters:

--use_magcache --magcache_K 2 \
--magcache_thresh 0.12 --retention_ratio 0.2

Enter fullscreen mode Exit fullscreen mode

Performance: 157.1 seconds (1.6×), and 121.56 seconds (1.97×) when combined with TF32.


6. Torch Compile — Autotuned Kernels

Enable torch.compile with "max-autotune-no-cudagraphs" mode by passing:

--compile True --compile_mode "max-autotune-no-cudagraphs"

Enter fullscreen mode Exit fullscreen mode

Benchmarks

Optimization Combo Time (s) Speedup
FA3 + Compile 172.87 1.45×
FA3 + TF32 + Compile 142.73 1.76×
FA3 + Quant + Compile 142.40 1.76×
FA3 + TF32 + Magcache + Compile 109.81 2.28×

Pushing Magcache parameters (E024K2R10) achieves 98.87 seconds (2.53×) but introduces slight artifacts.


Final Results

Configuration Time (s) Speedup
Baseline (FA2) 250.7 1.0×
FA3 + TF32 + Magcache + Compile 109.8 2.28×
Aggressive (E024K2R10) 98.9 2.53×

Conclusion

These optimizations collectively cut Wan2.2 I2V inference time by more than half, without any quality degradation.

Such improvements bring open-source diffusion models closer to real-time video generation on modern GPUs.

Special thanks to Modal for powering our multi-GPU inference setup.


References

  1. Wan2.1 Repository
  2. Wan2.2 Repository
  3. PyTorch CUDA Docs
  4. FSDP API
  5. TorchAO Quantization
  6. Magcache Paper
  7. Torch Compile Tutorial

Tags: #pytorch #deeplearning #gpu #videogeneration #opensource


Written by Muhammad Ali Afridi — ML Engineer at Morphic, building next-gen generative video systems.

Top comments (0)