Muhammad Ali Afridi

Posted on Nov 3

Boosting Wan2.2 I2V Inference on 8 H100s — 2.5 Faster with Sequence Parallelism & Magcache

#performance #deeplearning #ai #opensource

Author: Muhammad Ali Afridi, Morphic

Date: November 2025

Originally published on the Morphic Blog. Reposted here with permission.

If you’re working on diffusion-based video models and want faster inference, this guide covers optimizations we used to boost Wan2.2 by 2.5×.

Introduction

Open-source video generation models like Wan2.1 and Wan2.2 are closing the gap with closed-source systems. However, inference speed remains a bottleneck for real-time deployment.

In this post, we share how we accelerated Wan2.2’s image-to-video (I2V) inference pipeline using several optimization techniques.

The result: 2.5× faster performance on 8× NVIDIA H100 GPUs.

This work is part of Morphic’s ongoing effort to optimize diffusion-based video generation pipelines. You can find the detailed benchmarks and results on Morphic’s official blog.

Experiment Setup

Hardware: 8× NVIDIA H100 (80 GB)
Resolution: 1280×720
Frames: 81
Steps: 40
Framework: PyTorch with FSDP and custom parallelism

Clone the repository to get started: https://github.com/morphicfilms/wan2.2_optimizations.git

1. Baseline — Flash Attention 2

Default Wan2.2 with Flash Attention 2 took 250.7 seconds to generate one 81-frame 720p video on 8xH100.

torchrun --nproc_per_node=8 generate.py \
  --task i2v-A14B --size 1280*720 \
  --ckpt_dir ./Wan2.2-I2V-A14B \
  --image examples/i2v_input.JPG \
  --dit_fsdp --t5_fsdp --ulysses_size 8 \
  --prompt "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard..."

This baseline serves as the reference for all further optimizations.

2. Flash Attention 3 — +1.28x Speedup

Hopper GPUs perform significantly better with Flash Attention 3.

Install it separately:

git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention && pip install wheel
cd hopper && python setup.py install

Re-running inference yields 195.13 seconds, a 1.28× speedup, with no quality loss.

3. TensorFloat32 Tensor Cores — +1.57x Speedup

Enable TF32 matmul and convolution acceleration:

torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

Or use the flag --tf32 True.

This reduces inference time to 159.55 seconds (1.57× faster).

4. Quantization (int8_weight_only)

Quantization allows both low-noise and high-noise models to fit on a single GPU, eliminating FSDP overhead.

pip install -U torchao

And then use the flag --quantize True.

Result: 170.24 seconds (1.47× speedup).

TF32 has no effect here because matrix multiplies are now in int8.

5. Magcache — Smarter Diffusion Caching

We extended Magcache for multi-GPU use. Using parameters E012K2R20 (threshold 0.12, K = 2, retention = 0.2) balanced quality and performance.

To enable and use magcache, pass in additional parameters:

--use_magcache --magcache_K 2 \
--magcache_thresh 0.12 --retention_ratio 0.2

Performance: 157.1 seconds (1.6×), and 121.56 seconds (1.97×) when combined with TF32.

6. Torch Compile — Autotuned Kernels

Enable torch.compile with "max-autotune-no-cudagraphs" mode by passing:

--compile True --compile_mode "max-autotune-no-cudagraphs"

Benchmarks

Optimization Combo	Time (s)	Speedup
FA3 + Compile	172.87	1.45×
FA3 + TF32 + Compile	142.73	1.76×
FA3 + Quant + Compile	142.40	1.76×
FA3 + TF32 + Magcache + Compile	109.81	2.28×

Pushing Magcache parameters (E024K2R10) achieves 98.87 seconds (2.53×) but introduces slight artifacts.

Final Results

Configuration	Time (s)	Speedup
Baseline (FA2)	250.7	1.0×
FA3 + TF32 + Magcache + Compile	109.8	2.28×
Aggressive (E024K2R10)	98.9	2.53×

Conclusion

These optimizations collectively cut Wan2.2 I2V inference time by more than half, without any quality degradation.

Such improvements bring open-source diffusion models closer to real-time video generation on modern GPUs.

Special thanks to Modal for powering our multi-GPU inference setup.

References

Tags: #pytorch #deeplearning #gpu #videogeneration #opensource

Written by Muhammad Ali Afridi — ML Engineer at Morphic, building next-gen generative video systems.

DEV Community