<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Muhammad Ali Afridi</title>
    <description>The latest articles on DEV Community by Muhammad Ali Afridi (@muhammad_aliafridi_35ece).</description>
    <link>https://dev.to/muhammad_aliafridi_35ece</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3587547%2F5643bcb0-0999-4cf3-9bba-cc7d78df8cfa.png</url>
      <title>DEV Community: Muhammad Ali Afridi</title>
      <link>https://dev.to/muhammad_aliafridi_35ece</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/muhammad_aliafridi_35ece"/>
    <language>en</language>
    <item>
      <title>Boosting Wan2.2 I2V Inference on 8 H100s — 2.5 Faster with Sequence Parallelism &amp; Magcache</title>
      <dc:creator>Muhammad Ali Afridi</dc:creator>
      <pubDate>Mon, 03 Nov 2025 17:02:12 +0000</pubDate>
      <link>https://dev.to/muhammad_aliafridi_35ece/boosting-wan22-i2v-inference-on-8-h100s-25-faster-with-sequence-parallelism-magcache-4pfn</link>
      <guid>https://dev.to/muhammad_aliafridi_35ece/boosting-wan22-i2v-inference-on-8-h100s-25-faster-with-sequence-parallelism-magcache-4pfn</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdxtvbr28qi8g46zczpgb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdxtvbr28qi8g46zczpgb.png" alt=" " width="800" height="466"&gt;&lt;/a&gt;&lt;strong&gt;Author:&lt;/strong&gt; Muhammad Ali Afridi, Morphic&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Date:&lt;/strong&gt; November 2025&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published on the &lt;a href="https://www.morphic.com/blog/boosting-wan2-2-i2v-56-faster/" rel="noopener noreferrer"&gt;Morphic Blog&lt;/a&gt;. Reposted here with permission.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;If you’re working on diffusion-based video models and want faster inference, this guide covers optimizations we used to boost Wan2.2 by 2.5×.&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Open-source video generation models like &lt;strong&gt;Wan2.1&lt;/strong&gt; and &lt;strong&gt;Wan2.2&lt;/strong&gt; are closing the gap with closed-source systems. However, inference speed remains a bottleneck for real-time deployment.&lt;/p&gt;

&lt;p&gt;In this post, we share how we accelerated &lt;strong&gt;Wan2.2’s image-to-video (I2V)&lt;/strong&gt; inference pipeline using several optimization techniques.&lt;/p&gt;

&lt;p&gt;The result: &lt;strong&gt;2.5× faster performance&lt;/strong&gt; on 8× NVIDIA H100 GPUs.&lt;/p&gt;

&lt;p&gt;This work is part of Morphic’s ongoing effort to optimize diffusion-based video generation pipelines. You can find the detailed benchmarks and results on &lt;a href="https://www.morphic.com/blog/boosting-wan2-2-i2v-56-faster/" rel="noopener noreferrer"&gt;Morphic’s official blog&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Experiment Setup
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hardware:&lt;/strong&gt; 8× NVIDIA H100 (80 GB)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resolution:&lt;/strong&gt; 1280×720&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Frames:&lt;/strong&gt; 81&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Steps:&lt;/strong&gt; 40&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Framework:&lt;/strong&gt; PyTorch with FSDP and custom parallelism&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Clone the repository to get started: &lt;code&gt;https://github.com/morphicfilms/wan2.2_optimizations.git&lt;/code&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Baseline — Flash Attention 2
&lt;/h2&gt;

&lt;p&gt;Default Wan2.2 with &lt;strong&gt;Flash Attention 2&lt;/strong&gt; took &lt;strong&gt;250.7 seconds&lt;/strong&gt; to generate one 81-frame 720p video on 8xH100.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;torchrun &lt;span class="nt"&gt;--nproc_per_node&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;8 generate.py &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--task&lt;/span&gt; i2v-A14B &lt;span class="nt"&gt;--size&lt;/span&gt; 1280&lt;span class="k"&gt;*&lt;/span&gt;720 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--ckpt_dir&lt;/span&gt; ./Wan2.2-I2V-A14B &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--image&lt;/span&gt; examples/i2v_input.JPG &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--dit_fsdp&lt;/span&gt; &lt;span class="nt"&gt;--t5_fsdp&lt;/span&gt; &lt;span class="nt"&gt;--ulysses_size&lt;/span&gt; 8 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--prompt&lt;/span&gt; &lt;span class="s2"&gt;"Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard..."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This baseline serves as the reference for all further optimizations.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Flash Attention 3 — +1.28x Speedup
&lt;/h2&gt;

&lt;p&gt;Hopper GPUs perform significantly better with &lt;strong&gt;Flash Attention 3&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Install it separately:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Dao-AILab/flash-attention.git
&lt;span class="nb"&gt;cd &lt;/span&gt;flash-attention &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; pip &lt;span class="nb"&gt;install &lt;/span&gt;wheel
&lt;span class="nb"&gt;cd &lt;/span&gt;hopper &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; python setup.py &lt;span class="nb"&gt;install&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Re-running inference yields &lt;strong&gt;195.13 seconds&lt;/strong&gt;, a &lt;strong&gt;1.28× speedup&lt;/strong&gt;, with no quality loss.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. TensorFloat32 Tensor Cores — +1.57x Speedup
&lt;/h2&gt;

&lt;p&gt;Enable TF32 matmul and convolution acceleration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;torch.backends.cuda.matmul.allow_tf32 &lt;span class="o"&gt;=&lt;/span&gt; True
torch.backends.cudnn.allow_tf32 &lt;span class="o"&gt;=&lt;/span&gt; True

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or use the flag &lt;code&gt;--tf32 True&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This reduces inference time to &lt;strong&gt;159.55 seconds (1.57× faster)&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Quantization (int8_weight_only)
&lt;/h2&gt;

&lt;p&gt;Quantization allows both low-noise and high-noise models to fit on a single GPU, eliminating FSDP overhead.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-U&lt;/span&gt; torchao
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And then use the flag &lt;code&gt;--quantize True&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Result: &lt;strong&gt;170.24 seconds (1.47× speedup)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;TF32 has no effect here because matrix multiplies are now in int8.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Magcache — Smarter Diffusion Caching
&lt;/h2&gt;

&lt;p&gt;We extended &lt;strong&gt;Magcache&lt;/strong&gt; for multi-GPU use. Using parameters &lt;code&gt;E012K2R20&lt;/code&gt; (threshold 0.12, K = 2, retention = 0.2) balanced quality and performance.&lt;/p&gt;

&lt;p&gt;To enable and use magcache, pass in additional parameters:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nt"&gt;--use_magcache&lt;/span&gt; &lt;span class="nt"&gt;--magcache_K&lt;/span&gt; 2 &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--magcache_thresh&lt;/span&gt; 0.12 &lt;span class="nt"&gt;--retention_ratio&lt;/span&gt; 0.2

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Performance: &lt;strong&gt;157.1 seconds (1.6×)&lt;/strong&gt;, and &lt;strong&gt;121.56 seconds (1.97×)&lt;/strong&gt; when combined with TF32.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Torch Compile — Autotuned Kernels
&lt;/h2&gt;

&lt;p&gt;Enable &lt;code&gt;torch.compile&lt;/code&gt; with &lt;code&gt;"max-autotune-no-cudagraphs"&lt;/code&gt; mode by passing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nt"&gt;--compile&lt;/span&gt; True &lt;span class="nt"&gt;--compile_mode&lt;/span&gt; &lt;span class="s2"&gt;"max-autotune-no-cudagraphs"&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Benchmarks
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Optimization Combo&lt;/th&gt;
&lt;th&gt;Time (s)&lt;/th&gt;
&lt;th&gt;Speedup&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;FA3 + Compile&lt;/td&gt;
&lt;td&gt;172.87&lt;/td&gt;
&lt;td&gt;1.45×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FA3 + TF32 + Compile&lt;/td&gt;
&lt;td&gt;142.73&lt;/td&gt;
&lt;td&gt;1.76×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FA3 + Quant + Compile&lt;/td&gt;
&lt;td&gt;142.40&lt;/td&gt;
&lt;td&gt;1.76×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FA3 + TF32 + Magcache + Compile&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;109.81&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.28×&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Pushing Magcache parameters (&lt;code&gt;E024K2R10&lt;/code&gt;) achieves &lt;strong&gt;98.87 seconds (2.53×)&lt;/strong&gt; but introduces slight artifacts.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Configuration&lt;/th&gt;
&lt;th&gt;Time (s)&lt;/th&gt;
&lt;th&gt;Speedup&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Baseline (FA2)&lt;/td&gt;
&lt;td&gt;250.7&lt;/td&gt;
&lt;td&gt;1.0×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FA3 + TF32 + Magcache + Compile&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;109.8&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.28×&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Aggressive (E024K2R10)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;98.9&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.53×&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;These optimizations collectively cut &lt;strong&gt;Wan2.2 I2V inference time by more than half&lt;/strong&gt;, without any quality degradation.&lt;/p&gt;

&lt;p&gt;Such improvements bring open-source diffusion models closer to &lt;strong&gt;real-time video generation&lt;/strong&gt; on modern GPUs.&lt;/p&gt;

&lt;p&gt;Special thanks to &lt;a href="https://modal.com/" rel="noopener noreferrer"&gt;Modal&lt;/a&gt; for powering our multi-GPU inference setup.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://github.com/Wan-Video/Wan2.1" rel="noopener noreferrer"&gt;Wan2.1 Repository&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Wan-Video/Wan2.2" rel="noopener noreferrer"&gt;Wan2.2 Repository&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.pytorch.org/docs/stable/notes/cuda.html" rel="noopener noreferrer"&gt;PyTorch CUDA Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/" rel="noopener noreferrer"&gt;FSDP API&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/pytorch/ao/blob/main/torchao/quantization/README.md" rel="noopener noreferrer"&gt;TorchAO Quantization&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2506.09045" rel="noopener noreferrer"&gt;Magcache Paper&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html" rel="noopener noreferrer"&gt;Torch Compile Tutorial&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; &lt;code&gt;#pytorch&lt;/code&gt; &lt;code&gt;#deeplearning&lt;/code&gt; &lt;code&gt;#gpu&lt;/code&gt; &lt;code&gt;#videogeneration&lt;/code&gt; &lt;code&gt;#opensource&lt;/code&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Written by Muhammad Ali Afridi — ML Engineer at &lt;a href="https://www.morphic.com/" rel="noopener noreferrer"&gt;Morphic&lt;/a&gt;, building next-gen generative video systems.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>performance</category>
      <category>deeplearning</category>
      <category>ai</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
