<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Atlas Cloud</title>
    <description>The latest articles on DEV Community by Atlas Cloud (@atlas_cloud_ai).</description>
    <link>https://dev.to/atlas_cloud_ai</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3815847%2Fc1a61742-36b7-41d5-881a-7879f5e3bf07.jpg</url>
      <title>DEV Community: Atlas Cloud</title>
      <link>https://dev.to/atlas_cloud_ai</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/atlas_cloud_ai"/>
    <language>en</language>
    <item>
      <title>Writing High-Performance Kernels in TileLang, from GEMM to MLA</title>
      <dc:creator>Atlas Cloud</dc:creator>
      <pubDate>Tue, 26 May 2026 08:50:38 +0000</pubDate>
      <link>https://dev.to/atlas_cloud_ai/writing-high-performance-kernels-in-tilelang-from-gemm-to-mla-13p0</link>
      <guid>https://dev.to/atlas_cloud_ai/writing-high-performance-kernels-in-tilelang-from-gemm-to-mla-13p0</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh8lcjjjac553rk86mkpz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh8lcjjjac553rk86mkpz.png" alt=" " width="800" height="496"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you write GPU kernels, you live somewhere on a spectrum. At one end is Triton: quick to write, but the compiler makes most of the layout and shared-memory decisions for you. At the other end is CUTLASS / CuTe: total control, at the cost of a lot of template machinery. TileLang sits in the middle. You write Python, but you say explicitly what lives in shared memory, how the pipeline is staged, and how warps split the work — and a layout inference pass fills in the rest.&lt;/p&gt;

&lt;p&gt;In this post we'll cover the mental model, write a GEMM, and then build up to a real production kernel: DeepSeek's MLA decode, where the interesting decisions actually show up. The goal is not to be exhaustive. It's to show what you think about tiles, and where TileLang quietly does the hard parts for you. We'll finish with a more typical story from production — a kernel where the win wasn't speed at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  The mental model
&lt;/h2&gt;

&lt;p&gt;Here's the whole idea in three points.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A tile is a first-class object.&lt;/strong&gt; A shaped chunk of data (&lt;code&gt;block_M × block_K&lt;/code&gt;, say) is owned and operated on by a thread block, a warp, or a thread. You stop thinking purely at the thread-block level the way you do in Triton, and you stop hand-managing individual threads the way you do in CUDA.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You place buffers in the memory hierarchy yourself.&lt;/strong&gt; You declare what goes to shared memory (&lt;code&gt;T.alloc_shared&lt;/code&gt;), what goes to registers (&lt;code&gt;T.alloc_fragment&lt;/code&gt;), and what's thread-local. This is the biggest difference from Triton, which hides shared-memory allocation and staging inside the compiler.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The compiler infers the thread mapping.&lt;/strong&gt; Once you've said where a tile lives and what operation runs on it (a copy, a gemm, a reduce), a &lt;em&gt;layout inference&lt;/em&gt; pass parallelizes it across threads and works out the register and shared-memory layouts. You can override it when you need to, but most of the time you don't. This pass is the load-bearing feature — by the time we get to MLA you'll see why.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're coming from Triton, here's the rough mapping.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Triton&lt;/th&gt;
&lt;th&gt;TileLang&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Granularity&lt;/td&gt;
&lt;td&gt;thread block + implicit vectorization&lt;/td&gt;
&lt;td&gt;tile (block / warp / thread)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Shared memory&lt;/td&gt;
&lt;td&gt;managed by the compiler&lt;/td&gt;
&lt;td&gt;explicit alloc_shared + copy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Layout&lt;/td&gt;
&lt;td&gt;the compiler decides&lt;/td&gt;
&lt;td&gt;inferred, but you can annotate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pipelining&lt;/td&gt;
&lt;td&gt;tl.range + compiler&lt;/td&gt;
&lt;td&gt;explicit T.Pipelined(num_stages=)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tensor Core&lt;/td&gt;
&lt;td&gt;tl.dot&lt;/td&gt;
&lt;td&gt;T.gemm with a selectable warp policy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Backends&lt;/td&gt;
&lt;td&gt;NVIDIA (mainly) / AMD&lt;/td&gt;
&lt;td&gt;NVIDIA / AMD / CPU / WebGPU / CuTeDSL, plus Ascend &amp;amp; MUSA forks&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The short version: if you want fine control over blocking, pipeline depth, and warp partitioning without writing CUTLASS, TileLang is the sweet spot. For simple elementwise or light fusion, Triton is still quicker to reach for.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting set up
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;conda create &lt;span class="nt"&gt;-n&lt;/span&gt; tilelang &lt;span class="nv"&gt;python&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3.10 &lt;span class="nt"&gt;-y&lt;/span&gt;
conda activate tilelang
pip &lt;span class="nb"&gt;install &lt;/span&gt;tilelang                 &lt;span class="c"&gt;# prebuilt wheel, easiest path&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you're going to touch the compiler passes, build from source instead (you'll need a local LLVM/CUDA toolchain):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone &lt;span class="nt"&gt;--recursive&lt;/span&gt; https://github.com/tile-ai/tilelang.git
&lt;span class="nb"&gt;cd &lt;/span&gt;tilelang &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements-dev.txt
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="nt"&gt;--no-build-isolation&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Let's write a GEMM
&lt;/h2&gt;

&lt;p&gt;We'll start with the kernel everyone starts with: &lt;code&gt;C = ReLU(A @ B)&lt;/code&gt;. It's small, but it touches every primitive that matters — explicit buffers, parallel copy, software pipelining, the Tensor Core call, and an L2 swizzle.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tilelang&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tilelang.language&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;

&lt;span class="nd"&gt;@tilelang.jit&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;matmul&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;M&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;block_M&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;block_N&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;block_K&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;float16&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;accum_dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;float&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

    &lt;span class="nd"&gt;@T.prim_func&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;matmul_relu_kernel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Tensor&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;M&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Tensor&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Tensor&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;M&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# grid dims: (#blocks along N, #blocks along M); 128 threads per block
&lt;/span&gt;        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Kernel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ceildiv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;block_N&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ceildiv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;M&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;block_M&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                      &lt;span class="n"&gt;threads&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nf"&gt;as &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;by&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

            &lt;span class="c1"&gt;# Say where each tile lives, explicitly.
&lt;/span&gt;            &lt;span class="n"&gt;A_shared&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;alloc_shared&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;block_M&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;block_K&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;         &lt;span class="c1"&gt;# shared memory
&lt;/span&gt;            &lt;span class="n"&gt;B_shared&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;alloc_shared&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;block_K&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;block_N&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;C_local&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;alloc_fragment&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;block_M&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;block_N&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;accum_dtype&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# register accumulator
&lt;/span&gt;
            &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;use_swizzle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;panel_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;col&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# optional: better L2 reuse
&lt;/span&gt;            &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;clear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;C_local&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                           &lt;span class="c1"&gt;# zero the accumulator
&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ko&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Pipelined&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ceildiv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;block_K&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;num_stages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;by&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;block_M&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ko&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;block_K&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;A_shared&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# global -&amp;gt; shared
&lt;/span&gt;                &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ko&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;block_K&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bx&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;block_N&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;B_shared&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gemm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;A_shared&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;B_shared&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;C_local&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;               &lt;span class="c1"&gt;# tile-level MMA
&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Parallel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;block_M&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;block_N&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;             &lt;span class="c1"&gt;# fused ReLU
&lt;/span&gt;                &lt;span class="n"&gt;C_local&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;C_local&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;C_local&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;by&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;block_M&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bx&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;block_N&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;        &lt;span class="c1"&gt;# write back
&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;matmul_relu_kernel&lt;/span&gt;


&lt;span class="n"&gt;M&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;
&lt;span class="n"&gt;kernel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;matmul&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;M&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;block_M&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;block_N&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;block_K&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;M&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float16&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float16&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;empty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;M&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float16&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;kernel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;testing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;assert_close&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;relu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;rtol&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1e-2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;atol&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1e-2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemm ok&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here is what each piece is doing.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Three buffers, three levels.&lt;/strong&gt; &lt;code&gt;A_shared&lt;/code&gt; and &lt;code&gt;B_shared&lt;/code&gt; live in shared memory; &lt;code&gt;C_local&lt;/code&gt; lives in registers. Accumulator in registers, operands staged through shared memory — that's the standard GEMM recipe, except here &lt;em&gt;you&lt;/em&gt; write it down. That's the whole difference from Triton in one line.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;T.copy&lt;/code&gt; is sugar for a parallel copy.&lt;/strong&gt; It expands into a &lt;code&gt;T.Parallel&lt;/code&gt;-style move, and the compiler derives a vectorized, coalesced global→shared transfer from it. When the copy sits inside &lt;code&gt;T.Pipelined&lt;/code&gt;, it becomes &lt;code&gt;cp.async&lt;/code&gt; automatically.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;T.Pipelined(extent, num_stages=N)&lt;/code&gt; is a software pipeline.&lt;/strong&gt; &lt;code&gt;num_stages=3&lt;/code&gt; means triple buffering — while you compute K-tile &lt;code&gt;ko&lt;/code&gt;, the loads for &lt;code&gt;ko+1&lt;/code&gt; and &lt;code&gt;ko+2&lt;/code&gt; are already in flight. In Triton, this is a compile flag; here it's just the loop, which is easier to reason about.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;T.gemm(A, B, C)&lt;/code&gt; is the tile-level matmul.&lt;/strong&gt; It lowers to CuTe/MMA on NVIDIA and the matching intrinsically on AMD. It also takes &lt;code&gt;transpose_A&lt;/code&gt; / &lt;code&gt;transpose_B&lt;/code&gt; and a &lt;code&gt;policy=T.GemmWarpPolicy.*&lt;/code&gt; that controls how warps split the output tile. Hold onto that policy argument — it's the whole story when we get to MLA.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;T.use_swizzle&lt;/code&gt;&lt;/strong&gt; reorders how thread blocks are scheduled so that blocks adjacent in L2 run close together in time. Usually a few percent of free bandwidth.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The figure below maps all of this onto the hardware. It's worth reading against the code, because the labeled spots are exactly where TileLang hands you control that Triton keeps for itself.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqs6lbdoewih47pd58qvb.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqs6lbdoewih47pd58qvb.jpg" alt="Figure: GEMM in TileLang — you place every buffer in the hierarchy yourself. A_shared / B_shared sit in shared memory, C_local accumulates in registers across warps W0–W3, and the K-loop pipeline (num_stages=3) overlaps cp.async prefetches with the current gemm compute." width="800" height="518"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  A few primitives you'll reach for
&lt;/h2&gt;

&lt;p&gt;You can write most kernels with a small vocabulary.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Allocate:&lt;/strong&gt; &lt;code&gt;T.alloc_shared&lt;/code&gt;, &lt;code&gt;T.alloc_fragment&lt;/code&gt; (registers), &lt;code&gt;T.alloc_local&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Move and init:&lt;/strong&gt; &lt;code&gt;T.copy(src, dst)&lt;/code&gt; between any two levels; &lt;code&gt;T.clear&lt;/code&gt;, &lt;code&gt;T.fill&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compute:&lt;/strong&gt; &lt;code&gt;T.gemm(...)&lt;/code&gt;; &lt;code&gt;T.Parallel(d0, d1, ...)&lt;/code&gt; for elementwise loops (this is the entry point for layout inference); &lt;code&gt;T.reduce_max&lt;/code&gt; / &lt;code&gt;T.reduce_sum&lt;/code&gt;; scalar math like &lt;code&gt;T.exp&lt;/code&gt;, &lt;code&gt;T.exp2&lt;/code&gt;, &lt;code&gt;T.max&lt;/code&gt;, &lt;code&gt;T.infinity&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schedule:&lt;/strong&gt; &lt;code&gt;T.Pipelined(extent, num_stages=)&lt;/code&gt;, &lt;code&gt;T.use_swizzle(...)&lt;/code&gt;, &lt;code&gt;T.annotate_layout(...)&lt;/code&gt; when you need a specific layout (bank-conflict avoidance, custom swizzle).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dynamic shape:&lt;/strong&gt; &lt;code&gt;M = T.dynamic("m")&lt;/code&gt; so you don't recompile per shape (it's called &lt;code&gt;T.symbolic&lt;/code&gt; in some versions).&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Checking your work
&lt;/h2&gt;

&lt;p&gt;Two things you'll want often. To see what the compiler actually emitted:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kernel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_kernel_source&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;     &lt;span class="c1"&gt;# generated CUDA / HIP
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And to time it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;profiler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kernel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_profiler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tensor_supply_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tilelang&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TensorSupplyType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Normal&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;latency: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;profiler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;do_bench&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;T.print(buf)&lt;/code&gt; prints a tile from inside the kernel, and the repo's &lt;code&gt;examples/plot_layout&lt;/code&gt; draws the memory layout, which is handy when you're chasing a bank conflict or checking a swizzle.&lt;/p&gt;

&lt;h2&gt;
  
  
  Now a real one: MLA decode
&lt;/h2&gt;

&lt;p&gt;The GEMM shows the mechanics. This next one shows why they matter. We'll walk through DeepSeek's MLA (Multi-Head Latent Attention) decode kernel, because it's the cleanest example of TileLang earning its keep. The TileLang reference lands at roughly FlashMLA's H100 performance (benchmarked at batch 64/128 in fp16, comfortably ahead of Triton and FlashInfer) in about 80 lines of Python. The interesting question is how, because the hard part of MLA isn't the math — it's register pressure.&lt;/p&gt;

&lt;p&gt;Let's review the loop everyone knows. Every FlashAttention-family kernel has the same shape. Per query block, you stream over key/value blocks and keep a running max and denominator, so the full score matrix never lands in memory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# acc_s : [block_M, block_N]  scores for this KV block
# acc_o : [block_M, dim]      output accumulator
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_kv_blocks&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;acc_s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Q&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;
    &lt;span class="n"&gt;m_prev&lt;/span&gt;       &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scores_max&lt;/span&gt;
    &lt;span class="n"&gt;scores_max&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m_prev&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;rowmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;acc_s&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;scores_scale&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m_prev&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;scores_max&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;acc_o&lt;/span&gt; &lt;span class="o"&gt;*=&lt;/span&gt; &lt;span class="n"&gt;scores_scale&lt;/span&gt;                       &lt;span class="c1"&gt;# rescale prior output
&lt;/span&gt;    &lt;span class="n"&gt;acc_s&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;acc_s&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;scores_max&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;            &lt;span class="c1"&gt;# probabilities
&lt;/span&gt;    &lt;span class="n"&gt;acc_o&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;acc_s&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="n"&gt;V&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both &lt;code&gt;acc_s&lt;/code&gt; and &lt;code&gt;acc_o&lt;/code&gt; want to stay in registers. For MHA or GQA, that's fine. For MLA, it isn't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Here's where it gets hard.&lt;/strong&gt; MLA's head dimensions are big: query and key are 576 wide (a 512-wide "nope" part with no positional encoding, plus a 64-wide "rope" part), and value is 512. So &lt;code&gt;acc_o = [block_M, 512]&lt;/code&gt;, and it has to stay resident in registers across the whole KV loop.&lt;/p&gt;

&lt;p&gt;Now bring in the hardware. On Hopper, the fast path is &lt;code&gt;wgmma.mma_async&lt;/code&gt;, which ties 4 warps (128 threads) into one warpgroup and requires a minimum M of 64. So the smallest M one warpgroup can own is 64, which means one warpgroup would be holding a &lt;code&gt;64 × 512&lt;/code&gt; accumulator. That's too big for a single warpgroup's register file. It spills, and performance falls off a cliff.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq4vs2m5jz93ikowi92gl.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq4vs2m5jz93ikowi92gl.jpg" alt="Figure: MLA decode in TileLang — splitting acc_o across two warpgroups. WG0 and WG1 each compute Q·K^T (policy=FullCol), exchange their score halves through S_shared, and then each compute their column slab of P·V into acc_o_L / acc_o_R. The whole bookkeeping (acc_s shape, S_shared shape, Q·K split) is derived by layout inference from the FullCol policy you annotated." width="799" height="520"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix is to split the output across two warpgroups.&lt;/strong&gt; You can't shrink M below 64, so the only axis left is &lt;code&gt;dim&lt;/code&gt;. Use two warpgroups: &lt;code&gt;WG0&lt;/code&gt; owns &lt;code&gt;acc_o[:, :256]&lt;/code&gt;, &lt;code&gt;WG1&lt;/code&gt; owns &lt;code&gt;acc_o[:, 256:]&lt;/code&gt;. Now each one holds a &lt;code&gt;64 × 256&lt;/code&gt; accumulator, which fits. That creates a second problem, though: the &lt;code&gt;P @ V&lt;/code&gt; step (with &lt;code&gt;policy=FullCol&lt;/code&gt;, each warpgroup producing one column slab of the output) needs the &lt;em&gt;complete&lt;/em&gt; &lt;code&gt;acc_s&lt;/code&gt;, but in &lt;code&gt;Q @ K&lt;/code&gt; each warpgroup only naturally computed half of it. The resolution is a shared-memory swap. During &lt;code&gt;Q @ K&lt;/code&gt;, each warpgroup writes its half of &lt;code&gt;acc_s&lt;/code&gt; to shared memory and reads back the other warpgroup's half, so afterward both hold the full &lt;code&gt;acc_s&lt;/code&gt; and can each compute their slab of &lt;code&gt;acc_o&lt;/code&gt;. The diagram above is exactly that: split the scores, swap through &lt;code&gt;S_shared&lt;/code&gt;, split the output.&lt;/p&gt;

&lt;p&gt;In CuTe you'd hand-write the layouts, the swizzles, the Tensor Core alignment, and the producer/consumer sync to pull this off. The reason it collapses to ~80 lines here is layout inference.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Let's break down what layout inference does.&lt;/strong&gt; You annotate intent on the &lt;code&gt;T.gemm&lt;/code&gt; calls, and it propagates the constraints through the program for you:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;policy=FullCol&lt;/code&gt; on &lt;code&gt;P @ V&lt;/code&gt; means each warpgroup needs the full &lt;code&gt;acc_s&lt;/code&gt;, so &lt;code&gt;acc_s = [block_M, block_N]&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;That propagates back to the staging buffer, so &lt;code&gt;S_shared&lt;/code&gt; in &lt;code&gt;T.copy(S_shared, acc_s)&lt;/code&gt; is also &lt;code&gt;[block_M, block_N]&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;And forward into &lt;code&gt;Q @ K&lt;/code&gt;: with &lt;code&gt;FullCol&lt;/code&gt;, each warpgroup's score slab is &lt;code&gt;[block_M, block_N/2]&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The key insight is that you never write any of those shapes. You pick the warp policy and write the math; the shapes, the swizzled layouts, and the warp-specialized producer/consumer code all come out of inference.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The kernel skeleton.&lt;/strong&gt; In MLA decode the query splits into a "nope" part (&lt;code&gt;Q&lt;/code&gt;, dim 512) and a "rope" part (&lt;code&gt;Q_pe&lt;/code&gt;, dim 64), and the compressed latent serves as both K and V. So the score is a sum of two GEMMs, and the output is one more. The inner loop looks like this (a representative skeleton, not line-exact — see &lt;code&gt;example_mla_decode.py&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# acc_s = Q_nope @ KV^T + Q_rope @ K_pe^T
&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gemm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Q_shared&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="n"&gt;KV_shared&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="n"&gt;acc_s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transpose_B&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;policy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GemmWarpPolicy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FullCol&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;clear_accum&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gemm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Q_pe_shared&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;K_pe_shared&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;acc_s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transpose_B&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;policy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GemmWarpPolicy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FullCol&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# online softmax
&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores_max&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scores_max_prev&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fill&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores_max&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;infinity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;accum_dtype&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reduce_max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;acc_s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scores_max&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;clear&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# ... exp, rescale acc_o by scores_scale, reduce_sum into logsum ...
&lt;/span&gt;
&lt;span class="c1"&gt;# acc_o += P @ V  (V is the same latent KV)
&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;acc_s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;acc_s_cast&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gemm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;acc_s_cast&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;KV_shared&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;acc_o&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;policy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GemmWarpPolicy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FullCol&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;S_shared&lt;/code&gt; exchange between the two warpgroups is the part inference inserts for you, once the &lt;code&gt;FullCol&lt;/code&gt; policies force &lt;code&gt;acc_s&lt;/code&gt; to be full per warpgroup.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The nice part: the optimizations are one line each.&lt;/strong&gt; This is where TileLang pays off — the whole performance toolkit is one-liners, and the messy lowering is handled for you.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Threadblock swizzling&lt;/strong&gt; for L2 reuse: &lt;code&gt;T.use_swizzle(panel_size, order="row")&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shared-memory swizzling&lt;/strong&gt; for bank conflicts: &lt;code&gt;T.annotate_layout({S_shared: T.layout.make_swizzled_layout(S_shared)})&lt;/code&gt; — XOR-style address remapping so concurrent accesses spread across banks instead of serializing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Warp specialization:&lt;/strong&gt; you write a plain script, and it's lowered into a producer warpgroup (TMA loads) plus consumer warpgroups, with all the &lt;code&gt;mbarrier&lt;/code&gt; sync generated. None of it shows up in your code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pipelining:&lt;/strong&gt; &lt;code&gt;T.Pipelined(range, num_stages)&lt;/code&gt; overlaps loads with compute — more stages, more overlap, but more shared memory, so it's a knob.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Split-KV&lt;/strong&gt; (FlashDecoding-style): when the batch is small and the SMs are idle, split the KV context across SMs and merge. It's a &lt;code&gt;num_split&lt;/code&gt; parameter plus a combine kernel.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So the genuinely hard reasoning — register budget against the M≥64 floor, who owns what across warpgroups, the shared-memory swap — you express by choosing a policy and writing the math. Everything that would be hundreds of fragile lines in CuTe is inference and codegen. That's the pitch, and MLA is where it's most convincing.&lt;/p&gt;

&lt;h2&gt;
  
  
  One of our own: a drop-in RMSNorm at AtlasCloud
&lt;/h2&gt;

&lt;p&gt;The last example is one of our own production kernels at AtlasCloud, from the Wan video-generation VAE on H100/H200. It's a great illustration of the other thing TileLang is excellent at: covering a config a hand-tuned kernel can't reach, with a clean drop-in.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The setup.&lt;/strong&gt; We already ship a hand-tuned fused RMSNorm + SiLU kernel. It's fast, and it's compiled for the hidden dims &lt;code&gt;D ∈ {96, 192, 384}&lt;/code&gt; that one model config uses. A newer config needs channel widths like &lt;code&gt;{160, 256, 320, 512, 640, 1024}&lt;/code&gt;, so on that config the hand-tuned fast path can't run. We wrote a TileLang drop-in to cover exactly that gap.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The TileLang kernel.&lt;/strong&gt; A drop-in with the same interface (BTHWC in/out, same math, same &lt;code&gt;eps&lt;/code&gt;) that supports any &lt;code&gt;C&lt;/code&gt; that's a multiple of 32. Two passes, fully coalesced, FP32 accumulator:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@T.prim_func&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;     &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Tensor&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;M&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;      &lt;span class="c1"&gt;# M = B*T*H*W rows
&lt;/span&gt;         &lt;span class="n"&gt;gamma&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Tensor&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="p"&gt;,),&lt;/span&gt;  &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
         &lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;     &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Tensor&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;M&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Kernel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ceildiv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;M&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BLOCK_M&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;threads&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;bm&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;X_chunk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;alloc_shared&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;BLOCK_M&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BLOCK_C&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;ss&lt;/span&gt;      &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;alloc_fragment&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;BLOCK_M&lt;/span&gt;&lt;span class="p"&gt;,),&lt;/span&gt; &lt;span class="n"&gt;accum_dtype&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# FP32 sum-of-squares
&lt;/span&gt;        &lt;span class="c1"&gt;# pass 1: loop over C in BLOCK_C chunks, accumulate sum of squares in FP32
&lt;/span&gt;        &lt;span class="c1"&gt;# rinv = rsqrt(ss / C + 1e-5)
&lt;/span&gt;        &lt;span class="c1"&gt;# pass 2: re-load X, y = silu(x * gamma * rinv), write back
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;BLOCK_C&lt;/code&gt; is 128/64/32 depending on &lt;code&gt;C&lt;/code&gt;, to respect the TMA &lt;code&gt;boxDim ≤ 256&lt;/code&gt; limit, and the FP32 accumulator keeps the sum of squares from overflowing in FP16. Dispatch keeps the hand-tuned path where it works and only falls back when it has to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;_ATLAS_SUPPORTED_D&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;96&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;192&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;384&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;rms_silu_dispatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gamma&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;_ATLAS_SUPPORTED_D&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;atlas_rms_norm_silu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gamma&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;        &lt;span class="c1"&gt;# keep the hand-tuned path
&lt;/span&gt;    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;tilelang_rms_silu_bthwc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gamma&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    &lt;span class="c1"&gt;# cover the gap
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What it gained us.&lt;/strong&gt; All upside, and it's a true drop-in — same interface, same math, same &lt;code&gt;eps&lt;/code&gt;, so it slots in behind the existing dispatch with no call-site changes.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What&lt;/th&gt;
&lt;th&gt;Gain&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Previously-unsupported config&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;0 → 1&lt;/strong&gt; — it runs now (the headline win)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Attention-block RMSNorm vs the eager PyTorch norm it replaced&lt;/td&gt;
&lt;td&gt;42 μs → &lt;strong&gt;20 μs (~2× faster)&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;End-to-end VAE at production resolution (720×1280, 21 frames)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~1.79× encode, ~1.78× decode&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The first row is the real point: TileLang let us serve a model config that previously had no fast path at all, without touching the hand-tuned kernel that already works for the other config. One drop-in, written in Python, and a whole model path went from "throws" to "ships."&lt;/p&gt;

&lt;h2&gt;
  
  
  Where TileLang shines
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Fine control over blocking, pipeline stages, and warp partitioning, without writing CUTLASS/CuTe.&lt;/li&gt;
&lt;li&gt;Structurally complex, layout-sensitive kernels: GEMM variants, the FlashAttention family, MLA, linear attention, dequant-fused quant GEMM, MoE routing.&lt;/li&gt;
&lt;li&gt;Covering an op or config your hand-tuned kernels don't reach (a hidden dim outside the instantiated set, an unusual layout) — and beating an eager fallback while you're there.&lt;/li&gt;
&lt;li&gt;One kernel body across backends (NVIDIA / AMD / vendor forks).&lt;/li&gt;
&lt;li&gt;The whole optimization toolkit is one call at a time — &lt;code&gt;T.use_swizzle&lt;/code&gt;, &lt;code&gt;T.annotate_layout&lt;/code&gt;, &lt;code&gt;T.Pipelined&lt;/code&gt;, warp specialization, split-KV — with the lowering handled for you.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Wrapping up
&lt;/h2&gt;

&lt;p&gt;The cool part of TileLang is that the hard reasoning stays in your head, not in boilerplate. You decide how to split work across warps, where buffers live, and how deep the pipeline runs — and then layout inference and warp specialization turn that into the register layouts, the swizzles, and the producer/consumer dance that would otherwise be hundreds of lines of CuTe. You pick a policy and write the math. That's the whole pitch, and it's why an 80-line MLA kernel can sit next to a hand-tuned CUTLASS one.&lt;/p&gt;

</description>
      <category>deeplearning</category>
      <category>llm</category>
      <category>performance</category>
      <category>python</category>
    </item>
    <item>
      <title>Long video generation blog: How We Shipped SVI in Production</title>
      <dc:creator>Atlas Cloud</dc:creator>
      <pubDate>Thu, 07 May 2026 09:38:34 +0000</pubDate>
      <link>https://dev.to/atlas_cloud_ai/long-video-generation-blog-how-we-shipped-svi-in-production-5bln</link>
      <guid>https://dev.to/atlas_cloud_ai/long-video-generation-blog-how-we-shipped-svi-in-production-5bln</guid>
      <description>&lt;p&gt;In &lt;a href="https://www.atlascloud.ai/blog/guides/long-video-generation-blog-1" rel="noopener noreferrer"&gt;Part 1&lt;/a&gt;, we surveyed six approaches to long video generation — TTT, LoL, Self Forcing, Self Forcing++, Infinite Talk, and Helios — and landed on SVI as the only path that ships today without retraining a 14B model. This post is about what building with it actually looked like: how the clip-stitching loop works, why Error-Recycling matters, and the production numbers from our first deployment on TurboWan.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The choice: SVI (Stable Video Infinity)&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;SVI's core philosophy is to turn infinite-length generation into stitching together a finite number of short clips with carefully designed memory transfer. That sounds modest until you realize it cleans up most of the engineering pain points at once: no base-model retraining (a small LoRA mounted on TurboWan), constant VRAM, composable with existing speed-distillation, and official LoRA weights are public.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F34g6h65rbmwb1jm0nkif.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F34g6h65rbmwb1jm0nkif.png" alt="image8.png" width="800" height="192"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;SVI's mental model. (a) Standard video generative models have a Train-Test Hypothesis Gap — they train on clean inputs but face noisy, error-accumulated inputs at inference. (b) Image restoration models are robust to errors but cannot generate new content. (c) SVI's Error-Recycling Fine-Tuning bridges both — using self-generated errors as supervisory signals so the model actively learns to identify and correct its own generation errors.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;How clip stitching works&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Each clip is 81 frames (5s @ 16fps). Generation is just a loop: condition the next clip on a global identity anchor and a short-term motion bridge from the previous clip, then concatenate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Clip 1. &lt;/strong&gt;Inputs: ref image + empty motion memory. Output: a 5s clip. Extract motion memory: the latent of the last 4 frames. &lt;strong&gt;Clip 2. &lt;/strong&gt;Inputs: ref image + motion memory from clip 1. Output: a 5s clip. Extract motion memory from its tail. &lt;strong&gt;... &lt;/strong&gt;Repeat for N clips, then concatenate clip 1 + clip 2 + … + clip N into the long video.&lt;/p&gt;

&lt;p&gt;The clean part is that no DiT attention modification is needed. Historical context is concatenated at the input level as latents, and a small LoRA teaches the model to actually use that prefix.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Anchor latent. &lt;/strong&gt;User-provided reference image, encoded by the VAE → keeps subject / character appearance globally consistent. &lt;strong&gt;Motion latent. &lt;/strong&gt;Latent of the last 4 / 8 / 12 frames of the previous clip → tells the model how the last segment ended. &lt;strong&gt;Padding. &lt;/strong&gt;Aligns the input shape so the DiT sees one tidy concatenated sequence: anchor + motion + padding.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Error-Recycling Fine-Tuning&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The detail that makes SVI hold up over many clips is how its LoRA is trained. Standard inference always starts denoising from pure Gaussian noise — but in long-video stitching, errors from earlier clips contaminate the conditioning for later clips. If you only ever train on clean reference inputs, you have baked in the train-inference gap.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Standard training: &lt;/strong&gt;every clip's reference inputs are clean ground truth → the model never sees the kind of noisy historical context it actually faces at inference, and discontinuities accumulate. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Error-Recycling: &lt;/strong&gt;during training, deliberately inject the model's own past errors into the reference inputs, so the LoRA explicitly learns to operate on noisy historical context. Visual discontinuities at clip boundaries drop sharply.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn5jdyi2d66vi3dqr1s74.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn5jdyi2d66vi3dqr1s74.png" alt="image9.png" width="666" height="797"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;SVI identifies two core error types. (a) Error-Free Flow Matching is the training-time trajectory. (b) Single-Clip Predictive Error — the per-clip drift between the denoising path and the ideal trajectory. (c) Cross-Clip Conditional Error — error-contaminated reference images cause cascading drift across clips. Error-Recycling explicitly injects both.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5tgsd2vcpti8fto9hpjx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5tgsd2vcpti8fto9hpjx.png" alt="image10.png" width="800" height="371"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;SVI training framework. (a) Inject DiT's self-generated errors into the latent space to break the error-free assumption. (b) Efficiently compute bidirectional errors via one-step forward / backward integration. (c) Store errors in a Replay Memory and dynamically resample for reuse, forming a closed-loop error supervision cycle.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;SVI separates two error types. &lt;em&gt;Single-clip Predictive Error&lt;/em&gt; is the per-clip drift between the denoising path and the ideal trajectory. &lt;em&gt;Cross-clip Conditional Error&lt;/em&gt; is the cascading drift caused when error-contaminated reference images flow into the next clip. Error-Recycling injects both, so the LoRA learns explicit error tolerance.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;LoRA variants&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;SVI ships three variants — &lt;em&gt;SVI-Shot&lt;/em&gt; for static-image → short-clip, &lt;em&gt;SVI-Dance&lt;/em&gt; for human motion (it can also take a pose-sequence input), and &lt;em&gt;SVI-Film&lt;/em&gt; for multi-shot / scene-transition long video. Hyperparameters: 81 frames per clip, num_motion_frames ∈ {4, 8, 12}, LoRA rank typically 16–64.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Stacking on TurboWan&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;We mount SVI's LoRA on top of TurboWan (an speedup version of Wan optimized by Atlas), and we keep our specialized LoRA in the stack for style control. At inference, multiple LoRA weights are superimposed at once.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Base. &lt;/strong&gt;TurboWan &lt;strong&gt;LoRA 1.&lt;/strong&gt; specialized LoRA — content / style control. &lt;strong&gt;LoRA 2. &lt;/strong&gt;SVI LoRA — long-video consistency. &lt;strong&gt;Combined. &lt;/strong&gt;TurboWan speed + SVI long-video continuity + Spicy style, all in one inference pass.&lt;/p&gt;

&lt;p&gt;The full inference flow is straightforward: encode the reference into an anchor latent, concatenate it with the previous clip's motion latent and padding, run TurboWan's denoise, decode, append, and update the motion latent from the tail of the freshly-generated clip. After N iterations, concatenate everything into one video.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. &lt;/strong&gt;Encode ref image → anchor latent. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. &lt;/strong&gt;y = concat(anchor latent, motion latent, padding). &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. &lt;/strong&gt;Run TurboWan's 5-step denoise conditioned on y and the text embedding. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. &lt;/strong&gt;VAE-decode the clip and append to the output list. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. &lt;/strong&gt;Set motion latent = tail (last num_motion_frames) of the just-generated clip. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. &lt;/strong&gt;Repeat for N clips, then concatenate all of them.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Some production numbers&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Standard test: a single reference image and 3 prompts, generating ~15s output (3 clips × 5s):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Metric&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Value&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Generated duration&lt;/td&gt;
&lt;td&gt;15s (3 clips)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per-clip inference time&lt;/td&gt;
&lt;td&gt;~14s (TurboWan fp8, single GPU)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total inference time&lt;/td&gt;
&lt;td&gt;~42s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Subject consistency&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;A worked example: Cat Adventure&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;To make the cross-clip behavior concrete, we ran a 15-second case with one reference and three shots. The style prompt fixed a Pixar look with warm lighting; the character was an orange tabby kitten with big curious eyes; the three shots took it from windowsill, to sidewalk, to meeting a golden retriever, each with its own camera direction.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkqyiryc4d0nxuy84w4rg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkqyiryc4d0nxuy84w4rg.png" alt="image11.png" width="400" height="230"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Clip 1 (0–5s): the orange Pixar kitten on a windowsill, with the camera slowly pulling back from a close-up. Style and character stay stable across frames.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc6laagm4ugy45w1y9ix4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc6laagm4ugy45w1y9ix4.png" alt="image12.png" width="400" height="230"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Clip 2 (5–10s) at the transition boundary: the kitten's appearance matches Clip 1, then turns and shifts posture as it jumps down. The motion latent has carried the motion state across the boundary.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb8za6fd3czd5dilwbuie.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb8za6fd3czd5dilwbuie.png" alt="image13.png" width="400" height="230"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Clip 3 (10–15s): a golden retriever is introduced and the scene transitions toward an indoor / outdoor boundary. The kitten's Pixar style remains stable across all three clips.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Aggregate metrics for the run:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Metric&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Value&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total duration&lt;/td&gt;
&lt;td&gt;15s (3 clips × 5s)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total frames&lt;/td&gt;
&lt;td&gt;240 frames (16fps)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total inference time&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;33s (TurboWan, single GPU)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time-to-video ratio&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.2 s/s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Subject consistency&lt;/td&gt;
&lt;td&gt;Pixar orange kitten stable throughout&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Clip boundary discontinuity&lt;/td&gt;
&lt;td&gt;No obvious jump cuts&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That is a 15-second long video in 33 seconds on a single GPU, with cross-clip subject consistency — well within the ≤ 60s wait we set as our target. On a 14-case internal test set, 9 cases came back with no obvious issues (64% pass rate).&lt;/p&gt;

&lt;p&gt;The honest closing observation is that in video generation, speed, length, and quality are three corners of an iron triangle. No single approach today leads on all three at once. The interesting work is in choosing which corner you can give up the least, given today's hardware and your training budget. SVI gives up a little length and a little boundary quality — and in exchange we ship long video with Wan2.2-class fidelity, on one GPU, today.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>svi</category>
      <category>videogeneration</category>
    </item>
    <item>
      <title>Long video generation blog: Six Approaches, One Decision</title>
      <dc:creator>Atlas Cloud</dc:creator>
      <pubDate>Thu, 07 May 2026 09:35:29 +0000</pubDate>
      <link>https://dev.to/atlas_cloud_ai/long-video-generation-blog-six-approaches-one-decision-8l4</link>
      <guid>https://dev.to/atlas_cloud_ai/long-video-generation-blog-six-approaches-one-decision-8l4</guid>
      <description>&lt;p&gt;A few months ago we set ourselves a deceptively simple goal: produce coherent, high-quality video longer than 15 seconds, on a single GPU, in well under a minute of wall-clock time. Today's video diffusion models like Wan2.2 are good at 3–5 second clips. Stretching that to 10s, 30s, or a minute is where things get interesting.&lt;/p&gt;

&lt;p&gt;This post documents the route we actually took. We surveyed six approaches that show up in recent papers and tech reports — TTT, LoL, Self Forcing, Self Forcing++, Infinite Talk, and Helios — measured the trade-offs, and ultimately landed on SVI (Stable Video Infinity), wired up next to TurboWan in our DiffSynth Engine. We will go over each of those routes, then how SVI works, then the production numbers.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Why long video is hard&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Three things break when you push past about five seconds.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;The VRAM wall&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Wan2.2 uses Full Attention with O(n²) cost in the number of latent tokens. The math is unforgiving:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5s (81 frames): &lt;/strong&gt;~32.7k tokens, attention matrix ~10 GB. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;10s (165 frames): &lt;/strong&gt;~65.5k tokens, attention matrix ~40 GB — already spills off a single GPU. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;30s (~500 frames): &lt;/strong&gt;~200k tokens, infeasible.&lt;/p&gt;

&lt;p&gt;In practice, Self Forcing alone fills most of an H200's 129 GB at 165 frames just for the KV cache.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Temporal drift&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Even when memory is fine, three drift modes show up. The Helios paper named them: &lt;em&gt;position shift&lt;/em&gt; (subjects wandering across the frame), &lt;em&gt;color shift&lt;/em&gt; (gradual hue and brightness drift), and &lt;em&gt;restoration shift&lt;/em&gt; (the model overcorrecting and producing visible discontinuities).&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Causal consistency&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Standard video diffusion uses bidirectional Full Attention — every frame attends to every other. That means no streaming output: you cannot show frame 1 until frame N is done.&lt;/p&gt;

&lt;p&gt;Our concrete target was modest: ≥15 second video, smooth visual continuity, stable subjects across the whole clip, total wait under 60 seconds, minimal training, and a strong preference for reusing weights we already have.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The survey&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;We looked at six families. The names are mostly paper titles; the categories will matter later.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Route 1 · TTT (Test-Time Training)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Paper: One-Minute Video Generation with Test-Time Training (arXiv 2504.05298, Apr 2025).&lt;/p&gt;

&lt;p&gt;The idea is to fine-tune the model during inference so it remembers what it has already generated. A small TTT layer (a 2-layer MLP, plus a gate and a local attention) gets inserted after Attention in every Transformer Block, and the model is trained on a curriculum that pushes from short clips out to a full minute.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-block insertion: &lt;/strong&gt;after the standard attention, splice in a Gate, a TTT Layer, and a Local Attention, then a LayerNorm. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Curriculum: &lt;/strong&gt;train on progressively longer windows — 3s → 9s → 18s → 30s → 60s. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost: &lt;/strong&gt;256 H100s for ~50 hours.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhdlo0wipajt4uz1rf7ef.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhdlo0wipajt4uz1rf7ef.png" alt="image1.png" width="800" height="229"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;TTT — left: insertion point (Gate + TTT Layer + Local Attention + LayerNorm, attached after standard Attention via residual). Right: video segmented into 3-second clips, each handled by Local Attention internally, with the TTT Layer carrying global memory across segments.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It works — the paper reaches 1-minute generation. But the training cost is enormous, the experiments only cover CogVideoX 5B (transfer to Wan2.2 14B is unproven), and the inserted TTT layers conflict with the kernel optimizations we already rely on. Verdict: not selected.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Route 2 · LoL (Longer than Longer)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Paper: LoL: Longer than Longer, Scaling Video Generation to Hour (arXiv 2601.16914, Jan 2026).&lt;/p&gt;

&lt;p&gt;LoL targets a specific failure mode in autoregressive long video — &lt;em&gt;sink-collapse&lt;/em&gt;, where multi-head attention all converges onto the anchor frame and the video periodically reverts to its initial state. The fix is &lt;em&gt;Multi-Head RoPE Jitter&lt;/em&gt;: per-head random phase perturbations that break inter-head homogeneity. Training-free, plug-in.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failure mode: &lt;/strong&gt;sink-collapse — under autoregressive RoPE, distant frames' positional phases periodically realign with the anchor, attention concentrates, content snaps back to the anchor frame. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix: &lt;/strong&gt;give each attention head its own small random phase shift. Heads can no longer collapse to the same column. No retraining required, drops into existing models.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Favud5wbhh8monvrd1zrx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Favud5wbhh8monvrd1zrx.png" alt="image2.png" width="800" height="222"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;L2 distance to anchor vs. frame index. Self-Forcing++ (red) and LongLive (blue), both with sink, repeatedly snap back at specific frame positions — those are sink-collapse events where the video reverts to the anchor. LoL's Phase Alignment (green) eliminates the snap-back.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzikkvkchlo1z6w8sbozk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzikkvkchlo1z6w8sbozk.png" alt="image3.png" width="800" height="307"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Per-head attention maps. Top row: normal frames — heads have visibly different patterns. Bottom rows: during sink-collapse — every head looks the same, all collapsed onto the anchor frame's column. RoPE Jitter restores per-head diversity.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;LoL hits 12-hour video on CogVideoX/HunyuanVideo with little quality loss. The catch is that all the demos are static-ish scenes; we don't know how it survives dance, sports, or anything with strong motion. Plus we'd need to modify Wan2.2's attention. Verdict: adaptation cost is too high for unproven gains on motion content. Not selected.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Route 3 · Self Forcing (Causal Wan2.2)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Paper: Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion (arXiv 2506.08009, NeurIPS 2025 Spotlight).&lt;/p&gt;

&lt;p&gt;Self Forcing replaces Wan2.2's bidirectional Full Attention with &lt;em&gt;causal&lt;/em&gt; attention: a frame only attends to frames before it. That single change unlocks streaming generation — once chunk 1 is done, decode and ship it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bidirectional: &lt;/strong&gt;every frame attends to every other → must finish all 40 denoise steps before any frame can be shown. &lt;strong&gt;Causal: &lt;/strong&gt;a frame only sees its past → the first chunk can stream the moment it is done.&lt;/p&gt;

&lt;p&gt;The training trick is what gives the paper its name. Instead of training on clean ground-truth context (Teacher Forcing) or with custom attention masks (Diffusion Forcing), Self Forcing rolls out the actual inference path with a rolling KV cache, so train and inference distributions match.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Generation loop: &lt;/strong&gt;denoise the next small chunk of frames using DMD's compressed step schedule, conditioned on a rolling KV cache built from already-generated frames. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stream: &lt;/strong&gt;as soon as a chunk finishes, VAE-decode and emit it. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Carry-over: &lt;/strong&gt;push the new chunk's latents into the KV cache for the next chunk to attend to.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkfqyarjefvo919iefuds.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkfqyarjefvo919iefuds.png" alt="image4.png" width="800" height="334"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Three training paradigms compared: (a) Teacher Forcing trains on clean frames — at inference, noisy frames cause out-of-distribution drift; (b) Diffusion Forcing uses custom attention masks but still has train-inference mismatch; (c) Self Forcing replays the true inference process using a rolling KV cache, fully aligning training and inference.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We measured it on the FastVideo framework, single H200:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Length&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Frames&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Time&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;VRAM&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;5s&lt;/td&gt;
&lt;td&gt;81 frames&lt;/td&gt;
&lt;td&gt;70s&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10s&lt;/td&gt;
&lt;td&gt;165 frames&lt;/td&gt;
&lt;td&gt;168s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;129 GB (near capacity)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;20s&lt;/td&gt;
&lt;td&gt;321 frames&lt;/td&gt;
&lt;td&gt;287s&lt;/td&gt;
&lt;td&gt;129 GB (KV cache capped at 42 frames)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is architecturally the cleanest answer, and we genuinely like it. But 10s already saturates an H200's VRAM, quality drops at 165 frames, the original model needs causal-attention fine-tuning, and true streaming also needs a Causal Conv3D in the VAE. &lt;/p&gt;

&lt;p&gt;Verdict: wait for the community to chip away at VRAM and quality. Not adopted for now.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Route 4 · Self Forcing++&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Paper: Self-Forcing++: Towards Minute-Scale High-Quality Video Generation (arXiv 2510.02283, Oct 2025).&lt;/p&gt;

&lt;p&gt;Builds on Self Forcing with three additions: &lt;em&gt;Backward Noise Initialization&lt;/em&gt; (each new chunk starts from noise back-integrated from already-generated frames, removing chunk-boundary discontinuities); &lt;em&gt;Extended DMD alignment&lt;/em&gt; (slice 5s windows from a long rollout and align them against a teacher's short-window output); and a &lt;em&gt;GRPO&lt;/em&gt; stage with optical-flow reward to push for more dynamic motion.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1. &lt;/strong&gt;Self-rollout the student for far longer than 5 seconds, accumulating a long draft using a rolling KV cache. &lt;strong&gt;Step 2. &lt;/strong&gt;Slice random 5s windows out of that draft, run them through Extended DMD against the teacher's short-window distribution to align. &lt;strong&gt;Step 3. &lt;/strong&gt;Refine with GRPO using optical-flow magnitude as reward, nudging the model toward more dynamic motion. &lt;strong&gt;Trick. &lt;/strong&gt;Each new chunk starts from noise back-integrated from the previous chunk, not from fresh Gaussian — so chunk boundaries no longer pop.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxzt9goa52nn49svm9dyo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxzt9goa52nn49svm9dyo.png" alt="image5.png" width="800" height="346"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Left to right: CausVid (fixed training duration, train-inference mismatch) → Self Forcing (fixed duration + DMD alignment) → Self-Forcing++ (extended duration + Backward Noise Initialization + Extended DMD alignment). Bottom rows show training-stage and inference-stage correspondence.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Result: minute-scale video (up to about 4m15s) on a 1.3B Wan2.1. Great paper. For production we hit two walls: content is mostly static (low motion), the base model is 1.3B (a long way below Wan2.2 14B), and there is no released code or weights to bootstrap from. Verdict: not selected for now.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Route 5 · Infinite Talk (A2V)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;A different shape of problem entirely — &lt;em&gt;Audio-to-Video&lt;/em&gt;, where audio drives continuous talking-head generation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-chunk input bundle: &lt;/strong&gt;the new chunk's noisy latents, the audio features for that time window, the user-provided reference image, the last frame of the previous chunk, and a soft conditioning weight. &lt;strong&gt;Reference identity: &lt;/strong&gt;the reference image keeps long-term appearance stable. &lt;strong&gt;Adaptive constraint: &lt;/strong&gt;the soft weight tightens or relaxes the reference based on similarity drift. &lt;strong&gt;Motion bridge: &lt;/strong&gt;the previous chunk's last frame carries motion across boundaries.&lt;/p&gt;

&lt;p&gt;It is good for what it is — talking heads, indefinitely. But the architecture differs enough from Wan2.2 that it requires dedicated training, and it does not generalize to general scenes. Verdict: valuable in a narrow lane, not a general long-video solution.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Route 6 · Helios&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Paper: Helios: Real Real-Time Long Video Generation Model (PKU-YuanGroup, arXiv 2603.04379, Mar 2026).&lt;/p&gt;

&lt;p&gt;As of writing, Helios is the SOTA for long video — 14B params, 19.5 FPS real-time on a single H100. The trick is to compress historical frames into a three-level pyramid and inject them into the current frame's denoising, so the token budget stays constant no matter how long the video gets.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-Term Memory. &lt;/strong&gt;Short-term history (last 3 frames) keeps full resolution; mid-term (last 20 frames) gets moderate compression; long-term (everything earlier) gets heavy compression. Total token budget is constant regardless of video length. &lt;strong&gt;Guidance Attention. &lt;/strong&gt;Inside each DiT block, clean historical KVs and noisy current QKVs are processed separately so historical noise cannot contaminate current denoising. &lt;strong&gt;Pyramid Sampling. &lt;/strong&gt;Sample at low resolution first to define structure, then refine to high resolution to add detail — about 2.3× fewer tokens overall.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxm464qp1zg2mehwzi3a5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxm464qp1zg2mehwzi3a5.png" alt="image6.png" width="800" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Helios architecture. Left: Unified History Injection — short / mid / long-term history compressed at different ratios, concatenated with the current frame before entering the DiT. Right: Pyramid Unified Predictor-Corrector — low token count first to define structure, then high token count to refine details, reducing computation by ~2.3×.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feezjjsxe8o7jugai7foi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feezjjsxe8o7jugai7foi.png" alt="image7.png" width="800" height="515"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The Helios paper systematically defines and visualizes three categories of drift in long-video generation: (a) position shift, (b) color shift, (c) restoration shift (noise), (d) restoration shift (blur). Guidance Attention is specifically designed to address all three.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Helios's measured throughput on H200 is striking — basically flat with length:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Length&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Time&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Throughput&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;240 frames (10s)&lt;/td&gt;
&lt;td&gt;24s&lt;/td&gt;
&lt;td&gt;~10 FPS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;480 frames (20s)&lt;/td&gt;
&lt;td&gt;42s&lt;/td&gt;
&lt;td&gt;~11.4 FPS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;960 frames (40s)&lt;/td&gt;
&lt;td&gt;82s&lt;/td&gt;
&lt;td&gt;~11.7 FPS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;H100 single GPU (Helios-Distilled)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;19.5 FPS&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The catch is that Multi-Term Memory Patchification needs full retraining of a 14B model. There are no released weights — only a tech report — so we cannot just bolt on a LoRA. Verdict: a medium-to-long-term direction; not deployable today.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Route Comparison Summary&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;All six routes side by side, with SVI added as the row we ultimately committed to:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Approach&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Max Duration&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Quality&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Training Required&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Engineering Difficulty&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Generality&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Rec.&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TTT&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1 minute&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Heavy training needed&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;★★☆&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LoL&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hour-scale&lt;/td&gt;
&lt;td&gt;Medium (static only)&lt;/td&gt;
&lt;td&gt;Training needed&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;★★☆&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Self Forcing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Theoretically unlimited&lt;/td&gt;
&lt;td&gt;Medium (drops &amp;gt; 10s)&lt;/td&gt;
&lt;td&gt;Existing model&lt;/td&gt;
&lt;td&gt;High (VRAM issues)&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;★★★&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Self Forcing++&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Minute-scale&lt;/td&gt;
&lt;td&gt;Low (mostly static)&lt;/td&gt;
&lt;td&gt;Training needed&lt;/td&gt;
&lt;td&gt;Very high (no code)&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;★☆☆&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Infinite Talk&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Unlimited&lt;/td&gt;
&lt;td&gt;High (talking head)&lt;/td&gt;
&lt;td&gt;Training needed&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Low (A2V only)&lt;/td&gt;
&lt;td&gt;★★☆&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Helios&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Theoretically unlimited&lt;/td&gt;
&lt;td&gt;High (industry SOTA)&lt;/td&gt;
&lt;td&gt;Full retraining&lt;/td&gt;
&lt;td&gt;Very high (no weights)&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;★★★☆&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SVI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Unlimited&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Medium-High&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Open-source LoRA&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Medium&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;High&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;★★★★&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;A taxonomy that fell out of the survey&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;If you squint, every approach we surveyed falls into one of three buckets.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Type A — extend the attention range itself &lt;/strong&gt;(Self Forcing, LoL, TTT). Have the model directly process longer sequences. Highest theoretical quality. VRAM grows quadratically, so engineering hits a wall around 10s today.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Type B — hierarchical history compression &lt;/strong&gt;(Helios). Compress past frames and inject them as conditioning. Bypasses VRAM. Costs a full retraining of a 14B model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Type C — stateful rolling generation &lt;/strong&gt;(SVI, Infinite Talk). Decompose long video into short clips with overlapping state. Constant VRAM, unlimited length, LoRA-only training. The trade is possible discontinuities at clip boundaries and unbounded long-term drift you can manage but not eliminate.&lt;/p&gt;

&lt;p&gt;For this quarter, Type C is what we ship. For next year, Type B is where we are watching the literature.&lt;/p&gt;




&lt;p&gt;In the next post, we go into what shipping actually looked like — six approaches to ≥15-second video generation, why we picked SVI, and what the production numbers look like. &lt;a href="https://www.atlascloud.ai/blog/guides/long-video-generation-blog-2" rel="noopener noreferrer"&gt;Read Part 2 →&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>mlengineering</category>
      <category>diffusion</category>
    </item>
  </channel>
</rss>
