Atlas Cloud

Posted on May 7

Long video generation blog: Six Approaches, One Decision

#ai #machinelearning #mlengineering #diffusion

A few months ago we set ourselves a deceptively simple goal: produce coherent, high-quality video longer than 15 seconds, on a single GPU, in well under a minute of wall-clock time. Today's video diffusion models like Wan2.2 are good at 3–5 second clips. Stretching that to 10s, 30s, or a minute is where things get interesting.

This post documents the route we actually took. We surveyed six approaches that show up in recent papers and tech reports — TTT, LoL, Self Forcing, Self Forcing++, Infinite Talk, and Helios — measured the trade-offs, and ultimately landed on SVI (Stable Video Infinity), wired up next to TurboWan in our DiffSynth Engine. We will go over each of those routes, then how SVI works, then the production numbers.

Why long video is hard

Three things break when you push past about five seconds.

The VRAM wall

Wan2.2 uses Full Attention with O(n²) cost in the number of latent tokens. The math is unforgiving:

5s (81 frames): ~32.7k tokens, attention matrix ~10 GB.

10s (165 frames): ~65.5k tokens, attention matrix ~40 GB — already spills off a single GPU.

30s (~500 frames): ~200k tokens, infeasible.

In practice, Self Forcing alone fills most of an H200's 129 GB at 165 frames just for the KV cache.

Temporal drift

Even when memory is fine, three drift modes show up. The Helios paper named them: position shift (subjects wandering across the frame), color shift (gradual hue and brightness drift), and restoration shift (the model overcorrecting and producing visible discontinuities).

Causal consistency

Standard video diffusion uses bidirectional Full Attention — every frame attends to every other. That means no streaming output: you cannot show frame 1 until frame N is done.

Our concrete target was modest: ≥15 second video, smooth visual continuity, stable subjects across the whole clip, total wait under 60 seconds, minimal training, and a strong preference for reusing weights we already have.

The survey

We looked at six families. The names are mostly paper titles; the categories will matter later.

Route 1 · TTT (Test-Time Training)

Paper: One-Minute Video Generation with Test-Time Training (arXiv 2504.05298, Apr 2025).

The idea is to fine-tune the model during inference so it remembers what it has already generated. A small TTT layer (a 2-layer MLP, plus a gate and a local attention) gets inserted after Attention in every Transformer Block, and the model is trained on a curriculum that pushes from short clips out to a full minute.

Per-block insertion: after the standard attention, splice in a Gate, a TTT Layer, and a Local Attention, then a LayerNorm.

Curriculum: train on progressively longer windows — 3s → 9s → 18s → 30s → 60s.

Cost: 256 H100s for ~50 hours.

TTT — left: insertion point (Gate + TTT Layer + Local Attention + LayerNorm, attached after standard Attention via residual). Right: video segmented into 3-second clips, each handled by Local Attention internally, with the TTT Layer carrying global memory across segments.

It works — the paper reaches 1-minute generation. But the training cost is enormous, the experiments only cover CogVideoX 5B (transfer to Wan2.2 14B is unproven), and the inserted TTT layers conflict with the kernel optimizations we already rely on. Verdict: not selected.

Route 2 · LoL (Longer than Longer)

Paper: LoL: Longer than Longer, Scaling Video Generation to Hour (arXiv 2601.16914, Jan 2026).

LoL targets a specific failure mode in autoregressive long video — sink-collapse, where multi-head attention all converges onto the anchor frame and the video periodically reverts to its initial state. The fix is Multi-Head RoPE Jitter: per-head random phase perturbations that break inter-head homogeneity. Training-free, plug-in.

Failure mode: sink-collapse — under autoregressive RoPE, distant frames' positional phases periodically realign with the anchor, attention concentrates, content snaps back to the anchor frame.

Fix: give each attention head its own small random phase shift. Heads can no longer collapse to the same column. No retraining required, drops into existing models.

L2 distance to anchor vs. frame index. Self-Forcing++ (red) and LongLive (blue), both with sink, repeatedly snap back at specific frame positions — those are sink-collapse events where the video reverts to the anchor. LoL's Phase Alignment (green) eliminates the snap-back.

Per-head attention maps. Top row: normal frames — heads have visibly different patterns. Bottom rows: during sink-collapse — every head looks the same, all collapsed onto the anchor frame's column. RoPE Jitter restores per-head diversity.

LoL hits 12-hour video on CogVideoX/HunyuanVideo with little quality loss. The catch is that all the demos are static-ish scenes; we don't know how it survives dance, sports, or anything with strong motion. Plus we'd need to modify Wan2.2's attention. Verdict: adaptation cost is too high for unproven gains on motion content. Not selected.

Route 3 · Self Forcing (Causal Wan2.2)

Paper: Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion (arXiv 2506.08009, NeurIPS 2025 Spotlight).

Self Forcing replaces Wan2.2's bidirectional Full Attention with causal attention: a frame only attends to frames before it. That single change unlocks streaming generation — once chunk 1 is done, decode and ship it.

Bidirectional: every frame attends to every other → must finish all 40 denoise steps before any frame can be shown. Causal: a frame only sees its past → the first chunk can stream the moment it is done.

The training trick is what gives the paper its name. Instead of training on clean ground-truth context (Teacher Forcing) or with custom attention masks (Diffusion Forcing), Self Forcing rolls out the actual inference path with a rolling KV cache, so train and inference distributions match.

Generation loop: denoise the next small chunk of frames using DMD's compressed step schedule, conditioned on a rolling KV cache built from already-generated frames.

Stream: as soon as a chunk finishes, VAE-decode and emit it.

Carry-over: push the new chunk's latents into the KV cache for the next chunk to attend to.

Three training paradigms compared: (a) Teacher Forcing trains on clean frames — at inference, noisy frames cause out-of-distribution drift; (b) Diffusion Forcing uses custom attention masks but still has train-inference mismatch; (c) Self Forcing replays the true inference process using a rolling KV cache, fully aligning training and inference.

We measured it on the FastVideo framework, single H200:

Length	Frames	Time	VRAM
5s	81 frames	70s	—
10s	165 frames	168s	129 GB (near capacity)
20s	321 frames	287s	129 GB (KV cache capped at 42 frames)

This is architecturally the cleanest answer, and we genuinely like it. But 10s already saturates an H200's VRAM, quality drops at 165 frames, the original model needs causal-attention fine-tuning, and true streaming also needs a Causal Conv3D in the VAE.

Verdict: wait for the community to chip away at VRAM and quality. Not adopted for now.

Route 4 · Self Forcing++

Paper: Self-Forcing++: Towards Minute-Scale High-Quality Video Generation (arXiv 2510.02283, Oct 2025).

Builds on Self Forcing with three additions: Backward Noise Initialization (each new chunk starts from noise back-integrated from already-generated frames, removing chunk-boundary discontinuities); Extended DMD alignment (slice 5s windows from a long rollout and align them against a teacher's short-window output); and a GRPO stage with optical-flow reward to push for more dynamic motion.

Step 1. Self-rollout the student for far longer than 5 seconds, accumulating a long draft using a rolling KV cache. Step 2. Slice random 5s windows out of that draft, run them through Extended DMD against the teacher's short-window distribution to align. Step 3. Refine with GRPO using optical-flow magnitude as reward, nudging the model toward more dynamic motion. Trick. Each new chunk starts from noise back-integrated from the previous chunk, not from fresh Gaussian — so chunk boundaries no longer pop.

Left to right: CausVid (fixed training duration, train-inference mismatch) → Self Forcing (fixed duration + DMD alignment) → Self-Forcing++ (extended duration + Backward Noise Initialization + Extended DMD alignment). Bottom rows show training-stage and inference-stage correspondence.

Result: minute-scale video (up to about 4m15s) on a 1.3B Wan2.1. Great paper. For production we hit two walls: content is mostly static (low motion), the base model is 1.3B (a long way below Wan2.2 14B), and there is no released code or weights to bootstrap from. Verdict: not selected for now.

Route 5 · Infinite Talk (A2V)

A different shape of problem entirely — Audio-to-Video, where audio drives continuous talking-head generation.

Per-chunk input bundle: the new chunk's noisy latents, the audio features for that time window, the user-provided reference image, the last frame of the previous chunk, and a soft conditioning weight. Reference identity: the reference image keeps long-term appearance stable. Adaptive constraint: the soft weight tightens or relaxes the reference based on similarity drift. Motion bridge: the previous chunk's last frame carries motion across boundaries.

It is good for what it is — talking heads, indefinitely. But the architecture differs enough from Wan2.2 that it requires dedicated training, and it does not generalize to general scenes. Verdict: valuable in a narrow lane, not a general long-video solution.

Route 6 · Helios

Paper: Helios: Real Real-Time Long Video Generation Model (PKU-YuanGroup, arXiv 2603.04379, Mar 2026).

As of writing, Helios is the SOTA for long video — 14B params, 19.5 FPS real-time on a single H100. The trick is to compress historical frames into a three-level pyramid and inject them into the current frame's denoising, so the token budget stays constant no matter how long the video gets.

Multi-Term Memory. Short-term history (last 3 frames) keeps full resolution; mid-term (last 20 frames) gets moderate compression; long-term (everything earlier) gets heavy compression. Total token budget is constant regardless of video length. Guidance Attention. Inside each DiT block, clean historical KVs and noisy current QKVs are processed separately so historical noise cannot contaminate current denoising. Pyramid Sampling. Sample at low resolution first to define structure, then refine to high resolution to add detail — about 2.3× fewer tokens overall.

Helios architecture. Left: Unified History Injection — short / mid / long-term history compressed at different ratios, concatenated with the current frame before entering the DiT. Right: Pyramid Unified Predictor-Corrector — low token count first to define structure, then high token count to refine details, reducing computation by ~2.3×.

The Helios paper systematically defines and visualizes three categories of drift in long-video generation: (a) position shift, (b) color shift, (c) restoration shift (noise), (d) restoration shift (blur). Guidance Attention is specifically designed to address all three.

Helios's measured throughput on H200 is striking — basically flat with length:

Length	Time	Throughput
240 frames (10s)	24s	~10 FPS
480 frames (20s)	42s	~11.4 FPS
960 frames (40s)	82s	~11.7 FPS
H100 single GPU (Helios-Distilled)	—	19.5 FPS

The catch is that Multi-Term Memory Patchification needs full retraining of a 14B model. There are no released weights — only a tech report — so we cannot just bolt on a LoRA. Verdict: a medium-to-long-term direction; not deployable today.

Route Comparison Summary

All six routes side by side, with SVI added as the row we ultimately committed to:

Approach	Max Duration	Quality	Training Required	Engineering Difficulty	Generality	Rec.
TTT	1 minute	High	Heavy training needed	High	Medium	★★☆
LoL	Hour-scale	Medium (static only)	Training needed	Medium	Medium	★★☆
Self Forcing	Theoretically unlimited	Medium (drops > 10s)	Existing model	High (VRAM issues)	High	★★★
Self Forcing++	Minute-scale	Low (mostly static)	Training needed	Very high (no code)	High	★☆☆
Infinite Talk	Unlimited	High (talking head)	Training needed	High	Low (A2V only)	★★☆
Helios	Theoretically unlimited	High (industry SOTA)	Full retraining	Very high (no weights)	High	★★★☆
SVI	Unlimited	Medium-High	Open-source LoRA	Medium	High	★★★★

A taxonomy that fell out of the survey

If you squint, every approach we surveyed falls into one of three buckets.

Type A — extend the attention range itself (Self Forcing, LoL, TTT). Have the model directly process longer sequences. Highest theoretical quality. VRAM grows quadratically, so engineering hits a wall around 10s today.

Type B — hierarchical history compression (Helios). Compress past frames and inject them as conditioning. Bypasses VRAM. Costs a full retraining of a 14B model.

Type C — stateful rolling generation (SVI, Infinite Talk). Decompose long video into short clips with overlapping state. Constant VRAM, unlimited length, LoRA-only training. The trade is possible discontinuities at clip boundaries and unbounded long-term drift you can manage but not eliminate.

For this quarter, Type C is what we ship. For next year, Type B is where we are watching the literature.

In the next post, we go into what shipping actually looked like — six approaches to ≥15-second video generation, why we picked SVI, and what the production numbers look like. Read Part 2 →

DEV Community