TL;DR
I run LTX-2.3 image-to-video (I2V) locally on a 96 GB GPU. At 1024×768 / 97 frames it peaked at 83.5 GiB — so close to the ceiling that it OOM'd whenever my image-generation server was co-resident, and 1280×768 OOM'd outright. I assumed I'd hit a hardware wall.
I hadn't. 54 of those gigabytes were an autograd graph. The pipeline returns a lazy decode iterator; the real VAE decode runs when you encode the output — and in my harness that happened outside the with torch.no_grad(): block, so every conv activation in the decoder was retained for a backward pass that never comes.
Moving one call inside the no_grad block:
| before | after | |
|---|---|---|
| I2V 1024×768/97f peak | 83.5 GiB | 29.5 GiB (−65%) |
| time | 151.6 s | 135.2 s (slightly faster) |
And the peak goes nearly flat across resolution — 2048×1536 (3.1 MP) tops out at 33.6 GiB. The "I need a bigger GPU" conclusion was a measurement artifact.
The lever I tried first — finer VAE decode tiling — barely moved the number. That dead end is part of the story.
The setup
- GPU: RTX PRO 6000 Blackwell Max-Q (96 GB)
- PyTorch: 2.x + CUDA 12.8 (Blackwell sm_120)
- Model: LTX-2.3 22B, two-stage (low-res denoise → 2× latent upscale → high-res refine → VAE decode), transformer loaded as fp8-cast
- Mode: cold-start (components built/freed per request, low idle VRAM)
The workflow I care about: generate a clean still, then animate it with I2V. Starting from a correct still sidesteps the seed-gacha and anatomy breakdowns you get from pure text-to-video. The only thing standing in the way was VRAM.
Dead end #1: VAE decode tiling
LTX-2's VAE decode supports tiling (TilingConfig: spatial tile px / temporal tile frames). The default is a coarse 768 px / 80 frames. The intuition: smaller tiles → smaller decode workspace → lower peak.
I made tiling configurable and swept it. The most aggressive setting (384 px / 32 frames):
tile 384px/32f (finest): process demanded 77.37 GiB → still OOM with the co-resident model
tile 768px/80f (default): 83.51 GiB
Halving the spatial tile and cutting temporal to a third bought ~6 GiB. So the peak isn't the decode workspace. Tiling was the wrong lever.
Before retreating to "lower the resolution," I measured where the peak actually lives.
Localizing the peak
I dropped an env-gated profiler into the pipeline's __call__, printing torch.cuda.max_memory_allocated() at each phase boundary:
def _vram(label):
torch.cuda.synchronize()
peak = torch.cuda.max_memory_allocated() / 1024**3
print(f"[vram] {label}: peak_so_far={peak:.2f}GiB")
# after stage_1 denoise → upsampler → stage_2 denoise → decode
At 1024×768/97f:
[vram] after stage_1 denoise: peak_so_far=29.17GiB
[vram] after upsampler (2x latent): peak_so_far=29.17GiB
[vram] after stage_2 denoise: peak_so_far=29.51GiB
[vram] after decode call (lazy): peak_so_far=29.51GiB ← inside the pipeline: 29.5 GiB
The pipeline's internal peak is 29.51 GiB. But measured around the whole generate call it was 83.51 GiB. The extra 54 GiB appears after the pipeline returns a value.
Root cause: a lazy iterator escaping no_grad
The return value is a lazy iterator:
def __call__(self, ...):
...
decoded_video = self.video_decoder(latent, tiling_config, generator) # builds an iterator
return decoded_video, audio # nothing decoded yet
The actual VAE decode runs when something consumes the iterator — i.e. inside encode_video. And my harness looked like this:
with torch.no_grad():
video, audio = pipeline(...) # returns the iterator (cheap)
encode_video(video=video, ...) # decode runs HERE — outside no_grad
encode_video is outside the no_grad block. Because decode is lazy, it runs with grad enabled, and PyTorch dutifully keeps every intermediate activation in the VAE decoder around for a backward pass. That's the 54 GiB.
The fix is to indent one call:
with torch.no_grad():
video, audio = pipeline(...)
encode_video(video=video, ...) # decode now runs under no_grad
before: 83.51 GiB / 151.6 s
after: 29.51 GiB / 135.2 s ← graph bookkeeping gone, slightly faster too
Why
no_gradand notinference_mode? With the streaming weight loader, the VAE decode chokes on inference-mode tensors ("Inference tensors cannot be saved for backward").no_gradkeeps the latents as normal tensors so decode survives. (Production servers that wrap the entire generate ininference_mode/no_gradnever hit this — it was purely a harness scoping slip.)
The payoff: peak is ~flat across resolution
Post-fix sweep, single process, escalating resolution:
| resolution (97f) | peak VRAM | time |
|---|---|---|
| 1024×768 | 29.51 GiB | 135 s |
| 1280×768 (was a 93 GiB OOM) | 29.51 GiB | 165 s |
| 1536×1152 | 29.99 GiB | 206 s |
| 2048×1536 (3.1 MP) | 33.55 GiB | 348 s |
Nearly flat. The decode processes tiles sequentially, so higher resolution just means more tiles, not a bigger simultaneous workspace — and once the autograd graph is gone, that's what dominates. (Which is exactly why tiling alone did nothing earlier: the graph was swamping it.)
A bonus: a "VRAM leak" I'd blamed on consecutive generations in one process also vanished. It was the same retained graph, accumulating across prompts.
Takeaways
-
Check that
with torch.no_grad():actually covers what you think. If the return value is a generator / iterator / lazy tensor, the real compute can happen outside the block when it's consumed. Scope illusion. -
Don't kill a VRAM peak by guessing. Print
max_memory_allocated()at phase boundaries; the culprit shows up immediately. My "the decode workspace is heavy" intuition was simply wrong, and without profiling I'd have spent the afternoon lowering resolution. - Suspect measurement artifacts before concluding "the hardware is too small." I almost gave up high-res I2V as impossible on 96 GB. It runs in 30 GiB up to 2048×1536.
This came out of building the video features for a solo voice × video roleplay platform (kotonia.ai) — chasing what a single local GPU can do in a niche the big labs deprioritize.
I wrote up the why behind that bet — the model A/B that led me to make I2V the mainstay, and the GPU traffic-control that lets me experiment in production without stalling users — separately: Betting on the video niche the big labs walked away from.
Top comments (0)