My high-res image-to-video kept OOMing — turns out I was decoding outside no_grad

#python #ai #machinelearning #pytorch

TL;DR

I run LTX-2.3 image-to-video (I2V) locally on a 96 GB GPU. At 1024×768 / 97 frames it peaked at 83.5 GiB — so close to the ceiling that it OOM'd whenever my image-generation server was co-resident, and 1280×768 OOM'd outright. I assumed I'd hit a hardware wall.

I hadn't. 54 of those gigabytes were an autograd graph. The pipeline returns a lazy decode iterator; the real VAE decode runs when you encode the output — and in my harness that happened outside the with torch.no_grad(): block, so every conv activation in the decoder was retained for a backward pass that never comes.

Moving one call inside the no_grad block:

	before	after
I2V 1024×768/97f peak	83.5 GiB	29.5 GiB (−65%)
time	151.6 s	135.2 s (slightly faster)

And the peak goes nearly flat across resolution — 2048×1536 (3.1 MP) tops out at 33.6 GiB. The "I need a bigger GPU" conclusion was a measurement artifact.

The lever I tried first — finer VAE decode tiling — barely moved the number. That dead end is part of the story.

The setup

GPU: RTX PRO 6000 Blackwell Max-Q (96 GB)
PyTorch: 2.x + CUDA 12.8 (Blackwell sm_120)
Model: LTX-2.3 22B, two-stage (low-res denoise → 2× latent upscale → high-res refine → VAE decode), transformer loaded as fp8-cast
Mode: cold-start (components built/freed per request, low idle VRAM)

The workflow I care about: generate a clean still, then animate it with I2V. Starting from a correct still sidesteps the seed-gacha and anatomy breakdowns you get from pure text-to-video. The only thing standing in the way was VRAM.

Dead end #1: VAE decode tiling

LTX-2's VAE decode supports tiling (TilingConfig: spatial tile px / temporal tile frames). The default is a coarse 768 px / 80 frames. The intuition: smaller tiles → smaller decode workspace → lower peak.

I made tiling configurable and swept it. The most aggressive setting (384 px / 32 frames):

tile 384px/32f (finest): process demanded 77.37 GiB → still OOM with the co-resident model
tile 768px/80f (default): 83.51 GiB

Halving the spatial tile and cutting temporal to a third bought ~6 GiB. So the peak isn't the decode workspace. Tiling was the wrong lever.

Before retreating to "lower the resolution," I measured where the peak actually lives.

Localizing the peak

I dropped an env-gated profiler into the pipeline's __call__, printing torch.cuda.max_memory_allocated() at each phase boundary:

def _vram(label):
    torch.cuda.synchronize()
    peak = torch.cuda.max_memory_allocated() / 1024**3
    print(f"[vram] {label}: peak_so_far={peak:.2f}GiB")
# after stage_1 denoise → upsampler → stage_2 denoise → decode

At 1024×768/97f:

[vram] after stage_1 denoise:        peak_so_far=29.17GiB
[vram] after upsampler (2x latent):  peak_so_far=29.17GiB
[vram] after stage_2 denoise:        peak_so_far=29.51GiB
[vram] after decode call (lazy):     peak_so_far=29.51GiB   ← inside the pipeline: 29.5 GiB

The pipeline's internal peak is 29.51 GiB. But measured around the whole generate call it was 83.51 GiB. The extra 54 GiB appears after the pipeline returns a value.

Root cause: a lazy iterator escaping no_grad

The return value is a lazy iterator:

def __call__(self, ...):
    ...
    decoded_video = self.video_decoder(latent, tiling_config, generator)  # builds an iterator
    return decoded_video, audio   # nothing decoded yet

The actual VAE decode runs when something consumes the iterator — i.e. inside encode_video. And my harness looked like this:

with torch.no_grad():
    video, audio = pipeline(...)   # returns the iterator (cheap)

encode_video(video=video, ...)     # decode runs HERE — outside no_grad

encode_video is outside the no_grad block. Because decode is lazy, it runs with grad enabled, and PyTorch dutifully keeps every intermediate activation in the VAE decoder around for a backward pass. That's the 54 GiB.

The fix is to indent one call:

with torch.no_grad():
    video, audio = pipeline(...)
    encode_video(video=video, ...)   # decode now runs under no_grad

before: 83.51 GiB / 151.6 s
after:  29.51 GiB / 135.2 s   ← graph bookkeeping gone, slightly faster too

Why no_grad and not inference_mode? With the streaming weight loader, the VAE decode chokes on inference-mode tensors ("Inference tensors cannot be saved for backward"). no_grad keeps the latents as normal tensors so decode survives. (Production servers that wrap the entire generate in inference_mode/no_grad never hit this — it was purely a harness scoping slip.)

The payoff: peak is ~flat across resolution

Post-fix sweep, single process, escalating resolution:

resolution (97f)	peak VRAM	time
1024×768	29.51 GiB	135 s
1280×768 (was a 93 GiB OOM)	29.51 GiB	165 s
1536×1152	29.99 GiB	206 s
2048×1536 (3.1 MP)	33.55 GiB	348 s

Nearly flat. The decode processes tiles sequentially, so higher resolution just means more tiles, not a bigger simultaneous workspace — and once the autograd graph is gone, that's what dominates. (Which is exactly why tiling alone did nothing earlier: the graph was swamping it.)

A bonus: a "VRAM leak" I'd blamed on consecutive generations in one process also vanished. It was the same retained graph, accumulating across prompts.

Takeaways

Check that with torch.no_grad(): actually covers what you think. If the return value is a generator / iterator / lazy tensor, the real compute can happen outside the block when it's consumed. Scope illusion.
Don't kill a VRAM peak by guessing. Print max_memory_allocated() at phase boundaries; the culprit shows up immediately. My "the decode workspace is heavy" intuition was simply wrong, and without profiling I'd have spent the afternoon lowering resolution.
Suspect measurement artifacts before concluding "the hardware is too small." I almost gave up high-res I2V as impossible on 96 GB. It runs in 30 GiB up to 2048×1536.

This came out of building the video features for a solo voice × video roleplay platform (kotonia.ai) — chasing what a single local GPU can do in a niche the big labs deprioritize.

I wrote up the why behind that bet — the model A/B that led me to make I2V the mainstay, and the GPU traffic-control that lets me experiment in production without stalling users — separately: Betting on the video niche the big labs walked away from.