How to fix OOM crashes when running large open-source LLMs locally

#llm #python #machinelearning #performance

The crash that ruined my Friday

Last week I tried to spin up a 13B parameter open-source LLM on my workstation. The model was advertised as fitting comfortably in 24GB of VRAM. My RTX 4090 has 24GB. Should be fine, right?

Wrong. The model loaded, I sent a single prompt, and a few seconds later: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.10 GiB.

If you've ever tried to run a large open-source language model locally, you've probably hit this exact error. The model "fits," then explodes the moment you actually use it. Let's dig into why this happens and how to fix it properly — not just throw --load-in-4bit at the problem and hope for the best.

Why "the model fits" doesn't mean what you think it means

When someone says a 13B parameter model needs ~26GB in FP16, that's just the weights. The actual memory footprint at inference time has three layers:

Model weights — the static parameters loaded into VRAM
KV cache — keys and values for every token in the context, scaling linearly with sequence length
Activation memory — intermediate tensors during the forward pass

The KV cache is the silent killer. For a transformer with L layers, H heads, head dimension D, batch size B, and sequence length S, the KV cache is roughly:

kv_cache_bytes = 2 * L * B * S * H * D * bytes_per_element

For a 13B model with 40 layers, 40 heads, 128 head dim, FP16, and a 4K context, that's already ~2.5GB just for one sequence. Push it to 32K context and you're at 20GB of KV cache alone — on top of the weights. That's the root cause of most "fits then crashes" scenarios.

Step 1: Profile what's actually in memory

Before you start optimizing, see where the bytes are going. PyTorch's memory snapshot is the right tool here, but most people skip it.

import torch

# Start recording allocations before you load anything
torch.cuda.memory._record_memory_history(max_entries=100000)

# Load model and run a single forward pass
model = load_my_model()
out = model.generate(prompt_ids, max_new_tokens=128)

# Dump the snapshot — open it at pytorch.org/memory_viz
torch.cuda.memory._dump_snapshot("oom_debug.pickle")
torch.cuda.memory._record_memory_history(enabled=None)

Drop the pickle into the PyTorch memory visualizer and you'll see exactly which allocations are eating your VRAM. In my case it was obvious — the KV cache was almost as large as the weights themselves once the context filled up.

Step 2: Pick the right quantization for your bottleneck

If weights are dominating, quantize the weights. If the KV cache is dominating, quantize the cache. People conflate these, and it matters.

For weight-only quantization, bitsandbytes 4-bit (NF4) is a solid default for inference if you're using transformers:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",            # NF4 generally beats FP4 on perplexity
    bnb_4bit_compute_dtype=torch.bfloat16,  # compute in bf16, store in 4-bit
    bnb_4bit_use_double_quant=True,       # quantize the quantization constants too
)

model = AutoModelForCausalLM.from_pretrained(
    "your-open-source-model",
    quantization_config=bnb_config,
    device_map="auto",
)

That cuts a 13B model from ~26GB down to roughly 7GB. Now you actually have headroom for the KV cache.

For the KV cache itself, your options depend on your inference engine. llama.cpp supports --cache-type-k q4_0 --cache-type-v q4_0 to quantize the cache to 4-bit. vLLM has FP8 KV cache via --kv-cache-dtype fp8. Both trade a small amount of quality for a 2-4x cache reduction. In my experience FP8 KV cache is essentially lossless for typical chat workloads; 4-bit is noticeably worse on tasks that require precise recall.

Step 3: Fix memory fragmentation

Even with quantization, you might still hit OOM at high context lengths because of fragmentation. PyTorch's default allocator can leave VRAM looking like swiss cheese after enough allocate/free cycles.

The fix is one environment variable:

# Allow the allocator to grow blocks instead of leaving fragmented holes
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

This switches on the expandable-segments allocator. It has saved me from OOMs at long context lengths more than any other single tweak. The old max_split_size_mb workaround you'll see on Stack Overflow is largely obsolete now — see the PyTorch CUDA semantics docs for the full story.

Step 4: Offload what you can't fit

If you've quantized and you still don't fit, offload. The accelerate library handles this transparently with device_map="auto", but you can be explicit:

from accelerate import infer_auto_device_map, init_empty_weights
from transformers import AutoConfig, AutoModelForCausalLM

config = AutoConfig.from_pretrained("your-open-source-model")

with init_empty_weights():
    model = AutoModelForCausalLM.from_config(config)

# Reserve headroom on the GPU for KV cache and activations
device_map = infer_auto_device_map(
    model,
    max_memory={0: "18GiB", "cpu": "64GiB"},  # leave ~6GB on GPU for runtime
    no_split_module_classes=["LlamaDecoderLayer"],
)

CPU offload is slow — expect 5-10x slowdown for any layer that ends up on CPU — but it beats not running the model at all. Disk offload exists too and is even slower; use it only for one-shot evaluation runs.

Prevention: don't get burned again

A few habits that have saved me from repeating this debugging cycle:

Calculate KV cache up front. Before deploying a model, run the formula above for your target context length. If the cache exceeds ~30% of available VRAM, plan for cache quantization from day one.
Set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True in your shell profile. There's almost no downside.
Use a serving engine for anything beyond prototyping. Projects like vLLM and llama.cpp implement PagedAttention and other tricks that keep KV cache memory under control automatically. Hand-rolling model.generate() in PyTorch is fine for one-off experiments but wasteful at scale.
Monitor with nvidia-smi --query-gpu=memory.used --format=csv -l 1 during a stress test before declaring something "fits." Single-prompt smoke tests lie — you need to push context to your actual production length.

The frustrating truth is that running open-source LLMs locally isn't really a solved problem. It's a moving target where the model architectures, the quantization techniques, and the serving engines all evolve faster than the documentation. But understanding what's actually in your VRAM is the difference between fixing this in an hour and burning a whole afternoon on guesses.