David

Posted on Apr 7

VRAM Is the New RAM — A Practical Guide to Running Large Language Models on Consumer GPUs

#ai #tutorial #opensource #machinelearning

Your GPU has 8 GB of VRAM. The model you want to run needs 14 GB. What now?

This is the most common wall people hit when running LLMs locally. Cloud APIs don't care about your hardware — local inference does. Understanding VRAM is the difference between smooth 40 tok/s responses and your system grinding to a halt.

I've spent months optimizing local AI setups and building tools around Ollama. Here's everything I've learned about making large models fit on consumer hardware.

Why VRAM Matters More Than You Think

When you load a model into your GPU, every single parameter needs to live in VRAM during inference. A 7B parameter model in full FP16 precision needs roughly:

7 billion × 2 bytes = ~14 GB VRAM

That's already more than most consumer GPUs. An RTX 4060 has 8 GB. An RTX 4070 has 12 GB. Even an RTX 4090 tops out at 24 GB.

So how do people run 70B models on a single GPU? Quantization.

Quantization Cheat Sheet

Quantization reduces the precision of model weights. Instead of 16 bits per parameter, you use 4 or 8 bits. Here's the practical breakdown:

Quant Level	Bits/Param	7B Model Size	13B Model Size	70B Model Size
FP16	16	~14 GB	~26 GB	~140 GB
Q8_0	8	~7.5 GB	~14 GB	~70 GB
Q6_K	6	~5.5 GB	~10.5 GB	~54 GB
Q5_K_M	5	~4.8 GB	~9 GB	~48 GB
Q4_K_M	4	~4.1 GB	~7.5 GB	~40 GB
Q3_K_M	3	~3.3 GB	~6 GB	~32 GB
Q2_K	2	~2.7 GB	~5 GB	~26 GB

The sweet spot for most people: Q4_K_M. You lose minimal quality compared to FP16 while cutting memory usage by 75%.

The Hidden VRAM Tax

The model weights aren't the only thing eating your VRAM. You also need memory for:

KV Cache: Stores attention states during generation. Scales with context length × number of layers. A 7B model with 8K context uses ~500 MB-1 GB extra.
CUDA overhead: ~300-500 MB just for the CUDA runtime.
OS/display: Your desktop compositor uses 200-500 MB of VRAM.

Real formula:

Total VRAM needed = Model weights + KV cache + CUDA overhead + OS reservation

Example for Llama 3.1 8B Q4_K_M, 8K context:
~4.1 GB + ~0.8 GB + ~0.4 GB + ~0.3 GB = ~5.6 GB

This is why a 4 GB quantized model doesn't actually run on a 4 GB GPU.

Practical Ollama Commands for VRAM Management

Ollama handles most of this automatically, but you can tune it:

# Check which models are loaded and their VRAM usage
ollama ps

# Set context size (lower = less VRAM)
ollama run llama3.1 --ctx-size 4096

# Force CPU-only inference (when GPU VRAM is full)
OLLAMA_GPU_LAYERS=0 ollama run llama3.1

# Partial GPU offloading — put some layers on GPU, rest on CPU
OLLAMA_GPU_LAYERS=20 ollama run mixtral

# Set how long models stay in VRAM (default: 5 min)
OLLAMA_KEEP_ALIVE=10m ollama run llama3.1

# Unload all models from VRAM immediately
curl http://localhost:11434/api/generate -d '{"model": "llama3.1", "keep_alive": 0}'

GPU Layer Splitting Strategy

When a model doesn't fit entirely in VRAM, you split it between GPU and CPU. The key insight: the first and last layers matter most for speed.

# Check total layers in a model
ollama show llama3.1 --modelfile | grep -i layer

# For a 32-layer model on an 8 GB GPU with Q4_K_M:
# Start with 24 GPU layers, adjust based on actual usage
OLLAMA_GPU_LAYERS=24 ollama run llama3.1:8b-q4_K_M

Monitor with nvidia-smi while generating:

# Watch VRAM usage in real-time
watch -n 0.5 nvidia-smi

# Or just the memory line
nvidia-smi --query-gpu=memory.used,memory.total --format=csv -l 1

Rule of thumb: If VRAM usage is at 95%+ during generation, reduce GPU layers by 2-3. You want ~500 MB headroom for the KV cache to grow during long conversations.

Multi-Model Workflows Without OOM

Running multiple models simultaneously (say, a chat model + a code model) doubles your VRAM needs. Strategies:

1. Sequential loading with aggressive timeouts:

# Unload after 30 seconds of inactivity
OLLAMA_KEEP_ALIVE=30s

2. Mix model sizes intentionally:

Instead of two 7B models, pair a 7B with a 1.5B:

Primary chat: llama3.1:8b-q4_K_M (~4.1 GB)
Quick classification: qwen2.5:1.5b (~1 GB)
Total: ~5.1 GB — fits on an 8 GB card

3. CPU offload your secondary model entirely:

# Run the smaller model on CPU while the main model uses GPU
OLLAMA_GPU_LAYERS=0 ollama run qwen2.5:1.5b

A/B Model Comparison Without Doubling VRAM

Here's a trick I built into Locally Uncensored — when doing A/B comparisons between models, you don't need both loaded simultaneously. The app sends the same prompt sequentially: load Model A, generate, unload, load Model B, generate, display side-by-side.

Sequential comparison is ~2x slower than parallel, but it means you can compare a 13B model against a 7B model on an 8 GB GPU. On a 24 GB card, you could compare two 70B quantized models that would otherwise need 48 GB together.

If you're doing this manually via API:

# Send to model A
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Explain quantum tunneling",
  "keep_alive": 0
}'

# Model A unloads, then send to model B
curl http://localhost:11434/api/generate -d '{
  "model": "gemma2:9b",
  "prompt": "Explain quantum tunneling",
  "keep_alive": 0
}'

The keep_alive: 0 is crucial — it tells Ollama to unload immediately after generation.

The VRAM Ladder: What to Run on Each GPU Tier

Based on real-world testing:

4-6 GB VRAM (GTX 1660, RTX 3050):

7B models at Q3_K_M or Q4_K_M with 2-4K context
Stick to single-model workflows
Consider CPU inference for anything bigger

8 GB VRAM (RTX 4060, RTX 3070):

7-8B models at Q4_K_M-Q6_K with 8K context
13B models at Q3_K_M with reduced context
Sweet spot for most home users

12 GB VRAM (RTX 4070, RTX 3060 12GB):

13B models at Q4_K_M-Q5_K_M with 8K context
7B models at Q8_0 (near-lossless)
Can run some 30B models at Q3_K_M

16 GB VRAM (RTX 4070 Ti, RTX 5060 Ti):

30B models at Q4_K_M with 8K context
13B models at Q6_K with 16K context
Multi-model setups start becoming viable

24 GB VRAM (RTX 4090, RTX 3090):

70B models at Q4_K_M with 4-8K context
30B models at Q6_K with full context
Comfortable multi-model workflows

Monitoring Script

Save this as vram-watch.sh:

#!/bin/bash
# Monitor VRAM + Ollama loaded models
while true; do
  clear
  echo "=== GPU VRAM ==="
  nvidia-smi --query-gpu=name,memory.used,memory.total,utilization.gpu \
    --format=csv,noheader,nounits | \
    awk -F', ' '{printf "%s: %s/%s MB (GPU: %s%%)\n", $1, $2, $3, $4}'
  echo ""
  echo "=== Loaded Models ==="
  ollama ps 2>/dev/null || echo "Ollama not running"
  echo ""
  echo "Press Ctrl+C to exit"
  sleep 2
done

Key Takeaways

Always account for the hidden tax — model size on disk ≠ VRAM needed at runtime.
Q4_K_M is your default quantization — best quality/size ratio for consumer GPUs.
Partial GPU offloading is underused — 20 layers on GPU + rest on CPU beats full CPU inference by 3-5x.
Manage model lifetimes — keep_alive prevents models from squatting on your VRAM.
Monitor actively — nvidia-smi and ollama ps are your best friends.

The local AI space is moving fast. Models are getting more efficient (Gemma 2 is wild for its size), quantization methods are improving, and tools like Ollama keep abstracting away the complexity. Understanding VRAM management means you'll always know how to squeeze maximum performance out of whatever hardware you have.

If you want a GUI that handles all of this automatically — model management, VRAM-aware loading, A/B comparisons — check out Locally Uncensored. It's MIT-licensed and built specifically for running local AI without the headaches.

What GPU are you running local models on? Drop your setup in the comments — I'm always curious what hardware people are working with.

DEV Community