DEV Community

Cover image for Ollama VRAM Requirements: Complete Guide for 2026
Thurmon Demich
Thurmon Demich

Posted on • Originally published at bestgpuforllm.com

Ollama VRAM Requirements: Complete Guide for 2026

This article was originally published on Best GPU for LLM. The full version with interactive tools, FAQ, and live pricing is on the original site.

Quick answer: Ollama automatically selects quantization based on your available VRAM. For 7B models, you need at least 8GB VRAM. For 13B models, 12-16GB. For 70B models, 48GB+ or accept heavy CPU offloading.

See the recommended pick on the original guide

How Ollama uses VRAM

When you run ollama run llama3, Ollama loads the model weights into GPU memory. If the model does not fit entirely, Ollama offloads remaining layers to system RAM, which dramatically slows inference.

Key facts about Ollama's VRAM usage in 2026:

  • Ollama uses GGUF quantized models by default (Q4_K_M for most)
  • The default ollama pull downloads a Q4_K_M variant unless you specify otherwise
  • KV cache for context uses additional VRAM beyond the model weights
  • Running ollama run with a model already loaded reuses the same VRAM allocation

VRAM chart available at the original article

VRAM requirements by model

Small models (1B-3B parameters)

Model Default Quant VRAM Used Min GPU
Llama 3.2 1B Q8_0 ~1.5GB Any 4GB GPU
Llama 3.2 3B Q4_K_M ~2.5GB Any 4GB GPU
Phi-3.5 Mini (3.8B) Q4_K_M ~3GB Any 4GB GPU
Gemma 2 2B Q4_K_M ~2GB Any 4GB GPU
Qwen 2.5 3B Q4_K_M ~2.5GB Any 4GB GPU

These models run on virtually any modern GPU. Even a GTX 1650 with 4GB handles them fine.

Medium models (7B-9B parameters)

Model Default Quant VRAM Used Min GPU Recommended GPU
Llama 3.1 8B Q4_K_M ~5.5GB 8GB RTX 4060 Ti 16GB
Mistral 7B v0.3 Q4_K_M ~5GB 8GB RTX 3060 12GB
Gemma 2 9B Q4_K_M ~6GB 8GB RTX 3060 12GB
Qwen 2.5 7B Q4_K_M ~5GB 8GB RTX 3060 12GB
DeepSeek-R1 8B Q4_K_M ~5.5GB 8GB RTX 4060 Ti 16GB
Llama 3.1 8B Q8_0 ~9GB 12GB RTX 4060 Ti 16GB
Llama 3.1 8B FP16 ~16GB 16GB RTX 4060 Ti 16GB

At Q4_K_M, all 7B-9B models fit on 8GB cards. However, 8GB leaves almost no room for context. A 12-16GB card gives much better real-world performance. For a model-specific deep dive, see how much VRAM does Llama 3 8B need?

Large models (13B-14B parameters)

Model Default Quant VRAM Used Min GPU Recommended GPU
Llama 2 13B Q4_K_M ~8.5GB 12GB RTX 4060 Ti 16GB
CodeLlama 13B Q4_K_M ~8.5GB 12GB RTX 4060 Ti 16GB
Phi-3 Medium 14B Q4_K_M ~9GB 12GB RTX 4060 Ti 16GB
Qwen 2.5 14B Q4_K_M ~9GB 12GB RTX 4060 Ti 16GB
Llama 2 13B Q8_0 ~14.5GB 16GB RTX 5070 Ti

The 16GB sweet spot: an RTX 4060 Ti 16GB or RTX 5070 Ti handles any 13B-14B model at Q4-Q8 with room for context.

XL models (30B-34B parameters)

Model Default Quant VRAM Used Min GPU Recommended GPU
CodeLlama 34B Q4_K_M ~20GB 24GB RTX 4090
Yi 34B Q4_K_M ~20GB 24GB RTX 4090
Qwen 2.5 32B Q4_K_M ~19GB 24GB RTX 4090
DeepSeek-R1 32B Q4_K_M ~19GB 24GB RTX 4090
CodeLlama 34B Q3_K_M ~16GB 24GB RTX 4090

24GB is the minimum for 34B models. The RTX 4090 and used RTX 3090 are your options here.

XXL models (70B+ parameters)

Model Default Quant VRAM Used Min GPU Recommended GPU
Llama 3.1 70B Q4_K_M ~42GB 48GB+ 2x RTX 4090
Qwen 2.5 72B Q4_K_M ~43GB 48GB+ 2x RTX 4090
DeepSeek-R1 70B Q4_K_M ~42GB 48GB+ 2x RTX 4090
Llama 3.1 70B Q3_K_M ~33GB 48GB RTX 5090 (tight)
Llama 3.1 70B Q2_K ~27GB 32GB RTX 5090

70B models do not fit on any single consumer GPU at good quantization levels. The RTX 5090 can squeeze in a Q2_K-Q3_K variant, but quality suffers. For serious 70B usage, plan for dual GPUs or cloud.

GPU recommendation summary

Your Target Best GPU Price Why
7B models RTX 3060 12GB ~$250 used Cheap, 12GB is plenty
7B-13B models RTX 4060 Ti 16GB ~$400 16GB handles everything up to 14B
13B-34B models RTX 4090 ~$1,600 24GB for 34B at Q4
34B comfortable RTX 5090 ~$2,000 32GB with room for context
70B models 2x RTX 4090 or cloud ~$3,200+ No single consumer card suffices

See the recommended pick on the original guide

See the recommended pick on the original guide

See the recommended pick on the original guide

Which GPU should you buy for Ollama?

If you run small models (1B-3B) for lightweight tasks, any 4GB+ GPU works. No need to upgrade.

If you run 7B-13B models for chat, coding, or writing, a 16GB card is the sweet spot. The RTX 4060 Ti 16GB ($400) handles every model in this range at Q4-Q8 with room for context. Upgrade to the RTX 4070 Ti Super ($700) if you want faster token generation.

If you run 34B models like CodeLlama 34B or DeepSeek-R1 32B, you need 24GB. The RTX 4090 ($1,600) is the go-to card. A used RTX 3090 ($850) works too if you accept slower inference.

If you want 70B models, no single consumer GPU is enough at good quantization. Plan for dual RTX 4090s or use cloud GPUs.

Common mistakes with Ollama VRAM

Not accounting for KV cache — Your model fits in VRAM, but crashes mid-conversation. The KV cache for context grows as you chat. Always leave 2-4GB of headroom beyond the model's base size.

Running multiple models simultaneously — Ollama keeps models loaded in VRAM by default. If you pull and run a second model without stopping the first, both compete for VRAM. Use ollama stop to unload unused models.

Choosing Q2_K to squeeze a larger model — Dropping to Q2_K quantization to fit a 70B model on 32GB sounds clever, but the quality loss is severe. You are better off running a 34B model at Q6_K than a 70B at Q2_K.

Ignoring CPU offloading speed — Ollama silently offloads layers to RAM when VRAM runs out. The model "works" but runs 5-10x slower on offloaded layers. Check nvidia-smi to confirm the model is fully GPU-resident.

Tips for managing VRAM in Ollama

Check actual VRAM usage with nvidia-smi while a model is running. Ollama's reported size does not always include KV cache overhead.

Use /set parameter num_ctx 2048 to reduce context window if you are tight on VRAM. The default is 2048, but some models request more.

Unload unused models with ollama stop <model>. Ollama keeps models loaded in VRAM by default for faster subsequent runs.

For a deeper dive on VRAM planning, see our VRAM requirements guide. For GPU-specific Ollama performance, check our best GPU for Ollama article. If you have outgrown Ollama and are moving to a multi-user serving stack, our best GPU for vLLM guide covers the additional VRAM headroom PagedAttention requires.

When in doubt, buy more VRAM than you think you need. Models are growing faster than GPU memory, and Ollama makes it too easy to try the next size up.

Related guides on Best GPU for LLM


Continue on Best GPU for LLM for the complete guide with interactive calculators and current GPU prices.

Top comments (0)