Thurmon Demich

Posted on May 8 • Originally published at bestgpuforllm.com

Ollama VRAM Requirements: Complete Guide for 2026

#ollama #vram #llm #inference

This article was originally published on Best GPU for LLM. The full version with interactive tools, FAQ, and live pricing is on the original site.

Quick answer: Ollama automatically selects quantization based on your available VRAM. For 7B models, you need at least 8GB VRAM. For 13B models, 12-16GB. For 70B models, 48GB+ or accept heavy CPU offloading.

How Ollama uses VRAM

When you run ollama run llama3, Ollama loads the model weights into GPU memory. If the model does not fit entirely, Ollama offloads remaining layers to system RAM, which dramatically slows inference.

Key facts about Ollama's VRAM usage in 2026:

Ollama uses GGUF quantized models by default (Q4_K_M for most)
The default ollama pull downloads a Q4_K_M variant unless you specify otherwise
KV cache for context uses additional VRAM beyond the model weights
Running ollama run with a model already loaded reuses the same VRAM allocation

VRAM chart available at the original article

VRAM requirements by model

Small models (1B-3B parameters)

Model	Default Quant	VRAM Used	Min GPU
Llama 3.2 1B	Q8_0	~1.5GB	Any 4GB GPU
Llama 3.2 3B	Q4_K_M	~2.5GB	Any 4GB GPU
Phi-3.5 Mini (3.8B)	Q4_K_M	~3GB	Any 4GB GPU
Gemma 2 2B	Q4_K_M	~2GB	Any 4GB GPU
Qwen 2.5 3B	Q4_K_M	~2.5GB	Any 4GB GPU

These models run on virtually any modern GPU. Even a GTX 1650 with 4GB handles them fine.

Medium models (7B-9B parameters)

Model	Default Quant	VRAM Used	Min GPU	Recommended GPU
Llama 3.1 8B	Q4_K_M	~5.5GB	8GB	RTX 4060 Ti 16GB
Mistral 7B v0.3	Q4_K_M	~5GB	8GB	RTX 3060 12GB
Gemma 2 9B	Q4_K_M	~6GB	8GB	RTX 3060 12GB
Qwen 2.5 7B	Q4_K_M	~5GB	8GB	RTX 3060 12GB
DeepSeek-R1 8B	Q4_K_M	~5.5GB	8GB	RTX 4060 Ti 16GB
Llama 3.1 8B	Q8_0	~9GB	12GB	RTX 4060 Ti 16GB
Llama 3.1 8B	FP16	~16GB	16GB	RTX 4060 Ti 16GB

At Q4_K_M, all 7B-9B models fit on 8GB cards. However, 8GB leaves almost no room for context. A 12-16GB card gives much better real-world performance. For a model-specific deep dive, see how much VRAM does Llama 3 8B need?

Large models (13B-14B parameters)

Model	Default Quant	VRAM Used	Min GPU	Recommended GPU
Llama 2 13B	Q4_K_M	~8.5GB	12GB	RTX 4060 Ti 16GB
CodeLlama 13B	Q4_K_M	~8.5GB	12GB	RTX 4060 Ti 16GB
Phi-3 Medium 14B	Q4_K_M	~9GB	12GB	RTX 4060 Ti 16GB
Qwen 2.5 14B	Q4_K_M	~9GB	12GB	RTX 4060 Ti 16GB
Llama 2 13B	Q8_0	~14.5GB	16GB	RTX 5070 Ti

The 16GB sweet spot: an RTX 4060 Ti 16GB or RTX 5070 Ti handles any 13B-14B model at Q4-Q8 with room for context.

XL models (30B-34B parameters)

Model	Default Quant	VRAM Used	Min GPU	Recommended GPU
CodeLlama 34B	Q4_K_M	~20GB	24GB	RTX 4090
Yi 34B	Q4_K_M	~20GB	24GB	RTX 4090
Qwen 2.5 32B	Q4_K_M	~19GB	24GB	RTX 4090
DeepSeek-R1 32B	Q4_K_M	~19GB	24GB	RTX 4090
CodeLlama 34B	Q3_K_M	~16GB	24GB	RTX 4090

24GB is the minimum for 34B models. The RTX 4090 and used RTX 3090 are your options here.

XXL models (70B+ parameters)

Model	Default Quant	VRAM Used	Min GPU	Recommended GPU
Llama 3.1 70B	Q4_K_M	~42GB	48GB+	2x RTX 4090
Qwen 2.5 72B	Q4_K_M	~43GB	48GB+	2x RTX 4090
DeepSeek-R1 70B	Q4_K_M	~42GB	48GB+	2x RTX 4090
Llama 3.1 70B	Q3_K_M	~33GB	48GB	RTX 5090 (tight)
Llama 3.1 70B	Q2_K	~27GB	32GB	RTX 5090

70B models do not fit on any single consumer GPU at good quantization levels. The RTX 5090 can squeeze in a Q2_K-Q3_K variant, but quality suffers. For serious 70B usage, plan for dual GPUs or cloud.

GPU recommendation summary

Your Target	Best GPU	Price	Why
7B models	RTX 3060 12GB	~$250 used	Cheap, 12GB is plenty
7B-13B models	RTX 4060 Ti 16GB	~$400	16GB handles everything up to 14B
13B-34B models	RTX 4090	~$1,600	24GB for 34B at Q4
34B comfortable	RTX 5090	~$2,000	32GB with room for context
70B models	2x RTX 4090 or cloud	~$3,200+	No single consumer card suffices

Which GPU should you buy for Ollama?

If you run small models (1B-3B) for lightweight tasks, any 4GB+ GPU works. No need to upgrade.

If you run 7B-13B models for chat, coding, or writing, a 16GB card is the sweet spot. The RTX 4060 Ti 16GB ($400) handles every model in this range at Q4-Q8 with room for context. Upgrade to the RTX 4070 Ti Super ($700) if you want faster token generation.

If you run 34B models like CodeLlama 34B or DeepSeek-R1 32B, you need 24GB. The RTX 4090 ($1,600) is the go-to card. A used RTX 3090 ($850) works too if you accept slower inference.

If you want 70B models, no single consumer GPU is enough at good quantization. Plan for dual RTX 4090s or use cloud GPUs.

Common mistakes with Ollama VRAM

Not accounting for KV cache — Your model fits in VRAM, but crashes mid-conversation. The KV cache for context grows as you chat. Always leave 2-4GB of headroom beyond the model's base size.

Running multiple models simultaneously — Ollama keeps models loaded in VRAM by default. If you pull and run a second model without stopping the first, both compete for VRAM. Use ollama stop to unload unused models.

Choosing Q2_K to squeeze a larger model — Dropping to Q2_K quantization to fit a 70B model on 32GB sounds clever, but the quality loss is severe. You are better off running a 34B model at Q6_K than a 70B at Q2_K.

Ignoring CPU offloading speed — Ollama silently offloads layers to RAM when VRAM runs out. The model "works" but runs 5-10x slower on offloaded layers. Check nvidia-smi to confirm the model is fully GPU-resident.

Tips for managing VRAM in Ollama

Check actual VRAM usage with nvidia-smi while a model is running. Ollama's reported size does not always include KV cache overhead.

Use /set parameter num_ctx 2048 to reduce context window if you are tight on VRAM. The default is 2048, but some models request more.

Unload unused models with ollama stop <model>. Ollama keeps models loaded in VRAM by default for faster subsequent runs.

For a deeper dive on VRAM planning, see our VRAM requirements guide. For GPU-specific Ollama performance, check our best GPU for Ollama article. If you have outgrown Ollama and are moving to a multi-user serving stack, our best GPU for vLLM guide covers the additional VRAM headroom PagedAttention requires.

When in doubt, buy more VRAM than you think you need. Models are growing faster than GPU memory, and Ollama makes it too easy to try the next size up.

Related guides on Best GPU for LLM

Continue on Best GPU for LLM for the complete guide with interactive calculators and current GPU prices.

DEV Community