This article was originally published on Best GPU for LLM. The full version with interactive tools, FAQ, and live pricing is on the original site.
Quick answer: Ollama automatically selects quantization based on your available VRAM. For 7B models, you need at least 8GB VRAM. For 13B models, 12-16GB. For 70B models, 48GB+ or accept heavy CPU offloading.
See the recommended pick on the original guide
How Ollama uses VRAM
When you run ollama run llama3, Ollama loads the model weights into GPU memory. If the model does not fit entirely, Ollama offloads remaining layers to system RAM, which dramatically slows inference.
Key facts about Ollama's VRAM usage in 2026:
- Ollama uses GGUF quantized models by default (Q4_K_M for most)
- The default
ollama pulldownloads a Q4_K_M variant unless you specify otherwise - KV cache for context uses additional VRAM beyond the model weights
- Running
ollama runwith a model already loaded reuses the same VRAM allocation
VRAM chart available at the original article
VRAM requirements by model
Small models (1B-3B parameters)
| Model | Default Quant | VRAM Used | Min GPU |
|---|---|---|---|
| Llama 3.2 1B | Q8_0 | ~1.5GB | Any 4GB GPU |
| Llama 3.2 3B | Q4_K_M | ~2.5GB | Any 4GB GPU |
| Phi-3.5 Mini (3.8B) | Q4_K_M | ~3GB | Any 4GB GPU |
| Gemma 2 2B | Q4_K_M | ~2GB | Any 4GB GPU |
| Qwen 2.5 3B | Q4_K_M | ~2.5GB | Any 4GB GPU |
These models run on virtually any modern GPU. Even a GTX 1650 with 4GB handles them fine.
Medium models (7B-9B parameters)
| Model | Default Quant | VRAM Used | Min GPU | Recommended GPU |
|---|---|---|---|---|
| Llama 3.1 8B | Q4_K_M | ~5.5GB | 8GB | RTX 4060 Ti 16GB |
| Mistral 7B v0.3 | Q4_K_M | ~5GB | 8GB | RTX 3060 12GB |
| Gemma 2 9B | Q4_K_M | ~6GB | 8GB | RTX 3060 12GB |
| Qwen 2.5 7B | Q4_K_M | ~5GB | 8GB | RTX 3060 12GB |
| DeepSeek-R1 8B | Q4_K_M | ~5.5GB | 8GB | RTX 4060 Ti 16GB |
| Llama 3.1 8B | Q8_0 | ~9GB | 12GB | RTX 4060 Ti 16GB |
| Llama 3.1 8B | FP16 | ~16GB | 16GB | RTX 4060 Ti 16GB |
At Q4_K_M, all 7B-9B models fit on 8GB cards. However, 8GB leaves almost no room for context. A 12-16GB card gives much better real-world performance. For a model-specific deep dive, see how much VRAM does Llama 3 8B need?
Large models (13B-14B parameters)
| Model | Default Quant | VRAM Used | Min GPU | Recommended GPU |
|---|---|---|---|---|
| Llama 2 13B | Q4_K_M | ~8.5GB | 12GB | RTX 4060 Ti 16GB |
| CodeLlama 13B | Q4_K_M | ~8.5GB | 12GB | RTX 4060 Ti 16GB |
| Phi-3 Medium 14B | Q4_K_M | ~9GB | 12GB | RTX 4060 Ti 16GB |
| Qwen 2.5 14B | Q4_K_M | ~9GB | 12GB | RTX 4060 Ti 16GB |
| Llama 2 13B | Q8_0 | ~14.5GB | 16GB | RTX 5070 Ti |
The 16GB sweet spot: an RTX 4060 Ti 16GB or RTX 5070 Ti handles any 13B-14B model at Q4-Q8 with room for context.
XL models (30B-34B parameters)
| Model | Default Quant | VRAM Used | Min GPU | Recommended GPU |
|---|---|---|---|---|
| CodeLlama 34B | Q4_K_M | ~20GB | 24GB | RTX 4090 |
| Yi 34B | Q4_K_M | ~20GB | 24GB | RTX 4090 |
| Qwen 2.5 32B | Q4_K_M | ~19GB | 24GB | RTX 4090 |
| DeepSeek-R1 32B | Q4_K_M | ~19GB | 24GB | RTX 4090 |
| CodeLlama 34B | Q3_K_M | ~16GB | 24GB | RTX 4090 |
24GB is the minimum for 34B models. The RTX 4090 and used RTX 3090 are your options here.
XXL models (70B+ parameters)
| Model | Default Quant | VRAM Used | Min GPU | Recommended GPU |
|---|---|---|---|---|
| Llama 3.1 70B | Q4_K_M | ~42GB | 48GB+ | 2x RTX 4090 |
| Qwen 2.5 72B | Q4_K_M | ~43GB | 48GB+ | 2x RTX 4090 |
| DeepSeek-R1 70B | Q4_K_M | ~42GB | 48GB+ | 2x RTX 4090 |
| Llama 3.1 70B | Q3_K_M | ~33GB | 48GB | RTX 5090 (tight) |
| Llama 3.1 70B | Q2_K | ~27GB | 32GB | RTX 5090 |
70B models do not fit on any single consumer GPU at good quantization levels. The RTX 5090 can squeeze in a Q2_K-Q3_K variant, but quality suffers. For serious 70B usage, plan for dual GPUs or cloud.
GPU recommendation summary
| Your Target | Best GPU | Price | Why |
|---|---|---|---|
| 7B models | RTX 3060 12GB | ~$250 used | Cheap, 12GB is plenty |
| 7B-13B models | RTX 4060 Ti 16GB | ~$400 | 16GB handles everything up to 14B |
| 13B-34B models | RTX 4090 | ~$1,600 | 24GB for 34B at Q4 |
| 34B comfortable | RTX 5090 | ~$2,000 | 32GB with room for context |
| 70B models | 2x RTX 4090 or cloud | ~$3,200+ | No single consumer card suffices |
See the recommended pick on the original guide
See the recommended pick on the original guide
See the recommended pick on the original guide
Which GPU should you buy for Ollama?
If you run small models (1B-3B) for lightweight tasks, any 4GB+ GPU works. No need to upgrade.
If you run 7B-13B models for chat, coding, or writing, a 16GB card is the sweet spot. The RTX 4060 Ti 16GB ($400) handles every model in this range at Q4-Q8 with room for context. Upgrade to the RTX 4070 Ti Super ($700) if you want faster token generation.
If you run 34B models like CodeLlama 34B or DeepSeek-R1 32B, you need 24GB. The RTX 4090 ($1,600) is the go-to card. A used RTX 3090 ($850) works too if you accept slower inference.
If you want 70B models, no single consumer GPU is enough at good quantization. Plan for dual RTX 4090s or use cloud GPUs.
Common mistakes with Ollama VRAM
Not accounting for KV cache — Your model fits in VRAM, but crashes mid-conversation. The KV cache for context grows as you chat. Always leave 2-4GB of headroom beyond the model's base size.
Running multiple models simultaneously — Ollama keeps models loaded in VRAM by default. If you pull and run a second model without stopping the first, both compete for VRAM. Use ollama stop to unload unused models.
Choosing Q2_K to squeeze a larger model — Dropping to Q2_K quantization to fit a 70B model on 32GB sounds clever, but the quality loss is severe. You are better off running a 34B model at Q6_K than a 70B at Q2_K.
Ignoring CPU offloading speed — Ollama silently offloads layers to RAM when VRAM runs out. The model "works" but runs 5-10x slower on offloaded layers. Check nvidia-smi to confirm the model is fully GPU-resident.
Tips for managing VRAM in Ollama
Check actual VRAM usage with nvidia-smi while a model is running. Ollama's reported size does not always include KV cache overhead.
Use /set parameter num_ctx 2048 to reduce context window if you are tight on VRAM. The default is 2048, but some models request more.
Unload unused models with ollama stop <model>. Ollama keeps models loaded in VRAM by default for faster subsequent runs.
For a deeper dive on VRAM planning, see our VRAM requirements guide. For GPU-specific Ollama performance, check our best GPU for Ollama article. If you have outgrown Ollama and are moving to a multi-user serving stack, our best GPU for vLLM guide covers the additional VRAM headroom PagedAttention requires.
When in doubt, buy more VRAM than you think you need. Models are growing faster than GPU memory, and Ollama makes it too easy to try the next size up.
Related guides on Best GPU for LLM
- How Much VRAM for Local LLMs in 2026? Full Q4-Q8 Guide
- How to Choose a GPU for Ollama in 2026 (Step Guide)
- Best Quantization for Local LLM in 2026 (Q4 to Q8)
Continue on Best GPU for LLM for the complete guide with interactive calculators and current GPU prices.
Top comments (0)