DEV Community

Cover image for How Much VRAM for a 70B LLM in 2026? (Q4-Q8 Table)
Thurmon Demich
Thurmon Demich

Posted on • Originally published at bestgpuforllm.com

How Much VRAM for a 70B LLM in 2026? (Q4-Q8 Table)

Cross-posted from Best GPU for LLM — visit the original for our VRAM calculator, GPU comparison table, and current Amazon pricing.

Quick answer: At Q4_K_M quantization, a 70B model needs approximately 40GB of VRAM — requiring two 24GB GPUs (like two RTX 4090s) or a single high-end workstation card. At FP16, you're looking at 140GB minimum.

See the recommended pick on the original guide

The exact numbers at every quantization level

70B models are large. The actual VRAM requirement depends heavily on how aggressively you quantize the weights. Here is the breakdown:

Quantization Size on Disk VRAM Required Notes
FP16 (full precision) ~140GB ~145GB+ Requires multi-GPU workstation
Q8_0 ~70GB ~75GB 4x 24GB GPUs minimum
Q6_K ~57GB ~60GB Still needs multi-GPU
Q5_K_M ~49GB ~52GB 2x 24GB + some CPU offload
Q4_K_M ~40GB ~42GB 2x 24GB GPUs (tight but works)
Q3_K_M ~31GB ~33GB RTX 5090 (32GB) fits — barely
Q2_K ~25GB ~27GB Single RTX 4090 — significant quality loss
IQ2_M ~22GB ~24GB Single RTX 4090 with headroom

The inflection point most people care about: Q4_K_M at ~40GB is the minimum that keeps quality acceptable. Below Q4, you start losing coherence on complex reasoning tasks.

VRAM chart available at the original article

What GPU setups actually work

FP16 (140GB+)

This requires an A100 80GB pair, H100 pair, or a workstation with multiple A6000 Ada (48GB each). Not practical for home use. Use cloud GPU if you need FP16 accuracy for production.

Q8_0 (75GB)

Four RTX 4090s in a multi-GPU setup, or two H100 PCIe 80GB cards. Overkill for most users. Cloud is cheaper for occasional inference at this quality level.

Q5_K_M (52GB)

Two RTX 4090s (48GB combined) get close but need some CPU offload for the remaining 4GB. Expect a small speed penalty from the offloaded layers.

Q4_K_M (42GB) — the practical target

Two RTX 4090s (48GB total) run this comfortably with 6GB to spare for the KV cache. This is the standard setup for 70B inference at home. Tokens run at roughly 8-12 tok/s combined, which is conversational.

Q3_K_M (33GB)

The RTX 5090 with 32GB VRAM fits this but runs hot on memory bandwidth. Expect 1-2GB of CPU offload depending on your system. Speed is reasonable at ~15-18 tok/s.

IQ2_M and below (22-25GB)

A single RTX 4090 (24GB) can technically run these ultra-compressed variants. At IQ2 quality, a 70B model performs comparably to a well-quantized 13B model — you lose the reason you wanted 70B in the first place.

Which GPU should YOU buy?

Running 70B on a budget: Get two RTX 4060 Ti 16GB cards ($400 each, $800 total = 32GB). You can run Q3_K_M quality, which gives you the flavor of a 70B model without the full price. Use llama.cpp with tensor split.

Running 70B properly: Two RTX 4090s ($1,600 each, ~$3,200 total = 48GB). This is the gold standard for home 70B inference — Q4_K_M fits with VRAM to spare. Most guides use this setup.

Single-card 70B (compromised): The RTX 5090 at 32GB lets you run Q3_K_M without multi-GPU complexity, at ~$2,000. Simpler setup, but lower quality than dual 4090s.

Need better than Q4? Rent an A100 80GB pair on RunPod for Q8_0 quality. At 70B scale, cloud often beats a home multi-GPU build on cost per inference.

See the recommended pick on the original guide

VRAM math explained

Why does a 70B model need ~40GB at Q4? The calculation:

  • 70 billion parameters × 0.5 bytes per parameter (Q4 ≈ 4 bits = 0.5 bytes) = 35GB base model size
  • + KV cache for context: At 4K context, add ~2-4GB. At 16K context, add ~8-12GB.
  • + Overhead (activations, runtime): ~1-2GB

So at Q4_K_M with 4K context: ~38-42GB. At 16K context: ~48-50GB. This is why two RTX 4090s (48GB) get tight at longer contexts — you may need to cap context length or drop one step in quantization.

Common mistakes to avoid

  • Assuming Q4 = half the VRAM of FP16. It's closer to 28% (40GB vs 140GB). The math surprises people.
  • Forgetting KV cache. Your VRAM budget is not just the model weights. Long conversations eat into the headroom fast. Always leave 4-8GB for the cache.
  • Buying a single 24GB GPU to run 70B. You will be stuck at IQ2 quality, which defeats the purpose of using a 70B model. Save up for a second 4090 or start with a well-quantized 34B.
  • Ignoring tensor parallel overhead. llama.cpp with -ts 1,1 (tensor split) adds some communication overhead between GPUs. Expect 5-10% lower throughput versus theoretical peak.
  • Skipping the Q4_K_M vs Q5_K_M comparison. For tasks involving multi-step reasoning, Q5 is noticeably better. If your two-GPU setup has 48GB, you have headroom — use it.

Final verdict

Setup Max Quantization Quality Tokens/s Cost
1x RTX 4090 (24GB) IQ2_M Poor ~12 ~$1,600
2x RTX 4060 Ti 16GB (32GB) Q3_K_M Acceptable ~8 ~$800
1x RTX 5090 (32GB) Q3_K_M Acceptable ~15 ~$2,000
2x RTX 4090 (48GB) Q5_K_M Good ~10 ~$3,200
2x RTX 5090 (64GB) Q8_0 Excellent ~20 ~$4,000

For most home users, the two RTX 4090 setup running Q4_K_M is the practical target. It costs roughly $3,200 in hardware and gives you a genuinely capable 70B model for open-ended reasoning, long-form writing, and research tasks.

If you want single-card simplicity, consider whether a well-quantized 34B model — which fits on one RTX 4090 — might meet your needs. For full VRAM planning across all model sizes, see our VRAM requirements guide. And if you're specifically running Llama 3.1 70B or Llama 3.3 70B, the best GPU for Llama 70B guide covers those models in detail. For multi-GPU build advice, see best multi-GPU setup for LLM inference.

See the recommended pick on the original guide

Related guides on Best GPU for LLM


Continue on Best GPU for LLM for the complete guide with interactive calculators and current GPU prices.

Top comments (0)