Thurmon Demich

Posted on Jun 27 • Originally published at bestgpuforllm.com

How Much VRAM for a 70B LLM in 2026? (Q4-Q8 Table)

#vram #70b #quantization #guide

Cross-posted from Best GPU for LLM — visit the original for our VRAM calculator, GPU comparison table, and current Amazon pricing.

Quick answer: At Q4_K_M quantization, a 70B model needs approximately 40GB of VRAM — requiring two 24GB GPUs (like two RTX 4090s) or a single high-end workstation card. At FP16, you're looking at 140GB minimum.

The exact numbers at every quantization level

70B models are large. The actual VRAM requirement depends heavily on how aggressively you quantize the weights. Here is the breakdown:

Quantization	Size on Disk	VRAM Required	Notes
FP16 (full precision)	~140GB	~145GB+	Requires multi-GPU workstation
Q8_0	~70GB	~75GB	4x 24GB GPUs minimum
Q6_K	~57GB	~60GB	Still needs multi-GPU
Q5_K_M	~49GB	~52GB	2x 24GB + some CPU offload
Q4_K_M	~40GB	~42GB	2x 24GB GPUs (tight but works)
Q3_K_M	~31GB	~33GB	RTX 5090 (32GB) fits — barely
Q2_K	~25GB	~27GB	Single RTX 4090 — significant quality loss
IQ2_M	~22GB	~24GB	Single RTX 4090 with headroom

The inflection point most people care about: Q4_K_M at ~40GB is the minimum that keeps quality acceptable. Below Q4, you start losing coherence on complex reasoning tasks.

VRAM chart available at the original article

What GPU setups actually work

FP16 (140GB+)

This requires an A100 80GB pair, H100 pair, or a workstation with multiple A6000 Ada (48GB each). Not practical for home use. Use cloud GPU if you need FP16 accuracy for production.

Q8_0 (75GB)

Four RTX 4090s in a multi-GPU setup, or two H100 PCIe 80GB cards. Overkill for most users. Cloud is cheaper for occasional inference at this quality level.

Q5_K_M (52GB)

Two RTX 4090s (48GB combined) get close but need some CPU offload for the remaining 4GB. Expect a small speed penalty from the offloaded layers.

Q4_K_M (42GB) — the practical target

Two RTX 4090s (48GB total) run this comfortably with 6GB to spare for the KV cache. This is the standard setup for 70B inference at home. Tokens run at roughly 8-12 tok/s combined, which is conversational.

Q3_K_M (33GB)

The RTX 5090 with 32GB VRAM fits this but runs hot on memory bandwidth. Expect 1-2GB of CPU offload depending on your system. Speed is reasonable at ~15-18 tok/s.

IQ2_M and below (22-25GB)

A single RTX 4090 (24GB) can technically run these ultra-compressed variants. At IQ2 quality, a 70B model performs comparably to a well-quantized 13B model — you lose the reason you wanted 70B in the first place.

Which GPU should YOU buy?

Running 70B on a budget: Get two RTX 4060 Ti 16GB cards ($400 each, $800 total = 32GB). You can run Q3_K_M quality, which gives you the flavor of a 70B model without the full price. Use llama.cpp with tensor split.

Running 70B properly: Two RTX 4090s ($1,600 each, ~$3,200 total = 48GB). This is the gold standard for home 70B inference — Q4_K_M fits with VRAM to spare. Most guides use this setup.

Single-card 70B (compromised): The RTX 5090 at 32GB lets you run Q3_K_M without multi-GPU complexity, at ~$2,000. Simpler setup, but lower quality than dual 4090s.

Need better than Q4? Rent an A100 80GB pair on RunPod for Q8_0 quality. At 70B scale, cloud often beats a home multi-GPU build on cost per inference.

VRAM math explained

Why does a 70B model need ~40GB at Q4? The calculation:

70 billion parameters × 0.5 bytes per parameter (Q4 ≈ 4 bits = 0.5 bytes) = 35GB base model size
+ KV cache for context: At 4K context, add ~2-4GB. At 16K context, add ~8-12GB.
+ Overhead (activations, runtime): ~1-2GB

So at Q4_K_M with 4K context: ~38-42GB. At 16K context: ~48-50GB. This is why two RTX 4090s (48GB) get tight at longer contexts — you may need to cap context length or drop one step in quantization.

Common mistakes to avoid

Assuming Q4 = half the VRAM of FP16. It's closer to 28% (40GB vs 140GB). The math surprises people.
Forgetting KV cache. Your VRAM budget is not just the model weights. Long conversations eat into the headroom fast. Always leave 4-8GB for the cache.
Buying a single 24GB GPU to run 70B. You will be stuck at IQ2 quality, which defeats the purpose of using a 70B model. Save up for a second 4090 or start with a well-quantized 34B.
Ignoring tensor parallel overhead. llama.cpp with -ts 1,1 (tensor split) adds some communication overhead between GPUs. Expect 5-10% lower throughput versus theoretical peak.
Skipping the Q4_K_M vs Q5_K_M comparison. For tasks involving multi-step reasoning, Q5 is noticeably better. If your two-GPU setup has 48GB, you have headroom — use it.

Final verdict

Setup	Max Quantization	Quality	Tokens/s	Cost
1x RTX 4090 (24GB)	IQ2_M	Poor	~12	~$1,600
2x RTX 4060 Ti 16GB (32GB)	Q3_K_M	Acceptable	~8	~$800
1x RTX 5090 (32GB)	Q3_K_M	Acceptable	~15	~$2,000
2x RTX 4090 (48GB)	Q5_K_M	Good	~10	~$3,200
2x RTX 5090 (64GB)	Q8_0	Excellent	~20	~$4,000

For most home users, the two RTX 4090 setup running Q4_K_M is the practical target. It costs roughly $3,200 in hardware and gives you a genuinely capable 70B model for open-ended reasoning, long-form writing, and research tasks.

If you want single-card simplicity, consider whether a well-quantized 34B model — which fits on one RTX 4090 — might meet your needs. For full VRAM planning across all model sizes, see our VRAM requirements guide. And if you're specifically running Llama 3.1 70B or Llama 3.3 70B, the best GPU for Llama 70B guide covers those models in detail. For multi-GPU build advice, see best multi-GPU setup for LLM inference.

Related guides on Best GPU for LLM

Continue on Best GPU for LLM for the complete guide with interactive calculators and current GPU prices.

DEV Community