Cross-posted from Best GPU for LLM — visit the original for our VRAM calculator, GPU comparison table, and current Amazon pricing.
Quick answer: At Q4_K_M quantization, a 70B model needs approximately 40GB of VRAM — requiring two 24GB GPUs (like two RTX 4090s) or a single high-end workstation card. At FP16, you're looking at 140GB minimum.
See the recommended pick on the original guide
The exact numbers at every quantization level
70B models are large. The actual VRAM requirement depends heavily on how aggressively you quantize the weights. Here is the breakdown:
| Quantization | Size on Disk | VRAM Required | Notes |
|---|---|---|---|
| FP16 (full precision) | ~140GB | ~145GB+ | Requires multi-GPU workstation |
| Q8_0 | ~70GB | ~75GB | 4x 24GB GPUs minimum |
| Q6_K | ~57GB | ~60GB | Still needs multi-GPU |
| Q5_K_M | ~49GB | ~52GB | 2x 24GB + some CPU offload |
| Q4_K_M | ~40GB | ~42GB | 2x 24GB GPUs (tight but works) |
| Q3_K_M | ~31GB | ~33GB | RTX 5090 (32GB) fits — barely |
| Q2_K | ~25GB | ~27GB | Single RTX 4090 — significant quality loss |
| IQ2_M | ~22GB | ~24GB | Single RTX 4090 with headroom |
The inflection point most people care about: Q4_K_M at ~40GB is the minimum that keeps quality acceptable. Below Q4, you start losing coherence on complex reasoning tasks.
VRAM chart available at the original article
What GPU setups actually work
FP16 (140GB+)
This requires an A100 80GB pair, H100 pair, or a workstation with multiple A6000 Ada (48GB each). Not practical for home use. Use cloud GPU if you need FP16 accuracy for production.
Q8_0 (75GB)
Four RTX 4090s in a multi-GPU setup, or two H100 PCIe 80GB cards. Overkill for most users. Cloud is cheaper for occasional inference at this quality level.
Q5_K_M (52GB)
Two RTX 4090s (48GB combined) get close but need some CPU offload for the remaining 4GB. Expect a small speed penalty from the offloaded layers.
Q4_K_M (42GB) — the practical target
Two RTX 4090s (48GB total) run this comfortably with 6GB to spare for the KV cache. This is the standard setup for 70B inference at home. Tokens run at roughly 8-12 tok/s combined, which is conversational.
Q3_K_M (33GB)
The RTX 5090 with 32GB VRAM fits this but runs hot on memory bandwidth. Expect 1-2GB of CPU offload depending on your system. Speed is reasonable at ~15-18 tok/s.
IQ2_M and below (22-25GB)
A single RTX 4090 (24GB) can technically run these ultra-compressed variants. At IQ2 quality, a 70B model performs comparably to a well-quantized 13B model — you lose the reason you wanted 70B in the first place.
Which GPU should YOU buy?
Running 70B on a budget: Get two RTX 4060 Ti 16GB cards ($400 each, $800 total = 32GB). You can run Q3_K_M quality, which gives you the flavor of a 70B model without the full price. Use llama.cpp with tensor split.
Running 70B properly: Two RTX 4090s ($1,600 each, ~$3,200 total = 48GB). This is the gold standard for home 70B inference — Q4_K_M fits with VRAM to spare. Most guides use this setup.
Single-card 70B (compromised): The RTX 5090 at 32GB lets you run Q3_K_M without multi-GPU complexity, at ~$2,000. Simpler setup, but lower quality than dual 4090s.
Need better than Q4? Rent an A100 80GB pair on RunPod for Q8_0 quality. At 70B scale, cloud often beats a home multi-GPU build on cost per inference.
See the recommended pick on the original guide
VRAM math explained
Why does a 70B model need ~40GB at Q4? The calculation:
- 70 billion parameters × 0.5 bytes per parameter (Q4 ≈ 4 bits = 0.5 bytes) = 35GB base model size
- + KV cache for context: At 4K context, add ~2-4GB. At 16K context, add ~8-12GB.
- + Overhead (activations, runtime): ~1-2GB
So at Q4_K_M with 4K context: ~38-42GB. At 16K context: ~48-50GB. This is why two RTX 4090s (48GB) get tight at longer contexts — you may need to cap context length or drop one step in quantization.
Common mistakes to avoid
- Assuming Q4 = half the VRAM of FP16. It's closer to 28% (40GB vs 140GB). The math surprises people.
- Forgetting KV cache. Your VRAM budget is not just the model weights. Long conversations eat into the headroom fast. Always leave 4-8GB for the cache.
- Buying a single 24GB GPU to run 70B. You will be stuck at IQ2 quality, which defeats the purpose of using a 70B model. Save up for a second 4090 or start with a well-quantized 34B.
-
Ignoring tensor parallel overhead. llama.cpp with
-ts 1,1(tensor split) adds some communication overhead between GPUs. Expect 5-10% lower throughput versus theoretical peak. - Skipping the Q4_K_M vs Q5_K_M comparison. For tasks involving multi-step reasoning, Q5 is noticeably better. If your two-GPU setup has 48GB, you have headroom — use it.
Final verdict
| Setup | Max Quantization | Quality | Tokens/s | Cost |
|---|---|---|---|---|
| 1x RTX 4090 (24GB) | IQ2_M | Poor | ~12 | ~$1,600 |
| 2x RTX 4060 Ti 16GB (32GB) | Q3_K_M | Acceptable | ~8 | ~$800 |
| 1x RTX 5090 (32GB) | Q3_K_M | Acceptable | ~15 | ~$2,000 |
| 2x RTX 4090 (48GB) | Q5_K_M | Good | ~10 | ~$3,200 |
| 2x RTX 5090 (64GB) | Q8_0 | Excellent | ~20 | ~$4,000 |
For most home users, the two RTX 4090 setup running Q4_K_M is the practical target. It costs roughly $3,200 in hardware and gives you a genuinely capable 70B model for open-ended reasoning, long-form writing, and research tasks.
If you want single-card simplicity, consider whether a well-quantized 34B model — which fits on one RTX 4090 — might meet your needs. For full VRAM planning across all model sizes, see our VRAM requirements guide. And if you're specifically running Llama 3.1 70B or Llama 3.3 70B, the best GPU for Llama 70B guide covers those models in detail. For multi-GPU build advice, see best multi-GPU setup for LLM inference.
See the recommended pick on the original guide
Related guides on Best GPU for LLM
- Best Quantization for Local LLM in 2026 (Q4 to Q8)
- Can the RTX 4060 Ti Run Llama 70B in 2026? (Honest)
- Can the RTX 5070 Run 34B Models in 2026? (Analyzed)
Continue on Best GPU for LLM for the complete guide with interactive calculators and current GPU prices.
Top comments (0)