This article was originally published on Best GPU for LLM. The full version with interactive tools, FAQ, and live pricing is on the original site.
Quick answer: For production vLLM serving, the RTX 4090 ($1,600) offers the best throughput per dollar for models up to 13B. For 34B+ models or high-concurrency workloads, the RTX 5090 ($2,000) or multi-GPU setups are essential.
See the recommended pick on the original guide
Why vLLM is different from local inference
vLLM is not a chatbot runner. It is a high-throughput inference server designed for serving multiple concurrent requests. This changes what matters in a GPU:
- VRAM capacity determines the largest model you can serve and how many concurrent requests you can handle (KV cache scales with concurrency)
- Memory bandwidth directly impacts token generation speed across all requests
- Tensor parallelism lets vLLM split models across multiple GPUs with near-linear scaling
- PagedAttention makes vLLM 2-4x more memory efficient than naive serving, but you still need enough VRAM for the model plus KV cache
Unlike Ollama which handles one request at a time, vLLM batches requests dynamically, so throughput scales with VRAM headroom. For a side-by-side breakdown of when vLLM makes sense versus Ollama or llama.cpp, see Ollama vs llama.cpp vs vLLM. If you prefer a GUI loader over a production server stack, see our text-generation-webui GPU guide.
GPU comparison for vLLM throughput
Benchmarks serving Llama 3 8B at FP16 with 32 concurrent requests:
| GPU | VRAM | Throughput (tok/s total) | Latency (P50) | Price |
|---|---|---|---|---|
| RTX 5090 | 32GB | ~2,800 tok/s | ~45ms | ~$2,000 |
| RTX 4090 | 24GB | ~2,100 tok/s | ~55ms | ~$1,600 |
| RTX 5080 | 16GB | ~1,500 tok/s | ~70ms | ~$1,000 |
| RTX 4070 Ti Super | 16GB | ~1,200 tok/s | ~85ms | ~$700 |
| 2x RTX 4090 (TP=2) | 48GB | ~3,900 tok/s | ~50ms | ~$3,200 |
Total throughput is what matters for serving, not single-request speed. The RTX 4090 delivers excellent throughput per dollar and is the workhorse of budget vLLM deployments.
Model sizing for vLLM
vLLM typically serves models at FP16 or AWQ/GPTQ 4-bit for best throughput. Unlike llama.cpp GGUF, vLLM uses GPU-native quantization formats.
| Model | FP16 VRAM | AWQ 4-bit VRAM | Min GPU (FP16) | Min GPU (AWQ) |
|---|---|---|---|---|
| Mistral 7B | ~14GB | ~4.5GB | RTX 5080 16GB | RTX 4060 Ti 16GB |
| Llama 3 8B | ~16GB | ~5GB | RTX 5080 16GB | RTX 4060 Ti 16GB |
| Llama 3 13B | ~26GB | ~8GB | RTX 5090 32GB | RTX 4060 Ti 16GB |
| Qwen 2.5 32B | ~64GB | ~19GB | 2x RTX 5090 | RTX 4090 24GB |
| Llama 3 70B | ~140GB | ~40GB | Multi-GPU | 2x RTX 4090 |
Remember to add 4-8GB overhead for KV cache depending on concurrency and context length. Higher concurrency needs more VRAM headroom.
PagedAttention and VRAM efficiency
PagedAttention is vLLM's key innovation. It manages GPU memory for KV cache like virtual memory pages, eliminating waste from pre-allocated fixed buffers. In practice this means:
- ~2-4x more concurrent requests than naive serving with the same VRAM
- Near-zero memory waste from fragmentation
- Dynamic allocation lets you serve bursty traffic without over-provisioning
This makes VRAM even more valuable in vLLM than in single-user tools. Every extra GB of VRAM translates to more concurrent users you can serve.
Tensor parallelism: scaling across GPUs
vLLM supports tensor parallelism natively. Two RTX 4090s with TP=2 give you 48GB of combined VRAM and roughly 1.85x the throughput of a single card (not quite linear due to NVLink absence on consumer cards, which adds PCIe communication overhead).
For serious serving, dual RTX 4090s are often better than a single RTX 5090: more total VRAM (48GB vs 32GB) at 1.6x the cost for nearly double the throughput.
Which GPU should you buy?
If you are prototyping or testing vLLM with 7B models, the RTX 4060 Ti 16GB at $400 is enough to validate your pipeline. If you are serving 7-13B models in production with moderate concurrency, the RTX 4090 at $1,600 is the best throughput-per-dollar choice. If you need high concurrency or 34B+ models, go with dual RTX 4090s — 48GB combined VRAM with tensor parallelism beats a single RTX 5090 for serving workloads where total throughput matters more than single-request latency.
Common mistakes to avoid
- Using GGUF quantization with vLLM. vLLM uses GPU-native formats (AWQ, GPTQ), not llama.cpp's GGUF. Using the wrong format means you cannot take advantage of PagedAttention and continuous batching.
- Underestimating KV cache VRAM. A model that fits in 20GB of VRAM still needs 4-8GB for KV cache under concurrency. Budget VRAM for your peak concurrent users, not just the model weights.
- Buying a single expensive GPU instead of two cheaper ones. For serving, two RTX 4090s with tensor parallelism outperform a single RTX 5090 in total throughput and have more combined VRAM (48GB vs 32GB).
Our recommendation
| Workload | Best GPU | Price |
|---|---|---|
| Dev/testing (7B models) | RTX 4060 Ti 16GB | ~$400 |
| Small-scale serving (7-13B) | RTX 4090 | ~$1,600 |
| Production serving (7-13B) | RTX 5090 | ~$2,000 |
| High-throughput or 34B+ | 2x RTX 4090 | ~$3,200 |
For most vLLM deployments, the RTX 4090 at $1,600 is the sweet spot. It serves 7-13B models at FP16 with excellent throughput and has enough VRAM for decent concurrency. Scale horizontally with tensor parallelism when you need more.
GPU tier list available at the original article
See the recommended pick on the original guide
See the recommended pick on the original guide
See the recommended pick on the original guide
For more on how VRAM requirements scale with model size and quantization, see our VRAM requirements guide.
Related guides on Best GPU for LLM
- Best GPU for LLM Inference Server in 2026 (vLLM)
- Best GPU for 13B Parameter Models in 2026 (Ranked)
- Best GPU for 34B Models: Yi, CodeLlama & Qwen
Continue on Best GPU for LLM for the complete guide with interactive calculators and current GPU prices.
Top comments (0)