Introduction
Running a 7B-8B class model on a single RTX 3090, you might settle for ~25-30 tokens/s, enough for personal use but far from optimal. For a production-grade API service, we aspire for maximal requests per second—this is our performance target.
Through a series of optimizations—leveraging vLLM's specialized architecture, model quantization, and deep parameter tuning—we can transform a single 3090 into a high-throughput API node capable of handling over 50 concurrent sequences.
This guide outlines the systematic approach I've used to move from a single-user setup to an efficient, concurrent API deployment.
The Core Technology: Why vLLM Excels
vLLM fundamentally changes LLM serving with two key innovations:
PagedAttention: Transforms KV cache management by splitting it into fixed-size pages, akin to an OS virtual memory manager. This eliminates fragmentation and increases memory utilization, enabling far larger batch sizes on limited VRAM compared to traditional frameworks.
Continuous Batching: Unlike static batching which waits for all requests in a batch, vLLM's continuous batching dynamically adds new requests as soon as others finish, keeping the GPU perpetually busy.
Initial benchmarks align with community findings—a single RTX 3090 can achieve approximately 30 tokens/s with very low latency, making it a solid foundation for concurrency optimization.
System Preparation and Base Deployment
Our deployment setup is running a Meta-Llama-3-8B-Instruct model on an Ubuntu 22.04 LTS system with an NVIDIA RTX 3090 (24GB VRAM). Linux is strongly preferred over Windows for memory management efficiency.
Model Selection and Quantization: The full FP16 version (~16GB) occupies most of the 3090's VRAM, leaving insufficient space for aggressive concurrent batching.
I deployed the GPTQ INT4 quantized version of Llama 3 8B. INT4 reduces FP16 VRAM consumption by 60-75% with negligible quality degradation, and I observed stable usage around 5.8GB of VRAM with this approach.
Basic vLLM Setup:
bash
Install vLLM
pip install vllm
Launch the server (baseline)
python -m vllm.entrypoints.openai.api_server \
--model TheBloke/Llama-2-7B-Chat-GPTQ \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.85 \
--max-model-len 4096
Step 1: Reserve Enough Memory for the KV Cache
The gpu_memory_utilization parameter is your primary lever. vLLM pre-allocates a large chunk of VRAM for the KV cache up front. If set too conservatively, batch sizes collapse and throughput drops.
Start at 0.85 and push towards 0.95, while monitoring for OOM errors. The 5.8GB baseline leaves roughly 18GB of reusable VRAM for the KV cache and batching.
Step 2: Increase Concurrency
The max_num_seqs parameter is a hard limit on how many requests the engine can process in a single batch. This is the key to moving from single-user to industrial throughput.
Our target is to support over 50 concurrent requests. The vLLM documentation recommends setting max_num_seqs to at least your target concurrency and max_num_batched_tokens to a balancing value (try a high value like 16384, adjusting upward if memory allows). However, when max_num_seqs exceeds 32, perform stability testing with the --enforce-eager true flag, and monitor pending_requests and gpu_mem_usage closely.
Increasing concurrency means you're filling more of the compute pipeline. But you can push it too far; if you encounter preemption events, you've hit the true limit and need to dial it back.
Step 3: Fine-Tuning for Maximum Throughput
Search Strategy: Start with single-parameter sweeps—fix all parameters while varying one, such as gpu_memory_utilization or max_num_seqs. After finding individual sweet spots, follow the established best practices: first sweep max_num_batched_tokens, then sweep max_num_seqs with the batch token count fixed.
Speculative Decoding: For further gains, consider lightweight speculative decoding. A tiny "draft model" guesses several subsequent tokens in one forward pass, and the main 7B target model validates them in parallel. This yields lossless acceleration where appropriate.
Theoretical Considerations: The Bottleneck
This optimization journey works within fundamental constraints. The GPU's instruction issue bandwidth and streaming multiprocessors (SM) ultimately limit concurrency.
While you're tuning, there's a hidden bottleneck: the memory bus. Each generated token requires fetching the entire active KV cache. During each generation step, the GPU's compute units may spend significant time waiting for KV data from VRAM.
Successful concurrency tuning pushes the GPU towards this memory bandwidth limit. Your goal is to find the saturation point where tokens flow as fast as the bus permits. This is the practical ceiling for a given 3090 setup.
Conclusion
Through strategic deployment—leveraging vLLM's PagedAttention and continuous batching, VRAM-frugal model quantization, and careful, systematic parameter tuning—a single RTX 3090 can deliver robust performance capable of serving a global user base.
Estimated Performance Target: With a INT4-quantized 7B model and the above tuning, you can realistically target handling 50+ concurrent requests at a stable ~30 tokens/s per overall system throughput, with minimal latency overhead.
We can achieve more from our hardware. With methodical optimization, even a single consumer-grade graphics card can become the backbone of a high-performance, globally-accessible application.
If you're interested in turning this optimized node into a revenue stream, consider joining TopenRouter—a platform with massive token demand where individual GPU providers can monetize their compute capacity. Have questions about vLLM deployment or parameter tuning? Leave a comment below, or reach out directly.
Top comments (0)