I wanted to know exactly how the three most popular open-source LLM serving engines perform when real users hit your server at the same time. So I built this educational repo and ran identical tests on a single GPU.
Repo: https://github.com/zkzkGamal/concurrent-llm-serving
Model: Qwen/Qwen3.5-0.8B
Hardware: Single GPU
Concurrency: 16 simultaneous requests (only 4 for Ollama)
Task: Diverse AI & programming questions (max_tokens=150)
The Results (spoiler: one engine destroys the others)
| Engine | Requests | Total Time | Avg per Request | Concurrency Model |
|----------|----------|------------|-----------------------|----------------------------|
| SGLang | 16 | 2.47s | 0.68–2.46s | True parallel batching + RadixAttention |
| vLLM | 16 | 11.26s | ~10.25–11.26s | PagedAttention + continuous batching |
| Ollama | 4 | 134.72s| 26–134s | Sequential (time-sliced) |
SGLang was 4.6× faster than vLLM and completely smoked Ollama.
Why the huge difference? The Core Algorithms
1. KV-Cache & Memory Management
Every LLM needs to store Key/Value vectors for previous tokens. Without smart caching, you waste VRAM and kill concurrency.
vLLM → PagedAttention (treats KV cache like OS virtual memory pages → no fragmentation)
SGLang → RadixAttention (trie-based prefix tree → shares any common prefix across requests automatically)
2. Continuous Batching
Instead of waiting for a full batch, new requests join the GPU forward pass instantly. Both vLLM and SGLang do this. Ollama does not.
3. Other Tricks
SGLang: Chunked prefill + custom Triton kernels + zero warm-up
vLLM: Broad model support + CUDA graph warm-up
Ollama: GGUF quantization + llama.cpp (great for single-user, terrible for concurrency)
When Should You Use Each Engine?
SGLang → Maximum throughput, structured JSON/regex output, production serving
vLLM → Stability, 50+ model architectures, when you need reliability over raw speed
Ollama → Quick prototyping, local development, zero-config experience
How to Reproduce the Tests Yourself
The repo includes everything:
install.sh(one-click setup)sglang_concurrent_test.py/vllm_concurrent_test.py/ollama_concurrent_test.pyRaw logs + result markdowns
Video demos (
test_sglang.mkv,ollama_test.mkv)
Just clone and run:
git clone https://github.com/zkzkGamal/concurrent-llm-serving
cd concurrent-llm-serving
bash install.sh
(Full startup commands and API compatibility notes are in the README.)
Project Structure
├── sglang_concurrent_test.py # 16 concurrent requests
├── vllm_concurrent_test.py
├── ollama_concurrent_test.py
├── install.sh
├── *_results.md # Formatted benchmark outputs
└── README.md # Full deep-dive guide
Final Thoughts
Concurrent serving is no longer optional — it's table stakes for any serious LLM application. The difference between "works on my machine" and "handles 16 users at once" is huge, and the right engine choice can save you GPUs (and money).
If you're building anything with local LLMs — agents, RAG, chat apps, etc. — I highly recommend trying SGLang first.
⭐ Star the repo if you found it useful!
Feedback, PRs, and questions are all welcome.
What engine are you using right now? Have you hit concurrency limits yet? Drop a comment below 👇

Top comments (0)