zkaria gamal

Posted on Mar 16

Concurrent LLM Serving: Benchmarking vLLM vs SGLang vs Ollama

#llm #ai #cloudnative

I wanted to know exactly how the three most popular open-source LLM serving engines perform when real users hit your server at the same time. So I built this educational repo and ran identical tests on a single GPU.

Repo: https://github.com/zkzkGamal/concurrent-llm-serving

Model: Qwen/Qwen3.5-0.8B

Hardware: Single GPU

Concurrency: 16 simultaneous requests (only 4 for Ollama)

Task: Diverse AI & programming questions (max_tokens=150)

The Results (spoiler: one engine destroys the others)

|----------|----------|------------|-----------------------|----------------------------|

| SGLang | 16 | 2.47s | 0.68–2.46s | True parallel batching + RadixAttention |

| vLLM | 16 | 11.26s | ~10.25–11.26s | PagedAttention + continuous batching |

| Ollama | 4 | 134.72s| 26–134s | Sequential (time-sliced) |

SGLang was 4.6× faster than vLLM and completely smoked Ollama.

Why the huge difference? The Core Algorithms

1. KV-Cache & Memory Management

Every LLM needs to store Key/Value vectors for previous tokens. Without smart caching, you waste VRAM and kill concurrency.

vLLM → PagedAttention (treats KV cache like OS virtual memory pages → no fragmentation)
SGLang → RadixAttention (trie-based prefix tree → shares any common prefix across requests automatically)

2. Continuous Batching

Instead of waiting for a full batch, new requests join the GPU forward pass instantly. Both vLLM and SGLang do this. Ollama does not.

3. Other Tricks

SGLang: Chunked prefill + custom Triton kernels + zero warm-up
vLLM: Broad model support + CUDA graph warm-up
Ollama: GGUF quantization + llama.cpp (great for single-user, terrible for concurrency)

When Should You Use Each Engine?

SGLang → Maximum throughput, structured JSON/regex output, production serving
vLLM → Stability, 50+ model architectures, when you need reliability over raw speed
Ollama → Quick prototyping, local development, zero-config experience

How to Reproduce the Tests Yourself

The repo includes everything:

install.sh (one-click setup)
sglang_concurrent_test.py / vllm_concurrent_test.py / ollama_concurrent_test.py
Raw logs + result markdowns
Video demos (test_sglang.mkv, ollama_test.mkv)

Just clone and run:


git clone https://github.com/zkzkGamal/concurrent-llm-serving

cd concurrent-llm-serving

bash install.sh

(Full startup commands and API compatibility notes are in the README.)

Project Structure


├── sglang_concurrent_test.py     # 16 concurrent requests

├── vllm_concurrent_test.py

├── ollama_concurrent_test.py

├── install.sh

├── *_results.md                  # Formatted benchmark outputs

└── README.md                     # Full deep-dive guide

Final Thoughts

Concurrent serving is no longer optional — it's table stakes for any serious LLM application. The difference between "works on my machine" and "handles 16 users at once" is huge, and the right engine choice can save you GPUs (and money).

If you're building anything with local LLMs — agents, RAG, chat apps, etc. — I highly recommend trying SGLang first.

⭐ Star the repo if you found it useful!

Feedback, PRs, and questions are all welcome.

What engine are you using right now? Have you hit concurrency limits yet? Drop a comment below 👇

llm #vllm #sglang #ollama #aiserving #machinelearning #opensource #gpu #inference

DEV Community