I built an interactive 11-chapter guide to how LLM inference actually works

Ashwin Giridharan — Wed, 24 Jun 2026 06:36:00 +0000

Production vLLM is 100,000+ lines of C++, CUDA, and Python. It powers most of the industry's LLM serving — but reading it cold is brutal.

So I built a study series around nano-vLLM, an open-source reimplementation of vLLM's core ideas in ~1,200 lines of pure Python. Every algorithm is visible. Every design decision is legible. It turned out to be the perfect lens for actually understanding how LLMs generate text.

The result is an 11-chapter interactive guide. No ML background required — every piece of jargon is explained from scratch with analogies, diagrams, annotated source code, interactive simulators, and quizzes.

What it covers:

What Is LLM Inference? — tokens, autoregressive generation, Q/K/V attention, HBM vs SRAM
Architecture — how 1,200 lines are organised; CPU control plane vs GPU data plane
KV Cache — why storing Keys and Values turns O(N²) recomputation into O(1) lookup
PagedAttention — virtual memory for the KV cache; how fragmentation wastes 60–80% of GPU memory
The Scheduler — continuous batching; keeping the GPU at 95% utilisation instead of 12%
Prefill vs Decode — same model, two completely different bottlenecks (compute-bound vs memory-bound)
Prefix Caching — skip prefill for shared tokens; ~700ms → ~90ms TTFT
Sampling Strategies — greedy, temperature, top-k, top-p, and what each does to the distribution
Tensor Parallelism — splitting a model across GPUs; column/row parallel and all-reduce
The Optimization Stack — FlashAttention, kernel fusion, CUDA Graphs, torch.compile
Benchmarks — measuring honestly; why nano-vLLM matches vLLM on core throughput

Each chapter is fully self-contained and interactive. A few of the simulators I'm most happy with: a PagedAttention block allocator you can fill up and watch fragment, a live scheduler you step through token by token, and a sampling playground where you reshape the probability distribution with sliders and sample from it.

🔗 Read the full series: https://ashwing.github.io/vllm-guide/

It's free and open. If you've ever wanted to understand what actually happens between sending a prompt and getting tokens back — this is the path I wish I'd had.

Feedback very welcome. Happy to answer questions about any of the concepts in the comments.

DEV Community: Ashwin Giridharan

I built an interactive 11-chapter guide to how LLM inference actually works