<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ashwin Giridharan</title>
    <description>The latest articles on DEV Community by Ashwin Giridharan (@ashwin_giridharan_dc396df).</description>
    <link>https://dev.to/ashwin_giridharan_dc396df</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3999881%2F81ef3f5c-0c6b-40d5-9fc7-b049c2983fdf.png</url>
      <title>DEV Community: Ashwin Giridharan</title>
      <link>https://dev.to/ashwin_giridharan_dc396df</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ashwin_giridharan_dc396df"/>
    <language>en</language>
    <item>
      <title>I built an interactive 11-chapter guide to how LLM inference actually works</title>
      <dc:creator>Ashwin Giridharan</dc:creator>
      <pubDate>Wed, 24 Jun 2026 06:36:00 +0000</pubDate>
      <link>https://dev.to/ashwin_giridharan_dc396df/i-built-an-interactive-11-chapter-guide-to-how-llm-inference-actually-works-1pb9</link>
      <guid>https://dev.to/ashwin_giridharan_dc396df/i-built-an-interactive-11-chapter-guide-to-how-llm-inference-actually-works-1pb9</guid>
      <description>&lt;p&gt;Production vLLM is 100,000+ lines of C++, CUDA, and Python. It powers most of the industry's LLM serving — but reading it cold is brutal.&lt;/p&gt;

&lt;p&gt;So I built a study series around &lt;strong&gt;nano-vLLM&lt;/strong&gt;, an open-source reimplementation of vLLM's core ideas in ~1,200 lines of pure Python. Every algorithm is visible. Every design decision is legible. It turned out to be the perfect lens for actually understanding how LLMs generate text.&lt;/p&gt;

&lt;p&gt;The result is an 11-chapter interactive guide. No ML background required — every piece of jargon is explained from scratch with analogies, diagrams, annotated source code, interactive simulators, and quizzes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it covers:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;What Is LLM Inference?&lt;/strong&gt; — tokens, autoregressive generation, Q/K/V attention, HBM vs SRAM
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fp4z5l3o1fhff6twciv95.png" alt=" " width="800" height="556"&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Architecture&lt;/strong&gt; — how 1,200 lines are organised; CPU control plane vs GPU data plane&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;KV Cache&lt;/strong&gt; — why storing Keys and Values turns O(N²) recomputation into O(1) lookup
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F3jnlx4wtux661juu0p4s.png" alt=" " width="800" height="603"&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PagedAttention&lt;/strong&gt; — virtual memory for the KV cache; how fragmentation wastes 60–80% of GPU memory
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fmk8d2vfed4b7bosg509d.png" alt=" " width="800" height="585"&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Scheduler&lt;/strong&gt; — continuous batching; keeping the GPU at 95% utilisation instead of 12%
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fp6k01t5ox6vcvk8o5k0o.png" alt=" " width="800" height="653"&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prefill vs Decode&lt;/strong&gt; — same model, two completely different bottlenecks (compute-bound vs memory-bound)
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Foqsh3rqkl36z8yql6v0k.png" alt=" " width="800" height="719"&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prefix Caching&lt;/strong&gt; — skip prefill for shared tokens; ~700ms → ~90ms TTFT
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fj26dcnxqha0nga52xltx.png" alt=" " width="800" height="867"&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sampling Strategies&lt;/strong&gt; — greedy, temperature, top-k, top-p, and what each does to the distribution&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tensor Parallelism&lt;/strong&gt; — splitting a model across GPUs; column/row parallel and all-reduce&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Optimization Stack&lt;/strong&gt; — FlashAttention, kernel fusion, CUDA Graphs, torch.compile&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Benchmarks&lt;/strong&gt; — measuring honestly; why nano-vLLM matches vLLM on core throughput&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each chapter is fully self-contained and interactive. A few of the simulators I'm most happy with: a PagedAttention block allocator you can fill up and watch fragment, a live scheduler you step through token by token, and a sampling playground where you reshape the probability distribution with sliders and sample from it.&lt;/p&gt;

&lt;p&gt;🔗 &lt;strong&gt;Read the full series:&lt;/strong&gt; &lt;a href="https://ashwing.github.io/vllm-guide/" rel="noopener noreferrer"&gt;https://ashwing.github.io/vllm-guide/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It's free and open. If you've ever wanted to understand what actually happens between sending a prompt and getting tokens back — this is the path I wish I'd had.&lt;/p&gt;

&lt;p&gt;Feedback very welcome. Happy to answer questions about any of the concepts in the comments.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>llm</category>
      <category>python</category>
    </item>
  </channel>
</rss>
