<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ben</title>
    <description>The latest articles on DEV Community by Ben (@c2sea).</description>
    <link>https://dev.to/c2sea</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3745848%2F0113816b-20cb-49d2-ba6a-7fa2fe227e0e.png</url>
      <title>DEV Community: Ben</title>
      <link>https://dev.to/c2sea</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/c2sea"/>
    <language>en</language>
    <item>
      <title>vLLM — Session 2: The Engine Layer — Request Management</title>
      <dc:creator>Ben</dc:creator>
      <pubDate>Sun, 01 Feb 2026 22:00:10 +0000</pubDate>
      <link>https://dev.to/c2sea/vllm-session-2-the-engine-layer-request-management-4dg2</link>
      <guid>https://dev.to/c2sea/vllm-session-2-the-engine-layer-request-management-4dg2</guid>
      <description>&lt;p&gt;&lt;em&gt;This is part of my vLLM learning series. In this session, I cover Step 2 (The Engine Layer).&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: This content was generated by Claude, grounded on the actual&lt;br&gt;
&lt;a href="https://github.com/vllm-project/vllm" rel="noopener noreferrer"&gt;vLLM&lt;/a&gt; codebase. It is intended for personal&lt;br&gt;
learning only and may contain inaccuracies. Always verify against the&lt;br&gt;
original source code and official documentation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Topic&lt;/strong&gt;: vLLM&lt;br&gt;
&lt;strong&gt;Date&lt;/strong&gt;: 2026-02-01&lt;br&gt;
&lt;strong&gt;Sections covered&lt;/strong&gt;: Step 2 (The Engine Layer)&lt;br&gt;
&lt;strong&gt;Prerequisites&lt;/strong&gt;: Session 1 — LLM class, SamplingParams, generate() flow, RequestOutput&lt;/p&gt;


&lt;h2&gt;
  
  
  Review
&lt;/h2&gt;

&lt;p&gt;In Session 1, we learned that the &lt;code&gt;LLM&lt;/code&gt; class is a thin wrapper around &lt;code&gt;LLMEngine&lt;/code&gt;. When you call &lt;code&gt;llm.generate()&lt;/code&gt;, the flow is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;_validate_and_add_requests()&lt;/code&gt; — pairs prompts with &lt;code&gt;SamplingParams&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;_run_engine()&lt;/code&gt; — loops &lt;code&gt;self.llm_engine.step()&lt;/code&gt; until all requests finish&lt;/li&gt;
&lt;li&gt;Returns sorted &lt;code&gt;RequestOutput&lt;/code&gt; objects&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We saw that &lt;code&gt;LLM.__init__()&lt;/code&gt; calls &lt;code&gt;LLMEngine.from_engine_args()&lt;/code&gt; — but we treated the engine as a black box. Today we open that box.&lt;/p&gt;

&lt;p&gt;The key question: &lt;strong&gt;What happens inside &lt;code&gt;llm_engine.step()&lt;/code&gt;?&lt;/strong&gt; The answer involves three components: &lt;code&gt;InputProcessor&lt;/code&gt;, &lt;code&gt;EngineCoreClient&lt;/code&gt;, and &lt;code&gt;OutputProcessor&lt;/code&gt;.&lt;/p&gt;


&lt;h2&gt;
  
  
  Today's Material
&lt;/h2&gt;
&lt;h3&gt;
  
  
  1. LLMEngine — The Orchestrator
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;LLMEngine&lt;/code&gt; sits between the user-facing &lt;code&gt;LLM&lt;/code&gt; class and the core scheduling/execution machinery. Its job is to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Preprocess&lt;/strong&gt; inputs (tokenize prompts, handle multimodal data)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Relay&lt;/strong&gt; preprocessed requests to the engine core&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postprocess&lt;/strong&gt; raw outputs (detokenize, format for the user)
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# vllm/v1/engine/llm_engine.py
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;LLMEngine&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vllm_config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;executor_class&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;log_stats&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;engine_core&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;EngineCoreClient&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;    &lt;span class="c1"&gt;# Talks to core
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;input_processor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;InputProcessor&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;   &lt;span class="c1"&gt;# Tokenize inputs
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output_processor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OutputProcessor&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt; &lt;span class="c1"&gt;# Format outputs
&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;add_request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;request_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Preprocess and send request to engine core.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;step&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;RequestOutput&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;One iteration: get outputs from core, process, return.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="nd"&gt;@classmethod&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;from_engine_args&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;engine_args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LLMEngine&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Factory: parse args -&amp;gt; VllmConfig -&amp;gt; create engine.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Think of &lt;code&gt;LLMEngine&lt;/code&gt; as a &lt;strong&gt;translator&lt;/strong&gt;: it speaks "user language" (strings, Python objects) on one side and "engine language" (token IDs, &lt;code&gt;msgspec&lt;/code&gt; structs) on the other.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;📝 Note:&lt;/strong&gt;&lt;br&gt;
vLLM has undergone a major architectural evolution. The &lt;code&gt;v1/&lt;/code&gt; directory contains the current architecture. Older code in the root &lt;code&gt;vllm/engine/&lt;/code&gt; directory is the legacy (v0) engine. When reading code, focus on &lt;code&gt;vllm/v1/&lt;/code&gt; — that's where active development happens.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;
  
  
  2. The Factory: from_engine_args()
&lt;/h3&gt;

&lt;p&gt;Before exploring the runtime flow, let's see how &lt;code&gt;LLMEngine&lt;/code&gt; gets created:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# vllm/v1/engine/llm_engine.py
&lt;/span&gt;&lt;span class="nd"&gt;@classmethod&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;from_engine_args&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;engine_args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;usage_context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LLMEngine&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Factory: parse args -&amp;gt; VllmConfig -&amp;gt; create engine.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;vllm_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;engine_args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_engine_config&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="c1"&gt;# vllm_config is a VllmConfig that bundles:
&lt;/span&gt;    &lt;span class="c1"&gt;#   ModelConfig, CacheConfig, ParallelConfig,
&lt;/span&gt;    &lt;span class="c1"&gt;#   SchedulerConfig, DeviceConfig, LoadConfig, ...
&lt;/span&gt;
    &lt;span class="n"&gt;executor_class&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Executor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_class&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vllm_config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Selects UniProcExecutor, MultiprocExecutor, or RayDistributedExecutor
&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;cls&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vllm_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;vllm_config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
               &lt;span class="n"&gt;executor_class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;executor_class&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
               &lt;span class="n"&gt;usage_context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;usage_context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is a classic factory pattern. The user provides simple arguments (&lt;code&gt;model="meta-llama/..."&lt;/code&gt;, &lt;code&gt;tensor_parallel_size=2&lt;/code&gt;), and the factory:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Parses them into a structured &lt;code&gt;VllmConfig&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Selects the right executor class based on configuration&lt;/li&gt;
&lt;li&gt;Constructs the engine with all dependencies wired up&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Why this matters&lt;/strong&gt;: The factory is where all of vLLM's auto-configuration happens. It determines dtype (auto-selects fp16/bf16 based on GPU capability), figures out how many blocks fit in memory, and selects the appropriate attention backend.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. InputProcessor — From Strings to Tokens
&lt;/h3&gt;

&lt;p&gt;When &lt;code&gt;add_request()&lt;/code&gt; is called, the first thing that happens is input processing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# vllm/v1/engine/input_processor.py
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;InputProcessor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vllm_config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mm_processor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;  &lt;span class="c1"&gt;# Multimodal input processor
&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;request_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;EngineCoreRequest&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Tokenize prompt, process multimodal inputs,
           create EngineCoreRequest.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The processor handles several input formats:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Users can provide prompts in multiple ways:
&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hello world&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                           &lt;span class="c1"&gt;# Plain string
&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hello world&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;               &lt;span class="c1"&gt;# Dict with string
&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_token_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;15496&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;995&lt;/span&gt;&lt;span class="p"&gt;]})&lt;/span&gt;      &lt;span class="c1"&gt;# Pre-tokenized
&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;                                         &lt;span class="c1"&gt;# Multimodal
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s in this image?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;multi_modal_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;image_data&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No matter what format you use, &lt;code&gt;InputProcessor.process()&lt;/code&gt; normalizes it into an &lt;code&gt;EngineCoreRequest&lt;/code&gt; — the standard wire format for the engine core.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The tokenization step&lt;/strong&gt; converts your string prompt into token IDs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"What is the capital of France?"
    → tokenizer.encode()
    → [1, 1724, 338, 278, 7483, 310, 3444, 29973]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💡 Tip:&lt;/strong&gt;&lt;br&gt;
If you already have token IDs (e.g., from your own tokenizer or preprocessing pipeline), pass &lt;code&gt;{"prompt_token_ids": [...]}&lt;/code&gt; to skip redundant tokenization. This saves CPU time for high-throughput applications.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  4. EngineCoreRequest — The Wire Format
&lt;/h3&gt;

&lt;p&gt;The output of &lt;code&gt;InputProcessor&lt;/code&gt; is an &lt;code&gt;EngineCoreRequest&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# vllm/v1/engine/__init__.py
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;EngineCoreRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msgspec&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Struct&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;request_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;prompt_token_ids&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;mm_features&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;MultiModalFeatureSpec&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;sampling_params&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SamplingParams&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;pooling_params&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;PoolingParams&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;eos_token_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;arrival_time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
    &lt;span class="n"&gt;lora_request&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;LoRARequest&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;cache_salt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;data_parallel_rank&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;prompt_embeds&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tensor&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;client_index&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;current_wave&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;trace_headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Mapping&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;resumable&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;
    &lt;span class="n"&gt;external_req_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why is this a separate type from &lt;code&gt;Request&lt;/code&gt; (which the scheduler uses internally)?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Separation of concerns&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;EngineCoreRequest&lt;/code&gt; is the &lt;strong&gt;transport&lt;/strong&gt; format — designed for serialization across process boundaries&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Request&lt;/code&gt; (in the scheduler) is the &lt;strong&gt;runtime&lt;/strong&gt; format — tracks mutable state like &lt;code&gt;num_computed_tokens&lt;/code&gt;, allocated blocks, output tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This separation is important because vLLM can run in &lt;strong&gt;multiprocess mode&lt;/strong&gt;: the FastAPI server and &lt;code&gt;InputProcessor&lt;/code&gt; run in one process, while the &lt;code&gt;EngineCore&lt;/code&gt; (scheduler + executor) runs in another. The &lt;code&gt;EngineCoreRequest&lt;/code&gt; gets serialized with &lt;code&gt;msgspec.msgpack.encode()&lt;/code&gt;, sent over a ZMQ socket, and deserialized on the other side.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Process 1 (Frontend)           Process 2 (Engine Core)
┌──────────────────┐           ┌──────────────────┐
│  InputProcessor  │           │    Scheduler     │
│       ↓          │           │       ↓          │
│ EngineCoreRequest│──ZMQ──→   │    Request       │
│                  │           │  (mutable state)  │
└──────────────────┘           └──────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;📝 Note:&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Why msgspec instead of pickle or JSON?&lt;/strong&gt; &lt;code&gt;msgspec.msgpack&lt;/code&gt; is 10-50x faster than pickle for structured data and produces smaller payloads than JSON. For a system processing thousands of requests per second, serialization overhead directly impacts throughput. This is not premature optimization — it's a measured bottleneck.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  5. EngineCoreClient — Bridging Processes
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;EngineCoreClient&lt;/code&gt; abstracts the communication between the engine layer and the engine core:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Conceptual interface:
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;EngineCoreClient&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;add_request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;EngineCoreRequest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Send request to the core.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_output&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;EngineCoreOutput&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Get completed/streaming outputs from the core.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The client has two modes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;When&lt;/th&gt;
&lt;th&gt;How it works&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;In-process&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;LLM&lt;/code&gt; class (offline)&lt;/td&gt;
&lt;td&gt;Direct function calls to &lt;code&gt;EngineCore&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Multiprocess&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;API server&lt;/td&gt;
&lt;td&gt;ZMQ sockets between processes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In &lt;strong&gt;in-process mode&lt;/strong&gt; (what you get with the &lt;code&gt;LLM&lt;/code&gt; class), &lt;code&gt;EngineCoreClient&lt;/code&gt; directly calls methods on an &lt;code&gt;EngineCore&lt;/code&gt; object in the same process. No serialization overhead.&lt;/p&gt;

&lt;p&gt;In &lt;strong&gt;multiprocess mode&lt;/strong&gt; (the OpenAI-compatible server), &lt;code&gt;EngineCoreClient&lt;/code&gt; serializes requests with &lt;code&gt;msgspec.msgpack&lt;/code&gt;, sends them over ZMQ, and the &lt;code&gt;EngineCore&lt;/code&gt; process deserializes and processes them. This keeps the FastAPI event loop responsive while heavy inference runs in a separate process.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simplified view of multiprocess communication:
# Frontend process:
&lt;/span&gt;&lt;span class="n"&gt;encoded&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;msgspec&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;msgpack&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;engine_core_request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;zmq_socket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;encoded&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Engine core process:
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;zmq_socket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;recv&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;request&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;msgspec&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;msgpack&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;EngineCoreRequest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;scheduler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this matters&lt;/strong&gt;: This two-process architecture is critical for production deployments. Without it, long-running model forward passes on the GPU would block the HTTP server from accepting new requests.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. The step() Method — One Iteration
&lt;/h3&gt;

&lt;p&gt;Now we can understand what happens in each call to &lt;code&gt;llm_engine.step()&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# vllm/v1/engine/llm_engine.py (simplified)
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;step&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;RequestOutput&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="c1"&gt;# 1. Get raw outputs from the engine core
&lt;/span&gt;    &lt;span class="n"&gt;engine_core_outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;engine_core&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_output&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# 2. Process outputs: detokenize, format, check completion
&lt;/span&gt;    &lt;span class="n"&gt;request_outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output_processor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;process_outputs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;engine_core_outputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;request_outputs&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each &lt;code&gt;step()&lt;/code&gt; returns a list of &lt;code&gt;RequestOutput&lt;/code&gt; objects — some may be streaming (partial), others may be finished. The &lt;code&gt;_run_engine()&lt;/code&gt; loop in &lt;code&gt;LLM&lt;/code&gt; collects the finished ones.&lt;/p&gt;

&lt;p&gt;But what triggers the core to actually run inference? In in-process mode, &lt;code&gt;get_output()&lt;/code&gt; internally calls &lt;code&gt;engine_core.step()&lt;/code&gt; which runs the scheduler + model execution. In multiprocess mode, the engine core runs its own loop continuously, and &lt;code&gt;get_output()&lt;/code&gt; just reads from a queue.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. OutputProcessor — From Tokens to Text
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;OutputProcessor&lt;/code&gt; is the mirror of &lt;code&gt;InputProcessor&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# vllm/v1/engine/output_processor.py
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;OutputProcessor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;log_stats&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output_states&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;RequestState&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It receives &lt;code&gt;EngineCoreOutput&lt;/code&gt; (raw token IDs from the core) and produces &lt;code&gt;RequestOutput&lt;/code&gt; (user-facing results). The key operations:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Accumulate tokens&lt;/strong&gt; — Maintains a running state per request&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Detokenize&lt;/strong&gt; — Converts token IDs back to text using the tokenizer&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Handle streaming modes&lt;/strong&gt; — &lt;code&gt;CUMULATIVE&lt;/code&gt; returns the full text so far; &lt;code&gt;DELTA&lt;/code&gt; returns only new tokens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Track completion&lt;/strong&gt; — Checks &lt;code&gt;finish_reason&lt;/code&gt; to know when a request is done
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# EngineCoreOutput — what the core produces:
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;EngineCoreOutput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msgspec&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Struct&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;request_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;new_token_ids&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;          &lt;span class="c1"&gt;# Newly generated tokens this step
&lt;/span&gt;    &lt;span class="n"&gt;new_logprobs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;LogprobsLists&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;finish_reason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;FinishReason&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;  &lt;span class="c1"&gt;# STOP, LENGTH, ABORT, or None
&lt;/span&gt;    &lt;span class="n"&gt;stop_reason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;num_cached_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The transformation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;EngineCoreOutput                          RequestOutput
┌──────────────────────┐                 ┌─────────────────────────┐
│ request_id: "req-42" │                 │ request_id: "req-42"    │
│ new_token_ids: [464] │   detokenize    │ prompt: "Hello"         │
│ finish_reason: None  │ ───────────→    │ outputs: [              │
│                      │                 │   CompletionOutput(     │
└──────────────────────┘                 │     text: " world",     │
                                         │     token_ids: [464],   │
                                         │     finish_reason: None │
                                         │   )                     │
                                         │ ]                       │
                                         │ finished: False         │
                                         └─────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚠️ Warning:&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Detokenization is not trivially reversible.&lt;/strong&gt; Many tokenizers use byte-level BPE, where a single token might represent part of a multi-byte UTF-8 character. The &lt;code&gt;OutputProcessor&lt;/code&gt; handles these edge cases — if a token produces an incomplete character, it buffers bytes until a valid character is formed. This is why you sometimes see "garbled" output when accessing raw &lt;code&gt;token_ids&lt;/code&gt; without proper detokenization.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  8. Putting It All Together — The Full Request Lifecycle
&lt;/h3&gt;

&lt;p&gt;Let's trace a request from start to finish:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User calls: llm.generate(["What is AI?"], SamplingParams(max_tokens=20))

1. LLM.generate()
   └→ _validate_and_add_requests()
      └→ llm_engine.add_request(request_id="0", prompt="What is AI?", params=...)
         └→ InputProcessor.process()
            - Tokenize: "What is AI?" → [1, 1724, 338, 319, 29902, 29973]
            - Create EngineCoreRequest(request_id="0",
                                       prompt_token_ids=[1, 1724, ...],
                                       sampling_params=...,
                                       arrival_time=time.time())
         └→ engine_core.add_request(engine_core_request)

2. LLM._run_engine()
   while has_unfinished_requests():
     └→ llm_engine.step()
        └→ engine_core.get_output()
           - Core runs: schedule → execute model → sample tokens
           - Returns EngineCoreOutput(request_id="0",
                                      new_token_ids=[23435],
                                      finish_reason=None)
        └→ OutputProcessor.process_outputs()
           - Detokenize [23435] → " Artificial"
           - Accumulate: text = " Artificial"
           - Return RequestOutput(finished=False, ...)

     ... more steps, generating tokens one at a time ...

     └→ llm_engine.step()  (final iteration)
        └→ engine_core.get_output()
           - Returns EngineCoreOutput(request_id="0",
                                      new_token_ids=[29889],
                                      finish_reason=FinishReason.LENGTH)
        └→ OutputProcessor.process_outputs()
           - Detokenize [29889] → "."
           - Accumulate: text = " Artificial intelligence is..."
           - finish_reason = "length" (hit max_tokens=20)
           - Return RequestOutput(finished=True, ...)

3. _run_engine() collects finished output, sorts by request_id, returns
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;📝 Note:&lt;/strong&gt;&lt;br&gt;
In practice, the engine core doesn't generate just one token per step. With continuous batching, a single &lt;code&gt;step()&lt;/code&gt; processes tokens for &lt;strong&gt;all active requests simultaneously&lt;/strong&gt;. If there are 50 active requests, one GPU forward pass generates the next token for all 50. The &lt;code&gt;OutputProcessor&lt;/code&gt; then demultiplexes the results back to individual &lt;code&gt;RequestOutput&lt;/code&gt; objects.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Exercises
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Exercise 1: Component Identification
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Difficulty&lt;/strong&gt;: Beginner&lt;br&gt;
&lt;strong&gt;Goal&lt;/strong&gt;: Verify you can identify the role of each engine layer component&lt;/p&gt;

&lt;p&gt;For each of the following operations, name which component handles it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Converting the string &lt;code&gt;"Hello world"&lt;/code&gt; into token IDs &lt;code&gt;[15496, 995]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Deciding which requests get GPU time this iteration&lt;/li&gt;
&lt;li&gt;Converting &lt;code&gt;EngineArgs&lt;/code&gt; into a &lt;code&gt;VllmConfig&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Decoding token ID &lt;code&gt;[29889]&lt;/code&gt; back into the string &lt;code&gt;"."&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Sending an &lt;code&gt;EngineCoreRequest&lt;/code&gt; from the frontend process to the engine core process&lt;/li&gt;
&lt;/ol&gt;


Solution

1. **InputProcessor** — it runs the tokenizer on the raw prompt string.
2. **Scheduler** (inside the engine core) — it decides which requests to include in each step's batch.
3. **`LLMEngine.from_engine_args()`** — the factory classmethod calls `engine_args.create_engine_config()`.
4. **OutputProcessor** — it detokenizes raw token IDs back into text.
5. **EngineCoreClient** (multiprocess mode) — it serializes with `msgspec.msgpack` and sends over ZMQ.


&lt;h3&gt;
  
  
  Exercise 2: Multiprocess vs. In-Process
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Difficulty&lt;/strong&gt;: Intermediate&lt;br&gt;
&lt;strong&gt;Goal&lt;/strong&gt;: Understand when and why vLLM uses multiprocess communication&lt;/p&gt;

&lt;p&gt;Consider two scenarios:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenario A&lt;/strong&gt;: Offline batch processing&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LLM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;meta-llama/Llama-3.1-8B-Instruct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Scenario B&lt;/strong&gt;: Production API server&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vllm serve meta-llama/Llama-3.1-8B-Instruct &lt;span class="nt"&gt;--port&lt;/span&gt; 8000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For each scenario:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Is the &lt;code&gt;EngineCoreClient&lt;/code&gt; in in-process or multiprocess mode?&lt;/li&gt;
&lt;li&gt;Does the &lt;code&gt;EngineCoreRequest&lt;/code&gt; actually get serialized with msgspec?&lt;/li&gt;
&lt;li&gt;What would happen if the API server ran the engine core in-process (same event loop)?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Solution&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenario A (offline &lt;code&gt;LLM&lt;/code&gt; class)&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;In-process&lt;/strong&gt; — direct function calls to &lt;code&gt;EngineCore&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No&lt;/strong&gt; — the &lt;code&gt;EngineCoreRequest&lt;/code&gt; is created but passed directly without serialization.&lt;/li&gt;
&lt;li&gt;N/A — there's no HTTP server.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Scenario B (API server)&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Multiprocess&lt;/strong&gt; — ZMQ sockets between frontend and engine core processes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Yes&lt;/strong&gt; — &lt;code&gt;msgspec.msgpack.encode()&lt;/code&gt; serializes it, sends over ZMQ, and the core deserializes it.&lt;/li&gt;
&lt;li&gt;The HTTP server would block during GPU forward passes. A single inference step can take 10-100ms, during which the server couldn't accept new connections or respond to health checks. Under load, this would cause request timeouts and dropped connections.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Exercise 3: Tracing Data Transformations
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Difficulty&lt;/strong&gt;: Intermediate&lt;br&gt;
&lt;strong&gt;Goal&lt;/strong&gt;: Follow the data as it changes form through the pipeline&lt;/p&gt;

&lt;p&gt;Starting with this call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_token_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]}],&lt;/span&gt;
    &lt;span class="nc"&gt;SamplingParams&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Does &lt;code&gt;InputProcessor&lt;/code&gt; tokenize this prompt? Why or why not?&lt;/li&gt;
&lt;li&gt;What fields of &lt;code&gt;EngineCoreRequest&lt;/code&gt; are set? What's &lt;code&gt;arrival_time&lt;/code&gt; used for?&lt;/li&gt;
&lt;li&gt;If the model generates tokens &lt;code&gt;[100, 200, 300]&lt;/code&gt;, what does the &lt;code&gt;EngineCoreOutput&lt;/code&gt; for the final step look like?&lt;/li&gt;
&lt;li&gt;What is &lt;code&gt;finish_reason&lt;/code&gt; and why?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Solution&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;No.&lt;/strong&gt; The input is &lt;code&gt;{"prompt_token_ids": [1, 2, 3, 4, 5]}&lt;/code&gt; — already tokenized. &lt;code&gt;InputProcessor&lt;/code&gt; detects the &lt;code&gt;prompt_token_ids&lt;/code&gt; key and skips tokenization, using the provided IDs directly.&lt;/li&gt;
&lt;li&gt;Key fields: &lt;code&gt;request_id&lt;/code&gt; (auto-assigned), &lt;code&gt;prompt_token_ids=[1, 2, 3, 4, 5]&lt;/code&gt;, &lt;code&gt;sampling_params&lt;/code&gt; (with &lt;code&gt;max_tokens=3, temperature=0&lt;/code&gt;), &lt;code&gt;arrival_time=time.time()&lt;/code&gt;. &lt;code&gt;arrival_time&lt;/code&gt; is used by the scheduler for FCFS ordering and for latency metrics.&lt;/li&gt;
&lt;li&gt;The final &lt;code&gt;EngineCoreOutput&lt;/code&gt; would be: &lt;code&gt;EngineCoreOutput(request_id="0", new_token_ids=[300], finish_reason=FinishReason.LENGTH, stop_reason=None)&lt;/code&gt;. Each step produces one new token, and the third token triggers the length limit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;FinishReason.LENGTH&lt;/code&gt;&lt;/strong&gt; — the model generated exactly &lt;code&gt;max_tokens=3&lt;/code&gt; tokens (&lt;code&gt;[100, 200, 300]&lt;/code&gt;) and was stopped. It didn't hit an EOS or stop token naturally.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Exercise 4: Design Challenge — Adding Request Priority
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Difficulty&lt;/strong&gt;: Advanced&lt;br&gt;
&lt;strong&gt;Goal&lt;/strong&gt;: Think through how a feature propagates through the engine layer&lt;/p&gt;

&lt;p&gt;Suppose you want to add priority-based scheduling: high-priority requests should be processed before low-priority ones. Trace through the architecture:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Where does the user specify priority? (Hint: look at &lt;code&gt;LLM.generate()&lt;/code&gt; parameters)&lt;/li&gt;
&lt;li&gt;How does priority get from the user to the scheduler? List each class it passes through.&lt;/li&gt;
&lt;li&gt;Why is &lt;code&gt;priority&lt;/code&gt; a field on &lt;code&gt;EngineCoreRequest&lt;/code&gt; rather than just on &lt;code&gt;SamplingParams&lt;/code&gt;?&lt;/li&gt;
&lt;li&gt;What would happen if the &lt;code&gt;OutputProcessor&lt;/code&gt; also needed to know about priority? Would the current architecture support that?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Hint&lt;/strong&gt;: Priority is already partially implemented — look at the &lt;code&gt;EngineCoreRequest&lt;/code&gt; fields.&lt;/p&gt;

&lt;p&gt;Solution&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Via the &lt;code&gt;priority&lt;/code&gt; parameter in &lt;code&gt;LLM.generate(prompts, params, priority=[1, 2, ...])&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The path is: &lt;code&gt;LLM.generate()&lt;/code&gt; → &lt;code&gt;_validate_and_add_requests()&lt;/code&gt; → &lt;code&gt;LLMEngine.add_request()&lt;/code&gt; → &lt;code&gt;InputProcessor.process()&lt;/code&gt; (sets the &lt;code&gt;priority&lt;/code&gt; field on &lt;code&gt;EngineCoreRequest&lt;/code&gt;) → &lt;code&gt;EngineCoreClient.add_request()&lt;/code&gt; → &lt;code&gt;Scheduler&lt;/code&gt; (reads &lt;code&gt;priority&lt;/code&gt; from the request).&lt;/li&gt;
&lt;li&gt;Priority is a &lt;strong&gt;request-level&lt;/strong&gt; concept, not a &lt;strong&gt;generation-level&lt;/strong&gt; concept. &lt;code&gt;SamplingParams&lt;/code&gt; controls how tokens are sampled (temperature, top-p, etc.) — it's about the quality of the output. Priority controls when the request gets scheduled — it's about resource allocation. Mixing them would conflate two different concerns.&lt;/li&gt;
&lt;li&gt;Yes — the &lt;code&gt;OutputProcessor&lt;/code&gt; receives &lt;code&gt;EngineCoreOutput&lt;/code&gt; which includes the &lt;code&gt;request_id&lt;/code&gt;. It could look up priority from its internal state (it already maintains per-request &lt;code&gt;RequestState&lt;/code&gt;). But currently it doesn't need to — priority only matters for scheduling decisions.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Exercise 5: Streaming Output Modes
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Difficulty&lt;/strong&gt;: Advanced&lt;br&gt;
&lt;strong&gt;Goal&lt;/strong&gt;: Understand the difference between CUMULATIVE and DELTA output modes&lt;/p&gt;

&lt;p&gt;Given a request that generates the text &lt;code&gt;"Hello world!"&lt;/code&gt; as three tokens:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Step 1: token &lt;code&gt;"Hello"&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Step 2: token &lt;code&gt;" world"&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Step 3: token &lt;code&gt;"!"&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Write out what &lt;code&gt;RequestOutput.outputs[0].text&lt;/code&gt; contains at each step for:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;code&gt;output_kind = RequestOutputKind.CUMULATIVE&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;output_kind = RequestOutputKind.DELTA&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;When would you use each mode? Think about a streaming chat UI vs. a batch processing pipeline.&lt;/p&gt;

&lt;p&gt;Solution&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CUMULATIVE&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Step 1: &lt;code&gt;"Hello"&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Step 2: &lt;code&gt;"Hello world"&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Step 3: &lt;code&gt;"Hello world!"&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;DELTA&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Step 1: &lt;code&gt;"Hello"&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Step 2: &lt;code&gt;" world"&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Step 3: &lt;code&gt;"!"&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When to use each&lt;/strong&gt;: DELTA is ideal for streaming chat UIs — you append each delta directly to the display. CUMULATIVE is simpler for batch pipelines — you always have the full text so far, no need to track previous outputs. CUMULATIVE is the default because it's easier to use correctly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quiz
&lt;/h2&gt;

&lt;p&gt;Answer these questions based on today's material. Try to answer each question before revealing the answer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q1: What are the three main components inside &lt;code&gt;LLMEngine&lt;/code&gt;, and what does each one do?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Answer&lt;/p&gt;

&lt;p&gt;&lt;code&gt;InputProcessor&lt;/code&gt;, &lt;code&gt;EngineCoreClient&lt;/code&gt;, and &lt;code&gt;OutputProcessor&lt;/code&gt;. &lt;code&gt;InputProcessor&lt;/code&gt; tokenizes prompts and creates &lt;code&gt;EngineCoreRequest&lt;/code&gt; objects. &lt;code&gt;EngineCoreClient&lt;/code&gt; sends requests to and receives outputs from the engine core (either in-process or via ZMQ). &lt;code&gt;OutputProcessor&lt;/code&gt; detokenizes raw token IDs back into text and formats &lt;code&gt;RequestOutput&lt;/code&gt; objects for the user.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q2: Why does vLLM have both &lt;code&gt;EngineCoreRequest&lt;/code&gt; and &lt;code&gt;Request&lt;/code&gt; as separate types?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Answer&lt;/p&gt;

&lt;p&gt;They serve different purposes across a process boundary. &lt;code&gt;EngineCoreRequest&lt;/code&gt; is the transport/wire format — immutable, serializable with &lt;code&gt;msgspec&lt;/code&gt;, designed to cross process boundaries efficiently. &lt;code&gt;Request&lt;/code&gt; is the scheduler's internal runtime format — mutable, tracks state like &lt;code&gt;num_computed_tokens&lt;/code&gt;, allocated KV cache blocks, and output tokens. Mixing these concerns would either make serialization expensive or make runtime tracking awkward.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q3: What serialization format does vLLM use for inter-process communication, and why was it chosen over alternatives like pickle or JSON?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Answer&lt;/p&gt;

&lt;p&gt;&lt;code&gt;msgspec.msgpack&lt;/code&gt; — a binary MessagePack format. It's 10-50x faster than pickle for structured data and produces compact binary payloads. JSON was rejected because it's text-based (larger payloads, slower parsing). Pickle was rejected because it's slow for structured data and has security concerns. At thousands of requests per second, serialization overhead is a real bottleneck.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q4: In multiprocess mode, what happens if the engine core is busy running a forward pass when a new HTTP request arrives?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Answer&lt;/p&gt;

&lt;p&gt;The new request is accepted by the frontend process and queued. Because the frontend (FastAPI + &lt;code&gt;InputProcessor&lt;/code&gt;) runs in a separate process from the engine core, it can accept and preprocess new HTTP requests while the GPU is busy. The &lt;code&gt;EngineCoreRequest&lt;/code&gt; is sent over ZMQ and queued for the next scheduling iteration. This is exactly why the two-process architecture exists.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q5: What does &lt;code&gt;OutputProcessor&lt;/code&gt; do when it receives a token that represents an incomplete UTF-8 character?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Answer&lt;/p&gt;

&lt;p&gt;It buffers the incomplete bytes until a valid character is formed. Many tokenizers use byte-level BPE, where tokens can split in the middle of multi-byte UTF-8 characters (e.g., emoji, CJK characters). The &lt;code&gt;OutputProcessor&lt;/code&gt; accumulates bytes and only emits text when complete characters are available. This prevents garbled output in streaming responses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q6: True or false: In in-process mode (using the &lt;code&gt;LLM&lt;/code&gt; class), &lt;code&gt;EngineCoreRequest&lt;/code&gt; is still created even though it doesn't need to be serialized.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Answer&lt;/p&gt;

&lt;p&gt;True. The &lt;code&gt;InputProcessor&lt;/code&gt; always creates an &lt;code&gt;EngineCoreRequest&lt;/code&gt; regardless of execution mode. In in-process mode, the request is passed directly to the engine core without serialization. The &lt;code&gt;EngineCoreRequest&lt;/code&gt; type serves as a clean interface contract between the engine layer and the core, even when no process boundary exists.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q7: What is the purpose of &lt;code&gt;arrival_time&lt;/code&gt; in &lt;code&gt;EngineCoreRequest&lt;/code&gt;?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Answer&lt;/p&gt;

&lt;p&gt;It records when the request was submitted, enabling scheduling policies like FCFS (first-come-first-served). The scheduler can use &lt;code&gt;arrival_time&lt;/code&gt; to prioritize older requests over newer ones. It's also used for metrics: you can measure end-to-end latency by comparing &lt;code&gt;arrival_time&lt;/code&gt; with the completion time. Without it, the scheduler would have no notion of fairness or request ordering.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q8: Why does &lt;code&gt;LLMEngine.from_engine_args()&lt;/code&gt; exist as a classmethod factory instead of putting all the logic in &lt;code&gt;__init__&lt;/code&gt;?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Answer&lt;/p&gt;

&lt;p&gt;To separate argument parsing from construction. The factory method converts user-friendly &lt;code&gt;EngineArgs&lt;/code&gt; (flat key-value pairs) into a structured &lt;code&gt;VllmConfig&lt;/code&gt; (nested, validated configuration), selects the right executor class, and then calls &lt;code&gt;__init__&lt;/code&gt;. This keeps &lt;code&gt;__init__&lt;/code&gt; simple — it receives fully validated, structured objects. It also allows alternative construction paths (e.g., creating &lt;code&gt;LLMEngine&lt;/code&gt; directly with a &lt;code&gt;VllmConfig&lt;/code&gt; for testing).&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;LLMEngine&lt;/code&gt;&lt;/strong&gt; is the orchestrator that connects the user-facing API to the engine core, with three sub-components: &lt;code&gt;InputProcessor&lt;/code&gt;, &lt;code&gt;EngineCoreClient&lt;/code&gt;, and &lt;code&gt;OutputProcessor&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;InputProcessor&lt;/code&gt;&lt;/strong&gt; normalizes various input formats (strings, token IDs, multimodal data) into &lt;code&gt;EngineCoreRequest&lt;/code&gt; — the standard wire format&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;EngineCoreRequest&lt;/code&gt;&lt;/strong&gt; uses &lt;code&gt;msgspec.Struct&lt;/code&gt; for fast serialization, enabling efficient multiprocess communication via ZMQ&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;EngineCoreClient&lt;/code&gt;&lt;/strong&gt; abstracts the communication mode: in-process for offline use, multiprocess (ZMQ) for production servers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;OutputProcessor&lt;/code&gt;&lt;/strong&gt; reverses the input pipeline: accumulates tokens, detokenizes, handles streaming modes (CUMULATIVE vs DELTA), and produces &lt;code&gt;RequestOutput&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;two-process architecture&lt;/strong&gt; (frontend + engine core) is critical for production: it keeps the HTTP server responsive while the GPU runs inference&lt;/li&gt;
&lt;li&gt;Next session: &lt;strong&gt;The Scheduler&lt;/strong&gt; — how vLLM decides which requests get GPU time, the token budget system, and chunked prefill&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Generated from my &lt;a href="https://github.com/ccwang/ai-study" rel="noopener noreferrer"&gt;ai-study&lt;/a&gt; learning project.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>vllm</category>
      <category>llm</category>
      <category>python</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Session 1: vLLM Overview and the User API</title>
      <dc:creator>Ben</dc:creator>
      <pubDate>Sun, 01 Feb 2026 22:00:01 +0000</pubDate>
      <link>https://dev.to/c2sea/session-1-vllm-overview-and-the-user-api-2406</link>
      <guid>https://dev.to/c2sea/session-1-vllm-overview-and-the-user-api-2406</guid>
      <description>&lt;p&gt;&lt;em&gt;This is part of my vLLM learning series. In this session, I cover Step 1 (The User API).&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: This content was generated by Claude, grounded on the actual&lt;br&gt;
&lt;a href="https://github.com/vllm-project/vllm" rel="noopener noreferrer"&gt;vLLM&lt;/a&gt; codebase. It is intended for personal&lt;br&gt;
learning only and may contain inaccuracies. Always verify against the&lt;br&gt;
original source code and official documentation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Topic&lt;/strong&gt;: vLLM&lt;br&gt;
&lt;strong&gt;Date&lt;/strong&gt;: 2026-01-31&lt;br&gt;
&lt;strong&gt;Sections covered&lt;/strong&gt;: Step 1 (The User API)&lt;br&gt;
&lt;strong&gt;Prerequisites&lt;/strong&gt;: None&lt;/p&gt;


&lt;h2&gt;
  
  
  Today's Material
&lt;/h2&gt;
&lt;h3&gt;
  
  
  1. What is vLLM and Why Does It Matter?
&lt;/h3&gt;

&lt;p&gt;LLM inference is GPU-memory-bound. When a model generates text, it needs to store &lt;strong&gt;key-value (KV) caches&lt;/strong&gt; — intermediate computations from the attention mechanism — for every token in every active request. Naive implementations pre-allocate the maximum possible sequence length for each request, wasting 60-80% of GPU memory on empty space.&lt;/p&gt;

&lt;p&gt;vLLM solves this with &lt;strong&gt;PagedAttention&lt;/strong&gt;: instead of pre-allocating a giant contiguous buffer per request, it carves GPU memory into fixed-size &lt;strong&gt;blocks&lt;/strong&gt; (default 16 tokens each) and allocates them on demand — just like how an operating system manages virtual memory with pages.&lt;/p&gt;

&lt;p&gt;The result: near-optimal memory utilization and &lt;strong&gt;2-4x higher throughput&lt;/strong&gt; than HuggingFace Transformers on typical workloads.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;📝 Note:&lt;/strong&gt;&lt;br&gt;
Think of the difference like this: the naive approach is like reserving an entire row of seats in a theater for each person "just in case" they bring friends. PagedAttention is like assigning individual seats as people actually show up.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;
  
  
  2. High-Level Architecture
&lt;/h3&gt;

&lt;p&gt;Before diving into code, here's the bird's-eye view of how vLLM is organized:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌───────────────────────────────────────────────┐
│              User-Facing Layer                │
│   LLM class  |  OpenAI API Server  |  gRPC   │
└──────────────────────┬────────────────────────┘
                       │
┌──────────────────────▼────────────────────────┐
│              Engine Layer                     │
│  InputProcessor → EngineCoreClient → OutputProcessor │
└──────────────────────┬────────────────────────┘
                       │
┌──────────────────────▼────────────────────────┐
│              Engine Core                      │
│   Scheduler → Executor → Workers → GPU        │
│      └── KVCacheManager (BlockPool)           │
└───────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three layers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;User-Facing&lt;/strong&gt; — Multiple entry points (Python API, HTTP, gRPC) that all funnel into the engine&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Engine Layer&lt;/strong&gt; — Tokenize inputs, relay to the core, format outputs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Engine Core&lt;/strong&gt; — The scheduling loop, KV cache management, and GPU execution&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Today we focus on layer 1: the &lt;code&gt;LLM&lt;/code&gt; class and its associated types.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The LLM Class — Your Main Interface
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;LLM&lt;/code&gt; class in &lt;code&gt;vllm/entrypoints/llm.py&lt;/code&gt; is the primary interface for offline batch inference. Here's its constructor (simplified to the most important parameters):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# vllm/entrypoints/llm.py
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;LLM&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;tensor_parallel_size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ModelDType&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;quantization&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;QuantizationMethods&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;gpu_memory_utilization&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;engine_args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;EngineArgs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;tensor_parallel_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tensor_parallel_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;quantization&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;quantization&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;gpu_memory_utilization&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;gpu_memory_utilization&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;llm_engine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;LLMEngine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_engine_args&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;engine_args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;engine_args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;usage_context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;UsageContext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LLM_CLASS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request_counter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key things to notice:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;LLM&lt;/code&gt; is &lt;strong&gt;thin&lt;/strong&gt; — it creates an &lt;code&gt;EngineArgs&lt;/code&gt; config, then hands everything off to &lt;code&gt;LLMEngine.from_engine_args()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;gpu_memory_utilization=0.9&lt;/code&gt; means vLLM claims 90% of GPU memory for the KV cache, reserving 10% for PyTorch overhead (model weights, activations, etc.)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tensor_parallel_size&lt;/code&gt; controls how many GPUs to shard the model across — set to 1 for single-GPU&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💡 Tip:&lt;/strong&gt;&lt;br&gt;
If you get CUDA out-of-memory errors, lower &lt;code&gt;gpu_memory_utilization&lt;/code&gt; (e.g., to 0.8). If you want more throughput and have headroom, raise it (up to ~0.95).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  4. The generate() Method — Where Requests Enter
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# vllm/entrypoints/llm.py
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;prompts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;PromptType&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;Sequence&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;PromptType&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;sampling_params&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SamplingParams&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;Sequence&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;SamplingParams&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;use_tqdm&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;Callable&lt;/span&gt;&lt;span class="p"&gt;[...,&lt;/span&gt; &lt;span class="n"&gt;tqdm&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;lora_request&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;LoRARequest&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;LoRARequest&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;RequestOutput&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;model_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model_config&lt;/span&gt;
    &lt;span class="n"&gt;runner_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model_config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;runner_type&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;runner_type&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LLM.generate() is only supported for generative models.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;sampling_params&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;sampling_params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_default_sampling_params&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_validate_and_add_requests&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;prompts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;prompts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sampling_params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;use_tqdm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;use_tqdm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;lora_request&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;...,&lt;/span&gt;
        &lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_run_engine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;use_tqdm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;use_tqdm&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;engine_class&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;validate_outputs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;RequestOutput&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The flow is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Validate&lt;/strong&gt; that this is a generative model (not an embedding model)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add requests&lt;/strong&gt; via &lt;code&gt;_validate_and_add_requests()&lt;/code&gt; — normalizes inputs, pairs each prompt with its &lt;code&gt;SamplingParams&lt;/code&gt;, and sends them to the engine&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run the engine&lt;/strong&gt; via &lt;code&gt;_run_engine()&lt;/code&gt; — loops until all requests are finished&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Return&lt;/strong&gt; sorted &lt;code&gt;RequestOutput&lt;/code&gt; objects&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You can pass a single &lt;code&gt;SamplingParams&lt;/code&gt; (applied to all prompts) or a list (one per prompt). This is useful when different prompts need different temperatures or stop conditions.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. _run_engine() — The Processing Loop
&lt;/h3&gt;

&lt;p&gt;This is where the actual inference happens:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# vllm/entrypoints/llm.py
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_run_engine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;use_tqdm&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;Callable&lt;/span&gt;&lt;span class="p"&gt;[...,&lt;/span&gt; &lt;span class="n"&gt;tqdm&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;RequestOutput&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;PoolingRequestOutput&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;RequestOutput&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;PoolingRequestOutput&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;total_in_toks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="n"&gt;total_out_toks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;llm_engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;has_unfinished_requests&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;step_outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;llm_engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;step&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;step_outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;finished&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Sort the outputs by request ID.
&lt;/span&gt;    &lt;span class="c1"&gt;# This is necessary because some requests may be finished earlier than
&lt;/span&gt;    &lt;span class="c1"&gt;# its previous requests.
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request_id&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key insight: &lt;strong&gt;&lt;code&gt;_run_engine()&lt;/code&gt; is a simple loop&lt;/strong&gt;. It calls &lt;code&gt;self.llm_engine.step()&lt;/code&gt; repeatedly. Each &lt;code&gt;step()&lt;/code&gt; runs one iteration of the scheduling + inference pipeline — potentially processing hundreds of requests in a single forward pass. Finished requests come back as &lt;code&gt;RequestOutput&lt;/code&gt; objects.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚠️ Warning:&lt;/strong&gt;&lt;br&gt;
The outputs are sorted by &lt;code&gt;request_id&lt;/code&gt; at the end because requests don't finish in order. A short request (e.g., "Say hi") may finish in 5 iterations while a long request (e.g., "Write an essay") takes 500. The sorting ensures the output list matches the input prompt order.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Why this matters&lt;/strong&gt;: This loop is where continuous batching happens. Unlike static batching (process N prompts, wait for all to finish, return), vLLM processes requests at different stages simultaneously. Request A might be mid-generation while Request B is just starting its prefill.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. SamplingParams — Controlling Generation
&lt;/h3&gt;

&lt;p&gt;Every request carries a &lt;code&gt;SamplingParams&lt;/code&gt; that controls how tokens are selected:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# vllm/sampling_params.py
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SamplingParams&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;PydanticMsgspecMixin&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;msgspec&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Struct&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;omit_defaults&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# --- Core sampling ---
&lt;/span&gt;    &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;                          &lt;span class="c1"&gt;# Number of output sequences
&lt;/span&gt;    &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;            &lt;span class="c1"&gt;# 0 = greedy, higher = more random
&lt;/span&gt;    &lt;span class="n"&gt;top_p&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;                  &lt;span class="c1"&gt;# Nucleus sampling threshold
&lt;/span&gt;    &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;                      &lt;span class="c1"&gt;# Top-K filtering (0 = disabled)
&lt;/span&gt;    &lt;span class="n"&gt;min_p&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;                  &lt;span class="c1"&gt;# Minimum probability threshold
&lt;/span&gt;    &lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;             &lt;span class="c1"&gt;# Reproducible sampling
&lt;/span&gt;
    &lt;span class="c1"&gt;# --- Penalties ---
&lt;/span&gt;    &lt;span class="n"&gt;presence_penalty&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;       &lt;span class="c1"&gt;# Penalize tokens that appeared
&lt;/span&gt;    &lt;span class="n"&gt;frequency_penalty&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;      &lt;span class="c1"&gt;# Penalize by frequency
&lt;/span&gt;    &lt;span class="n"&gt;repetition_penalty&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;     &lt;span class="c1"&gt;# Multiplicative penalty
&lt;/span&gt;
    &lt;span class="c1"&gt;# --- Generation limits ---
&lt;/span&gt;    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;         &lt;span class="c1"&gt;# Output length limit
&lt;/span&gt;    &lt;span class="n"&gt;min_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;                 &lt;span class="c1"&gt;# Minimum before allowing EOS
&lt;/span&gt;    &lt;span class="n"&gt;ignore_eos&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;            &lt;span class="c1"&gt;# Don't stop at EOS
&lt;/span&gt;
    &lt;span class="c1"&gt;# --- Stop conditions ---
&lt;/span&gt;    &lt;span class="n"&gt;stop&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;stop_token_ids&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="c1"&gt;# --- Output control ---
&lt;/span&gt;    &lt;span class="n"&gt;logprobs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;         &lt;span class="c1"&gt;# Return top-N log probabilities
&lt;/span&gt;    &lt;span class="n"&gt;prompt_logprobs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;  &lt;span class="c1"&gt;# Prompt token log probs
&lt;/span&gt;    &lt;span class="n"&gt;detokenize&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;             &lt;span class="c1"&gt;# Decode token IDs to text
&lt;/span&gt;
    &lt;span class="c1"&gt;# --- Advanced ---
&lt;/span&gt;    &lt;span class="n"&gt;structured_outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;StructuredOutputsParams&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;  &lt;span class="c1"&gt;# JSON schema
&lt;/span&gt;    &lt;span class="n"&gt;logit_bias&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;output_kind&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;RequestOutputKind&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;RequestOutputKind&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CUMULATIVE&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice that &lt;code&gt;SamplingParams&lt;/code&gt; inherits from &lt;code&gt;msgspec.Struct&lt;/code&gt;, not a Python dataclass. This is a deliberate performance choice — &lt;code&gt;msgspec&lt;/code&gt; serialization is 10-50x faster than &lt;code&gt;pickle&lt;/code&gt;, which matters when requests cross process boundaries (more on this in a future session).&lt;/p&gt;

&lt;h4&gt;
  
  
  Validation logic
&lt;/h4&gt;

&lt;p&gt;&lt;code&gt;SamplingParams.__post_init__()&lt;/code&gt; enforces constraints:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# vllm/sampling_params.py
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__post_init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Normalize stop to a list
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stop&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stop&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stop&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stop&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stop&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="c1"&gt;# Zero temperature → force greedy sampling
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;_SAMPLING_EPS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;top_p&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;top_k&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;min_p&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_verify_greedy_sampling&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_verify_args&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_verify_args&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;n must be at least 1, got &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;2.0&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;presence_penalty&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;VLLMValidationError&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;top_p&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;VLLMValidationError&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_tokens&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_tokens&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;VLLMValidationError&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;📝 Note:&lt;/strong&gt;&lt;br&gt;
When &lt;code&gt;temperature=0&lt;/code&gt;, vLLM automatically sets &lt;code&gt;top_p=1.0&lt;/code&gt;, &lt;code&gt;top_k=0&lt;/code&gt;, and &lt;code&gt;min_p=0.0&lt;/code&gt;. This is because greedy decoding (always pick the highest-probability token) makes all other sampling parameters irrelevant. The code enforces this rather than letting the user set contradictory values.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  7. RequestOutput and CompletionOutput — What You Get Back
&lt;/h3&gt;

&lt;p&gt;After &lt;code&gt;generate()&lt;/code&gt; finishes, you get a list of &lt;code&gt;RequestOutput&lt;/code&gt; objects:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# vllm/outputs.py
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RequestOutput&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;request_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;prompt_token_ids&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;prompt_logprobs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;PromptLogprobs&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;CompletionOutput&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;finished&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;RequestStateStats&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;num_cached_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="bp"&gt;...&lt;/span&gt;
    &lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each &lt;code&gt;RequestOutput&lt;/code&gt; contains one or more &lt;code&gt;CompletionOutput&lt;/code&gt; objects (one per &lt;code&gt;n&lt;/code&gt; in &lt;code&gt;SamplingParams&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# vllm/outputs.py
&lt;/span&gt;&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CompletionOutput&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;                         &lt;span class="c1"&gt;# Which of the n outputs
&lt;/span&gt;    &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;                          &lt;span class="c1"&gt;# Generated text
&lt;/span&gt;    &lt;span class="n"&gt;token_ids&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;GenericSequence&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;    &lt;span class="c1"&gt;# Generated token IDs
&lt;/span&gt;    &lt;span class="n"&gt;cumulative_logprob&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;   &lt;span class="c1"&gt;# Sum of log probs
&lt;/span&gt;    &lt;span class="n"&gt;logprobs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SampleLogprobs&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;    &lt;span class="c1"&gt;# Per-token log probs
&lt;/span&gt;    &lt;span class="n"&gt;finish_reason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;          &lt;span class="c1"&gt;# "stop", "length", or None
&lt;/span&gt;    &lt;span class="n"&gt;stop_reason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;      &lt;span class="c1"&gt;# What triggered the stop
&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;finished&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;finish_reason&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A typical usage pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;vllm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LLM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SamplingParams&lt;/span&gt;

&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LLM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;meta-llama/Llama-3.1-8B-Instruct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is the capital of France?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Explain gravity in one sentence.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="nc"&gt;SamplingParams&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;
    &lt;span class="n"&gt;generated&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;
    &lt;span class="n"&gt;reason&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;finish_reason&lt;/span&gt;  &lt;span class="c1"&gt;# "stop" or "length"
&lt;/span&gt;    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Prompt: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Output: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;generated&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Finished because: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When &lt;code&gt;n &amp;gt; 1&lt;/code&gt;, you get multiple completions per prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tell me a joke.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="nc"&gt;SamplingParams&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# outputs[0].outputs has 3 CompletionOutput objects
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;completion&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Completion &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;completion&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;completion&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  8. Beyond generate() — Other Task Types
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;LLM&lt;/code&gt; class supports more than text generation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Chat (applies chat template automatically)
&lt;/span&gt;&lt;span class="n"&gt;outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is 2+2?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;sampling_params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;SamplingParams&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Embeddings
&lt;/span&gt;&lt;span class="n"&gt;outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hello world&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Goodbye world&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="c1"&gt;# Classification (not all models support this)
&lt;/span&gt;&lt;span class="n"&gt;outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;classify&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;This movie was great!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Terrible film.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="c1"&gt;# Scoring (cross-encoder style)
&lt;/span&gt;&lt;span class="n"&gt;outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;doc1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;doc2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;doc3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each method validates that the loaded model supports the requested task via &lt;code&gt;runner_type&lt;/code&gt;. If you try &lt;code&gt;llm.generate()&lt;/code&gt; on an embedding model, you get a clear error.&lt;/p&gt;




&lt;h2&gt;
  
  
  Exercises
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Exercise 1: Basic Generation
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Difficulty&lt;/strong&gt;: Beginner&lt;br&gt;
&lt;strong&gt;Goal&lt;/strong&gt;: Understand the relationship between &lt;code&gt;SamplingParams&lt;/code&gt; and output&lt;/p&gt;

&lt;p&gt;Given this code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LLM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;meta-llama/Llama-3.1-8B-Instruct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SamplingParams&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Count from 1 to 10.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;What happens when &lt;code&gt;temperature=0&lt;/code&gt; and &lt;code&gt;n=2&lt;/code&gt;? Will the two completions be different or identical? Why?&lt;/li&gt;
&lt;li&gt;What will &lt;code&gt;finish_reason&lt;/code&gt; be for each completion? (&lt;code&gt;"stop"&lt;/code&gt; or &lt;code&gt;"length"&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;How many &lt;code&gt;CompletionOutput&lt;/code&gt; objects will be in &lt;code&gt;outputs[0].outputs&lt;/code&gt;?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Hint&lt;/strong&gt;: Think about what greedy decoding means for multiple samples.&lt;/p&gt;

&lt;p&gt;Solution&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Identical.&lt;/strong&gt; Temperature=0 means greedy decoding — always pick the highest-probability token. With no randomness, every sample produces the exact same sequence. Running &lt;code&gt;n=2&lt;/code&gt; with greedy is wasteful.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;"length"&lt;/code&gt;&lt;/strong&gt; for both. &lt;code&gt;max_tokens=5&lt;/code&gt; will cut off "Count from 1 to 10" well before the model naturally stops — it would need at least ~20 tokens ("1, 2, 3, 4, 5, 6, 7, 8, 9, 10").&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2&lt;/strong&gt; — one per &lt;code&gt;n&lt;/code&gt;. &lt;code&gt;outputs[0].outputs[0]&lt;/code&gt; and &lt;code&gt;outputs[0].outputs[1]&lt;/code&gt;, though both will have the same text.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Exercise 2: Trace the Call Path
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Difficulty&lt;/strong&gt;: Intermediate&lt;br&gt;
&lt;strong&gt;Goal&lt;/strong&gt;: Map the execution flow from user call to engine loop&lt;/p&gt;

&lt;p&gt;Trace what happens when this code executes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LLM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;meta-llama/Llama-3.1-8B-Instruct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hello&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;World&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="nc"&gt;SamplingParams&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For each step, name the method and describe what it does:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What does &lt;code&gt;generate()&lt;/code&gt; call first?&lt;/li&gt;
&lt;li&gt;How are the two prompts and the single &lt;code&gt;SamplingParams&lt;/code&gt; paired?&lt;/li&gt;
&lt;li&gt;What does &lt;code&gt;_run_engine()&lt;/code&gt; do on each iteration?&lt;/li&gt;
&lt;li&gt;Why are outputs sorted at the end?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Solution&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;generate()&lt;/code&gt; first validates the runner type is &lt;code&gt;"generate"&lt;/code&gt;, then calls &lt;code&gt;_validate_and_add_requests()&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The single &lt;code&gt;SamplingParams&lt;/code&gt; is replicated: &lt;code&gt;[params] * num_requests&lt;/code&gt; — so both "Hello" and "World" get the same &lt;code&gt;max_tokens=10&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Each iteration calls &lt;code&gt;self.llm_engine.step()&lt;/code&gt;, which runs one scheduling + inference cycle. Finished requests are collected into the &lt;code&gt;outputs&lt;/code&gt; list.&lt;/li&gt;
&lt;li&gt;Because requests finish out of order. "Hello" (shorter) might finish before "World" (or vice versa depending on generation). Sorting by &lt;code&gt;request_id&lt;/code&gt; ensures &lt;code&gt;outputs[0]&lt;/code&gt; corresponds to "Hello" and &lt;code&gt;outputs[1]&lt;/code&gt; to "World".&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Exercise 3: SamplingParams Edge Cases
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Difficulty&lt;/strong&gt;: Intermediate&lt;br&gt;
&lt;strong&gt;Goal&lt;/strong&gt;: Understand validation and normalization&lt;/p&gt;

&lt;p&gt;What happens in each case? Does it succeed, raise an error, or get silently normalized?&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;code&gt;SamplingParams(temperature=-0.5)&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;SamplingParams(temperature=0, top_k=50)&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;SamplingParams(top_p=0.0)&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;SamplingParams(stop="END", include_stop_str_in_output=False)&lt;/code&gt; — what does &lt;code&gt;output_text_buffer_length&lt;/code&gt; get set to?&lt;/li&gt;
&lt;li&gt;&lt;code&gt;SamplingParams(max_tokens=0)&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Solution&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Raises &lt;code&gt;VLLMValidationError&lt;/code&gt;&lt;/strong&gt; — &lt;code&gt;_verify_args()&lt;/code&gt; checks &lt;code&gt;self.temperature &amp;amp;lt; 0.0&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Silently normalized.&lt;/strong&gt; When temperature=0, &lt;code&gt;__post_init__&lt;/code&gt; forces &lt;code&gt;top_k=0&lt;/code&gt; (along with &lt;code&gt;top_p=1.0&lt;/code&gt;, &lt;code&gt;min_p=0.0&lt;/code&gt;). Your &lt;code&gt;top_k=50&lt;/code&gt; is overwritten.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Raises &lt;code&gt;VLLMValidationError&lt;/code&gt;&lt;/strong&gt; — &lt;code&gt;_verify_args()&lt;/code&gt; checks &lt;code&gt;not 0.0 &amp;amp;lt; self.top_p &amp;amp;lt;= 1.0&lt;/code&gt;. Zero is not in the valid range.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;output_text_buffer_length&lt;/code&gt; is set to &lt;code&gt;len("END") - 1 = 2&lt;/code&gt;.&lt;/strong&gt; This buffer ensures the output processor doesn't emit text that might be part of the stop string before the full match is determined.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Raises &lt;code&gt;VLLMValidationError&lt;/code&gt;&lt;/strong&gt; — &lt;code&gt;_verify_args()&lt;/code&gt; checks &lt;code&gt;self.max_tokens &amp;amp;lt; 1&lt;/code&gt; when not None.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Exercise 4: Design a Batch Inference Script
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Difficulty&lt;/strong&gt;: Advanced&lt;br&gt;
&lt;strong&gt;Goal&lt;/strong&gt;: Apply what you've learned to a realistic scenario&lt;/p&gt;

&lt;p&gt;You have a file with 10,000 prompts (one per line). You need to generate completions with &lt;code&gt;temperature=0.8&lt;/code&gt; and &lt;code&gt;max_tokens=256&lt;/code&gt;, saving results to a JSON file. Design the script:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Should you call &lt;code&gt;generate()&lt;/code&gt; once with all 10,000 prompts, or in batches of 100? Why?&lt;/li&gt;
&lt;li&gt;How would you handle prompts that need different &lt;code&gt;max_tokens&lt;/code&gt;?&lt;/li&gt;
&lt;li&gt;If 3 out of 10,000 prompts fail, how would you know which ones? (Hint: look at &lt;code&gt;request_id&lt;/code&gt;)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Solution&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Call &lt;code&gt;generate()&lt;/code&gt; once with all 10,000.&lt;/strong&gt; vLLM's continuous batching handles scheduling internally — it dynamically fits as many requests as GPU memory allows per step. Breaking into batches of 100 would serialize work unnecessarily and prevent vLLM from optimally utilizing the GPU.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pass a list of &lt;code&gt;SamplingParams&lt;/code&gt;&lt;/strong&gt;, one per prompt: &lt;code&gt;[SamplingParams(max_tokens=t) for t in per_prompt_max_tokens]&lt;/code&gt;. This lets each prompt have its own configuration.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Match by index.&lt;/strong&gt; Since &lt;code&gt;generate()&lt;/code&gt; sorts outputs by &lt;code&gt;request_id&lt;/code&gt; (which maps to the input order), &lt;code&gt;outputs[i]&lt;/code&gt; corresponds to &lt;code&gt;prompts[i]&lt;/code&gt;. Check &lt;code&gt;outputs[i].outputs[0].finish_reason&lt;/code&gt; — if it's &lt;code&gt;None&lt;/code&gt; or shows an unexpected state, that prompt had issues. You could also check &lt;code&gt;len(outputs)&lt;/code&gt; vs &lt;code&gt;len(prompts)&lt;/code&gt; to see if any were dropped.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Quiz
&lt;/h2&gt;

&lt;p&gt;Answer these questions based on today's material. Try to answer each question before revealing the answer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q1: What does &lt;code&gt;LLM.__init__()&lt;/code&gt; actually do with its parameters? Where does the heavy lifting happen?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Answer&lt;/p&gt;

&lt;p&gt;It packs parameters into &lt;code&gt;EngineArgs&lt;/code&gt; and calls &lt;code&gt;LLMEngine.from_engine_args()&lt;/code&gt;. The &lt;code&gt;LLM&lt;/code&gt; class itself does minimal work — it's a convenience wrapper. The engine factory method parses the args into a &lt;code&gt;VllmConfig&lt;/code&gt;, selects the right executor, loads the model, allocates the KV cache, and initializes the scheduler.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q2: Why does &lt;code&gt;_run_engine()&lt;/code&gt; sort its outputs by &lt;code&gt;request_id&lt;/code&gt; before returning?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Answer&lt;/p&gt;

&lt;p&gt;Because requests don't finish in order. Short prompts complete in fewer iterations than long ones. Since &lt;code&gt;_run_engine()&lt;/code&gt; collects outputs as they finish, a request with &lt;code&gt;request_id=5&lt;/code&gt; might finish before &lt;code&gt;request_id=3&lt;/code&gt;. Sorting by request ID restores the original prompt order so &lt;code&gt;outputs[i]&lt;/code&gt; corresponds to &lt;code&gt;prompts[i]&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q3: What is the default value of &lt;code&gt;max_tokens&lt;/code&gt; in &lt;code&gt;SamplingParams&lt;/code&gt;, and why might this surprise users?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Answer&lt;/p&gt;

&lt;p&gt;The default is 16 tokens. This is much smaller than most users expect (GPT-4 defaults to ~4096). If your outputs seem cut short, you probably need to set &lt;code&gt;max_tokens&lt;/code&gt; explicitly. The low default is intentional — it prevents accidental resource exhaustion when experimenting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q4: What happens internally to &lt;code&gt;SamplingParams&lt;/code&gt; when you set &lt;code&gt;temperature=0&lt;/code&gt;?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Answer&lt;/p&gt;

&lt;p&gt;vLLM forces greedy sampling parameters. Specifically, it sets &lt;code&gt;top_p=1.0&lt;/code&gt;, &lt;code&gt;top_k=0&lt;/code&gt;, and &lt;code&gt;min_p=0.0&lt;/code&gt;, then calls &lt;code&gt;_verify_greedy_sampling()&lt;/code&gt;. This is because when temperature is zero (always pick the highest-probability token), top-p/top-k filtering is meaningless and could introduce unexpected behavior.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q5: Why does &lt;code&gt;SamplingParams&lt;/code&gt; inherit from &lt;code&gt;msgspec.Struct&lt;/code&gt; instead of using a Python &lt;code&gt;@dataclass&lt;/code&gt;?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Answer&lt;/p&gt;

&lt;p&gt;Serialization performance. &lt;code&gt;msgspec.Struct&lt;/code&gt; provides 10-50x faster serialization than pickle (used with dataclasses). This matters because when vLLM runs in multiprocess mode, &lt;code&gt;SamplingParams&lt;/code&gt; is serialized with &lt;code&gt;msgspec.msgpack&lt;/code&gt; and sent over ZMQ sockets from the frontend process to the engine core process. Faster serialization = lower per-request overhead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q6: What is the difference between &lt;code&gt;finish_reason="stop"&lt;/code&gt; and &lt;code&gt;finish_reason="length"&lt;/code&gt; in &lt;code&gt;CompletionOutput&lt;/code&gt;?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Answer&lt;/p&gt;

&lt;p&gt;&lt;code&gt;"stop"&lt;/code&gt; means the model naturally stopped — it hit an EOS token, a stop string, or a stop token ID. &lt;code&gt;"length"&lt;/code&gt; means it hit &lt;code&gt;max_tokens&lt;/code&gt; — the model wanted to keep generating but was cut off. If you see many &lt;code&gt;"length"&lt;/code&gt; finishes, consider increasing &lt;code&gt;max_tokens&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q7: True or false: Calling &lt;code&gt;llm.generate()&lt;/code&gt; with a single &lt;code&gt;SamplingParams&lt;/code&gt; and a list of 100 prompts will use the same sampling parameters for all 100 prompts.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Answer&lt;/p&gt;

&lt;p&gt;True. When you pass a single &lt;code&gt;SamplingParams&lt;/code&gt; (not a list), &lt;code&gt;_validate_and_add_requests()&lt;/code&gt; replicates it: &lt;code&gt;engine_params = [params] * num_requests&lt;/code&gt;. Each prompt gets the same sampling configuration. To use different parameters per prompt, pass a list of &lt;code&gt;SamplingParams&lt;/code&gt; with the same length as the prompts list.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q8: What does &lt;code&gt;gpu_memory_utilization=0.9&lt;/code&gt; mean, and what happens to the other 10%?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Answer&lt;/p&gt;

&lt;p&gt;vLLM uses 90% of GPU memory for the KV cache (and model weights). The remaining 10% is reserved for PyTorch's internal allocations — temporary activation tensors, CUDA context, cuBLAS workspace, etc. If you set it too high (e.g., 0.99), you risk CUDA OOM during forward passes. If you set it too low, you waste GPU capacity.&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;vLLM&lt;/strong&gt; solves the KV cache memory waste problem via PagedAttention — block-based allocation instead of pre-allocation&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;LLM&lt;/strong&gt; class is a thin wrapper: it creates &lt;code&gt;EngineArgs&lt;/code&gt;, builds &lt;code&gt;LLMEngine&lt;/code&gt;, and provides &lt;code&gt;generate()&lt;/code&gt;, &lt;code&gt;chat()&lt;/code&gt;, &lt;code&gt;embed()&lt;/code&gt;, and other task methods&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;generate()&lt;/code&gt;&lt;/strong&gt; validates inputs, adds requests to the engine, then loops &lt;code&gt;step()&lt;/code&gt; until all requests finish&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;_run_engine()&lt;/code&gt;&lt;/strong&gt; is a simple while-loop over &lt;code&gt;llm_engine.step()&lt;/code&gt; — this is where continuous batching happens under the hood&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;SamplingParams&lt;/code&gt;&lt;/strong&gt; controls per-request generation with thorough validation — zero temperature forces greedy mode, invalid ranges raise errors&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;RequestOutput&lt;/code&gt;&lt;/strong&gt; wraps one or more &lt;code&gt;CompletionOutput&lt;/code&gt; objects, each containing the generated text, token IDs, and finish reason&lt;/li&gt;
&lt;li&gt;Next session: &lt;strong&gt;The Engine Layer&lt;/strong&gt; — what &lt;code&gt;LLMEngine&lt;/code&gt; does inside &lt;code&gt;step()&lt;/code&gt;, how &lt;code&gt;InputProcessor&lt;/code&gt; tokenizes prompts, and how &lt;code&gt;EngineCoreClient&lt;/code&gt; bridges to the core&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Generated from my &lt;a href="https://github.com/ccwang/ai-study" rel="noopener noreferrer"&gt;ai-study&lt;/a&gt; learning project.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>vllm</category>
      <category>llm</category>
      <category>python</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
