DEV Community: Ben

vLLM — Session 2: The Engine Layer — Request Management

Ben — Sun, 01 Feb 2026 22:00:10 +0000

This is part of my vLLM learning series. In this session, I cover Step 2 (The Engine Layer).

Note: This content was generated by Claude, grounded on the actual
vLLM codebase. It is intended for personal
learning only and may contain inaccuracies. Always verify against the
original source code and official documentation.

Topic: vLLM
Date: 2026-02-01
Sections covered: Step 2 (The Engine Layer)
Prerequisites: Session 1 — LLM class, SamplingParams, generate() flow, RequestOutput

Review

In Session 1, we learned that the LLM class is a thin wrapper around LLMEngine. When you call llm.generate(), the flow is:

_validate_and_add_requests() — pairs prompts with SamplingParams
_run_engine() — loops self.llm_engine.step() until all requests finish
Returns sorted RequestOutput objects

We saw that LLM.__init__() calls LLMEngine.from_engine_args() — but we treated the engine as a black box. Today we open that box.

The key question: What happens inside llm_engine.step()? The answer involves three components: InputProcessor, EngineCoreClient, and OutputProcessor.

Today's Material

1. LLMEngine — The Orchestrator

The LLMEngine sits between the user-facing LLM class and the core scheduling/execution machinery. Its job is to:

Preprocess inputs (tokenize prompts, handle multimodal data)
Relay preprocessed requests to the engine core
Postprocess raw outputs (detokenize, format for the user)

# vllm/v1/engine/llm_engine.py
class LLMEngine:
    def __init__(self, vllm_config, executor_class, log_stats, ...):
        self.engine_core = EngineCoreClient(...)    # Talks to core
        self.input_processor = InputProcessor(...)   # Tokenize inputs
        self.output_processor = OutputProcessor(...) # Format outputs

    def add_request(self, request_id, prompt, params, ...):
        """Preprocess and send request to engine core."""

    def step(self) -> list[RequestOutput]:
        """One iteration: get outputs from core, process, return."""

    @classmethod
    def from_engine_args(cls, engine_args) -> "LLMEngine":
        """Factory: parse args -> VllmConfig -> create engine."""

Think of LLMEngine as a translator: it speaks "user language" (strings, Python objects) on one side and "engine language" (token IDs, msgspec structs) on the other.

📝 Note:
vLLM has undergone a major architectural evolution. The v1/ directory contains the current architecture. Older code in the root vllm/engine/ directory is the legacy (v0) engine. When reading code, focus on vllm/v1/ — that's where active development happens.

2. The Factory: from_engine_args()

Before exploring the runtime flow, let's see how LLMEngine gets created:

# vllm/v1/engine/llm_engine.py
@classmethod
def from_engine_args(cls, engine_args, usage_context, ...) -> "LLMEngine":
    """Factory: parse args -> VllmConfig -> create engine."""
    vllm_config = engine_args.create_engine_config()
    # vllm_config is a VllmConfig that bundles:
    #   ModelConfig, CacheConfig, ParallelConfig,
    #   SchedulerConfig, DeviceConfig, LoadConfig, ...

    executor_class = Executor.get_class(vllm_config)
    # Selects UniProcExecutor, MultiprocExecutor, or RayDistributedExecutor

    return cls(vllm_config=vllm_config,
               executor_class=executor_class,
               usage_context=usage_context, ...)

This is a classic factory pattern. The user provides simple arguments (model="meta-llama/...", tensor_parallel_size=2), and the factory:

Parses them into a structured VllmConfig
Selects the right executor class based on configuration
Constructs the engine with all dependencies wired up

Why this matters: The factory is where all of vLLM's auto-configuration happens. It determines dtype (auto-selects fp16/bf16 based on GPU capability), figures out how many blocks fit in memory, and selects the appropriate attention backend.

3. InputProcessor — From Strings to Tokens

When add_request() is called, the first thing that happens is input processing:

# vllm/v1/engine/input_processor.py
class InputProcessor:
    def __init__(self, vllm_config, tokenizer, ...):
        self.tokenizer = tokenizer
        self.mm_processor = ...  # Multimodal input processor

    def process(self, request_id, prompt, params, ...) -> EngineCoreRequest:
        """Tokenize prompt, process multimodal inputs,
           create EngineCoreRequest."""

The processor handles several input formats:

# Users can provide prompts in multiple ways:
llm.generate("Hello world")                           # Plain string
llm.generate({"prompt": "Hello world"})               # Dict with string
llm.generate({"prompt_token_ids": [15496, 995]})      # Pre-tokenized
llm.generate({                                         # Multimodal
    "prompt": "What's in this image?",
    "multi_modal_data": {"image": image_data}
})

No matter what format you use, InputProcessor.process() normalizes it into an EngineCoreRequest — the standard wire format for the engine core.

The tokenization step converts your string prompt into token IDs:

"What is the capital of France?"
    → tokenizer.encode()
    → [1, 1724, 338, 278, 7483, 310, 3444, 29973]

💡 Tip:
If you already have token IDs (e.g., from your own tokenizer or preprocessing pipeline), pass {"prompt_token_ids": [...]} to skip redundant tokenization. This saves CPU time for high-throughput applications.

4. EngineCoreRequest — The Wire Format

The output of InputProcessor is an EngineCoreRequest:

# vllm/v1/engine/__init__.py
class EngineCoreRequest(msgspec.Struct):
    request_id: str
    prompt_token_ids: list[int] | None
    mm_features: list[MultiModalFeatureSpec] | None
    sampling_params: SamplingParams | None
    pooling_params: PoolingParams | None
    eos_token_id: int | None
    arrival_time: float
    lora_request: LoRARequest | None
    cache_salt: str | None
    data_parallel_rank: int | None
    prompt_embeds: torch.Tensor | None
    client_index: int
    current_wave: int
    priority: int
    trace_headers: Mapping[str, str] | None
    resumable: bool
    external_req_id: str | None

Why is this a separate type from Request (which the scheduler uses internally)?

Separation of concerns:

EngineCoreRequest is the transport format — designed for serialization across process boundaries
Request (in the scheduler) is the runtime format — tracks mutable state like num_computed_tokens, allocated blocks, output tokens

This separation is important because vLLM can run in multiprocess mode: the FastAPI server and InputProcessor run in one process, while the EngineCore (scheduler + executor) runs in another. The EngineCoreRequest gets serialized with msgspec.msgpack.encode(), sent over a ZMQ socket, and deserialized on the other side.

Process 1 (Frontend)           Process 2 (Engine Core)
┌──────────────────┐           ┌──────────────────┐
│  InputProcessor  │           │    Scheduler     │
│       ↓          │           │       ↓          │
│ EngineCoreRequest│──ZMQ──→   │    Request       │
│                  │           │  (mutable state)  │
└──────────────────┘           └──────────────────┘

📝 Note:
Why msgspec instead of pickle or JSON? msgspec.msgpack is 10-50x faster than pickle for structured data and produces smaller payloads than JSON. For a system processing thousands of requests per second, serialization overhead directly impacts throughput. This is not premature optimization — it's a measured bottleneck.

5. EngineCoreClient — Bridging Processes

EngineCoreClient abstracts the communication between the engine layer and the engine core:

# Conceptual interface:
class EngineCoreClient:
    def add_request(self, request: EngineCoreRequest) -> None:
        """Send request to the core."""

    def get_output(self) -> list[EngineCoreOutput]:
        """Get completed/streaming outputs from the core."""

The client has two modes:

Mode	When	How it works
In-process	`LLM` class (offline)	Direct function calls to `EngineCore`
Multiprocess	API server	ZMQ sockets between processes

In in-process mode (what you get with the LLM class), EngineCoreClient directly calls methods on an EngineCore object in the same process. No serialization overhead.

In multiprocess mode (the OpenAI-compatible server), EngineCoreClient serializes requests with msgspec.msgpack, sends them over ZMQ, and the EngineCore process deserializes and processes them. This keeps the FastAPI event loop responsive while heavy inference runs in a separate process.

# Simplified view of multiprocess communication:
# Frontend process:
encoded = msgspec.msgpack.encode(engine_core_request)
zmq_socket.send(encoded)

# Engine core process:
data = zmq_socket.recv()
request = msgspec.msgpack.decode(data, type=EngineCoreRequest)
scheduler.add_request(request)

Why this matters: This two-process architecture is critical for production deployments. Without it, long-running model forward passes on the GPU would block the HTTP server from accepting new requests.

6. The step() Method — One Iteration

Now we can understand what happens in each call to llm_engine.step():

# vllm/v1/engine/llm_engine.py (simplified)
def step(self) -> list[RequestOutput]:
    # 1. Get raw outputs from the engine core
    engine_core_outputs = self.engine_core.get_output()

    # 2. Process outputs: detokenize, format, check completion
    request_outputs = self.output_processor.process_outputs(
        engine_core_outputs)

    return request_outputs

Each step() returns a list of RequestOutput objects — some may be streaming (partial), others may be finished. The _run_engine() loop in LLM collects the finished ones.

But what triggers the core to actually run inference? In in-process mode, get_output() internally calls engine_core.step() which runs the scheduler + model execution. In multiprocess mode, the engine core runs its own loop continuously, and get_output() just reads from a queue.

7. OutputProcessor — From Tokens to Text

The OutputProcessor is the mirror of InputProcessor:

# vllm/v1/engine/output_processor.py
class OutputProcessor:
    def __init__(self, tokenizer, log_stats, ...):
        self.tokenizer = tokenizer
        self.output_states: dict[str, RequestState] = {}

It receives EngineCoreOutput (raw token IDs from the core) and produces RequestOutput (user-facing results). The key operations:

Accumulate tokens — Maintains a running state per request
Detokenize — Converts token IDs back to text using the tokenizer
Handle streaming modes — CUMULATIVE returns the full text so far; DELTA returns only new tokens
Track completion — Checks finish_reason to know when a request is done

# EngineCoreOutput — what the core produces:
class EngineCoreOutput(msgspec.Struct):
    request_id: str
    new_token_ids: list[int]          # Newly generated tokens this step
    new_logprobs: LogprobsLists | None
    finish_reason: FinishReason | None  # STOP, LENGTH, ABORT, or None
    stop_reason: int | str | None
    num_cached_tokens: int = 0

The transformation:

EngineCoreOutput                          RequestOutput
┌──────────────────────┐                 ┌─────────────────────────┐
│ request_id: "req-42" │                 │ request_id: "req-42"    │
│ new_token_ids: [464] │   detokenize    │ prompt: "Hello"         │
│ finish_reason: None  │ ───────────→    │ outputs: [              │
│                      │                 │   CompletionOutput(     │
└──────────────────────┘                 │     text: " world",     │
                                         │     token_ids: [464],   │
                                         │     finish_reason: None │
                                         │   )                     │
                                         │ ]                       │
                                         │ finished: False         │
                                         └─────────────────────────┘

⚠️ Warning:
Detokenization is not trivially reversible. Many tokenizers use byte-level BPE, where a single token might represent part of a multi-byte UTF-8 character. The OutputProcessor handles these edge cases — if a token produces an incomplete character, it buffers bytes until a valid character is formed. This is why you sometimes see "garbled" output when accessing raw token_ids without proper detokenization.

8. Putting It All Together — The Full Request Lifecycle

Let's trace a request from start to finish:

User calls: llm.generate(["What is AI?"], SamplingParams(max_tokens=20))

1. LLM.generate()
   └→ _validate_and_add_requests()
      └→ llm_engine.add_request(request_id="0", prompt="What is AI?", params=...)
         └→ InputProcessor.process()
            - Tokenize: "What is AI?" → [1, 1724, 338, 319, 29902, 29973]
            - Create EngineCoreRequest(request_id="0",
                                       prompt_token_ids=[1, 1724, ...],
                                       sampling_params=...,
                                       arrival_time=time.time())
         └→ engine_core.add_request(engine_core_request)

2. LLM._run_engine()
   while has_unfinished_requests():
     └→ llm_engine.step()
        └→ engine_core.get_output()
           - Core runs: schedule → execute model → sample tokens
           - Returns EngineCoreOutput(request_id="0",
                                      new_token_ids=[23435],
                                      finish_reason=None)
        └→ OutputProcessor.process_outputs()
           - Detokenize [23435] → " Artificial"
           - Accumulate: text = " Artificial"
           - Return RequestOutput(finished=False, ...)

     ... more steps, generating tokens one at a time ...

     └→ llm_engine.step()  (final iteration)
        └→ engine_core.get_output()
           - Returns EngineCoreOutput(request_id="0",
                                      new_token_ids=[29889],
                                      finish_reason=FinishReason.LENGTH)
        └→ OutputProcessor.process_outputs()
           - Detokenize [29889] → "."
           - Accumulate: text = " Artificial intelligence is..."
           - finish_reason = "length" (hit max_tokens=20)
           - Return RequestOutput(finished=True, ...)

3. _run_engine() collects finished output, sorts by request_id, returns

📝 Note:
In practice, the engine core doesn't generate just one token per step. With continuous batching, a single step() processes tokens for all active requests simultaneously. If there are 50 active requests, one GPU forward pass generates the next token for all 50. The OutputProcessor then demultiplexes the results back to individual RequestOutput objects.

Exercises

Exercise 1: Component Identification

Difficulty: Beginner
Goal: Verify you can identify the role of each engine layer component

For each of the following operations, name which component handles it:

Converting the string "Hello world" into token IDs [15496, 995]
Deciding which requests get GPU time this iteration
Converting EngineArgs into a VllmConfig
Decoding token ID [29889] back into the string "."
Sending an EngineCoreRequest from the frontend process to the engine core process

Solution 1. **InputProcessor** — it runs the tokenizer on the raw prompt string. 2. **Scheduler** (inside the engine core) — it decides which requests to include in each step's batch. 3. **`LLMEngine.from_engine_args()`** — the factory classmethod calls `engine_args.create_engine_config()`. 4. **OutputProcessor** — it detokenizes raw token IDs back into text. 5. **EngineCoreClient** (multiprocess mode) — it serializes with `msgspec.msgpack` and sends over ZMQ.

Exercise 2: Multiprocess vs. In-Process

Difficulty: Intermediate
Goal: Understand when and why vLLM uses multiprocess communication

Consider two scenarios:

Scenario A: Offline batch processing

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
outputs = llm.generate(prompts, params)

Scenario B: Production API server

vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000

For each scenario:

Is the EngineCoreClient in in-process or multiprocess mode?
Does the EngineCoreRequest actually get serialized with msgspec?
What would happen if the API server ran the engine core in-process (same event loop)?

Solution

Scenario A (offline LLM class):

In-process — direct function calls to EngineCore.
No — the EngineCoreRequest is created but passed directly without serialization.
N/A — there's no HTTP server.

Scenario B (API server):

Multiprocess — ZMQ sockets between frontend and engine core processes.
Yes — msgspec.msgpack.encode() serializes it, sends over ZMQ, and the core deserializes it.
The HTTP server would block during GPU forward passes. A single inference step can take 10-100ms, during which the server couldn't accept new connections or respond to health checks. Under load, this would cause request timeouts and dropped connections.

Exercise 3: Tracing Data Transformations

Difficulty: Intermediate
Goal: Follow the data as it changes form through the pipeline

Starting with this call:

llm.generate(
    [{"prompt_token_ids": [1, 2, 3, 4, 5]}],
    SamplingParams(max_tokens=3, temperature=0)
)

Does InputProcessor tokenize this prompt? Why or why not?
What fields of EngineCoreRequest are set? What's arrival_time used for?
If the model generates tokens [100, 200, 300], what does the EngineCoreOutput for the final step look like?
What is finish_reason and why?

Solution

No. The input is {"prompt_token_ids": [1, 2, 3, 4, 5]} — already tokenized. InputProcessor detects the prompt_token_ids key and skips tokenization, using the provided IDs directly.
Key fields: request_id (auto-assigned), prompt_token_ids=[1, 2, 3, 4, 5], sampling_params (with max_tokens=3, temperature=0), arrival_time=time.time(). arrival_time is used by the scheduler for FCFS ordering and for latency metrics.
The final EngineCoreOutput would be: EngineCoreOutput(request_id="0", new_token_ids=[300], finish_reason=FinishReason.LENGTH, stop_reason=None). Each step produces one new token, and the third token triggers the length limit.
FinishReason.LENGTH — the model generated exactly max_tokens=3 tokens ([100, 200, 300]) and was stopped. It didn't hit an EOS or stop token naturally.

Exercise 4: Design Challenge — Adding Request Priority

Difficulty: Advanced
Goal: Think through how a feature propagates through the engine layer

Suppose you want to add priority-based scheduling: high-priority requests should be processed before low-priority ones. Trace through the architecture:

Where does the user specify priority? (Hint: look at LLM.generate() parameters)
How does priority get from the user to the scheduler? List each class it passes through.
Why is priority a field on EngineCoreRequest rather than just on SamplingParams?
What would happen if the OutputProcessor also needed to know about priority? Would the current architecture support that?

Hint: Priority is already partially implemented — look at the EngineCoreRequest fields.

Solution

Via the priority parameter in LLM.generate(prompts, params, priority=[1, 2, ...]).
The path is: LLM.generate() → _validate_and_add_requests() → LLMEngine.add_request() → InputProcessor.process() (sets the priority field on EngineCoreRequest) → EngineCoreClient.add_request() → Scheduler (reads priority from the request).
Priority is a request-level concept, not a generation-level concept. SamplingParams controls how tokens are sampled (temperature, top-p, etc.) — it's about the quality of the output. Priority controls when the request gets scheduled — it's about resource allocation. Mixing them would conflate two different concerns.
Yes — the OutputProcessor receives EngineCoreOutput which includes the request_id. It could look up priority from its internal state (it already maintains per-request RequestState). But currently it doesn't need to — priority only matters for scheduling decisions.

Exercise 5: Streaming Output Modes

Difficulty: Advanced
Goal: Understand the difference between CUMULATIVE and DELTA output modes

Given a request that generates the text "Hello world!" as three tokens:

Step 1: token "Hello"
Step 2: token " world"
Step 3: token "!"

Write out what RequestOutput.outputs[0].text contains at each step for:

output_kind = RequestOutputKind.CUMULATIVE
output_kind = RequestOutputKind.DELTA

When would you use each mode? Think about a streaming chat UI vs. a batch processing pipeline.

Solution

CUMULATIVE:

Step 1: "Hello"
Step 2: "Hello world"
Step 3: "Hello world!"

DELTA:

Step 1: "Hello"
Step 2: " world"
Step 3: "!"

When to use each: DELTA is ideal for streaming chat UIs — you append each delta directly to the display. CUMULATIVE is simpler for batch pipelines — you always have the full text so far, no need to track previous outputs. CUMULATIVE is the default because it's easier to use correctly.

Quiz

Answer these questions based on today's material. Try to answer each question before revealing the answer.

Q1: What are the three main components inside LLMEngine, and what does each one do?

Answer

InputProcessor, EngineCoreClient, and OutputProcessor. InputProcessor tokenizes prompts and creates EngineCoreRequest objects. EngineCoreClient sends requests to and receives outputs from the engine core (either in-process or via ZMQ). OutputProcessor detokenizes raw token IDs back into text and formats RequestOutput objects for the user.

Q2: Why does vLLM have both EngineCoreRequest and Request as separate types?

Answer

They serve different purposes across a process boundary. EngineCoreRequest is the transport/wire format — immutable, serializable with msgspec, designed to cross process boundaries efficiently. Request is the scheduler's internal runtime format — mutable, tracks state like num_computed_tokens, allocated KV cache blocks, and output tokens. Mixing these concerns would either make serialization expensive or make runtime tracking awkward.

Q3: What serialization format does vLLM use for inter-process communication, and why was it chosen over alternatives like pickle or JSON?

Answer

msgspec.msgpack — a binary MessagePack format. It's 10-50x faster than pickle for structured data and produces compact binary payloads. JSON was rejected because it's text-based (larger payloads, slower parsing). Pickle was rejected because it's slow for structured data and has security concerns. At thousands of requests per second, serialization overhead is a real bottleneck.

Q4: In multiprocess mode, what happens if the engine core is busy running a forward pass when a new HTTP request arrives?

Answer

The new request is accepted by the frontend process and queued. Because the frontend (FastAPI + InputProcessor) runs in a separate process from the engine core, it can accept and preprocess new HTTP requests while the GPU is busy. The EngineCoreRequest is sent over ZMQ and queued for the next scheduling iteration. This is exactly why the two-process architecture exists.

Q5: What does OutputProcessor do when it receives a token that represents an incomplete UTF-8 character?

Answer

It buffers the incomplete bytes until a valid character is formed. Many tokenizers use byte-level BPE, where tokens can split in the middle of multi-byte UTF-8 characters (e.g., emoji, CJK characters). The OutputProcessor accumulates bytes and only emits text when complete characters are available. This prevents garbled output in streaming responses.

Q6: True or false: In in-process mode (using the LLM class), EngineCoreRequest is still created even though it doesn't need to be serialized.

Answer

True. The InputProcessor always creates an EngineCoreRequest regardless of execution mode. In in-process mode, the request is passed directly to the engine core without serialization. The EngineCoreRequest type serves as a clean interface contract between the engine layer and the core, even when no process boundary exists.

Q7: What is the purpose of arrival_time in EngineCoreRequest?

Answer

It records when the request was submitted, enabling scheduling policies like FCFS (first-come-first-served). The scheduler can use arrival_time to prioritize older requests over newer ones. It's also used for metrics: you can measure end-to-end latency by comparing arrival_time with the completion time. Without it, the scheduler would have no notion of fairness or request ordering.

Q8: Why does LLMEngine.from_engine_args() exist as a classmethod factory instead of putting all the logic in __init__?

Answer

To separate argument parsing from construction. The factory method converts user-friendly EngineArgs (flat key-value pairs) into a structured VllmConfig (nested, validated configuration), selects the right executor class, and then calls __init__. This keeps __init__ simple — it receives fully validated, structured objects. It also allows alternative construction paths (e.g., creating LLMEngine directly with a VllmConfig for testing).

Summary

LLMEngine is the orchestrator that connects the user-facing API to the engine core, with three sub-components: InputProcessor, EngineCoreClient, and OutputProcessor
InputProcessor normalizes various input formats (strings, token IDs, multimodal data) into EngineCoreRequest — the standard wire format
EngineCoreRequest uses msgspec.Struct for fast serialization, enabling efficient multiprocess communication via ZMQ
EngineCoreClient abstracts the communication mode: in-process for offline use, multiprocess (ZMQ) for production servers
OutputProcessor reverses the input pipeline: accumulates tokens, detokenizes, handles streaming modes (CUMULATIVE vs DELTA), and produces RequestOutput
The two-process architecture (frontend + engine core) is critical for production: it keeps the HTTP server responsive while the GPU runs inference
Next session: The Scheduler — how vLLM decides which requests get GPU time, the token budget system, and chunked prefill

Generated from my ai-study learning project.

Session 1: vLLM Overview and the User API

Ben — Sun, 01 Feb 2026 22:00:01 +0000

This is part of my vLLM learning series. In this session, I cover Step 1 (The User API).

Note: This content was generated by Claude, grounded on the actual
vLLM codebase. It is intended for personal
learning only and may contain inaccuracies. Always verify against the
original source code and official documentation.

Topic: vLLM
Date: 2026-01-31
Sections covered: Step 1 (The User API)
Prerequisites: None

Today's Material

1. What is vLLM and Why Does It Matter?

LLM inference is GPU-memory-bound. When a model generates text, it needs to store key-value (KV) caches — intermediate computations from the attention mechanism — for every token in every active request. Naive implementations pre-allocate the maximum possible sequence length for each request, wasting 60-80% of GPU memory on empty space.

vLLM solves this with PagedAttention: instead of pre-allocating a giant contiguous buffer per request, it carves GPU memory into fixed-size blocks (default 16 tokens each) and allocates them on demand — just like how an operating system manages virtual memory with pages.

The result: near-optimal memory utilization and 2-4x higher throughput than HuggingFace Transformers on typical workloads.

📝 Note:
Think of the difference like this: the naive approach is like reserving an entire row of seats in a theater for each person "just in case" they bring friends. PagedAttention is like assigning individual seats as people actually show up.

2. High-Level Architecture

Before diving into code, here's the bird's-eye view of how vLLM is organized:

┌───────────────────────────────────────────────┐
│              User-Facing Layer                │
│   LLM class  |  OpenAI API Server  |  gRPC   │
└──────────────────────┬────────────────────────┘
                       │
┌──────────────────────▼────────────────────────┐
│              Engine Layer                     │
│  InputProcessor → EngineCoreClient → OutputProcessor │
└──────────────────────┬────────────────────────┘
                       │
┌──────────────────────▼────────────────────────┐
│              Engine Core                      │
│   Scheduler → Executor → Workers → GPU        │
│      └── KVCacheManager (BlockPool)           │
└───────────────────────────────────────────────┘

Three layers:

User-Facing — Multiple entry points (Python API, HTTP, gRPC) that all funnel into the engine
Engine Layer — Tokenize inputs, relay to the core, format outputs
Engine Core — The scheduling loop, KV cache management, and GPU execution

Today we focus on layer 1: the LLM class and its associated types.

3. The LLM Class — Your Main Interface

The LLM class in vllm/entrypoints/llm.py is the primary interface for offline batch inference. Here's its constructor (simplified to the most important parameters):

# vllm/entrypoints/llm.py
class LLM:
    def __init__(
        self,
        model: str,
        *,
        tokenizer: str | None = None,
        tensor_parallel_size: int = 1,
        dtype: ModelDType = "auto",
        quantization: QuantizationMethods | None = None,
        gpu_memory_utilization: float = 0.9,
        seed: int = 0,
        **kwargs: Any,
    ) -> None:
        engine_args = EngineArgs(
            model=model,
            tokenizer=tokenizer,
            tensor_parallel_size=tensor_parallel_size,
            dtype=dtype,
            quantization=quantization,
            gpu_memory_utilization=gpu_memory_utilization,
            **kwargs,
        )

        self.llm_engine = LLMEngine.from_engine_args(
            engine_args=engine_args,
            usage_context=UsageContext.LLM_CLASS,
        )
        self.request_counter = Counter()

Key things to notice:

LLM is thin — it creates an EngineArgs config, then hands everything off to LLMEngine.from_engine_args()
gpu_memory_utilization=0.9 means vLLM claims 90% of GPU memory for the KV cache, reserving 10% for PyTorch overhead (model weights, activations, etc.)
tensor_parallel_size controls how many GPUs to shard the model across — set to 1 for single-GPU

💡 Tip:
If you get CUDA out-of-memory errors, lower gpu_memory_utilization (e.g., to 0.8). If you want more throughput and have headroom, raise it (up to ~0.95).

4. The generate() Method — Where Requests Enter

# vllm/entrypoints/llm.py
def generate(
    self,
    prompts: PromptType | Sequence[PromptType],
    sampling_params: SamplingParams | Sequence[SamplingParams] | None = None,
    *,
    use_tqdm: bool | Callable[..., tqdm] = True,
    lora_request: list[LoRARequest] | LoRARequest | None = None,
    priority: list[int] | None = None,
) -> list[RequestOutput]:
    model_config = self.model_config
    runner_type = model_config.runner_type
    if runner_type != "generate":
        raise ValueError(
            "LLM.generate() is only supported for generative models.")

    if sampling_params is None:
        sampling_params = self.get_default_sampling_params()

    self._validate_and_add_requests(
        prompts=prompts,
        params=sampling_params,
        use_tqdm=use_tqdm,
        lora_request=...,
        priority=priority,
    )

    outputs = self._run_engine(use_tqdm=use_tqdm)
    return self.engine_class.validate_outputs(outputs, RequestOutput)

The flow is:

Validate that this is a generative model (not an embedding model)
Add requests via _validate_and_add_requests() — normalizes inputs, pairs each prompt with its SamplingParams, and sends them to the engine
Run the engine via _run_engine() — loops until all requests are finished
Return sorted RequestOutput objects

You can pass a single SamplingParams (applied to all prompts) or a list (one per prompt). This is useful when different prompts need different temperatures or stop conditions.

5. _run_engine() — The Processing Loop

This is where the actual inference happens:

# vllm/entrypoints/llm.py
def _run_engine(
    self, *, use_tqdm: bool | Callable[..., tqdm] = True
) -> list[RequestOutput | PoolingRequestOutput]:
    outputs: list[RequestOutput | PoolingRequestOutput] = []
    total_in_toks = 0
    total_out_toks = 0

    while self.llm_engine.has_unfinished_requests():
        step_outputs = self.llm_engine.step()
        for output in step_outputs:
            if output.finished:
                outputs.append(output)

    # Sort the outputs by request ID.
    # This is necessary because some requests may be finished earlier than
    # its previous requests.
    return sorted(outputs, key=lambda x: int(x.request_id))

The key insight: _run_engine() is a simple loop. It calls self.llm_engine.step() repeatedly. Each step() runs one iteration of the scheduling + inference pipeline — potentially processing hundreds of requests in a single forward pass. Finished requests come back as RequestOutput objects.

⚠️ Warning:
The outputs are sorted by request_id at the end because requests don't finish in order. A short request (e.g., "Say hi") may finish in 5 iterations while a long request (e.g., "Write an essay") takes 500. The sorting ensures the output list matches the input prompt order.

Why this matters: This loop is where continuous batching happens. Unlike static batching (process N prompts, wait for all to finish, return), vLLM processes requests at different stages simultaneously. Request A might be mid-generation while Request B is just starting its prefill.

6. SamplingParams — Controlling Generation

Every request carries a SamplingParams that controls how tokens are selected:

# vllm/sampling_params.py
class SamplingParams(
    PydanticMsgspecMixin,
    msgspec.Struct,
    omit_defaults=True,
    dict=True,
):
    # --- Core sampling ---
    n: int = 1                          # Number of output sequences
    temperature: float = 1.0            # 0 = greedy, higher = more random
    top_p: float = 1.0                  # Nucleus sampling threshold
    top_k: int = 0                      # Top-K filtering (0 = disabled)
    min_p: float = 0.0                  # Minimum probability threshold
    seed: int | None = None             # Reproducible sampling

    # --- Penalties ---
    presence_penalty: float = 0.0       # Penalize tokens that appeared
    frequency_penalty: float = 0.0      # Penalize by frequency
    repetition_penalty: float = 1.0     # Multiplicative penalty

    # --- Generation limits ---
    max_tokens: int | None = 16         # Output length limit
    min_tokens: int = 0                 # Minimum before allowing EOS
    ignore_eos: bool = False            # Don't stop at EOS

    # --- Stop conditions ---
    stop: str | list[str] | None = None
    stop_token_ids: list[int] | None = None

    # --- Output control ---
    logprobs: int | None = None         # Return top-N log probabilities
    prompt_logprobs: int | None = None  # Prompt token log probs
    detokenize: bool = True             # Decode token IDs to text

    # --- Advanced ---
    structured_outputs: StructuredOutputsParams | None = None  # JSON schema
    logit_bias: dict[int, float] | None = None
    output_kind: RequestOutputKind = RequestOutputKind.CUMULATIVE

Notice that SamplingParams inherits from msgspec.Struct, not a Python dataclass. This is a deliberate performance choice — msgspec serialization is 10-50x faster than pickle, which matters when requests cross process boundaries (more on this in a future session).

Validation logic

SamplingParams.__post_init__() enforces constraints:

# vllm/sampling_params.py
def __post_init__(self) -> None:
    # Normalize stop to a list
    if self.stop is None:
        self.stop = []
    elif isinstance(self.stop, str):
        self.stop = [self.stop]

    # Zero temperature → force greedy sampling
    if self.temperature < _SAMPLING_EPS:
        self.top_p = 1.0
        self.top_k = 0
        self.min_p = 0.0
        self._verify_greedy_sampling()

    self._verify_args()

def _verify_args(self) -> None:
    if self.n < 1:
        raise ValueError(f"n must be at least 1, got {self.n}.")
    if not -2.0 <= self.presence_penalty <= 2.0:
        raise ValueError(...)
    if self.temperature < 0.0:
        raise VLLMValidationError(...)
    if not 0.0 < self.top_p <= 1.0:
        raise VLLMValidationError(...)
    if self.max_tokens is not None and self.max_tokens < 1:
        raise VLLMValidationError(...)

📝 Note:
When temperature=0, vLLM automatically sets top_p=1.0, top_k=0, and min_p=0.0. This is because greedy decoding (always pick the highest-probability token) makes all other sampling parameters irrelevant. The code enforces this rather than letting the user set contradictory values.

7. RequestOutput and CompletionOutput — What You Get Back

After generate() finishes, you get a list of RequestOutput objects:

# vllm/outputs.py
class RequestOutput:
    def __init__(
        self,
        request_id: str,
        prompt: str | None,
        prompt_token_ids: list[int] | None,
        prompt_logprobs: PromptLogprobs | None,
        outputs: list[CompletionOutput],
        finished: bool,
        metrics: RequestStateStats | None = None,
        num_cached_tokens: int | None = None,
        ...
    ): ...

Each RequestOutput contains one or more CompletionOutput objects (one per n in SamplingParams):

# vllm/outputs.py
@dataclass
class CompletionOutput:
    index: int                         # Which of the n outputs
    text: str                          # Generated text
    token_ids: GenericSequence[int]    # Generated token IDs
    cumulative_logprob: float | None   # Sum of log probs
    logprobs: SampleLogprobs | None    # Per-token log probs
    finish_reason: str | None          # "stop", "length", or None
    stop_reason: int | str | None      # What triggered the stop

    def finished(self) -> bool:
        return self.finish_reason is not None

A typical usage pattern:

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")

outputs = llm.generate(
    ["What is the capital of France?", "Explain gravity in one sentence."],
    SamplingParams(temperature=0.7, max_tokens=64)
)

for output in outputs:
    prompt = output.prompt
    generated = output.outputs[0].text
    reason = output.outputs[0].finish_reason  # "stop" or "length"
    print(f"Prompt: {prompt}")
    print(f"Output: {generated}")
    print(f"Finished because: {reason}\n")

When n > 1, you get multiple completions per prompt:

outputs = llm.generate(
    ["Tell me a joke."],
    SamplingParams(n=3, temperature=0.9, max_tokens=100)
)

# outputs[0].outputs has 3 CompletionOutput objects
for completion in outputs[0].outputs:
    print(f"Completion {completion.index}: {completion.text}")

8. Beyond generate() — Other Task Types

The LLM class supports more than text generation:

# Chat (applies chat template automatically)
outputs = llm.chat(
    messages=[{"role": "user", "content": "What is 2+2?"}],
    sampling_params=SamplingParams(max_tokens=32)
)

# Embeddings
outputs = llm.embed(["Hello world", "Goodbye world"])

# Classification (not all models support this)
outputs = llm.classify(["This movie was great!", "Terrible film."])

# Scoring (cross-encoder style)
outputs = llm.score("query text", ["doc1", "doc2", "doc3"])

Each method validates that the loaded model supports the requested task via runner_type. If you try llm.generate() on an embedding model, you get a clear error.

Exercises

Exercise 1: Basic Generation

Difficulty: Beginner
Goal: Understand the relationship between SamplingParams and output

Given this code:

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
params = SamplingParams(temperature=0, max_tokens=5, n=2)
outputs = llm.generate(["Count from 1 to 10."], params)

What happens when temperature=0 and n=2? Will the two completions be different or identical? Why?
What will finish_reason be for each completion? ("stop" or "length")
How many CompletionOutput objects will be in outputs[0].outputs?

Hint: Think about what greedy decoding means for multiple samples.

Solution

Identical. Temperature=0 means greedy decoding — always pick the highest-probability token. With no randomness, every sample produces the exact same sequence. Running n=2 with greedy is wasteful.
"length" for both. max_tokens=5 will cut off "Count from 1 to 10" well before the model naturally stops — it would need at least ~20 tokens ("1, 2, 3, 4, 5, 6, 7, 8, 9, 10").
2 — one per n. outputs[0].outputs[0] and outputs[0].outputs[1], though both will have the same text.

Exercise 2: Trace the Call Path

Difficulty: Intermediate
Goal: Map the execution flow from user call to engine loop

Trace what happens when this code executes:

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
outputs = llm.generate(
    ["Hello", "World"],
    SamplingParams(max_tokens=10)
)

For each step, name the method and describe what it does:

What does generate() call first?
How are the two prompts and the single SamplingParams paired?
What does _run_engine() do on each iteration?
Why are outputs sorted at the end?

Solution

generate() first validates the runner type is "generate", then calls _validate_and_add_requests().
The single SamplingParams is replicated: [params] * num_requests — so both "Hello" and "World" get the same max_tokens=10.
Each iteration calls self.llm_engine.step(), which runs one scheduling + inference cycle. Finished requests are collected into the outputs list.
Because requests finish out of order. "Hello" (shorter) might finish before "World" (or vice versa depending on generation). Sorting by request_id ensures outputs[0] corresponds to "Hello" and outputs[1] to "World".

Exercise 3: SamplingParams Edge Cases

Difficulty: Intermediate
Goal: Understand validation and normalization

What happens in each case? Does it succeed, raise an error, or get silently normalized?

SamplingParams(temperature=-0.5)
SamplingParams(temperature=0, top_k=50)
SamplingParams(top_p=0.0)
SamplingParams(stop="END", include_stop_str_in_output=False) — what does output_text_buffer_length get set to?
SamplingParams(max_tokens=0)

Solution

Raises VLLMValidationError — _verify_args() checks self.temperature < 0.0.
Silently normalized. When temperature=0, __post_init__ forces top_k=0 (along with top_p=1.0, min_p=0.0). Your top_k=50 is overwritten.
Raises VLLMValidationError — _verify_args() checks not 0.0 < self.top_p <= 1.0. Zero is not in the valid range.
output_text_buffer_length is set to len("END") - 1 = 2. This buffer ensures the output processor doesn't emit text that might be part of the stop string before the full match is determined.
Raises VLLMValidationError — _verify_args() checks self.max_tokens < 1 when not None.

Exercise 4: Design a Batch Inference Script

Difficulty: Advanced
Goal: Apply what you've learned to a realistic scenario

You have a file with 10,000 prompts (one per line). You need to generate completions with temperature=0.8 and max_tokens=256, saving results to a JSON file. Design the script:

Should you call generate() once with all 10,000 prompts, or in batches of 100? Why?
How would you handle prompts that need different max_tokens?
If 3 out of 10,000 prompts fail, how would you know which ones? (Hint: look at request_id)

Solution

Call generate() once with all 10,000. vLLM's continuous batching handles scheduling internally — it dynamically fits as many requests as GPU memory allows per step. Breaking into batches of 100 would serialize work unnecessarily and prevent vLLM from optimally utilizing the GPU.
Pass a list of SamplingParams, one per prompt: [SamplingParams(max_tokens=t) for t in per_prompt_max_tokens]. This lets each prompt have its own configuration.
Match by index. Since generate() sorts outputs by request_id (which maps to the input order), outputs[i] corresponds to prompts[i]. Check outputs[i].outputs[0].finish_reason — if it's None or shows an unexpected state, that prompt had issues. You could also check len(outputs) vs len(prompts) to see if any were dropped.

Quiz

Answer these questions based on today's material. Try to answer each question before revealing the answer.

Q1: What does LLM.__init__() actually do with its parameters? Where does the heavy lifting happen?

Answer

It packs parameters into EngineArgs and calls LLMEngine.from_engine_args(). The LLM class itself does minimal work — it's a convenience wrapper. The engine factory method parses the args into a VllmConfig, selects the right executor, loads the model, allocates the KV cache, and initializes the scheduler.

Q2: Why does _run_engine() sort its outputs by request_id before returning?

Answer

Because requests don't finish in order. Short prompts complete in fewer iterations than long ones. Since _run_engine() collects outputs as they finish, a request with request_id=5 might finish before request_id=3. Sorting by request ID restores the original prompt order so outputs[i] corresponds to prompts[i].

Q3: What is the default value of max_tokens in SamplingParams, and why might this surprise users?

Answer

The default is 16 tokens. This is much smaller than most users expect (GPT-4 defaults to ~4096). If your outputs seem cut short, you probably need to set max_tokens explicitly. The low default is intentional — it prevents accidental resource exhaustion when experimenting.

Q4: What happens internally to SamplingParams when you set temperature=0?

Answer

vLLM forces greedy sampling parameters. Specifically, it sets top_p=1.0, top_k=0, and min_p=0.0, then calls _verify_greedy_sampling(). This is because when temperature is zero (always pick the highest-probability token), top-p/top-k filtering is meaningless and could introduce unexpected behavior.

Q5: Why does SamplingParams inherit from msgspec.Struct instead of using a Python @dataclass?

Answer

Serialization performance. msgspec.Struct provides 10-50x faster serialization than pickle (used with dataclasses). This matters because when vLLM runs in multiprocess mode, SamplingParams is serialized with msgspec.msgpack and sent over ZMQ sockets from the frontend process to the engine core process. Faster serialization = lower per-request overhead.

Q6: What is the difference between finish_reason="stop" and finish_reason="length" in CompletionOutput?

Answer

"stop" means the model naturally stopped — it hit an EOS token, a stop string, or a stop token ID. "length" means it hit max_tokens — the model wanted to keep generating but was cut off. If you see many "length" finishes, consider increasing max_tokens.

Q7: True or false: Calling llm.generate() with a single SamplingParams and a list of 100 prompts will use the same sampling parameters for all 100 prompts.

Answer

True. When you pass a single SamplingParams (not a list), _validate_and_add_requests() replicates it: engine_params = [params] * num_requests. Each prompt gets the same sampling configuration. To use different parameters per prompt, pass a list of SamplingParams with the same length as the prompts list.

Q8: What does gpu_memory_utilization=0.9 mean, and what happens to the other 10%?

Answer

vLLM uses 90% of GPU memory for the KV cache (and model weights). The remaining 10% is reserved for PyTorch's internal allocations — temporary activation tensors, CUDA context, cuBLAS workspace, etc. If you set it too high (e.g., 0.99), you risk CUDA OOM during forward passes. If you set it too low, you waste GPU capacity.

Summary

vLLM solves the KV cache memory waste problem via PagedAttention — block-based allocation instead of pre-allocation
The LLM class is a thin wrapper: it creates EngineArgs, builds LLMEngine, and provides generate(), chat(), embed(), and other task methods
generate() validates inputs, adds requests to the engine, then loops step() until all requests finish
_run_engine() is a simple while-loop over llm_engine.step() — this is where continuous batching happens under the hood
SamplingParams controls per-request generation with thorough validation — zero temperature forces greedy mode, invalid ranges raise errors
RequestOutput wraps one or more CompletionOutput objects, each containing the generated text, token IDs, and finish reason
Next session: The Engine Layer — what LLMEngine does inside step(), how InputProcessor tokenizes prompts, and how EngineCoreClient bridges to the core

Generated from my ai-study learning project.