DEV Community: Hyogeun Oh (오효근)

Code Review: Deep Dive into vLLM's Architecture and Implementation Analysis of OpenAI-Compatible Serving (2/2)

Hyogeun Oh (오효근) — Thu, 19 Jun 2025 16:41:34 +0000

Introduction

In the previous article, I explored why vLLM is gaining popularity and the process of setting up an OpenAI-compatible server when using vllm serve.
While the first article focused on the architectural foundations and server initialization process, in this article, I want to dive deeper into the runtime behavior and request processing pipeline.

The /v1/chat/completions endpoint has become the de facto standard for conversational AI applications, powering everything from customer service chatbots to sophisticated AI assistants.
Unlike the legacy /v1/completions endpoint, which operates on simple text completion, the chat completions endpoint provides structured message handling, role-based conversations, and built-in context management.

Through this deep dive, I'll walk you through:

Endpoint Comparison: Detailed comparison between /v1/completions and /v1/chat/completions
Request Processing: Step-by-step breakdown of how chat messages are preprocessed and transformed
Chat Template System: How vLLM applies model-specific chat templates to structure conversations
Internal Pipeline: Deep dive into the inference process, from message parsing to response generation
Performance Considerations: Understanding token efficiency and memory management in chat contexts

By examining vLLM's implementation of the OpenAI-compatible chat completions endpoint, I'll uncover the sophisticated engineering that enables high-performance conversational AI serving while maintaining full API compatibility.

Theoretical Background

`/v1/completions` vs. `/v1/chat/completions`

As seen in the previous article, the OpenAI compatible server provides two endpoints as shown below.

$ vllm serve Qwen/Qwen3-0.6B --max-model-len 8192
...
INFO 06-09 23:16:17 [launcher.py:36] Route: /v1/chat/completions, Methods: POST
INFO 06-09 23:16:17 [launcher.py:36] Route: /v1/completions, Methods: POST
...

Let me walk you through the differences between these two endpoints.

Aspect	`/v1/completions` [1]	`/v1/chat/completions` [2]
Purpose	Text Completion	Conversational Chat
Input Format	Single string (`prompt`)	Array of messages (`messages`)
Message Structure	`{"prompt": "Hello, World!"}`	`{"messages": [{"role": "user", "content": "Hello, World!"}]}`
Role Support	None (plain text)	`system`, `user`, `assistant`, etc.
Context Management	Manual inclusion in prompt	Automatic management via message history
Conversation Continuity	Requires manual implementation	Built-in support
Response Format	`choices[].text`	`choices[].message.content`
Use Cases	- Code generation - Text completion - One-shot tasks	- Chatbots - Conversational assistants - Multi-turn dialogues
Token Efficiency	Low (full context retransmission)	High (message-level management)
Legacy Status	Legacy (not recommended)	Currently recommended approach

As officially documented by OpenAI, /v1/completions is legacy and not recommended.

Let's test them in practice and compare the output and logs provided by vLLM.

$ curl http://localhost:8000/v1/completions \
 -H "Content-Type: application/json" \
 -d '{"prompt": "Hello, World!"}' | jq

INFO 06-16 21:27:19 [logger.py:43] Received request cmpl-bc9fa340e282468eb41d47ea9db57bfd-0: prompt: 'Hello, World!', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: [9707, 11, 4337, 0], prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-16 21:27:19 [engine.py:317] Added request cmpl-bc9fa340e282468eb41d47ea9db57bfd-0.
INFO:     127.0.0.1:59189 - "POST /v1/completions HTTP/1.1" 200 OK

From the logs, we can see that /v1/completions feeds the sentence from the "prompt" directly to the LLM.

{
  "id": "cmpl-bc9fa340e282468eb41d47ea9db57bfd",
  "object": "text_completion",
  "created": 1750076839,
  "model": "Qwen/Qwen3-0.6B",
  "choices": [
    {
      "index": 0,
      "text": " My name is Alex. I am a software engineer with a passion for coding and",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "prompt_logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 4,
    "total_tokens": 20,
    "completion_tokens": 16,
    "prompt_tokens_details": null
  },
  "kv_transfer_params": null
}

As a result, it responds with an extended sentence based on the input "prompt", rather than a chat-style response.

$ curl http://localhost:8000/v1/chat/completions \
 -H "Content-Type: application/json" \
 -d '{"messages": [{"role": "user", "content": "Hello, World!"}]}' | jq

INFO 06-16 21:29:16 [logger.py:43] Received request chatcmpl-dab79c6ebcb24ff58b4e032f6f83b888: prompt: '<|im_start|>user\nHello, World!<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=8180, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-16 21:29:16 [engine.py:317] Added request chatcmpl-dab79c6ebcb24ff58b4e032f6f83b888.
INFO:     127.0.0.1:59198 - "POST /v1/chat/completions HTTP/1.1" 200 OK

In contrast, /v1/chat/completions, as shown in the server log above, applies a chat template according to the user's input format and feeds that value to the LLM.

{
  "id": "chatcmpl-dab79c6ebcb24ff58b4e032f6f83b888",
  "object": "chat.completion",
  "created": 1750076956,
  "model": "Qwen/Qwen3-0.6B",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": null,
        "content": "<think>\nOkay, the user said \"Hello, World!\" and I need to respond. First, I should acknowledge their message. Since it's a simple greeting, a straightforward response is best. I can say \"Hello, World!\" as well, but maybe add a friendly note to keep it engaging. Let me check if there's any context I'm missing, but the message is pretty basic. Just a greeting. Alright, I'll respond with a friendly message to reinforce the exchange.\n</think>\n\nHello, World! 😊 What's interesting about you?",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "total_tokens": 125,
    "completion_tokens": 113,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "kv_transfer_params": null
}

As a result, the response appears in chat format.
The chat template applied in the above result uses the chat_template in tokenizer_config.json by default, unless a separate --chat-template option is specified.

Qwen/Qwen3-0.6B/tokenizer_config.json


...
  "chat_template": "{%- if tools %}\n    {{- '<|im_start|>system\\n' }}\n    {%- if messages[0].role == 'system' %}\n        {{- messages[0].content + '\\n\\n' }}\n    {%- endif %}\n    {{- \"# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n    {%- for tool in tools %}\n        {{- \"\\n\" }}\n        {{- tool | tojson }}\n    {%- endfor %}\n    {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n    {%- if messages[0].role == 'system' %}\n        {{- '<|im_start|>system\\n' + messages[0].content + '<|im_end|>\\n' }}\n    {%- endif %}\n{%- endif %}\n{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}\n{%- for message in messages[::-1] %}\n    {%- set index = (messages|length - 1) - loop.index0 %}\n    {%- if ns.multi_step_tool and message.role == \"user\" and message.content is string and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}\n        {%- set ns.multi_step_tool = false %}\n        {%- set ns.last_query_index = index %}\n    {%- endif %}\n{%- endfor %}\n{%- for message in messages %}\n    {%- if message.content is string %}\n        {%- set content = message.content %}\n    {%- else %}\n        {%- set content = '' %}\n    {%- endif %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) %}\n        {{- '<|im_start|>' + message.role + '\\n' + content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {%- set reasoning_content = '' %}\n        {%- if message.reasoning_content is string %}\n            {%- set reasoning_content = message.reasoning_content %}\n        {%- else %}\n            {%- if '</think>' in content %}\n                {%- set reasoning_content = content.split('</think>')[0].rstrip('\\n').split('<think>')[-1].lstrip('\\n') %}\n                {%- set content = content.split('</think>')[-1].lstrip('\\n') %}\n            {%- endif %}\n        {%- endif %}\n        {%- if loop.index0 > ns.last_query_index %}\n            {%- if loop.last or (not loop.last and reasoning_content) %}\n                {{- '<|im_start|>' + message.role + '\\n<think>\\n' + reasoning_content.strip('\\n') + '\\n</think>\\n\\n' + content.lstrip('\\n') }}\n            {%- else %}\n                {{- '<|im_start|>' + message.role + '\\n' + content }}\n            {%- endif %}\n        {%- else %}\n            {{- '<|im_start|>' + message.role + '\\n' + content }}\n        {%- endif %}\n        {%- if message.tool_calls %}\n            {%- for tool_call in message.tool_calls %}\n                {%- if (loop.first and content) or (not loop.first) %}\n                    {{- '\\n' }}\n                {%- endif %}\n                {%- if tool_call.function %}\n                    {%- set tool_call = tool_call.function %}\n                {%- endif %}\n                {{- '<tool_call>\\n{\"name\": \"' }}\n                {{- tool_call.name }}\n                {{- '\", \"arguments\": ' }}\n                {%- if tool_call.arguments is string %}\n                    {{- tool_call.arguments }}\n                {%- else %}\n                    {{- tool_call.arguments | tojson }}\n                {%- endif %}\n                {{- '}\\n</tool_call>' }}\n            {%- endfor %}\n        {%- endif %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if loop.first or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n    {%- if enable_thinking is defined and enable_thinking is false %}\n        {{- '<think>\\n\\n</think>\\n\\n' }}\n    {%- endif %}\n{%- endif %}",
...

Chat template testing can be performed as follows:

>>> import transformers
>>> tokenizer=transformers.AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
>>> messages = [
...   { "role": "system", "content": "You are a helpful assistant." },
...   { "role": "user", "content": "What is the capital of France?" },
...   { "role": "assistant", "content": "The capital of France is Paris." },
...   { "role": "user", "content": "Tell me more about it." }
... ]
>>> print(tokenizer.apply_chat_template(messages, tokenize=False))
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is the capital of France?<|im_end|>
<|im_start|>assistant
The capital of France is Paris.<|im_end|>
<|im_start|>user
Tell me more about it.<|im_end|>

Request/Response Schema of `/v1/chat/completions`

Now that I understand the fundamental differences between the endpoints, let me examine the detailed structure of the /v1/chat/completions request and response schemas.
Understanding these schemas is crucial for effective API integration and troubleshooting, as they define the contract between client applications and vLLM's serving infrastructure.

My analysis here is based on vLLM's source code implementation, providing insights into both OpenAI-compatible fields and vLLM-specific extensions that enhance functionality beyond the standard API specification.

Request Schema

The ChatCompletionRequest class in vLLM implements the complete OpenAI Chat Completions API specification while adding several vLLM-specific extensions for advanced sampling and optimization features.

The schema is carefully organized to match the official OpenAI API documentation order, ensuring maximum compatibility with existing OpenAI client libraries and tools.

vllm/entrypoints/openai/protocol.py


...
class ChatCompletionRequest(OpenAIBaseModel):
    # Ordered by official OpenAI API documentation
    # https://platform.openai.com/docs/api-reference/chat/create
    messages: list[ChatCompletionMessageParam]
    model: Optional[str] = None
    frequency_penalty: Optional[float] = 0.0
    logit_bias: Optional[dict[str, float]] = None
    logprobs: Optional[bool] = False
    top_logprobs: Optional[int] = 0
    # TODO(#9845): remove max_tokens when field is removed from OpenAI API
    max_tokens: Optional[int] = Field(
        default=None,
        deprecated=
        'max_tokens is deprecated in favor of the max_completion_tokens field')
    max_completion_tokens: Optional[int] = None
    n: Optional[int] = 1
    presence_penalty: Optional[float] = 0.0
    response_format: Optional[AnyResponseFormat] = None
    seed: Optional[int] = Field(None, ge=_LONG_INFO.min, le=_LONG_INFO.max)
    stop: Optional[Union[str, list[str]]] = Field(default_factory=list)
    stream: Optional[bool] = False
    stream_options: Optional[StreamOptions] = None
    temperature: Optional[float] = None
    top_p: Optional[float] = None
    tools: Optional[list[ChatCompletionToolsParam]] = None
    tool_choice: Optional[Union[
        Literal["none"],
        Literal["auto"],
        Literal["required"],
        ChatCompletionNamedToolChoiceParam,
    ]] = "none"

    # NOTE this will be ignored by vLLM -- the model determines the behavior
    parallel_tool_calls: Optional[bool] = False
    user: Optional[str] = None

    # --8<-- [start:chat-completion-sampling-params]
    best_of: Optional[int] = None
    use_beam_search: bool = False
    top_k: Optional[int] = None
    min_p: Optional[float] = None
    repetition_penalty: Optional[float] = None
    length_penalty: float = 1.0
    stop_token_ids: Optional[list[int]] = Field(default_factory=list)
    include_stop_str_in_output: bool = False
    ignore_eos: bool = False
    min_tokens: int = 0
    skip_special_tokens: bool = True
    spaces_between_special_tokens: bool = True
    truncate_prompt_tokens: Optional[Annotated[int, Field(ge=1)]] = None
    prompt_logprobs: Optional[int] = None
    # --8<-- [end:chat-completion-sampling-params]
...

Field	Type	Required	Default	Description
`messages`	`list[ChatCompletionMessageParam]`	✅	-	Array of conversation messages
`model`	`Optional[str]`	❌	`None`	Model name to use (vllm-project/vllm#13568 made optional)
`frequency_penalty`	`Optional[float]`	❌	`0.0`	Frequency-based token penalty (-2.0 ~ 2.0)
`logit_bias`	`Optional[dict[str, float]]`	❌	`None`	Bias for specific tokens' logits
`logprobs`	`Optional[bool]`	❌	`False`	Whether to return log probabilities
`top_logprobs`	`Optional[int]`	❌	`0`	Number of top log probabilities to return (0-20)
`max_tokens`	`Optional[int]`	❌	`None`	Maximum number of tokens to generate
`n`	`Optional[int]`	❌	`1`	Number of completions to generate
`presence_penalty`	`Optional[float]`	❌	`0.0`	Presence-based token penalty (-2.0 ~ 2.0)
`response_format`	`Optional[AnyResponseFormat]`	❌	`None`	Response format specification (JSON mode)
`seed`	`Optional[int]`	❌	`None`	Seed for reproducible output
`stop`	`Optional[Union[str, list[str]]]`	❌	`[]`	Stop strings for generation
`stream`	`Optional[bool]`	❌	`False`	Whether to stream responses
`temperature`	`Optional[float]`	❌	`None`	Sampling temperature (0.0 ~ 2.0)
`top_p`	`Optional[float]`	❌	`None`	Nucleus sampling probability
`tools`	`Optional[list[ChatCompletionToolsParam]]`	❌	`None`	Function call tool definitions
`tool_choice`	`Optional[Union[Literal, NamedToolChoice]]`	❌	`"none"`	Tool selection strategy
`user`	`Optional[str]`	❌	`None`	User identifier
`best_of`	`Optional[int]`	❌	`None`	Number of generations to select best from
`use_beam_search`	`bool`	❌	`False`	Whether to use beam search
`top_k`	`Optional[int]`	❌	`None`	Consider only top k tokens
`min_p`	`Optional[float]`	❌	`None`	Minimum probability threshold
`repetition_penalty`	`Optional[float]`	❌	`None`	Repetition penalty
`min_tokens`	`int`	❌	`0`	Minimum number of tokens to generate
`skip_special_tokens`	`bool`	❌	`True`	Whether to skip special tokens in output
`spaces_between_special_tokens`	`bool`	❌	`True`	Whether to add spaces between special tokens
`truncate_prompt_tokens`	`Optional[int]`	❌	`None`	Truncate prompt to specified token count
`prompt_logprobs`	`Optional[int]`	❌	`None`	Number of prompt log probabilities to return

Message Object

The message object structure supports both simple text conversations and complex multimodal interactions. vLLM extends the standard OpenAI message format to support custom roles and enhanced tool integration.

vllm/entrypoints/chat_utils.py


...
class CustomChatCompletionMessageParam(TypedDict, total=False):
    """Enables custom roles in the Chat Completion API."""
    role: Required[str]
    """The role of the message's author."""

    content: Union[str, list[ChatCompletionContentPartParam]]
    """The contents of the message."""

    name: str
    """An optional name for the participant.

    Provides the model information to differentiate between participants of the
    same role.
    """

    tool_call_id: Optional[str]
    """Tool call that this message is responding to."""

    tool_calls: Optional[Iterable[ChatCompletionMessageToolCallParam]]
    """The tool calls generated by the model, such as function calls."""


ChatCompletionMessageParam = Union[OpenAIChatCompletionMessageParam,
                                   CustomChatCompletionMessageParam]
...

Field	Type	Required	Description
`role`	`Required[str]`	✅	Message role: `system`, `user`, `assistant`, `tool`
`content`	`Union[str, list[ChatCompletionContentPartParam]]`	✅	Message content (text or multimodal array)
`name`	`str`	❌	Message author name
`tool_call_id`	`Optional[str]`	❌	Tool call ID (required when role is `tool`)
`tool_calls`	`Optional[Iterable[ChatCompletionMessageToolCallParam]]`	❌	Assistant's tool calls

Response Schema

The response schema follows the OpenAI specification closely while incorporating vLLM-specific enhancements for advanced use cases like KV caching optimization and detailed logging.

vllm/entrypoints/openai/protocol.py


...
class ChatCompletionResponse(OpenAIBaseModel):
    id: str = Field(default_factory=lambda: f"chatcmpl-{random_uuid()}")
    object: Literal["chat.completion"] = "chat.completion"
    created: int = Field(default_factory=lambda: int(time.time()))
    model: str
    choices: list[ChatCompletionResponseChoice]
    usage: UsageInfo
    prompt_logprobs: Optional[list[Optional[dict[int, Logprob]]]] = None
    kv_transfer_params: Optional[dict[str, Any]] = Field(
        default=None, description="KVTransfer parameters.")
...

Field	Type	Description
`id`	`str`	Unique identifier for the completion request
`object`	`Literal["chat.completion"]`	Object type (`chat.completion` or `chat.completion.chunk`)
`created`	`int`	Creation time represented as Unix timestamp
`model`	`str`	Model name used
`choices`	`list[ChatCompletionResponseChoice]`	Array of generated completion choices
`usage`	`UsageInfo`	Token usage information
`prompt_logprobs`	`Optional[list[Optional[dict[int, Logprob]]]]`	Prompt log probability information
`kv_transfer_params`	`Optional[dict[str, Any]]`	KVTransfer parameters

Choice Object

Each choice represents a single completion generated by the model. The choice object contains the actual generated content along with metadata about the generation process.

vllm/entrypoints/openai/protocol.py


...
class ChatCompletionResponseChoice(OpenAIBaseModel):
    index: int
    message: ChatMessage
    logprobs: Optional[ChatCompletionLogProbs] = None
    # per OpenAI spec this is the default
    finish_reason: Optional[str] = "stop"
    # not part of the OpenAI spec but included in vLLM for legacy reasons
    stop_reason: Optional[Union[int, str]] = None
...

vllm/entrypoints/openai/protocol.py


...
class ChatMessage(OpenAIBaseModel):
    role: str
    reasoning_content: Optional[str] = None
    content: Optional[str] = None
    tool_calls: list[ToolCall] = Field(default_factory=list)
...

Field	Type	Description
`index`	`int`	Index of the choice
`message`	`ChatMessage`	Message generated by the assistant
`logprobs`	`Optional[ChatCompletionLogProbs]`	Log probability information
`finish_reason`	`Optional[str]`	Completion termination reason: `stop`, `length`, `function_call`, `content_filter`, `tool_calls`
`stop_reason`	`Optional[Union[int, str]]`	vLLM legacy field (outside OpenAI spec, provides similar info to `finish_reason`)

Usage Object

The usage object provides detailed token consumption metrics, essential for billing, monitoring, and optimization purposes.

vllm/entrypoints/openai/protocol.py


class UsageInfo(OpenAIBaseModel):
    prompt_tokens: int = 0
    total_tokens: int = 0
    completion_tokens: Optional[int] = 0
    prompt_tokens_details: Optional[PromptTokenUsageInfo] = None

Field	Type	Description
`prompt_tokens`	`int`	Number of tokens used in prompt
`total_tokens`	`int`	Total tokens (prompt + completion)
`completion_tokens`	`Optional[int]`	Number of tokens generated in completion
`prompt_tokens_details`	`Optional[PromptTokenUsageInfo]`	Detailed prompt token usage information

Router

vLLM's OpenAI-compatible server is built on FastAPI, providing a robust and high-performance web framework for serving LLM requests.
When a user sends a POST request to /v1/chat/completions, FastAPI's routing system directs the request to the following function, which serves as the main entry point for chat completion requests.

vllm/entrypoints/openai/api_server.py


...
@router.post("/v1/chat/completions",
             dependencies=[Depends(validate_json_request)],
             responses={
                 HTTPStatus.OK.value: {
                     "content": {
                         "text/event-stream": {}
                     }
                 },
                 HTTPStatus.BAD_REQUEST.value: {
                     "model": ErrorResponse
                 },
                 HTTPStatus.NOT_FOUND.value: {
                     "model": ErrorResponse
                 },
                 HTTPStatus.INTERNAL_SERVER_ERROR.value: {
                     "model": ErrorResponse
                 }
             })
@with_cancellation
@load_aware_call
async def create_chat_completion(request: ChatCompletionRequest,
                                 raw_request: Request):
    handler = chat(raw_request)
    if handler is None:
        return base(raw_request).create_error_response(
            message="The model does not support Chat Completions API")

    generator = await handler.create_chat_completion(request, raw_request)

    if isinstance(generator, ErrorResponse):
        return JSONResponse(content=generator.model_dump(),
                            status_code=generator.code)

    elif isinstance(generator, ChatCompletionResponse):
        return JSONResponse(content=generator.model_dump())

    return StreamingResponse(content=generator, media_type="text/event-stream")
...

Request Handler

I can see that the handler is defined through the chat() function.
This function retrieves the openai_serving_chat instance that was registered in the app.state during server initialization, as shown below.

vllm/entrypoints/openai/api_server.py


...
def chat(request: Request) -> Optional[OpenAIServingChat]:
    return request.app.state.openai_serving_chat
...

Starlette Request Object

The Request object is a class included in the Starlette framework, and it inherits the app property from its parent class HTTPConnection.
This design provides access to the application state and configuration throughout the request lifecycle.

starlette/requests.py


...
class Request(HTTPConnection):
...

The app property provides access to the FastAPI application instance, while scope contains ASGI (Asynchronous Server Gateway Interface) information about the current request.
This architecture follows the ASGI specification, enabling efficient handling of asynchronous web requests.

starlette/requests.py


...
class HTTPConnection(Mapping[str, Any]):
    """
    A base class for incoming HTTP connections, that is used to provide
    any functionality that is common to both `Request` and `WebSocket`.
    """
...
    @property
    def app(self) -> Any:
        return self.scope["app"]
...

Application State Initialization

Looking at the initialization of state.openai_serving_chat, it occurs in the init_app_state() function as follows.
This initialization happens during server startup, ensuring that all necessary components are ready before handling incoming requests.

vllm/entrypoints/openai/api_server.py


...
async def init_app_state(
    engine_client: EngineClient,
    vllm_config: VllmConfig,
    state: State,
    args: Namespace,
) -> None:
...
    state.openai_serving_chat = OpenAIServingChat(
        engine_client,
        model_config,
        state.openai_serving_models,
        args.response_role,
        request_logger=request_logger,
        chat_template=resolved_chat_template,
        chat_template_content_format=args.chat_template_content_format,
        return_tokens_as_token_ids=args.return_tokens_as_token_ids,
        enable_auto_tools=args.enable_auto_tool_choice,
        tool_parser=args.tool_call_parser,
        reasoning_parser=args.reasoning_parser,
        enable_prompt_tokens_details=args.enable_prompt_tokens_details,
    ) if model_config.runner_type == "generate" else None
...

Testing `app.state`

The app.state mechanism can be tested with the following example.
This demonstrates how FastAPI's application state works in practice and how components are shared across request handlers.

from random import random
from typing import Optional

import uvicorn
import uvloop
from fastapi import FastAPI, Request
from fastapi.datastructures import State
from loguru import logger
from pydantic import BaseModel

app = FastAPI()


class OpenAIServingChat:
    def __init__(self) -> None:
        logger.info("Init: OpenAIServingChat")

    def create_chat_completion(self, *args, **kwargs) -> float:
        logger.info("Run: OpenAIServingChat.create_chat_completion")
        return random()


async def init_app_state(state: State):
    state.openai_serving_chat = OpenAIServingChat()


def chat(request: Request) -> Optional[OpenAIServingChat]:
    return request.app.state.openai_serving_chat


class ChatCompletionRequest(BaseModel):
    id: int


@app.post("/v1/chat/completions")
async def create_chat_completion(request: ChatCompletionRequest, raw_request: Request):
    handler = chat(raw_request)
    logger.info(f"{raw_request=}")
    return {"id": request.id, "chat_completion": handler.create_chat_completion()}


async def main():
    await init_app_state(app.state)
    config = uvicorn.Config(app, host="0.0.0.0", port=8000)
    server = uvicorn.Server(config)
    await server.serve()


if __name__ == "__main__":
    uvloop.run(main())

$ curl -X 'POST' \
 'http://localhost:8000/v1/chat/completions' \
 -H 'accept: application/json' \
 -H 'Content-Type: application/json' \
 -d '{
  "id": 0
}' | jq
{
  "id": 0,
  "chat_completion": 0.7867811845314955
}

Examining the server logs reveals the initialization sequence: the OpenAIServingChat instance is initialized before FastAPI starts running.
When a request arrives, the handler is retrieved from request.app.state.openai_serving_chat and executed.

This pattern demonstrates FastAPI's application lifecycle management, where:

Initialization Phase: Critical components are set up during server startup
Request Phase: Pre-initialized components are accessed through the application state
Processing Phase: The actual request handling occurs with the retrieved handler

2025-06-16 23:38:46.972 | INFO     | __main__:__init__:16 - Init: OpenAIServingChat
INFO:     Started server process [52024]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
2025-06-16 23:38:49.021 | INFO     | __main__:create_chat_completion:38 - raw_request=<starlette.requests.Request object at 0x105a80a50>
2025-06-16 23:38:49.021 | INFO     | __main__:create_chat_completion:19 - Run: OpenAIServingChat.create_chat_completion
INFO:     127.0.0.1:61279 - "POST /v1/chat/completions HTTP/1.1" 200 OK

vllm/entrypoints/openai/serving_chat.py


...
class OpenAIServingChat(OpenAIServing):
...
    async def create_chat_completion(
        self,
        request: ChatCompletionRequest,
        raw_request: Optional[Request] = None,
    ) -> Union[AsyncGenerator[str, None], ChatCompletionResponse,
               ErrorResponse]:
        """
        Chat Completion API similar to OpenAI's API.

        See https://platform.openai.com/docs/api-reference/chat/create
        for the API specification. This API mimics the OpenAI
        Chat Completion API.
        """
...

Chat Completion Processing Pipeline

As I observed in the router's create_chat_completion() function above, all preprocessing, LLM inference, and postprocessing for /v1/chat/completions requests are performed within the following method.

vllm/entrypoints/openai/serving_chat.py


...
class OpenAIServingChat(OpenAIServing):
...
    async def create_chat_completion(
        self,
        request: ChatCompletionRequest,
        raw_request: Optional[Request] = None,
    ) -> Union[AsyncGenerator[str, None], ChatCompletionResponse,
               ErrorResponse]:
        """
        Chat Completion API similar to OpenAI's API.

        See https://platform.openai.com/docs/api-reference/chat/create
        for the API specification. This API mimics the OpenAI
        Chat Completion API.
        """
...

How does the complete processing flow work?
Let's examine the step-by-step process:

vllm/entrypoints/openai/serving_chat.py


...
class OpenAIServingChat(OpenAIServing):
...
    async def create_chat_completion(
...
        error_check_ret = await self._check_model(request)
        if error_check_ret is not None:
            logger.error("Error with model %s", error_check_ret)
            return error_check_ret
...

Model Validation: The OpenAIServing._check_model() method validates that the request's "model" name is correctly configured.

vllm/entrypoints/openai/serving_chat.py


...
class OpenAIServingChat(OpenAIServing):
...
    async def create_chat_completion(
...
        error_check_ret = await self._check_model(request)
        if error_check_ret is not None:
            logger.error("Error with model %s", error_check_ret)
            return error_check_ret
...

Engine Health Check: The AsyncLLM(EngineClient).errored property performs a health check of the engine client.

vllm/entrypoints/openai/serving_chat.py


...
class OpenAIServingChat(OpenAIServing):
...
    async def create_chat_completion(
...
        # If the engine is dead, raise the engine's DEAD_ERROR.
        # This is required for the streaming case, where we return a
        # success status before we actually start generating text :).
        if self.engine_client.errored:
            raise self.engine_client.dead_error
...

Component Preparation: Prepares LoRA adapter-related requests (lora_request, prompt_adapter_request), model_name, tokenizer, and tool_parser.

vllm/entrypoints/openai/serving_chat.py


...
class OpenAIServingChat(OpenAIServing):
...
    async def create_chat_completion(
...
        try:
            (
                lora_request,
                prompt_adapter_request,
            ) = self._maybe_get_adapters(request)

            model_name = self._get_model_name(request.model, lora_request)

            tokenizer = await self.engine_client.get_tokenizer(lora_request)

            tool_parser = self.tool_parser
...

Mistral Tokenizer Handling: For v0.9.0.1, there are Pydantic-related issues with MistralTokenizer (vllm-project/vllm#9951, pydantic/pydantic#9467, pydantic/pydantic#9541), which require special handling as shown below.

vllm/entrypoints/openai/serving_chat.py


...
class OpenAIServingChat(OpenAIServing):
...
    async def create_chat_completion(
...
        try:
...
            if isinstance(tokenizer, MistralTokenizer):
                # because of issues with pydantic we need to potentially
                # re-serialize the tool_calls field of the request
                # for more info: see comment in `maybe_serialize_tool_calls`
                maybe_serialize_tool_calls(request)
                truncate_tool_call_ids(request)
                validate_request_params(request)
...

Tool Configuration: When the request's tool_choice is "auto", it undergoes validation and generates tool_dicts.

vllm/entrypoints/openai/serving_chat.py


...
class OpenAIServingChat(OpenAIServing):
...
    async def create_chat_completion(
...
        try:
...
            if (request.tool_choice == "auto" and
                    not (self.enable_auto_tools and tool_parser is not None)
                    and not isinstance(tokenizer, MistralTokenizer)):
                # for hf tokenizers, "auto" tools requires
                # --enable-auto-tool-choice and --tool-call-parser
                return self.create_error_response(
                    "\"auto\" tool choice requires "
                    "--enable-auto-tool-choice and --tool-call-parser to be set"
                )

            tool_dicts = None if request.tools is None else [
                tool.model_dump() for tool in request.tools
            ]
...

Request Preprocessing: Uses the OpenAIServingChat(OpenAIServing)._preprocess_chat() method to preprocess the request.

vllm/entrypoints/openai/serving_chat.py


...
class OpenAIServingChat(OpenAIServing):
...
    async def create_chat_completion(
...
        try:
...
            (
                conversation,
                request_prompts,
                engine_prompts,
            ) = await self._preprocess_chat(
                request,
                tokenizer,
                request.messages,
                chat_template=request.chat_template or self.chat_template,
                chat_template_content_format=self.chat_template_content_format,
                add_generation_prompt=request.add_generation_prompt,
                continue_final_message=request.continue_final_message,
                tool_dicts=tool_dicts,
                documents=request.documents,
                chat_template_kwargs=request.chat_template_kwargs,
                tool_parser=tool_parser,
                truncate_prompt_tokens=request.truncate_prompt_tokens,
                add_special_tokens=request.add_special_tokens,
            )
...

Request ID Generation: The OpenAIServingChat(OpenAIServing)._base_request_id() method generates a random request_id and stores it as metadata in the state.

vllm/entrypoints/openai/serving_chat.py


...
class OpenAIServingChat(OpenAIServing):
...
    async def create_chat_completion(
...
        request_id = "chatcmpl-" \
                     f"{self._base_request_id(raw_request, request.request_id)}"

        request_metadata = RequestResponseMetadata(request_id=request_id)
        if raw_request:
            raw_request.state.request_metadata = request_metadata
...

Sampling Parameters Setup: For the preprocessed engine_prompts, prepares BeamSearchParams if using beam search, or SamplingParams otherwise.

vllm/entrypoints/openai/serving_chat.py


...
class OpenAIServingChat(OpenAIServing):
...
    async def create_chat_completion(
...
        # Schedule the request and get the result generator.
        generators: list[AsyncGenerator[RequestOutput, None]] = []
        try:
            for i, engine_prompt in enumerate(engine_prompts):
                sampling_params: Union[SamplingParams, BeamSearchParams]
                default_max_tokens = self.max_model_len - len(
                    engine_prompt["prompt_token_ids"])
                if request.use_beam_search:
                    sampling_params = request.to_beam_search_params(
                        default_max_tokens, self.default_sampling_params)
                else:
                    sampling_params = request.to_sampling_params(
                        default_max_tokens,
                        self.model_config.logits_processor_pattern,
                        self.default_sampling_params)
...

Request Logging: Uses OpenAIServingChat(OpenAIServing)._log_inputs() to log the request and prepare trace_headers. (However, trace_headers do not work in V1 engine-based vllm serve)

vllm/entrypoints/openai/serving_chat.py


...
class OpenAIServingChat(OpenAIServing):
...
    async def create_chat_completion(
...
        try:
            for i, engine_prompt in enumerate(engine_prompts):
...
                self._log_inputs(request_id,
                                 request_prompts[i],
                                 params=sampling_params,
                                 lora_request=lora_request,
                                 prompt_adapter_request=prompt_adapter_request)

                trace_headers = (None if raw_request is None else await
                                 self._get_trace_headers(raw_request.headers))
...

Inference Execution: For the preprocessed engine_prompts, uses the EngineClient.beam_search() method if using beam search, or the AsyncLLM(EngineClient).generate() method for inference.

vllm/entrypoints/openai/serving_chat.py


...
class OpenAIServingChat(OpenAIServing):
...
    async def create_chat_completion(
...
        try:
            for i, engine_prompt in enumerate(engine_prompts):
...
                if isinstance(sampling_params, BeamSearchParams):
                    generator = self.engine_client.beam_search(
                        prompt=engine_prompt,
                        request_id=request_id,
                        params=sampling_params,
                    )
                else:
                    generator = self.engine_client.generate(
                        engine_prompt,
                        sampling_params,
                        request_id,
                        lora_request=lora_request,
                        trace_headers=trace_headers,
                        prompt_adapter_request=prompt_adapter_request,
                        priority=request.priority,
                    )

                generators.append(generator)
...

Response Generation: For streaming requests, uses the OpenAIServingChat(OpenAIServing).chat_completion_stream_generator() method; for non-streaming requests, uses the OpenAIServingChat(OpenAIServing).chat_completion_full_generator() method to generate the response.

vllm/entrypoints/openai/serving_chat.py


...
class OpenAIServingChat(OpenAIServing):
...
    async def create_chat_completion(
...
        assert len(generators) == 1
        result_generator, = generators

        # Streaming response
        if request.stream:
            return self.chat_completion_stream_generator(
                request, result_generator, request_id, model_name,
                conversation, tokenizer, request_metadata)

        try:
            return await self.chat_completion_full_generator(
                request, result_generator, request_id, model_name,
                conversation, tokenizer, request_metadata)
        except ValueError as e:
            # TODO: Use a vllm-specific Validation Error
            return self.create_error_response(str(e))
...

Now that I've examined the overall chat completion processing pipeline, let me dive into the important core logic components.

Preprocessing

For this analysis, I'll assume that beam search is not being used and examine the code accordingly.

As I saw above, request preprocessing uses the OpenAIServingChat(OpenAIServing)._preprocess_chat() method.
Let me examine how this method works step by step.

Content Format and Conversation Setup: Prepares resolved_content_format (determines the content format for chat templates based on tools and model configuration), conversation (parsed conversation messages with multimodal data handling), and mm_data_future (future object for asynchronous multimodal data processing), then updates the chat_template_kwargs (user-specified chat template settings) into _chat_template_kwargs (internal chat template configuration dictionary).

vllm/entrypoints/openai/serving_engine.py


...
class OpenAIServing:
...
    async def _preprocess_chat(
        self,
        request: ChatLikeRequest,
        tokenizer: AnyTokenizer,
        messages: list[ChatCompletionMessageParam],
        chat_template: Optional[str],
        chat_template_content_format: ChatTemplateContentFormatOption,
        add_generation_prompt: bool = True,
        continue_final_message: bool = False,
        tool_dicts: Optional[list[dict[str, Any]]] = None,
        documents: Optional[list[dict[str, str]]] = None,
        chat_template_kwargs: Optional[dict[str, Any]] = None,
        tool_parser: Optional[Callable[[AnyTokenizer], ToolParser]] = None,
        truncate_prompt_tokens: Optional[Annotated[int, Field(ge=1)]] = None,
        add_special_tokens: bool = False,
    ) -> tuple[list[ConversationMessage], Sequence[RequestPrompt],
               list[EngineTokensPrompt]]:
        model_config = self.model_config

        resolved_content_format = resolve_chat_template_content_format(
            chat_template,
            tool_dicts,
            chat_template_content_format,
            tokenizer,
            model_config=model_config,
        )
        conversation, mm_data_future = parse_chat_messages_futures(
            messages,
            model_config,
            tokenizer,
            content_format=resolved_content_format,
        )

        _chat_template_kwargs: dict[str, Any] = dict(
            chat_template=chat_template,
            add_generation_prompt=add_generation_prompt,
            continue_final_message=continue_final_message,
            tools=tool_dicts,
            documents=documents,
        )
        _chat_template_kwargs.update(chat_template_kwargs or {})
...

Obtain the request_prompt based on the tokenizer type: for models using MistralTokenizer, the apply_mistral_chat_template() function is used, while for other models, the apply_hf_chat_template() function is used to generate the request_prompt.

vllm/entrypoints/openai/serving_engine.py


...
class OpenAIServing:
...
    async def _preprocess_chat(
...
        request_prompt: Union[str, list[int]]
        if isinstance(tokenizer, MistralTokenizer):
            request_prompt = apply_mistral_chat_template(
                tokenizer,
                messages=messages,
                **_chat_template_kwargs,
            )
        else:
            request_prompt = apply_hf_chat_template(
                tokenizer=tokenizer,
                conversation=conversation,
                model_config=model_config,
                **_chat_template_kwargs,
            )

        mm_data = await mm_data_future
...

Process tool parsing if enabled: When a tool parser is configured and the tool choice is not "none", the system determines whether tool parsing should be performed. If tools are being used, the request is adjusted through the tool parser to handle function calling capabilities. This step ensures that the model can correctly interpret and respond to tool-related requests.

vllm/entrypoints/openai/serving_engine.py


...
class OpenAIServing:
...
    async def _preprocess_chat(
...
        # tool parsing is done only if a tool_parser has been set and if
        # tool_choice is not "none" (if tool_choice is "none" but a tool_parser
        # is set, we want to prevent parsing a tool_call hallucinated by the LLM
        should_parse_tools = tool_parser is not None and (hasattr(
            request, "tool_choice") and request.tool_choice != "none")

        if should_parse_tools:
            if not isinstance(request, ChatCompletionRequest):
                msg = "Tool usage is only supported for Chat Completions API"
                raise NotImplementedError(msg)

            request = tool_parser(tokenizer).adjust_request(  # type: ignore
                request=request)
...

Tokenize the request prompt: Convert the string-based prompt into token format for model processing. For string prompts, the system uses asynchronous tokenization with optional prompt truncation and special token handling through the OpenAIServing._tokenize_prompt_input_async() method, which performs tokenization in a thread pool to prevent blocking the main event loop. For MistralTokenizer, token IDs are already provided, so the system creates a TextTokensPrompt object containing both the decoded text and the token IDs.

vllm/entrypoints/openai/serving_engine.py


...
class OpenAIServing:
...
    async def _preprocess_chat(
...
        if isinstance(request_prompt, str):
            prompt_inputs = await self._tokenize_prompt_input_async(
                request,
                tokenizer,
                request_prompt,
                truncate_prompt_tokens=truncate_prompt_tokens,
                add_special_tokens=add_special_tokens,
            )
        else:
            # For MistralTokenizer
            assert is_list_of(request_prompt, int), (
                "Prompt has to be either a string or a list of token ids")
            prompt_inputs = TextTokensPrompt(
                prompt=tokenizer.decode(request_prompt),
                prompt_token_ids=request_prompt)
...

Create the engine prompt: Construct the final EngineTokensPrompt object that will be passed to the inference engine. This includes the tokenized prompt, multimodal data (if present), multimodal processor kwargs, and cache salt for caching optimization. The function returns the processed conversation, request prompt, and engine prompt for the next stage of processing.

vllm/entrypoints/openai/serving_engine.py


...
class OpenAIServing:
...
    async def _preprocess_chat(
...
        engine_prompt = EngineTokensPrompt(
            prompt_token_ids=prompt_inputs["prompt_token_ids"])
        if mm_data is not None:
            engine_prompt["multi_modal_data"] = mm_data
        if request.mm_processor_kwargs is not None:
            engine_prompt["mm_processor_kwargs"] = request.mm_processor_kwargs

        if hasattr(request, "cache_salt") and request.cache_salt is not None:
            engine_prompt["cache_salt"] = request.cache_salt

        return conversation, [request_prompt], [engine_prompt]
...

Inferencing

Inference is performed through the OpenAIServingChat(OpenAIServing).engine_client.generate() method.
In this document, I'm using AsyncLLM(EngineClient) as the engine_client, so let me examine the AsyncLLM(EngineClient).generate() method.

Engine Client

Initialize output handler: AsyncLLM(EngineClient).output_handler is executed by running the AsyncLLM(EngineClient)._run_output_handler() method.

vllm/v1/engine/async_llm.py


...
class AsyncLLM(EngineClient):
...
    async def generate(
        self,
        prompt: PromptType,
        sampling_params: SamplingParams,
        request_id: str,
        lora_request: Optional[LoRARequest] = None,
        trace_headers: Optional[Mapping[str, str]] = None,
        prompt_adapter_request: Optional[PromptAdapterRequest] = None,
        priority: int = 0,
    ) -> AsyncGenerator[RequestOutput, None]:
        """
        Main function called by the API server to kick off a request
            * 1) Making an AsyncStream corresponding to the Request.
            * 2) Processing the Input.
            * 3) Adding the Request to the Detokenizer.
            * 4) Adding the Request to the EngineCore (separate process).

        A separate output_handler loop runs in a background AsyncIO task,
        pulling outputs from EngineCore and putting them into the
        per-request AsyncStream.

        The caller of generate() iterates the returned AsyncGenerator,
        returning the RequestOutput back to the caller.
        """
        try:

            # We start the output_handler on the first call to generate() so
            # we can call __init__ before the event loop, which enables us
            # to handle startup failure gracefully in the OpenAI server.
            self._run_output_handler()
...

The output_handler executes in the following order:
1. Pull EngineCoreOutputs from the EngineCore: Continuously polls the engine core for outputs using await engine_core.get_output_async() and processes them in chunks to avoid blocking the event loop.
2. Process EngineCoreOutputs: Each output chunk is processed through output_processor.process_outputs() which converts raw engine outputs into formatted request outputs and pushes them to appropriate async streams.
3. Handle request aborts: Processes any requests that need to be aborted due to stop strings or other completion conditions via await engine_core.abort_requests_async().
4. Performance logging: Records scheduler statistics and iteration metrics for monitoring and debugging purposes.

vllm/v1/engine/async_llm.py


...
class AsyncLLM(EngineClient):
...
    def _run_output_handler(self):
        """Background loop: pulls from EngineCore and pushes to AsyncStreams."""

        if self.output_handler is not None:
            return

        # Ensure that the task doesn't have a circular ref back to the AsyncLLM
        # object, or else it won't be garbage collected and cleaned up properly.
        engine_core = self.engine_core
        output_processor = self.output_processor
        log_stats = self.log_stats
        stat_loggers = self.stat_loggers if log_stats else None

        async def output_handler():
            try:
                while True:
                    # 1) Pull EngineCoreOutputs from the EngineCore.
                    outputs = await engine_core.get_output_async()
                    num_outputs = len(outputs.outputs)

                    iteration_stats = IterationStats() if (
                        log_stats and num_outputs) else None

                    # Split outputs into chunks of at most
                    # VLLM_V1_OUTPUT_PROC_CHUNK_SIZE, so that we don't block the
                    # event loop for too long.
                    if num_outputs <= VLLM_V1_OUTPUT_PROC_CHUNK_SIZE:
                        slices = (outputs.outputs, )
                    else:
                        slices = np.array_split(
                            outputs.outputs,
                            cdiv(num_outputs, VLLM_V1_OUTPUT_PROC_CHUNK_SIZE))

                    for i, outputs_slice in enumerate(slices):
                        # 2) Process EngineCoreOutputs.
                        processed_outputs = output_processor.process_outputs(
                            outputs_slice, outputs.timestamp, iteration_stats)
                        # NOTE: RequestOutputs are pushed to their queues.
                        assert not processed_outputs.request_outputs

                        # Allow other asyncio tasks to run between chunks
                        if i + 1 < len(slices):
                            await asyncio.sleep(0)

                        # 3) Abort any reqs that finished due to stop strings.
                        await engine_core.abort_requests_async(
                            processed_outputs.reqs_to_abort)

                    # 4) Logging.
                    # TODO(rob): make into a coroutine and launch it in
                    # background thread once Prometheus overhead is non-trivial.
                    if stat_loggers:
                        assert outputs.scheduler_stats is not None
                        AsyncLLM._record_stats(
                            stat_loggers[outputs.engine_index],
                            scheduler_stats=outputs.scheduler_stats,
                            iteration_stats=iteration_stats,
                        )
            except Exception as e:
                logger.exception("AsyncLLM output_handler failed.")
                output_processor.propagate_error(e)

        self.output_handler = asyncio.create_task(output_handler())
...

Send inference request: The inference request is sent to the core engine through the AsyncLLM(EngineClient).add_request() method.

vllm/v1/engine/async_llm.py


...
class AsyncLLM(EngineClient):
...
    async def generate(
...
        try:
...
            q = await self.add_request(
                request_id,
                prompt,
                sampling_params,
                lora_request=lora_request,
                trace_headers=trace_headers,
                prompt_adapter_request=prompt_adapter_request,
                priority=priority,
            )
...

AsyncLLM(EngineClient).add_request() operates as follows:
1. Process input and create request: Converts the input prompt and parameters into an internal request object using self.processor.process_inputs(), which handles tokenization, parameter validation, and request formatting.
2. Send request to core engine: The AsyncLLM(EngineClient)._add_request() method calls the AsyncMPClient(MPClient).add_request_async() method to send an EngineCoreRequestType.ADD request to the core engine, enabling asynchronous communication between the client and the engine process for efficient request queuing and processing.

vllm/v1/engine/async_llm.py


...
class AsyncLLM(EngineClient):
...
    async def add_request(
        self,
        request_id: str,
        prompt: PromptType,
        params: Union[SamplingParams, PoolingParams],
        arrival_time: Optional[float] = None,
        lora_request: Optional[LoRARequest] = None,
        tokenization_kwargs: Optional[dict[str, Any]] = None,
        trace_headers: Optional[Mapping[str, str]] = None,
        prompt_adapter_request: Optional[PromptAdapterRequest] = None,
        priority: int = 0,
    ) -> RequestOutputCollector:
        """Add new request to the AsyncLLM."""

        if self.errored:
            raise EngineDeadError()

        assert isinstance(params, SamplingParams), \
            "Pooling is not supported in V1"

        # Create a new output collector for the request.
        queue = RequestOutputCollector(output_kind=params.output_kind)

        # Convert Input --> Request.
        prompt_str, request = self.processor.process_inputs(
            request_id, prompt, params, arrival_time, lora_request,
            tokenization_kwargs, trace_headers, prompt_adapter_request,
            priority)

        if params.n == 1:
            await self._add_request(request, prompt_str, None, 0, queue)
            return queue

        # Fan out child requests (for n>1).
        parent_request = ParentRequest(request_id, params)
        for idx in range(params.n):
            request_id, params = parent_request.get_child_info(idx)
            child_request = request if idx == params.n - 1 else copy(request)
            child_request.request_id = request_id
            child_request.sampling_params = params
            await self._add_request(child_request, prompt_str, parent_request,
                                    idx, queue)
        return queue
...

Engine Core

Process request through busy loop: The request sent in this way is processed through EngineCoreProc via a busy loop as shown below and scheduled in the EngineCoreProc(EngineCore).scheduler.

vllm/v1/engine/core.py


...
class EngineCore:
    """Inner loop of vLLM's Engine."""
...
    def run_busy_loop(self):
        """Core busy loop of the EngineCore."""

        # Loop until process is sent a SIGINT or SIGTERM
        while True:
            # 1) Poll the input queue until there is work to do.
            self._process_input_queue()
            # 2) Step the engine core and return the outputs.
            self._process_engine_step()

    def _process_input_queue(self):
        """Exits when an engine step needs to be performed."""

        waited = False
        while not self.engines_running and not (self.scheduler.has_requests()):
            if logger.isEnabledFor(DEBUG) and self.input_queue.empty():
                logger.debug("EngineCore waiting for work.")
                waited = True
            req = self.input_queue.get()
            self._handle_client_request(*req)

        if waited:
            logger.debug("EngineCore loop active.")

        # Handle any more client requests.
        while not self.input_queue.empty():
            req = self.input_queue.get_nowait()
            self._handle_client_request(*req)

    def _process_engine_step(self):
        """Called only when there are unfinished local requests."""

        # Step the engine core.
        outputs = self.step_fn()
        # Put EngineCoreOutputs into the output queue.
        if outputs is not None:
            self.output_queue.put_nowait(outputs)

    def _handle_client_request(self, request_type: EngineCoreRequestType,
                               request: Any) -> None:
        """Dispatch request from client."""

        if request_type == EngineCoreRequestType.ADD:
            self.add_request(request)
        elif request_type == EngineCoreRequestType.ABORT:
            self.abort_requests(request)
        elif request_type == EngineCoreRequestType.UTILITY:
            call_id, method_name, args = request
            output = UtilityOutput(call_id)
            try:
                method = getattr(self, method_name)
                output.result = method(
                    *self._convert_msgspec_args(method, args))
            except BaseException as e:
                logger.exception("Invocation of %s method failed", method_name)
                output.failure_message = (f"Call to {method_name} method"
                                          f" failed: {str(e)}")
            self.output_queue.put_nowait(
                EngineCoreOutputs(utility_output=output))
        elif request_type == EngineCoreRequestType.EXECUTOR_FAILED:
            raise RuntimeError("Executor failed.")
        else:
            logger.error("Unrecognized input request type encountered: %s",
                         request_type)
...

vllm/v1/engine/core.py


...
class EngineCore:
    """Inner loop of vLLM's Engine."""
...
    def add_request(self, request: EngineCoreRequest):
        """Add request to the scheduler."""

        if request.mm_hashes is not None:
            # Here, if hash exists for a multimodal input, then it will be
            # fetched from the cache, else it will be added to the cache.
            # Note that the cache here is mirrored with the client cache, so
            # anything that has a hash must have a HIT cache entry here
            # as well.
            assert request.mm_inputs is not None
            request.mm_inputs = self.mm_input_cache_server.get_and_update_p1(
                request.mm_inputs, request.mm_hashes)

        req = Request.from_engine_core_request(request)
        if req.use_structured_output:
            # Start grammar compilation asynchronously
            self.structured_output_manager.grammar_init(req)

        if req.kv_transfer_params is not None and (
                not self.scheduler.get_kv_connector()):
            logger.warning("Got kv_transfer_params, but no KVConnector found. "
                           "Disabling KVTransfer for this request.")

        self.scheduler.add_request(req)
...

The busy loop is created through the following process:

Scheduler

Add request to scheduler queue: The request is added to the Scheduler(SchedulerInterface).waiting queue through the Scheduler(SchedulerInterface).add_request() method.
Determine step function and execute scheduling: Based on the EngineCoreProc(EngineCore).model_executor.max_concurrent_batches value, the EngineCoreProc(EngineCore).step_fn is determined as one of the two methods below, and the Scheduler(SchedulerInterface).schedule() method is internally executed by the EngineCoreProc(EngineCore)._process_engine_step() method.
Scheduling logic: The scheduler determines which requests to process next based on factors like priority, available resources, sequence length, and batching constraints. It creates batched sequences for efficient GPU utilization and manages the transition of requests between different states (waiting, running, swapped).

vllm/v1/engine/core.py


...
class EngineCore:
...
    def step(self) -> EngineCoreOutputs:
        """Schedule, execute, and make output."""

        # Check for any requests remaining in the scheduler - unfinished,
        # or finished and not yet removed from the batch.
        if not self.scheduler.has_requests():
            return EngineCoreOutputs(
                outputs=[],
                scheduler_stats=self.scheduler.make_stats(),
            )
        scheduler_output = self.scheduler.schedule()
        model_output = self.execute_model(scheduler_output)
        engine_core_outputs = self.scheduler.update_from_output(
            scheduler_output, model_output)  # type: ignore

        return engine_core_outputs

    def step_with_batch_queue(self) -> Optional[EngineCoreOutputs]:
        """Schedule and execute batches with the batch queue.
        Note that if nothing to output in this step, None is returned.

        The execution flow is as follows:
        1. Try to schedule a new batch if the batch queue is not full.
        If a new batch is scheduled, directly return an empty engine core
        output. In other words, fulfilling the batch queue has a higher priority
        than getting model outputs.
        2. If there is no new scheduled batch, meaning that the batch queue
        is full or no other requests can be scheduled, we block until the first
        batch in the job queue is finished.
        3. Update the scheduler from the output.
        """
        assert self.batch_queue is not None

        engine_core_outputs = None
        scheduler_output = None
        # Try to schedule a new batch if the batch queue is not full, but
        # the scheduler may return an empty batch if all requests are scheduled.
        # Note that this is not blocking.
        if not self.batch_queue.full():
            scheduler_output = self.scheduler.schedule()
            if scheduler_output.total_num_scheduled_tokens > 0:
                future = self.model_executor.execute_model(scheduler_output)
                self.batch_queue.put_nowait(
                    (future, scheduler_output))  # type: ignore

        scheduled_batch = (scheduler_output is not None
                           and scheduler_output.total_num_scheduled_tokens > 0)

        # If no more requests can be scheduled and the job queue is not empty,
        # block until the first batch in the job queue is finished.
        # TODO(comaniac): Ideally we should peek the first batch in the
        # job queue to check if it's finished before scheduling a new batch,
        # but peeking the first element in a queue is not thread-safe,
        # so we need more work.
        if not scheduled_batch and not self.batch_queue.empty():
            future, scheduler_output = self.batch_queue.get_nowait()
            # Blocking until the first result is available.
            model_output = future.result()
            self.batch_queue.task_done()
            engine_core_outputs = self.scheduler.update_from_output(
                scheduler_output, model_output)

        return engine_core_outputs
...

Executor

Execute model with scheduler output: The EngineCoreProc(EngineCore).model_executor.execute_model() method is executed using the SchedulerOutput (which contains batched sequences, execution metadata, and resource allocation information) from the Scheduler(SchedulerInterface).schedule() method output.

vllm/v1/engine/core.py


...
class EngineCore:
...
    def execute_model(self, scheduler_output: SchedulerOutput):
        try:
            return self.model_executor.execute_model(scheduler_output)
        except BaseException as err:
            # NOTE: This method is exception-free
            dump_engine_exception(self.vllm_config, scheduler_output,
                                  self.scheduler.make_stats())
            # Re-raise exception
            raise err
...

Send model inference request: The model inference request is sent through the UniProcExecutor(UniProcExecutorV0, Executor).collective_rpc() method.

vllm/v1/executor/abstract.py


...
class Executor(ExecutorBase):
...
    def execute_model(
        self,
        scheduler_output,
    ) -> Union[ModelRunnerOutput, Future[ModelRunnerOutput]]:
        output = self.collective_rpc("execute_model",
                                     args=(scheduler_output, ))
        return output[0]
...

vllm/executor/uniproc_executor.py


...
class UniProcExecutor(ExecutorBase):
...
    def collective_rpc(self,
                       method: Union[str, Callable],
                       timeout: Optional[float] = None,
                       args: Tuple = (),
                       kwargs: Optional[Dict] = None) -> List[Any]:
        if kwargs is None:
            kwargs = {}
        answer = run_method(self.driver_worker, method, args, kwargs)
        return [answer]
...

Worker & Model Runner

Execute model inference: The Worker(WorkerBase) that receives the request executes the execute_model() method and performs actual model inference through the GPUModelRunner(LoRAModelRunnerMixin).execute_model() method.

vllm/v1/worker/gpu_worker.py


...
class Worker(WorkerBase):
...
    @torch.inference_mode()
    def execute_model(
        self,
        scheduler_output: "SchedulerOutput",
    ) -> Optional[ModelRunnerOutput]:
        intermediate_tensors = None
        if not get_pp_group().is_first_rank:
            intermediate_tensors = IntermediateTensors(
                get_pp_group().recv_tensor_dict(
                    all_gather_group=get_tp_group()))

        output = self.model_runner.execute_model(scheduler_output,
                                                 intermediate_tensors)
        parallel_config = self.vllm_config.parallel_config
        if parallel_config.distributed_executor_backend != "external_launcher" \
            and not get_pp_group().is_last_rank:
            assert isinstance(output, IntermediateTensors)
            get_pp_group().send_tensor_dict(output.tensors,
                                            all_gather_group=get_tp_group())
            return None
        assert isinstance(output, ModelRunnerOutput)
        return output if self.is_driver_worker else None
...

Engine Core

Update inference results: The inference results output from the model runner are finally updated using the Scheduler(SchedulerInterface).update_from_output() method.
Add results to output queue: The results are added to the EngineCoreProc(EngineCore).output_queue.

Engine Client

Yield outputs until completion: The queue (RequestOutputCollector) yields outputs until the inference is finished.

vllm/v1/engine/async_llm.py


...
class AsyncLLM(EngineClient):
...
    async def generate(
...
        try:
...
            # The output_handler task pushes items into the queue.
            # This task pulls from the queue and yields to caller.
            finished = False
            while not finished:
                # Note: drain queue without await if possible (avoids
                # task switching under load which helps performance).
                out = q.get_nowait() or await q.get()

                # Note: both OutputProcessor and EngineCore handle their
                # own request cleanup based on finished.
                finished = out.finished
                yield out
...

Postprocessing

The process of preparing the response that users will receive is very complex, so the code for this section has been excluded.

Buffered Response

Method Initialization
- The method accepts parameters including ChatCompletionRequest, AsyncIterator[RequestOutput], request metadata, etc.
- Records the current timestamp with created_time = int(time.time())
- Initializes final_res: Optional[RequestOutput] = None to store the final result
Result Generation Loop
- Iterates through result_generator using async for res in result_generator:
- Continuously updates final_res = res to get the final output
- Handles exceptions:
  - asyncio.CancelledError: Returns error response for client disconnection
  - ValueError: Returns error response with the exception message
Response Processing Initialization
- Asserts that final_res is not None
- Initializes empty choices: list[ChatCompletionResponseChoice] = []
- Gets the response role using self.get_chat_request_role(request)
Output Processing Loop For each output in final_res.outputs:
- Log Probabilities Handling
  - Extracts token_ids and out_logprobs from output
  - If request.logprobs is requested, creates chat logprobs using self._create_chat_logprobs()
  - Sets auto_tools_called = False as initial state
- Reasoning Parser Processing
  - If self.reasoning_parser exists:
  - Creates reasoning parser instance: reasoning_parser = self.reasoning_parser(tokenizer)
  - Extracts reasoning content: reasoning_parser.extract_reasoning_content()
  - Otherwise, sets reasoning_content = None and content = output.text
Message Type Determination The method determines message type based on tool configuration:
- Standard Chat Message
  - When auto tools are disabled and no named tool choice
  - Creates ChatMessage with role, reasoning_content, and content
- Named Tool Choice
  - When request.tool_choice is ChatCompletionNamedToolChoiceParam
  - Determines tool call class: MistralToolCall or ToolCall based on tokenizer type
  - Creates ChatMessage with tool_calls containing FunctionCall
- Required Tool Choice
  - When request.tool_choice == "required"
  - Parses tool calls using TypeAdapter(list[FunctionDefinition]).validate_json()
  - Creates message with multiple tool calls
- No Tool Choice
  - When tool choice is None or "none"
  - Creates standard ChatMessage
- Auto Tool Choice
  - When tools exist and tool_choice is "auto" or None
  - Creates tool parser: tool_parser = self.tool_parser(tokenizer)
  - Extracts tool calls: tool_parser.extract_tool_calls()
  - Sets auto_tools_called based on whether tools were called
  - Creates appropriate message based on tool call results
- Fallback Case
  - Handles undetermined cases with error logging
  - Creates standard ChatMessage as fallback
Choice Creation
- Creates ChatCompletionResponseChoice with:
  - index, message, logprobs
  - finish_reason: "tool_calls" if auto tools called, otherwise output's finish reason
  - stop_reason: from output
- Appends to choices list
Echo Processing
- If request.echo is True:
  - Extracts last message content from conversation
  - Concatenates with generated content for each choice
  - Updates choice.message.content
Usage Statistics Calculation
- Calculates token counts:
  - num_prompt_tokens: from prompt_token_ids and encoder_prompt_token_ids
  - num_generated_tokens: sum of all output token_ids
- Creates UsageInfo object with token statistics
- Adds prompt token details if enabled and cached tokens exist
Final Response Creation
- Sets request_metadata.final_usage_info = usage
- Creates ChatCompletionResponse with:
  - request_id, created_time, model_name
  - choices, usage, prompt_logprobs
  - kv_transfer_params
- Returns the complete response

Streaming Responses

Method Initialization
- Method signature accepts ChatCompletionRequest, AsyncIterator[RequestOutput], and metadata
- Sets up initial values:
  - created_time = int(time.time()): Current timestamp
  - chunk_object_type = "chat.completion.chunk": Fixed chunk type for streaming
  - first_iteration = True: Flag for first iteration handling
Choice and Token Tracking Setup
- Determines number of choices: num_choices = 1 if request.n is None else request.n
- Initializes tracking arrays:
  - previous_num_tokens = [0] * num_choices: Token count per choice
  - finish_reason_sent = [False] * num_choices: Completion status per choice
  - num_prompt_tokens = 0 and num_cached_tokens = None: Token counters
Tool Choice Configuration
- Extracts tool choice function name:
  - If ChatCompletionNamedToolChoiceParam: gets specific function name
  - Otherwise: sets to None
- Determines auto tool choice: tool_choice_auto using self._should_stream_with_auto_tool_parsing(request)
State Management Arrays Setup Based on tool choice configuration:
- For auto tools or reasoning parser:
  - Creates previous_texts, all_previous_token_ids arrays
  - Sets up added_content_delta_arr, reasoning_end_arr for reasoning parser
- For required tool choice: Creates previous_texts only
- For standard chat: Sets arrays to None
Parser Initialization
- Reasoning Parser Setup:
  - Creates reasoning_parser = self.reasoning_parser(tokenizer)
  - On error: yields streaming error response and returns
- Tool Parser Setup:
  - If auto tools enabled: creates tool_parsers array with self.tool_parser(tokenizer)
  - Otherwise: sets to [None] * num_choices
  - On error: yields streaming error response and returns
Streaming Options Configuration
- Extracts stream_options from request
- Sets flags:
  - include_usage: Whether to include usage statistics
  - include_continuous_usage: Whether to include continuous usage stats
Main Streaming Loop
- Result Processing Loop Iterates through result_generator with async for res in result_generator:
  - Token Count Calculation
  - Updates num_prompt_tokens from res.prompt_token_ids
  - Adds encoder prompt tokens if present
  - First Iteration Processing When first_iteration = True:
  - Sets num_cached_tokens = res.num_cached_tokens
  - Gets response role: role = self.get_chat_request_role(request)
  - Initial Response Sending:
    - Creates ChatCompletionResponseStreamChoice with role and empty content
    - Creates ChatCompletionStreamResponse chunk
    - Adds usage info if include_continuous_usage is True
    - Yields formatted response: f"data: {data}\n\n"
  - Echo Processing: If request.echo is True, sends echoed input content
  - Sets first_iteration = False
  - Output Processing Loop For each output in res.outputs:
  - Basic Setup
    - Gets output index and tool parser
    - Skips if finish reason already sent
    - Creates logprobs if requested using self._create_chat_logprobs()
    - Gets delta_text = output.text
    - Skips empty chunks in chunked prefill case
  - Text and Token State Update
    - If auto tools or reasoning parser enabled:
    - Updates previous_text, current_text, previous_token_ids, current_token_ids
  - Delta Message Processing Based on Tool Choice
    - Named Tool Choice:
    - If reasoning parser active and not at reasoning end:
      - Uses reasoning_parser.extract_reasoning_content_streaming()
    - Otherwise:
      - Creates DeltaToolCall with function name and arguments
      - Uses random_tool_call_id() for tool call ID
    - Required Tool Choice:
    - Uses self.extract_tool_call_required_streaming() to extract tool calls
    - Updates previous text state
    - Auto Tool Choice + Reasoning Parser:
    - If reasoning not ended: processes reasoning content
    - After reasoning ends: processes tool calls using tool_parser.extract_tool_calls_streaming()
    - Auto Tool Choice Only:
    - Uses tool_parser.extract_tool_calls_streaming() directly
    - Reasoning Parser Only:
    - Uses reasoning_parser.extract_reasoning_content_streaming()
    - Standard Content:
    - Creates simple DeltaMessage(content=delta_text)
  - State Updates
    - Updates previous_texts and all_previous_token_ids arrays
    - Increments previous_num_tokens[i] with token count
    - Skips iteration if delta_message is None
  - Response Generation
    - Ongoing Generation:
    - Creates ChatCompletionResponseStreamChoice with delta message
    - Completion Handling:
    - Detects auto tools called: auto_tools_called = len(tool_parser.prev_tool_call_arr) > 0
    - Unstreamed Token Check:
      - Uses self._should_check_for_unstreamed_tool_arg_tokens()
      - Compares expected vs actual streamed arguments
      - Sends remaining arguments if needed
    - Creates final choice with appropriate finish_reason
    - Sets finish_reason_sent[i] = True
  - Chunk Creation and Yielding
    - Creates ChatCompletionStreamResponse chunk
    - Adds continuous usage stats if requested
    - Yields formatted chunk: f"data: {data}\n\n"
Final Usage Statistics
- If include_usage is True:
  - Calculates total completion tokens
  - Creates UsageInfo with final statistics
  - Adds prompt token details if enabled
  - Yields final usage chunk
Metadata and Error Handling
- Sets request_metadata.final_usage_info with aggregate usage
- Exception Handling: Catches all exceptions and yields error response
- Final Response: Yields "data: [DONE]\n\n" to signal completion

Conclusion

This comprehensive analysis of vLLM's /v1/chat/completions endpoint reveals the sophisticated architecture powering OpenAI-compatible inference serving.
The journey from a simple HTTP request to a complete chat response involves multiple layers of abstraction, each meticulously optimized for performance, scalability, and reliability.

Below is a sequence diagram summarizing this article:

sequenceDiagram
    participant Client
    participant FastAPI
    participant OpenAIServingChat as OpenAIServingChat(OpenAIServing)
    participant AsyncLLM as AsyncLLM(EngineClient)
    participant AsyncMPClient as AsyncMPClient(MPClient)
    participant ZMQ as ZeroMQ
    participant EngineCoreProc as EngineCoreProc(EngineCore)
    participant Scheduler as Scheduler(SchedulerInterface)
    participant UniProcExecutor as UniProcExecutor(UniProcExecutorV0|Executor)
    participant Worker as Worker(WorkerBase)
    participant GPUModelRunner as GPUModelRunner(LoRAModelRunnerMixin)
    participant OutputProcessor

    EngineCoreProc-->>EngineCoreProc: run_busy_loop()
    Client->>FastAPI: POST /v1/chat/completions
    FastAPI->>OpenAIServingChat: create_chat_completion(ChatCompletionRequest)

    OpenAIServingChat->>OpenAIServingChat: _check_model, _preprocess_chat, etc.

    OpenAIServingChat->>AsyncLLM: generate()
    AsyncLLM->>AsyncMPClient: add_request(EngineCoreRequest)
    AsyncMPClient->>ZMQ: add_request_async(EngineCoreRequest)
    EngineCoreProc->>ZMQ: _handle_client_request(EngineCoreRequestType)
    ZMQ-->>EngineCoreProc: add_request(EngineCoreRequest)
    EngineCoreProc->>Scheduler: add_request(Request)

    rect rgb(255,128,128)
        note over EngineCoreProc: step_fn()
        EngineCoreProc->>Scheduler: schedule()
        Scheduler-->>EngineCoreProc: SchedulerOutput
        EngineCoreProc->>UniProcExecutor: execute_model(SchedulerOutput)
        UniProcExecutor->>Worker: collective_rpc("execute_model")
        Worker->>GPUModelRunner: execute_model(SchedulerOutput)
        GPUModelRunner-->>Worker: ModelRunnerOutput | IntermediateTensors
        Worker-->>UniProcExecutor: ModelRunnerOutput
        UniProcExecutor-->>EngineCoreProc: ModelRunnerOutput
        EngineCoreProc->>Scheduler: update_from_output(SchedulerOutput, ModelRunnerOutput)
        Scheduler->>EngineCoreProc: EngineCoreOutputs
    end
    EngineCoreProc-->>EngineCoreProc: put_nowait(EngineCoreOutputs)
    EngineCoreProc->>ZMQ: process_output_socket()
    rect rgb(128,128,255)
        note over AsyncLLM: output_handler()
        AsyncLLM->>AsyncMPClient: get_output_async()
        AsyncMPClient->>ZMQ: process_outputs_socket()
        ZMQ-->>AsyncLLM: EngineCoreOutputs
        AsyncLLM->>OutputProcessor: process_outputs()
        OutputProcessor-->>AsyncLLM: OutputProcessorOutput
    end
    AsyncLLM-->>OpenAIServingChat: AsyncGenerator[RequestOutput, None]

    OpenAIServingChat-->>FastAPI: ChatCompletionResponse / AsyncGenerator
    FastAPI-->>Client: JSONResponse / StreamingResponse

The structure turned out to be much more complex than I expected, making this article quite lengthy with many parts omitted.
In future articles, I'll take a closer look at core components like EngineCoreProc(EngineCore), Scheduler(SchedulerInterface), and GPUModelRunner(LoRAModelRunnerMixin).

References

OpenAI Platform: Create completion

OpenAI Platform: Create chat completion

Code Review: Deep Dive into vLLM's Architecture and Implementation Analysis of OpenAI-Compatible Serving (1/2)

Hyogeun Oh (오효근) — Sun, 15 Jun 2025 14:52:40 +0000

Introduction

vLLM [1, 2] is a fast and easy-to-use library for LLM inference and serving.
Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry. [3]

The rapid advancement of Large Language Models (LLMs) has brought efficient model serving and inference optimization to the forefront of MLOps concerns.
In response to these challenges, vLLM has emerged as a leading solution, garnering significant attention with 49.2k stars on GitHub as of June 9, 2025.
As demonstrated in the star history graph below, vLLM has established itself as the most prominent LLM serving framework among various competing solutions.

A particularly noteworthy aspect is the standardized API interface provided by OpenAI's GPT series.
With countless developers already building applications based on this API specification, ensuring compatibility has become crucial for any LLM serving solution.
This article provides a comprehensive analysis of vLLM's core technological foundations and examines the internal implementation processes that enable OpenAI-compatible server deployment when executing the vllm serve command.

This article is based on vLLM version v0.9.0.1.
Implementation details and API specifications may vary in newer versions.

Theoretical Background

PagedAttention

vLLM Blog: vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention [4]
vLLM utilizes PagedAttention, our new attention algorithm that effectively manages attention keys and values.
vLLM equipped with PagedAttention redefines the new state of the art in LLM serving: it delivers up to 24x higher throughput than HuggingFace Transformers, without requiring any model architecture changes.

SOSP 2023 (ACM Symposium on Operating Systems Principles): Efficient Memory Management for Large Language Model Serving with PagedAttention [5]
PagedAttention divides the request's KV cache into blocks, each of which can contain the attention keys and values of a fixed number of tokens.
In PagedAttention, the blocks for the KV cache are not necessarily stored in contiguous space.
Therefore, we can manage the KV cache in a more flexible way as in OS's virtual memory: one can think of blocks as pages, tokens as bytes, and requests as processes.

Core Concepts and Motivation

PagedAttention represents one of the fundamental technological breakthroughs that distinguishes vLLM from other LLM serving frameworks.
This innovative approach addresses the critical memory management challenges inherent in large-scale language model deployment.

Traditional attention mechanisms in transformer models face significant memory inefficiencies when handling variable-length sequences.
The primary challenges include:

Memory Fragmentation: Pre-allocated contiguous memory blocks for maximum sequence length
Inefficient Utilization: Wasted memory when actual sequences are shorter than pre-allocated space
Static Batching: Limited flexibility in request batching and scheduling
Poor Scalability: Memory requirements grow quadratically with sequence length

Virtual Memory Paradigm

PagedAttention revolutionizes memory management by implementing a virtual memory system inspired by operating system design principles.
The key innovation lies in treating attention computation similarly to how operating systems manage virtual memory:

Pages → Blocks: Fixed-size memory blocks containing attention keys and values
Bytes → Tokens: Individual tokens within each block
Processes → Requests: Individual inference requests with their own virtual address spaces

Technical Implementation

The conventional approach allocates contiguous memory blocks for each sequence, leading to substantial memory fragmentation and inefficient GPU utilization.
PagedAttention breaks this paradigm by:

Block-based Memory Management: Dividing attention computation into fixed-size blocks
Dynamic Memory Allocation: Enabling efficient memory reuse across different requests
Reduced Memory Fragmentation: Minimizing wasted memory through intelligent block allocation
Copy-on-Write Semantics: Sharing identical prefixes across multiple requests
Non-contiguous Storage: Blocks can be stored anywhere in memory, linked through logical addressing

Performance Benefits

PagedAttention delivers remarkable memory efficiency gains, typically achieving 2-4x improvement in memory utilization compared to traditional approaches, directly translating to higher throughput and reduced infrastructure costs. [5]

The performance improvements stem from several key optimizations:

Memory Efficiency: Reduced memory footprint through dynamic allocation
Throughput Enhancement: Higher concurrent request handling capacity
Latency Reduction: Faster memory access patterns and reduced copying overhead
Scalability: Linear scaling with hardware resources rather than memory-bound limitations

Comparative Analysis

Aspect	Traditional Attention	PagedAttention
Memory Allocation	Contiguous blocks per sequence	Fixed-size blocks (non-contiguous)
Memory Utilization	Pre-allocated for max sequence length	Dynamic allocation as needed
Memory Fragmentation	High fragmentation when sequences end	Minimal fragmentation through block reuse
Prefix Sharing	Not supported	Efficient sharing of common prefixes
Batch Management	Static batching	Continuous batching with dynamic scheduling
Memory Efficiency	Baseline	2-4x improvement
Throughput	Limited by memory constraints	Up to 24x higher than HuggingFace Transformers
GPU Utilization	Suboptimal due to fragmentation	Optimized through intelligent block allocation
Scalability	Limited by contiguous memory requirements	High scalability with virtual memory approach

Note: A comprehensive mathematical analysis of PagedAttention's algorithmic foundations will be covered in an upcoming paper review post, where I'll dive deeper into the theoretical underpinnings and formal proofs of its efficiency guarantees.

OpenAI-Compatible Server

The OpenAI API has become the de facto standard for LLM interactions, establishing a unified interface that countless applications and services depend upon. [6]
vLLM's OpenAI-compatible server implementation represents a critical bridge between high-performance serving capabilities and industry-standard API compatibility.

vLLM implements the essential OpenAI API endpoints with full compatibility:

Endpoint	HTTP Method	OpenAI Compatible	Description
`/v1/models`	`GET`	✅	List available models
`/v1/completions`	`POST`	✅	Text completions for single prompts
`/v1/chat/completions`	`POST`	✅	Chat completions with message history
`/v1/embeddings`	`POST`	✅	Generate text embeddings
`/health`	`GET`	❌	vLLM-specific health check
`/tokenize`	`POST`	❌	vLLM-specific tokenization
`/detokenize`	`POST`	❌	vLLM-specific detokenization
`/metrics`	`GET`	❌	Prometheus-compatible metrics

Hands-On

This section demonstrates how to deploy and interact with vLLM's OpenAI-compatible server in practice.
I'll walk through the installation process, server startup, and explore the automatically generated API documentation.

Installation

vLLM can be installed with different backend configurations depending on your hardware setup [7]:

# CPU-only installation
$ uv pip install vllm

# GPU installation with automatic PyTorch backend detection
$ uv pip install vllm --torch-backend=auto

Server Deployment

Launch the vLLM server with a lightweight model for demonstration purposes [8]:

$ vllm serve Qwen/Qwen3-0.6B --max-model-len 8192
...
INFO 06-09 23:16:17 [api_server.py:1336] Starting vLLM API server on http://0.0.0.0:8000
INFO 06-09 23:16:17 [launcher.py:28] Available routes are:
INFO 06-09 23:16:17 [launcher.py:36] Route: /openapi.json, Methods: GET, v0.9.0.1
INFO 06-09 23:16:17 [launcher.py:36] Route: /docs, Methods: GET, v0.9.0.1
INFO 06-09 23:16:17 [launcher.py:36] Route: /docs/oauth2-redirect, Methods: GET, v0.9.0.1
INFO 06-09 23:16:17 [launcher.py:36] Route: /redoc, Methods: GET, v0.9.0.1
INFO 06-09 23:16:17 [launcher.py:36] Route: /health, Methods: GET
INFO 06-09 23:16:17 [launcher.py:36] Route: /load, Methods: GET
INFO 06-09 23:16:17 [launcher.py:36] Route: /ping, Methods: POST
INFO 06-09 23:16:17 [launcher.py:36] Route: /ping, Methods: GET
INFO 06-09 23:16:17 [launcher.py:36] Route: /tokenize, Methods: POST
INFO 06-09 23:16:17 [launcher.py:36] Route: /detokenize, Methods: POST
INFO 06-09 23:16:17 [launcher.py:36] Route: /v1/models, Methods: GET
INFO 06-09 23:16:17 [launcher.py:36] Route: /version, Methods: GET
INFO 06-09 23:16:17 [launcher.py:36] Route: /v1/chat/completions, Methods: POST
INFO 06-09 23:16:17 [launcher.py:36] Route: /v1/completions, Methods: POST
INFO 06-09 23:16:17 [launcher.py:36] Route: /v1/embeddings, Methods: POST
INFO 06-09 23:16:17 [launcher.py:36] Route: /pooling, Methods: POST
INFO 06-09 23:16:17 [launcher.py:36] Route: /classify, Methods: POST
INFO 06-09 23:16:17 [launcher.py:36] Route: /score, Methods: POST
INFO 06-09 23:16:17 [launcher.py:36] Route: /v1/score, Methods: POST
INFO 06-09 23:16:17 [launcher.py:36] Route: /v1/audio/transcriptions, Methods: POST
INFO 06-09 23:16:17 [launcher.py:36] Route: /rerank, Methods: POST
INFO 06-09 23:16:17 [launcher.py:36] Route: /v1/rerank, Methods: POST
INFO 06-09 23:16:17 [launcher.py:36] Route: /v2/rerank, Methods: POST
INFO 06-09 23:16:17 [launcher.py:36] Route: /invocations, Methods: POST
INFO 06-09 23:16:17 [launcher.py:36] Route: /metrics, Methods: GET
INFO:     Started server process [16355]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

Interactive API Documentation

vLLM automatically generates interactive API documentation accessible via Swagger UI:

The Swagger interface provides:

Interactive Testing: Direct API endpoint testing from the browser
Schema Documentation: Complete request/response schema definitions
Parameter Validation: Real-time parameter validation and examples
Authentication Setup: Easy API key configuration for testing

Practical API Usage Examples

Once the server is running, you can interact with it using standard OpenAI-compatible clients:

>>> import openai
>>> client = openai.OpenAI(base_url="http://localhost:8000/v1")
>>> client.models.list()
SyncPage[Model](data=[Model(id='Qwen/Qwen3-0.6B', created=1749651810, object='model', owned_by='vllm', root='Qwen/Qwen3-0.6B', parent=None, max_model_len=8192, permission=[{'id': 'modelperm-8bc1b4000ad84fac81f2de0addc81ef6', 'object': 'model_permission', 'created': 1749651810, 'allow_create_engine': False, 'allow_sampling': True, 'allow_logprobs': True, 'allow_search_indices': False, 'allow_view': True, 'allow_fine_tuning': False, 'organization': '*', 'group': None, 'is_blocking': False}])], object='list')
>>> client.chat.completions.create(model="Qwen/Qwen3-0.6B", messages=[{"role": "user", "content": "Hello, vLLM!"}])
ChatCompletion(id='chatcmpl-d4ecd72df87c4b13a8b9d47ddcb75ccc', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='<think>\nOkay, the user just said "Hello, vLLM!" so I need to respond in a friendly and helpful way. Let me start by acknowledging their greeting. Maybe say something like "Hello! How can I assist you today?" to show I\'m here to help. I should keep the tone positive and open-ended so they can ask more questions. Let me check if there\'s anything else they might need, like setup instructions or support. I\'ll make sure to offer assistance in both technical and general ways. Alright, that should cover it.\n</think>\n\nHello! How can I assist you today? 😊', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=[], reasoning_content=None), stop_reason=None)], created=1749651812, model='Qwen/Qwen3-0.6B', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=126, prompt_tokens=14, total_tokens=140, completion_tokens_details=None, prompt_tokens_details=None), prompt_logprobs=None, kv_transfer_params=None)

Monitoring and Metrics

vLLM supports Prometheus-based metrics collection through the /metrics endpoint. [9, 10]
This enables real-time monitoring through Grafana dashboards. [11, 12]

$ curl http://localhost:8000/metrics
# HELP vllm:num_preemptions_total Cumulative number of preemption from the engine.
# TYPE vllm:num_preemptions_total counter
vllm:num_preemptions_total{model_name="Qwen/Qwen3-0.6B"} 0.0
# HELP vllm:prompt_tokens_total Number of prefill tokens processed.
# TYPE vllm:prompt_tokens_total counter
vllm:prompt_tokens_total{model_name="Qwen/Qwen3-0.6B"} 28.0
...

Server Initialization

This section reviews the entire process from when a user starts an OpenAI-compatible server through vllm serve until the server reaches its ready state.

CLI

When the vllm serve command is executed in the terminal, it runs the main() function from vllm/entrypoints/cli/main.py through the vllm command defined in the project.scripts section of pyproject.toml.

pyproject.toml


...
[project.scripts]
vllm = "vllm.entrypoints.cli.main:main"
...

Subsequently, the main() function recognizes the serve command through the subparser and executes the dispatch_function, which is the ServeSubcommand.cmd() function.

vllm/entrypoints/cli/main.py


...
def main():
    cli_env_setup()

    parser = FlexibleArgumentParser(
        description="vLLM CLI",
        epilog=VLLM_SERVE_PARSER_EPILOG,
    )
    parser.add_argument('-v',
                        '--version',
                        action='version',
                        version=vllm.version.__version__)
    subparsers = parser.add_subparsers(required=False, dest="subparser")
    cmds = {}
    for cmd_module in CMD_MODULES:
        new_cmds = cmd_module.cmd_init()
        for cmd in new_cmds:
            cmd.subparser_init(subparsers).set_defaults(
                dispatch_function=cmd.cmd)
            cmds[cmd.name] = cmd
    args = parser.parse_args()
    if args.subparser in cmds:
        cmds[args.subparser].validate(args)

    if hasattr(args, "dispatch_function"):
        args.dispatch_function(args)
    else:
        parser.print_help()
...

The user-specified args are then passed to the run_server() function, and the OpenAI-compatible server begins operation through the uvloop.run() function.

vllm/entrypoints/cli/serve.py


...
class ServeSubcommand(CLISubcommand):
...
    @staticmethod
    def cmd(args: argparse.Namespace) -> None:
        # If model is specified in CLI (as positional arg), it takes precedence
        if hasattr(args, 'model_tag') and args.model_tag is not None:
            args.model = args.model_tag

        if args.headless:
            run_headless(args)
        else:
            uvloop.run(run_server(args))
...

Engine

Client

For resource lifecycle management, the engine_client is created with async with.

vllm/entrypoints/openai/api_server.py


...
async def run_server(args, **uvicorn_kwargs) -> None:
...
    async with build_async_engine_client(args) as engine_client:
        app = build_app(args)
...

Based on the user-configured args, it determines whether to use the V0 engine or V1 engine and provides the created engine.

vllm/entrypoints/openai/api_server.py


...
@asynccontextmanager
async def build_async_engine_client(
        args: Namespace) -> AsyncIterator[EngineClient]:

    # Context manager to handle engine_client lifecycle
    # Ensures everything is shutdown and cleaned up on error/exit
    engine_args = AsyncEngineArgs.from_cli_args(args)

    async with build_async_engine_client_from_engine_args(
            engine_args, args.disable_frontend_multiprocessing) as engine:
        yield engine
...

vllm/entrypoints/openai/api_server.py


...
@asynccontextmanager
async def build_async_engine_client_from_engine_args(
    engine_args: AsyncEngineArgs,
    disable_frontend_multiprocessing: bool = False,
) -> AsyncIterator[EngineClient]:
    """
    Create EngineClient, either:
        - in-process using the AsyncLLMEngine Directly
        - multiprocess using AsyncLLMEngine RPC

    Returns the Client or None if the creation failed.
    """

    # Create the EngineConfig (determines if we can use V1).
    usage_context = UsageContext.OPENAI_API_SERVER
    vllm_config = engine_args.create_engine_config(usage_context=usage_context)

    # V1 AsyncLLM.
    if envs.VLLM_USE_V1:
        if disable_frontend_multiprocessing:
            logger.warning(
                "V1 is enabled, but got --disable-frontend-multiprocessing. "
                "To disable frontend multiprocessing, set VLLM_USE_V1=0.")

        from vllm.v1.engine.async_llm import AsyncLLM
        async_llm: Optional[AsyncLLM] = None
        try:
            async_llm = AsyncLLM.from_vllm_config(
                vllm_config=vllm_config,
                usage_context=usage_context,
                disable_log_requests=engine_args.disable_log_requests,
                disable_log_stats=engine_args.disable_log_stats)

            # Don't keep the dummy data in memory
            await async_llm.reset_mm_cache()

            yield async_llm
        finally:
            if async_llm:
                async_llm.shutdown()
...

During this process, an AsyncMPClient is created to manage and communicate with the core engine.

vllm/v1/engine/async_llm.py


...
class AsyncLLM(EngineClient):
...
    def __init__(
        self,
        vllm_config: VllmConfig,
        executor_class: type[Executor],
        log_stats: bool,
        usage_context: UsageContext = UsageContext.ENGINE_CONTEXT,
        mm_registry: MultiModalRegistry = MULTIMODAL_REGISTRY,
        use_cached_outputs: bool = False,
        log_requests: bool = True,
        start_engine_loop: bool = True,
        stat_loggers: Optional[list[StatLoggerFactory]] = None,
    ) -> None:
...
        # EngineCore (starts the engine in background process).
        core_client_class = AsyncMPClient if (
            vllm_config.parallel_config.data_parallel_size
            == 1) else DPAsyncMPClient

        self.engine_core = core_client_class(
            vllm_config=vllm_config,
            executor_class=executor_class,
            log_stats=self.log_stats,
        )
...

vllm/v1/engine/core_client.py


...
class AsyncMPClient(MPClient):
    """Asyncio-compatible client for multi-proc EngineCore."""

    def __init__(self, vllm_config: VllmConfig, executor_class: type[Executor],
                 log_stats: bool):
        super().__init__(
            asyncio_mode=True,
            vllm_config=vllm_config,
            executor_class=executor_class,
            log_stats=log_stats,
        )
...

vllm/v1/engine/core_client.py


...
class MPClient(EngineCoreClient):
...
    def __init__(
        self,
        asyncio_mode: bool,
        vllm_config: VllmConfig,
        executor_class: type[Executor],
        log_stats: bool,
    ):
...
            # Start local engines.
            if local_engine_count:
                # In server mode, start_index and local_start_index will
                # both be 0.
                self.resources.local_engine_manager = CoreEngineProcManager(
                    EngineCoreProc.run_engine_core,
                    vllm_config=vllm_config,
                    executor_class=executor_class,
                    log_stats=log_stats,
                    input_address=input_address,
                    on_head_node=True,
                    local_engine_count=local_engine_count,
                    start_index=start_index,
                    local_start_index=local_start_index)
...

The AsyncMPClient manages the core engine process through CoreEngineProcManager and communicates using ZMQ IPC socket.
The actual core engine runs as a separate background process for improved isolation and performance.

Core

The core engine is created as a background process as shown below.
I will examine its detailed operation in the next article.

vllm/v1/engine/core.py


...
class EngineCoreProc(EngineCore):
    """ZMQ-wrapper for running EngineCore in background process."""
...
    @staticmethod
    def run_engine_core(*args,
                        dp_rank: int = 0,
                        local_dp_rank: int = 0,
                        **kwargs):
        """Launch EngineCore busy loop in background process."""
...
        engine_core: Optional[EngineCoreProc] = None
        try:
            parallel_config: ParallelConfig = kwargs[
                "vllm_config"].parallel_config
            if parallel_config.data_parallel_size > 1 or dp_rank > 0:
                # Set data parallel rank for this engine process.
                parallel_config.data_parallel_rank = dp_rank
                parallel_config.data_parallel_rank_local = local_dp_rank
                engine_core = DPEngineCoreProc(*args, **kwargs)
            else:
                engine_core = EngineCoreProc(*args, **kwargs)
...

Executor

During engine creation, the executor is appropriately configured based on the conditions provided by the user.

vllm/v1/engine/async_llm.py


...
class AsyncLLM(EngineClient):
...
    @classmethod
    def from_vllm_config(
        cls,
        vllm_config: VllmConfig,
        start_engine_loop: bool = True,
        usage_context: UsageContext = UsageContext.ENGINE_CONTEXT,
        stat_loggers: Optional[list[StatLoggerFactory]] = None,
        disable_log_requests: bool = False,
        disable_log_stats: bool = False,
    ) -> "AsyncLLM":
...
        # Create the LLMEngine.
        return cls(
            vllm_config=vllm_config,
            executor_class=Executor.get_class(vllm_config),
            start_engine_loop=start_engine_loop,
            stat_loggers=stat_loggers,
            log_requests=not disable_log_requests,
            log_stats=not disable_log_stats,
            usage_context=usage_context,
        )
...

At this time, the executor to be used is determined through the Executor.get_class() function.

vllm/v1/executor/abstract.py


...
class Executor(ExecutorBase):
    """
    Abstract class for v1 executors, mainly define some methods for v1.
    For methods shared by v0 and v1, define them in ExecutorBase"""

    @staticmethod
    def get_class(vllm_config: VllmConfig) -> type["Executor"]:
        executor_class: type[Executor]
        parallel_config = vllm_config.parallel_config
        distributed_executor_backend = (
            parallel_config.distributed_executor_backend)
        # distributed_executor_backend must be set in VllmConfig.__post_init__
        if isinstance(distributed_executor_backend, type):
            if not issubclass(distributed_executor_backend, ExecutorBase):
                raise TypeError(
                    "distributed_executor_backend must be a subclass of "
                    f"ExecutorBase. Got {distributed_executor_backend}.")
            executor_class = distributed_executor_backend
        elif distributed_executor_backend == "ray":
            from vllm.v1.executor.ray_distributed_executor import (  # noqa
                RayDistributedExecutor)
            executor_class = RayDistributedExecutor
        elif distributed_executor_backend == "mp":
            from vllm.v1.executor.multiproc_executor import MultiprocExecutor
            executor_class = MultiprocExecutor
        elif distributed_executor_backend == "uni":
            executor_class = UniProcExecutor
        elif distributed_executor_backend == "external_launcher":
            # TODO: make v1 scheduling deterministic
            # to support external launcher
            executor_class = ExecutorWithExternalLauncher
        else:
            raise ValueError("Unknown distributed executor backend: "
                             f"{distributed_executor_backend}")
        return executor_class
...

The executor configuration is accomplished through the following process.

The system automatically selects the appropriate distributed backend based on the hardware configuration and execution environment.
Key decision factors include:

Single Node vs Multi-Node: Determines whether to use multiprocessing (mp) or Ray (ray) backend
GPU Availability: Checks CUDA device count against world size requirements
Ray Environment: Detects existing Ray initialization and placement groups
Platform-Specific: Special handling for Neuron devices using unified process (uni) backend

The configuration logic ensures optimal resource utilization while maintaining compatibility across different deployment scenarios.

vllm/config.py


...
@config
@dataclass
class ParallelConfig:
...
    def __post_init__(self) -> None:
...
        if self.distributed_executor_backend is None and self.world_size > 1:
            # We use multiprocessing by default if world_size fits on the
            # current node and we aren't in a ray placement group.

            from vllm.executor import ray_utils
            backend: DistributedExecutorBackend = "mp"
            ray_found = ray_utils.ray_is_available()
            if current_platform.is_neuron():
                # neuron uses single process to control multiple devices
                backend = "uni"
            elif (current_platform.is_cuda()
                  and cuda_device_count_stateless() < self.world_size):
                if not ray_found:
                    raise ValueError("Unable to load Ray which is "
                                     "required for multi-node inference, "
                                     "please install Ray with `pip install "
                                     "ray`.") from ray_utils.ray_import_err
                backend = "ray"
            elif ray_found:
                if self.placement_group:
                    backend = "ray"
                else:
                    from ray import is_initialized as ray_is_initialized
                    if ray_is_initialized():
                        from ray.util import get_current_placement_group
                        if get_current_placement_group():
                            backend = "ray"
            self.distributed_executor_backend = backend
            logger.info("Defaulting to use %s for distributed inference",
                        backend)

        if self.distributed_executor_backend is None and self.world_size == 1:
            self.distributed_executor_backend = "uni"
...

In this article, I will not consider tensor parallel plus pipeline parallel inference [13, 14].
Therefore, I'll analyze the UniProcExecutor.

The UniProcExecutor is designed for single-process execution scenarios where all model computation happens within a single process. This is the default choice for single-GPU deployments or when distributed execution is not required.

Key characteristics of UniProcExecutor:

Single Process: All computation occurs within one process
Direct Communication: No inter-process communication overhead
Simplified Architecture: Straightforward execution path without distributed coordination
Resource Efficiency: Minimal overhead for single-device scenarios

The _init_executor() method initializes the worker and sets up the execution environment. The rpc_rank parameter represents the rank of the worker in the executor context, which is typically 0 for single-process execution.

vllm/executor/uniproc_executor.py


...
class UniProcExecutor(ExecutorBase):

    uses_ray: bool = False

    def _init_executor(self) -> None:
        """Initialize the worker and load the model.
        """
        self.driver_worker = WorkerWrapperBase(vllm_config=self.vllm_config,
                                               rpc_rank=0)
        distributed_init_method = get_distributed_init_method(
            get_ip(), get_open_port())
        local_rank = 0
        # set local rank as the device index if specified
        device_info = self.vllm_config.device_config.device.__str__().split(
            ":")
        if len(device_info) > 1:
            local_rank = int(device_info[1])
        rank = 0
        is_driver_worker = True
        kwargs = dict(
            vllm_config=self.vllm_config,
            local_rank=local_rank,
            rank=rank,
            distributed_init_method=distributed_init_method,
            is_driver_worker=is_driver_worker,
        )
        self.collective_rpc("init_worker", args=([kwargs], ))
        self.collective_rpc("init_device")
        self.collective_rpc("load_model")

    def collective_rpc(self,
                       method: Union[str, Callable],
                       timeout: Optional[float] = None,
                       args: Tuple = (),
                       kwargs: Optional[Dict] = None) -> List[Any]:
        if kwargs is None:
            kwargs = {}
        answer = run_method(self.driver_worker, method, args, kwargs)
        return [answer]
...

Worker

The UniProcExecutor.driver_worker is defined as the WorkerWrapperBase class as shown below, and the WorkerWrapperBase operates through the collective_rpc method of UniProcExecutor.

vllm/worker/worker_base.py


...
class WorkerWrapperBase:
    """
    This class represents one process in an executor/engine. It is responsible
    for lazily initializing the worker and handling the worker's lifecycle.
    We first instantiate the WorkerWrapper, which remembers the worker module
    and class name. Then, when we call `update_environment_variables`, and the
    real initialization happens in `init_worker`.
    """

    def __init__(
        self,
        vllm_config: VllmConfig,
        rpc_rank: int = 0,
    ) -> None:
        """
        Initialize the worker wrapper with the given vllm_config and rpc_rank.
        Note: rpc_rank is the rank of the worker in the executor. In most cases,
        it is also the rank of the worker in the distributed group. However,
        when multiple executors work together, they can be different.
        e.g. in the case of SPMD-style offline inference with TP=2,
        users can launch 2 engines/executors, each with only 1 worker.
        All workers have rpc_rank=0, but they have different ranks in the TP
        group.
        """
        self.rpc_rank = rpc_rank
        self.worker: Optional[WorkerBase] = None
        # do not store this `vllm_config`, `init_worker` will set the final
        # one. TODO: investigate if we can remove this field in
        # `WorkerWrapperBase`, `init_cached_hf_modules` should be
        # unnecessary now.
        if vllm_config.model_config is not None:
            # it can be None in tests
            trust_remote_code = vllm_config.model_config.trust_remote_code
            if trust_remote_code:
                # note: lazy import to avoid importing torch before initializing
                from vllm.utils import init_cached_hf_modules
                init_cached_hf_modules()
...

The WorkerWrapperBase.worker is initialized by the UniProcExecutor.collective_rpc("init_worker", args=([kwargs], )) call executed in the UniProcExecutor._init_executor() method.

vllm/worker/worker_base.py


...
class WorkerWrapperBase:
...
    def init_worker(self, all_kwargs: List[Dict[str, Any]]) -> None:
        """
        Here we inject some common logic before initializing the worker.
        Arguments are passed to the worker class constructor.
        """
        kwargs = all_kwargs[self.rpc_rank]
        self.vllm_config = kwargs.get("vllm_config", None)
...
        if isinstance(self.vllm_config.parallel_config.worker_cls, str):
            worker_class = resolve_obj_by_qualname(
                self.vllm_config.parallel_config.worker_cls)
...
        with set_current_vllm_config(self.vllm_config):
            # To make vLLM config available during worker initialization
            self.worker = worker_class(**kwargs)
            assert self.worker is not None
...

The worker to be used varies depending on the execution environment and configuration.
In CUDA environments, it is configured as follows.
For the V1 engine, vllm.v1.worker.gpu_worker.Worker is used.

vllm/platforms/cuda.py


...
class CudaPlatformBase(Platform):
    _enum = PlatformEnum.CUDA
    device_name: str = "cuda"
    device_type: str = "cuda"
    dispatch_key: str = "CUDA"
    ray_device_key: str = "GPU"
    device_control_env_var: str = "CUDA_VISIBLE_DEVICES"
...
    @classmethod
    def check_and_update_config(cls, vllm_config: "VllmConfig") -> None:
        parallel_config = vllm_config.parallel_config
        scheduler_config = vllm_config.scheduler_config
        compilation_config = vllm_config.compilation_config
        model_config = vllm_config.model_config

        if parallel_config.worker_cls == "auto":
            if scheduler_config.is_multi_step:
                if envs.VLLM_USE_V1:
                    raise NotImplementedError(
                        "Multi-step scheduling is not supported (and not "
                        "needed) on vLLM V1. Please launch without "
                        "--num-scheduler-steps.")
                else:
                    parallel_config.worker_cls = \
                        "vllm.worker.multi_step_worker.MultiStepWorker"
            elif vllm_config.speculative_config:
                if envs.VLLM_USE_V1:
                    parallel_config.worker_cls = \
                            "vllm.v1.worker.gpu_worker.Worker"
                else:
                    parallel_config.worker_cls = \
                        "vllm.spec_decode.spec_decode_worker.create_spec_worker"
                    parallel_config.sd_worker_cls = \
                        "vllm.worker.worker.Worker"
            else:
                if envs.VLLM_USE_V1:
                    parallel_config.worker_cls = \
                            "vllm.v1.worker.gpu_worker.Worker"
                else:
                    parallel_config.worker_cls = "vllm.worker.worker.Worker"
...

vllm/v1/worker/gpu_worker.py


class Worker(WorkerBase):

    def __init__(
        self,
        vllm_config: VllmConfig,
        local_rank: int,
        rank: int,
        distributed_init_method: str,
        is_driver_worker: bool = False,
    ):

        super().__init__(vllm_config=vllm_config,
                         local_rank=local_rank,
                         rank=rank,
                         distributed_init_method=distributed_init_method,
                         is_driver_worker=is_driver_worker)

After worker creation, the device is initialized as shown below through UniProcExecutor.collective_rpc("init_device"). (The following is based on CUDA environment worker)

vllm/v1/worker/gpu_worker.py


...
class Worker(WorkerBase):
...
    def init_device(self):
        if self.device_config.device.type == "cuda":
            # torch.distributed.all_reduce does not free the input tensor until
            # the synchronization point. This causes the memory usage to grow
            # as the number of all_reduce calls increases. This env var disables
            # this behavior.
            # Related issue:
            # https://discuss.pytorch.org/t/cuda-allocation-lifetime-for-inputs-to-distributed-all-reduce/191573
            os.environ["TORCH_NCCL_AVOID_RECORD_STREAMS"] = "1"

            # This env var set by Ray causes exceptions with graph building.
            os.environ.pop("NCCL_ASYNC_ERROR_HANDLING", None)
            self.device = torch.device(f"cuda:{self.local_rank}")
            torch.cuda.set_device(self.device)

            _check_if_gpu_supports_dtype(self.model_config.dtype)
            gc.collect()
            torch.cuda.empty_cache()
            self.init_gpu_memory = torch.cuda.mem_get_info()[0]
        else:
            raise RuntimeError(
                f"Not support device type: {self.device_config.device}")
        # Initialize the distributed environment.
        init_worker_distributed_environment(self.vllm_config, self.rank,
                                            self.distributed_init_method,
                                            self.local_rank)
        # Set random seed.
        set_random_seed(self.model_config.seed)

        # Construct the model runner
        self.model_runner: GPUModelRunner = GPUModelRunner(
            self.vllm_config, self.device)

        if self.rank == 0:
            # If usage stat is enabled, collect relevant info.
            report_usage_stats(self.vllm_config)
...

During the above process, Worker.model_runner is created.

Model Runner

Once the device is ready, the model is loaded through UniProcExecutor.collective_rpc("load_model"), and the model is loaded through the model runner.

vllm/v1/worker/gpu_worker.py


...
class Worker(WorkerBase):
...
    def load_model(self) -> None:
        if self.vllm_config.model_config.enable_sleep_mode:
            allocator = CuMemAllocator.get_instance()
            assert allocator.get_current_usage() == 0, (
                "Sleep mode can only be "
                "used for one instance per process.")
            context = allocator.use_memory_pool(tag="weights")
        else:
            from contextlib import nullcontext
            context = nullcontext()
        with context:
            self.model_runner.load_model()
...

The GPUModelRunner downloads and loads the model through the get_model() function.

vllm/v1/worker/gpu_model_runner.py


...
class GPUModelRunner(LoRAModelRunnerMixin):
...
    def load_model(self) -> None:
        logger.info("Starting to load model %s...", self.model_config.model)
        with DeviceMemoryProfiler() as m:  # noqa: SIM117
            time_before_load = time.perf_counter()
            self.model = get_model(vllm_config=self.vllm_config)
            if self.lora_config:
                self.model = self.load_lora_model(self.model,
                                                  self.model_config,
                                                  self.scheduler_config,
                                                  self.lora_config,
                                                  self.device)
...

The get_model() function determines the loader based on the user-provided config and returns it in the form of torch.nn.Module.

vllm/model_executor/model_loader/__init__.py


...
def get_model_loader(load_config: LoadConfig) -> BaseModelLoader:
    """Get a model loader based on the load format."""
    if isinstance(load_config.load_format, type):
        return load_config.load_format(load_config)

    if load_config.load_format == LoadFormat.DUMMY:
        return DummyModelLoader(load_config)

    if load_config.load_format == LoadFormat.TENSORIZER:
        return TensorizerLoader(load_config)

    if load_config.load_format == LoadFormat.SHARDED_STATE:
        return ShardedStateLoader(load_config)

    if load_config.load_format == LoadFormat.BITSANDBYTES:
        return BitsAndBytesModelLoader(load_config)

    if load_config.load_format == LoadFormat.GGUF:
        return GGUFModelLoader(load_config)

    if load_config.load_format == LoadFormat.RUNAI_STREAMER:
        return RunaiModelStreamerLoader(load_config)

    if load_config.load_format == LoadFormat.RUNAI_STREAMER_SHARDED:
        return ShardedStateLoader(load_config, runai_model_streamer=True)

    return DefaultModelLoader(load_config)


def get_model(*,
              vllm_config: VllmConfig,
              model_config: Optional[ModelConfig] = None) -> nn.Module:
    loader = get_model_loader(vllm_config.load_config)
    if model_config is None:
        model_config = vllm_config.model_config
    return loader.load_model(vllm_config=vllm_config,
                             model_config=model_config)
...

Now the engine, worker, executor, and model runner are all ready.

FastAPI

A FastAPI instance (app) is created through the build_app() function, setting up the server's router, middleware, and exception handlers.

vllm/entrypoints/openai/api_server.py


...
async def run_server(args, **uvicorn_kwargs) -> None:
...
    async with build_async_engine_client(args) as engine_client:
        app = build_app(args)

        vllm_config = await engine_client.get_vllm_config()
        await init_app_state(engine_client, vllm_config, app.state, args)

        def _listen_addr(a: str) -> str:
            if is_valid_ipv6_address(a):
                return '[' + a + ']'
            return a or "0.0.0.0"

        is_ssl = args.ssl_keyfile and args.ssl_certfile
        logger.info("Starting vLLM API server on http%s://%s:%d",
                    "s" if is_ssl else "", _listen_addr(sock_addr[0]),
                    sock_addr[1])

        shutdown_task = await serve_http(
            app,
            sock=sock,
            enable_ssl_refresh=args.enable_ssl_refresh,
            host=args.host,
            port=args.port,
            log_level=args.uvicorn_log_level,
            # NOTE: When the 'disable_uvicorn_access_log' value is True,
            # no access log will be output.
            access_log=not args.disable_uvicorn_access_log,
            timeout_keep_alive=TIMEOUT_KEEP_ALIVE,
            ssl_keyfile=args.ssl_keyfile,
            ssl_certfile=args.ssl_certfile,
            ssl_ca_certs=args.ssl_ca_certs,
            ssl_cert_reqs=args.ssl_cert_reqs,
            **uvicorn_kwargs,
        )
...

vllm/entrypoints/openai/api_server.py


...
def build_app(args: Namespace) -> FastAPI:
...
    app.include_router(router)
...

Here, the router is set up as follows:

vllm/entrypoints/openai/api_server.py


router = APIRouter()
...
@router.get("/health", response_class=Response)
async def health(raw_request: Request) -> Response:
...
@router.get("/load")
async def get_server_load_metrics(request: Request):
...
@router.get("/ping", response_class=Response)
@router.post("/ping", response_class=Response)
async def ping(raw_request: Request) -> Response:
...
@router.post("/tokenize",
             dependencies=[Depends(validate_json_request)],
             responses={
                 HTTPStatus.BAD_REQUEST.value: {
                     "model": ErrorResponse
                 },
                 HTTPStatus.NOT_FOUND.value: {
                     "model": ErrorResponse
                 },
                 HTTPStatus.INTERNAL_SERVER_ERROR.value: {
                     "model": ErrorResponse
                 },
                 HTTPStatus.NOT_IMPLEMENTED.value: {
                     "model": ErrorResponse
                 },
             })
@with_cancellation
async def tokenize(request: TokenizeRequest, raw_request: Request):
...
@router.post("/detokenize",
             dependencies=[Depends(validate_json_request)],
             responses={
                 HTTPStatus.BAD_REQUEST.value: {
                     "model": ErrorResponse
                 },
                 HTTPStatus.NOT_FOUND.value: {
                     "model": ErrorResponse
                 },
                 HTTPStatus.INTERNAL_SERVER_ERROR.value: {
                     "model": ErrorResponse
                 },
             })
@with_cancellation
async def detokenize(request: DetokenizeRequest, raw_request: Request):
...
@router.get("/v1/models")
async def show_available_models(raw_request: Request):
...
@router.get("/version")
async def show_version():
...
@router.post("/v1/chat/completions",
             dependencies=[Depends(validate_json_request)],
             responses={
                 HTTPStatus.OK.value: {
                     "content": {
                         "text/event-stream": {}
                     }
                 },
                 HTTPStatus.BAD_REQUEST.value: {
                     "model": ErrorResponse
                 },
                 HTTPStatus.NOT_FOUND.value: {
                     "model": ErrorResponse
                 },
                 HTTPStatus.INTERNAL_SERVER_ERROR.value: {
                     "model": ErrorResponse
                 }
             })
@with_cancellation
@load_aware_call
async def create_chat_completion(request: ChatCompletionRequest,
                                 raw_request: Request):
...
@router.post("/v1/completions",
             dependencies=[Depends(validate_json_request)],
             responses={
                 HTTPStatus.OK.value: {
                     "content": {
                         "text/event-stream": {}
                     }
                 },
                 HTTPStatus.BAD_REQUEST.value: {
                     "model": ErrorResponse
                 },
                 HTTPStatus.NOT_FOUND.value: {
                     "model": ErrorResponse
                 },
                 HTTPStatus.INTERNAL_SERVER_ERROR.value: {
                     "model": ErrorResponse
                 },
             })
@with_cancellation
@load_aware_call
async def create_completion(request: CompletionRequest, raw_request: Request):
...
@router.post("/v1/embeddings",
             dependencies=[Depends(validate_json_request)],
             responses={
                 HTTPStatus.BAD_REQUEST.value: {
                     "model": ErrorResponse
                 },
                 HTTPStatus.INTERNAL_SERVER_ERROR.value: {
                     "model": ErrorResponse
                 },
             })
@with_cancellation
@load_aware_call
async def create_embedding(request: EmbeddingRequest, raw_request: Request):
...
@router.post("/pooling",
             dependencies=[Depends(validate_json_request)],
             responses={
                 HTTPStatus.BAD_REQUEST.value: {
                     "model": ErrorResponse
                 },
                 HTTPStatus.INTERNAL_SERVER_ERROR.value: {
                     "model": ErrorResponse
                 },
             })
@with_cancellation
@load_aware_call
async def create_pooling(request: PoolingRequest, raw_request: Request):
...
@router.post("/classify", dependencies=[Depends(validate_json_request)])
@with_cancellation
@load_aware_call
async def create_classify(request: ClassificationRequest,
                          raw_request: Request):
...
@router.post("/score",
             dependencies=[Depends(validate_json_request)],
             responses={
                 HTTPStatus.BAD_REQUEST.value: {
                     "model": ErrorResponse
                 },
                 HTTPStatus.INTERNAL_SERVER_ERROR.value: {
                     "model": ErrorResponse
                 },
             })
@with_cancellation
@load_aware_call
async def create_score(request: ScoreRequest, raw_request: Request):
...
@router.post("/v1/score",
             dependencies=[Depends(validate_json_request)],
             responses={
                 HTTPStatus.BAD_REQUEST.value: {
                     "model": ErrorResponse
                 },
                 HTTPStatus.INTERNAL_SERVER_ERROR.value: {
                     "model": ErrorResponse
                 },
             })
@with_cancellation
@load_aware_call
async def create_score_v1(request: ScoreRequest, raw_request: Request):
...
@router.post("/v1/audio/transcriptions",
             responses={
                 HTTPStatus.OK.value: {
                     "content": {
                         "text/event-stream": {}
                     }
                 },
                 HTTPStatus.BAD_REQUEST.value: {
                     "model": ErrorResponse
                 },
                 HTTPStatus.UNPROCESSABLE_ENTITY.value: {
                     "model": ErrorResponse
                 },
                 HTTPStatus.INTERNAL_SERVER_ERROR.value: {
                     "model": ErrorResponse
                 },
             })
@with_cancellation
@load_aware_call
async def create_transcriptions(raw_request: Request,
                                request: Annotated[TranscriptionRequest,
                                                   Form()]):
...
@router.post("/rerank",
             dependencies=[Depends(validate_json_request)],
             responses={
                 HTTPStatus.BAD_REQUEST.value: {
                     "model": ErrorResponse
                 },
                 HTTPStatus.INTERNAL_SERVER_ERROR.value: {
                     "model": ErrorResponse
                 },
             })
@with_cancellation
@load_aware_call
async def do_rerank(request: RerankRequest, raw_request: Request):
...
@router.post("/v1/rerank",
             dependencies=[Depends(validate_json_request)],
             responses={
                 HTTPStatus.BAD_REQUEST.value: {
                     "model": ErrorResponse
                 },
                 HTTPStatus.INTERNAL_SERVER_ERROR.value: {
                     "model": ErrorResponse
                 },
             })
@with_cancellation
async def do_rerank_v1(request: RerankRequest, raw_request: Request):
...
@router.post("/v2/rerank",
             dependencies=[Depends(validate_json_request)],
             responses={
                 HTTPStatus.BAD_REQUEST.value: {
                     "model": ErrorResponse
                 },
                 HTTPStatus.INTERNAL_SERVER_ERROR.value: {
                     "model": ErrorResponse
                 },
             })
@with_cancellation
async def do_rerank_v2(request: RerankRequest, raw_request: Request):
...
if envs.VLLM_SERVER_DEV_MODE:

    @router.get("/server_info")
    async def show_server_info(raw_request: Request):
...
    @router.post("/reset_prefix_cache")
    async def reset_prefix_cache(raw_request: Request):
...
    @router.post("/sleep")
    async def sleep(raw_request: Request):
...
    @router.post("/wake_up")
    async def wake_up(raw_request: Request):
...
    @router.get("/is_sleeping")
    async def is_sleeping(raw_request: Request):
...
@router.post("/invocations",
             dependencies=[Depends(validate_json_request)],
             responses={
                 HTTPStatus.BAD_REQUEST.value: {
                     "model": ErrorResponse
                 },
                 HTTPStatus.UNSUPPORTED_MEDIA_TYPE.value: {
                     "model": ErrorResponse
                 },
                 HTTPStatus.INTERNAL_SERVER_ERROR.value: {
                     "model": ErrorResponse
                 },
             })
async def invocations(raw_request: Request):
...
if envs.VLLM_TORCH_PROFILER_DIR:
    logger.warning(
        "Torch Profiler is enabled in the API server. This should ONLY be "
        "used for local development!")

    @router.post("/start_profile")
    async def start_profile(raw_request: Request):
...
    @router.post("/start_profile")
    async def start_profile(raw_request: Request):
...
    @router.post("/stop_profile")
    async def stop_profile(raw_request: Request):
...
if envs.VLLM_ALLOW_RUNTIME_LORA_UPDATING:
    logger.warning(
        "LoRA dynamic loading & unloading is enabled in the API server. "
        "This should ONLY be used for local development!")

    @router.post("/v1/load_lora_adapter",
                 dependencies=[Depends(validate_json_request)])
    async def load_lora_adapter(request: LoadLoRAAdapterRequest,
                                raw_request: Request):
...
    @router.post("/v1/unload_lora_adapter",
                 dependencies=[Depends(validate_json_request)])
    async def unload_lora_adapter(request: UnloadLoRAAdapterRequest,
                                  raw_request: Request):
...

Flows

flowchart TD
    User
    subgraph Command Line Interface
        vs["vllm serve ${MODEL_NAME}"]
    end
    subgraph vLLM Package
        pyproject["vllm = "vllm.entrypoints.cli.main:main""]
        cli_main["vllm.entrypoints.cli.main.main()"]
        subgraph Parser
            arg["vllm.engine.arg_utils.EngineArgs"]
            aarg["vllm.engine.arg_utils.AsyncEngineArgs"]
        end
        subgraph Config
            cfg["vllm.config.VllmConfig"]
        end
        subgraph Engine Client
            ssc["vllm.entrypoints.cli.serve.ServeSubCommand.cmd()"]
            alc["vllm.v1.engine.async_llm.AsyncLLM"]
            amc["vllm.v1.engine.core_client.AsyncMPClient"]
            cepm["vllm.v1.utils.CoreEngineProcManager"]
        end
        subgraph Engine Core
            ecp["vllm.v1.engine.core.EngineCoreProc"]
        end
        subgraph Executor
            exe["vllm.v1.executor.abstract.Executor.get_class()"]
            uni["vllm.executor.uniproc_executor.UniProcExecutor"]
        end
        subgraph Worker
            ww["vllm.worker.worker_base.WorkerWrapperBase"]
            wo["vllm.v1.worker.gpu_worker.Worker"]
        end
        subgraph Model Runner
            mr["vllm.v1.worker.gpu_model_runner.GPUModelRunner"]
        end
    end

    User-->vs-->pyproject-->cli_main-->ssc-->alc-->amc-->cepm-.->ecp-->uni
    cfg-->arg
    cfg-->aarg-->ww
    cli_main-->arg-->aarg-->ssc
    alc-->exe-->uni-->ww-->wo-->mr

    classDef userClass fill:#e1f5fe
    classDef engineClass fill:#f3e5f5
    classDef workerClass fill:#e8f5e8

    class User userClass
    class alc,amc,cepm,ecp engineClass
    class uni,ww,wo,mr workerClass

Etc

Parser

The ServeSubcommand, which inherits from CLISubcommand, parses the configuration values required for the OpenAI-compatible server.

entrypoints/cli/main.py


...
def main():
...
    for cmd_module in CMD_MODULES:
        new_cmds = cmd_module.cmd_init()
        for cmd in new_cmds:
            cmd.subparser_init(subparsers).set_defaults(
                dispatch_function=cmd.cmd)
            cmds[cmd.name] = cmd
...

vllm/entrypoints/cli/serve.py


...
class ServeSubcommand(CLISubcommand):
    """The `serve` subcommand for the vLLM CLI. """
...
    def subparser_init(
            self,
            subparsers: argparse._SubParsersAction) -> FlexibleArgumentParser:
        serve_parser = subparsers.add_parser(
            "serve",
            help="Start the vLLM OpenAI Compatible API server.",
            description="Start the vLLM OpenAI Compatible API server.",
            usage="vllm serve [model_tag] [options]")
        ...
        serve_parser = make_arg_parser(serve_parser)
        show_filtered_argument_or_group_from_help(serve_parser)
        serve_parser.epilog = VLLM_SERVE_PARSER_EPILOG
        return serve_parser

Configuration values related to the engine, which plays a crucial role in inference, are added through the add_cli_args() method of AsyncEngineArgs.

vllm/entrypoints/openai/cli_args.py


...
def make_arg_parser(parser: FlexibleArgumentParser) -> FlexibleArgumentParser:
...
    parser = AsyncEngineArgs.add_cli_args(parser)
...
    return parser
...

vllm/engine/arg_utils.py


...
@dataclass
class AsyncEngineArgs(EngineArgs):
    """Arguments for asynchronous vLLM engine."""
    disable_log_requests: bool = False

    @staticmethod
    def add_cli_args(parser: FlexibleArgumentParser,
                     async_args_only: bool = False) -> FlexibleArgumentParser:
        # Initialize plugin to update the parser, for example, The plugin may
        # adding a new kind of quantization method to --quantization argument or
        # a new device to --device argument.
        load_general_plugins()
        if not async_args_only:
            parser = EngineArgs.add_cli_args(parser)
        ...
        from vllm.platforms import current_platform
        current_platform.pre_register_and_update(parser)
        return parser
...

Configuration values required for the engine are defined in the code lines below, with default values defined in vllm/configs.py.

vllm/engine/arg_utils.py


...
@dataclass
class EngineArgs:
    """Arguments for vLLM engine."""
...
    @staticmethod
    def add_cli_args(parser: FlexibleArgumentParser) -> FlexibleArgumentParser:
        """Shared CLI arguments for vLLM engine."""

        # Model arguments
        model_kwargs = get_kwargs(ModelConfig)
        model_group = parser.add_argument_group(
            title="ModelConfig",
            description=ModelConfig.__doc__,
        )
        if 'serve' not in sys.argv[1:] and '--help' not in sys.argv[1:]:
            model_group.add_argument("--model", **model_kwargs["model"])
        model_group.add_argument("--task", **model_kwargs["task"])
        model_group.add_argument("--tokenizer", **model_kwargs["tokenizer"])
        model_group.add_argument("--tokenizer-mode",
                                 **model_kwargs["tokenizer_mode"])
...
        # Parallel arguments
        parallel_kwargs = get_kwargs(ParallelConfig)
        parallel_group = parser.add_argument_group(
            title="ParallelConfig",
            description=ParallelConfig.__doc__,
        )
...

Config

The table below defines the config classes defined within vllm/config.py.

Configuration Class	Purpose	Key Fields
`ModelConfig`	Model and tokenizer configuration	`model`, `tokenizer`, `dtype`, `max_model_len`, `quantization`, `task`, `trust_remote_code`, `revision`, `seed`
`CacheConfig`	KV cache memory management	`block_size`, `gpu_memory_utilization`, `swap_space`, `cache_dtype`, `enable_prefix_caching`, `cpu_offload_gb`
`ParallelConfig`	Distributed execution settings	`tensor_parallel_size`, `pipeline_parallel_size`, `data_parallel_size`, `distributed_executor_backend`, `disable_custom_all_reduce`
`SchedulerConfig`	Request scheduling and batching	`max_num_batched_tokens`, `max_num_seqs`, `max_model_len`, `enable_chunked_prefill`, `preemption_mode`, `policy`
`DeviceConfig`	Hardware device settings	`device`, `device_type`
`LoadConfig`	Model weight loading options	`load_format`, `download_dir`, `ignore_patterns`, `use_tqdm_on_load`
`LoRAConfig`	LoRA adapter configuration	`max_lora_rank`, `max_loras`, `max_cpu_loras`, `lora_extra_vocab_size`, `enable_lora_bias`
`SpeculativeConfig`	Speculative decoding settings	`num_speculative_tokens`, `model`, `method`, `acceptance_method`, `disable_logprobs`
`DecodingConfig`	Guided decoding configuration	`backend`, `disable_fallback`, `disable_any_whitespace`, `reasoning_backend`
`ObservabilityConfig`	Metrics and tracing settings	`collect_detailed_traces`, `otlp_traces_endpoint`
`TokenizerPoolConfig`	Deprecated tokenizer pooling	`pool_size`, `pool_type`, `extra_config` (⚠️ Deprecated)
`PromptAdapterConfig`	Prompt adapter settings	`max_prompt_adapters`, `max_prompt_adapter_token`
`MultiModalConfig`	Multimodal model configuration	Model-specific multimodal settings
`PoolerConfig`	Text embedding pooling	`pooling_type`, `normalize`, `softmax`
`CompilationConfig`	torch.compile optimization	`level`, `use_inductor`, `use_cudagraph`, `custom_ops`
`PassConfig`	Compilation pass configuration	`enable_fusion`, `enable_sequence_parallelism`, `enable_async_tp`
`KVTransferConfig`	Distributed KV cache transfer	KV cache distribution settings
`KVEventsConfig`	KV cache event publishing	`enable_kv_cache_events`, `publisher`
`VllmConfig`	Main configuration container	All above configs combined

vllm/config.py


...
@config
@dataclass
class ModelConfig:
    """Configuration for the model."""
...
@config
@dataclass
class CacheConfig:
    """Configuration for the KV cache."""
...
@config
@dataclass
class TokenizerPoolConfig:
    """This config is deprecated and will be removed in a future release.

    Passing these parameters will have no effect. Please remove them from your
    configurations.
    """
...
@config
@dataclass
class LoadConfig:
    """Configuration for loading the model weights."""
...
@config
@dataclass
class ParallelConfig:
    """Configuration for the distributed execution."""
...
@config
@dataclass
class SchedulerConfig:
    """Scheduler configuration."""
...
@config
@dataclass
class DeviceConfig:
    """Configuration for the device to use for vLLM execution."""
...
@config
@dataclass
class SpeculativeConfig:
    """Configuration for speculative decoding."""
...
@config
@dataclass
class LoRAConfig:
    """Configuration for LoRA."""
...
@config
@dataclass
class PromptAdapterConfig:
    """Configuration for PromptAdapters."""
...
@config
@dataclass
class MultiModalConfig:
    """Controls the behavior of multimodal models."""
...
@config
@dataclass
class PoolerConfig:
    """Controls the behavior of output pooling in pooling models."""
...
@config
@dataclass
class DecodingConfig:
    """Dataclass which contains the decoding strategy of the engine."""
...
@config
@dataclass
class ObservabilityConfig:
    """Configuration for observability - metrics and tracing."""
...
@config
@dataclass
class KVTransferConfig:
    """Configuration for distributed KV cache transfer."""
...
@config
@dataclass
class KVEventsConfig:
    """Configuration for KV event publishing."""
...
@config
@dataclass
class PassConfig:
    """Configuration for custom Inductor passes.

    This is separate from general `CompilationConfig` so that inductor passes
    don't all have access to full configuration - that would create a cycle as
    the `PassManager` is set as a property of config."""
...
@config
@dataclass
class CompilationConfig:
    """Configuration for compilation. It has three parts:

    - Top-level Compilation control:
        - [`level`][vllm.config.CompilationConfig.level]
        - [`debug_dump_path`][vllm.config.CompilationConfig.debug_dump_path]
        - [`cache_dir`][vllm.config.CompilationConfig.cache_dir]
        - [`backend`][vllm.config.CompilationConfig.backend]
        - [`custom_ops`][vllm.config.CompilationConfig.custom_ops]
        - [`splitting_ops`][vllm.config.CompilationConfig.splitting_ops]
    - CudaGraph capture:
        - [`use_cudagraph`][vllm.config.CompilationConfig.use_cudagraph]
        - [`cudagraph_capture_sizes`]
        [vllm.config.CompilationConfig.cudagraph_capture_sizes]
        - [`cudagraph_num_of_warmups`]
        [vllm.config.CompilationConfig.cudagraph_num_of_warmups]
        - [`cudagraph_copy_inputs`]
        [vllm.config.CompilationConfig.cudagraph_copy_inputs]
        - [`full_cuda_graph`][vllm.config.CompilationConfig.full_cuda_graph]
    - Inductor compilation:
        - [`use_inductor`][vllm.config.CompilationConfig.use_inductor]
        - [`compile_sizes`][vllm.config.CompilationConfig.compile_sizes]
        - [`inductor_compile_config`]
        [vllm.config.CompilationConfig.inductor_compile_config]
        - [`inductor_passes`][vllm.config.CompilationConfig.inductor_passes]
        - custom inductor passes

    Why we have different sizes for cudagraph and inductor:
    - cudagraph: a cudagraph captured for a specific size can only be used
        for the same size. We need to capture all the sizes we want to use.
    - inductor: a graph compiled by inductor for a general shape can be used
        for different sizes. Inductor can also compile for specific sizes,
        where it can have more information to optimize the graph with fully
        static shapes. However, we find the general shape compilation is
        sufficient for most cases. It might be beneficial to compile for
        certain small batchsizes, where inductor is good at optimizing.
    """
...
@config
@dataclass
class VllmConfig:
    """Dataclass which contains all vllm-related configuration. This
    simplifies passing around the distinct configurations in the codebase.
    """
...

Conclusion

In this article, I explored what happens behind the scenes when executing the vllm serve command.
Starting from PagedAttention's innovative memory management approach, I examined the server initialization process and the roles of each component.

Key components and their roles:

Engine Client: Manages the engine lifecycle and coordinates communication with core engine processes for request handling.
Core Engine: Receives incoming requests, manages scheduling queues, handles tokenization, orchestrates model execution (including distributed scenarios), and processes outputs.
Executor: Determines the optimal execution strategy (single-process vs. distributed with tensor/pipeline parallelism) and creates multiple worker processes as needed.
Worker: Individual process assigned to a specific device (e.g., GPU) that handles device initialization and executes model inference tasks.
Model Runner: Loads the model weights, prepares input tensors for computation, and executes the core model inference logic.
Model: The actual torch.nn.Module instance containing the loaded language model weights and architecture.
FastAPI Server: Exposes OpenAI-compatible REST API endpoints for client interactions, built on the FastAPI framework.

vLLM's core strengths lie in memory efficiency through PagedAttention, scalability through modular architecture, and perfect compatibility with OpenAI APIs.

In the next article, I will focus on the actual inference process, exploring how user requests are processed, queued, batched, and executed through the /v1/chat/completions endpoint.

References

GitHub: vllm-project/vllm

Red Hat: What is vLLM?

vLLM Docs (v0.9.0.1): Welcome to vLLM

vLLM Blog: vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

SOSP 2023 (ACM Symposium on Operating Systems Principles): Efficient Memory Management for Large Language Model Serving with PagedAttention

OpenAI Platform: API Reference Introduction

vLLM Docs (v0.9.0.1): Installation

vLLM Docs (v0.9.0.1): OpenAI-Compatible Server

vLLM Docs (v0.9.0.1): Production Metrics

vLLM Docs (v0.9.0.1): Metrics

vLLM Docs (v0.9.0.1): Prometheus and Grafana

GitHub: vllm-project/vllm/examples/online_serving/prometheus_grafana/grafana.json

vLLM Docs (v0.9.0.1): Distributed Inference and Serving

arXiv: Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

DEV Community: Hyogeun Oh (오효근)

Code Review: Deep Dive into vLLM's Architecture and Implementation Analysis of OpenAI-Compatible Serving (2/2)

Introduction

Theoretical Background

/v1/completions vs. /v1/chat/completions

Request/Response Schema of /v1/chat/completions

Request Schema

Message Object

Response Schema

Choice Object

Usage Object

Router

Request Handler

Starlette Request Object

Application State Initialization

Testing app.state

Chat Completion Processing Pipeline

Preprocessing

Inferencing

Engine Client

Engine Core

Scheduler

Executor

Worker & Model Runner

Engine Core

Engine Client

Postprocessing

Buffered Response

Streaming Responses

Conclusion

Code Review: Deep Dive into vLLM's Architecture and Implementation Analysis of OpenAI-Compatible Serving (1/2)

Introduction

Theoretical Background

PagedAttention

Core Concepts and Motivation

Virtual Memory Paradigm

Technical Implementation

Performance Benefits

Comparative Analysis

OpenAI-Compatible Server

Hands-On

Installation

Server Deployment

Interactive API Documentation

Practical API Usage Examples

Monitoring and Metrics

Server Initialization

CLI

Engine

Client

Core

Executor

Worker

Model Runner

FastAPI

Flows

Etc

Parser

Config

Conclusion

`/v1/completions` vs. `/v1/chat/completions`

Request/Response Schema of `/v1/chat/completions`

Testing `app.state`