DEV Community: Deleon Karen

Part 3: The Core of Memory Management: Allocation and Residency

Deleon Karen — Tue, 02 Jun 2026 15:21:41 +0000

If WDDM is an operating system, then Video Memory Management (VidMm) is its heart. With WDDM 2.0+, the logic of memory management underwent a fundamental shift from "OS-applied patching" to "driver/application-managed state."

Physical Perspective: Implementation of Memory Segments

The driver describes the GPU's physical memory layout to the OS through "memory segments." This is primarily accomplished through two calls to DxgkDdiQueryAdapterInfo.

Implementation Flow:

First Call (Get Count):
- The OS sends DXGKQAITYPE_QUERYSEGMENT (or _QUERYSEGMENT3).
- The driver only populates DXGK_QUERYSEGMENTOUT3.NbSegment (e.g., 1 VRAM segment, 1 Aperture segment, returns 2).
Second Call (Populate Descriptors):
- The OS allocates space for the DXGK_SEGMENTDESCRIPTOR3 array.
- The driver populates the specific parameters for each segment.

Key Parameter Analysis:

BaseAddress / Size:
- For Local VRAM: This is the physical starting address as seen internally by the GPU.
- For Aperture Segment: This is the window starting address the GPU uses to access system memory.
CpuVisibleAddress:
- If the GPU's physical memory is mapped into the CPU's address space via a PCIe BAR, the driver needs to provide CpuVisibleAddress.
- WDDM 2.0+ Optimization: Even without a large BAR, dynamic mapping via CpuHostAperture is possible, with VidMm handling paging automatically.
Flags:
- Aperture: Marks the segment as a "window" into system memory.
- PopulatedFromSystemMemory: Marks whether the physical backing store of this segment is essentially system RAM.
- CpuVisible: Tells the OS whether the CPU can directly read/write this segment's memory.

Development Guidance: Be absolutely accurate in marking which segments are PopulatedFromSystemMemory. Misreporting system memory as local VRAM will cause significant discrepancies in the OS page file and the Total Graphics Memory calculation formula.

Creating Allocations: DxgkDdiCreateAllocation

When an application requests resources, the OS calls this DDI.

Responsibility: The driver must calculate the size and alignment requirements for the resource and return DXGK_ALLOCATIONINFO.
WDDM 2.0 Change: Drivers no longer need to record a "Patch Location List," as resources are now accessed via virtual addresses.

Engineering Focus: GPU Virtual Addressing (GPUVA)

This is a core feature of WDDM 2.0.

Concept: Each process has a 48-bit (or larger) virtual address space.
Benefit: UMD can use fixed addresses directly in the command stream, eliminating the need for the OS to modify instructions before submission.
Driver Responsibility: KMD needs to implement page table operations (via UpdatePageTable operations within DxgkDdiBuildPagingBuffer).

Residency Mechanism: MakeResident and Eviction

Before WDDM 2.0, the OS automatically ensured all resources referenced by a command buffer were in video memory. Now, this responsibility is handed over to the driver (primarily UMD).

MakeResident (Triggered by UMD): The driver explicitly tells the OS, "The upcoming operations need these resources; please keep them in video memory."
Eviction (OS Policy): When video memory is low, the OS uses algorithms like LRU to evict non-resident resources to system memory.
Development Guidance:
- Do Not Over-Reside: MakeResident calls exceeding the process budget will fail.
- Handle Eviction Notifications: In WDDM 3.2, if a resource requires special handling (e.g., decompression) before eviction, the NotifyEviction flag can be utilized.

Advanced Perspective: WDDM vs. Linux (GEM/TTM)

If you are familiar with Linux kernel driver development, you will find WDDM's VidMm shares similarities with Linux's GEM/TTM, but there are core philosophical differences.

Feature	WDDM 2.0+ (VidMm)	Linux (GEM/TTM)
Basic Unit	Allocation	Buffer Object (BO)
Memory Abstraction	Segments: Memory, Aperture, System	Regions/Placements: VRAM, GTT, System
Residency Logic	Explicit Residency: UMD actively maintains device residency list	Validation Logic: Kernel ensures BO is in the correct Region upon command submission
Address Binding	GPUVA: UMD allocates fixed virtual address, OS handles page table updates	Relocations (Traditional GEM): OS modifies instruction stream; GPUVA (Modern): Similar to WDDM
Migration/Paging	OS-Driven: VidMm decides when to page, driver executes `BuildPagingBuffer`	Driver-Led: TTM provides framework, driver implements specific migration logic (Move)

Key Differences:

Boundary of Responsibility: WDDM 2.0+ delegates more decision-making power regarding "which resources need to be resident" to the UMD (User Mode Driver) to reduce kernel call overhead. In contrast, traditional Linux TTM validates the BO list during each execbuffer submission.
Page Table Management: GPU virtual address management in WDDM is highly standardized, with the OS deeply involved in the page table lifecycle; under Linux, different drivers (e.g., i915 vs. AMDGPU) have greater freedom in their page table implementations.

Advanced Topic: IOMMU and Hardware Isolation

For modern drivers (WDDM 2.4+), IOMMU is no longer transparent. It not only provides security (GPU isolation) but also solves addressing issues for systems with more than 1TB of physical memory on high-end servers (DMA Remapping).

Core Challenge: Shifting from using physical addresses (MDL) to logical addresses (ADL).
DDI That Must Be Implemented: DxgkDdiBeginExclusiveAccess (ensures hardware is quiet during IOMMU switches).

Deep Dive: Please refer to the dedicated document WDDM Advanced: IOMMU and DMA Remapping.

Developer Advice: Accuracy of Video Memory Statistics

The OS heavily relies on the video memory statistics reported by the driver.

Reporting Mechanism: Ensure DXGK_DRIVERCAPS in DxgkDdiQueryAdapterInfo correctly returns parameters like MaxSharedSystemMemory.
Common Issue: If the driver misreports the memory size, it will cause the OS to incorrectly configure page file settings, leading to memory pressure crashes that are difficult to debug.

Part 2: Driver Entry and Device Initialization

Deleon Karen — Tue, 02 Jun 2026 15:15:57 +0000

In the lifecycle of a WDDM driver, the initialization phase establishes the "contractual" relationship between the driver and the operating system. This chapter will provide an in-depth analysis of every core step from driver loading to device readiness, from the perspective of the underlying DDI.

1. The Soul of the Driver: Deep Dive into the `DriverEntry` Protocol

DriverEntry is the entry point for a KMD. Its core task is to register a set of callback function tables with Dxgkrnl via DxgkInitialize.

1.1 Core Structure: `DRIVER_INITIALIZATION_DATA`

This structure is very large, containing hundreds of callback interfaces. In practical engineering, you need to focus on the following points:

Version Control (Version): Must be set to DXGKDDI_INTERFACE_VERSION. This macro is defined differently across WDK versions, directly determining which DDIs the OS will attempt to call.
Mandatory Interfaces: If you omit core functions like DxgkDdiAddDevice, DxgkDdiStartDevice, or DxgkDdiExchangePnPInterface, DxgkInitialize will directly return STATUS_INVALID_PARAMETER.

1.2 Registration Example

// Example: Registering basic DDIs at the entry point
NTSTATUS DriverEntry(PDRIVER_OBJECT DriverObject, PUNICODE_STRING RegistryPath) {
    DRIVER_INITIALIZATION_DATA initData = {0};
    initData.Version = DXGKDDI_INTERFACE_VERSION;

    // Basic Lifecycle
    initData.DxgkDdiAddDevice = MyAddDevice;
    initData.DxgkDdiStartDevice = MyStartDevice;
    initData.DxgkDdiStopDevice = MyStopDevice;
    initData.DxgkDdiRemoveDevice = MyRemoveDevice;
    initData.DxgkDdiUnload = MyUnload;

    // Core Management
    initData.DxgkDdiQueryAdapterInfo = MyQueryAdapterInfo;
    initData.DxgkDdiCreateDevice = MyCreateDevice;
    initData.DxgkDdiDestroyDevice = MyDestroyDevice;

    // Memory and Scheduling (Basic Set)
    initData.DxgkDdiCreateAllocation = MyCreateAllocation;
    initData.DxgkDdiDestroyAllocation = MyDestroyAllocation;
    initData.DxgkDdiBuildPagingBuffer = MyBuildPagingBuffer;
    initData.DxgkDdiSubmitCommand = MySubmitCommand;

    return DxgkInitialize(DriverObject, RegistryPath, &initData);
}

1.3 Special Case: Display-Only Driver (KMDOD)

For drivers without rendering capabilities that only support display (Kernel-Mode Display-Only Driver), the flow is slightly different:

Uses the KMDDOD_INITIALIZATION_DATA structure.
Registers via DxgkInitializeDisplayOnlyDriver.
Typically used for basic display adapters or remote desktop display drivers.

2. Establishing Context: `DxgkDdiAddDevice`

When the PnP Manager discovers hardware, the OS calls AddDevice once for each physical adapter.

2.1 Core Tasks

Miniport Device Context: You need to allocate a driver-private device context (usually a structure) here and return it to the OS. All subsequent device-related DDIs will carry this pointer.
Obtaining Hardware Information: You should call DxgkCbGetDeviceInformation at this point. It returns a DXGK_DEVICEINFO structure containing:
- PCI Configuration Space Information.
- Translated Resource List: Including physical memory base addresses (BARs) and interrupt vectors.

2.2 Registering Hardware Information (Registry Practice)

According to official specifications, to correctly identify the hardware in "Control Panel -> Display", the driver must set the following registry values in AddDevice:

Information Type	Registry Key Name	Description
Chip Type	`HardwareInformation.ChipType`	Friendly name string for the chip
DAC Type	`HardwareInformation.DacType`	DAC type identifier
Memory Size	`HardwareInformation.MemorySize`	ULONG, unit in MB
Adapter Name	`HardwareInformation.AdapterString`	Full name of the adapter
BIOS Info	`HardwareInformation.BiosString`	BIOS version string

Implementation Hint: Use IoOpenDeviceRegistryKey to open the PLUGPLAY_REGKEY_DRIVER key, then use ZwSetValueKey to write the above values.

3. "Ignition" Execution: `DxgkDdiStartDevice`

StartDevice signals that the hardware should begin actual operation.

Dxgkrnl Interface Exchange: The OS passes in the DXGKRNL_INTERFACE. This interface contains all DxgkCbXxx callbacks (e.g., querying VidPN, notifying interrupts). The driver must save this pointer.
Hardware Resource Initialization: Map MMIO registers, initialize GPU core logic.
Preliminary Capability Report:
- NumberOfChildren: Reports the number of devices connected downstream of the display adapter (e.g., HDMI ports, internal panels).
- NumberOfVideoPresentSources: Reports the number of display controllers inside the GPU (determines how many independent screens can be driven simultaneously).

4. Deep Dive: The Information Storm of `DxgkDdiQueryAdapterInfo`

This is the busiest DDI during initialization. The OS queries various configurations using different Type values.

DXGKQAITYPE_DRIVERCAPS:
- Core Flags: SupportGpuVirtualAddress (whether virtual addresses are supported), CanHandleTDR (whether timeout recovery is supported).
- Scheduling Policy: Determines whether global scheduling or hardware scheduling is used.
DXGKQAITYPE_QUERYSEGMENT:
- Defines Memory Topology: Tells the OS which segments are local video memory (VRAM) and which are system memory apertures.
- Segment Properties: The Aperture flag determines how VidMm handles that memory.
DXGKQAITYPE_UMDRIVERPRIVATE:
- UMD Path: Returns the file path of the UMD DLL (e.g., my_umd.dll). The OS uses this name to load the user-mode component.

5. Practical Engineering: The State Machine of the Initialization Phase

Understanding the calling sequence of these DDIs is critical for troubleshooting startup failures. The following is the deep call flow organized according to the WDDM specification:

Key Steps Explained:

DxgkDdiStartDevice: Must accurately return NumberOfChildren (e.g., number of HDMI, DP ports) and NumberOfVideoPresentSources (number of CRTCs inside the GPU).
DxgkDdiQueryChildRelations: The driver needs to fill in ChildUid here. This ID will be used as VideoPresentTargetId in subsequent VidPN management.
HPD (Hot Plug Detection): For interfaces with interrupt or polling capabilities, the OS confirms if a monitor is currently connected via QueryChildStatus.
VidPN Negotiation: This is the most complex part of initialization, involving the VidPN Manager cooperating with the driver to establish the initial display path (Source -> Target).

6. Expert Recommendations

Lazy Initialization: Avoid time-consuming hardware detection in AddDevice as much as possible to keep the PnP process smooth.
Strict Version Checking: In QueryAdapterInfo, if a feature requested by the OS is completely unsupported by your hardware, directly return STATUS_NOT_SUPPORTED.
Error Cleanup: If StartDevice fails, be sure to manually clean up already allocated resources and MMIO mappings before returning. The OS will not automatically roll back memory allocated in AddDevice.

Part 1: Modern WDDM Architecture Protocol

Deleon Karen — Tue, 02 Jun 2026 15:13:02 +0000

1. From XDDM to WDDM: An Inevitable Evolution

Before Windows Vista, display drivers used the XDDM (Windows 2000 Display Driver Model). Under the XDDM model, the driver ran almost entirely in kernel mode, and any minor crash would directly lead to a system "Blue Screen of Death" (BSoD).

The introduction of WDDM (Windows Display Driver Model) was not only to support the Aero desktop effects but also represented a major architectural refactoring:

Stability: Most of the logic was moved to the User Mode Driver (UMD), leaving only the most critical hardware interaction logic in the Kernel Mode Driver (KMD).
Performance: Introduced true GPU scheduling and video memory management.
Multitasking: Supports multiple applications using GPU resources concurrently without interfering with each other.

2. WDDM Architecture Core: The Boundaries of Power Between UMD and KMD

A modern WDDM driver consists of two core components, each with its own responsibilities:

User Mode Driver (UMD)

Nature: A DLL loaded by the Direct3D Runtime.
Core Responsibilities:
- Translates API calls (such as D3D12 Draw Calls) into command streams (Command Buffers) that the hardware can understand.
- Performs high-level state management and shader compilation.
- Development Guidance: The UMD is the most logically complex and frequently changed part of the driver, but it does not directly touch hardware registers.

Kernel Mode Display Miniport Driver (KMD)

Nature: A system service that communicates with Dxgkrnl.sys.
Core Responsibilities:
- Resource Management: Manages video memory segments and page tables.
- Hardware Scheduling: Submits commands generated by the UMD to the hardware queue.
- Display Management: Controls display output (VidPN) and handles interrupts.
- Development Guidance: The KMD must be extremely stable, as it runs in Ring 0.

3. Engineering Design Philosophy: "The OS Owns the Policy, the Driver Owns the Implementation"

This is the most core philosophy in WDDM development.

The OS (Dxgkrnl) owns the policy: The operating system decides when to switch tasks, when to lower the frequency for power saving, and when to swap resources to system memory due to insufficient video memory (Eviction).
The Driver (KMD) owns the implementation: The driver tells the OS what capabilities (Caps) the hardware has and executes the specific instructions. For example, the OS tells the driver "please move this memory block to VRAM," and the driver is responsible for writing the specific registers to complete the transfer.

4. The WDDM 2.0 Watershed: The Shift in Video Memory Management Focus

In WDDM 1.x, video memory management used the "Patch Location List" model. Before the driver submitted a command, the OS had to scan the entire command buffer and fill in the physical addresses. This became a bottleneck in the context of large-scale parallelism and GPU virtualization.

WDDM 2.0 (Windows 10+) introduced GPU Virtual Addressing (GPUVA):

Each process has an independent GPU address space: Just like CPU virtual memory, the UMD uses virtual addresses and no longer relies on dynamic patching by the OS.
Residency Model: The driver no longer needs to list all dependent resources each time a command is submitted, but uses MakeResident to inform the OS which resources must be in video memory. This greatly reduces kernel transition overhead and is the foundation for the high performance of Direct3D 12.

5. Developer Advice: How to Start Your WDDM Journey

Clarify the DDI version: In DriverEntry, make sure to correctly declare your DXGKDDI_INTERFACE_VERSION. For modern development, it is recommended to support at least WDDM 2.0.
Focus on Mandatory DDIs: Not all DDIs need to be implemented. For basic rendering, DxgkDdiCreateAllocation, DxgkDdiSubmitCommand, and DxgkDdiBuildPagingBuffer are your three core functions.
Understand Asynchronicity: The GPU is asynchronous. When you submit a command, it merely enters a queue. Never synchronously wait for the GPU to complete within the driver, otherwise, it will cause the entire system to stutter.

Part 4: Driver Implementation — How Developers Adapt to GPU-P

Deleon Karen — Tue, 02 Jun 2026 15:06:11 +0000

After understanding the architecture and isolation mechanisms of GPU-P, the most pressing question for a driver developer is: How do I implement and enable these features in my driver? This chapter will analyze the GPU-P development and adaptation process in detail from three dimensions: capability declaration, configuration files, and key DDIs (Device Driver Interfaces).

1. Capability Declaration: Telling the System "I'm Ready"

To let Windows know that your driver supports GPU partitioning, you first need to set specific capability bits in the host driver.

DXGK_VIDMMCAPS

When the KMD responds to the DxgkDdiQueryAdapterInfo call, it must populate the DXGK_DRIVERCAPS structure. Within it, MemoryManagementCaps.ParavirtualizationSupported is the "master switch" for enabling GPU-P:

// In the host KMD implementation
pDriverCaps->MemoryManagementCaps.ParavirtualizationSupported = TRUE;

Note: The situation is slightly different for MCDM (Microsoft Compute Driver Model) drivers. The host driver should set this bit to TRUE, while the guest driver running inside the virtual machine should adjust it based on whether it is in a virtualized environment.

2. INF File Configuration: The "Mover" of Resources

Since there is no KMD inside the virtual machine, the DLL images and registry settings required by the guest-side UMD must be provided in advance by the host.

Driver Store Mapping

In the INF file, you need to declare which files should be copied to the virtual machine. Commonly used registry entries include:

CopyToVmOverwrite: Always overwrites the file with the same name in the virtual machine.
CopyToVmWhenNewer: Only overwrites when the host file version is newer.

[DDInstall]
; Copies the host's driver files to the virtual machine's HostDriverStore directory
HKR,"CopyToVmOverwrite",SoftGpuFiles,%REG_MULTI_SZ%,"umd_binary.dll","umd_binary.dll"

These files are automatically placed in the virtual machine's %windir%\system32\HostDriverStore directory.

3. Key DDI Implementation: The Host's "Art of Management"

To manage numerous virtual machine instances, the host KMD needs to implement several key new DDIs:

DxgkDdiSetVirtualMachineData

This is the core interface through which Dxgkrnl communicates VM context to the KMD. When a new virtual machine instance starts or its configuration changes, the OS calls this interface to synchronize key metadata with the driver.

Purpose: Passes virtual machine data, including the VM's unique identifier (LUID), video memory quota, compute resource limits, etc.
Security Flag: Through the SecureVirtualMachine flag in DXGK_VIRTUALMACHINEDATAFLAGS, the OS tells the KMD whether the VM is in "secure mode" (e.g., Windows Sandbox).
Developer Action: The driver should initialize or update its internal virtual machine tracking structures based on this data. For example, in secure mode, the driver must disable non-standard Escape calls and ensure that IOMMU isolation is fully effective. Furthermore, the driver can use this interface to associate the VM's guest space mapping with the host-side physical resources.

DxgkDdiQueryAdapterInfo (Extended)

This interface is one of the most functionally complex in WDDM. In a GPU-P environment, it is extended to support finer-grained cross-boundary queries.

Key Flags:
- VirtualMachineData: If this bit is TRUE, it indicates that the current query request originates from a virtual machine.
- SecureVirtualMachine: Indicates whether the current execution is within a stricter security isolation environment (like a sandbox).
Context Identification: The new hKmdProcessHandle member allows the driver, when processing queries from a virtual machine, to accurately identify and use the corresponding host-side process context.
Common Query Types:
- DXGKQAITYPE_GPUPCAPS: Queries the driver's partitioning capabilities, such as whether it supports Live Migration.
- DXGKQAITYPE_GPUMMUCAPS: Queries the support status for GPU virtual addresses (GpuVA).
- DXGKQAITYPE_PAGETABLELEVELDESC: Defines the page table hierarchy structure, which is crucial for the GpuMMU model.

DxgkDdiCreateProcess

In a virtualized environment, the host creates a "mirror process" object for each VM's drawing operations. The driver needs to recognize new flags in DXGK_CREATEPROCESSFLAGS to distinguish different execution contexts:

VirtualMachineWorkerProcess: Corresponds to the host-side VM worker process (vmwp.exe). This process is responsible for managing the VM's virtual hardware (including vGPU emulation) but does not perform rendering itself. The driver can use this flag to skip the allocation of certain rendering resources.
VirtualMachineProcess: Corresponds to the actual process inside the Guest that initiates drawing requests. Whenever an application within the virtual machine tries to use the GPU, a call is triggered on the host side via this flag.
Process Association: Through the hKmdVmWorkerProcess handle, the driver can associate multiple guest-side processes with the same VM instance on the host, which is crucial for resource accounting, debugging, and performance monitoring (e.g., GPUView).
Debugging Support: The pProcessName member provides a human-readable name for the process, greatly facilitating troubleshooting for developers in multi-VM concurrent scenarios.

4. Special Handling for MCDM Drivers

For compute cards without display output capabilities (like data center GPUs), adaptation to GPU-P must follow MCDM (Microsoft Compute Driver Model):

Class Definition: The INF must specify Class=ComputeAccelerator.
Capability Reduction: Display-related interfaces like DxgkDdiPresent do not need to be implemented, but GPU Virtual Address (GpuVA) and IOMMU isolation must be supported.
Isolation Requirements: If an MCDM driver wants to run in a secure container, it must set IoMmuSecureModeSupported to TRUE.

5. Development Advice: The "Forbidden Zone" for Private Data

When writing a driver that supports GPU-P, developers must always keep the "cross-boundary" principle in mind:

Strictly Prohibited from Passing Pointers: The UMD is in the guest space, and the KMD is in the host space. Absolute physical memory addresses are invalid between the two.
Handle Translation: Utilize the DriverKnownEscape mechanism to let the OS translate resource handles from the guest side into handles recognizable on the host side.
Message Size Limit: The maximum size for a single VMBus message is typically 128KB. Avoid passing excessively large private data in a single AllocateCb call.

Conclusion

Adapting to GPU-P is not just about flipping a switch; it is a meticulous reconstruction of the driver architecture. Through standardized capability declarations and rigorous DDI implementation, developers can give their hardware a new lease on life in the cloud and virtualized environments.

In the next chapter, we will tackle the toughest nut to crack in the field of GPU virtualization: Live Migration.

Part 3: The Foundation — IOMMU and Security Isolation

Deleon Karen — Tue, 02 Jun 2026 15:04:29 +0000

In the previous "Architecture" chapter, we learned how GPU-P achieves resource sharing through the clever division of labor between the Host and Guest. However, in cloud computing and virtualization environments, simply being "usable" and "shareable" is far from enough. Security is always the Sword of Damocles hanging over the head of virtualization.

If the GPU-P architecture is a building that allows multiple tenants to move in, then IOMMU is the security door customized for each tenant. Today, we will unveil the cornerstone of GPU-P's underlying security isolation.

The Necessity of Isolation: Guarding Against Deadly DMA Attacks

As we all know, GPUs are high-speed peripherals connected to the system via the PCIe bus. In pursuit of ultimate performance, GPUs heavily rely on DMA (Direct Memory Access) technology. With DMA, the GPU can read from and write to the system's main memory (physical memory) directly across the bus without CPU intervention.

In a standalone environment, this is not a problem. But in a virtualized environment (like GPU-P), it becomes a huge security risk:
Suppose a malicious user gains control of a GPU Virtual Function (VF) within a Virtual Machine (Guest). They could craft malicious hardware commands, instructing the GPU's DMA engine to read physical memory that does not belong to that virtual machine. If they were to read the Host's core data, encryption keys, or the memory of other tenant VMs, the consequences would be catastrophic.

To prevent this kind of "unauthorized access," relying solely on software-level interception is insufficient. We must establish a physical barrier at the hardware bus level.

The Role of IOMMU: A "Logical Maze" for Physical Memory

This is where the IOMMU (Input/Output Memory Management Unit) takes the stage.

You can think of the IOMMU as a specialized MMU for peripherals (I/O devices). The CPU relies on the MMU to map virtual addresses to physical addresses, while the IOMMU is responsible for DMA Remapping for peripherals.

In GPU-P, the IOMMU works as follows:

Domain Division: The Host's Dxgkrnl (DirectX Graphics Kernel) creates an independent IOMMU Domain for each logical adapter (the instance assigned to a virtual machine) on the system.
Logical Address Spoofing: The Host operating system no longer exposes real physical addresses to the GPU; instead, it provides logical addresses managed by the IOMMU.
Hardware-Level Interception: When a VM's GPU VF attempts to initiate a DMA read or write via the PCIe bus, the request is intercepted by the IOMMU. The IOMMU checks if this logical address is valid and converts it to the real physical address.
Blocking on Violation: If a malicious Guest tries to access a logical address not assigned to it, the IOMMU cannot complete the translation and will directly block this PCIe transaction at the hardware level, thus protecting the absolute security of the Host's and other VMs' memory.

Silence Protocol: The Danger of Domain Switching

While powerful, the IOMMU has a fatal weakness when switching protection domains (Domain Switch): the attach and detach operations of a domain are not atomic at the hardware level.

Imagine this: Dxgkrnl is in the background altering the IOMMU mapping tables, while the GPU is furiously writing data to memory. Since the mapping table is in an intermediate state, a PCIe translation error is very likely to occur, directly crashing the entire system (Blue Screen/Bug Check).

To resolve this race condition, WDDM introduced the Silence Protocol. The Host graphics driver (KMD) must implement a pair of extremely critical DDIs (Device Driver Interfaces):

DxgkDdiBeginExclusiveAccess
DxgkDdiEndExclusiveAccess

Execution Flow:

Before an IOMMU domain switch occurs, Dxgkrnl pauses the scheduler, flushes all active workloads, ensuring no new tasks are sent to the hardware.
Dxgkrnl calls DxgkDdiBeginExclusiveAccess, notifying the KMD: "I'm about to touch the IOMMU, tell your hardware to be quiet!"
Upon receiving the instruction, the KMD must ensure the GPU hardware remains absolutely silent during this period—no reading from or writing to system memory, even hardware interrupts can be masked.
Dxgkrnl safely completes the IOMMU domain switch.
Dxgkrnl calls DxgkDdiEndExclusiveAccess, lifting the silence, and the GPU resumes normal operation.

Secure VM: Stringent Admission Criteria

In scenarios with extremely high security requirements (such as Windows Defender Application Guard or advanced security sandboxes), the operating system may launch a Secure VM.

For GPU instances assigned to a Secure VM, WDDM imposes mandatory admission and operational conditions:

Mandatory IOMMU Isolation: If the driver does not support IoMmu isolation in its capability declaration (Caps), the Secure VM will directly refuse to create a GPU instance, resulting in startup failure. Here, there is no compromise on security.
Banning Illegal Escape Calls: In traditional WDDM, a User-Mode Driver (UMD) can send private data packets to the Kernel-Mode Driver (KMD) via Escape calls. Since this is entirely a "black box," a malicious Guest could trigger vulnerabilities like buffer overflows in the Host's KMD by crafting malformed Escape packets.
- In a Secure VM, conventional Escape calls are completely banned.
- Only "Known Escapes" with the DriverKnownEscape flag, strictly defined and audited by the system, are permitted. This drastically reduces the attack surface for kernel privilege escalation.

Conclusion

Through the IOMMU's hardware-level DMA interception, the ingenious Silence Protocol, and the stringent communication restrictions for Secure VMs, Microsoft has built an impregnable defense line for GPU-P. It is precisely this foundational cornerstone that gives cloud service providers the confidence to partition the same expensive high-end GPU among multiple, unrelated tenants.

So far, we have understood the operational mechanism of GPU-P from both the macro-architectural and security foundation perspectives. For driver developers, how can existing code be modified to adapt to this complex virtualization mechanism?

In the next chapter, we will enter the practical realm and analyze: Driver Implementation — How Developers Adapt to GPU-P.

Next Chapter Preview: Driver Implementation — How Developers Adapt to GPU-P

Part 2: Architecture — The "Duet" of Host and Virtual Machine

Deleon Karen — Tue, 02 Jun 2026 15:02:37 +0000

In the previous part, we covered the basic concepts of GPU-P. Today, we'll dive deep into its internal architecture. If a traditional graphics driver is a solo performance, then GPU-P is a precisely choreographed "duet": the Host and the Guest (virtual machine) have clearly defined roles, working together closely through VMBus.

Component Model: An Unbalanced "Bilocation"

Under the GPU-P architecture, the driver components are split between two worlds — the host and the virtual machine. Interestingly, the components in these two worlds are asymmetric.

Host: The All-Powerful "Brain"

The host possesses the complete graphics subsystem stack and serves as the ultimate resource manager:

Full-Featured KMD (Kernel Mode Driver): Interacts directly with the physical GPU hardware.
VidMm (Video Memory Manager): Controls the allocation and scheduling of all video memory.
VidSch (Scheduler): Decides which virtual machine's tasks can enter the GPU hardware queues.
UMD (User Mode Driver): Used for the host's own rendering tasks.

Guest (Virtual Machine): The Lean "Executor"

Inside the virtual machine, things are very "slim":

UMD Only: An adapted user mode driver runs inside the VM, directly facing applications (like games or AI frameworks).
No KMD: The manufacturer's kernel mode driver code does not run inside the VM.
No VidMm or VidSch: The virtual machine is not responsible for managing video memory or hardware scheduling.

This design drastically reduces the attack surface of the virtual machine, while also avoiding the overhead of running complex video memory management logic repeatedly in every VM.

Virtual Render Device (VRD): The Ingenious "Impostor"

Since there is no KMD in the virtual machine, how does the Windows operating system know to load a display driver? This is the work of the VRD (Virtual Render Device).

The VRD is a "shadow driver" on the Guest side. Its main responsibilities are:

Deceiving the OS: Mounting a virtual graphics device in Device Manager, making the VM's operating system think "I have hardware," thereby triggering the loading of Dxgkrnl.sys (the DirectX Graphics Kernel).
Guiding the Load: It acts as the fuse for loading the Guest-side UMD.

On the host side, each virtual machine corresponds to a VMWP.exe (VM Worker Process). Within this process runs a library called vrdumed.dll, which serves as the "back-end support" for the VRD, responsible for emulating this virtual device on the host side.

Communication Bridge: VMBus and Parameter Marshalling

For the UMD in the virtual machine to render an image, its commands must ultimately be passed to the host's hardware. How does this "dialogue" happen between them?

VMBus

VMBus, provided by Hyper-V, is the underlying communication mechanism for this architecture. It functions like a dedicated expressway, allowing data to travel rapidly between the virtual machine and host memory.

Parameter Marshalling

When an application inside the VM calls a DirectX API, the following process occurs:

Interception: The Guest-side Dxgkrnl receives the UMD's call requests (Thunk calls).
Marshalling: Dxgkrnl packages the parameters and data packets of these calls into individual Messages.
Transmission: These messages are sent to the host via VMBus.
Execution: The host-side Dxgkrnl receives the messages, unpacks them, and passes them to the physical driver for execution.

Optimization Mechanism: To prevent VMBus congestion, the Guest-side Dxgkrnl retains some "local objects" (such as handle mappings for Allocations and Devices), and communication across the boundary only occurs when hardware execution is truly required.

Resource Ownership: Why Doesn't the Guest Have Video Memory Management?

A common question is: Why not let the virtual machine manage the video memory allocated to it?

The answer is: Physical video memory is a globally unified resource.
If each virtual machine believed it had its own independent video memory manager, they would "fight" over the same physical addresses. In GPU-P:

Host Coordination: The host's VidMm stands from a God's-eye view, coordinating the whole picture. Based on each VM's configuration (like MinPartitionVRAM / MaxPartitionVRAM), it partitions the physical video memory among different VMs.
Guest Request: When a virtual machine needs memory, it must initiate a "loan" request to the Host via VMBus.

This "centralized authority" architecture ensures that even when multiple VMs are running under high load, the system will not suffer a host-wide blue screen (TDR) due to video memory conflicts.

Conclusion

The architecture of GPU-P showcases Microsoft's art of balancing performance and security in virtualization: through VRD deception, VMBus transmission, and highly centralized control on the Host side, efficient GPU resource sharing is achieved.

However, in such a shared environment, how do we ensure that a virtual machine cannot illegally access the host's memory? That is the topic we will discuss in our next part: IOMMU and Security Isolation.

Part 1: Concepts — Ushering in the "Golden Age" of GPU Virtualization

Deleon Karen — Tue, 02 Jun 2026 14:59:07 +0000

Introduction: The GPU Dilemma Amid the Virtualization Wave

In today's rapidly evolving landscape of cloud computing and virtualization, the virtualization of compute, storage, and networking is already highly mature. However, the virtualization of GPU resources has long remained a major industry challenge. In the past, if we wanted to use GPU acceleration in a virtual machine (VM), there were generally two mainstream approaches, each with its own strengths and limitations:

Discrete Device Assignment (DDA):
- Principle: Assigns an entire physical GPU exclusively to a single virtual machine.
- Advantages: Near-native performance and good compatibility.
- Disadvantages: Resources cannot be shared. One GPU can only serve one VM, so even if the VM is only running simple UI rendering, it causes massive resource waste and extremely high costs.
API Forwarding:
- Principle: Intercepts OpenGL/DirectX API calls within the virtual machine and forwards them to the host for execution via network or shared memory channels.
- Advantages: Simple to implement and supports 1:N sharing.
- Disadvantages: High performance overhead, high latency, and often only supports specific API versions, making it unable to meet the demands of high-performance computing or 3D gaming.

With the rise of artificial intelligence, remote desktops (VDI), and cloud gaming, the market urgently needs a GPU virtualization solution that can deliver both high performance and high-density sharing. It is against this backdrop that Microsoft introduced GPU-P (GPU Partitioning).

What is GPU-P: SR-IOV-Based Hardware Partitioning Technology

GPU-P (short for GPU Partitioning) is a GPU hardware partitioning technology developed by Microsoft based on the industry standard SR-IOV (Single Root I/O Virtualization). In WDDM documentation, it is also often referred to as GPU Paravirtualization (GPU-PV).

Core Principles

Unlike traditional "full virtualization," GPU-P employs a paravirtualization design:

Hardware Level: Utilizes SR-IOV-capable GPU hardware to partition the physical device into multiple "Virtual Functions (VFs)." Each VF possesses its own independent hardware context, command queue, and memory space.
Software Level: Retains the full-featured driver (KMD) on the host side, while the virtual machine (guest) side runs a streamlined user-mode driver (UMD) specifically adapted for the virtualization environment.
Communication Mechanism: Uses Hyper-V's VMBus as a high-speed communication bridge to enable efficient collaboration between the Guest UMD and the Host KMD.

Advantages of GPU-P

Hardware-Level Isolation: Leveraging SR-IOV and IOMMU technologies, GPU tasks from different virtual machines do not interfere with one another, ensuring extremely high security.
Near-Native Performance: Critical rendering paths interact directly with the hardware VF, greatly reducing the overhead introduced by software emulation.
Flexible Resource Scheduling: Supports dynamically partitioning a single physical GPU into multiple instances, achieving optimal resource allocation.

Application Scenarios: From the Lab to the Cloud

GPU-P is not a laboratory "toy"; it is deeply integrated into all corners of the Windows ecosystem:

Windows Sandbox: When you start a sandbox to test suspicious software, its smooth UI is powered by GPU-P providing hardware acceleration, without requiring you to manually configure drivers.
WSL2 (Windows Subsystem for Linux): When developers run neural network training (such as PyTorch, TensorFlow) under the Linux subsystem, they can directly call upon the GPU compute power of the Windows host, also thanks to the D3D12/CUDA mapping support provided by GPU-P.
Azure NV-Series Virtual Machines: On the public cloud, GPU-P allows Azure to offer cost-effective GPU instances to different users, supporting AI inference, rendering, and scientific computing.
Cloud Gaming and VDI: Through high-density GPU partitioning, a single server can simultaneously support the 1080P/60FPS gaming experiences or 3D design desktops of dozens of users.

WDDM Evolution: Embarking on the Long March of GPU Virtualization

The maturity of GPU-P was not achieved overnight; it has continuously strengthened with the iteration of the Windows Display Driver Model (WDDM):

WDDM 2.4 (Windows 10 1803):
- Milestone Significance: Formally introduced the GPU-PV architecture.
- Core Functionality: Supported basic rendering capability partitioning and introduced the IOMMU isolation mechanism.
WDDM 2.5 - 2.9:
- Continuous optimization. Introduced mechanisms such as "driver-known Escape calls," enhancing the security of cross-process/cross-VM communication.
WDDM 3.2 (Windows 11 24H2):
- Live Migration: This is the "holy grail" of GPU virtualization. WDDM 3.2 introduced technologies like Dirty Bit Tracking, enabling virtual machines running GPU workloads to migrate between different physical hosts without downtime.
- LDA (Linked Display Adapter) Support: Supports more complex partitioning strategies in multi-GPU environments.

Conclusion

The emergence of GPU-P marks the transition of Windows GPU virtualization from "barely functional" to "truly practical" and "industrialized." It not only solves the pain point of resource sharing but also finds the perfect balance between security and performance.

In the following chapters, we will delve into its kernel logic and explore how the host and virtual machine execute their precise "pas de deux."

Part 13: Epilogue — The Architectural Evolution from i915 to the Xe Driver

Deleon Karen — Tue, 02 Jun 2026 14:54:16 +0000

In the previous twelve lectures, we delved from macro architecture down to code details, thoroughly dissecting i915, the massive and complex kernel graphics driver. We saw how it manages memory (GEM/TTM), schedules tasks (Execlists/GuC), lights up displays (KMS), and recovers from disasters (Reset).

However, there is no eternal perfection in software engineering. Over time, business requirements change and hardware architectures evolve, and an existing codebase inevitably accumulates technical debt. In the final lecture of this series, we will step outside the code framework of i915 to discuss its historical baggage and why Intel introduced an entirely new driver in the Linux 6.8 kernel: the Xe driver.

1. Retrospect: The Heavy Historical Baggage of i915

The i915 driver began in 2004. Over a span of 20 years, it has accompanied countless users' computers but has also gradually become an unwieldy behemoth (over 300,000 lines of code). Its "heaviness" is mainly reflected in the following aspects:

1.1 The Long Hardware Support Timeline

Opening i915_pci.c, you will see device IDs ranging from i830 (Gen2) all the way to the latest Meteor Lake and Discrete Graphics (DG2).

To maintain compatibility with those ancient integrated GPUs that had no hardware scheduler and relied on simple ring buffers for command submission, the code retains numerous legacy execution paths.
Despite the existence of the firmware-based GuC scheduler, the old Execlists and even older submission mechanisms must still be maintained, making the logic in the gt/ directory intricate and convoluted.

1.2 The Inertia of the Unified Memory Architecture (UMA) Mindset

At i915's inception, Intel GPUs were all integrated, sharing memory with the CPU. This deeply imprinted the mark of UMA on i915's low-level data structures.

When Intel decided to enter the discrete graphics card market (Arc series) and introduce local memory (LMEM), the i915 team had to painfully cram the veteran TTM (Translation Table Manager) into the originally GEM-based architecture.
This mid-career integration resulted in code littered with conditional logic (e.g., if (HAS_LMEM(i915))), causing a sharp increase in the maintenance difficulty of the memory management subsystem.

1.3 Complex Synchronization and Display Coupling

In i915, the display logic (Display/KMS) is highly coupled with the low-level memory management.
i915's early scheduling design did not fully align with the philosophy of the modern DRM Scheduler, often requiring many complex adaptations when interfacing with modern userspace stacks like Wayland.

2. Breaking the Cocoon: The Design Philosophy of the Xe Driver

Faced with the heavy baggage of i915, Intel decided that when supporting the new generation of GPUs based on the Xe architecture (Tiger Lake/Gen12 and beyond), instead of patching and mending i915, they would write an entirely new driver from scratch — Xe (drivers/gpu/drm/xe).

The core design philosophy of the Xe driver can be summarized as: "Travel light, and do everything for the modern GPU."

2.1 Abandoning History, Focusing on Modern Architectures

The Xe driver only supports GPUs from Tiger Lake and newer architectures (Gen12+).
This means it directly discards the baggage of supporting 15 years of old hardware. Without Ringbuffers or Execlists, the Xe driver, from its very first line of code, completely relies on the GuC (Graphics Microcontroller) for hardware scheduling. This dramatically simplifies the code complexity of the gt/ layer.

2.2 Natively Embracing TTM and the DRM Scheduler

Pure TTM: The Xe driver no longer has its own custom GEM memory allocator but natively uses the DRM core's TTM framework from the start to manage system memory and discrete memory.
Standard Scheduler: Xe fully integrates drm_sched (DRM Scheduler). The complex dependency resolution and request ordering logic previously found in i915 is now largely handed off to the kernel's DRM subsystem for unified handling. This not only results in less code for the Xe driver but also allows it to work better with other graphics stacks.

2.3 The User-Mode Queue (VM_BIND) Model

Modern graphics APIs (like Vulkan, Direct3D 12) all adopt an explicit memory management and command queue model.
i915 uses the traditional implicit synchronization (memory is implicitly bound when commands are submitted). In contrast, the Xe driver introduces the modern VM_BIND API:

Userspace can explicitly bind memory regions to the GPU virtual address space and submit command packets directly to the hardware queues without kernel involvement.
This 1:1 user/kernel execution queue mapping greatly reduces kernel overhead and enhances the efficiency of CPU command submission for rendering.

2.4 Code Reuse: Clever Display Isolation

While rewriting the low-level memory and scheduling components, Intel did not want to rewrite the extremely large and complex display code (hundreds of thousands of lines handling HDMI/DP/Type-C protocols).
The Xe team made a very ingenious design choice: refactoring i915's Display code into a relatively independent module. Now, when the Xe driver lights up a screen, it actually calls into this stripped-out i915 Display module (you will see Xe source code directly including i915 display header files). This achieves the low-level advantages of a modern architecture while preserving the battle-tested stability of the display output.

3. Conclusion

Over a decade of development has turned i915 into a living textbook of Linux graphics drivers, recording every footprint of GPU technology's journey from weakness to strength. Although the spotlight will gradually shift to Xe, i915 will continue to be maintained for many years, providing stable graphics support to countless older devices.

This series of articles ends here. It is hoped that through the analysis in these 13 lectures, everyone will no longer feel intimidated when facing hundreds of thousands of lines of low-level C code. Although the world of the kernel is profound, as long as you trace the threads of "initialization, memory, scheduling, display," you can always discern its exquisite framework.

Part 12: The Undying Body: GPU Hang Detection and Reset

Deleon Karen — Tue, 02 Jun 2026 14:48:40 +0000

In complex graphics rendering or compute tasks, it is common for the GPU to "hang" due to executing defective shader code, encountering an infinite loop, or experiencing an anomaly in the hardware state machine. For a mature kernel driver, the ability to quickly detect a hang, capture on-site "last words," and gracefully resume operation is a key measure of its fault tolerance.

In i915, this life-support mechanism is primarily composed of three modules: Hangcheck, Error Capture, and Reset.

1. Detecting a GPU Hang: The Hangcheck and Heartbeat Mechanism

Early i915 drivers relied on a timer to poll the execution progress (whether the Seqno advanced) of all execution engines. However, in the modern i915 architecture (especially the implementation centered around intel_engine_heartbeat.c), the driver uses a more proactive and precise "Heartbeat" mechanism.

1.1 Heartbeat Emission and Detection

When an engine is in an active state, the driver periodically sends a special request, called a "Heartbeat" (Systole), containing only no-ops and synchronization barriers.

Under normal circumstances, the GPU quickly executes this heartbeat request and triggers an interrupt.
The driver uses mod_delayed_work to set a timeout. If the GPU fails to complete the heartbeat request within this period, the driver does not immediately declare it hung.

1.2 The Ultimatum (Preemption Timeout)

If the heartbeat times out, the driver plays its "trump card": it forcibly elevates the priority of this heartbeat request to the highest level (I915_PRIORITY_BARRIER).
This tells the hardware scheduler: "Whatever time-consuming task you are running right now, preempt it immediately and run my heartbeat first!"

If, even after issuing the highest-priority preemption command and waiting for a period (typically several hundred milliseconds, depending on preempt_timeout_ms), the heartbeat still does not pulse, i915 gives up completely and calls reset_engine(), officially declaring the engine Hung.

2. Collecting "Last Words": Error State Capture

Before pulling the plug on the GPU, the most important task is to preserve the crash scene so that developers can perform a post-mortem analysis. This step occurs within the intel_gt_handle_error() function.

When the I915_ERROR_CAPTURE flag is passed, the driver calls i915_capture_error_state():

Snapshot Registers: Reads all critical hardware register states at that moment (such as the instruction pointer EIR, current context state, etc.).
Capture Ringbuffer: Copies all the commands currently in the command ring (Ringbuffer) being executed by the engine.
Record Batchbuffer: If a long series of rendering commands submitted from user space caused the hang, the driver also saves the content of the relevant Batchbuffer.

These "last words" are packaged into the i915_gpu_error structure and exposed to user space via Linux's sysfs or debugfs (usually located at /sys/kernel/debug/dri/0/i915_error_state). Tools like intel_error_decode can read this to reconstruct what instructions the GPU was executing at the moment of the hang.

3. Reset Therapy: From Microsurgery to Defibrillation

After collecting the error state, i915 attempts to pull the GPU back from the brink of death. Modern Intel GPUs support a multi-level reset strategy, following the principle of "minimizing the impact area."

3.1 Engine Reset

If only the Video Decode Engine (VCS) is stuck, while the Render Engine (RCS) is still happily running a game, we obviously don't want the entire screen to go black.

The driver first tries to call intel_engine_reset().
The hardware sends a reset signal only to the specific engine(s) that are stuck, cleans up the hung context, and preserves the running state of other engines.
This is an extremely "minimally invasive" recovery method; the user might only feel a slight stutter in one specific task.

3.2 GT Reset (Full Reset)

If the engine reset fails, or if the hang involves shared common resources (like the memory scheduler or command streamer), the driver has to fall back to the next option.

The driver executes intel_gt_reset(), enters the I915_RESET_BACKOFF state, and pauses submissions to all engines.
It sends a reset signal to the entire GT (Graphics Technology) core. This clears all hardware execution queues.
After a successful reset, the driver requeues and resubmits the innocent requests that did not cause the hang. For the culprit, it directly returns -EIO (Input/Output Error), telling the application "your task has been terminated."

3.3 Device-Level Reset (PCI Reset - Wedged)

If even the GT reset fails to wake the GPU (for instance, the hardware state machine has completely collapsed, or the bus is deadlocked), i915 desperately marks the device as Wedged.

It calls intel_gt_set_wedged(). In this state, the driver rejects all graphics execution requests from user space, and all new execbuffer calls immediately return -EIO.
If conditions permit, the driver may attempt the highest-level reset (Device Reset) at the PCI bus level. However, if it comes to this, the screen usually flickers, and a machine reboot might be necessary for full recovery.

Summary

A robust GPU driver should not assume that hardware will always run perfectly. i915's Heartbeat mechanism monitors the health of the engines like an electrocardiograph; i915_gpu_error records all data before a crash like a black box; and the multi-level Reset mechanism, from Engine to GT level, works like an emergency room physician, doing its utmost to revive the GPU to full health and protect the user experience to the greatest extent possible.

In the next lecture, we will reach the final installment of this series, reviewing the historical baggage carried by the i915 behemoth with over two million lines of code, and see how Intel's latest Xe architecture driver travels light and faces the future.

Part 11: Extreme Power Control: RPM, RC6, and RPS

Deleon Karen — Tue, 02 Jun 2026 14:40:25 +0000

In previous chapters, we explored how i915 manages video memory, schedules tasks, and lights up the display. However, for modern GPUs, "how to run fast" is merely the baseline; "how to save power" is the real technical barrier.

Whether in a thin-and-light laptop or a power-constrained data center, the GPU is a notorious power hog. If left unchecked, it will not only drain the battery but also cause severe thermal throttling. To squeeze out every drop of energy efficiency, the i915 driver has built an extremely sophisticated power control system within the kernel.

Today, we will focus on the three core pillars of i915 power management: RPM (Runtime PM), RC6, and RPS.

1. Putting the Device to Sleep as a Whole: Runtime PM (RPM)

Imagine this scenario: your laptop screen is on, displaying a static, plain-text article. At this moment, apart from the display controller periodically reading the framebuffer, all of the GPU's compute engines are essentially idle. In a more extreme case, if you have an external eGPU (docking station) connected and no program is currently using it, should it still be running at full power?

Runtime Power Management (RPM) is designed to solve this problem. It is a standard power management framework provided by the Linux kernel, and its core implementation in the i915 driver resides in intel_runtime_pm.c.

1.1 Core Mechanism: Wakeref (Wake Reference)

i915's RPM management relies on a mechanism called Wakeref (Wake Reference Counting).

When the driver needs to access GPU hardware (e.g., writing instructions to registers, handling interrupts, or when userspace initiates a rendering request), it must first acquire a Wakeref: calling intel_runtime_pm_get(&i915->runtime_pm).
If this is the first Wakeref, the driver triggers a device wake-up (waking from PCI D3hot/D3cold sleep state to D0 full-power state).
Once the operation is complete, the driver releases the reference: calling intel_runtime_pm_put().
Once the Wakeref count drops to zero, the driver considers the GPU idle and allows the device to enter deep sleep.

This mechanism is extremely strict. Assertions like assert_rpm_wakelock_held() are often seen in the code, forcing developers to "hold a permit" before touching the hardware. Otherwise, reading or writing to silicon that has already been powered down will directly cause the entire system to hang.

2. Dynamic Sleep of the Render Engine: RC6 (Render C-States)

RPM controls the life and death of the entire GPU device, but its granularity is too coarse. If the screen is refreshing, the device cannot completely enter RPM sleep. At this point, we need finer-grained control — RC6.

2.1 What is RC6?

In the CPU world, there are C-States (e.g., C0 is running, C6 is deep sleep). Intel GPUs introduced a similar concept called Render C-States (RC). The most critical among them is RC6.

When RC6 is enabled, the hardware's PCU (Power Control Unit) constantly monitors the busyness of each engine (like the render engine RCS, video engine VCS). If it finds an engine has been idle for more than a few milliseconds, the hardware automatically cuts off that engine's clock and even power, while other parts (like the display output) remain operational.

2.2 The Driver's Role

Entering and exiting RC6 is primarily handled automatically by the hardware, but in gt/intel_rc6.c, the i915 driver is responsible for:

Initialization and Enabling: Configuring the hardware sleep thresholds and policies in intel_rc6_enable().
Status Monitoring: Reading hardware registers via intel_rc6_residency_ns() to count what proportion of the recent past the GPU spent in the RC6 (sleep) state. This data is often used to evaluate whether the driver's power-saving optimizations are effective.
Deeper Sleep: Besides RC6, the hardware also supports deeper RC6p and RC6pp states. Entering these states saves even more power but comes with greater wake-up latency.

3. Dynamic Frequency Scaling (GPU Turbo): RPS (Render P-States)

If RC6 is about making the GPU "sleep" smartly, then RPS (Render P-States) is about making the GPU "work" smartly. It is the equivalent of dynamic frequency scaling in the CPU world (cpufreq / Turbo Boost).

3.1 Three Anchors of P-States

In gt/intel_rps.c, you will often see three terms that define the boundaries of GPU frequency:

RP0: The maximum turbo frequency supported by the hardware (Max Turbo Frequency), offering the highest performance but consuming the most power.
RP1: The guaranteed base frequency (Guaranteed Frequency), the sustained maximum frequency under good thermal conditions.
RPn: The minimum frequency supported by the hardware (Minimum Frequency), offering the lowest performance but consuming the least power.

3.2 Load-Based Frequency Scaling

By default, the GPU sits at RPn when idle. When the workload increases, hardware counters trigger an interrupt (Up Threshold), notifying the driver that compute power is insufficient. The driver's intel_rps_set() then intervenes, stepping up the GPU frequency; the reverse happens when load decreases.

3.3 Sacrificing Power for Latency: Wait Boost

i915 features an extremely interesting mechanism called RPS Boost.
In i915_gem_wait.c, when userspace (like X Server or a game) urgently needs a rendering result and calls a system call to wait for a dma_fence, the driver realizes: "The user is anxiously waiting for this frame."

At this moment, the driver calls intel_rps_boost(), ignoring the current load ramp-up curve, and directly forces the GPU frequency to the maximum RP0, completing the current task as quickly as possible to reduce visual latency for the user. Once the task is finished, the frequency rapidly drops back down. This is a classic strategy of "trading instantaneous high power consumption for ultimate experience."

Summary

i915's power control is a relay race from macro to micro:

RPM controls the big picture, decisively cutting power when the entire device is idle.
RC6 seizes every opportunity, letting the render engine sneak a nap in the millisecond gaps when the display is on but the image is static.
RPS manages the rhythm, dynamically adjusting frequency based on load during operation, and instantly delivering a "shot in the arm" when the user demands it.

It is the tight coordination of these three that enables Intel GPUs to achieve excellent battery life across a range of devices.

In the next lecture, we will explore the darkest yet most life-saving part of the GPU driver: how i915 pulls the GPU back from the brink when it truly "crashes" (Hang Detection and Reset).

Part 10: Deep Dive into Atomic Modesetting

Deleon Karen — Tue, 02 Jun 2026 14:38:02 +0000

In the previous sections, we discussed the execution of rendering commands and video memory management. But at the very end of the graphics stack, how exactly do the rendered pixels appear on the screen smoothly and seamlessly? This brings us to the most revolutionary architecture in the modern Linux display subsystem (DRM/KMS)—Atomic Modesetting.

In this lecture, we'll dive into the display directory of i915 to see how those complex state machines, often spanning thousands of lines, actually work.

1. Why Do We Need Atomic Commit?

In the early Legacy KMS era, display state updates were "fragmented." For example, if a userspace program (such as a Wayland Compositor or X Server) wanted to change the resolution and move the mouse cursor, it needed to call different IOCTLs separately: first set the CRTC (mode setting), then update the Plane (layer parameters), and finally move the Cursor.

The fatal flaws of this design were unpredictability and visible intermediate states:

Screen tearing/flickering: These three operations might span multiple vertical blanking (Vblank) signals, leading to awkward moments where an old layer is displayed with a new cursor.
"Getting stuck halfway": If, after setting the CRTC, the driver discovers during the Plane setup that the hardware bandwidth is insufficient, the operation will return an error. However, the CRTC has already been modified and is difficult to roll back, ultimately resulting in a black screen.

To solve this problem, the kernel introduced Atomic KMS. Its core idea is to package all display states (configurations of CRTCs, Planes, and Connectors) into a single "Transaction." The driver either applies the entire state to the hardware perfectly in one go, or rejects it outright during the check phase, ensuring absolute consistency of the hardware state—this is the essence of "Atomicity."

2. Two-Phase Design: Check and Commit

In i915, the entire atomic update process is strictly divided into two phases: atomic_check and atomic_commit.

Phase One: State Validation (`intel_atomic_check`)

When userspace submits a packaged new state, the entry function is intel_atomic_check.
During this phase, the driver must never modify any actual hardware registers. Its sole task is to simulate in software:

Recompute PLL clocks (checking if the required pixel frequency can be generated).
Calculate display bandwidth (checking if memory bandwidth can support a 4K 144Hz display with multiple UI layers).
If all validations pass, return 0; if even a single hardware constraint cannot be met, return an error code directly (such as -EINVAL or -ENOSPC). The userspace application can then lower its requirements (e.g., reduce the refresh rate) and try again based on this feedback.

Phase Two: Hardware Commit (`intel_atomic_commit`)

After validation passes, the kernel asynchronously calls intel_atomic_commit (which ultimately lands in intel_atomic_commit_tail).
This is an extremely precise state machine that dismantles the old state and assembles the new state in a strict order:

Wait for the rendering of the previous frame to complete (depending on dma_fence as discussed earlier).
Disable Planes and CRTCs that are no longer in use.
Update global display clocks and DDI interfaces.
Reallocate DDB and Watermarks (detailed later).
Enable new CRTCs and Planes.

3. The Carrier of Hardware Constraints: `intel_crtc_state`

In the i915 source file intel_display_types.h, there is a massive structure struct intel_crtc_state. It holds the complete blueprint for mapping a display pipe from software abstraction to hardware registers.

Among its contents, Watermarks (WM) and DDB (Display Data Buffer) are the key hardware constraints that determine whether an Atomic commit can succeed.

3.1 The Lifeline of Supply and Demand: Watermarks (WM)

The GPU's display engine needs to continuously read pixels from video memory and send them to the monitor. If the memory read speed cannot keep up with the monitor's scanning speed, the screen will show visual artifacts (Underrun).
To prevent this, the display engine has internal FIFO buffers. Watermarks are the "warning levels" for these buffers:

The driver needs to precisely calculate these based on the current resolution, color format (e.g., NV12 or ARGB), and memory latency (System Memory is slow, LMEM is fast).
When the data in the FIFO drops below this watermark level, the hardware must immediately issue an urgent memory read request.
In the wm field of intel_crtc_state, i915 meticulously calculates WMs at various levels (such as raw, optimal, intermediate) to ensure that the pixel stream never runs dry.

3.2 Precise Partitioning of SRAM: DDB Allocation

The total size of the high-speed SRAM buffer (DDB) inside the display engine is fixed. If you have multiple displays (Pipes) and multiple layers (Planes) working simultaneously, these DDB blocks must be precisely partitioned.

Dynamic Reallocation: If you suddenly add a video overlay plane to a running 4K monitor, this triggers an Atomic commit. intel_atomic_check must recalculate the DDB (struct skl_ddb_entry).
Seamless Transition: If you forcibly snatch DDB from an active layer to give it to a new one, it might cause screen tearing. Therefore, in intel_atomic_commit, you will see careful calls like intel_dbuf_mbus_pre_ddb_update() and post_ddb_update(), ensuring that the DDB re-partitioning transitions smoothly during the vertical blanking interval (Vblank).

If, during the atomic_check phase, the driver finds that even squeezing all available DDB cannot meet the newly requested resolution and layer combination, it will directly reject this update.

Summary

Atomic Modesetting has completely ended the flickering and tearing problems that once plagued Linux graphical interfaces. Through a strict two-phase design (a validation phase that does not touch hardware, and a meticulously ordered commit phase), i915 perfectly encapsulates complex hardware constraints—including clocks, bandwidth, Watermarks, and DDB allocation—into a single atomic transaction.

In the next lecture, we will move into Part 5 and explore how the driver controls the power consumption and sleep states of this "performance beast."

Part 9: Mapping the KMS Model onto Intel Hardware

Deleon Karen — Tue, 02 Jun 2026 14:35:32 +0000

In the previous two articles, we took a deep dive into the GPU's "heart" (GT) and "blood" (GEM). Today, we turn to the GPU's "eyes"—the display subsystem. In the Linux kernel, display drivers follow the KMS (Kernel Mode Setting) model.

For the i915 driver, the challenge lies in how to precisely map KMS's generic abstractions (CRTC, Plane, Connector, Encoder) onto Intel's complex display hardware pipeline (Pipes, Transcoders, DDIs).

1. Mapping Software Abstractions to Hardware Entities

To understand the display flow, we must first establish a mapping table from "software objects" to "hardware circuits":

KMS Object (DRM)	i915 Implementation Struct	Corresponding Intel HW Block	Responsibility
Plane	`intel_plane`	Hardware Plane	Pixel layer, responsible for scaling, rotation, and format conversion.
CRTC	`intel_crtc`	Pipe	Scanline generator, responsible for blending layers and applying the color Look-Up Table (LUT).
(No direct equivalent)	(maintained in state)	Transcoder	Protocol converter, responsible for encoding Pipe signals into timing signals (e.g., HDMI/DP timing).
Encoder	`intel_encoder`	DDI / Port	Digital Display Interface, the physical output endpoint (e.g., Port A, Port B).
Connector	`intel_connector`	Physical Socket	Receives the panel's EDID, detects HPD (Hot Plug Detect).

2. The Core Hardware Pipeline: Pipes and Transcoders

In Intel's documentation, the display pipeline is typically described as a series of connected modules.

2.1 Pipe

The core of intel_crtc is the Pipe (typically named Pipe A, B, C, D).

Task: It fetches the pixels of each Plane from memory, blends them according to Z-order, and applies Gamma correction and Color Space Conversion (CSC).
Mapping: One intel_crtc instance strictly corresponds to one physical Pipe.

2.2 Transcoder

This is a concept easily confused. In modern Intel hardware, an abstraction layer called the Transcoder is added between the Pipe and the output port.

Role: It determines the output timing. For example, even if you are using Pipe A, if you are driving an eDP panel, it might be routed to TRANSCODER_EDP; for a normal DP output, it might use TRANSCODER_A.
Flexibility: This design allows the hardware to share display logic among different physical interfaces.

3. Flexible Output Endpoints: DDI and Port

Before the Sandy Bridge era, Intel hardware had dedicated HDMI controllers, DP controllers, etc. Starting with the Haswell architecture, Intel introduced the DDI (Digital Display Interface).

Essence of DDI: It is a universal physical interface. Through programming, the same DDI port can run either HDMI or DisplayPort.
Role of the Encoder: In i915, intel_encoder represents a DDI port. Due to the versatility of DDI, a single intel_encoder can often handle multiple protocols simultaneously.

4. A Pixel's Fantastic Journey: Data Flow Example

Imagine you are gaming on a laptop connected to an external 4K monitor:

Plane: The game frame, as an intel_plane, waits in video memory (LMEM/SMEM).
Pipe: The intel_crtc (Pipe A) blends the game frame layer with the mouse cursor layer (Cursor Plane).
Transcoder: TRANSCODER_A generates the blanking and synchronization signals for 4K@60Hz.
Encoder: The signal flows to the DDI port (Port B), where the pixels are packetized into DisplayPort Micro-packets.
Connector: Finally, through the physical Type-C/DP port, the pixels race to the monitor.

5. Glossary: A Guide to Avoiding Pitfalls

When reading the i915 source code, you will encounter the following high-frequency terms:

FDI (Flexible Display Interface): The pathway connecting the CPU to the PCH display chip in older architectures (largely obsolete on modern GPUs but commonly found in legacy code).
PCH (Platform Controller Hub): The motherboard southbridge, which previously handled some display outputs; now most of that logic is integrated into the CPU.
Watermarks: This is an extremely complex topic. A Pipe needs time to read data from memory, and Watermarks determine when the hardware initiates memory requests to prevent the display buffer from "running dry" and causing screen flicker.

Summary

The i915 display driver essentially "pieces together" the generic KMS model onto Intel's specific hardware pipeline. intel_crtc controls the Pipe's blending logic, while intel_encoder manages the flexible DDI interface. The Transcoder in between serves as the link connecting the two.

In the next lecture, we will delve into the part that gives display driver engineers the biggest headache: the two-phase commit mechanism of Atomic Modesetting.

DEV Community: Deleon Karen

Part 3: The Core of Memory Management: Allocation and Residency

Physical Perspective: Implementation of Memory Segments

Implementation Flow:

Key Parameter Analysis:

Creating Allocations: DxgkDdiCreateAllocation

Engineering Focus: GPU Virtual Addressing (GPUVA)

Residency Mechanism: MakeResident and Eviction

Advanced Perspective: WDDM vs. Linux (GEM/TTM)

Advanced Topic: IOMMU and Hardware Isolation

Developer Advice: Accuracy of Video Memory Statistics

Part 2: Driver Entry and Device Initialization

1. The Soul of the Driver: Deep Dive into the DriverEntry Protocol

1.1 Core Structure: DRIVER_INITIALIZATION_DATA

1.2 Registration Example

1.3 Special Case: Display-Only Driver (KMDOD)

2. Establishing Context: DxgkDdiAddDevice

2.1 Core Tasks

2.2 Registering Hardware Information (Registry Practice)

3. "Ignition" Execution: DxgkDdiStartDevice

4. Deep Dive: The Information Storm of DxgkDdiQueryAdapterInfo

5. Practical Engineering: The State Machine of the Initialization Phase

Key Steps Explained:

6. Expert Recommendations

Part 1: Modern WDDM Architecture Protocol

1. From XDDM to WDDM: An Inevitable Evolution

2. WDDM Architecture Core: The Boundaries of Power Between UMD and KMD

User Mode Driver (UMD)

Kernel Mode Display Miniport Driver (KMD)

3. Engineering Design Philosophy: "The OS Owns the Policy, the Driver Owns the Implementation"

4. The WDDM 2.0 Watershed: The Shift in Video Memory Management Focus

5. Developer Advice: How to Start Your WDDM Journey

Part 4: Driver Implementation — How Developers Adapt to GPU-P

1. Capability Declaration: Telling the System "I'm Ready"

DXGK_VIDMMCAPS

2. INF File Configuration: The "Mover" of Resources

Driver Store Mapping

3. Key DDI Implementation: The Host's "Art of Management"

DxgkDdiSetVirtualMachineData

DxgkDdiQueryAdapterInfo (Extended)

DxgkDdiCreateProcess

4. Special Handling for MCDM Drivers

5. Development Advice: The "Forbidden Zone" for Private Data

Conclusion

Part 3: The Foundation — IOMMU and Security Isolation

The Necessity of Isolation: Guarding Against Deadly DMA Attacks

The Role of IOMMU: A "Logical Maze" for Physical Memory

Silence Protocol: The Danger of Domain Switching

Secure VM: Stringent Admission Criteria

Conclusion

Part 2: Architecture — The "Duet" of Host and Virtual Machine

Component Model: An Unbalanced "Bilocation"

Host: The All-Powerful "Brain"

Guest (Virtual Machine): The Lean "Executor"

Virtual Render Device (VRD): The Ingenious "Impostor"

Communication Bridge: VMBus and Parameter Marshalling

VMBus

Parameter Marshalling

Resource Ownership: Why Doesn't the Guest Have Video Memory Management?

Conclusion

Part 1: Concepts — Ushering in the "Golden Age" of GPU Virtualization

Introduction: The GPU Dilemma Amid the Virtualization Wave

What is GPU-P: SR-IOV-Based Hardware Partitioning Technology

Core Principles

Advantages of GPU-P

Application Scenarios: From the Lab to the Cloud

WDDM Evolution: Embarking on the Long March of GPU Virtualization

Conclusion

Part 13: Epilogue — The Architectural Evolution from i915 to the Xe Driver

1. Retrospect: The Heavy Historical Baggage of i915

1.1 The Long Hardware Support Timeline

1.2 The Inertia of the Unified Memory Architecture (UMA) Mindset

1.3 Complex Synchronization and Display Coupling

2. Breaking the Cocoon: The Design Philosophy of the Xe Driver

2.1 Abandoning History, Focusing on Modern Architectures

2.2 Natively Embracing TTM and the DRM Scheduler

2.3 The User-Mode Queue (VM_BIND) Model

2.4 Code Reuse: Clever Display Isolation

3. Conclusion

Part 12: The Undying Body: GPU Hang Detection and Reset

1. The Soul of the Driver: Deep Dive into the `DriverEntry` Protocol

1.1 Core Structure: `DRIVER_INITIALIZATION_DATA`

2. Establishing Context: `DxgkDdiAddDevice`

3. "Ignition" Execution: `DxgkDdiStartDevice`

4. Deep Dive: The Information Storm of `DxgkDdiQueryAdapterInfo`

Phase One: State Validation (`intel_atomic_check`)

Phase Two: Hardware Commit (`intel_atomic_commit`)

3. The Carrier of Hardware Constraints: `intel_crtc_state`