<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Deleon Karen</title>
    <description>The latest articles on DEV Community by Deleon Karen (@deleon_karen_2216eb5888b3).</description>
    <link>https://dev.to/deleon_karen_2216eb5888b3</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1758826%2F0eba7e03-6ae1-475f-b6c0-a1917fcbc753.png</url>
      <title>DEV Community: Deleon Karen</title>
      <link>https://dev.to/deleon_karen_2216eb5888b3</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/deleon_karen_2216eb5888b3"/>
    <language>en</language>
    <item>
      <title>Part 3: The Core of Memory Management: Allocation and Residency</title>
      <dc:creator>Deleon Karen</dc:creator>
      <pubDate>Tue, 02 Jun 2026 15:21:41 +0000</pubDate>
      <link>https://dev.to/deleon_karen_2216eb5888b3/part-3-the-core-of-memory-management-allocation-and-residency-50bd</link>
      <guid>https://dev.to/deleon_karen_2216eb5888b3/part-3-the-core-of-memory-management-allocation-and-residency-50bd</guid>
      <description>&lt;p&gt;If WDDM is an operating system, then Video Memory Management (VidMm) is its heart. With WDDM 2.0+, the logic of memory management underwent a fundamental shift from "OS-applied patching" to "driver/application-managed state."&lt;/p&gt;

&lt;h2&gt;
  
  
  Physical Perspective: Implementation of Memory Segments
&lt;/h2&gt;

&lt;p&gt;The driver describes the GPU's physical memory layout to the OS through "memory segments." This is primarily accomplished through two calls to &lt;code&gt;DxgkDdiQueryAdapterInfo&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fynf8gfpj7fjr3a3to6rh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fynf8gfpj7fjr3a3to6rh.png" alt="Sequence diagram of the two‑phase memory segment query between Dxgkrnl (VidMm) and the miniport driver via DxgkDdiQueryAdapterInfo. Phase 1 retrieves the segment count; Phase 2 populates the DXGK_SEGMENTDESCRIPTOR3 array describing VRAM and Aperture segments." width="800" height="488"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Implementation Flow:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;First Call (Get Count):&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;  The OS sends &lt;code&gt;DXGKQAITYPE_QUERYSEGMENT&lt;/code&gt; (or &lt;code&gt;_QUERYSEGMENT3&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;  The driver only populates &lt;code&gt;DXGK_QUERYSEGMENTOUT3.NbSegment&lt;/code&gt; (e.g., 1 VRAM segment, 1 Aperture segment, returns 2).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Second Call (Populate Descriptors):&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;  The OS allocates space for the &lt;code&gt;DXGK_SEGMENTDESCRIPTOR3&lt;/code&gt; array.&lt;/li&gt;
&lt;li&gt;  The driver populates the specific parameters for each segment.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Key Parameter Analysis:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;BaseAddress / Size:&lt;/strong&gt; 

&lt;ul&gt;
&lt;li&gt;  For &lt;strong&gt;Local VRAM&lt;/strong&gt;: This is the physical starting address as seen internally by the GPU.&lt;/li&gt;
&lt;li&gt;  For &lt;strong&gt;Aperture Segment&lt;/strong&gt;: This is the window starting address the GPU uses to access system memory.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;  &lt;strong&gt;CpuVisibleAddress:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;  If the GPU's physical memory is mapped into the CPU's address space via a PCIe BAR, the driver needs to provide &lt;code&gt;CpuVisibleAddress&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;WDDM 2.0+ Optimization:&lt;/strong&gt; Even without a large BAR, dynamic mapping via &lt;code&gt;CpuHostAperture&lt;/code&gt; is possible, with VidMm handling paging automatically.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;  &lt;strong&gt;Flags:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;code&gt;Aperture&lt;/code&gt;: Marks the segment as a "window" into system memory.&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;PopulatedFromSystemMemory&lt;/code&gt;: Marks whether the physical backing store of this segment is essentially system RAM.&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;CpuVisible&lt;/code&gt;: Tells the OS whether the CPU can directly read/write this segment's memory.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Development Guidance:&lt;/strong&gt; Be absolutely accurate in marking which segments are &lt;code&gt;PopulatedFromSystemMemory&lt;/code&gt;. Misreporting system memory as local VRAM will cause significant discrepancies in the OS page file and the Total Graphics Memory calculation formula.&lt;/p&gt;

&lt;h2&gt;
  
  
  Creating Allocations: DxgkDdiCreateAllocation
&lt;/h2&gt;

&lt;p&gt;When an application requests resources, the OS calls this DDI.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Responsibility:&lt;/strong&gt; The driver must calculate the size and alignment requirements for the resource and return &lt;code&gt;DXGK_ALLOCATIONINFO&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;WDDM 2.0 Change:&lt;/strong&gt; Drivers no longer need to record a "Patch Location List," as resources are now accessed via virtual addresses.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Engineering Focus: GPU Virtual Addressing (GPUVA)
&lt;/h2&gt;

&lt;p&gt;This is a core feature of WDDM 2.0.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh0iwzigc3fjatfwe6drd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh0iwzigc3fjatfwe6drd.png" alt="Overview of GPU Virtual Addressing (GPUVA) in WDDM 2.0. Each process has its own GPU virtual address space. The VidMm and driver page tables map virtual addresses to physical pages residing on hardware segments such as local VRAM or system memory accessed through an aperture over PCIe." width="800" height="806"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Concept:&lt;/strong&gt; Each process has a 48-bit (or larger) virtual address space.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Benefit:&lt;/strong&gt; UMD can use fixed addresses directly in the command stream, eliminating the need for the OS to modify instructions before submission.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Driver Responsibility:&lt;/strong&gt; KMD needs to implement page table operations (via &lt;code&gt;UpdatePageTable&lt;/code&gt; operations within &lt;code&gt;DxgkDdiBuildPagingBuffer&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Residency Mechanism: MakeResident and Eviction
&lt;/h2&gt;

&lt;p&gt;Before WDDM 2.0, the OS automatically ensured all resources referenced by a command buffer were in video memory. Now, this responsibility is handed over to the driver (primarily UMD).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe549hqjhgpxjms2n5x4p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe549hqjhgpxjms2n5x4p.png" alt="State diagram of allocation residency and eviction in WDDM 2.0+. UMD explicitly requests residency via MakeResident (reference count +1). Under memory pressure, the OS may evict the allocation to system memory. Evicted allocations become non‑resident and must be made resident again before rendering." width="800" height="1233"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;MakeResident (Triggered by UMD):&lt;/strong&gt; The driver explicitly tells the OS, "The upcoming operations need these resources; please keep them in video memory."&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Eviction (OS Policy):&lt;/strong&gt; When video memory is low, the OS uses algorithms like LRU to evict non-resident resources to system memory.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Development Guidance:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Do Not Over-Reside:&lt;/strong&gt; &lt;code&gt;MakeResident&lt;/code&gt; calls exceeding the process budget will fail.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Handle Eviction Notifications:&lt;/strong&gt; In WDDM 3.2, if a resource requires special handling (e.g., decompression) before eviction, the &lt;code&gt;NotifyEviction&lt;/code&gt; flag can be utilized.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  Advanced Perspective: WDDM vs. Linux (GEM/TTM)
&lt;/h2&gt;

&lt;p&gt;If you are familiar with Linux kernel driver development, you will find WDDM's VidMm shares similarities with Linux's GEM/TTM, but there are core philosophical differences.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;WDDM 2.0+ (VidMm)&lt;/th&gt;
&lt;th&gt;Linux (GEM/TTM)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Basic Unit&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Allocation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Buffer Object (BO)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Memory Abstraction&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Segments&lt;/strong&gt;: Memory, Aperture, System&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Regions/Placements&lt;/strong&gt;: VRAM, GTT, System&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Residency Logic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Explicit Residency&lt;/strong&gt;: UMD actively maintains device residency list&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Validation Logic&lt;/strong&gt;: Kernel ensures BO is in the correct Region upon command submission&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Address Binding&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;GPUVA&lt;/strong&gt;: UMD allocates fixed virtual address, OS handles page table updates&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Relocations (Traditional GEM)&lt;/strong&gt;: OS modifies instruction stream; &lt;strong&gt;GPUVA (Modern)&lt;/strong&gt;: Similar to WDDM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Migration/Paging&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;OS-Driven&lt;/strong&gt;: VidMm decides when to page, driver executes &lt;code&gt;BuildPagingBuffer&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Driver-Led&lt;/strong&gt;: TTM provides framework, driver implements specific migration logic (Move)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Key Differences:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Boundary of Responsibility:&lt;/strong&gt; WDDM 2.0+ delegates more decision-making power regarding "which resources need to be resident" to the UMD (User Mode Driver) to reduce kernel call overhead. In contrast, traditional Linux TTM validates the BO list during each &lt;code&gt;execbuffer&lt;/code&gt; submission.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Page Table Management:&lt;/strong&gt; GPU virtual address management in WDDM is highly standardized, with the OS deeply involved in the page table lifecycle; under Linux, different drivers (e.g., i915 vs. AMDGPU) have greater freedom in their page table implementations.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Advanced Topic: IOMMU and Hardware Isolation
&lt;/h2&gt;

&lt;p&gt;For modern drivers (WDDM 2.4+), IOMMU is no longer transparent. It not only provides security (GPU isolation) but also solves addressing issues for systems with more than 1TB of physical memory on high-end servers (DMA Remapping).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Core Challenge&lt;/strong&gt;: Shifting from using physical addresses (MDL) to logical addresses (ADL).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;DDI That Must Be Implemented&lt;/strong&gt;: &lt;code&gt;DxgkDdiBeginExclusiveAccess&lt;/code&gt; (ensures hardware is quiet during IOMMU switches).&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Deep Dive&lt;/strong&gt;: Please refer to the dedicated document &lt;a href="//./wddm_guide_iommu.md"&gt;WDDM Advanced: IOMMU and DMA Remapping&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Developer Advice: Accuracy of Video Memory Statistics
&lt;/h2&gt;

&lt;p&gt;The OS heavily relies on the video memory statistics reported by the driver.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Reporting Mechanism:&lt;/strong&gt; Ensure &lt;code&gt;DXGK_DRIVERCAPS&lt;/code&gt; in &lt;code&gt;DxgkDdiQueryAdapterInfo&lt;/code&gt; correctly returns parameters like &lt;code&gt;MaxSharedSystemMemory&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Common Issue:&lt;/strong&gt; If the driver misreports the memory size, it will cause the OS to incorrectly configure page file settings, leading to memory pressure crashes that are difficult to debug.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>architecture</category>
      <category>computerscience</category>
      <category>microsoft</category>
      <category>systems</category>
    </item>
    <item>
      <title>Part 2: Driver Entry and Device Initialization</title>
      <dc:creator>Deleon Karen</dc:creator>
      <pubDate>Tue, 02 Jun 2026 15:15:57 +0000</pubDate>
      <link>https://dev.to/deleon_karen_2216eb5888b3/part-2-driver-entry-and-device-initialization-3fkj</link>
      <guid>https://dev.to/deleon_karen_2216eb5888b3/part-2-driver-entry-and-device-initialization-3fkj</guid>
      <description>&lt;p&gt;In the lifecycle of a WDDM driver, the initialization phase establishes the "contractual" relationship between the driver and the operating system. This chapter will provide an in-depth analysis of every core step from driver loading to device readiness, from the perspective of the underlying DDI.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The Soul of the Driver: Deep Dive into the &lt;code&gt;DriverEntry&lt;/code&gt; Protocol
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;DriverEntry&lt;/code&gt; is the entry point for a KMD. Its core task is to register a set of callback function tables with &lt;code&gt;Dxgkrnl&lt;/code&gt; via &lt;code&gt;DxgkInitialize&lt;/code&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  1.1 Core Structure: &lt;code&gt;DRIVER_INITIALIZATION_DATA&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;This structure is very large, containing hundreds of callback interfaces. In practical engineering, you need to focus on the following points:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Version Control (&lt;code&gt;Version&lt;/code&gt;):&lt;/strong&gt; Must be set to &lt;code&gt;DXGKDDI_INTERFACE_VERSION&lt;/code&gt;. This macro is defined differently across WDK versions, directly determining which DDIs the OS will attempt to call.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Mandatory Interfaces:&lt;/strong&gt; If you omit core functions like &lt;code&gt;DxgkDdiAddDevice&lt;/code&gt;, &lt;code&gt;DxgkDdiStartDevice&lt;/code&gt;, or &lt;code&gt;DxgkDdiExchangePnPInterface&lt;/code&gt;, &lt;code&gt;DxgkInitialize&lt;/code&gt; will directly return &lt;code&gt;STATUS_INVALID_PARAMETER&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwbcz3bxgsrqcir61p7ha.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwbcz3bxgsrqcir61p7ha.png" alt=" " width="798" height="230"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  1.2 Registration Example
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Example: Registering basic DDIs at the entry point&lt;/span&gt;
&lt;span class="n"&gt;NTSTATUS&lt;/span&gt; &lt;span class="nf"&gt;DriverEntry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PDRIVER_OBJECT&lt;/span&gt; &lt;span class="n"&gt;DriverObject&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;PUNICODE_STRING&lt;/span&gt; &lt;span class="n"&gt;RegistryPath&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;DRIVER_INITIALIZATION_DATA&lt;/span&gt; &lt;span class="n"&gt;initData&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;
    &lt;span class="n"&gt;initData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Version&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DXGKDDI_INTERFACE_VERSION&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// Basic Lifecycle&lt;/span&gt;
    &lt;span class="n"&gt;initData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DxgkDdiAddDevice&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MyAddDevice&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;initData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DxgkDdiStartDevice&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MyStartDevice&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;initData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DxgkDdiStopDevice&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MyStopDevice&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;initData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DxgkDdiRemoveDevice&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MyRemoveDevice&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;initData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DxgkDdiUnload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MyUnload&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// Core Management&lt;/span&gt;
    &lt;span class="n"&gt;initData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DxgkDdiQueryAdapterInfo&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MyQueryAdapterInfo&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;initData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DxgkDdiCreateDevice&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MyCreateDevice&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;initData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DxgkDdiDestroyDevice&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MyDestroyDevice&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// Memory and Scheduling (Basic Set)&lt;/span&gt;
    &lt;span class="n"&gt;initData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DxgkDdiCreateAllocation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MyCreateAllocation&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;initData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DxgkDdiDestroyAllocation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MyDestroyAllocation&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;initData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DxgkDdiBuildPagingBuffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MyBuildPagingBuffer&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;initData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DxgkDdiSubmitCommand&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MySubmitCommand&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;DxgkInitialize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DriverObject&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;RegistryPath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;initData&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  1.3 Special Case: Display-Only Driver (KMDOD)
&lt;/h4&gt;

&lt;p&gt;For drivers without rendering capabilities that only support display (Kernel-Mode Display-Only Driver), the flow is slightly different:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Uses the &lt;code&gt;KMDDOD_INITIALIZATION_DATA&lt;/code&gt; structure.&lt;/li&gt;
&lt;li&gt;  Registers via &lt;code&gt;DxgkInitializeDisplayOnlyDriver&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;  Typically used for basic display adapters or remote desktop display drivers.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Establishing Context: &lt;code&gt;DxgkDdiAddDevice&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;When the PnP Manager discovers hardware, the OS calls &lt;code&gt;AddDevice&lt;/code&gt; once for each physical adapter.&lt;/p&gt;

&lt;h4&gt;
  
  
  2.1 Core Tasks
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Miniport Device Context:&lt;/strong&gt; You need to allocate a driver-private device context (usually a structure) here and return it to the OS. All subsequent device-related DDIs will carry this pointer.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Obtaining Hardware Information:&lt;/strong&gt; You should call &lt;code&gt;DxgkCbGetDeviceInformation&lt;/code&gt; at this point. It returns a &lt;code&gt;DXGK_DEVICEINFO&lt;/code&gt; structure containing:

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;PCI Configuration Space Information.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Translated Resource List:&lt;/strong&gt; Including physical memory base addresses (BARs) and interrupt vectors.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h4&gt;
  
  
  2.2 Registering Hardware Information (Registry Practice)
&lt;/h4&gt;

&lt;p&gt;According to official specifications, to correctly identify the hardware in "Control Panel -&amp;gt; Display", the driver must set the following registry values in &lt;code&gt;AddDevice&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Information Type&lt;/th&gt;
&lt;th&gt;Registry Key Name&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Chip Type&lt;/td&gt;
&lt;td&gt;&lt;code&gt;HardwareInformation.ChipType&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Friendly name string for the chip&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DAC Type&lt;/td&gt;
&lt;td&gt;&lt;code&gt;HardwareInformation.DacType&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;DAC type identifier&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory Size&lt;/td&gt;
&lt;td&gt;&lt;code&gt;HardwareInformation.MemorySize&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;ULONG, unit in MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Adapter Name&lt;/td&gt;
&lt;td&gt;&lt;code&gt;HardwareInformation.AdapterString&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Full name of the adapter&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BIOS Info&lt;/td&gt;
&lt;td&gt;&lt;code&gt;HardwareInformation.BiosString&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;BIOS version string&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Implementation Hint:&lt;/strong&gt; Use &lt;code&gt;IoOpenDeviceRegistryKey&lt;/code&gt; to open the &lt;code&gt;PLUGPLAY_REGKEY_DRIVER&lt;/code&gt; key, then use &lt;code&gt;ZwSetValueKey&lt;/code&gt; to write the above values.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. "Ignition" Execution: &lt;code&gt;DxgkDdiStartDevice&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;StartDevice&lt;/code&gt; signals that the hardware should begin actual operation.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Dxgkrnl Interface Exchange:&lt;/strong&gt; The OS passes in the &lt;code&gt;DXGKRNL_INTERFACE&lt;/code&gt;. This interface contains all &lt;code&gt;DxgkCbXxx&lt;/code&gt; callbacks (e.g., querying VidPN, notifying interrupts). The driver must save this pointer.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Hardware Resource Initialization:&lt;/strong&gt; Map MMIO registers, initialize GPU core logic.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Preliminary Capability Report:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;NumberOfChildren:&lt;/strong&gt; Reports the number of devices connected downstream of the display adapter (e.g., HDMI ports, internal panels).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;NumberOfVideoPresentSources:&lt;/strong&gt; Reports the number of display controllers inside the GPU (determines how many independent screens can be driven simultaneously).&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Deep Dive: The Information Storm of &lt;code&gt;DxgkDdiQueryAdapterInfo&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;This is the busiest DDI during initialization. The OS queries various configurations using different &lt;code&gt;Type&lt;/code&gt; values.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;&lt;code&gt;DXGKQAITYPE_DRIVERCAPS&lt;/code&gt;:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Core Flags:&lt;/strong&gt; &lt;code&gt;SupportGpuVirtualAddress&lt;/code&gt; (whether virtual addresses are supported), &lt;code&gt;CanHandleTDR&lt;/code&gt; (whether timeout recovery is supported).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Scheduling Policy:&lt;/strong&gt; Determines whether global scheduling or hardware scheduling is used.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;  &lt;strong&gt;&lt;code&gt;DXGKQAITYPE_QUERYSEGMENT&lt;/code&gt;:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Defines Memory Topology:&lt;/strong&gt; Tells the OS which segments are local video memory (VRAM) and which are system memory apertures.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Segment Properties:&lt;/strong&gt; The &lt;code&gt;Aperture&lt;/code&gt; flag determines how VidMm handles that memory.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;  &lt;strong&gt;&lt;code&gt;DXGKQAITYPE_UMDRIVERPRIVATE&lt;/code&gt;:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;UMD Path:&lt;/strong&gt; Returns the file path of the UMD DLL (e.g., &lt;code&gt;my_umd.dll&lt;/code&gt;). The OS uses this name to load the user-mode component.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  5. Practical Engineering: The State Machine of the Initialization Phase
&lt;/h3&gt;

&lt;p&gt;Understanding the calling sequence of these DDIs is critical for troubleshooting startup failures. The following is the deep call flow organized according to the WDDM specification:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjjrmosul84vxu1sltotg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjjrmosul84vxu1sltotg.png" alt=" " width="800" height="1192"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Key Steps Explained:
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;DxgkDdiStartDevice:&lt;/strong&gt; Must accurately return &lt;code&gt;NumberOfChildren&lt;/code&gt; (e.g., number of HDMI, DP ports) and &lt;code&gt;NumberOfVideoPresentSources&lt;/code&gt; (number of CRTCs inside the GPU).&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;DxgkDdiQueryChildRelations:&lt;/strong&gt; The driver needs to fill in &lt;code&gt;ChildUid&lt;/code&gt; here. This ID will be used as &lt;code&gt;VideoPresentTargetId&lt;/code&gt; in subsequent VidPN management.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;HPD (Hot Plug Detection):&lt;/strong&gt; For interfaces with interrupt or polling capabilities, the OS confirms if a monitor is currently connected via &lt;code&gt;QueryChildStatus&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;VidPN Negotiation:&lt;/strong&gt; This is the most complex part of initialization, involving the &lt;code&gt;VidPN Manager&lt;/code&gt; cooperating with the driver to establish the initial display path (Source -&amp;gt; Target).&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  6. Expert Recommendations
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Lazy Initialization:&lt;/strong&gt; Avoid time-consuming hardware detection in &lt;code&gt;AddDevice&lt;/code&gt; as much as possible to keep the PnP process smooth.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Strict Version Checking:&lt;/strong&gt; In &lt;code&gt;QueryAdapterInfo&lt;/code&gt;, if a feature requested by the OS is completely unsupported by your hardware, directly return &lt;code&gt;STATUS_NOT_SUPPORTED&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Error Cleanup:&lt;/strong&gt; If &lt;code&gt;StartDevice&lt;/code&gt; fails, be sure to manually clean up already allocated resources and MMIO mappings before returning. The OS will not automatically roll back memory allocated in &lt;code&gt;AddDevice&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>architecture</category>
      <category>microsoft</category>
      <category>systems</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Part 1: Modern WDDM Architecture Protocol</title>
      <dc:creator>Deleon Karen</dc:creator>
      <pubDate>Tue, 02 Jun 2026 15:13:02 +0000</pubDate>
      <link>https://dev.to/deleon_karen_2216eb5888b3/part-1-modern-wddm-architecture-protocol-4ob0</link>
      <guid>https://dev.to/deleon_karen_2216eb5888b3/part-1-modern-wddm-architecture-protocol-4ob0</guid>
      <description>&lt;h3&gt;
  
  
  1. From XDDM to WDDM: An Inevitable Evolution
&lt;/h3&gt;

&lt;p&gt;Before Windows Vista, display drivers used the &lt;strong&gt;XDDM (Windows 2000 Display Driver Model)&lt;/strong&gt;. Under the XDDM model, the driver ran almost entirely in kernel mode, and any minor crash would directly lead to a system "Blue Screen of Death" (BSoD).&lt;/p&gt;

&lt;p&gt;The introduction of &lt;strong&gt;WDDM (Windows Display Driver Model)&lt;/strong&gt; was not only to support the Aero desktop effects but also represented a major architectural refactoring:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Stability:&lt;/strong&gt; Most of the logic was moved to the User Mode Driver (UMD), leaving only the most critical hardware interaction logic in the Kernel Mode Driver (KMD).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Performance:&lt;/strong&gt; Introduced true GPU scheduling and video memory management.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Multitasking:&lt;/strong&gt; Supports multiple applications using GPU resources concurrently without interfering with each other.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. WDDM Architecture Core: The Boundaries of Power Between UMD and KMD
&lt;/h3&gt;

&lt;p&gt;A modern WDDM driver consists of two core components, each with its own responsibilities:&lt;/p&gt;

&lt;h4&gt;
  
  
  User Mode Driver (UMD)
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Nature:&lt;/strong&gt; A DLL loaded by the Direct3D Runtime.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Core Responsibilities:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;  Translates API calls (such as D3D12 Draw Calls) into command streams (Command Buffers) that the hardware can understand.&lt;/li&gt;
&lt;li&gt;  Performs high-level state management and shader compilation.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Development Guidance:&lt;/strong&gt; The UMD is the most logically complex and frequently changed part of the driver, but it does not directly touch hardware registers.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h4&gt;
  
  
  Kernel Mode Display Miniport Driver (KMD)
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Nature:&lt;/strong&gt; A system service that communicates with &lt;code&gt;Dxgkrnl.sys&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Core Responsibilities:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Resource Management:&lt;/strong&gt; Manages video memory segments and page tables.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Hardware Scheduling:&lt;/strong&gt; Submits commands generated by the UMD to the hardware queue.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Display Management:&lt;/strong&gt; Controls display output (VidPN) and handles interrupts.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Development Guidance:&lt;/strong&gt; The KMD must be extremely stable, as it runs in Ring 0.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Engineering Design Philosophy: "The OS Owns the Policy, the Driver Owns the Implementation"
&lt;/h3&gt;

&lt;p&gt;This is the most core philosophy in WDDM development.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;The OS (Dxgkrnl) owns the policy:&lt;/strong&gt; The operating system decides when to switch tasks, when to lower the frequency for power saving, and when to swap resources to system memory due to insufficient video memory (Eviction).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;The Driver (KMD) owns the implementation:&lt;/strong&gt; The driver tells the OS what capabilities (Caps) the hardware has and executes the specific instructions. For example, the OS tells the driver "please move this memory block to VRAM," and the driver is responsible for writing the specific registers to complete the transfer.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. The WDDM 2.0 Watershed: The Shift in Video Memory Management Focus
&lt;/h3&gt;

&lt;p&gt;In WDDM 1.x, video memory management used the "Patch Location List" model. Before the driver submitted a command, the OS had to scan the entire command buffer and fill in the physical addresses. This became a bottleneck in the context of large-scale parallelism and GPU virtualization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WDDM 2.0 (Windows 10+) introduced GPU Virtual Addressing (GPUVA):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Each process has an independent GPU address space:&lt;/strong&gt; Just like CPU virtual memory, the UMD uses virtual addresses and no longer relies on dynamic patching by the OS.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Residency Model:&lt;/strong&gt; The driver no longer needs to list all dependent resources each time a command is submitted, but uses &lt;code&gt;MakeResident&lt;/code&gt; to inform the OS which resources must be in video memory. This greatly reduces kernel transition overhead and is the foundation for the high performance of Direct3D 12.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5. Developer Advice: How to Start Your WDDM Journey
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Clarify the DDI version:&lt;/strong&gt; In &lt;code&gt;DriverEntry&lt;/code&gt;, make sure to correctly declare your &lt;code&gt;DXGKDDI_INTERFACE_VERSION&lt;/code&gt;. For modern development, it is recommended to support at least WDDM 2.0.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Focus on Mandatory DDIs:&lt;/strong&gt; Not all DDIs need to be implemented. For basic rendering, &lt;code&gt;DxgkDdiCreateAllocation&lt;/code&gt;, &lt;code&gt;DxgkDdiSubmitCommand&lt;/code&gt;, and &lt;code&gt;DxgkDdiBuildPagingBuffer&lt;/code&gt; are your three core functions.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Understand Asynchronicity:&lt;/strong&gt; The GPU is asynchronous. When you submit a command, it merely enters a queue. Never synchronously wait for the GPU to complete within the driver, otherwise, it will cause the entire system to stutter.&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>architecture</category>
      <category>microsoft</category>
      <category>performance</category>
      <category>systems</category>
    </item>
    <item>
      <title>Part 4: Driver Implementation — How Developers Adapt to GPU-P</title>
      <dc:creator>Deleon Karen</dc:creator>
      <pubDate>Tue, 02 Jun 2026 15:06:11 +0000</pubDate>
      <link>https://dev.to/deleon_karen_2216eb5888b3/part-4-driver-implementation-how-developers-adapt-to-gpu-p-44nm</link>
      <guid>https://dev.to/deleon_karen_2216eb5888b3/part-4-driver-implementation-how-developers-adapt-to-gpu-p-44nm</guid>
      <description>&lt;p&gt;After understanding the architecture and isolation mechanisms of GPU-P, the most pressing question for a driver developer is: How do I implement and enable these features in my driver? This chapter will analyze the GPU-P development and adaptation process in detail from three dimensions: capability declaration, configuration files, and key DDIs (Device Driver Interfaces).&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Capability Declaration: Telling the System "I'm Ready"
&lt;/h2&gt;

&lt;p&gt;To let Windows know that your driver supports GPU partitioning, you first need to set specific capability bits in the host driver.&lt;/p&gt;

&lt;h3&gt;
  
  
  DXGK_VIDMMCAPS
&lt;/h3&gt;

&lt;p&gt;When the KMD responds to the &lt;code&gt;DxgkDdiQueryAdapterInfo&lt;/code&gt; call, it must populate the &lt;code&gt;DXGK_DRIVERCAPS&lt;/code&gt; structure. Within it, &lt;code&gt;MemoryManagementCaps.ParavirtualizationSupported&lt;/code&gt; is the "master switch" for enabling GPU-P:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// In the host KMD implementation&lt;/span&gt;
&lt;span class="n"&gt;pDriverCaps&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;MemoryManagementCaps&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ParavirtualizationSupported&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: The situation is slightly different for MCDM (Microsoft Compute Driver Model) drivers. The host driver should set this bit to &lt;code&gt;TRUE&lt;/code&gt;, while the guest driver running inside the virtual machine should adjust it based on whether it is in a virtualized environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. INF File Configuration: The "Mover" of Resources
&lt;/h2&gt;

&lt;p&gt;Since there is no KMD inside the virtual machine, the DLL images and registry settings required by the guest-side UMD must be provided in advance by the host.&lt;/p&gt;

&lt;h3&gt;
  
  
  Driver Store Mapping
&lt;/h3&gt;

&lt;p&gt;In the INF file, you need to declare which files should be copied to the virtual machine. Commonly used registry entries include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;CopyToVmOverwrite&lt;/strong&gt;: Always overwrites the file with the same name in the virtual machine.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;CopyToVmWhenNewer&lt;/strong&gt;: Only overwrites when the host file version is newer.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[DDInstall]
; Copies the host's driver files to the virtual machine's HostDriverStore directory
HKR,"CopyToVmOverwrite",SoftGpuFiles,%REG_MULTI_SZ%,"umd_binary.dll","umd_binary.dll"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These files are automatically placed in the virtual machine's &lt;code&gt;%windir%\system32\HostDriverStore&lt;/code&gt; directory.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Key DDI Implementation: The Host's "Art of Management"
&lt;/h2&gt;

&lt;p&gt;To manage numerous virtual machine instances, the host KMD needs to implement several key new DDIs:&lt;/p&gt;

&lt;h3&gt;
  
  
  DxgkDdiSetVirtualMachineData
&lt;/h3&gt;

&lt;p&gt;This is the core interface through which Dxgkrnl communicates VM context to the KMD. When a new virtual machine instance starts or its configuration changes, the OS calls this interface to synchronize key metadata with the driver.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Purpose&lt;/strong&gt;: Passes virtual machine data, including the VM's unique identifier (LUID), video memory quota, compute resource limits, etc.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Security Flag&lt;/strong&gt;: Through the &lt;code&gt;SecureVirtualMachine&lt;/code&gt; flag in &lt;code&gt;DXGK_VIRTUALMACHINEDATAFLAGS&lt;/code&gt;, the OS tells the KMD whether the VM is in "secure mode" (e.g., Windows Sandbox).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Developer Action&lt;/strong&gt;: The driver should initialize or update its internal virtual machine tracking structures based on this data. For example, in secure mode, the driver must disable non-standard Escape calls and ensure that IOMMU isolation is fully effective. Furthermore, the driver can use this interface to associate the VM's guest space mapping with the host-side physical resources.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  DxgkDdiQueryAdapterInfo (Extended)
&lt;/h3&gt;

&lt;p&gt;This interface is one of the most functionally complex in WDDM. In a GPU-P environment, it is extended to support finer-grained cross-boundary queries.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Key Flags&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;VirtualMachineData&lt;/strong&gt;: If this bit is TRUE, it indicates that the current query request originates from a virtual machine.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;SecureVirtualMachine&lt;/strong&gt;: Indicates whether the current execution is within a stricter security isolation environment (like a sandbox).&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;  &lt;strong&gt;Context Identification&lt;/strong&gt;: The new &lt;strong&gt;hKmdProcessHandle&lt;/strong&gt; member allows the driver, when processing queries from a virtual machine, to accurately identify and use the corresponding host-side process context.&lt;/li&gt;

&lt;li&gt;  &lt;strong&gt;Common Query Types&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;DXGKQAITYPE_GPUPCAPS&lt;/strong&gt;: Queries the driver's partitioning capabilities, such as whether it supports Live Migration.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;DXGKQAITYPE_GPUMMUCAPS&lt;/strong&gt;: Queries the support status for GPU virtual addresses (GpuVA).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;DXGKQAITYPE_PAGETABLELEVELDESC&lt;/strong&gt;: Defines the page table hierarchy structure, which is crucial for the GpuMMU model.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  DxgkDdiCreateProcess
&lt;/h3&gt;

&lt;p&gt;In a virtualized environment, the host creates a "mirror process" object for each VM's drawing operations. The driver needs to recognize new flags in &lt;code&gt;DXGK_CREATEPROCESSFLAGS&lt;/code&gt; to distinguish different execution contexts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;VirtualMachineWorkerProcess&lt;/strong&gt;: Corresponds to the host-side VM worker process (&lt;em&gt;vmwp.exe&lt;/em&gt;). This process is responsible for managing the VM's virtual hardware (including vGPU emulation) but does not perform rendering itself. The driver can use this flag to skip the allocation of certain rendering resources.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;VirtualMachineProcess&lt;/strong&gt;: Corresponds to the actual process inside the Guest that initiates drawing requests. Whenever an application within the virtual machine tries to use the GPU, a call is triggered on the host side via this flag.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Process Association&lt;/strong&gt;: Through the &lt;strong&gt;hKmdVmWorkerProcess&lt;/strong&gt; handle, the driver can associate multiple guest-side processes with the same VM instance on the host, which is crucial for resource accounting, debugging, and performance monitoring (e.g., GPUView).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Debugging Support&lt;/strong&gt;: The &lt;strong&gt;pProcessName&lt;/strong&gt; member provides a human-readable name for the process, greatly facilitating troubleshooting for developers in multi-VM concurrent scenarios.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  4. Special Handling for MCDM Drivers
&lt;/h2&gt;

&lt;p&gt;For compute cards without display output capabilities (like data center GPUs), adaptation to GPU-P must follow &lt;strong&gt;MCDM (Microsoft Compute Driver Model)&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Class Definition&lt;/strong&gt;: The INF must specify &lt;code&gt;Class=ComputeAccelerator&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Capability Reduction&lt;/strong&gt;: Display-related interfaces like &lt;code&gt;DxgkDdiPresent&lt;/code&gt; do not need to be implemented, but GPU Virtual Address (GpuVA) and IOMMU isolation must be supported.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Isolation Requirements&lt;/strong&gt;: If an MCDM driver wants to run in a secure container, it must set &lt;code&gt;IoMmuSecureModeSupported&lt;/code&gt; to &lt;code&gt;TRUE&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  5. Development Advice: The "Forbidden Zone" for Private Data
&lt;/h2&gt;

&lt;p&gt;When writing a driver that supports GPU-P, developers must always keep the "cross-boundary" principle in mind:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Strictly Prohibited from Passing Pointers&lt;/strong&gt;: The UMD is in the guest space, and the KMD is in the host space. Absolute physical memory addresses are invalid between the two.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Handle Translation&lt;/strong&gt;: Utilize the &lt;code&gt;DriverKnownEscape&lt;/code&gt; mechanism to let the OS translate resource handles from the guest side into handles recognizable on the host side.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Message Size Limit&lt;/strong&gt;: The maximum size for a single VMBus message is typically 128KB. Avoid passing excessively large private data in a single &lt;code&gt;AllocateCb&lt;/code&gt; call.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Adapting to GPU-P is not just about flipping a switch; it is a meticulous reconstruction of the driver architecture. Through standardized capability declarations and rigorous DDI implementation, developers can give their hardware a new lease on life in the cloud and virtualized environments.&lt;/p&gt;

&lt;p&gt;In the next chapter, we will tackle the toughest nut to crack in the field of GPU virtualization: &lt;strong&gt;Live Migration&lt;/strong&gt;.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>microsoft</category>
      <category>systems</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Part 3: The Foundation — IOMMU and Security Isolation</title>
      <dc:creator>Deleon Karen</dc:creator>
      <pubDate>Tue, 02 Jun 2026 15:04:29 +0000</pubDate>
      <link>https://dev.to/deleon_karen_2216eb5888b3/part-3-the-foundation-iommu-and-security-isolation-5ch5</link>
      <guid>https://dev.to/deleon_karen_2216eb5888b3/part-3-the-foundation-iommu-and-security-isolation-5ch5</guid>
      <description>&lt;p&gt;In the previous "Architecture" chapter, we learned how GPU-P achieves resource sharing through the clever division of labor between the Host and Guest. However, in cloud computing and virtualization environments, simply being "usable" and "shareable" is far from enough. Security is always the Sword of Damocles hanging over the head of virtualization.&lt;/p&gt;

&lt;p&gt;If the GPU-P architecture is a building that allows multiple tenants to move in, then &lt;strong&gt;IOMMU&lt;/strong&gt; is the security door customized for each tenant. Today, we will unveil the cornerstone of GPU-P's underlying security isolation.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Necessity of Isolation: Guarding Against Deadly DMA Attacks
&lt;/h2&gt;

&lt;p&gt;As we all know, GPUs are high-speed peripherals connected to the system via the PCIe bus. In pursuit of ultimate performance, GPUs heavily rely on &lt;strong&gt;DMA (Direct Memory Access)&lt;/strong&gt; technology. With DMA, the GPU can read from and write to the system's main memory (physical memory) directly across the bus without CPU intervention.&lt;/p&gt;

&lt;p&gt;In a standalone environment, this is not a problem. But in a virtualized environment (like GPU-P), it becomes a huge security risk:&lt;br&gt;
Suppose a malicious user gains control of a GPU Virtual Function (VF) within a Virtual Machine (Guest). They could craft malicious hardware commands, instructing the GPU's DMA engine to read physical memory that does not belong to that virtual machine. If they were to read the Host's core data, encryption keys, or the memory of other tenant VMs, the consequences would be catastrophic.&lt;/p&gt;

&lt;p&gt;To prevent this kind of "unauthorized access," relying solely on software-level interception is insufficient. We must establish a physical barrier at the hardware bus level.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Role of IOMMU: A "Logical Maze" for Physical Memory
&lt;/h2&gt;

&lt;p&gt;This is where the &lt;strong&gt;IOMMU (Input/Output Memory Management Unit)&lt;/strong&gt; takes the stage.&lt;/p&gt;

&lt;p&gt;You can think of the IOMMU as a specialized MMU for peripherals (I/O devices). The CPU relies on the MMU to map virtual addresses to physical addresses, while the IOMMU is responsible for &lt;strong&gt;DMA Remapping&lt;/strong&gt; for peripherals.&lt;/p&gt;

&lt;p&gt;In GPU-P, the IOMMU works as follows:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Domain Division&lt;/strong&gt;: The Host's &lt;code&gt;Dxgkrnl&lt;/code&gt; (DirectX Graphics Kernel) creates an independent IOMMU Domain for each logical adapter (the instance assigned to a virtual machine) on the system.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Logical Address Spoofing&lt;/strong&gt;: The Host operating system no longer exposes real physical addresses to the GPU; instead, it provides &lt;strong&gt;logical addresses&lt;/strong&gt; managed by the IOMMU.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hardware-Level Interception&lt;/strong&gt;: When a VM's GPU VF attempts to initiate a DMA read or write via the PCIe bus, the request is intercepted by the IOMMU. The IOMMU checks if this logical address is valid and converts it to the real physical address.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Blocking on Violation&lt;/strong&gt;: If a malicious Guest tries to access a logical address not assigned to it, the IOMMU cannot complete the translation and will directly block this PCIe transaction at the hardware level, thus protecting the absolute security of the Host's and other VMs' memory.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Silence Protocol: The Danger of Domain Switching
&lt;/h2&gt;

&lt;p&gt;While powerful, the IOMMU has a fatal weakness when switching protection domains (Domain Switch): &lt;strong&gt;the attach and detach operations of a domain are not atomic at the hardware level&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Imagine this: &lt;code&gt;Dxgkrnl&lt;/code&gt; is in the background altering the IOMMU mapping tables, while the GPU is furiously writing data to memory. Since the mapping table is in an intermediate state, a PCIe translation error is very likely to occur, directly crashing the entire system (Blue Screen/Bug Check).&lt;/p&gt;

&lt;p&gt;To resolve this race condition, WDDM introduced the &lt;strong&gt;Silence Protocol&lt;/strong&gt;. The Host graphics driver (KMD) must implement a pair of extremely critical DDIs (Device Driver Interfaces):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;code&gt;DxgkDdiBeginExclusiveAccess&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;DxgkDdiEndExclusiveAccess&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Execution Flow&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Before an IOMMU domain switch occurs, &lt;code&gt;Dxgkrnl&lt;/code&gt; pauses the scheduler, flushes all active workloads, ensuring no new tasks are sent to the hardware.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Dxgkrnl&lt;/code&gt; calls &lt;code&gt;DxgkDdiBeginExclusiveAccess&lt;/code&gt;, notifying the KMD: "I'm about to touch the IOMMU, tell your hardware to be quiet!"&lt;/li&gt;
&lt;li&gt;Upon receiving the instruction, the KMD must ensure the GPU hardware remains &lt;strong&gt;absolutely silent&lt;/strong&gt; during this period—no reading from or writing to system memory, even hardware interrupts can be masked.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Dxgkrnl&lt;/code&gt; safely completes the IOMMU domain switch.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Dxgkrnl&lt;/code&gt; calls &lt;code&gt;DxgkDdiEndExclusiveAccess&lt;/code&gt;, lifting the silence, and the GPU resumes normal operation.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Secure VM: Stringent Admission Criteria
&lt;/h2&gt;

&lt;p&gt;In scenarios with extremely high security requirements (such as Windows Defender Application Guard or advanced security sandboxes), the operating system may launch a &lt;strong&gt;Secure VM&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For GPU instances assigned to a Secure VM, WDDM imposes mandatory admission and operational conditions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Mandatory IOMMU Isolation&lt;/strong&gt;: If the driver does not support IoMmu isolation in its capability declaration (Caps), the Secure VM will directly refuse to create a GPU instance, resulting in startup failure. Here, there is no compromise on security.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Banning Illegal Escape Calls&lt;/strong&gt;: In traditional WDDM, a User-Mode Driver (UMD) can send private data packets to the Kernel-Mode Driver (KMD) via &lt;code&gt;Escape&lt;/code&gt; calls. Since this is entirely a "black box," a malicious Guest could trigger vulnerabilities like buffer overflows in the Host's KMD by crafting malformed Escape packets.

&lt;ul&gt;
&lt;li&gt;  In a Secure VM, conventional Escape calls are &lt;strong&gt;completely banned&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;  Only "Known Escapes" with the &lt;code&gt;DriverKnownEscape&lt;/code&gt; flag, strictly defined and audited by the system, are permitted. This drastically reduces the attack surface for kernel privilege escalation.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Through the IOMMU's hardware-level DMA interception, the ingenious Silence Protocol, and the stringent communication restrictions for Secure VMs, Microsoft has built an impregnable defense line for GPU-P. It is precisely this foundational cornerstone that gives cloud service providers the confidence to partition the same expensive high-end GPU among multiple, unrelated tenants.&lt;/p&gt;

&lt;p&gt;So far, we have understood the operational mechanism of GPU-P from both the macro-architectural and security foundation perspectives. For driver developers, how can existing code be modified to adapt to this complex virtualization mechanism?&lt;/p&gt;

&lt;p&gt;In the next chapter, we will enter the practical realm and analyze: &lt;strong&gt;Driver Implementation — How Developers Adapt to GPU-P&lt;/strong&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Next Chapter Preview: Driver Implementation — How Developers Adapt to GPU-P&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>security</category>
      <category>systems</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Part 2: Architecture — The "Duet" of Host and Virtual Machine</title>
      <dc:creator>Deleon Karen</dc:creator>
      <pubDate>Tue, 02 Jun 2026 15:02:37 +0000</pubDate>
      <link>https://dev.to/deleon_karen_2216eb5888b3/part-2-architecture-the-duet-of-host-and-virtual-machine-ah9</link>
      <guid>https://dev.to/deleon_karen_2216eb5888b3/part-2-architecture-the-duet-of-host-and-virtual-machine-ah9</guid>
      <description>&lt;p&gt;In the previous part, we covered the basic concepts of GPU-P. Today, we'll dive deep into its internal architecture. If a traditional graphics driver is a solo performance, then GPU-P is a precisely choreographed "duet": the Host and the Guest (virtual machine) have clearly defined roles, working together closely through VMBus.&lt;/p&gt;

&lt;h2&gt;
  
  
  Component Model: An Unbalanced "Bilocation"
&lt;/h2&gt;

&lt;p&gt;Under the GPU-P architecture, the driver components are split between two worlds — the host and the virtual machine. Interestingly, the components in these two worlds are asymmetric.&lt;/p&gt;

&lt;h3&gt;
  
  
  Host: The All-Powerful "Brain"
&lt;/h3&gt;

&lt;p&gt;The host possesses the complete graphics subsystem stack and serves as the ultimate resource manager:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Full-Featured KMD (Kernel Mode Driver)&lt;/strong&gt;: Interacts directly with the physical GPU hardware.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;VidMm (Video Memory Manager)&lt;/strong&gt;: Controls the allocation and scheduling of all video memory.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;VidSch (Scheduler)&lt;/strong&gt;: Decides which virtual machine's tasks can enter the GPU hardware queues.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;UMD (User Mode Driver)&lt;/strong&gt;: Used for the host's own rendering tasks.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Guest (Virtual Machine): The Lean "Executor"
&lt;/h3&gt;

&lt;p&gt;Inside the virtual machine, things are very "slim":&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;UMD Only&lt;/strong&gt;: An adapted user mode driver runs inside the VM, directly facing applications (like games or AI frameworks).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;No KMD&lt;/strong&gt;: The manufacturer's kernel mode driver code does not run inside the VM.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;No VidMm or VidSch&lt;/strong&gt;: The virtual machine is not responsible for managing video memory or hardware scheduling.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This design drastically reduces the attack surface of the virtual machine, while also avoiding the overhead of running complex video memory management logic repeatedly in every VM.&lt;/p&gt;

&lt;h2&gt;
  
  
  Virtual Render Device (VRD): The Ingenious "Impostor"
&lt;/h2&gt;

&lt;p&gt;Since there is no KMD in the virtual machine, how does the Windows operating system know to load a display driver? This is the work of the &lt;strong&gt;VRD (Virtual Render Device)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The VRD is a "shadow driver" on the Guest side. Its main responsibilities are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Deceiving the OS&lt;/strong&gt;: Mounting a virtual graphics device in Device Manager, making the VM's operating system think "I have hardware," thereby triggering the loading of &lt;code&gt;Dxgkrnl.sys&lt;/code&gt; (the DirectX Graphics Kernel).&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Guiding the Load&lt;/strong&gt;: It acts as the fuse for loading the Guest-side UMD.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;On the host side, each virtual machine corresponds to a &lt;strong&gt;VMWP.exe (VM Worker Process)&lt;/strong&gt;. Within this process runs a library called &lt;code&gt;vrdumed.dll&lt;/code&gt;, which serves as the "back-end support" for the VRD, responsible for emulating this virtual device on the host side.&lt;/p&gt;

&lt;h2&gt;
  
  
  Communication Bridge: VMBus and Parameter Marshalling
&lt;/h2&gt;

&lt;p&gt;For the UMD in the virtual machine to render an image, its commands must ultimately be passed to the host's hardware. How does this "dialogue" happen between them?&lt;/p&gt;

&lt;h3&gt;
  
  
  VMBus
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;VMBus&lt;/strong&gt;, provided by Hyper-V, is the underlying communication mechanism for this architecture. It functions like a dedicated expressway, allowing data to travel rapidly between the virtual machine and host memory.&lt;/p&gt;

&lt;h3&gt;
  
  
  Parameter Marshalling
&lt;/h3&gt;

&lt;p&gt;When an application inside the VM calls a DirectX API, the following process occurs:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Interception&lt;/strong&gt;: The Guest-side &lt;code&gt;Dxgkrnl&lt;/code&gt; receives the UMD's call requests (Thunk calls).&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Marshalling&lt;/strong&gt;: &lt;code&gt;Dxgkrnl&lt;/code&gt; packages the parameters and data packets of these calls into individual Messages.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Transmission&lt;/strong&gt;: These messages are sent to the host via VMBus.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Execution&lt;/strong&gt;: The host-side &lt;code&gt;Dxgkrnl&lt;/code&gt; receives the messages, unpacks them, and passes them to the physical driver for execution.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Optimization Mechanism&lt;/strong&gt;: To prevent VMBus congestion, the Guest-side &lt;code&gt;Dxgkrnl&lt;/code&gt; retains some "local objects" (such as handle mappings for Allocations and Devices), and communication across the boundary only occurs when hardware execution is truly required.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resource Ownership: Why Doesn't the Guest Have Video Memory Management?
&lt;/h2&gt;

&lt;p&gt;A common question is: Why not let the virtual machine manage the video memory allocated to it?&lt;/p&gt;

&lt;p&gt;The answer is: &lt;strong&gt;Physical video memory is a globally unified resource.&lt;/strong&gt;&lt;br&gt;
If each virtual machine believed it had its own independent video memory manager, they would "fight" over the same physical addresses. In GPU-P:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Host Coordination&lt;/strong&gt;: The host's VidMm stands from a God's-eye view, coordinating the whole picture. Based on each VM's configuration (like &lt;code&gt;MinPartitionVRAM&lt;/code&gt; / &lt;code&gt;MaxPartitionVRAM&lt;/code&gt;), it partitions the physical video memory among different VMs.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Guest Request&lt;/strong&gt;: When a virtual machine needs memory, it must initiate a "loan" request to the Host via VMBus.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This "centralized authority" architecture ensures that even when multiple VMs are running under high load, the system will not suffer a host-wide blue screen (TDR) due to video memory conflicts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The architecture of GPU-P showcases Microsoft's art of balancing performance and security in virtualization: through VRD deception, VMBus transmission, and highly centralized control on the Host side, efficient GPU resource sharing is achieved.&lt;/p&gt;

&lt;p&gt;However, in such a shared environment, how do we ensure that a virtual machine cannot illegally access the host's memory? That is the topic we will discuss in our next part: &lt;strong&gt;IOMMU and Security Isolation&lt;/strong&gt;.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>infrastructure</category>
      <category>microsoft</category>
      <category>systems</category>
    </item>
    <item>
      <title>Part 1: Concepts — Ushering in the "Golden Age" of GPU Virtualization</title>
      <dc:creator>Deleon Karen</dc:creator>
      <pubDate>Tue, 02 Jun 2026 14:59:07 +0000</pubDate>
      <link>https://dev.to/deleon_karen_2216eb5888b3/part-1-concepts-ushering-in-the-golden-age-of-gpu-virtualization-1kp4</link>
      <guid>https://dev.to/deleon_karen_2216eb5888b3/part-1-concepts-ushering-in-the-golden-age-of-gpu-virtualization-1kp4</guid>
      <description>&lt;h2&gt;
  
  
  Introduction: The GPU Dilemma Amid the Virtualization Wave
&lt;/h2&gt;

&lt;p&gt;In today's rapidly evolving landscape of cloud computing and virtualization, the virtualization of compute, storage, and networking is already highly mature. However, the virtualization of GPU resources has long remained a major industry challenge. In the past, if we wanted to use GPU acceleration in a virtual machine (VM), there were generally two mainstream approaches, each with its own strengths and limitations:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Discrete Device Assignment (DDA)&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Principle&lt;/strong&gt;: Assigns an entire physical GPU exclusively to a single virtual machine.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Advantages&lt;/strong&gt;: Near-native performance and good compatibility.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Disadvantages&lt;/strong&gt;: Resources cannot be shared. One GPU can only serve one VM, so even if the VM is only running simple UI rendering, it causes massive resource waste and extremely high costs.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;API Forwarding&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Principle&lt;/strong&gt;: Intercepts OpenGL/DirectX API calls within the virtual machine and forwards them to the host for execution via network or shared memory channels.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Advantages&lt;/strong&gt;: Simple to implement and supports 1:N sharing.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Disadvantages&lt;/strong&gt;: High performance overhead, high latency, and often only supports specific API versions, making it unable to meet the demands of high-performance computing or 3D gaming.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;With the rise of artificial intelligence, remote desktops (VDI), and cloud gaming, the market urgently needs a GPU virtualization solution that can deliver both &lt;strong&gt;high performance&lt;/strong&gt; and &lt;strong&gt;high-density sharing&lt;/strong&gt;. It is against this backdrop that Microsoft introduced &lt;strong&gt;GPU-P (GPU Partitioning)&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is GPU-P: SR-IOV-Based Hardware Partitioning Technology
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;GPU-P&lt;/strong&gt; (short for GPU Partitioning) is a GPU hardware partitioning technology developed by Microsoft based on the industry standard &lt;strong&gt;SR-IOV (Single Root I/O Virtualization)&lt;/strong&gt;. In WDDM documentation, it is also often referred to as &lt;strong&gt;GPU Paravirtualization (GPU-PV)&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Core Principles
&lt;/h3&gt;

&lt;p&gt;Unlike traditional "full virtualization," GPU-P employs a &lt;strong&gt;paravirtualization&lt;/strong&gt; design:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Hardware Level&lt;/strong&gt;: Utilizes SR-IOV-capable GPU hardware to partition the physical device into multiple "Virtual Functions (VFs)." Each VF possesses its own independent hardware context, command queue, and memory space.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Software Level&lt;/strong&gt;: Retains the full-featured driver (KMD) on the host side, while the virtual machine (guest) side runs a streamlined user-mode driver (UMD) specifically adapted for the virtualization environment.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Communication Mechanism&lt;/strong&gt;: Uses Hyper-V's &lt;strong&gt;VMBus&lt;/strong&gt; as a high-speed communication bridge to enable efficient collaboration between the Guest UMD and the Host KMD.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Advantages of GPU-P
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Hardware-Level Isolation&lt;/strong&gt;: Leveraging SR-IOV and IOMMU technologies, GPU tasks from different virtual machines do not interfere with one another, ensuring extremely high security.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Near-Native Performance&lt;/strong&gt;: Critical rendering paths interact directly with the hardware VF, greatly reducing the overhead introduced by software emulation.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Flexible Resource Scheduling&lt;/strong&gt;: Supports dynamically partitioning a single physical GPU into multiple instances, achieving optimal resource allocation.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Application Scenarios: From the Lab to the Cloud
&lt;/h2&gt;

&lt;p&gt;GPU-P is not a laboratory "toy"; it is deeply integrated into all corners of the Windows ecosystem:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Windows Sandbox&lt;/strong&gt;: When you start a sandbox to test suspicious software, its smooth UI is powered by GPU-P providing hardware acceleration, without requiring you to manually configure drivers.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;WSL2 (Windows Subsystem for Linux)&lt;/strong&gt;: When developers run neural network training (such as PyTorch, TensorFlow) under the Linux subsystem, they can directly call upon the GPU compute power of the Windows host, also thanks to the D3D12/CUDA mapping support provided by GPU-P.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Azure NV-Series Virtual Machines&lt;/strong&gt;: On the public cloud, GPU-P allows Azure to offer cost-effective GPU instances to different users, supporting AI inference, rendering, and scientific computing.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Cloud Gaming and VDI&lt;/strong&gt;: Through high-density GPU partitioning, a single server can simultaneously support the 1080P/60FPS gaming experiences or 3D design desktops of dozens of users.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  WDDM Evolution: Embarking on the Long March of GPU Virtualization
&lt;/h2&gt;

&lt;p&gt;The maturity of GPU-P was not achieved overnight; it has continuously strengthened with the iteration of the Windows Display Driver Model (WDDM):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;WDDM 2.4 (Windows 10 1803)&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Milestone Significance&lt;/strong&gt;: Formally introduced the GPU-PV architecture.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Core Functionality&lt;/strong&gt;: Supported basic rendering capability partitioning and introduced the IOMMU isolation mechanism.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;WDDM 2.5 - 2.9&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Continuous optimization. Introduced mechanisms such as "driver-known Escape calls," enhancing the security of cross-process/cross-VM communication.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;WDDM 3.2 (Windows 11 24H2)&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Live Migration&lt;/strong&gt;: This is the "holy grail" of GPU virtualization. WDDM 3.2 introduced technologies like Dirty Bit Tracking, enabling virtual machines running GPU workloads to migrate between different physical hosts without downtime.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;LDA (Linked Display Adapter) Support&lt;/strong&gt;: Supports more complex partitioning strategies in multi-GPU environments.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The emergence of GPU-P marks the transition of Windows GPU virtualization from "barely functional" to "truly practical" and "industrialized." It not only solves the pain point of resource sharing but also finds the perfect balance between security and performance.&lt;/p&gt;

&lt;p&gt;In the following chapters, we will delve into its kernel logic and explore how the host and virtual machine execute their precise "pas de deux."&lt;/p&gt;

</description>
      <category>cloudcomputing</category>
      <category>infrastructure</category>
      <category>performance</category>
      <category>systems</category>
    </item>
    <item>
      <title>Part 13: Epilogue — The Architectural Evolution from i915 to the Xe Driver</title>
      <dc:creator>Deleon Karen</dc:creator>
      <pubDate>Tue, 02 Jun 2026 14:54:16 +0000</pubDate>
      <link>https://dev.to/deleon_karen_2216eb5888b3/part-13-epilogue-the-architectural-evolution-from-i915-to-the-xe-driver-7j8</link>
      <guid>https://dev.to/deleon_karen_2216eb5888b3/part-13-epilogue-the-architectural-evolution-from-i915-to-the-xe-driver-7j8</guid>
      <description>&lt;p&gt;In the previous twelve lectures, we delved from macro architecture down to code details, thoroughly dissecting i915, the massive and complex kernel graphics driver. We saw how it manages memory (GEM/TTM), schedules tasks (Execlists/GuC), lights up displays (KMS), and recovers from disasters (Reset).&lt;/p&gt;

&lt;p&gt;However, there is no eternal perfection in software engineering. Over time, business requirements change and hardware architectures evolve, and an existing codebase inevitably accumulates technical debt. In the final lecture of this series, we will step outside the code framework of i915 to discuss its historical baggage and why Intel introduced an entirely new driver in the Linux 6.8 kernel: the &lt;strong&gt;Xe driver&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Retrospect: The Heavy Historical Baggage of i915
&lt;/h2&gt;

&lt;p&gt;The i915 driver began in 2004. Over a span of 20 years, it has accompanied countless users' computers but has also gradually become an unwieldy behemoth (over 300,000 lines of code). Its "heaviness" is mainly reflected in the following aspects:&lt;/p&gt;

&lt;h3&gt;
  
  
  1.1 The Long Hardware Support Timeline
&lt;/h3&gt;

&lt;p&gt;Opening &lt;code&gt;i915_pci.c&lt;/code&gt;, you will see device IDs ranging from i830 (Gen2) all the way to the latest Meteor Lake and Discrete Graphics (DG2).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;To maintain compatibility with those ancient integrated GPUs that had no hardware scheduler and relied on simple ring buffers for command submission, the code retains numerous &lt;code&gt;legacy&lt;/code&gt; execution paths.&lt;/li&gt;
&lt;li&gt;Despite the existence of the firmware-based GuC scheduler, the old &lt;code&gt;Execlists&lt;/code&gt; and even older submission mechanisms must still be maintained, making the logic in the &lt;code&gt;gt/&lt;/code&gt; directory intricate and convoluted.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  1.2 The Inertia of the Unified Memory Architecture (UMA) Mindset
&lt;/h3&gt;

&lt;p&gt;At i915's inception, Intel GPUs were all integrated, sharing memory with the CPU. This deeply imprinted the mark of UMA on i915's low-level data structures.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;When Intel decided to enter the discrete graphics card market (Arc series) and introduce local memory (LMEM), the i915 team had to painfully cram the veteran &lt;strong&gt;TTM (Translation Table Manager)&lt;/strong&gt; into the originally GEM-based architecture.&lt;/li&gt;
&lt;li&gt;This mid-career integration resulted in code littered with conditional logic (e.g., &lt;code&gt;if (HAS_LMEM(i915))&lt;/code&gt;), causing a sharp increase in the maintenance difficulty of the memory management subsystem.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  1.3 Complex Synchronization and Display Coupling
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;In i915, the display logic (Display/KMS) is highly coupled with the low-level memory management.&lt;/li&gt;
&lt;li&gt;i915's early scheduling design did not fully align with the philosophy of the modern DRM Scheduler, often requiring many complex adaptations when interfacing with modern userspace stacks like Wayland.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  2. Breaking the Cocoon: The Design Philosophy of the Xe Driver
&lt;/h2&gt;

&lt;p&gt;Faced with the heavy baggage of i915, Intel decided that when supporting the new generation of GPUs based on the Xe architecture (Tiger Lake/Gen12 and beyond), instead of patching and mending i915, they would write an entirely new driver from scratch — &lt;strong&gt;Xe&lt;/strong&gt; (&lt;code&gt;drivers/gpu/drm/xe&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;The core design philosophy of the Xe driver can be summarized as: &lt;strong&gt;"Travel light, and do everything for the modern GPU."&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2.1 Abandoning History, Focusing on Modern Architectures
&lt;/h3&gt;

&lt;p&gt;The Xe driver &lt;strong&gt;only supports&lt;/strong&gt; GPUs from Tiger Lake and newer architectures (Gen12+).&lt;br&gt;
This means it directly discards the baggage of supporting 15 years of old hardware. Without Ringbuffers or Execlists, the Xe driver, from its very first line of code, &lt;strong&gt;completely relies on the GuC (Graphics Microcontroller)&lt;/strong&gt; for hardware scheduling. This dramatically simplifies the code complexity of the &lt;code&gt;gt/&lt;/code&gt; layer.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.2 Natively Embracing TTM and the DRM Scheduler
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pure TTM&lt;/strong&gt;: The Xe driver no longer has its own custom GEM memory allocator but natively uses the DRM core's TTM framework from the start to manage system memory and discrete memory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Standard Scheduler&lt;/strong&gt;: Xe fully integrates &lt;code&gt;drm_sched&lt;/code&gt; (DRM Scheduler). The complex dependency resolution and request ordering logic previously found in i915 is now largely handed off to the kernel's DRM subsystem for unified handling. This not only results in less code for the Xe driver but also allows it to work better with other graphics stacks.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2.3 The User-Mode Queue (VM_BIND) Model
&lt;/h3&gt;

&lt;p&gt;Modern graphics APIs (like Vulkan, Direct3D 12) all adopt an explicit memory management and command queue model.&lt;br&gt;
i915 uses the traditional implicit synchronization (memory is implicitly bound when commands are submitted). In contrast, the Xe driver introduces the modern &lt;strong&gt;VM_BIND API&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Userspace can explicitly bind memory regions to the GPU virtual address space and submit command packets directly to the hardware queues without kernel involvement.&lt;/li&gt;
&lt;li&gt;This 1:1 user/kernel execution queue mapping greatly reduces kernel overhead and enhances the efficiency of CPU command submission for rendering.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2.4 Code Reuse: Clever Display Isolation
&lt;/h3&gt;

&lt;p&gt;While rewriting the low-level memory and scheduling components, Intel did not want to rewrite the extremely large and complex display code (hundreds of thousands of lines handling HDMI/DP/Type-C protocols).&lt;br&gt;
The Xe team made a very ingenious design choice: &lt;strong&gt;refactoring i915's Display code into a relatively independent module&lt;/strong&gt;. Now, when the Xe driver lights up a screen, it actually calls into this stripped-out i915 Display module (you will see Xe source code directly including i915 display header files). This achieves the low-level advantages of a modern architecture while preserving the battle-tested stability of the display output.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Conclusion
&lt;/h2&gt;

&lt;p&gt;Over a decade of development has turned i915 into a living textbook of Linux graphics drivers, recording every footprint of GPU technology's journey from weakness to strength. Although the spotlight will gradually shift to Xe, i915 will continue to be maintained for many years, providing stable graphics support to countless older devices.&lt;/p&gt;

&lt;p&gt;This series of articles ends here. It is hoped that through the analysis in these 13 lectures, everyone will no longer feel intimidated when facing hundreds of thousands of lines of low-level C code. Although the world of the kernel is profound, as long as you trace the threads of "initialization, memory, scheduling, display," you can always discern its exquisite framework.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>linux</category>
      <category>opensource</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Part 12: The Undying Body: GPU Hang Detection and Reset</title>
      <dc:creator>Deleon Karen</dc:creator>
      <pubDate>Tue, 02 Jun 2026 14:48:40 +0000</pubDate>
      <link>https://dev.to/deleon_karen_2216eb5888b3/part-12-the-undying-body-gpu-hang-detection-and-reset-305f</link>
      <guid>https://dev.to/deleon_karen_2216eb5888b3/part-12-the-undying-body-gpu-hang-detection-and-reset-305f</guid>
      <description>&lt;p&gt;In complex graphics rendering or compute tasks, it is common for the GPU to "hang" due to executing defective shader code, encountering an infinite loop, or experiencing an anomaly in the hardware state machine. For a mature kernel driver, the ability to quickly detect a hang, capture on-site "last words," and gracefully resume operation is a key measure of its fault tolerance.&lt;/p&gt;

&lt;p&gt;In i915, this life-support mechanism is primarily composed of three modules: &lt;strong&gt;Hangcheck&lt;/strong&gt;, &lt;strong&gt;Error Capture&lt;/strong&gt;, and &lt;strong&gt;Reset&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Detecting a GPU Hang: The Hangcheck and Heartbeat Mechanism
&lt;/h2&gt;

&lt;p&gt;Early i915 drivers relied on a timer to poll the execution progress (whether the Seqno advanced) of all execution engines. However, in the modern i915 architecture (especially the implementation centered around &lt;code&gt;intel_engine_heartbeat.c&lt;/code&gt;), the driver uses a more proactive and precise &lt;strong&gt;"Heartbeat" mechanism&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.1 Heartbeat Emission and Detection
&lt;/h3&gt;

&lt;p&gt;When an engine is in an active state, the driver periodically sends a special request, called a "Heartbeat" (Systole), containing only no-ops and synchronization barriers.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Under normal circumstances, the GPU quickly executes this heartbeat request and triggers an interrupt.&lt;/li&gt;
&lt;li&gt;The driver uses &lt;code&gt;mod_delayed_work&lt;/code&gt; to set a timeout. If the GPU fails to complete the heartbeat request within this period, the driver does not immediately declare it hung.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  1.2 The Ultimatum (Preemption Timeout)
&lt;/h3&gt;

&lt;p&gt;If the heartbeat times out, the driver plays its "trump card": it forcibly elevates the priority of this heartbeat request to the highest level (&lt;code&gt;I915_PRIORITY_BARRIER&lt;/code&gt;).&lt;br&gt;
This tells the hardware scheduler: "Whatever time-consuming task you are running right now, preempt it immediately and run my heartbeat first!"&lt;/p&gt;

&lt;p&gt;If, even after issuing the highest-priority preemption command and waiting for a period (typically several hundred milliseconds, depending on &lt;code&gt;preempt_timeout_ms&lt;/code&gt;), the heartbeat still does not pulse, i915 gives up completely and calls &lt;code&gt;reset_engine()&lt;/code&gt;, officially declaring the engine Hung.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Collecting "Last Words": Error State Capture
&lt;/h2&gt;

&lt;p&gt;Before pulling the plug on the GPU, the most important task is to preserve the crash scene so that developers can perform a post-mortem analysis. This step occurs within the &lt;code&gt;intel_gt_handle_error()&lt;/code&gt; function.&lt;/p&gt;

&lt;p&gt;When the &lt;code&gt;I915_ERROR_CAPTURE&lt;/code&gt; flag is passed, the driver calls &lt;code&gt;i915_capture_error_state()&lt;/code&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Snapshot Registers&lt;/strong&gt;: Reads all critical hardware register states at that moment (such as the instruction pointer EIR, current context state, etc.).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capture Ringbuffer&lt;/strong&gt;: Copies all the commands currently in the command ring (Ringbuffer) being executed by the engine.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Record Batchbuffer&lt;/strong&gt;: If a long series of rendering commands submitted from user space caused the hang, the driver also saves the content of the relevant Batchbuffer.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These "last words" are packaged into the &lt;code&gt;i915_gpu_error&lt;/code&gt; structure and exposed to user space via Linux's &lt;code&gt;sysfs&lt;/code&gt; or &lt;code&gt;debugfs&lt;/code&gt; (usually located at &lt;code&gt;/sys/kernel/debug/dri/0/i915_error_state&lt;/code&gt;). Tools like &lt;code&gt;intel_error_decode&lt;/code&gt; can read this to reconstruct what instructions the GPU was executing at the moment of the hang.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Reset Therapy: From Microsurgery to Defibrillation
&lt;/h2&gt;

&lt;p&gt;After collecting the error state, i915 attempts to pull the GPU back from the brink of death. Modern Intel GPUs support a multi-level reset strategy, following the principle of &lt;strong&gt;"minimizing the impact area."&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  3.1 Engine Reset
&lt;/h3&gt;

&lt;p&gt;If only the Video Decode Engine (VCS) is stuck, while the Render Engine (RCS) is still happily running a game, we obviously don't want the entire screen to go black.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The driver first tries to call &lt;code&gt;intel_engine_reset()&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The hardware sends a reset signal only to the specific engine(s) that are stuck, cleans up the hung context, and preserves the running state of other engines.&lt;/li&gt;
&lt;li&gt;This is an extremely "minimally invasive" recovery method; the user might only feel a slight stutter in one specific task.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3.2 GT Reset (Full Reset)
&lt;/h3&gt;

&lt;p&gt;If the engine reset fails, or if the hang involves shared common resources (like the memory scheduler or command streamer), the driver has to fall back to the next option.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The driver executes &lt;code&gt;intel_gt_reset()&lt;/code&gt;, enters the &lt;code&gt;I915_RESET_BACKOFF&lt;/code&gt; state, and pauses submissions to all engines.&lt;/li&gt;
&lt;li&gt;It sends a reset signal to the entire GT (Graphics Technology) core. This clears all hardware execution queues.&lt;/li&gt;
&lt;li&gt;After a successful reset, the driver requeues and resubmits the innocent requests that did not cause the hang. For the culprit, it directly returns &lt;code&gt;-EIO&lt;/code&gt; (Input/Output Error), telling the application "your task has been terminated."&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3.3 Device-Level Reset (PCI Reset - Wedged)
&lt;/h3&gt;

&lt;p&gt;If even the GT reset fails to wake the GPU (for instance, the hardware state machine has completely collapsed, or the bus is deadlocked), i915 desperately marks the device as &lt;strong&gt;Wedged&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It calls &lt;code&gt;intel_gt_set_wedged()&lt;/code&gt;. In this state, the driver rejects all graphics execution requests from user space, and all new &lt;code&gt;execbuffer&lt;/code&gt; calls immediately return &lt;code&gt;-EIO&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;If conditions permit, the driver may attempt the highest-level reset (Device Reset) at the PCI bus level. However, if it comes to this, the screen usually flickers, and a machine reboot might be necessary for full recovery.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;A robust GPU driver should not assume that hardware will always run perfectly. i915's &lt;code&gt;Heartbeat&lt;/code&gt; mechanism monitors the health of the engines like an electrocardiograph; &lt;code&gt;i915_gpu_error&lt;/code&gt; records all data before a crash like a black box; and the multi-level &lt;code&gt;Reset&lt;/code&gt; mechanism, from Engine to GT level, works like an emergency room physician, doing its utmost to revive the GPU to full health and protect the user experience to the greatest extent possible.&lt;/p&gt;

&lt;p&gt;In the next lecture, we will reach the final installment of this series, reviewing the historical baggage carried by the i915 behemoth with over two million lines of code, and see how Intel's latest Xe architecture driver travels light and faces the future.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>linux</category>
      <category>monitoring</category>
      <category>systems</category>
    </item>
    <item>
      <title>Part 11: Extreme Power Control: RPM, RC6, and RPS</title>
      <dc:creator>Deleon Karen</dc:creator>
      <pubDate>Tue, 02 Jun 2026 14:40:25 +0000</pubDate>
      <link>https://dev.to/deleon_karen_2216eb5888b3/part-11-extreme-power-control-rpm-rc6-and-rps-38d7</link>
      <guid>https://dev.to/deleon_karen_2216eb5888b3/part-11-extreme-power-control-rpm-rc6-and-rps-38d7</guid>
      <description>&lt;p&gt;In previous chapters, we explored how i915 manages video memory, schedules tasks, and lights up the display. However, for modern GPUs, &lt;strong&gt;"how to run fast" is merely the baseline; "how to save power" is the real technical barrier&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Whether in a thin-and-light laptop or a power-constrained data center, the GPU is a notorious power hog. If left unchecked, it will not only drain the battery but also cause severe thermal throttling. To squeeze out every drop of energy efficiency, the i915 driver has built an extremely sophisticated power control system within the kernel.&lt;/p&gt;

&lt;p&gt;Today, we will focus on the three core pillars of i915 power management: &lt;strong&gt;RPM (Runtime PM)&lt;/strong&gt;, &lt;strong&gt;RC6&lt;/strong&gt;, and &lt;strong&gt;RPS&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Putting the Device to Sleep as a Whole: Runtime PM (RPM)
&lt;/h2&gt;

&lt;p&gt;Imagine this scenario: your laptop screen is on, displaying a static, plain-text article. At this moment, apart from the display controller periodically reading the framebuffer, all of the GPU's compute engines are essentially idle. In a more extreme case, if you have an external eGPU (docking station) connected and no program is currently using it, should it still be running at full power?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Runtime Power Management (RPM)&lt;/strong&gt; is designed to solve this problem. It is a standard power management framework provided by the Linux kernel, and its core implementation in the i915 driver resides in &lt;code&gt;intel_runtime_pm.c&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.1 Core Mechanism: Wakeref (Wake Reference)
&lt;/h3&gt;

&lt;p&gt;i915's RPM management relies on a mechanism called &lt;strong&gt;Wakeref (Wake Reference Counting)&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  When the driver needs to access GPU hardware (e.g., writing instructions to registers, handling interrupts, or when userspace initiates a rendering request), it must first acquire a Wakeref: calling &lt;code&gt;intel_runtime_pm_get(&amp;amp;i915-&amp;gt;runtime_pm)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;  If this is the first Wakeref, the driver triggers a device wake-up (waking from PCI D3hot/D3cold sleep state to D0 full-power state).&lt;/li&gt;
&lt;li&gt;  Once the operation is complete, the driver releases the reference: calling &lt;code&gt;intel_runtime_pm_put()&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Once the Wakeref count drops to zero, the driver considers the GPU idle and allows the device to enter deep sleep.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This mechanism is extremely strict. Assertions like &lt;code&gt;assert_rpm_wakelock_held()&lt;/code&gt; are often seen in the code, forcing developers to "hold a permit" before touching the hardware. Otherwise, reading or writing to silicon that has already been powered down will directly cause the entire system to hang.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Dynamic Sleep of the Render Engine: RC6 (Render C-States)
&lt;/h2&gt;

&lt;p&gt;RPM controls the life and death of the entire GPU device, but its granularity is too coarse. If the screen is refreshing, the device cannot completely enter RPM sleep. At this point, we need finer-grained control — &lt;strong&gt;RC6&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.1 What is RC6?
&lt;/h3&gt;

&lt;p&gt;In the CPU world, there are C-States (e.g., C0 is running, C6 is deep sleep). Intel GPUs introduced a similar concept called &lt;strong&gt;Render C-States (RC)&lt;/strong&gt;. The most critical among them is &lt;strong&gt;RC6&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When RC6 is enabled, the hardware's PCU (Power Control Unit) constantly monitors the busyness of each engine (like the render engine RCS, video engine VCS). If it finds an engine has been idle for more than a few milliseconds, &lt;strong&gt;the hardware automatically cuts off that engine's clock and even power&lt;/strong&gt;, while other parts (like the display output) remain operational.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.2 The Driver's Role
&lt;/h3&gt;

&lt;p&gt;Entering and exiting RC6 is primarily handled automatically by the hardware, but in &lt;code&gt;gt/intel_rc6.c&lt;/code&gt;, the i915 driver is responsible for:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Initialization and Enabling&lt;/strong&gt;: Configuring the hardware sleep thresholds and policies in &lt;code&gt;intel_rc6_enable()&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Status Monitoring&lt;/strong&gt;: Reading hardware registers via &lt;code&gt;intel_rc6_residency_ns()&lt;/code&gt; to count what proportion of the recent past the GPU spent in the RC6 (sleep) state. This data is often used to evaluate whether the driver's power-saving optimizations are effective.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deeper Sleep&lt;/strong&gt;: Besides RC6, the hardware also supports deeper &lt;code&gt;RC6p&lt;/code&gt; and &lt;code&gt;RC6pp&lt;/code&gt; states. Entering these states saves even more power but comes with greater wake-up latency.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  3. Dynamic Frequency Scaling (GPU Turbo): RPS (Render P-States)
&lt;/h2&gt;

&lt;p&gt;If RC6 is about making the GPU "sleep" smartly, then &lt;strong&gt;RPS (Render P-States)&lt;/strong&gt; is about making the GPU "work" smartly. It is the equivalent of dynamic frequency scaling in the CPU world (cpufreq / Turbo Boost).&lt;/p&gt;

&lt;h3&gt;
  
  
  3.1 Three Anchors of P-States
&lt;/h3&gt;

&lt;p&gt;In &lt;code&gt;gt/intel_rps.c&lt;/code&gt;, you will often see three terms that define the boundaries of GPU frequency:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;RP0&lt;/strong&gt;: The maximum turbo frequency supported by the hardware (Max Turbo Frequency), offering the highest performance but consuming the most power.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;RP1&lt;/strong&gt;: The guaranteed base frequency (Guaranteed Frequency), the sustained maximum frequency under good thermal conditions.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;RPn&lt;/strong&gt;: The minimum frequency supported by the hardware (Minimum Frequency), offering the lowest performance but consuming the least power.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3.2 Load-Based Frequency Scaling
&lt;/h3&gt;

&lt;p&gt;By default, the GPU sits at RPn when idle. When the workload increases, hardware counters trigger an interrupt (Up Threshold), notifying the driver that compute power is insufficient. The driver's &lt;code&gt;intel_rps_set()&lt;/code&gt; then intervenes, stepping up the GPU frequency; the reverse happens when load decreases.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.3 Sacrificing Power for Latency: Wait Boost
&lt;/h3&gt;

&lt;p&gt;i915 features an extremely interesting mechanism called &lt;strong&gt;RPS Boost&lt;/strong&gt;.&lt;br&gt;
In &lt;code&gt;i915_gem_wait.c&lt;/code&gt;, when userspace (like X Server or a game) urgently needs a rendering result and calls a system call to wait for a &lt;code&gt;dma_fence&lt;/code&gt;, the driver realizes: "The user is anxiously waiting for this frame."&lt;/p&gt;

&lt;p&gt;At this moment, the driver calls &lt;code&gt;intel_rps_boost()&lt;/code&gt;, ignoring the current load ramp-up curve, and &lt;strong&gt;directly forces the GPU frequency to the maximum RP0&lt;/strong&gt;, completing the current task as quickly as possible to reduce visual latency for the user. Once the task is finished, the frequency rapidly drops back down. This is a classic strategy of "trading instantaneous high power consumption for ultimate experience."&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;i915's power control is a relay race from macro to micro:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;RPM&lt;/strong&gt; controls the big picture, decisively cutting power when the entire device is idle.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RC6&lt;/strong&gt; seizes every opportunity, letting the render engine sneak a nap in the millisecond gaps when the display is on but the image is static.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RPS&lt;/strong&gt; manages the rhythm, dynamically adjusting frequency based on load during operation, and instantly delivering a "shot in the arm" when the user demands it.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It is the tight coordination of these three that enables Intel GPUs to achieve excellent battery life across a range of devices.&lt;/p&gt;

&lt;p&gt;In the next lecture, we will explore the darkest yet most life-saving part of the GPU driver: how i915 pulls the GPU back from the brink when it truly "crashes" (Hang Detection and Reset).&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>linux</category>
      <category>performance</category>
      <category>systems</category>
    </item>
    <item>
      <title>Part 10: Deep Dive into Atomic Modesetting</title>
      <dc:creator>Deleon Karen</dc:creator>
      <pubDate>Tue, 02 Jun 2026 14:38:02 +0000</pubDate>
      <link>https://dev.to/deleon_karen_2216eb5888b3/part-10-deep-dive-into-atomic-modesetting-1lf2</link>
      <guid>https://dev.to/deleon_karen_2216eb5888b3/part-10-deep-dive-into-atomic-modesetting-1lf2</guid>
      <description>&lt;p&gt;In the previous sections, we discussed the execution of rendering commands and video memory management. But at the very end of the graphics stack, how exactly do the rendered pixels appear on the screen smoothly and seamlessly? This brings us to the most revolutionary architecture in the modern Linux display subsystem (DRM/KMS)—&lt;strong&gt;Atomic Modesetting&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In this lecture, we'll dive into the &lt;code&gt;display&lt;/code&gt; directory of i915 to see how those complex state machines, often spanning thousands of lines, actually work.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Why Do We Need Atomic Commit?
&lt;/h2&gt;

&lt;p&gt;In the early Legacy KMS era, display state updates were "fragmented." For example, if a userspace program (such as a Wayland Compositor or X Server) wanted to change the resolution and move the mouse cursor, it needed to call different IOCTLs separately: first set the CRTC (mode setting), then update the Plane (layer parameters), and finally move the Cursor.&lt;/p&gt;

&lt;p&gt;The fatal flaws of this design were &lt;strong&gt;unpredictability&lt;/strong&gt; and &lt;strong&gt;visible intermediate states&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Screen tearing/flickering&lt;/strong&gt;: These three operations might span multiple vertical blanking (Vblank) signals, leading to awkward moments where an old layer is displayed with a new cursor.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"Getting stuck halfway"&lt;/strong&gt;: If, after setting the CRTC, the driver discovers during the Plane setup that the hardware bandwidth is insufficient, the operation will return an error. However, the CRTC has already been modified and is difficult to roll back, ultimately resulting in a black screen.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;To solve this problem, the kernel introduced &lt;strong&gt;Atomic KMS&lt;/strong&gt;. Its core idea is to package all display states (configurations of CRTCs, Planes, and Connectors) into a single "Transaction." The driver either applies the entire state to the hardware perfectly in one go, or rejects it outright during the check phase, ensuring absolute consistency of the hardware state—this is the essence of "Atomicity."&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Two-Phase Design: Check and Commit
&lt;/h2&gt;

&lt;p&gt;In i915, the entire atomic update process is strictly divided into two phases: &lt;code&gt;atomic_check&lt;/code&gt; and &lt;code&gt;atomic_commit&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase One: State Validation (&lt;code&gt;intel_atomic_check&lt;/code&gt;)
&lt;/h3&gt;

&lt;p&gt;When userspace submits a packaged new state, the entry function is &lt;code&gt;intel_atomic_check&lt;/code&gt;.&lt;br&gt;
During this phase, &lt;strong&gt;the driver must never modify any actual hardware registers&lt;/strong&gt;. Its sole task is to simulate in software:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Recompute PLL clocks (checking if the required pixel frequency can be generated).&lt;/li&gt;
&lt;li&gt;Calculate display bandwidth (checking if memory bandwidth can support a 4K 144Hz display with multiple UI layers).&lt;/li&gt;
&lt;li&gt;If all validations pass, return 0; if even a single hardware constraint cannot be met, return an error code directly (such as &lt;code&gt;-EINVAL&lt;/code&gt; or &lt;code&gt;-ENOSPC&lt;/code&gt;). The userspace application can then lower its requirements (e.g., reduce the refresh rate) and try again based on this feedback.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Phase Two: Hardware Commit (&lt;code&gt;intel_atomic_commit&lt;/code&gt;)
&lt;/h3&gt;

&lt;p&gt;After validation passes, the kernel asynchronously calls &lt;code&gt;intel_atomic_commit&lt;/code&gt; (which ultimately lands in &lt;code&gt;intel_atomic_commit_tail&lt;/code&gt;).&lt;br&gt;
This is an extremely precise state machine that dismantles the old state and assembles the new state in a strict order:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Wait for the rendering of the previous frame to complete (depending on &lt;code&gt;dma_fence&lt;/code&gt; as discussed earlier).&lt;/li&gt;
&lt;li&gt;Disable Planes and CRTCs that are no longer in use.&lt;/li&gt;
&lt;li&gt;Update global display clocks and DDI interfaces.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reallocate DDB and Watermarks (detailed later).&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Enable new CRTCs and Planes.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  3. The Carrier of Hardware Constraints: &lt;code&gt;intel_crtc_state&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;In the i915 source file &lt;code&gt;intel_display_types.h&lt;/code&gt;, there is a massive structure &lt;code&gt;struct intel_crtc_state&lt;/code&gt;. It holds the complete blueprint for mapping a display pipe from software abstraction to hardware registers.&lt;/p&gt;

&lt;p&gt;Among its contents, &lt;strong&gt;Watermarks (WM)&lt;/strong&gt; and &lt;strong&gt;DDB (Display Data Buffer)&lt;/strong&gt; are the key hardware constraints that determine whether an Atomic commit can succeed.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.1 The Lifeline of Supply and Demand: Watermarks (WM)
&lt;/h3&gt;

&lt;p&gt;The GPU's display engine needs to continuously read pixels from video memory and send them to the monitor. If the memory read speed cannot keep up with the monitor's scanning speed, the screen will show visual artifacts (Underrun).&lt;br&gt;
To prevent this, the display engine has internal FIFO buffers. &lt;strong&gt;Watermarks&lt;/strong&gt; are the "warning levels" for these buffers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The driver needs to precisely calculate these based on the current resolution, color format (e.g., NV12 or ARGB), and memory latency (System Memory is slow, LMEM is fast).&lt;/li&gt;
&lt;li&gt;When the data in the FIFO drops below this watermark level, the hardware must immediately issue an urgent memory read request.&lt;/li&gt;
&lt;li&gt;In the &lt;code&gt;wm&lt;/code&gt; field of &lt;code&gt;intel_crtc_state&lt;/code&gt;, i915 meticulously calculates WMs at various levels (such as raw, optimal, intermediate) to ensure that the pixel stream never runs dry.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3.2 Precise Partitioning of SRAM: DDB Allocation
&lt;/h3&gt;

&lt;p&gt;The total size of the high-speed SRAM buffer (DDB) inside the display engine is fixed. If you have multiple displays (Pipes) and multiple layers (Planes) working simultaneously, these DDB blocks must be precisely partitioned.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dynamic Reallocation&lt;/strong&gt;: If you suddenly add a video overlay plane to a running 4K monitor, this triggers an Atomic commit. &lt;code&gt;intel_atomic_check&lt;/code&gt; must recalculate the DDB (&lt;code&gt;struct skl_ddb_entry&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Seamless Transition&lt;/strong&gt;: If you forcibly snatch DDB from an active layer to give it to a new one, it might cause screen tearing. Therefore, in &lt;code&gt;intel_atomic_commit&lt;/code&gt;, you will see careful calls like &lt;code&gt;intel_dbuf_mbus_pre_ddb_update()&lt;/code&gt; and &lt;code&gt;post_ddb_update()&lt;/code&gt;, ensuring that the DDB re-partitioning transitions smoothly during the vertical blanking interval (Vblank).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If, during the &lt;code&gt;atomic_check&lt;/code&gt; phase, the driver finds that even squeezing all available DDB cannot meet the newly requested resolution and layer combination, it will directly reject this update.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Atomic Modesetting has completely ended the flickering and tearing problems that once plagued Linux graphical interfaces. Through a strict two-phase design (a validation phase that does not touch hardware, and a meticulously ordered commit phase), i915 perfectly encapsulates complex hardware constraints—including clocks, bandwidth, Watermarks, and DDB allocation—into a single atomic transaction.&lt;/p&gt;

&lt;p&gt;In the next lecture, we will move into Part 5 and explore how the driver controls the power consumption and sleep states of this "performance beast."&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>linux</category>
      <category>systems</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Part 9: Mapping the KMS Model onto Intel Hardware</title>
      <dc:creator>Deleon Karen</dc:creator>
      <pubDate>Tue, 02 Jun 2026 14:35:32 +0000</pubDate>
      <link>https://dev.to/deleon_karen_2216eb5888b3/part-9-mapping-the-kms-model-onto-intel-hardware-4ob4</link>
      <guid>https://dev.to/deleon_karen_2216eb5888b3/part-9-mapping-the-kms-model-onto-intel-hardware-4ob4</guid>
      <description>&lt;p&gt;In the previous two articles, we took a deep dive into the GPU's "heart" (GT) and "blood" (GEM). Today, we turn to the GPU's "eyes"—the display subsystem. In the Linux kernel, display drivers follow the &lt;strong&gt;KMS (Kernel Mode Setting)&lt;/strong&gt; model.&lt;/p&gt;

&lt;p&gt;For the i915 driver, the challenge lies in how to precisely map KMS's generic abstractions (CRTC, Plane, Connector, Encoder) onto Intel's complex display hardware pipeline (Pipes, Transcoders, DDIs).&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Mapping Software Abstractions to Hardware Entities
&lt;/h2&gt;

&lt;p&gt;To understand the display flow, we must first establish a mapping table from "software objects" to "hardware circuits":&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;KMS Object (DRM)&lt;/th&gt;
&lt;th&gt;i915 Implementation Struct&lt;/th&gt;
&lt;th&gt;Corresponding Intel HW Block&lt;/th&gt;
&lt;th&gt;Responsibility&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Plane&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;intel_plane&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Hardware Plane&lt;/td&gt;
&lt;td&gt;Pixel layer, responsible for scaling, rotation, and format conversion.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CRTC&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;intel_crtc&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Pipe&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Scanline generator, responsible for blending layers and applying the color Look-Up Table (LUT).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;(No direct equivalent)&lt;/td&gt;
&lt;td&gt;(maintained in state)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Transcoder&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Protocol converter, responsible for encoding Pipe signals into timing signals (e.g., HDMI/DP timing).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Encoder&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;intel_encoder&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;DDI / Port&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Digital Display Interface, the physical output endpoint (e.g., Port A, Port B).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connector&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;intel_connector&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Physical Socket&lt;/td&gt;
&lt;td&gt;Receives the panel's EDID, detects HPD (Hot Plug Detect).&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  2. The Core Hardware Pipeline: Pipes and Transcoders
&lt;/h2&gt;

&lt;p&gt;In Intel's documentation, the display pipeline is typically described as a series of connected modules.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.1 Pipe
&lt;/h3&gt;

&lt;p&gt;The core of &lt;code&gt;intel_crtc&lt;/code&gt; is the &lt;strong&gt;Pipe&lt;/strong&gt; (typically named Pipe A, B, C, D).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Task&lt;/strong&gt;: It fetches the pixels of each Plane from memory, blends them according to Z-order, and applies Gamma correction and Color Space Conversion (CSC).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mapping&lt;/strong&gt;: One &lt;code&gt;intel_crtc&lt;/code&gt; instance strictly corresponds to one physical Pipe.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2.2 Transcoder
&lt;/h3&gt;

&lt;p&gt;This is a concept easily confused. In modern Intel hardware, an abstraction layer called the &lt;strong&gt;Transcoder&lt;/strong&gt; is added between the Pipe and the output port.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Role&lt;/strong&gt;: It determines the output timing. For example, even if you are using Pipe A, if you are driving an eDP panel, it might be routed to &lt;code&gt;TRANSCODER_EDP&lt;/code&gt;; for a normal DP output, it might use &lt;code&gt;TRANSCODER_A&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flexibility&lt;/strong&gt;: This design allows the hardware to share display logic among different physical interfaces.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  3. Flexible Output Endpoints: DDI and Port
&lt;/h2&gt;

&lt;p&gt;Before the Sandy Bridge era, Intel hardware had dedicated HDMI controllers, DP controllers, etc. Starting with the Haswell architecture, Intel introduced the &lt;strong&gt;DDI (Digital Display Interface)&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Essence of DDI&lt;/strong&gt;: It is a universal physical interface. Through programming, the same DDI port can run either HDMI or DisplayPort.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Role of the Encoder&lt;/strong&gt;: In i915, &lt;code&gt;intel_encoder&lt;/code&gt; represents a DDI port. Due to the versatility of DDI, a single &lt;code&gt;intel_encoder&lt;/code&gt; can often handle multiple protocols simultaneously.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  4. A Pixel's Fantastic Journey: Data Flow Example
&lt;/h2&gt;

&lt;p&gt;Imagine you are gaming on a laptop connected to an external 4K monitor:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Plane&lt;/strong&gt;: The game frame, as an &lt;code&gt;intel_plane&lt;/code&gt;, waits in video memory (LMEM/SMEM).&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Pipe&lt;/strong&gt;: The &lt;code&gt;intel_crtc&lt;/code&gt; (Pipe A) blends the game frame layer with the mouse cursor layer (Cursor Plane).&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Transcoder&lt;/strong&gt;: &lt;code&gt;TRANSCODER_A&lt;/code&gt; generates the blanking and synchronization signals for &lt;a href="mailto:4K@60Hz"&gt;4K@60Hz&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Encoder&lt;/strong&gt;: The signal flows to the DDI port (Port B), where the pixels are packetized into DisplayPort Micro-packets.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Connector&lt;/strong&gt;: Finally, through the physical Type-C/DP port, the pixels race to the monitor.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  5. Glossary: A Guide to Avoiding Pitfalls
&lt;/h2&gt;

&lt;p&gt;When reading the i915 source code, you will encounter the following high-frequency terms:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;FDI (Flexible Display Interface)&lt;/strong&gt;: The pathway connecting the CPU to the PCH display chip in older architectures (largely obsolete on modern GPUs but commonly found in legacy code).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PCH (Platform Controller Hub)&lt;/strong&gt;: The motherboard southbridge, which previously handled some display outputs; now most of that logic is integrated into the CPU.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watermarks&lt;/strong&gt;: This is an extremely complex topic. A Pipe needs time to read data from memory, and Watermarks determine when the hardware initiates memory requests to prevent the display buffer from "running dry" and causing screen flicker.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;The i915 display driver essentially "pieces together" the generic KMS model onto Intel's specific hardware pipeline. &lt;code&gt;intel_crtc&lt;/code&gt; controls the Pipe's blending logic, while &lt;code&gt;intel_encoder&lt;/code&gt; manages the flexible DDI interface. The Transcoder in between serves as the link connecting the two.&lt;/p&gt;

&lt;p&gt;In the next lecture, we will delve into the part that gives display driver engineers the biggest headache: the two-phase commit mechanism of &lt;strong&gt;Atomic Modesetting&lt;/strong&gt;.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>linux</category>
      <category>systems</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
