Part 4: Command Stream and Rendering Pipeline (Command Submission)

#architecture #microsoft #performance #systems

Once video memory resources are in place, the next step is to drive the GPU for rendering via the command stream. Understanding the flow path of commands from the UMD down to the underlying hardware is key to debugging performance bottlenecks and screen tearing. WDDM's command submission mechanism has undergone significant evolution over the past few years, moving from "kernel-orchestrated" to "hardware-direct scheduling."

1. Evolution of the Submission Model: From Patching to Doorbell

To understand the current mechanism, you must know the historical pain points it has solved:

WDDM 1.x (Patching Era): The UMD submitted Command Buffers. Because there was no unified GPU virtual address, the KMD had to scan instructions one by one in DxgkDdiRender, "patching" resource handles into real physical addresses to generate a DMA Buffer. Pain Point: Extremely high CPU overhead.
WDDM 2.x (GPUVA Era): GPU Virtual Addressing was introduced. The UMD could directly write fixed GPUVA into instructions, and the KMD no longer needed to patch, only performing basic validation.
WDDM 3.x (Hardware Scheduling & User-Mode Submission): Further delegation of control. The UMD writes commands directly into a Ring Buffer mapped to user mode and then "rings" the hardware "Doorbell" to notify the GPU, almost completely bypassing the KMD's per-frame involvement.

2. Classic DDI Flow: Packet-Based Scheduling

For environments that do not support hardware scheduling or User-Mode Submission, the classic submission flow is as follows:

UMD Generation (Command Buffer): Located in the process's private address space, its format is entirely vendor-defined, and the OS does not understand its content.
KMD Translation (DxgkDdiRender): Validates the command security and translates it into the final instruction stream that the hardware can directly read (DMA Buffer).
OS Scheduling (VidSch): Ensures all Buffer Objects (BOs) and dependencies are resident, then calls DxgkDdiSubmitCommand to hand it over to the hardware.

3. Modern Approach: User-Mode Work Submission

In modern games with high-frequency Draw Calls (like DX12/Vulkan), the overhead of kernel calls becomes unacceptable. WDDM introduced a Doorbell-based submission mechanism.

Doorbell Mechanism: Hardware allocates a special MMIO (Memory-Mapped I/O) space mapped to the UMD. Once the UMD has prepared the instructions, it directly writes a value to this MMIO address (rings the bell).
The KMD's New Role: The KMD steps back, no longer processing DxgkDdiSubmitCommand per frame. It helps set up hardware queues via DxgkDdiCreateHwQueue, after which the UMD operates independently. The KMD only intervenes when a hardware TDR (Timeout) or page fault occurs.

4. Core of Synchronization: Evolution of the GPU Fence

Commands are executed asynchronously, so the CPU must know how far the GPU has progressed. WDDM's Fence mechanism has also gone through three generations:

Traditional Fence: A simple 64-bit monotonically increasing integer.
Monitored Fence: Allows the CPU to block and wait (Event) on a specific value; the GPU triggers an interrupt to wake the CPU when that value is reached.
Native GPU Fence (WDDM 3.1+): This is key for modern multi-GPU and heterogeneous computing. Not only can the CPU wait on the GPU, but the GPU can also natively wait on the GPU. Different engines (like a Copy engine and a Render engine) can synchronize with each other directly by writing to the same Native Fence memory address, eliminating the need for a CPU interrupt entirely, greatly reducing latency.
- Development Guidance Point: The driver needs to report DXGKQAITYPE_NATIVE_FENCE_CAPS support in DxgkDdiQueryAdapterInfo.

5. Presentation and Display: DxgkDdiPresent and Flip Queue

After rendering is complete, how is the image displayed?

BitBlt Mode: Copies the contents of the back buffer to the front buffer. Highly flexible, but slightly lower performance.
Flip Mode (Recommended): Simply changes the physical address scanned by the Display Controller, performing a page flip.
Hardware Flip Queue (WDDM 3.0+): Traditional Flips are submitted by the CPU upon receiving a VSync interrupt. The Hardware Flip Queue allows the driver to queue multiple flip instructions into the hardware controller's queue at once, and the hardware completes the flips independently based on timestamps. This is crucial for support of high refresh rate displays and Variable Refresh Rate (VRR).

6. Privileged Backdoor: DxgkDdiBuildPagingBuffer

This is the most unique aspect of WDDM development. When the OS needs to move video memory (Eviction, Promotion) or update page tables, it does not go through the normal Render flow.

Responsibility: The OS provides abstract page table operations or copy requests, and the KMD translates them into privileged DMA instructions for the hardware.
Development Guidance Point: BuildPagingBuffer is highly trusted; the instructions here directly impact the system's memory mapping. On systems supporting DMA Remapping, you need to handle DXGK_OPERATION_MAP_APERTURE_SEGMENT2, which uses an ADL (Address Descriptor List) instead of the old MDL.