Part 1: Concepts — Ushering in the "Golden Age" of GPU Virtualization

#cloudcomputing #infrastructure #performance #systems

Introduction: The GPU Dilemma Amid the Virtualization Wave

In today's rapidly evolving landscape of cloud computing and virtualization, the virtualization of compute, storage, and networking is already highly mature. However, the virtualization of GPU resources has long remained a major industry challenge. In the past, if we wanted to use GPU acceleration in a virtual machine (VM), there were generally two mainstream approaches, each with its own strengths and limitations:

Discrete Device Assignment (DDA):
- Principle: Assigns an entire physical GPU exclusively to a single virtual machine.
- Advantages: Near-native performance and good compatibility.
- Disadvantages: Resources cannot be shared. One GPU can only serve one VM, so even if the VM is only running simple UI rendering, it causes massive resource waste and extremely high costs.
API Forwarding:
- Principle: Intercepts OpenGL/DirectX API calls within the virtual machine and forwards them to the host for execution via network or shared memory channels.
- Advantages: Simple to implement and supports 1:N sharing.
- Disadvantages: High performance overhead, high latency, and often only supports specific API versions, making it unable to meet the demands of high-performance computing or 3D gaming.

With the rise of artificial intelligence, remote desktops (VDI), and cloud gaming, the market urgently needs a GPU virtualization solution that can deliver both high performance and high-density sharing. It is against this backdrop that Microsoft introduced GPU-P (GPU Partitioning).

What is GPU-P: SR-IOV-Based Hardware Partitioning Technology

GPU-P (short for GPU Partitioning) is a GPU hardware partitioning technology developed by Microsoft based on the industry standard SR-IOV (Single Root I/O Virtualization). In WDDM documentation, it is also often referred to as GPU Paravirtualization (GPU-PV).

Core Principles

Unlike traditional "full virtualization," GPU-P employs a paravirtualization design:

Hardware Level: Utilizes SR-IOV-capable GPU hardware to partition the physical device into multiple "Virtual Functions (VFs)." Each VF possesses its own independent hardware context, command queue, and memory space.
Software Level: Retains the full-featured driver (KMD) on the host side, while the virtual machine (guest) side runs a streamlined user-mode driver (UMD) specifically adapted for the virtualization environment.
Communication Mechanism: Uses Hyper-V's VMBus as a high-speed communication bridge to enable efficient collaboration between the Guest UMD and the Host KMD.

Advantages of GPU-P

Hardware-Level Isolation: Leveraging SR-IOV and IOMMU technologies, GPU tasks from different virtual machines do not interfere with one another, ensuring extremely high security.
Near-Native Performance: Critical rendering paths interact directly with the hardware VF, greatly reducing the overhead introduced by software emulation.
Flexible Resource Scheduling: Supports dynamically partitioning a single physical GPU into multiple instances, achieving optimal resource allocation.

Application Scenarios: From the Lab to the Cloud

GPU-P is not a laboratory "toy"; it is deeply integrated into all corners of the Windows ecosystem:

Windows Sandbox: When you start a sandbox to test suspicious software, its smooth UI is powered by GPU-P providing hardware acceleration, without requiring you to manually configure drivers.
WSL2 (Windows Subsystem for Linux): When developers run neural network training (such as PyTorch, TensorFlow) under the Linux subsystem, they can directly call upon the GPU compute power of the Windows host, also thanks to the D3D12/CUDA mapping support provided by GPU-P.
Azure NV-Series Virtual Machines: On the public cloud, GPU-P allows Azure to offer cost-effective GPU instances to different users, supporting AI inference, rendering, and scientific computing.
Cloud Gaming and VDI: Through high-density GPU partitioning, a single server can simultaneously support the 1080P/60FPS gaming experiences or 3D design desktops of dozens of users.

WDDM Evolution: Embarking on the Long March of GPU Virtualization

The maturity of GPU-P was not achieved overnight; it has continuously strengthened with the iteration of the Windows Display Driver Model (WDDM):

WDDM 2.4 (Windows 10 1803):
- Milestone Significance: Formally introduced the GPU-PV architecture.
- Core Functionality: Supported basic rendering capability partitioning and introduced the IOMMU isolation mechanism.
WDDM 2.5 - 2.9:
- Continuous optimization. Introduced mechanisms such as "driver-known Escape calls," enhancing the security of cross-process/cross-VM communication.
WDDM 3.2 (Windows 11 24H2):
- Live Migration: This is the "holy grail" of GPU virtualization. WDDM 3.2 introduced technologies like Dirty Bit Tracking, enabling virtual machines running GPU workloads to migrate between different physical hosts without downtime.
- LDA (Linked Display Adapter) Support: Supports more complex partitioning strategies in multi-GPU environments.

Conclusion

The emergence of GPU-P marks the transition of Windows GPU virtualization from "barely functional" to "truly practical" and "industrialized." It not only solves the pain point of resource sharing but also finds the perfect balance between security and performance.

In the following chapters, we will delve into its kernel logic and explore how the host and virtual machine execute their precise "pas de deux."