Deleon Karen

Posted on Jun 2

Part 7: The Evolution of Command Submission: From Ringbuffer to GuC

#architecture #linux #systems #tutorial

In the previous lecture, we explored the basic architecture of the GT (Graphics Technology) and how i915_gem_context preserves GPU execution state much like an operating system process. But with state and rendering commands in place, through what physical pathway do user-mode draw requests actually get fed into the GPU engines for execution?

The Command Submission mechanism is one of the most dramatically evolved and frequently refactored modules in the i915 driver. If you look at the kernel source directory drivers/gpu/drm/i915/gt/, you'll find that three entirely different submission flows coexist within the driver. This evolution is, at its core, a technological revolution centered on "how to free the CPU from heavy scheduling work and allow the GPU to become autonomous."

1. The Ancient Era: Legacy Ringbuffer Submission

On early Intel GPUs (before Gen8/Broadwell), the command submission mechanism was very simple and direct, known as the Ringbuffer mode. This code now resides in gt/intel_ring_submission.c.

1.1 A Simple "Producer-Consumer" Model

Early hardware was quite "dumb." Each hardware engine (such as the Render engine, Blitter engine) had only one global ring memory buffer (Ringbuffer).

CPU (Producer): Writes the memory address of the Batch Buffer (batch processing commands) sent from user mode, along with necessary configuration instructions, sequentially into the Ringbuffer, and updates the TAIL register.
GPU (Consumer): The hardware continuously reads and executes instructions from the position indicated by the HEAD register until HEAD catches up with TAIL.

1.2 Bottlenecks and Pain Points

The drawback of this model was obvious: a single global queue.
If you had multiple OpenGL applications (different Contexts) running simultaneously, the driver had to carefully insert lengthy instructions for "saving Context A state -> restoring Context B state" into the same Ringbuffer. This led to extremely high context switch latency and made high-priority preemption nearly impossible.

2. The Mesozoic Era: Execlists (Software-Based Scheduling)

To support virtualization and more efficient multi-tasking concurrency, starting with the Gen8 (Broadwell) architecture, Intel introduced the LRC (Logical Ring Context) and Execlists (Execution Lists) mechanisms. The core code resides in gt/intel_execlists_submission.c.

2.1 Hardware Upgrade: One Ringbuffer Per Context

By this era, the hardware had finally grown smarter. The GPU allowed each independent Context to have its own private Ringbuffer, rather than everyone cramming into a single global queue.

2.2 The Driver Bears the "Scheduler" Burden

Although the hardware now supported multiple Contexts, it no longer went looking for tasks on its own. The hardware provided a register port called the ELSP (Execlist Submission Port).
The i915 driver was forced to become a complex software scheduler:

The driver maintained a red-black tree on the CPU side, sorting all pending i915_request objects by priority.
The driver used the CPU to calculate who should run next, then wrote the descriptors of the 1–2 highest-priority Contexts into the ELSP.
Upon receiving the descriptors, the GPU automatically performed a hardware-level Context switch (much faster than before) and began executing the Ringbuffer corresponding to that Context.

Pain Point: Excessive CPU Overhead. Every time a task completed or a higher-priority task arrived, the GPU would send an interrupt to the CPU. The CPU had to immediately respond to the interrupt, recalculate the red-black tree, and write to the ELSP again. At very high game frame rates, this CPU-side scheduling overhead (Driver Overhead) became intolerable.

3. The Modern Era: GuC Hardware Microcontroller Scheduling

To thoroughly solve the CPU scheduling bottleneck, Intel introduced the GuC (Graphics Microcontroller). Gradually trialed starting with Gen11 (Ice Lake), by Gen12 (Tiger Lake) and the latest Xe architecture, GuC has become the default and only submission method. The code resides in gt/uc/intel_guc_submission.c.

3.1 What is GuC?

GuC is a low-power ARM-architecture (or proprietary architecture) microcontroller integrated directly onto the GPU silicon die. It runs proprietary, closed-source firmware provided by Intel. It takes over all the scheduling work that was previously done by the i915 driver on the CPU.

3.2 True Asynchrony and Autonomy

In the GuC era, the interaction between the i915 driver and the hardware becomes extremely elegant:

Workqueue: The driver and GuC share a block of memory as a communication Workqueue.
Doorbell: When a new draw command arrives from user mode, the i915 driver simply drops the request into the corresponding Context's Ringbuffer, leaves a note in the Workqueue, and then "rings" the GuC's Doorbell register (an extremely lightweight MMIO write operation).
Hands-Off Completely: After ringing the doorbell, the CPU can go off and do other things. The remaining tasks—context selection, priority preemption, and even load balancing between engines—are all computed and assigned in real-time by the GuC's internal firmware.

3.3 A Leap in Performance

With GuC submission:

The number of interrupt requests handled by the CPU is significantly reduced (from tens of thousands per second to a few thousand per second or even lower).
Preemption latency (from initiating preemption to the GPU actually switching) drops to the microsecond level, which is critical for VR and smooth desktop compositing (Wayland/KMS).

Summary

The evolution of command submission reflects a classic truth of computer architecture: offload specialized work to specialized hardware.

Ringbuffer: Simple structure, with the CPU and GPU tightly coupled.
Execlists: Hardware supports multiple processes, but the CPU is forced to become a "head steward" with a heavy scheduling burden.
GuC: Introducing a dedicated microcontroller inside the GPU achieves thorough CPU offloading and ultra-low latency.

After understanding how tasks are submitted, you may have a question: Since GPU execution is completely asynchronous (especially in the GuC era, where the CPU rings the doorbell and walks away), how does the CPU know when a task has finished executing? If I want to send this frame to the display, how do I synchronize it?

That is the mystery we will unravel in the next lecture.

DEV Community