Deleon Karen

Posted on Jun 2

Part 8: Synchronization Mechanisms: Requests and dma_fence

#architecture #linux #systems #tutorial

In previous chapters, we learned how GPU engines execute commands and how drivers feed commands to the hardware through different submission methods (Ringbuffer, Execlists, GuC). However, GPU execution is highly asynchronous. When you call the execbuffer IOCTL, the CPU simply throws the task into a queue and returns immediately.

How do you know when a task has completed? How do you ensure that "Task B" starts only after "Task A" finishes? How do you get content rendered by an NVIDIA discrete GPU to display on an Intel integrated GPU (cross-driver synchronization)? This is the core subject of this lecture: the art of synchronization in i915.

1. The Asynchronous Nature of GPU Execution

In the Linux graphics stack, the CPU and GPU are decoupled parallel worlds:

CPU: Responsible for logic control, building command sequences, and managing memory allocation.
GPU: Responsible for heavy parallel computation.

To avoid having the CPU wait idly for the GPU to finish (which would severely degrade system responsiveness), all submission operations follow the "Fire and Forget" principle. The driver needs a mechanism to track these "in-flight" tasks. This mechanism is known in i915 as the i915_request.

2. i915_request: The Container for Asynchronous Tasks

The i915_request (often abbreviated as rq in code) is the most important tracking unit in i915. Every task sequence submitted to the hardware is wrapped in an rq.

2.1 Lifecycle

Create: Calls i915_request_create(). A unique seqno (sequence number) is allocated for the request at this point.
Await: If the current task depends on other tasks (e.g., needing to wait for a previous rendering task to finish before beginning a display write), the driver calls i915_request_await_dma_fence().
Emit: The driver writes the actual operational instructions into the Ringbuffer and appends a special "Breadcrumb" instruction at the end.
Add: Calls i915_request_add(). The task is formally handed over to the scheduler and waits for hardware execution.
Signal: When the hardware executes the "Breadcrumb" instruction, it writes its seqno to a specific location in memory (HWSP) and triggers a hardware interrupt.
Retire: The driver processes the interrupt, discovers the task is complete, wakes up waiting processes, and releases related resources.

2.2 The Breadcrumbs Mechanism

How does the hardware tell the software how far it has progressed? i915 uses the "Breadcrumbs" technique:

The driver inserts an MI_STORE_DWORD_IMM instruction at the end of the Ringbuffer task.
This instruction causes the GPU to write the current request's seqno into the Hardware Status Page (HWSP).
The software simply needs to monitor the HWSP value. If HWSP_value >= rq->seqno, it indicates the task has completed.

3. The Bridge Across Boundaries: dma_fence

While the i915_request is powerful, it is private to the i915 driver. To achieve cross-driver synchronization (e.g., using an Intel GPU for rendering and then handing off to a DisplayLink USB graphics card for display), a common synchronization language in the Linux kernel is needed: dma_fence.

3.1 What is a dma_fence?

dma_fence is a generic synchronization object defined by the kernel. It has only two states: Pending and Signaled.
In i915_request.h, you can see that the first member of struct i915_request is struct dma_fence fence. This means every i915 request is inherently also a dma_fence.

3.2 Cross-Process and Cross-Driver Synchronization

Sync File (Android Synchronization Model): Passes synchronization signals between processes via a file descriptor (FD). An FD can be obtained using I915_EXEC_FENCE_OUT in execbuffer.
DRM Syncobj (Vulkan/Modern Approach): A more flexible synchronization container that can hold multiple fences, commonly used for Vulkan Timeline Semaphores.

When a user-space application passes the FD of a sync_file to another process, the kernel automatically handles the underlying dma_fence dependencies.

4. Internal Dependency Management: sw_fence

In addition to the generic dma_fence, i915 internally uses a more lightweight i915_sw_fence. It is primarily used for internal driver pipeline orchestration.
For example, when executing execbuffer, the driver might first need to wait for memory page-in. This internal wait is implemented via a sw_fence, ensuring that hardware requests are never dispatched to the scheduler before all software prerequisites are met.

Summary

Synchronization is where bugs most easily arise in a driver, and it is also a key point for performance optimization. i915 encapsulates asynchronous tasks through i915_request, communicates with hardware using the "Breadcrumbs" instruction, and achieves seamless integration with the Linux ecosystem via the standard dma_fence interface.

DEV Community