Deleon Karen

Posted on Jun 2

Part 12: The Undying Body: GPU Hang Detection and Reset

#architecture #linux #monitoring #systems

In complex graphics rendering or compute tasks, it is common for the GPU to "hang" due to executing defective shader code, encountering an infinite loop, or experiencing an anomaly in the hardware state machine. For a mature kernel driver, the ability to quickly detect a hang, capture on-site "last words," and gracefully resume operation is a key measure of its fault tolerance.

In i915, this life-support mechanism is primarily composed of three modules: Hangcheck, Error Capture, and Reset.

1. Detecting a GPU Hang: The Hangcheck and Heartbeat Mechanism

Early i915 drivers relied on a timer to poll the execution progress (whether the Seqno advanced) of all execution engines. However, in the modern i915 architecture (especially the implementation centered around intel_engine_heartbeat.c), the driver uses a more proactive and precise "Heartbeat" mechanism.

1.1 Heartbeat Emission and Detection

When an engine is in an active state, the driver periodically sends a special request, called a "Heartbeat" (Systole), containing only no-ops and synchronization barriers.

Under normal circumstances, the GPU quickly executes this heartbeat request and triggers an interrupt.
The driver uses mod_delayed_work to set a timeout. If the GPU fails to complete the heartbeat request within this period, the driver does not immediately declare it hung.

1.2 The Ultimatum (Preemption Timeout)

If the heartbeat times out, the driver plays its "trump card": it forcibly elevates the priority of this heartbeat request to the highest level (I915_PRIORITY_BARRIER).
This tells the hardware scheduler: "Whatever time-consuming task you are running right now, preempt it immediately and run my heartbeat first!"

If, even after issuing the highest-priority preemption command and waiting for a period (typically several hundred milliseconds, depending on preempt_timeout_ms), the heartbeat still does not pulse, i915 gives up completely and calls reset_engine(), officially declaring the engine Hung.

2. Collecting "Last Words": Error State Capture

Before pulling the plug on the GPU, the most important task is to preserve the crash scene so that developers can perform a post-mortem analysis. This step occurs within the intel_gt_handle_error() function.

When the I915_ERROR_CAPTURE flag is passed, the driver calls i915_capture_error_state():

Snapshot Registers: Reads all critical hardware register states at that moment (such as the instruction pointer EIR, current context state, etc.).
Capture Ringbuffer: Copies all the commands currently in the command ring (Ringbuffer) being executed by the engine.
Record Batchbuffer: If a long series of rendering commands submitted from user space caused the hang, the driver also saves the content of the relevant Batchbuffer.

These "last words" are packaged into the i915_gpu_error structure and exposed to user space via Linux's sysfs or debugfs (usually located at /sys/kernel/debug/dri/0/i915_error_state). Tools like intel_error_decode can read this to reconstruct what instructions the GPU was executing at the moment of the hang.

3. Reset Therapy: From Microsurgery to Defibrillation

After collecting the error state, i915 attempts to pull the GPU back from the brink of death. Modern Intel GPUs support a multi-level reset strategy, following the principle of "minimizing the impact area."

3.1 Engine Reset

If only the Video Decode Engine (VCS) is stuck, while the Render Engine (RCS) is still happily running a game, we obviously don't want the entire screen to go black.

The driver first tries to call intel_engine_reset().
The hardware sends a reset signal only to the specific engine(s) that are stuck, cleans up the hung context, and preserves the running state of other engines.
This is an extremely "minimally invasive" recovery method; the user might only feel a slight stutter in one specific task.

3.2 GT Reset (Full Reset)

If the engine reset fails, or if the hang involves shared common resources (like the memory scheduler or command streamer), the driver has to fall back to the next option.

The driver executes intel_gt_reset(), enters the I915_RESET_BACKOFF state, and pauses submissions to all engines.
It sends a reset signal to the entire GT (Graphics Technology) core. This clears all hardware execution queues.
After a successful reset, the driver requeues and resubmits the innocent requests that did not cause the hang. For the culprit, it directly returns -EIO (Input/Output Error), telling the application "your task has been terminated."

3.3 Device-Level Reset (PCI Reset - Wedged)

If even the GT reset fails to wake the GPU (for instance, the hardware state machine has completely collapsed, or the bus is deadlocked), i915 desperately marks the device as Wedged.

It calls intel_gt_set_wedged(). In this state, the driver rejects all graphics execution requests from user space, and all new execbuffer calls immediately return -EIO.
If conditions permit, the driver may attempt the highest-level reset (Device Reset) at the PCI bus level. However, if it comes to this, the screen usually flickers, and a machine reboot might be necessary for full recovery.

Summary

A robust GPU driver should not assume that hardware will always run perfectly. i915's Heartbeat mechanism monitors the health of the engines like an electrocardiograph; i915_gpu_error records all data before a crash like a black box; and the multi-level Reset mechanism, from Engine to GT level, works like an emergency room physician, doing its utmost to revive the GPU to full health and protect the user experience to the greatest extent possible.

In the next lecture, we will reach the final installment of this series, reviewing the historical baggage carried by the i915 behemoth with over two million lines of code, and see how Intel's latest Xe architecture driver travels light and faces the future.

DEV Community