Part 4: Breaking Boundaries: TTM and Discrete GPU Memory Management

#linux #graphics

In the previous lecture, we introduced the GEM object model. For a long time, the i915 driver's memory management was designed around UMA (Unified Memory Architecture), because for nearly two decades, Intel's GPUs were all integrated graphics built into the CPU, sharing the same physical memory (system memory) with the CPU.

However, with the advent of the Intel Xe architecture and the introduction of discrete GPUs such as DG1 and DG2 (Arc series), the GPU now has its own dedicated video memory (VRAM). To manage the complex memory hierarchy across the PCIe bus, i915 brought in the veteran workhorse of the Linux kernel graphics stack for discrete memory management—TTM (Translation Table Manager).

1. Why Is TTM Needed?

In the pure integrated graphics era, GEM's core job was managing the GTT (Graphics Translation Table), mapping scattered system memory pages for the GPU. When physical memory ran low, the driver simply needed to cooperate with the Linux kernel's swap mechanism to page out the data, since the CPU and GPU faced the same physical storage.

In the discrete GPU era, however, the situation underwent a fundamental change:

Asymmetric Storage: The GPU has extremely fast but limited-capacity LMEM (Local Memory / VRAM), while the motherboard has massive but slower SMEM (System Memory).
Location Determines Performance: A render object (Buffer Object) must be placed in the right location. For example, a frequently read/written framebuffer must reside in LMEM, whereas temporary command data prepared by the CPU can be placed in SMEM.
Data Relocation: When LMEM is full, less frequently used objects must be "moved out" to SMEM to free up space.

The core responsibility of TTM is precisely "Placements" and "Migration." It provides the driver with a common framework for defining different types of memory pools and moving data between these pools safely and synchronously.

2. Memory Region Abstraction: SMEM and LMEM

To interface with TTM, i915 abstracts the intel_memory_region structure. Each memory region acts like an independent "bank," with its own capacity, minimum page size, and allocation operations.

During driver initialization, the following common regions are probed and constructed:

// Defined in i915_memory_region.h
enum intel_region_id {
    INTEL_REGION_SMEM = 0,     // System Memory
    INTEL_REGION_LMEM_0,       // Local Memory / VRAM
    // ... for multi-die / multi-tile GPUs, there may be LMEM_1, LMEM_2
    INTEL_REGION_STOLEN_SMEM,
    INTEL_REGION_STOLEN_LMEM,
};

For a GEM object taken over by TTM, its lifetime is no longer simply bound to a fixed physical address; instead, it possesses a list of fallback placements.
For example, when creating a buffer, a userspace application can specify: "I'd prefer this object to be placed in LMEM_0, but if LMEM is full, placing it in SMEM is also acceptable."

3. Eviction and Migration

When a discrete GPU runs a large 3D game, several gigabytes of LMEM are quickly filled with texture and vertex data. When the driver tries to allocate new LMEM space but finds no free blocks, TTM triggers the eviction process.

3.1 Finding the Victim

TTM traverses the LRU (Least Recently Used) list, invoking the callback functions registered by i915: i915_ttm_eviction_valuable() and i915_ttm_evict_flags(). i915 determines which objects are currently not locked by the GPU's rendering engine and are allowed to be moved to SMEM. Once selected, this object becomes the "victim."

3.2 The Great Relocation Across PCIe

Moving data between LMEM and SMEM means the data must traverse the PCIe bus. In i915's i915_gem_ttm_move.c, the core function is i915_ttm_move.

This migration process is not performed by the CPU with a plain memcpy (which would significantly consume CPU performance and block the system), but is instead handed off to the GPU's Blitter Engine (a hardware engine specifically responsible for copying data) to be completed asynchronously.

The process is as follows:

The driver allocates enough physical pages in the target region (e.g., SMEM).
It constructs a hardware copy command (Blit instruction).
The driver submits this instruction to the GPU's Copy/Blit engine for execution.
A dma_fence (synchronization fence) is returned.
TTM and subsequent rendering tasks wait on this Fence. Once the GPU reports "copy complete," the object's new physical address is updated in the page tables, and rendering tasks continue to execute, all transparent to userspace.

If the hardware copy engine happens to be hung or unusable, the code retains i915_ttm_move_memcpy as a last-resort fallback, where the CPU steps in to manually move the data.

Summary

With the introduction of discrete GPUs, i915's memory management completely broke out of the comfort zone of the original UMA architecture. By introducing TTM and abstracting intel_memory_region, i915 gained the ability to flexibly allocate data between VRAM and system memory. The eviction and migration mechanisms ensure that the most critical, frequently accessed data always resides on the fastest physical medium, all underpinned by the efficient asynchronous copy capabilities of the GPU Blitter engine.