DEV Community

Ripan Deuri
Ripan Deuri

Posted on

Understanding Cache Coherency

Modern high-performance devices communicate with the CPU through shared memory structures such as DMA Rings.

When one side updates memory, the other side must see the latest value.

On cache-coherent systems this happens automatically. On many ARM platforms it does not.

This post explains what breaks, why it breaks, and how the Linux DMA API solves it.

Why DMA Fails on Non-Coherent Systems

Consider the completion flow from the earlier ring design in How Hardware and Software Share a Queue: Understanding DMA Rings:

  1. Device DMA-writes a completion entry
  2. Device updates WR_IDX
  3. CPU reads WR_IDX and processes new entries

On a non-coherent system the driver may:

  • read an old WR_IDX
  • read a partially updated descriptor
  • never observe new completions

This happens because the CPU and the DMA engine do not observe memory through the same path.

System Hardware View

                +----------------------+
                |        CPU           |
                |   Driver (load/store)|
                +----------+-----------+
                           |
                      +----v----+
                      |  Cache  |  (L1/L2)
                      +----+----+
                           |
                           |
                    +------v------+
                    |     DDR     |  (System RAM)
                    +------+------+
                           ^
                           | PCIe link
                    +------v------+
                    | PCIe Device |
                    |  DMA Engine |
                    +-------------+
Enter fullscreen mode Exit fullscreen mode

Key observation:

  • CPU accesses DDR through cache
  • DMA accesses DDR directly
  • Cache and DDR can hold different data at the same time

This is the source of incoherency.

What Is Cache Coherency

Physical memory (DDR) is the shared storage.

The CPU does not read DDR on every load. It reads cached copies stored in cache lines.

Two operations are required to keep both sides consistent:

  • Flush – push updated cache lines to DDR
  • Invalidate – discard cached copies so the next read comes from DDR

Without these operations, both sides operate on different versions of the same memory.

DMA Memory in System DDR

The ring allocated in How Hardware and Software Share a Queue: Understanding DMA Rings resides in system DDR. It is normal RAM shared between CPU and device.

Coherency is achieved by changing how the CPU maps that memory.

The same physical DDR page can be:

  • mapped as cacheable
  • mapped as non-cacheable

This is controlled by page table attributes.

Memory Types From the CPU Perspective

Cacheable Memory

  • Default for kzalloc
  • Fast for CPU
  • Not automatically DMA-safe on non-coherent systems

Non-cacheable Memory

  • CPU always accesses DDR directly
  • No stale cache lines
  • Safe for shared control structures

On many ARM systems, coherent DMA memory is implemented using a non-cacheable CPU mapping.

Linux Kernel DMA APIs

Linux Kernel provides two usage patterns:

Coherent DMA

  • CPU and device always observe the same data
  • No explicit cache maintenance in the driver

Streaming DMA

  • Memory is cacheable
  • Driver must perform cache sync at specific points

dma_alloc_coherent():

  • allocates memory from system RAM (often via CMA or page allocator)
  • returns:

    • CPU virtual address
    • DMA address for the device

On non-coherent ARM systems it typically:

  • maps the region as non-cacheable for the CPU

Result:

  • CPU accesses go directly to DDR
  • DMA accesses go to the same DDR
  • both sides see identical data without cache operations

This is why it is ideal for:

  • descriptor rings
  • doorbells

kzalloc() + DMA (Streaming DMA):

kzalloc returns cacheable normal memory.

For DMA usage the driver must:

  1. Map it for DMA

dma_map_single()

  1. Before device reads the buffer

dma_sync_single_for_device()

  1. After device writes the buffer and before CPU reads

dma_sync_single_for_cpu()

  1. When finished

dma_unmap_single()

Ring Buffer

Ring allocated with dma_alloc_coherent:

  • Ring lives in DDR
  • CPU mapping is non-cacheable
  • Device DMA writes directly to DDR
  • Driver reads fresh data
  • No cache maintenance required

Ring allocated with kzalloc:

After interrupt and before reading completions, invalidate cached lines dma_sync_single_for_cpu()

Performance and Design Trade-offs

Coherent memory:

  • simpler
  • safe for shared control data
  • slower for large CPU accesses (no caching)

Streaming DMA:

  • fast for bulk data
  • requires correct sync points

Typical design:

  • rings → coherent memory
  • data buffers → streaming DMA

Conclusion

On non-coherent systems, the CPU cache and the DMA engine observe DDR through different paths. The Linux DMA API bridges this gap by either:

  • creating a coherent mapping, or
  • providing explicit cache synchronization primitives.

Top comments (0)