Modern high-performance devices communicate with the CPU through shared memory structures such as DMA Rings.
When one side updates memory, the other side must see the latest value.
On cache-coherent systems this happens automatically. On many ARM platforms it does not.
This post explains what breaks, why it breaks, and how the Linux DMA API solves it.
Why DMA Fails on Non-Coherent Systems
Consider the completion flow from the earlier ring design in How Hardware and Software Share a Queue: Understanding DMA Rings:
- Device DMA-writes a completion entry
- Device updates
WR_IDX - CPU reads
WR_IDXand processes new entries
On a non-coherent system the driver may:
- read an old
WR_IDX - read a partially updated descriptor
- never observe new completions
This happens because the CPU and the DMA engine do not observe memory through the same path.
System Hardware View
+----------------------+
| CPU |
| Driver (load/store)|
+----------+-----------+
|
+----v----+
| Cache | (L1/L2)
+----+----+
|
|
+------v------+
| DDR | (System RAM)
+------+------+
^
| PCIe link
+------v------+
| PCIe Device |
| DMA Engine |
+-------------+
Key observation:
- CPU accesses DDR through cache
- DMA accesses DDR directly
- Cache and DDR can hold different data at the same time
This is the source of incoherency.
What Is Cache Coherency
Physical memory (DDR) is the shared storage.
The CPU does not read DDR on every load. It reads cached copies stored in cache lines.
Two operations are required to keep both sides consistent:
- Flush – push updated cache lines to DDR
- Invalidate – discard cached copies so the next read comes from DDR
Without these operations, both sides operate on different versions of the same memory.
DMA Memory in System DDR
The ring allocated in How Hardware and Software Share a Queue: Understanding DMA Rings resides in system DDR. It is normal RAM shared between CPU and device.
Coherency is achieved by changing how the CPU maps that memory.
The same physical DDR page can be:
- mapped as cacheable
- mapped as non-cacheable
This is controlled by page table attributes.
Memory Types From the CPU Perspective
Cacheable Memory
- Default for
kzalloc - Fast for CPU
- Not automatically DMA-safe on non-coherent systems
Non-cacheable Memory
- CPU always accesses DDR directly
- No stale cache lines
- Safe for shared control structures
On many ARM systems, coherent DMA memory is implemented using a non-cacheable CPU mapping.
Linux Kernel DMA APIs
Linux Kernel provides two usage patterns:
Coherent DMA
- CPU and device always observe the same data
- No explicit cache maintenance in the driver
Streaming DMA
- Memory is cacheable
- Driver must perform cache sync at specific points
dma_alloc_coherent():
- allocates memory from system RAM (often via CMA or page allocator)
-
returns:
- CPU virtual address
- DMA address for the device
On non-coherent ARM systems it typically:
- maps the region as non-cacheable for the CPU
Result:
- CPU accesses go directly to DDR
- DMA accesses go to the same DDR
- both sides see identical data without cache operations
This is why it is ideal for:
- descriptor rings
- doorbells
kzalloc() + DMA (Streaming DMA):
kzalloc returns cacheable normal memory.
For DMA usage the driver must:
- Map it for DMA
dma_map_single()
- Before device reads the buffer
dma_sync_single_for_device()
- After device writes the buffer and before CPU reads
dma_sync_single_for_cpu()
- When finished
dma_unmap_single()
Ring Buffer
Ring allocated with dma_alloc_coherent:
- Ring lives in DDR
- CPU mapping is non-cacheable
- Device DMA writes directly to DDR
- Driver reads fresh data
- No cache maintenance required
Ring allocated with kzalloc:
After interrupt and before reading completions, invalidate cached lines dma_sync_single_for_cpu()
Performance and Design Trade-offs
Coherent memory:
- simpler
- safe for shared control data
- slower for large CPU accesses (no caching)
Streaming DMA:
- fast for bulk data
- requires correct sync points
Typical design:
- rings → coherent memory
- data buffers → streaming DMA
Conclusion
On non-coherent systems, the CPU cache and the DMA engine observe DDR through different paths. The Linux DMA API bridges this gap by either:
- creating a coherent mapping, or
- providing explicit cache synchronization primitives.
Top comments (0)