DEV Community

Ripan Deuri
Ripan Deuri

Posted on

How Hardware and Software Share a Queue: Understanding DMA Rings

Modern high-performance systems rely on a shared memory queue for communication between hardware and software, where the device writes data using DMA and indicates new work by updating an index. This mechanism is widely used in network controllers, NVMe storage, GPUs, and asynchronous I/O frameworks because it eliminates lock contention, reduces register access, and allows both sides to operate independently at high throughput.

Understanding this structure requires looking beyond the idea of a circular buffer and focusing on ownership transfer, memory ordering, and cache visibility. These are the concepts that determine correctness and performance in real driver implementations.

This post explains how a lock-free queue is shared between hardware and software and breaks down the synchronization model that makes it work.

Why This Mechanism Exists

At high data rates, traditional communication methods between software and hardware become too expensive:

  • Reading device registers frequently causes latency.
  • Locking shared structures limits parallelism.
  • Interrupt-per-event models do not scale.

Instead, modern devices and drivers communicate through shared memory queues.

The key idea is simple:

The device publishes completed work into memory using DMA, and software consumes it later.

This removes:

  • register polling from the fast path
  • lock contention
  • synchronous handshakes

and replaces them with ownership transfer over a circular buffer.

Shared Memory Layout and Ownership Model

The circular queue lives in system DDR memory and is accessible to both the CPU and the device.

+--------------------------------------------------------------------+
|                            HOST SYSTEM                             |
|  +------------------+                                              |
|  |       CPU        |                                              |
|  |    +--------+    |                                              |
|  |    | Driver |------------------------------------------------+  |
|  |    +--------+    |                                           |  |
|  +---------^--------+                                           |  |
|            |                                                    |  |
|  +---------v--------+                                           |  |
|  |      Cache       |                                           |  |
|  |     L1 / L2      |                                           |  |
|  +---------^--------+                                           |  |
|            | Cache lines                                        |  |
|  +---------v-------------------------------------------------+  |  |
|  |                 SYSTEM DDR (Non-Coherent)                 |  |  |
|  |   +------------------------+    +----------------+        |  |  |
|  |   | Desc 0 | Desc 1 | ...  |    |     WR_IDX     |        |  |  |
|  |   +------------------^-----+    +-----^----------+        |  |  |
|  |   RING DESCRIPTORS   |                |  WR_ADDR (SHADOW) |  |  |
|  |                      +--------+-------+                   |  |  |
|  +-------------------------------|---------------------------+  |  |
|                        DMA write |         MMIO write (RD_IDX)  |  |
|                (Metadata, WR_IDX)|         +--------------------+  |
|                                  |         |                       |
|                     +----------------------v----+                  |
|                     |       Root Complex        |                  |
|                     +------------^--------------+                  |
+----------------------------------|---------------------------------+
                                   |                              
                                   | PCIe Link                    
                                   |
+----------------------------------v---------------------------------+
|                             PCIe DEVICE                            |
|  MMIO RING REGS                                                    |
|  +--------------+          +----------------------------+          |
|  | BASE_ADDR    |          |         DMA ENGINE         |          |
|  +--------------+          +----------------------------+          |
|  | ...          |                                                  |
|  +--------------+          +----------------------------+          |
|  | WR_ADDR      |          |           MSI-X            |          |
|  +--------------+          +----------------------------+          |
|  | RD_IDX       |                                                  |
|  +--------------+                                                  |
+--------------------------------------------------------------------+

Enter fullscreen mode Exit fullscreen mode
  • Device → advances WR_IDX
  • Driver → advances RD_IDX

At any moment:

      RD_IDX
        │
        +----+----+----+----+----+----+----+----+
        | S  | S  | S  | D  | D  | D  | D  | D  |
        +----+----+----+----+----+----+----+----+
                       |
                     WR_IDX

Driver owns   : [RD_IDX … WR_IDX)
Device owns   : [WR_IDX … RD_IDX)

[D]  → Device-owned slot (empty, can be filled by HW)
[S]  → Driver-owned slot (ready to process by SW)

The ring grows clockwise ➜
Enter fullscreen mode Exit fullscreen mode

How Ownership Moves Around the Ring

  • Init - no valid entries
        RD_IDX
        │
        +----+----+----+----+----+----+----+----+
        | D  | D  | D  | D  | D  | D  | D  | D  |
        +----+----+----+----+----+----+----+----+
        |
        WR_IDX
Enter fullscreen mode Exit fullscreen mode
  • Device fills new entries
                            WR_IDX
                            │
        +----+----+----+----+----+----+----+----+
        | S  | S  | S  | S  | D  | D  | D  | D  |
        +----+----+----+----+----+----+----+----+
        │
        RD_IDX
Enter fullscreen mode Exit fullscreen mode
  • Driver processes new entries
                            WR_IDX
                            │
        +----+----+----+----+----+----+----+----+
        | D  | D  | S  | S  | D  | D  | D  | D  |
        +----+----+----+----+----+----+----+----+
                  │
                  RD_IDX
Enter fullscreen mode Exit fullscreen mode
  • Warp-around - idndices wrap modulo ring size
              WR_IDX
             │
        +----+----+----+----+----+----+----+----+
        | S  | D  | S  | S  | S  | S  | S  | S  |
        +----+----+----+----+----+----+----+----+
                  │
                  RD_IDX
Enter fullscreen mode Exit fullscreen mode
  • Full ring - device must stop

If WR_IDX catches RD_IDX:

                  WR_IDX
                  │
        +----+----+----+----+----+----+----+----+
        | S  | S  | S  | S  | S  | S  | S  | S  |
        +----+----+----+----+----+----+----+----+
                  │
                  RD_IDX
Enter fullscreen mode Exit fullscreen mode

There are no device-owned slots. Device cannot write.

This is not an error - it is backpressure.

  • Empty ring - driver has nothing to do

If RD_IDX catches WR_IDX:

                  WR_IDX
                  │
        +----+----+----+----+----+----+----+----+
        | D  | D  | D  | D  | D  | D  | D  | D  |
        +----+----+----+----+----+----+----+----+
                  │
                  RD_IDX
Enter fullscreen mode Exit fullscreen mode

No software-owned entries. Driver stops processing.

Lifecycle of a Completion: From Device to Driver

This sequence describes how a real device reports finished work to software through the shared ring.

[1] Initialization

During setup, driver:

  • Allocates the ring in system memory.
  • Programs the device with:
    • the ring base address
    • the ring size
    • the address where WR_IDX will be written (shadow in host memory).
  • Initializes RD_IDX to zero.

At this point:

  • The queue contains no valid entry.
  • The entire ring is owned by the device.

[2] Device finishes processing

The device already knows where the result data should be placed.
This typically comes from a separate provisioning mechanism (another queue or pre-registered buffers).

It DMA-writes the result into system memory.

[3] Device writes a completion entry

The device selects the slot at its current WR_IDX and DMA-writes a completion record.

This record may contain:

  • an identifier for the buffer or request
  • the length of valid data
  • status or error information
  • device-generated metadata

At this stage the entry exists in memory, but software does not yet know that it is valid.

[4] Device publishes WR_IDX

After the completion entry is fully written, the device updates WR_IDX in host memory.

The index update is the visibility point for software.

[5] Interrupt

The device may generate an interrupt to notify CPU. Refer to How an Interrupt Reaches the CPU to understand how interrupt is delivered.

[6] Software consumption

When software runs (either due to an interrupt or polling):

  • reads WR_IDX to determine how far the device has progressed.
  • processes entries in the range: [RD_IDX … WR_IDX). For each entry:

    • interpret the completion record
    • recycle the associated resources
    • advance RD_IDX

[7] Returning ownership to the device

After consuming entries, software writes the updated RD_IDX to the device via MMIO.

This tells the device:

These slots are free again.

Cache Coherency and DMA Visibility

On cache-coherent systems, CPU and device observe the same memory contents automatically.

On non-coherent systems, DMA updates system memory but the CPU may still read stale data from its cache.

Before reading new completions, the driver must invalidate the cache lines that cover:

  • the completion entries
  • WR_IDX

Otherwise, software may see an old index or partially updated entries even though the device has already written the new data to memory.

Memory Ordering

The queue works because both sides publish progress in a strictly defined order. Without this ordering, an index can become visible before the data it refers to.

Device side

The device must ensure:

completion entry write → WR_IDX update

This guarantees that when software observes the new WR_IDX, the corresponding completion entry is already fully written in memory.

Software side

Software must:

read WR_IDX → then read the completion entries

This prevents the CPU from speculatively reading ring contents before it knows how far the device has progressed.

These rules are enforced with memory barriers in the driver and with ordering guarantees in the device.

Timeline View

Timeline between device and driver

Conclusion

A shared ring is a contract where hardware and software exchange ownership through ordered index updates. Completed work becomes visible when WR_IDX is updated, and buffer space is returned to the device when RD_IDX advances. This memory-based publication model removes locks, reduces MMIO, enabling scalable, high-throughput operation.

Top comments (0)