DEV Community

Ripan Deuri
Ripan Deuri

Posted on

Memory Mapped IO (MMIO)

Table of Contents

Introduction

Memory-Mapped I/O (MMIO) is an address-mapping technique where device registers are assigned fixed ranges in the system’s physical address map. When the CPU issues a load or store to these regions, the transaction is routed to a hardware block instead of DRAM. There are no special instructions involved; ordinary load/store operations form the entire I/O protocol.

VA to PA Conversion

The CPU always issues virtual addresses (VAs). Before transactions reach the interconnect, the Memory Management Unit (MMU) translates VAs into physical addresses (PAs). The MMU uses a Translation Lookaside Buffer (TLB) for cached translations and performs a page-table walk on TLB misses.

VA to PA mapping

MMIO Regions

MMIO regions occupy fixed physical address ranges defined by the SoC. Linux receives the physical MMIO layout from the Device Tree (DT), reserves the corresponding regions, and builds virtual mappings with device-type memory attributes.

+-----------------------------------------+
|   SoC Physical Address Map              |
+-----------------------------------------+
| 0x0000_0000 - 0x0FFF_FFFF : DDR         |
| 0x1000_0000 - 0x1000_0FFF : UART MMIO   |
| 0x1234_0000 - 0x1234_0FFF : Device MMIO |
| ...                                     |
+-----------------------------------------+
Enter fullscreen mode Exit fullscreen mode

Load and Store Operations

A CPU MMIO access follows the same initial path as a normal load or store: instruction issue → VA → MMU translation → PA. After translation, the PA is emitted onto the SoC interconnect. The interconnect decodes the PA and routes the access to the MMIO peripheral instead of DDR.

MMIO read

Barriers and Ordering

Modern CPUs reorder memory operations aggressively. Loads may complete early, stores may be buffered, and multiple memory streams may execute out of order. This behaviour is essential for performance, but it breaks the assumptions needed when software interacts with hardware through MMIO registers.

Programming a Device Without a Write Barrier

Example: A device expects the CFG register to be written before the DOORBELL register.

LDR: Loads a word (32-bit).
STR: Stores a word (32-bit).

; R0 = CFG value (0xAAAA5555)
LDR     R0, =0xAAAA5555
; R1 = CFG register (0x12340000)
LDR     R1, =0x12340000
; Write CFG
STR     R0, [R1]
; R2 = DOORBELL value (0x1)
MOV     R2, #1
; R3 = DOORBELL register (0x12340004)
LDR     R3, =0x12340004
; Kick the device
STR     R2, [R3]
Enter fullscreen mode Exit fullscreen mode

Without a write barrier, ARM may reorder these two STRs.

Reorder Write Issue

Store Buffer (or Write Buffer) is the queue inside the core to absorb store operations quickly, allowing the CPU to proceed without waiting for the actual memory write to complete. This is the common source of reordering issues relative to subsequent loads.

Program order   :   CFG → DOORBELL
Device sees     :   DOORBELL → CFG
Enter fullscreen mode Exit fullscreen mode

Insert a store barrier between the two writes.

DMB ST : Data Memory Barrier (Store)

; R0 = CFG value (0xAAAA5555)
LDR     R0, =0xAAAA5555
; R1 = CFG register (0x12340000)
LDR     R1, =0x12340000
; Write CFG
STR     R0, [R1]

; Write barrier: all stores before this complete
DMB     ST                       

; R2 = DOORBELL value (0x1)
MOV     R2, #1
; R3 = DOORBELL register (0x12340004)
LDR     R3, =0x12340004
; Kick the device
STR     R2, [R3]
Enter fullscreen mode Exit fullscreen mode

Reading Device Status Without a Read Barrier

Example: Many devices expose a STATUS register and a DATA register (read data only when STATUS indicates ready).

; R1 = address of STATUS (0x12340010)
LDR     R1, =0x12340010
; Read status
LDR     R0, [R1]                ; Expect updated status
; R3 = address of DATA (0x12340014)
LDR     R3, =0x12340014
; Read the data
LDR     R2, [R3]
Enter fullscreen mode Exit fullscreen mode

Reorder read issue

Load Queue (or pending-load buffer) is the queue inside the core that turns differing memory response times into observable reordering. It holds all outstanding loads, receives their results as they return from the bus, and forwards whichever response arrives first to the pipeline. That forwarding step is what exposes reordering to CPU. Without this queue, the core would be forced to wait for each load to complete before issuing the next one, eliminating any possibility of later loads completing earlier even when their data is available sooner.

Program order     :   STATUS → DATA
Actual sequence   :   DATA → STATUS
Enter fullscreen mode Exit fullscreen mode

Insert a load barrier between the two reads.

DMB LD : Data Memory Barrier (Load)

; R1 = address of STATUS (0x12340010)
LDR     R1, =0x12340010
; Read status
LDR     R0, [R1]                ; Expect updated status

; Read barrier: STATUS must complete first
DMB     LD

; R3 = address of DATA (0x12340014)
LDR     R3, =0x12340014
; Read the data
LDR     R2, [R3]
Enter fullscreen mode Exit fullscreen mode

MMIO reads and writes are operations to hardware and order (examples above) controls behaviour. This is the reason device memory needs barriers.

Two CPU Cores, Shared DDR

Assume two cores share a structure in DDR:

struct shared {
    int data;
    int flag;
} S;
Enter fullscreen mode Exit fullscreen mode

Core0: write data, set flag

S.data = 123;   // store #1
S.flag = 1;     // store #2
Enter fullscreen mode Exit fullscreen mode

Core1: read flag, then read data

if (S.flag == 1)        // load #1
    val = S.data;       // load #2
Enter fullscreen mode Exit fullscreen mode

Behaviour without barrier:

  • Core0’s data write may sit in its store buffer
  • Core0’s flag write may drain early and be visible first
  • Core1 may observe flag == 1 before the new data is visible

With barriers:

spin_lock() → acquire barrier
spin_unlock() → release barrier

Core0:

spin_lock();
S.data = 123;     // may still be in Core0's store buffer
S.flag = 1;       // may also be buffered
spin_unlock();    // release barrier flushes BOTH writes
Enter fullscreen mode Exit fullscreen mode

Release barrier drains store buffer before releasing the lock.

Core1:

spin_lock();      // acquire barrier
tmp_flag = S.flag; // cannot be reordered before lock
tmp_data = S.data; // cannot be reordered before reading flag
spin_unlock();
Enter fullscreen mode Exit fullscreen mode

Acquire barrier forbids both loads from slipping above the lock.

Synchronising Device for DMA

For example:

  • CPU writes a buffer
  • CPU must issue a barrier
  • CPU writes a “start DMA” register

The barrier ensures data fields reach DDR before the device reads them.

Top comments (0)