Ripan Deuri

Posted on Nov 20

Memory Mapped IO (MMIO)

#embedded #kernel

Introduction
VA to PA Conversion
MMIO Regions
Load and Store Operations
Barriers and Ordering

Introduction

Memory-Mapped I/O (MMIO) is an address-mapping technique where device registers are assigned fixed ranges in the system’s physical address map. When the CPU issues a load or store to these regions, the transaction is routed to a hardware block instead of DRAM. There are no special instructions involved; ordinary load/store operations form the entire I/O protocol.

VA to PA Conversion

The CPU always issues virtual addresses (VAs). Before transactions reach the interconnect, the Memory Management Unit (MMU) translates VAs into physical addresses (PAs). The MMU uses a Translation Lookaside Buffer (TLB) for cached translations and performs a page-table walk on TLB misses.

MMIO Regions

MMIO regions occupy fixed physical address ranges defined by the SoC. Linux receives the physical MMIO layout from the Device Tree (DT), reserves the corresponding regions, and builds virtual mappings with device-type memory attributes.

+-----------------------------------------+
|   SoC Physical Address Map              |
+-----------------------------------------+
| 0x0000_0000 - 0x0FFF_FFFF : DDR         |
| 0x1000_0000 - 0x1000_0FFF : UART MMIO   |
| 0x1234_0000 - 0x1234_0FFF : Device MMIO |
| ...                                     |
+-----------------------------------------+

Load and Store Operations

A CPU MMIO access follows the same initial path as a normal load or store: instruction issue → VA → MMU translation → PA. After translation, the PA is emitted onto the SoC interconnect. The interconnect decodes the PA and routes the access to the MMIO peripheral instead of DDR.

Barriers and Ordering

Modern CPUs reorder memory operations aggressively. Loads may complete early, stores may be buffered, and multiple memory streams may execute out of order. This behaviour is essential for performance, but it breaks the assumptions needed when software interacts with hardware through MMIO registers.

Programming a Device Without a Write Barrier

Example: A device expects the CFG register to be written before the DOORBELL register.

LDR: Loads a word (32-bit).
STR: Stores a word (32-bit).

; R0 = CFG value (0xAAAA5555)
LDR     R0, =0xAAAA5555
; R1 = CFG register (0x12340000)
LDR     R1, =0x12340000
; Write CFG
STR     R0, [R1]
; R2 = DOORBELL value (0x1)
MOV     R2, #1
; R3 = DOORBELL register (0x12340004)
LDR     R3, =0x12340004
; Kick the device
STR     R2, [R3]

Without a write barrier, ARM may reorder these two STRs.

Store Buffer (or Write Buffer) is the queue inside the core to absorb store operations quickly, allowing the CPU to proceed without waiting for the actual memory write to complete. This is the common source of reordering issues relative to subsequent loads.

Program order   :   CFG → DOORBELL
Device sees     :   DOORBELL → CFG

Insert a store barrier between the two writes.

DMB ST : Data Memory Barrier (Store)

; R0 = CFG value (0xAAAA5555)
LDR     R0, =0xAAAA5555
; R1 = CFG register (0x12340000)
LDR     R1, =0x12340000
; Write CFG
STR     R0, [R1]

; Write barrier: all stores before this complete
DMB     ST                       

; R2 = DOORBELL value (0x1)
MOV     R2, #1
; R3 = DOORBELL register (0x12340004)
LDR     R3, =0x12340004
; Kick the device
STR     R2, [R3]

Reading Device Status Without a Read Barrier

Example: Many devices expose a STATUS register and a DATA register (read data only when STATUS indicates ready).

; R1 = address of STATUS (0x12340010)
LDR     R1, =0x12340010
; Read status
LDR     R0, [R1]                ; Expect updated status
; R3 = address of DATA (0x12340014)
LDR     R3, =0x12340014
; Read the data
LDR     R2, [R3]

Load Queue (or pending-load buffer) is the queue inside the core that turns differing memory response times into observable reordering. It holds all outstanding loads, receives their results as they return from the bus, and forwards whichever response arrives first to the pipeline. That forwarding step is what exposes reordering to CPU. Without this queue, the core would be forced to wait for each load to complete before issuing the next one, eliminating any possibility of later loads completing earlier even when their data is available sooner.

Program order     :   STATUS → DATA
Actual sequence   :   DATA → STATUS

Insert a load barrier between the two reads.

DMB LD : Data Memory Barrier (Load)

; R1 = address of STATUS (0x12340010)
LDR     R1, =0x12340010
; Read status
LDR     R0, [R1]                ; Expect updated status

; Read barrier: STATUS must complete first
DMB     LD

; R3 = address of DATA (0x12340014)
LDR     R3, =0x12340014
; Read the data
LDR     R2, [R3]

MMIO reads and writes are operations to hardware and order (examples above) controls behaviour. This is the reason device memory needs barriers.

Two CPU Cores, Shared DDR

Assume two cores share a structure in DDR:

struct shared {
    int data;
    int flag;
} S;

Core0: write data, set flag

S.data = 123;   // store #1
S.flag = 1;     // store #2

Core1: read flag, then read data

if (S.flag == 1)        // load #1
    val = S.data;       // load #2

Behaviour without barrier:

Core0’s data write may sit in its store buffer
Core0’s flag write may drain early and be visible first
Core1 may observe flag == 1 before the new data is visible

With barriers:

spin_lock() → acquire barrier
spin_unlock() → release barrier

Core0:

spin_lock();
S.data = 123;     // may still be in Core0's store buffer
S.flag = 1;       // may also be buffered
spin_unlock();    // release barrier flushes BOTH writes

Release barrier drains store buffer before releasing the lock.

Core1:

spin_lock();      // acquire barrier
tmp_flag = S.flag; // cannot be reordered before lock
tmp_data = S.data; // cannot be reordered before reading flag
spin_unlock();

Acquire barrier forbids both loads from slipping above the lock.

Synchronising Device for DMA

For example:

CPU writes a buffer
CPU must issue a barrier
CPU writes a “start DMA” register

The barrier ensures data fields reach DDR before the device reads them.

DEV Community