Table of Contents
Introduction
Memory-Mapped I/O (MMIO) is an address-mapping technique where device registers are assigned fixed ranges in the system’s physical address map. When the CPU issues a load or store to these regions, the transaction is routed to a hardware block instead of DRAM. There are no special instructions involved; ordinary load/store operations form the entire I/O protocol.
VA to PA Conversion
The CPU always issues virtual addresses (VAs). Before transactions reach the interconnect, the Memory Management Unit (MMU) translates VAs into physical addresses (PAs). The MMU uses a Translation Lookaside Buffer (TLB) for cached translations and performs a page-table walk on TLB misses.
MMIO Regions
MMIO regions occupy fixed physical address ranges defined by the SoC. Linux receives the physical MMIO layout from the Device Tree (DT), reserves the corresponding regions, and builds virtual mappings with device-type memory attributes.
+-----------------------------------------+
| SoC Physical Address Map |
+-----------------------------------------+
| 0x0000_0000 - 0x0FFF_FFFF : DDR |
| 0x1000_0000 - 0x1000_0FFF : UART MMIO |
| 0x1234_0000 - 0x1234_0FFF : Device MMIO |
| ... |
+-----------------------------------------+
Load and Store Operations
A CPU MMIO access follows the same initial path as a normal load or store: instruction issue → VA → MMU translation → PA. After translation, the PA is emitted onto the SoC interconnect. The interconnect decodes the PA and routes the access to the MMIO peripheral instead of DDR.
Barriers and Ordering
Modern CPUs reorder memory operations aggressively. Loads may complete early, stores may be buffered, and multiple memory streams may execute out of order. This behaviour is essential for performance, but it breaks the assumptions needed when software interacts with hardware through MMIO registers.
Programming a Device Without a Write Barrier
Example: A device expects the CFG register to be written before the DOORBELL register.
LDR: Loads a word (32-bit).
STR: Stores a word (32-bit).
; R0 = CFG value (0xAAAA5555)
LDR R0, =0xAAAA5555
; R1 = CFG register (0x12340000)
LDR R1, =0x12340000
; Write CFG
STR R0, [R1]
; R2 = DOORBELL value (0x1)
MOV R2, #1
; R3 = DOORBELL register (0x12340004)
LDR R3, =0x12340004
; Kick the device
STR R2, [R3]
Without a write barrier, ARM may reorder these two STRs.
Store Buffer (or Write Buffer) is the queue inside the core to absorb store operations quickly, allowing the CPU to proceed without waiting for the actual memory write to complete. This is the common source of reordering issues relative to subsequent loads.
Program order : CFG → DOORBELL
Device sees : DOORBELL → CFG
Insert a store barrier between the two writes.
DMB ST : Data Memory Barrier (Store)
; R0 = CFG value (0xAAAA5555)
LDR R0, =0xAAAA5555
; R1 = CFG register (0x12340000)
LDR R1, =0x12340000
; Write CFG
STR R0, [R1]
; Write barrier: all stores before this complete
DMB ST
; R2 = DOORBELL value (0x1)
MOV R2, #1
; R3 = DOORBELL register (0x12340004)
LDR R3, =0x12340004
; Kick the device
STR R2, [R3]
Reading Device Status Without a Read Barrier
Example: Many devices expose a STATUS register and a DATA register (read data only when STATUS indicates ready).
; R1 = address of STATUS (0x12340010)
LDR R1, =0x12340010
; Read status
LDR R0, [R1] ; Expect updated status
; R3 = address of DATA (0x12340014)
LDR R3, =0x12340014
; Read the data
LDR R2, [R3]
Load Queue (or pending-load buffer) is the queue inside the core that turns differing memory response times into observable reordering. It holds all outstanding loads, receives their results as they return from the bus, and forwards whichever response arrives first to the pipeline. That forwarding step is what exposes reordering to CPU. Without this queue, the core would be forced to wait for each load to complete before issuing the next one, eliminating any possibility of later loads completing earlier even when their data is available sooner.
Program order : STATUS → DATA
Actual sequence : DATA → STATUS
Insert a load barrier between the two reads.
DMB LD : Data Memory Barrier (Load)
; R1 = address of STATUS (0x12340010)
LDR R1, =0x12340010
; Read status
LDR R0, [R1] ; Expect updated status
; Read barrier: STATUS must complete first
DMB LD
; R3 = address of DATA (0x12340014)
LDR R3, =0x12340014
; Read the data
LDR R2, [R3]
MMIO reads and writes are operations to hardware and order (examples above) controls behaviour. This is the reason device memory needs barriers.
Two CPU Cores, Shared DDR
Assume two cores share a structure in DDR:
struct shared {
int data;
int flag;
} S;
Core0: write data, set flag
S.data = 123; // store #1
S.flag = 1; // store #2
Core1: read flag, then read data
if (S.flag == 1) // load #1
val = S.data; // load #2
Behaviour without barrier:
- Core0’s data write may sit in its store buffer
- Core0’s flag write may drain early and be visible first
- Core1 may observe flag == 1 before the new data is visible
With barriers:
spin_lock() → acquire barrier
spin_unlock() → release barrier
Core0:
spin_lock();
S.data = 123; // may still be in Core0's store buffer
S.flag = 1; // may also be buffered
spin_unlock(); // release barrier flushes BOTH writes
Release barrier drains store buffer before releasing the lock.
Core1:
spin_lock(); // acquire barrier
tmp_flag = S.flag; // cannot be reordered before lock
tmp_data = S.data; // cannot be reordered before reading flag
spin_unlock();
Acquire barrier forbids both loads from slipping above the lock.
Synchronising Device for DMA
For example:
- CPU writes a buffer
- CPU must issue a barrier
- CPU writes a “start DMA” register
The barrier ensures data fields reach DDR before the device reads them.




Top comments (0)