DEV Community: Ripan Deuri

Understanding PCIe Data Link Layer

Ripan Deuri — Wed, 08 Apr 2026 13:41:55 +0000

1. Introduction

PCI Express (PCIe) uses a layered architecture to separate concerns like transaction creation, reliability, and physical transmission. The Data Link Layer (DLL) ensures reliable communication between directly connected devices.

The Transaction Layer generates packets (TLPs), and the Physical Layer transmits bits. The Data Link Layer sits between them, guaranteeing that TLPs are delivered correctly, in order, and without corruption over a single PCIe link.

2. PCIe Layers

Transaction Layer (TL): Creates Transaction Layer Packets (TLPs)
Data Link Layer (DLL): Ensures reliable delivery of TLPs
Physical Layer (PHY): Handles signaling and bit transmission

3. Data Link Layer (DLL)

3.1 What Data Link Layer Does

Responsibilities of DLL are:

Reliable delivery: Sequence numbers + ACK/NAK + Replay buffer
Error detection: LCRC (Link CRC) on every TLP
Flow control: Credit based system to prevent buffer overflow

DLL operates on two types of packets:

TLPs (Transmission Layer Packets) - actual data from the Transmission Layer
DLLPs (Data Link Layer Packets) - control packets used by the DLL itself (ACK, NAK, flow control updates)

3.2 Sequence Numbers

Each outgoing TLP is assigned a sequence number and LCRC by the Data Link Layer.

+--------------------------------------+
| Seq#   |    TLP Payload    | LCRC    |
| 12-bit |                   | 32-bit  |
+--------------------------------------+

DLL Tx steps:

Assign next sequence number
Prepend seq#
Compute LCRC over [Seq# | TLP]
Append LCRC
Save copy in Replay buffer
Send [Seq# | TLP | LCRC] to Physical Layer

DLL Rx steps:

Receive [Seq# | TLP | LCRC] from Physical Layer
Recompute LCRC - does it match?
- YES: continue
- NO: send NAK DLLP, discard TLP
Check Seq# - is it the expected one?
- YES: continue
- NO: send NAK DLLP, discard TLP
Strip Seq# and LCRC
Pass TLP up to Transaction Layer
Send ACK DLLP back to the transmitter

3.3 ACK/NAK Protocol

Replay Buffer
The transmitter keeps a Replay Buffer - a copy of every TLP that has been sent but not yet acknowledged.

Replay Buffer:
+--------------------------------+
| Seq = 40 | Seq = 41 | Seq = 42 |
+--------------------------------+
      ^
      |
     Once ACK(40) is received, Seq = 40 is removed

ACK
When the receiver successfully receives a TLP, it sends back an ACK DLLP with the sequence number of the last successfully received TLP.

ACK is cumulative. An ACK (Seq = 41) means TLPs are received TLPs upto and including Seq = 41.

NAK
When the receiver detected a corrupted TLP (bad LCRC or an out of order sequence number), it sends a NACK DLLP.

Upon receiving a NACK, the transmitter replays all unACK'd TLPs from the replay buffer, starting from the NAK'd sequence number.

A Complete ACK/NAK Example

RC DLL                                  EP DLL
  |                                       |
  |-- TLP (Seq=41) ---------------------->| OK
  |-- TLP (Seq=42) ---------------------->| OK
  |-- TLP (Seq=43) ---------------------->| CRC Error!
  |                                       |
  |<-- ACK (Seq=42)-----------------------| ACKs Seq=41 and Seq=42 cumulatively
  |<-- NAK (Seq=43)-----------------------| NAK Seq=43, requests retransmit
  |                                       |
  | [Transmitter replays from Seq = 43]   |
  |-- TLP (Seq=43)----------------------->| OK
  |-- TLP (Seq=44)----------------------->| OK
  |                                       |
  |<-- ACK (Seq=44)-----------------------| ACKs Seq=43 and Seq=44 cumulatively

If the transmitter does not receive an ACK within a timeout period, it assumes the TLP was lost and replays all unACK'd TLPs. This handles the case where the ACK itself was lost.

3.4 Flow Control

PCIe uses a credit-based flow control system:

The receiver advertises how many credits it has (how much buffer space is available)
The transmitter can only send a TLP if it has enough credits to cover that TLP
After the receiver processes a TLP and frees buffer space, it sends a Flow Control Update DLLP to return credits to the transmitter

Credit Units

Credit Type	Space
Header Credit	1 credit = space for 1 TLP header
Data Credit	1 credit = space for 4 DWORDs of TLP Data

PCIe tracks credits separately for different types of traffic to prevent deadlock.

Credit Pool	Used For	Completion Expected?
Posed (P)	Memory Write, Messages	No
Non-Posed (NP)	Memory Read	Config Read/Write
Completion (Cpl)	Completions (responses to NP requests	No

Flow Control Initialization
Before any TLPs can be sent, the two sides must initialize flow control. This happens right after the Physical Layer declares the link up:

Both sides send InitFC1 DLLPs advertising their initial credits
Both sides send InitFC2 DLLPs confirming they received the other side's credits
Flow Control is initialized - TLPs can flow now

3.5 DLLPs - Data Link Layer Packets

DLLPs are small control packets used exclusively by the Data Link Layer. They are never seen by the Transaction Layer - they are created and consumed entirely within the DLL.

DLLP Format

+---------------------------------------------+
|Type (1 byte)|Payload (3 bytes)|CRC (2 bytes)|
+---------------------------------------------+

DLLPs are much smaller than TLPs. They are sent in the gaps between TLPs.

DLLP Types

Type	Purpose	When Sent
ACK	Ack received TLPs up to a given Seq#	After successfully receiving a TLP
NAK	Request retransmission from a given Seq#	After detecting a bad TLP
InitFC1	Flow control initialization (phase 1)	During link initialization
InitFC2	Flow control initialization (phase 2)	During link initialization
UpdateFC	Return flow control credits to transmitter	After processing a TLP and freeing buffer space

Understanding PCIe Link Training

Ripan Deuri — Mon, 30 Mar 2026 18:52:17 +0000

1. Introduction

PCIe link training is the process by which a Root Complex (RC) and an Endpoint (EP) autonomously negotiate and establish a reliable high-speed serial link. No software is involved; everything is done by the Physical Layer state machine.

The process must solve:

Receiver detection: Does anything exist on the other end?
Bit lock: Can the receiver lock its clock-data recovery (CDR) circuit to the incoming bit stream?
Symbol/block lock: Can the receiver identify symbol or block boundaries?
Link configuration: What width and lane ordering to use?
Speed negotiation: What is the highest mutually supported data rate?

This article focuses on the physical layer (PHY) and explains the LTSSM (Link Training and Status State Machine).

2. System Setup

Topology:

RC Lane	EP Lane
Lane0	Lane0
Lane1	Lane1
Lane2	open
Lane3	open

Expected Outcome:

Link width: x2 (limited by the EP)
Final speed: Gen3 (8 GT/s)

3. Encoding Fundamentals

3.2 8b/10b Encoding (Gen1 and Gen2)

Every 8-bit byte is replaced by a 10-bit symbol. The two extra bits provide the overhead needed for DC balance and transition density.

Running disparity (RD): The encoder tracks whether the current symbol has sent more 1s or 0s. RD+ means the last symbol had a net excess of 1s; RD- means net excess of 0s. The next symbol is chosen from the RD+ or RD- variant to balance the line.

Symbol classes:

Data symbols Dxx.y: xx = bits [4:0], y = bits [7:5]. Value = y×32 + xx.
Control symbols Kxx.y: special characters outside the normal data space.

3.3 Deriving the Symbol Names (Example)

The TS1/TS2 identifier bytes are:

TS1 ID byte = 0x4A

0x4A = 74 decimal = 0100_1010 binary
bits [4:0] = 0_1010 = 10, bits [7:5] = 010 = 2 → D10.2

TS2 ID byte = 0x45

0x45 = 69 decimal = 0100_0101 binary
bits [4:0] = 0_0101 = 5, bits [7:5] = 010 = 2 → D5.2

K28.5 (COM) = 0xBC

0xBC = 1011_1100 binary
Bits [4:0] = 1_1100 = 28, bits [7:5] = 101 = 5 → K28.5
K28.5 is the designated "comma" character used to establish symbol alignment because its 10-bit patterns (both RD+ and RD-) contain a unique 6-bit run (six consecutive same-polarity bits) not achievable in any valid data symbol.

3.4 8b/10b Encoded Bit Patterns

8b/10b encoding uses lookup tables; below are the encoded values for the symbols used in training:

Byte	Symbol	10-bit (RD-)	10-bit (RD+)	Notes
0xBC	K28.5	0011_111010	1100_000101	COM
0x00	D0.0	1001_110100	0110_001011	Logical Idle
0xF8	K23.7	1110_101000	0001_010111	PAD symbol
0x4A	D10.2	0101_010101	1010_101010	TS1 ID ('J')
0x45	D5.2	1010_100101	0101_011010	TS2 ID ('E')
0xFF	D31.7	1010_111110	0101_000001	N_FTS (255)
0x07	D7.0	1110_100010	0001_011101	Rate ID
0x20	D0.1	0110_010111	1001_101000	Speed change bit5=1
0xC8	D8.6	0001_011011	1110_100100	N_FTS=200

Note on D10.2: The RD- pattern 0101_010101 happens to be an alternating sequence, and the RD+ pattern 1010_101010 is the complement.

3.5 128b/130b Encoding (Gen3 and Above)

At Gen3 (8 GT/s), 8b/10b is replaced by 128b/130b encoding:

Every 128-bit payload gets a 2-bit sync header prepended.
Sync header 10 = data block; 01 = ordered set (control) block.
Overhead: 2/130 ≈ 1.5%, vs. 20% for 8b/10b. This is why Gen3 at 8 GT/s delivers ~4× the effective bandwidth of Gen1 at 2.5 GT/s, not just 3.2×.
Scrambling uses a different LFSR polynomial than Gen1/Gen2.

Gen3 TS1/TS2 format in 128b/130b: An ordered set block (sync header = 01) carries a 128-bit payload. A TS1 or TS2 occupies one such block. The payload contains the same logical fields (Link, Lane, N_FTS, Rate, Training Control, identifier) but packed differently than the 16-symbol 8b/10b format. Specifically, a Gen3 TS1/TS2 block is:

[01][TS_ID(1B)][Link(1B)][Lane(1B)][N_FTS(1B)][Rate(1B)][TrainingCtrl(1B)][ID×10(10B)][reserved(2B)]

Total: 2 bits sync + 128 bits payload = 130 bits per ordered set block.

4. Ordered Set Structure (TS1 / TS2) — Gen1/Gen2

Each ordered set = 16 symbols × 10 bits = 160 bits at Gen1/Gen2.

TS1 Layout (pre-scramble byte values)

Symbol index	Field	Byte value (typical)	Notes
0	COM	0xBC	K28.5, always sent
1	Link Number	0xF8 (PAD) or 0x00	PAD until link assigned
2	Lane Number	0xF8 (PAD) or 0x00..n	PAD until lane assigned
3	N_FTS	0xC8 (200)	Receiver's FTS requirement
4	Rate Identifier	0x07	Gen1+Gen2+Gen3 support
5	Training Control	0x00 or 0x20	0x20 = speed change request
6–15	TS1 Identifier	0x4A × 10	ASCII 'J', repeated 10 times

TS2 Layout

Identical to TS1, except:

Symbol index	Field	Byte value
6–15	TS2 Identifier	0x45 × 10

Key Training Control Bits (Symbol 5)

Bit	Name	Meaning when set
0	Hot Reset	Request hot reset
1	Disable Link	Request link disable
2	Loopback	Request loopback mode
3	Disable Scrambling	Scrambling off (test/debug)
4	Compliance Receive	Enter compliance mode
5	Speed Change	Request transition to new speed

5. LTSSM Overview

Detect → Polling → Configuration → L0 (Gen1) → Recovery → L0 (Gen3)

Training always begins at Gen1 (2.5 GT/s), regardless of device capability.
Speed upgrade happens later via the Recovery state.
The LTSSM runs independently and in parallel on the RC and the EP. They converge through the exchange of ordered sets.

6. Detect State

The transmitter output is a current-mode driver with a nominal output impedance of 50 Ω into a 50 Ω termination at the receiver. Total DC path = 100 Ω differential.

The Detect state uses a slow voltage ramp on the TX differential pair:

Receiver present: The 50 Ω termination loads the ramp → slow rise time detected.
No receiver (open circuit): No termination → fast rise to rail → detected as absent.

The spec requires the transmitter to charge the line to a voltage and measure the time to reach a threshold. If it stays below the threshold long enough (indicating a load), a receiver is declared present.

RC Detect Results (per lane):

RC Lane	Detect result
Lane0	Receiver present (50 Ω load from EP Lane0)
Lane1	Receiver present (50 Ω load from EP Lane1)
Lane2	No receiver (open circuit)
Lane3	No receiver (open circuit)

RC exits Detect with 2 active lanes: Lane0 and Lane1.

Lane2 and Lane3 are deactivated for the remainder of training.

EP Detect Results (per lane):

EP Lane	Detect result
Lane0	Receiver present
Lane1	Receiver present

EP exits Detect with 2 active lanes.

7. Polling State

Polling has two sub-states: Polling.Active and Polling.Configuration.

Goal of Polling

Bit lock: The receiver CDR circuit locks its internal clock to the incoming bit transitions.
Symbol lock: The receiver identifies COM (K28.5) symbols and aligns its symbol boundaries.
Configuration capability exchange: Devices advertise their supported speeds in the Rate ID field.

7.1 Polling.Active — TS1 Transmission

Both RC and EP begin transmitting TS1 ordered sets simultaneously on all active lanes. At this stage:

Link Number = PAD (0xF8) — no link number assigned yet
Lane Number = PAD (0xF8) — no lane number assigned yet
Training Control = 0x00

RC → EP on Lane0, TS1 ordered set (pre-scramble bytes):

Symbol:  [ 0][ 1][ 2][ 3][ 4][ 5][ 6][ 7][ 8][ 9][10][11][12][13][14][15]
Byte:    [BC][F8][F8][C8][07][00][4A][4A][4A][4A][4A][4A][4A][4A][4A][4A]
Field:   COM  LNK LAN FTS RAT CTL <-----------TS1 ID ('J') × 10--------->

Bit-level expansion of the first three symbols (RD- for each):

Symbol 0 — COM (K28.5, 0xBC), RD-:
  10-bit: 0011_1110 10
  Wired:  0 0 1 1 1 1 1 0 1 0   (LSB first on differential pair)

Symbol 1 — PAD (K23.7, 0xF8), RD- → RD+:
  Sending K28.5 in RD- leaves RD+, so next symbol is RD+ variant.
  K23.7 RD+: 0001_0101 11
  Wired:  0 0 0 1 0 1 0 1 1 1

Symbol 2 — PAD (K23.7, 0xF8), RD+ → RD-:
  K23.7 RD-: 1110_1010 00
  Wired:  1 1 1 0 1 0 1 0 0 0

PCIe bit ordering: Bits are transmitted LSB-first within each 10-bit symbol. The wired sequence above shows bits as they appear on the differential pair over time, left to right.

Same TS1 on Lane1 (identical content, independent CDR):

Symbol:  [ 0][ 1][ 2][ 3][ 4][ 5][ 6][ 7][ 8][ 9][10][11][12][13][14][15]
Byte:    [BC][F8][F8][C8][07][00][4A][4A][4A][4A][4A][4A][4A][4A][4A][4A]

Lane0 and Lane1 carry identical TS1 content during Polling. Each lane's CDR circuit locks independently.

EP → RC (simultaneously, same TS1 format):

Lane0: [BC][F8][F8][C8][07][00][4A × 10]
Lane1: [BC][F8][F8][C8][07][00][4A × 10]

Polling.Active Exit Condition:

Transmit at least 1024 TS1 ordered sets on all active lanes.
AND receive 8 consecutive identical TS1 or TS2 on any active lane.

Both conditions must be met. The 1024-TS1 minimum ensures the receiver had enough transitions for bit and symbol lock before either side checks the received content.

On receipt of 8 consecutive valid TS1, the device transitions to Polling.Configuration.

7.2 Polling.Configuration

In Polling.Configuration, each device sends TS2 ordered sets (symbols 6–15 = 0x45). The Rate ID field is still 0x07. The device exits when it has:

Received 8 consecutive TS2 with matching Rate ID
AND transmitted at least 16 TS2

After Polling.Configuration, the devices have confirmed mutual speed support. Both transition to Configuration.

8. Configuration State

Configuration negotiates link width and lane numbering. It proceeds through several sub-states.

8.1 Configuration.LinkWidth.Start

The RC transmits TS1 on all active lanes with:

Link Number = PAD (0xF8)
Lane Number = PAD (0xF8)

This signals: "I am proposing lanes; tell me what you can accept."

RC → EP per lane:

Lane0: [BC][F8][F8][C8][07][00][4A × 10]
Lane1: [BC][F8][F8][C8][07][00][4A × 10]

Both lanes carry PAD/PAD — the RC has not assigned numbers yet.

8.2 Configuration.LinkWidth.Accept

The EP receives the RC's PAD/PAD TS1 on Lane0 and Lane1. It assigns link and lane numbers:

EP Lane	Assigned Link Number	Assigned Lane Number
Lane0	0x00	0x00
Lane1	0x00	0x01

The EP sends TS1 back to the RC with these assigned values:

EP → RC on Lane0 (pre-scramble):

Symbol:  [ 0][ 1][ 2][ 3][ 4][ 5][ 6–15]
Byte:    [BC][00][00][C8][07][00][4A × 10]
Field:   COM  LNK=0 LAN=0 FTS RAT CTL  TS1_ID

EP → RC on Lane1 (pre-scramble):

Symbol:  [ 0][ 1][ 2][ 3][ 4][ 5][ 6–15]
Byte:    [BC][00][01][C8][07][00][4A × 10]
Field:   COM  LNK=0 LAN=1 FTS RAT CTL  TS1_ID

Bit-level for Symbol 2 (Lane Number) on Lane1 — D1.0 (0x01), assume RD-:

0x01 = D1.0
bits[4:0] = 00001, bits[7:5] = 000
D1.0 RD-: 1001_010111   (10 bits, LSB first on wire: 1 0 0 1 0 1 0 1 1 1)

On Lane0, Symbol 2 = D0.0 (0x00):

D0.0 RD-: 1001_110100   (wire: 1 0 0 1 1 1 0 1 0 0)

8.3 RC Echoes Back

The RC receives the EP's numbered TS1. It now echoes the same Link=0 / Lane=N assignments back in its own TS1 transmissions, confirming acceptance:

RC → EP on Lane0:

[BC][00][00][C8][07][00][4A × 10]

RC → EP on Lane1:

[BC][00][01][C8][07][00][4A × 10]

Agreement is reached: Link 0, x2 width, Lane0 and Lane1.

8.4 Configuration.LaneNum.Wait — Switch to TS2

Both sides transition from TS1 → TS2 to confirm lane numbering.

RC → EP on Lane0:

[BC][00][00][C8][07][00][45 × 10]

RC → EP on Lane1:

[BC][00][01][C8][07][00][45 × 10]

EP → RC (same, mirrored).

Symbol 15 of TS2 on Lane0, D5.2 (0x45), assume RD-:

0x45 = D5.2
bits[4:0] = 00101 = 5, bits[7:5] = 010 = 2
D5.2 RD-: 1010_100101   (wire: 1 0 1 0 1 0 0 1 0 1)

8.5 Configuration.LaneNum.Accept

Exit condition: receive 2 consecutive TS2 with matching Link and Lane fields.

8.6 Configuration.Complete

Both sides exchange TS2 until:

8 consecutive TS2 received
AND minimum 2 ms has elapsed since entering Configuration state

The 2 ms minimum is a guard interval to allow all lanes to settle and converge regardless of implementation variance.

8.7 Configuration.Idle

Both sides stop sending TS2 and send Electrical Idle followed by Logical Idle symbols:

Logical Idle symbol — D0.0 (0x00), RD-:

10-bit RD-: 1001_110100
Wire (LSB first): 1 0 0 1 1 1 0 1 0 0

Both sides send 8 consecutive Logical Idle symbols on each lane. On receipt of these, both sides transition to L0.

9. L0 State — Gen1 Active Link

The link is now operational:

Parameter	Value
Speed	Gen1, 2.5 GT/s per lane
Width	x2
Encoding	8b/10b
Raw bit rate	2.5 Gb/s per lane
Effective rate	2.5 × 0.8 = 2.0 Gb/s per lane (8b/10b overhead)
Aggregate BW	2 lanes × 2.0 Gb/s = 4.0 Gb/s = 500 MB/s (bidirectional per direction)

TLPs (Transaction Layer Packets) and DLLPs (Data Link Layer Packets) can now flow. Software is notified that a link is up.

The RC or EP (typically the RC, or the driver stack) can now initiate a speed change by entering Recovery.

10. Recovery State — Speed Upgrade to Gen3

10.1 Recovery.RcvrLock — Requesting Speed Change

The RC (or EP) initiates Recovery by sending TS1 with:

Speed Change bit set (Training Control bit 5 = 1 → byte = 0x20)
Link = 0x00, Lane = 0x00 or 0x01 per lane
Rate ID = 0x07 (all speeds)

RC → EP on Lane0 (pre-scramble):

Symbol:  [ 0][ 1][ 2][ 3][ 4][ 5][ 6–15]
Byte:    [BC][00][00][C8][07][20][4A × 10]
                              ^^
                        Speed Change = 1 (bit 5)

Symbol 5, Training Control = 0x20:

0x20 = 0010_0000 binary = D0.1
bits[4:0] = 00000 = 0, bits[7:5] = 001 = 1 → D0.1

D0.1 RD-: 0110_010111   (wire: 0 1 1 0 0 1 0 1 1 1)

RC → EP on Lane1:

[BC][00][01][C8][07][20][4A × 10]

The EP receives these and responds with the same: TS1 with Speed Change bit = 1, its own Link=0, Lane=N.

EP → RC on Lane0:

[BC][00][00][C8][07][20][4A × 10]

EP → RC on Lane1:

[BC][00][01][C8][07][20][4A × 10]

Exit condition for Recovery.RcvrLock: receive 8 consecutive TS1 or TS2 (with or without speed change bit) on all active lanes.

10.2 Recovery.RcvrCfg — Confirming Speed with TS2

Both sides switch to TS2 (still Gen1, still 8b/10b, Speed Change bit = 1):

RC → EP on Lane0:

[BC][00][00][C8][07][20][45 × 10]

RC → EP on Lane1:

[BC][00][01][C8][07][20][45 × 10]

Exit condition: receive 8 consecutive TS2 with Speed Change bit set.

10.3 Recovery.Speed — PHY Retrain

Both sides simultaneously:

Assert electrical idle (stop driving differential data).
Reset the PHY PLL from 2.5 GT/s to 8 GT/s.
Switch the serializer/deserializer (SerDes) to Gen3 signaling parameters (different equalization, different reference voltage levels).
Switch framing from 8b/10b to 128b/130b.
Reset the LFSR scrambler to the Gen3 seed.

There is a mandatory quiet time (Electrical Idle) during this phase. The timeout for this step is 24 ms.

10.4 Recovery.RcvrLock Again (at Gen3)

After the PHY switches to 8 GT/s, both sides retransmit TS1 — now using 128b/130b framing.

Each TS1 block at Gen3 is a 130-bit unit:

[sync: 01][TS1 payload: 128 bits]
 ^^^^^^^^
 Ordered set block indicator

Payload (128 bits = 16 bytes), logical byte layout:

Byte:  [TS1_OS_ID][LNK][LAN][N_FTS][RATE][CTRL][4A×10][rsvd×2]
       [0xF0    ][0x00][0x00][0xC8][0x07][0x20][4A..4A][00 00]

Note: The Gen3 TS1 ordered set identifier byte is 0xF0 (different from Gen1/Gen2, which uses the symbol position to indicate "this is an ordered set"). In 128b/130b, a dedicated byte in the payload identifies the ordered set type.

On the wire (Lane0, first 20 bits of a Gen3 TS1 block, before scrambling):

Sync header (2 bits): 0 1
Byte 0 = 0xF0 = 1111_0000, wire (LSB first): 0 0 0 0 1 1 1 1
Byte 1 = 0x00 = 0000_0000, wire (LSB first): 0 0 0 0 0 0 0 0
...continues for remaining 14 bytes (112 bits)
Total = 2 + 128 = 130 bits per block

Both lanes (Lane0, Lane1) transmit identical ordered sets simultaneously.

10.5 Recovery.RcvrCfg (at Gen3) — TS2

Both sides switch to TS2 at Gen3. Same 130-bit block format, payload byte 0 = TS2 OS identifier (0xF2 for TS2 in 128b/130b), symbols 6–15 equivalent = 0x45 bytes.

Exit condition: 8 consecutive TS2 at Gen3 on all active lanes.

10.6 Transition to L0 (Gen3)

Both sides send Logical Idle in 128b/130b and transition to L0 at Gen3.

Final Link State:

Parameter	Value
Width	x2
Speed	Gen3, 8 GT/s per lane
Encoding	128b/130b
Raw bit rate	8 Gb/s per lane
Encoding overhead	2/130 ≈ 1.54%
Effective rate	8 × (128/130) ≈ 7.877 Gb/s per lane
Aggregate BW	2 lanes × 7.877 / 8 ≈ ~1.97 GB/s (per direction)

12. Complete LTSSM State Transition Summary

State	Duration / Exit Condition	Lanes active
Detect.Quiet	Receiver detection logic runs	All (RC: 4)
Detect.Active	Receiver seen on Lane0, Lane1; Lane2/3 deactivated	RC: 2, EP: 2
Polling.Active	TX ≥1024 TS1 AND RX 8 consecutive TS1/TS2	2 per side
Polling.Configuration	RX 8 consecutive TS2 AND TX ≥16 TS2	2 per side
Configuration.LinkWidth.Start	Send PAD/PAD TS1	2 per side
Configuration.LinkWidth.Accept	RX TS1 with Link≠PAD, Lane≠PAD	2 per side
Configuration.LaneNum.Wait	Switch to TS2 with Link/Lane assigned	2 per side
Configuration.LaneNum.Accept	RX 2 consecutive matching TS2	2 per side
Configuration.Complete	RX 8 consecutive TS2 AND ≥2 ms in Configuration	2 per side
Configuration.Idle	RX 8 Logical Idle symbols	2 per side
L0 (Gen1)	Active data transfer at 2.5 GT/s x2	2 per side
Recovery.RcvrLock	RX 8 consecutive TS1/TS2 (speed change requested)	2 per side
Recovery.RcvrCfg	RX 8 consecutive TS2 with speed change bit	2 per side
Recovery.Speed	PHY retrains to Gen3; 24 ms timeout	2 per side
Recovery.RcvrLock (Gen3)	RX 8 consecutive TS1/TS2 at 8 GT/s (128b/130b)	2 per side
Recovery.RcvrCfg (Gen3)	RX 8 consecutive TS2 at 8 GT/s	2 per side
L0 (Gen3)	Active data transfer at 8 GT/s x2 ≈ 1.97 GB/s	2 per side

Inside PCIe PHY: End-to-End Transmit and Receive Path

Ripan Deuri — Sat, 28 Mar 2026 12:23:45 +0000

This article builds on the PCIe overview and physical layer fundamentals by presenting an end-to-end view of how data flows through the transmit and receive paths. The focus is on how a Transaction Layer Packet (TLP) is transformed into a high-speed serial bit stream and reconstructed at the receiver.



============================================================================
TRANSMITTER (e.g., Root Complex sending a Memory Write TLP)
============================================================================

[Data Link Layer]
    Seq# + TLP Header + Data + LCRC (parallel bytes)
            |
            v
[Physical Layer]

    [Framing / Block Formation]
        Gen1/2:
            STP + TLP + END (control symbols embedded in stream)
        Gen3+:
            Data organized into 128-bit blocks
            (packet boundaries inferred, no explicit STP/END)

            |
            v

    [Scrambler]
        Bit stream XOR’d with LFSR sequence
        -> Randomized data for transition density and EMI reduction

            |
            v

    [Encoder / Block Encoding]
        Gen1/2 (8b/10b):
            8-bit -> 10-bit symbol (DC balance + control encoding)

        Gen3/4/5 (128b/130b):
            128-bit block + 2-bit sync header -> 130-bit block

            |
            v

    [Transmit PLL / Clock Generation]
        Reference clock (e.g., 100 MHz)
        -> Generates high-speed serial rate (e.g., 8.0 GT/s for Gen3)

            |
            v

    [Serializer]
        Parallel block -> serial bit stream
        Bit time ≈ 125 ps (Gen3)

            |
            v

    [Differential Driver]
        Drives Tx+ / Tx− pair
        Bit encoded as polarity (Tx+ > Tx− or Tx+ < Tx−)
        (~800 mVpp differential, implementation-dependent)

            |
============|===============================================================
            |
            |
    Tx+ ----|---- Tx−   (PCIe serial link, e.g., 8.0 GT/s)
            |
            |
============|===============================================================
            |
============================================================================
RECEIVER (e.g., Endpoint receiving the Memory Write TLP)
============================================================================

[Physical Layer]

    [Differential Receiver]
        Senses (Rx+ − Rx−)
        -> Recovers serial bit stream (noise rejection)

            |
            v

    [Clock Data Recovery (CDR)]
        Extracts clock from data transitions
        Phase Detector -> Loop Filter -> VCO
        -> Sampling aligned to center of Unit Interval (UI)

            |
            v

    [Deserializer]
        Serial stream -> parallel blocks
        (e.g., 130-bit blocks in Gen3+)

            |
            v

    [Decoder / Block Decoding]
        Gen1/2:
            10-bit -> 8-bit symbols

        Gen3/4/5:
            Remove 2-bit sync header
            -> Recover 128-bit scrambled data

            |
            v

    [De-scrambler]
        XOR with same LFSR sequence
        -> Restores original data

            |
            v

    [De-framing / Packet Reconstruction]
        Gen1/2:
            Detect STP / END symbols

        Gen3+:
            Packet boundaries inferred from protocol structure

            |
            v

[Data Link Layer]
    Seq# + TLP Header + Data + LCRC
        -> LCRC validation
        -> Sequence tracking
        -> ACK/NACK via DLLP

============================================================================

Understanding PCIe Physical Layer

Ripan Deuri — Sat, 28 Mar 2026 12:22:33 +0000

The PCIe Physical Layer is responsible for converting structured packet data into high-speed electrical signals that can traverse the link between devices. While higher layers define what to send, the Physical Layer determines how those bits are transmitted reliably over a noisy channel.

This article focuses on the transmit (TX) path, breaking down each stage from TLP data to differential signaling on the wire, including scrambling, encoding, and serialization.

PCIe Physical Layer

The Root Complex (RC) and Endpoint (EP) are connected by a PCIe Link — a set of high-speed differential signal pairs. Everything that happens between them (register reads, DMA transfers, interrupts) is ultimately carried as Transaction Layer Packets (TLPs) over this link.

+-------------+                             +-------------+
|             |<====== PCIe Link ========>  |             |
|     RC      |                             |     EP      |
|             |                             |             |
+-------------+                             +-------------+

PCIe Link

A PCIe lane is a full-duplex serial connection consisting of one transmit pair and one receive pair.

A PCIe link consists of one or more such lanes aggregated together (x1, x2, x4, x8, x16).

Physical wires in one PCIe lane

RC (Transmitter)                     EP (Receiver)

TX+  ----------------------------->  RX+
TX-  ----------------------------->  RX-

RX+  <-----------------------------  TX+
RX-  <-----------------------------  TX-

Each lane is full-duplex, meaning data flows simultaneously in both directions.

Differential Signaling

PCIe uses low-voltage differential signaling, where the same signal is transmitted on two wires with opposite polarity.

TX+ carries the signal
TX- carries the inverted signal
The receiver subtracts the two → noise cancels out

This provides:

High noise immunity
Better signal integrity at high speeds

Conceptually:

Transmitting ‘1’: TX+ > TX- → positive differential
Transmitting ‘0’: TX+ < TX- → negative differential

(The exact voltage swing is small — typically a few hundred millivolts — enabling high-speed operation with low power.)

Lanes — Bundling for Bandwidth

A single lane provides a fixed bandwidth. PCIe increases throughput by aggregating multiple lanes.

x1 Link (1 Lane)

[ Lane 0 ]
TX pair  --->  
RX pair  <---

x2 Link (2 Lanes)

[ Lane 0 ]          [ Lane 1 ]
TX pair --->        TX pair --->
RX pair <---        RX pair <---

Data is striped across lanes in a round-robin manner, and the receiver reassembles it using alignment mechanisms.

Transmit Path

[TLP Stream]
     |
     v
[Framing / Control Symbols]
     |
     v
[Scrambler]
     |
     v
[Encoder]
     |
     v
[Serializer]
     |
     v
[Differential Driver]
     |
     v
TX+ / TX-

Step 1: Framing and Control Symbols

The Data Link Layer passes a sequence of bytes:

[Seq# | TLP Header | Data | LCRC]

TLP boundaries are represented using control symbols:

STP (Start of TLP)
END (End of TLP)

In Gen1/Gen2, these are explicit symbols embedded in the encoded stream.
In Gen3 and later, boundaries are encoded within control blocks using sync headers and ordered sets.

So the conceptual view becomes:

[STP | Seq# | Header | Data | LCRC | END]

Step 2: Scrambler

Why scrambling is needed

Transition density

Clock recovery requires frequent signal transitions
Long runs of 0s or 1s break timing recovery

EMI reduction

Repetitive patterns create strong electromagnetic emissions
Scrambling randomizes the spectrum

How it works

The scrambler XORs data with a pseudo-random sequence generated by an LFSR (Linear Feedback Shift Register).

Transmitter: data ⊕ lfsr → scrambled  
Receiver:    scrambled ⊕ lfsr → data

PCIe Gen3+ uses a 23-bit LFSR.

Example (simplified LFSR)

Initial register:

[S3 S2 S1 S0] = 1 0 0 1
Polynomial: x⁴ + x + 1

LFSR sequence generation

Cycle	Register	Feedback
0	1001
1	1100	1 XOR 1 = 0
2	0110	1 XOR 0 = 1
3	1011	0 XOR 0 = 0
4	0101	1 XOR 1 = 0
5	1010	0 XOR 1 = 1
6	1101	1 XOR 0 = 1
7	1110	1 XOR 1 = 0

Generated sequence (LSB output):

1 0 0 1 1 0 1 0

Effect on data

Original:   0 0 0 0 0 0 0 0
LFSR:       1 0 0 1 1 0 1 0
Scrambled:  1 0 0 1 1 0 1 0

Now the signal has rich transitions.

Step 3: Encoder

The encoder provides structure for clock recovery and alignment.

8b/10b Encoding (Gen1, Gen2)

Each 8-bit byte → 10-bit symbol
Guarantees:
- Bounded run length
- DC balance

128b/130b Encoding (Gen3, Gen4)

Instead of per-byte encoding:

[Sync Header] + [128-bit scrambled data] = 130 bits

Sync Header = 01 (data) or 10 (control)
Guarantees a transition at block start

This improves efficiency:

8b/10b → 80% efficiency
128b/130b → ~98.5% efficiency

Step 4: Serializer

The serializer converts parallel data into a serial bit stream.

[130-bit block]
      |
      v
1 → 0 → 1 → 1 → 0 → ... (bit stream)

For PCIe Gen3:

Data rate = 8 GT/s
Unit Interval (UI) ≈ 125 ps per bit

Internally, serializers use high-speed clocking derived from a reference clock (commonly 100 MHz) via a PLL.

For multi-lane links:

Lane 0 → Serializer 0
Lane 1 → Serializer 1

Data is distributed across lanes and transmitted in parallel.

Step 5: Differential Driver

The serialized bits drive a differential output stage:

Bit = 1 → TX+ > TX-
Bit = 0 → TX+ < TX-

This produces a high-speed differential signal on the wire.

Receive Path (High-Level)

RX+ / RX-
     |
     v
[Analog Front-End + Equalization]
     |
     v
[Clock Data Recovery (CDR)]
     |
     v
[Deserializer]
     |
     v
[Decoder]
     |
     v
[De-scrambler]
     |
     v
[TLP Stream]

Key components:

Equalization (CTLE, DFE) → compensates channel loss
CDR → extracts clock from data transitions
Lane alignment → reconstructs multi-lane streams

PCIe Overview

Ripan Deuri — Sat, 28 Mar 2026 06:52:27 +0000

PCIe is widely used across modern computing systems, powering devices such as SSDs, GPUs, and network interfaces.

This article provides a structured overview of PCIe with an emphasis on practical understanding. It covers key concepts including topology, protocol layers, BARs, DMA, and interrupt mechanisms

What is PCIe

PCIe (Peripheral Component Interconnect Express) is a high-speed serial bus standard used to connect peripheral devices to the main processor.

The older PCI bus was parallel—it had 32 or 64 data lines all switching simultaneously. However, parallel buses have a fundamental limitation at high frequencies: signals on different wires arrive at slightly different times (called skew), making it difficult to scale to higher speeds.

PCIe uses a high-speed serialized bitstream over differential pairs of wires (called lanes). Each lane consists of one differential pair for transmit and one for receive. Because there are only two wires per direction per lane, PCIe can operate at much higher frequencies without skew-related issues.

Feature	PCI	PCIe
Signal Type	Parallel (32/64 wires)	Serial (differential pair per lane)
Topology	Shared bus	Point-to-point (switched fabric)
Max Bandwidth	~533 MB/s	Up to ~128 GB/s (Gen5 x16)
Interrupt	IRQ Lines	MSI / MSI-X

PCIe uses a point-to-point topology. Each device connects through a dedicated link to the Root Complex or via switches. There is no shared electrical bus; instead, PCIe forms a hierarchical switched interconnect.

Basic PCIe Topology

+-----------------------------------------------+
|                   SoC                         |
|                                               |
|  +-------+            +-------------------+   |
|  |  CPU  |<---------->| Root Complex (RC) |   |
|  +-------+  AXI bus   |                   |   |
|                       |                   |   |
|  +-------+            |                   |   |
|  | DRAM  |<---------->|                   |   |
|  +-------+  AXI bus   +---------^---------+   |
|                                 |             |
+---------------------------------|-------------+
                                  |
                        =====================
                              PCIe Link
                        =====================
                                  |
                        +---------v---------+
                        |   Endpoint (EP)   |
                        +-------------------+

Key Components:

Root Complex (RC):
The Root Complex is the PCIe host controller inside the SoC. It acts as a bridge between the CPU’s memory bus (AXI/AHB) and the PCIe fabric. It:

Initiates configuration space reads/writes during enumeration
Translates CPU memory accesses into PCIe TLPs (Transaction Layer Packets)
Receives TLPs from endpoints and converts them into memory transactions

Endpoint (EP):

Exposes registers and memory regions via BARs (Base Address Registers)
Can initiate DMA transfers to/from host memory
Signals the CPU using MSI/MSI-X interrupts

PCIe Switch (optional):
A PCIe switch expands a single upstream port into multiple downstream ports, allowing multiple endpoints to connect. This creates a hierarchical topology rather than a simple bus.

PCIe Link Layers

PCIe uses a layered architecture:

+------------------------+
| Transaction Layer (TL) |
+------------------------+
| Data Link Layer (DLL)  |
+------------------------+
| Physical Layer (PL)    |
+------------------------+

Physical Layer (PL):

Transmits serialized data over differential pairs (TX+/TX- and RX+/RX-)
Uses encoding schemes (8b/10b for Gen1/2, 128b/130b for Gen3+) to embed clock information
Performs link training using LTSSM (Link Training and Status State Machine), negotiating lane width, speed, and equalization parameters

Data Link Layer (DLL):

Adds sequence numbers and LCRC (Link CRC) to ensure data integrity
Implements ACK/NAK-based retransmission for reliable delivery

Transaction Layer (TL):

Creates and processes Transaction Layer Packets (TLPs)
Supports different types of transactions:
- Memory Read (Non-posted): Request is sent, and a Completion TLP returns data
- Memory Write (Posted): Sent without requiring a completion
- Configuration Read/Write: Used during enumeration
- Message TLPs: Used for MSI/MSI-X interrupts

PCIe Lanes and Link Speed

A PCIe lane consists of one differential pair for transmit and one for receive, enabling full-duplex communication.

Link Width	Notation	Typical Use Case
1 lane	x1	Low-bandwidth peripherals
4 lanes	x4	SSDs, NICs
8 lanes	x8	High-performance NICs
16 lanes	x16	Graphics cards (GPUs)

PCIe generations define per-lane throughput:

Generation	Raw Rate	Effective Bandwidth per Lane
Gen1	2.5 GT/s	250 MB/s
Gen2	5 GT/s	500 MB/s
Gen3	8 GT/s	~985 MB/s
Gen4	16 GT/s	~1969 MB/s

Gen1/Gen2 use 8b/10b encoding, while Gen3 and above use 128b/130b encoding, which improves efficiency.

Example:
A PCIe Gen3 x2 link provides ~2 GB/s bandwidth in each direction.

PCIe Configuration Space

PCIe devices expose a configuration space used by the host (Root Complex) to discover devices, read capabilities, and configure them.

The first 256 bytes follow the legacy PCI configuration format for backward compatibility. PCIe extends this to a total of 4 KB.

Standard 256-byte PCI Configuration Space:

Offset	Field	Size
0x00	Vendor ID	2 B
0x02	Device ID	2 B
0x04	Command	2 B
0x06	Status	2 B
0x08	Revision ID	1 B
0x09	Class Code (Prog IF)	1 B
0x0A	Class Code (Subclass)	1 B
0x0B	Class Code (Base)	1 B
0x0C	Cache Line Size	1 B
0x0D	Latency Timer	1 B
0x0E	Header Type	1 B ← bit=MFD, bits[6:0]=type
0x0F	BIST	1 B
0x10	BAR0	4 B ┐
0x14	BAR1	4 B │ Type 0 header
0x18	BAR2	4 B │ (endpoint)
0x1C	BAR3	4 B │
0x20	BAR4	4 B │
0x24	BAR5	4 B ┘
0x28	Cardbus CIS Pointer	4 B
0x2C	Subsystem Vendor ID	2 B
0x2E	Subsystem ID	2 B
0x30	Expansion ROM BAR	4 B
0x34	Capabilities Pointer	1 B ← points into capabilities list
0x3C	Interrupt Line	1 B
0x3D	Interrupt Pin	1 B
0x3E	Min_Gnt	1 B
0x3F	Max_Lat	1 B
	PCI-compatible header ends at 0x3F
0x40	Capability structures	variable (linked list)
...	...	...
0xFF	End of PCI-compatible space
0x100	PCIe Extended Capabilities	variable (linked list, 4 KB total)
...	...	...
0xFFF

Header Types:

HeaderType	Device Type
0x00	Endpoint
0x01	PCI-to-PCI Bridge

Bridges include bus routing registers assigned during enumeration:

0x10  BAR0
0x14  BAR1

0x18  Primary Bus Number
0x19  Secondary Bus Number
0x1A  Subordinate Bus Number

BAR - Base Address Register

A BAR (Base Address Register) allows a PCIe endpoint to expose memory or register regions to the host.

Each endpoint can have up to 6 BARs. Each BAR defines:

Type: Memory (MMIO) or legacy I/O
Size: Power-of-two region size
Prefetchable: Indicates reads have no side effects and can be cached/prefetched
Address Width: 32-bit or 64-bit

BAR sizing mechanism:

During enumeration, the OS writes 0xFFFFFFFF to a BAR and reads it back. The device returns a mask indicating the size.

Example:

Returned value: 0xFFFF0000
Size: 64 KB

After sizing, the OS assigns a physical address. The driver maps it into virtual space using functions like pci_iomap().

MSI / MSI-X

Traditional PCI used dedicated interrupt lines (physical wires). PCIe replaces these with message-based interrupts.

How MSI works:

During setup, the device is programmed with:

A target address in host memory (interrupt controller region)
A data value

When the device generates an interrupt:

It sends a Memory Write TLP to that address with the data
The CPU’s interrupt controller interprets this as an interrupt

DMA

DMA allows devices to directly access system memory without CPU involvement.

Types of DMA:

Inbound DMA (Device → Host):
Device sends Memory Write TLPs to host memory
Outbound DMA (Device ← Host):
Device sends Memory Read TLPs and receives Completion data

Addressing:

CPU uses virtual addresses (via MMU)
Devices use:
- Physical addresses, or
- IOMMU-translated IO virtual addresses (IOVA)

Understanding Linux Boot Memory Management

Ripan Deuri — Wed, 04 Mar 2026 20:07:47 +0000

When the Linux kernel begins executing on ARM64 hardware, the CPU starts in a minimal environment. The Memory Management Unit (MMU) is disabled and the processor executes instructions using physical addresses directly.

Before Linux can use its normal virtual address space, the kernel must construct the page tables required for address translation. This work happens very early in the boot process head.S

During this phase the kernel performs three important tasks:

Construct minimal page tables
Create both identity and kernel virtual mappings
Enable the MMU and switch execution to the kernel's high virtual address space

This article explains how that process works, using concrete memory layouts and examples.

1. Boot Environment Assumptions

To make the discussion concrete, assume the following system configuration:

RAM start          : 0x80000000
Kernel load addr   : 0x80800000
Kernel size        : 30 MB
Kernel end         : 0x82600000

Other relevant parameters:

Page size          : 4 KB
L2 block size      : 2 MB
Virtual address size : 48 bits

The bootloader loads the kernel image into RAM at 0x80800000 and then jumps to the kernel entry point.

2. Physical Layout of the Kernel Image

After the bootloader loads the kernel, RAM contains the kernel image and its data sections.

Physical RAM
=======================================================

0x80000000  ───────────────────────────────────────────
             Start of RAM

0x80800000  ───────────────────────────────────────────
             Kernel _text

             Kernel code
             Kernel rodata
             Kernel data
             Kernel BSS

0x82500000  ───────────────────────────────────────────
             init_pg_dir region

0x82600000  ───────────────────────────────────────────
             End of kernel image

=======================================================

The kernel image contains multiple sections including code, read-only data, writable data, and the BSS section. The BSS section stores zero-initialized global variables. The early page tables are allocated within BSS region.

Why Early Page Tables Are Placed in BSS:

Early boot code cannot use dynamic memory allocation because the memory subsystem is not yet initialized. As a result, the kernel must reserve memory for early structures at build time.

The ARM64 kernel defines the root page table region and is placed in the BSS section by the linker script.

Because the memory already exists inside the kernel image, early boot code can simply reference it directly.

During boot the physical address of init_pg_dir becomes the base location where the kernel builds its early page tables.

3. Page Table Structure

With 4 KB pages and a 48-bit virtual address space, ARM64 uses four levels of page tables.

Virtual Address Bits [47:0]:
+-----+-----+-----+-----+------------+
| L0  | L1  | L2  | L3  |  Offset    |
|47:39|38:30|29:21|20:12|   11:0     |
+-----+-----+-----+-----+------------+
  9b    9b    9b    9b      12b
 (512) (512) (512) (512)   (4KB)

VA[47:39] → L0 index
VA[38:30] → L1 index
VA[29:21] → L2 index
VA[20:12] → L3 index

Each level contains 2^9 = 512 entries.

Block mappings can be created at intermediate levels:

L1 block size = 1 GB
L2 block size = 2 MB

Early boot typically uses L2 block mappings because they are simple and cover memory efficiently.

4. Early Page Table Memory Layout

The early page tables are placed sequentially in the BSS region.

Physical RAM
=============================================================

0x82500000  ── L0 table (TTBR0 - identity root)

0x82501000  ── L1 table (identity)

0x82502000  ── L2 table (identity)


0x82503000  ── L0 table (TTBR1 - kernel high VA root)

0x82504000  ── L1 table (kernel VA)

0x82505000  ── L2 table (kernel VA)

=============================================================

Each table contains 512 slots and 8 bytes per slot. So each table occupies one page (4 KB). The 8 bytes contains either PA of block or PA of another table.

Total memory required:

6 tables × 4 KB = 24 KB

5. Page Table Creation Steps (Simplified)

It constructs the minimal set of page tables required before the MMU is enabled.

5.1 Clearing Page Table Memory

The first step clears the memory used by init_pg_dir.

This ensures all entries start as invalid descriptors.

This is implemented using a loop that stores zero values across the reserved region.

5.2 Creating the Identity Mapping

The kernel builds an identity mapping for the region containing the kernel image.

VA 0x80800000 → PA 0x80800000

Page table hierarchy:

L0 entry → L1 table
L1 entry → L2 table
L2 entries → 2 MB blocks

Since the kernel size is 30 MB, the L2 table maps approximately 15 blocks.

15 blocks × 2 MB = 30 MB

This mapping allows the CPU to continue executing the kernel immediately after the MMU is enabled.

5.3 Creating the Kernel Virtual Mapping

Linux does not run the kernel at low addresses. Instead, the kernel executes in the upper portion of the virtual address space.

Assuming 48-bit address space:
Kernel VA starts from 0xffff_0000_0000_0000 (= PAGE_OFFSET)

Example kernel VA for PA 0x8080_0000: 0xFFFF000080800000.

The page tables create a mapping: VA 0xFFFF000080800000 → PA 0x80800000

This allows the same physical memory to appear at a high virtual address.

So the layout becomes:

Physical RAM
===========================================================

0x80000000  ───────────────────────────────────────────────
             Start of RAM

0x80800000  ───────────────────────────────────────────────
             Kernel _text (bootloader loaded image)

             Kernel code
             Kernel rodata
             Kernel data
             Kernel BSS

0x82500000  ───────────────────────────────────────────────
             init_pg_dir region (inside BSS)

             Early Page Tables
             ──────────────────────────────────────────────

0x82500000  ── L0 table (TTBR0)  → Identity map root
                entry[0] → 0x82501000

0x82501000  ── L1 table (identity)
                entry[2] → 0x82502000

0x82502000  ── L2 table (identity)
                entry[4..18] → 2MB block

                Example:
                L2[4]  → PA 0x80800000
                L2[5]  → PA 0x80A00000
                ...
                L2[18] → PA 0x82400000

0x82503000  ── L0 table (TTBR1) → Kernel virtual root
                entry[511] → 0x82504000

0x82504000  ── L1 table (kernel VA)
                entry[...] → 0x82505000

0x82505000  ── L2 table (kernel VA)
             ──────────────────────────────────────────────
0x825FFFFF  ───────────────────────────────────────────────
             End of kernel image

0x82600000  ───────────────────────────────────────────────
             First free RAM after kernel
===========================================================

6. Why Dual Mapping Is Required

At the moment the MMU is enabled, the CPU is already executing instructions from the kernel.

For example:

PC = 0x80800100

Before enabling the MMU, this address is interpreted as a physical address.

After enabling the MMU, the CPU interprets the program counter as a virtual address.

If the page tables contain an identity mapping:

VA 0x80800100 → PA 0x80800100

the instruction fetch continues successfully.

Afterward, the kernel performs a branch to its intended virtual address:

0xFFFF000080800000

From that point onward, the kernel runs entirely in high virtual memory.

If the identity mapping did not exist, enabling the MMU would immediately cause a translation fault.

Example:

PC = 0x80800100

After enabling the MMU the CPU attempts to translate:

VA 0x80800100

If no mapping exists for that address, the CPU raises an instruction abort. So identity mapping is required during boot.

Once the page tables are created, the kernel configures the translation system registers.

Kernel installs page table base addresses in following registers:

TTBR0_EL1  → identity mapping tables
TTBR1_EL1  → kernel virtual mapping tables

Finally, the MMU is enabled. At this moment the CPU switches from physical addressing to virtual addressing.

After enabling the MMU, the kernel performs a branch to its virtual address:

0xFFFF000080800000

The page tables translate this address to the physical location of the kernel in RAM.

VA 0xFFFF000080800000 → PA 0x80800000

From this point forward, the kernel executes entirely in its high virtual address space.

Conclusion

Early in the boot process, the Linux kernel must construct its own memory translation environment before the MMU can be enabled. The code in head.S performs this task by building minimal page tables inside statically allocated memory.

Two mappings are created during this phase. An identity mapping ensures that execution continues safely when the MMU is first enabled, while a kernel virtual mapping allows the kernel to run in its intended high address space.

After the MMU is enabled and execution switches to the high virtual address, the kernel continues building the full virtual memory system used during normal operation. These early page tables therefore serve as the foundation for the entire memory management subsystem.

Understanding Cache Coherency

Ripan Deuri — Wed, 25 Feb 2026 17:25:48 +0000

Modern high-performance devices communicate with the CPU through shared memory structures such as DMA Rings.

When one side updates memory, the other side must see the latest value.

On cache-coherent systems this happens automatically. On many ARM platforms it does not.

This post explains what breaks, why it breaks, and how the Linux DMA API solves it.

Why DMA Fails on Non-Coherent Systems

Consider the completion flow from the earlier ring design in How Hardware and Software Share a Queue: Understanding DMA Rings:

Device DMA-writes a completion entry
Device updates WR_IDX
CPU reads WR_IDX and processes new entries

On a non-coherent system the driver may:

read an old WR_IDX
read a partially updated descriptor
never observe new completions

This happens because the CPU and the DMA engine do not observe memory through the same path.

System Hardware View

                +----------------------+
                |        CPU           |
                |   Driver (load/store)|
                +----------+-----------+
                           |
                      +----v----+
                      |  Cache  |  (L1/L2)
                      +----+----+
                           |
                           |
                    +------v------+
                    |     DDR     |  (System RAM)
                    +------+------+
                           ^
                           | PCIe link
                    +------v------+
                    | PCIe Device |
                    |  DMA Engine |
                    +-------------+

Key observation:

CPU accesses DDR through cache
DMA accesses DDR directly
Cache and DDR can hold different data at the same time

This is the source of incoherency.

What Is Cache Coherency

Physical memory (DDR) is the shared storage.

The CPU does not read DDR on every load. It reads cached copies stored in cache lines.

Two operations are required to keep both sides consistent:

Flush – push updated cache lines to DDR
Invalidate – discard cached copies so the next read comes from DDR

Without these operations, both sides operate on different versions of the same memory.

DMA Memory in System DDR

The ring allocated in How Hardware and Software Share a Queue: Understanding DMA Rings resides in system DDR. It is normal RAM shared between CPU and device.

Coherency is achieved by changing how the CPU maps that memory.

The same physical DDR page can be:

mapped as cacheable
mapped as non-cacheable

This is controlled by page table attributes.

Memory Types From the CPU Perspective

Cacheable Memory

Default for kzalloc
Fast for CPU
Not automatically DMA-safe on non-coherent systems

Non-cacheable Memory

CPU always accesses DDR directly
No stale cache lines
Safe for shared control structures

On many ARM systems, coherent DMA memory is implemented using a non-cacheable CPU mapping.

Linux Kernel DMA APIs

Linux Kernel provides two usage patterns:

Coherent DMA

CPU and device always observe the same data
No explicit cache maintenance in the driver

Streaming DMA

Memory is cacheable
Driver must perform cache sync at specific points

dma_alloc_coherent():

allocates memory from system RAM (often via CMA or page allocator)
returns:
- CPU virtual address
- DMA address for the device

On non-coherent ARM systems it typically:

maps the region as non-cacheable for the CPU

Result:

CPU accesses go directly to DDR
DMA accesses go to the same DDR
both sides see identical data without cache operations

This is why it is ideal for:

descriptor rings
doorbells

kzalloc() + DMA (Streaming DMA):

kzalloc returns cacheable normal memory.

For DMA usage the driver must:

Map it for DMA

dma_map_single()

Before device reads the buffer

dma_sync_single_for_device()

After device writes the buffer and before CPU reads

dma_sync_single_for_cpu()

When finished

dma_unmap_single()

Ring Buffer

Ring allocated with dma_alloc_coherent:

Ring lives in DDR
CPU mapping is non-cacheable
Device DMA writes directly to DDR
Driver reads fresh data
No cache maintenance required

Ring allocated with kzalloc:

After interrupt and before reading completions, invalidate cached lines dma_sync_single_for_cpu()

Performance and Design Trade-offs

Coherent memory:

simpler
safe for shared control data
slower for large CPU accesses (no caching)

Streaming DMA:

fast for bulk data
requires correct sync points

Typical design:

rings → coherent memory
data buffers → streaming DMA

Conclusion

On non-coherent systems, the CPU cache and the DMA engine observe DDR through different paths. The Linux DMA API bridges this gap by either:

creating a coherent mapping, or
providing explicit cache synchronization primitives.

Understanding DMA Rings

Ripan Deuri — Sat, 21 Feb 2026 10:16:59 +0000

Modern high-performance systems rely on a shared memory queue for communication between hardware and software, where the device writes data using DMA and indicates new work by updating an index. This mechanism is widely used in network controllers, NVMe storage, GPUs, and asynchronous I/O frameworks because it eliminates lock contention, reduces register access, and allows both sides to operate independently at high throughput.

Understanding this structure requires looking beyond the idea of a circular buffer and focusing on ownership transfer, memory ordering, and cache visibility. These are the concepts that determine correctness and performance in real driver implementations.

This post explains how a lock-free queue is shared between hardware and software and breaks down the synchronization model that makes it work.

Why This Mechanism Exists

At high data rates, traditional communication methods between software and hardware become too expensive:

Reading device registers frequently causes latency.
Locking shared structures limits parallelism.
Interrupt-per-event models do not scale.

Instead, modern devices and drivers communicate through shared memory queues.

The key idea is simple:

The device publishes completed work into memory using DMA, and software consumes it later.

This removes:

register polling from the fast path
lock contention
synchronous handshakes

and replaces them with ownership transfer over a circular buffer.

Shared Memory Layout and Ownership Model

The circular queue lives in system DDR memory and is accessible to both the CPU and the device.

+--------------------------------------------------------------------+
|                            HOST SYSTEM                             |
|  +------------------+                                              |
|  |       CPU        |                                              |
|  |    +--------+    |                                              |
|  |    | Driver |------------------------------------------------+  |
|  |    +--------+    |                                           |  |
|  +---------^--------+                                           |  |
|            |                                                    |  |
|  +---------v--------+                                           |  |
|  |      Cache       |                                           |  |
|  |     L1 / L2      |                                           |  |
|  +---------^--------+                                           |  |
|            | Cache lines                                        |  |
|  +---------v-------------------------------------------------+  |  |
|  |                 SYSTEM DDR (Non-Coherent)                 |  |  |
|  |   +------------------------+    +----------------+        |  |  |
|  |   | Desc 0 | Desc 1 | ...  |    |     WR_IDX     |        |  |  |
|  |   +------------------^-----+    +-----^----------+        |  |  |
|  |   RING DESCRIPTORS   |                |  WR_ADDR (SHADOW) |  |  |
|  |                      +--------+-------+                   |  |  |
|  +-------------------------------|---------------------------+  |  |
|                        DMA write |         MMIO write (RD_IDX)  |  |
|                (Metadata, WR_IDX)|         +--------------------+  |
|                                  |         |                       |
|                     +----------------------v----+                  |
|                     |       Root Complex        |                  |
|                     +------------^--------------+                  |
+----------------------------------|---------------------------------+
                                   |                              
                                   | PCIe Link                    
                                   |
+----------------------------------v---------------------------------+
|                             PCIe DEVICE                            |
|  MMIO RING REGS                                                    |
|  +--------------+          +----------------------------+          |
|  | BASE_ADDR    |          |         DMA ENGINE         |          |
|  +--------------+          +----------------------------+          |
|  | ...          |                                                  |
|  +--------------+          +----------------------------+          |
|  | WR_ADDR      |          |           MSI-X            |          |
|  +--------------+          +----------------------------+          |
|  | RD_IDX       |                                                  |
|  +--------------+                                                  |
+--------------------------------------------------------------------+

Device → advances WR_IDX
Driver → advances RD_IDX

At any moment:

      RD_IDX
        │
        +----+----+----+----+----+----+----+----+
        | S  | S  | S  | D  | D  | D  | D  | D  |
        +----+----+----+----+----+----+----+----+
                       |
                     WR_IDX

Driver owns   : [RD_IDX … WR_IDX)
Device owns   : [WR_IDX … RD_IDX)

[D]  → Device-owned slot (empty, can be filled by HW)
[S]  → Driver-owned slot (ready to process by SW)

The ring grows clockwise ➜

How Ownership Moves Around the Ring

Init - no valid entries

        RD_IDX
        │
        +----+----+----+----+----+----+----+----+
        | D  | D  | D  | D  | D  | D  | D  | D  |
        +----+----+----+----+----+----+----+----+
        |
        WR_IDX

Device fills new entries

                            WR_IDX
                            │
        +----+----+----+----+----+----+----+----+
        | S  | S  | S  | S  | D  | D  | D  | D  |
        +----+----+----+----+----+----+----+----+
        │
        RD_IDX

Driver processes new entries

                            WR_IDX
                            │
        +----+----+----+----+----+----+----+----+
        | D  | D  | S  | S  | D  | D  | D  | D  |
        +----+----+----+----+----+----+----+----+
                  │
                  RD_IDX

Warp-around - idndices wrap modulo ring size

              WR_IDX
             │
        +----+----+----+----+----+----+----+----+
        | S  | D  | S  | S  | S  | S  | S  | S  |
        +----+----+----+----+----+----+----+----+
                  │
                  RD_IDX

Full ring - device must stop

If WR_IDX catches RD_IDX:

                  WR_IDX
                  │
        +----+----+----+----+----+----+----+----+
        | S  | S  | S  | S  | S  | S  | S  | S  |
        +----+----+----+----+----+----+----+----+
                  │
                  RD_IDX

There are no device-owned slots. Device cannot write.

This is not an error - it is backpressure.

Empty ring - driver has nothing to do

If RD_IDX catches WR_IDX:

                  WR_IDX
                  │
        +----+----+----+----+----+----+----+----+
        | D  | D  | D  | D  | D  | D  | D  | D  |
        +----+----+----+----+----+----+----+----+
                  │
                  RD_IDX

No software-owned entries. Driver stops processing.

Lifecycle of a Completion: From Device to Driver

This sequence describes how a real device reports finished work to software through the shared ring.

[1] Initialization

During setup, driver:

Allocates the ring in system memory.
Programs the device with:
- the ring base address
- the ring size
- the address where WR_IDX will be written (shadow in host memory).
Initializes RD_IDX to zero.

At this point:

The queue contains no valid entry.
The entire ring is owned by the device.

[2] Device finishes processing

The device already knows where the result data should be placed.
This typically comes from a separate provisioning mechanism (another queue or pre-registered buffers).

It DMA-writes the result into system memory.

[3] Device writes a completion entry

The device selects the slot at its current WR_IDX and DMA-writes a completion record.

This record may contain:

an identifier for the buffer or request
the length of valid data
status or error information
device-generated metadata

At this stage the entry exists in memory, but software does not yet know that it is valid.

[4] Device publishes WR_IDX

After the completion entry is fully written, the device updates WR_IDX in host memory.

The index update is the visibility point for software.

[5] Interrupt

The device may generate an interrupt to notify CPU. Refer to How an Interrupt Reaches the CPU to understand how interrupt is delivered.

[6] Software consumption

When software runs (either due to an interrupt or polling):

reads WR_IDX to determine how far the device has progressed.
processes entries in the range: [RD_IDX … WR_IDX). For each entry:
- interpret the completion record
- recycle the associated resources
- advance RD_IDX

[7] Returning ownership to the device

After consuming entries, software writes the updated RD_IDX to the device via MMIO.

This tells the device:

These slots are free again.

Cache Coherency and DMA Visibility

On cache-coherent systems, CPU and device observe the same memory contents automatically.

On non-coherent systems, DMA updates system memory but the CPU may still read stale data from its cache.

Before reading new completions, the driver must invalidate the cache lines that cover:

the completion entries
WR_IDX

Otherwise, software may see an old index or partially updated entries even though the device has already written the new data to memory.

Memory Ordering

The queue works because both sides publish progress in a strictly defined order. Without this ordering, an index can become visible before the data it refers to.

Device side

The device must ensure:

completion entry write → WR_IDX update

This guarantees that when software observes the new WR_IDX, the corresponding completion entry is already fully written in memory.

Software side

Software must:

read WR_IDX → then read the completion entries

This prevents the CPU from speculatively reading ring contents before it knows how far the device has progressed.

These rules are enforced with memory barriers in the driver and with ordering guarantees in the device.

Timeline View

Conclusion

A shared ring is a contract where hardware and software exchange ownership through ordered index updates. Completed work becomes visible when WR_IDX is updated, and buffer space is returned to the device when RD_IDX advances. This memory-based publication model removes locks, reduces MMIO, enabling scalable, high-throughput operation.

From Reset to Control: Disabling Interrupts on ARM Bare Metal

Ripan Deuri — Fri, 02 Jan 2026 06:59:03 +0000

Bare-metal execution on ARMv7 begins at the reset vector, long before any C environment exists. When a Cortex-A9 leaves reset under QEMU’s vexpress-a9 model, the processor enters Supervisor mode with interrupts masked and the MMU disabled. The stack pointer is undefined, no memory sections are initialized, and no handlers are installed. Execution begins only with a defined program counter and CPSR value.

This post examines that earliest stage of execution and shows how a minimal block of startup assembly takes control after reset: explicitly masking interrupts, verifying the processor’s mode, and halting in a known state. This establishes a predictable baseline before introducing stacks, memory initialization, and eventually a C runtime.

From Reset Vector to `_start`

At reset, ARMv7 defines the following initial conditions:

Mode: Supervisor (0b10011)
IRQ mask (CPSR.I): 1 (disabled)
FIQ mask (CPSR.F): 1 (disabled)
Instruction set: ARM state
MMU: Disabled

Although interrupts are architecturally masked at reset, early startup code should not rely on this implicit state. Explicitly disabling interrupts ensures a deterministic environment before installing a vector table or exception handlers.

The previous post Bare Metal ARM Boot: Understanding the Reset Vector and First Instructions established a minimal vector table and reset vector:

Vector table placed at address 0x0 in flash
Reset vector branches from _vectors to _start
_start contained an infinite loop

This post builds on that example by giving _start its first real responsibility: enforcing interrupt masking before halting.

CPSR Overview

The Current Program Status Register (CPSR) controls key aspects of execution:

Bits [31:28] — Condition flags (N, Z, C, V)
Bit 7 — IRQ disable (I)
Bit 6 — FIQ disable (F)
Bits [4:0] — Mode bits

Common mode encodings:

0x10 — User
0x11 — FIQ
0x12 — IRQ
0x13 — Supervisor
0x17 — Abort
0x1B — Undefined
0x1F — System

Disabling Interrupts Explicitly

The cpsid instruction (Change Processor State, Interrupt Disable) provides direct control over interrupt masking:

cpsid i — Disable IRQ
cpsid f — Disable FIQ
cpsid if — Disable both

cpsid is a privileged instruction, so it can execute only in modes such as Supervisor. Issuing cpsid if at startup prevents accidental exception entry until valid handlers are in place.

Minimal Startup Assembly

startup.s:

.section .vectors, "ax"
.global _vectors
_vectors:
    b _start            @ Reset vector: branch to startup

.section .text
.global _start
_start:
    @ Disable IRQ and FIQ interrupts
    cpsid   if

    @ Infinite loop
halt:
    b   halt

Verifying Behavior with GDB

A breakpoint confirms that execution flows from the reset vector into _start:

(gdb) break _start
Breakpoint 1 at 0x4: file startup.s, line 9.
(gdb) continue
Continuing.

Breakpoint 1, _start () at startup.s:9
9       cpsid   if

Examining the CPSR before and after executing cpsid if:

(gdb) info registers cpsr
cpsr           0x400001d3          1073742291
(gdb) stepi
10      b .
(gdb) info registers cpsr
cpsr           0x400001d3          1073742291

The CPSR value remains unchanged because interrupts were already masked at reset.

Binary view:

(gdb) print/t $cpsr
$1 = 1000000000000000000000111010011

Interpretation:

Bits[4:0] = 10011 → Supervisor mode
Bit 6 = 1 → FIQ disabled
Bit 7 = 1 → IRQ disabled

This confirms that startup code is executing in a privileged mode with both interrupt sources masked.

Conclusion

The processor now transitions cleanly from reset into startup assembly that establishes a controlled execution state. Interrupt masking is explicitly enforced, and the processor mode is verified. With this foundation in place, subsequent steps - stack setup, .data and .bss initialization, and entry into main() can be introduced safely.

The next post builds on this minimal bootstrap to construct a usable C runtime for bare-metal ARM systems.

Linker Scripts Explained: Controlling Memory Layout on Bare Metal

Ripan Deuri — Sat, 13 Dec 2025 08:23:12 +0000

The Anatomy of a Minimal Linker Script
- ENTRY Directive
- MEMORY Block
- SECTIONS Block
The Location Counter: .
VMA vs LMA: Virtual and Load Memory Addresses
Linker-Defined Symbols
The Map File
Complete Minimal Example
Verification: What the Linker Produced
- Step 1: Examine Disassembly
- Step 2: Examine Section Headers
- Step 3: Examine the Map File
Verification: Loading in QEMU and Inspecting with GDB
Alignment Directives
Conclusion

Compilers generate relocatable object code—machine instructions and data whose addresses are not yet fixed. On hosted systems, a loader chooses the runtime addresses of each segment. Bare-metal systems have no loader: the ELF file’s section addresses become the addresses used directly by the CPU.

A linker script provides this mapping. It determines:

What memory regions are available? For example, FLASH at 0x00000000 (64M), RAM at 0x60000000 (128M), etc.
Where should each section go? .text to flash, .bss to RAM, etc.
Exactly where each section begins, with alignment enforced according to architecture requirements.

Without an explicit script, the linker defaults to placing sections at low addresses such as 0x00000000, which rarely matches real hardware memory layouts. Bare-metal firmware requires deterministic placement, so linker scripts are a fundamental tool.

In this post, the linker script targets QEMU’s vexpress-a9, where RAM begins at 0x60000000. QEMU’s -kernel argument loads ELF segments to their VMA addresses, so linking into RAM is sufficient for initial bring-up.

The Anatomy of a Minimal Linker Script

A linker script has two primary blocks: MEMORY and SECTIONS.

ENTRY(_start)

MEMORY
{
    RAM (rwx) : ORIGIN = 0x60000000, LENGTH = 128M
}

SECTIONS
{
    . = 0x60000000;
    .text : {
        *(.text)
    } > RAM
}

Each component has a specific purpose.

ENTRY Directive

ENTRY(_start) specifies the program's entry point symbol. When the ELF file is loaded, this symbol's address becomes the starting point that debuggers and loaders recognize. In bare metal, _start should match the first instruction executed after the reset vector transfers control.

Note that QEMU does not read an architectural reset vector for vexpress-a9 when using -kernel. Instead, it sets the CPU’s PC to the ELF entry point address.

MEMORY Block

The MEMORY block declares available address regions and their properties. Each region has:

Name: RAM, FLASH - labels used in SECTIONS
Attributes: r (read), w (write), x (execute)
ORIGIN: the starting address of the region
LENGTH: the size in bytes

Example:

MEMORY
{
    FLASH (rx)  : ORIGIN = 0x00000000, LENGTH = 64M
    RAM (rwx)   : ORIGIN = 0x60000000, LENGTH = 128M
}

Attributes are advisory. They let the linker validate that sections are placed in regions matching their needs:

FLASH (rx): read and execute only; marking it writable would be nonsensical
RAM (rwx): fully flexible for code, initialized data, uninitialized data

The linker issues warnings if a section with write permission is assigned to a read-only region, helping catch configuration mistakes.

SECTIONS Block

The SECTIONS block specifies the output memory layout. Each entry maps input sections (from object files) to output sections (in the final binary) and assigns a memory location.

SECTIONS
{
    .text : {
        *(.text)
    } > FLASH

    .rodata : {
        *(.rodata*)
    } > FLASH
}

Breaking this down:

.text : { ... } defines an output section named .text
*(.text) means "include all .text input sections from all object files"
> FLASH assigns this output section to the FLASH memory region

The * is a wildcard matching all input files. More specific patterns are possible (e.g., startup.o(.text) to include only startup's .text section), but wildcards are typical for bare-metal work.

The Location Counter: `.`

The linker maintains an implicit variable called the location counter, written as a dot: .. The location counter tracks the current position within memory and automatically increments as sections are laid out.

SECTIONS
{
    . = 0x60000000;         /* Set location counter to 0x60000000 */

    .text : {
        *(.text)            /* Place .text at current location */
    } > RAM
    /* After .text, . is automatically incremented by .text's size */

    .rodata : {
        *(.rodata*)         /* Place .rodata immediately after .text */
    } > RAM
}

Explicit assignment of . ensures predictable placement. Without . = 0x60000000;, the linker might place .text at 0x0 by default, ignoring the intended RAM region.

The location counter can also be used to create symbols:

_text_start = .;            /* Symbol marks current position */
.text : { *(.text) } > RAM
_text_end = .;              /* Symbol marks end position */

These symbols have no storage; they are merely addresses that can be referenced from assembly or C.

VMA vs LMA: Virtual and Load Memory Addresses

Every output section has two associated addresses:

VMA (Virtual Memory Address): where the section resides during execution (runtime)
LMA (Load Memory Address): where the section is stored initially (typically in non-volatile flash)

In simple cases—executable code stored and executed from the same location—VMA and LMA are identical. Both the > RAM clause and implicit location counter assignment govern VMA.

The .data section stores its initial contents in flash (LMA) but executes from RAM (VMA). Startup code copies the data from flash to RAM before execution.

Linker script syntax to specify both addresses:

.data : {
    *(.data)
} > RAM AT > FLASH          /* VMA in RAM, LMA in FLASH */

The > RAM specifies VMA. The AT > FLASH specifies LMA. The linker stores the section in FLASH (LMA) but generates symbols and relocation information assuming runtime execution from RAM (VMA).

Linker-Defined Symbols

The linker can create symbols by assigning the location counter or other expressions to a name. These symbols exist only as addresses; they occupy no storage.

SECTIONS
{
    . = 0x60000000;

    _text_start = .;        /* Symbol: address where .text begins */

    .text : {
        *(.text)
    } > RAM

    _text_end = .;          /* Symbol: address where .text ends */

    _stack_top = ORIGIN(RAM) + LENGTH(RAM);  /* Custom symbol */
}

In assembly, these symbols are referenced using the = pseudo-op:

ldr r0, =_text_start        @ Load symbol address into r0
ldr sp, =_stack_top         @ Load stack top address into sp

The = pseudo-op generates an LDR instruction with a literal pool reference. The linker resolves the symbol address and embeds it in the instruction encoding. When the CPU executes the LDR, the symbol's address is loaded into the register.

The Map File

The linker can generate a map file showing all sections, symbols, and their addresses. Generate it by adding -Map=output.map to the linker command:

arm-none-eabi-ld -T linker.ld startup.o -o boot.elf -Map=output.map

Complete Minimal Example

startup.s:

.global _start

.section .vectors, "ax"
_vectors:
    b _start                @ branch to start_

.section .text
_start:
    ldr r0, =0xDEADBEEF     @ Load test value
    b .                     @ Infinite loop

Note that QEMU does not use the vectors section above as a real reset vector when booting with -kernel.

linker.ld:

ENTRY(_start)

MEMORY
{
    RAM (rwx) : ORIGIN = 0x60000000, LENGTH = 128M
}

SECTIONS
{
    . = 0x60000000;

    .vectors : {
        *(.vectors)
    } > RAM

    .text : {
        *(.text)
    } > RAM
}

Build Commands

# Assemble the startup code
arm-none-eabi-as -mcpu=cortex-a9 -g startup.s -o startup.o

# Link with the linker script
arm-none-eabi-ld -T linker.ld startup.o -o boot.elf

# Generate a map file
arm-none-eabi-ld -T linker.ld startup.o -o boot.elf -Map=output.map

# Create raw binary (optional, for certain QEMU loading modes)
arm-none-eabi-objcopy -O binary boot.elf boot.bin

Verification: What the Linker Produced

Step 1: Examine Disassembly

arm-none-eabi-objdump -d boot.elf

Excerpt:

Disassembly of section .vectors:

60000000 <_vectors>:
60000000:   eaffffff    b   60000004 <_start>

Disassembly of section .text:

60000004 <_start>:
60000004:   e51f0000    ldr r0, [pc, #-0]   @ 6000000c <_start+0x8>
60000008:   eafffffe    b   60000008 <_start+0x4>
6000000c:   deadbeef    .word   0xdeadbeef

Key observations:

Addresses in disassembly match the VMA from objdump -h
.vectors at 0x60000000 contains a branch to _start at 0x60000004
ldr r0, =0xDEADBEEF -> ldr r0, [pc, #-0] is located at 0x60000004 and it depends on the literal stored at 0x6000000c.
The infinite loop b . branches to itself at 0x60000008
Literal deadbeef is stored at 0x6000000c

Step 2: Examine Section Headers

arm-none-eabi-objdump -h boot.elf

Excerpt:

Sections:
Idx Name          Size      VMA       LMA       File off  Algn
  0 .vectors      00000004  60000000  60000000  00001000  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  1 .text         0000000c  60000004  60000004  00001004  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, CODE

Key observations:

VMA shows where code resides at runtime: 0x60000000 for .vectors, 0x60000004 for .text
LMA matches VMA in this simple case (both in RAM)
Section sizes: 4 bytes for the branch instruction in .vectors, 12 bytes for the load and branch in .text

Step 3: Examine the Map File

cat output.map | head -30

Excerpt:

Memory Configuration

Name             Origin             Length             Attributes
RAM              0x60000000         0x08000000         xrw
*default*        0x00000000         0xffffffff

Linker script and memory map

                0x60000000                        . = 0x60000000

.vectors        0x60000000        0x4
 *(.vectors)
 .vectors       0x60000000        0x4 startup.o

.text           0x60000004        0xc
 *(.text)
 .text          0x60000004        0xc startup.o
                0x60000004                _start

Verification: Loading in QEMU and Inspecting with GDB

Launch QEMU with GDB Server:

qemu-system-arm -M vexpress-a9 -cpu cortex-a9 -m 128M \
  -kernel boot.elf \
  -nographic \
  -S -gdb tcp::1234

Flags:

-kernel boot.elf: load ELF file (QEMU parses program headers and loads sections to their VMA)
-S: start halted, waiting for debugger connection
-gdb tcp::1234: open GDB server on port 1234

Connect GDB and Inspect:

gdb-multiarch boot.elf
(gdb) target remote :1234
Remote debugging using :1234
_start () at startup.s:9
9       ldr r0, =0xDEADBEEF     @ Load test value
(gdb) info registers pc
pc             0x60000004          0x60000004 <_start>

The program counter starts at 0x60000004, points to the next instruction.

Disassemble in GDB:

(gdb) disassemble _start
Dump of assembler code for function _start:
=> 0x60000004 <+0>: ldr r0, [pc, #-0]   @ 0x6000000c <_start+8>
   0x60000008 <+4>: b   0x60000008 <_start+4>
   0x6000000c <+8>: cdple   14, 10, cr11, cr13, cr15, {7}
End of assembler dump.

Instructions are at exact addresses from objdump, confirming the linker placed code at the intended locations.

The literal 0xDEADBEEF appears as a coprocessor instruction in disassembly because objdump interprets raw data as instructions when listing code; this is expected.

Step Through Instructions:

(gdb) info registers r0
r0             0x0                 0
(gdb) break _start
Breakpoint 1 at 0x60000004: file startup.s, line 9.
(gdb) stepi
10      b .                     @ Infinite loop
(gdb) info registers r0
r0             0xdeadbeef          -559038737

After the instruction 0x60000004, r0 contains the test value, proving the instruction executed and the linker's address assignment was correct.

Inspect Memory:

(gdb) x/4i 0x60000000
   0x60000000 <_vectors>:   b   0x60000004 <_start>
   0x60000004 <_start>: ldr r0, [pc, #-0]   @ 0x6000000c <_start+8>
=> 0x60000008 <_start+4>:   b   0x60000008 <_start+4>
   0x6000000c <_start+8>:   cdple   14, 10, cr11, cr13, cr15, {7}

Memory contains the expected branch and ldr instructions at exact addresses, confirming the linker-assigned layout matches actual memory.

Alignment Directives

ARMv7 uses 4-byte instructions and requires 4-byte alignment. Explicit alignment ensures that sections begin at valid boundaries.

SECTIONS
{
    . = 0x60000000;
    . = ALIGN(4);           /* Ensure 4-byte alignment */

    .vectors : {
        *(.vectors)
    } > RAM

    . = ALIGN(4);           /* Ensure next section is aligned */

    .text : {
        *(.text)
    } > RAM
}

The ALIGN(n) function rounds the location counter up to the next multiple of n bytes. If already aligned, it is a no-op. If misaligned, it advances the counter and introduces padding.

Conclusion

A linker script describes the mapping from ELF sections to concrete memory addresses and serves as the bridge between compiler output and hardware layout. By defining memory regions, assigning sections, managing alignment, and generating linker-defined symbols, the script establishes the program’s static memory structure. Verifying these decisions with map files and objdump ensures the final image matches the intended layout.

With this foundation established, the next step is to examine how execution begins—specifically, the reset mechanism and the first instruction fetched by the CPU. Understanding reset behavior complements the static layout described here and completes the initial stage of bare-metal bring-up.

Bare Metal ARM Boot: Understanding the Reset Vector and First Instructions

Ripan Deuri — Sat, 13 Dec 2025 04:35:08 +0000

Understanding what happens before main() is essential when working on bare-metal systems. This article examines the reset behavior of ARMv7 and shows how to take control of the first instructions executed after reset on the vexpress-a9 platform in QEMU. A minimal vector table and linker script demonstrate how the CPU fetches its initial instruction from address 0x0 and how a simple branch verifies that control flow behaves exactly as intended.

The Myth of “PC Starts at `main`”

In a bare-metal system, main is simply a C function invoked after a sequence of hardware-defined and software-defined steps:

Hardware reset forces the CPU into a well-defined privileged mode.
A reset vector determines the first instruction the CPU fetches.
Low-level startup code configures the execution environment.
Only after this preparation does the C runtime transfer control to main.

This post focuses on step 2: how the CPU selects its first instruction after reset and how software controls that decision.

ARMv7 Reset Behavior: What the CPU Actually Does

When an ARMv7-A core such as Cortex-A9 exits reset, the architecture defines only a minimal and deterministic subset of the processor state. The SoC integration adds further rules, but the essential behaviors are consistent:

CPU mode: The core enters Supervisor (SVC) mode, a privileged mode suitable for exception entry.
Program Counter (PC): Loaded from the reset vector address, which is implementation-defined but commonly 0x00000000 or a remapped alias of another memory region.
General-purpose registers: All registers except the PC (and certain status bits) are architecturally undefined. Software must not assume values for r0–r12, SP, or LR.
Interrupts: External interrupts are disabled at reset, preventing accidental entry into uninitialized exception handlers.
MMU and caches: Disabled. Execution begins in a flat physical address space without virtual memory.

This initial state contains just enough information for the CPU to fetch and execute the first instruction from the reset vector. Everything else—stack initialization, memory sections, BSS clearing, C runtime setup—must be implemented in startup code.

Historically, many ARM systems located the vector table at 0x00000000, simplifying early boot ROM design. Modern systems provide more flexibility:

Some support high-vector mode, placing the vector table at 0xFFFF0000.
Many SoCs implement memory remapping so that ROM or flash appears temporarily at 0x0 during reset.
The physical storage for the bootloader may exist elsewhere (e.g. 0x40000000) but is made visible at 0x0 through a small alias window.

On QEMU’s vexpress-a9 model, the NOR flash device is mapped at 0x40000000 but aliased at 0x00000000, ensuring the reset vector resides at address 0x0. This is a QEMU modeling choice that mirrors typical early boot behavior.

Vector Table

The vector table defines CPU entry points for exceptions such as reset, undefined instructions, software interrupts, data aborts, IRQ, and FIQ.

Key properties:

The table resides at a fixed virtual address: 0x00000000 or 0xFFFF0000, depending on mode and SoC configuration.
Each entry corresponds to an exception type.
On ARMv7, typically entries are instructions, commonly unconditional branch instructions that jump to full handlers.
The first entry is the reset vector, which contains the first instruction fetched after reset.

A minimal vector table for experiments may contain only a reset vector, acknowledging that any other exception would lead to undefined behavior. Example:

.section .vectors, "ax"
.global _vectors
_vectors:
    b _start        @ Reset vector: branch to startup

The instruction at the vector table address branches to _start, which performs the earliest software-controlled action in the system.

How QEMU Wires Reset to Memory

QEMU offers multiple ways to load and boot an ARM image, each influencing reset behavior and initial PC selection:

-kernel boot.elf (ELF as kernel image):
- QEMU parses the ELF program headers.
- Loadable segments are placed at their specified VMAs.
- The CPU’s initial PC is set to the ELF entry point from the ELF header.
- This bypasses the classic reset-vector mechanism and does not reflect hardware reset routing.
-drive if=pflash,format=raw,file=flash.bin (NOR flash image):
- QEMU memory-maps the raw binary directly into the emulated NOR flash region.
- On vexpress-a9, NOR flash content is aliased at 0x00000000, so the CPU fetches the reset vector from the binary’s first instruction.
- This faithfully models how a real SoC remaps flash or boot ROM to 0x0 during reset.

This article uses the -pflash method so execution begins naturally at the reset vector located at address 0x0.

Minimal Vector Table Example

A minimal example demonstrating control over the reset vector:

startup.s:

.section .vectors, "ax"
.global _vectors
_vectors:
    b _start        @ Reset vector: branch to startup

.section .text
.global _start
_start:
    b .             @ Infinite loop to prove we reached here

linker.ld:

ENTRY(_start)

MEMORY
{
    FLASH (rx) : ORIGIN = 0x00000000, LENGTH = 64M
    RAM   (rwx): ORIGIN = 0x60000000, LENGTH = 128M
}

SECTIONS
{
    .vectors : {
        *(.vectors)
    } > FLASH

    .text : {
        *(.text)
    } > FLASH
}

Notes:

.vectors and .text both live in FLASH.
No .data, .bss, or RAM usage.

Build steps:

# Assemble the startup code
arm-none-eabi-as -mcpu=cortex-a9 -g startup.s -o startup.o

# Link using the custom linker script
arm-none-eabi-ld -T linker.ld startup.o -o boot.elf

# Convert ELF to raw binary for NOR flash
arm-none-eabi-objcopy -O binary boot.elf flash.bin

# Create a 64 MB NOR flash image
truncate -s 64M flash.bin

Artifacts:

boot.elf – Contains symbol information useful for GDB debugging.
flash.bin – Raw memory image that QEMU maps into its NOR flash region.

Verification with QEMU and GDB

Start QEMU:

qemu-system-arm \
  -M vexpress-a9 \
  -cpu cortex-a9 \
  -m 128M \
  -nographic \
  -drive if=pflash,format=raw,file=flash.bin \
  -S \
  -gdb tcp::1234

Key points:

The flash image is mapped into the emulated NOR flash device.
The alias at 0x00000000 ensures _vectors is fetched at reset.
-S halts the CPU until GDB connects.

Connect GDB:

gdb-multiarch boot.elf

(gdb) target remote :1234
Remote debugging using :1234
_vectors () at startup.s:4
4       b _start        @ Reset vector: branch to startup

Check PC:

(gdb) info registers pc
pc             0x0                 0x0 <_vectors>

This confirms that the CPU began executing at the reset vector address.

Disassemble the vector table and startup code

(gdb) disassemble _vectors
Dump of assembler code for function _vectors:
=> 0x00000000 <+0>: b   0x4 <_start>
End of assembler dump.
(gdb) disassemble _start
Dump of assembler code for function _start:
   0x00000004 <+0>: b   0x4 <_start>
End of assembler dump.

The first instruction at 0x0 is a branch to _start.

Inspect section placement

arm-none-eabi-objdump -h boot.elf

Excerpt:

Sections:
Idx Name          Size      VMA       LMA       File off  Algn
  0 .vectors      00000004  00000000  00000000  00001000  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  1 .text         00000004  00000004  00000004  00001004  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, CODE

.vectors is placed exactly at 0x0; .text begins at 0x4.

Step Through the Reset Vector

(gdb) break _start
Breakpoint 1 at 0x4: file startup.s, line 9.
(gdb) continue
Continuing.

Breakpoint 1, _start () at startup.s:9
9       b .             @ Infinite loop to prove we reached here
(gdb) info registers pc
pc             0x4                 0x4 <_start>
(gdb) stepi

Breakpoint 1, _start () at startup.s:9
9       b .             @ Infinite loop to prove we reached here
(gdb) info registers pc
pc             0x4                 0x4 <_start>

The b . instruction keeps the PC at _start, proving that:

The reset vector branch executed correctly.
The CPU reached _start.
Execution remains in the infinite loop.

Demonstrating aliasing with the QEMU monitor

Enter the monitor (Ctrl-A, then c):

(qemu) xp /4xw 0x0
0000000000000000: 0xeaffffff 0xeafffffe 0x00000000 0x00000000
(qemu) xp /4xw 0x40000000
0000000040000000: 0xeaffffff 0xeafffffe 0x00000000 0x00000000

Both regions reflect the same underlying flash content, confirming the aliasing behavior used during reset.

Conclusion

The minimal example demonstrates complete control over the system’s first executed instruction by placing a reset vector at address 0x0 and directing it to custom startup code. QEMU’s aliasing of NOR flash provides a convenient environment for experimenting with early boot behavior that closely reflects real hardware.

This foundation forms the basis for building full startup routines: stack setup, memory initialization, exception-vector expansion, and transition into higher-level runtime code. With reset behavior understood and verified, the next steps involve constructing a complete bare-metal initialization sequence.

Dissecting ELF for Bare Metal Development: Sections, Segments, VMA, and LMA Explained

Ripan Deuri — Fri, 12 Dec 2025 07:18:08 +0000

The memory map in the previous post Bare Metal Basics - Part 1: Understanding Memory Maps describes the hardware address space, but it does not explain how compiled code and data are placed into that space. The compiler does not emit instructions directly at fixed memory addresses, nor does it decide where variables reside at runtime. Instead, compilation produces relocatable artifacts that must later be assigned concrete addresses.

The GNU toolchain uses ELF (Executable and Linkable Format) as its standard output at multiple stages. These ELF files contain executable machine code, but they also include metadata such as symbols, relocation records, and debugging information—structures that bare-metal hardware cannot interpret. Since bare-metal systems have no loader, the addresses assigned during linking become the actual runtime addresses used by the CPU. Understanding these distinctions is essential for building reliable bare-metal systems.

This post dissects ELF files, explains sections and segments, and demonstrates how to inspect linker output to verify that the generated layout matches the intended memory map.

From Source Code to Executable

Bare-metal development involves multiple stages, each with a specific responsibility.

Stage 1: Source to Object Files

Source files (C and assembly) pass through the compiler and assembler, producing relocatable object files (.o) that contain machine code and data grouped into sections, but not yet bound to fixed addresses. References to symbols defined in other object files are left unresolved.

Stage 2: Linking Object Files

The linker reads all object files and a linker script. It then:

Resolves symbol references to actual memory addresses
Combines sections from multiple object files
Assigns memory addresses to sections based on the linker script
Produces an ELF executable with a defined entry point

The linker script serves as the blueprint: it describes how sections such as .text, .rodata, .data, and .bss map into the hardware memory regions. The linker assigns addresses accordingly, subject to options such as dead-code elimination (--gc-sections). The specific addresses used depend entirely on the memory layout defined in the script.

ELF Structure

An ELF executable is a structured container composed of several logical parts.

ELF Header

The ELF header identifies the file format and describes global properties such as:

Magic number (0x7f, 'E', 'L', 'F')
Target architecture (ARM, x86, etc.)
Entry point address where CPU execution begins
Offsets to section headers and program headers

These values guide development tools and loaders but are not interpreted by bare-metal processors.

Section Headers

The linker organizes code and data into named sections. Each section is a logical container:

.text: Executable code
.rodata: Read-only data
.data: Initialized global/static variables
.bss: Uninitialized global/static variables
Custom sections: Platform-specific sections such as interrupt vectors or application-defined segments

Each section has several important address concepts:

LMA (Load Memory Address): Where the section’s initial contents reside in non-volatile memory (typically flash).
VMA (Virtual Memory Address): The runtime address where the CPU accesses the section.
Size and Alignment: Constraints on how the linker places each section.

VMAs represent the addresses used by executing code, and symbol values correspond to VMAs.

Program Headers (Segments)

Segments describe how an ELF file should be loaded by a tool that understands ELF—such as QEMU or a custom bootloader. Loaders interpret segments, not sections. Each segment describes a contiguous region of memory to populate:

Segment type (e.g., LOAD)
File offset
Virtual or physical destination address
File size and memory size

Bare-metal CPUs do not interpret segments, but they matter when using QEMU or bootloaders that load ELF directly. QEMU follows program headers and uses the physical address field when present.

Metadata

Additional ELF components include:

Symbol and string tables
Debug information (DWARF)
Relocation records

These are essential during development but have no meaning for the hardware.

Sections and Their Runtime Meaning

.text: Executable Code

The .text section contains all compiled machine instructions, including startup routines. Some platforms place interrupt vectors in a dedicated section with strict address requirements; linker scripts typically handle this separately.

Properties:

Readable and executable
Typically stored in flash
May execute in place or be copied to RAM depending on design

Use objdump to see disassembled .text with addresses:

arm-none-eabi-objdump -d boot.elf

.rodata: Read-Only Data

The .rodata section contains immutable data:

String literals
const global variables
Compiler-generated lookup tables

This section typically resides in flash alongside .text.

Use:

objdump -s -j .rodata boot.elf

.data: Initialized Global Variables

The .data section contains global and static variables with explicit initializers. These variables must be writable at runtime. Their initial values are stored in flash (LMA), and the startup code copies them into RAM (VMA) before entering the C runtime.

Use:

readelf -S boot.elf

to view VMA and LMA assignments.

.bss: Uninitialized Global Variables

The .bss section contains uninitialized global and static variables. .bss does not appear in the binary because storing zero-filled data in flash would be wasteful.

The linker records:

Section name
VMA
Size

At runtime, the startup code sets the entire .bss region to zero. The memory is not allocated by software; it is reserved by the linker through the memory map.

Raw Binary Image

A processor begins execution at a hardware-defined reset address (for example, 0x00000000 or a device-specific flash base). The CPU does not interpret ELF metadata. It requires only:

Executable instructions at the reset address
Read-only data accessible at expected locations
Writable memory for initialized and uninitialized variables
Valid stack space

ELF files must therefore be transformed into binary images whose bytes correspond exactly to the intended load addresses.

Inspecting ELF Output

Common toolchain utilities make ELF inspection straightforward.

readelf

arm-none-eabi-readelf -h boot.elf   # ELF header
arm-none-eabi-readelf -S boot.elf   # Section headers
arm-none-eabi-readelf -l boot.elf   # Program headers
arm-none-eabi-readelf -s boot.elf   # Symbol table

objdump

arm-none-eabi-objdump -h boot.elf   # Section summary
arm-none-eabi-objdump -d boot.elf   # Disassembly
arm-none-eabi-objdump -s -j .rodata boot.elf

objcopy

arm-none-eabi-objcopy -O binary boot.elf boot.bin

This produces a raw binary by extracting only loadable content, arranged according to LMAs. The resulting file contains no address metadata, so it must be programmed into flash at the correct offset corresponding to those LMAs.

Alternative formats such as Intel HEX and S-records convey the same conceptual information with explicit addressing.

How QEMU Loads ELF and Binary Image

QEMU supports both ELF-based loading and raw binary loading.

ELF Loading (--kernel)

When provided with an ELF file, QEMU:

Reads the ELF header for architecture and entry point
Parses program headers
Loads each LOAD segment at its specified destination address
Sets the CPU PC to the ELF entry point

qemu-system-arm -M vexpress-a9 -cpu cortex-a9 -m 128M -nographic \
  -kernel boot.elf -S -gdb tcp::1234

Binary Loading (-pflash)

When loading a raw binary, QEMU:

Maps the file directly into the flash device model
Uses device-specific flash size limits
Begins execution from the board’s reset vector

This model mirrors real hardware and validates whether the firmware image matches the expected memory map.

qemu-system-arm -M vexpress-a9 -cpu cortex-a9 -m 128M -nographic \
  -drive if=pflash,format=raw,file=boot.bin -S -gdb tcp::1234

Conclusion

ELF files exist to support the toolchain. They encode executable code along with metadata required for linking, relocation, and debugging. Bare-metal processors, however, require only instructions and data placed at precise addresses.

Understanding the distinctions between sections and segments, between VMA and LMA, and between compilation and linking is foundational to bare-metal work. The linker script provides the concrete mapping between ELF structure and hardware memory.

The next post focuses on linker scripts: how they express memory intent, how the linker interprets them, and how to validate the final layout against the hardware memory map.