<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ripan Deuri</title>
    <description>The latest articles on DEV Community by Ripan Deuri (@ripan030).</description>
    <link>https://dev.to/ripan030</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3575291%2F8438c4dd-3058-457e-a2fb-e77ff8f27ad3.png</url>
      <title>DEV Community: Ripan Deuri</title>
      <link>https://dev.to/ripan030</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ripan030"/>
    <language>en</language>
    <item>
      <title>Understanding PCIe Data Link Layer</title>
      <dc:creator>Ripan Deuri</dc:creator>
      <pubDate>Wed, 08 Apr 2026 13:41:55 +0000</pubDate>
      <link>https://dev.to/ripan030/understanding-pcie-data-link-layer-9jp</link>
      <guid>https://dev.to/ripan030/understanding-pcie-data-link-layer-9jp</guid>
      <description>&lt;h2&gt;
  
  
  1. Introduction
&lt;/h2&gt;

&lt;p&gt;PCI Express (PCIe) uses a layered architecture to separate concerns like transaction creation, reliability, and physical transmission. The &lt;strong&gt;Data Link Layer (DLL)&lt;/strong&gt; ensures reliable communication between directly connected devices.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;Transaction Layer&lt;/strong&gt; generates packets (TLPs), and the &lt;strong&gt;Physical Layer&lt;/strong&gt; transmits bits. The &lt;strong&gt;Data Link Layer&lt;/strong&gt; sits between them, guaranteeing that TLPs are delivered correctly, in order, and without corruption over a single PCIe link.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. PCIe Layers
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Transaction Layer (TL):&lt;/strong&gt; Creates Transaction Layer Packets (TLPs)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Link Layer (DLL):&lt;/strong&gt; Ensures reliable delivery of TLPs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Physical Layer (PHY):&lt;/strong&gt; Handles signaling and bit transmission&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  3. Data Link Layer (DLL)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  3.1 What Data Link Layer Does
&lt;/h3&gt;

&lt;p&gt;Responsibilities of DLL are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reliable delivery: Sequence numbers + ACK/NAK + Replay buffer&lt;/li&gt;
&lt;li&gt;Error detection: LCRC (Link CRC) on every TLP&lt;/li&gt;
&lt;li&gt;Flow control: Credit based system to prevent buffer overflow&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;DLL operates on two types of packets:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;TLPs&lt;/strong&gt; (Transmission Layer Packets) - actual data from the Transmission Layer&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DLLPs&lt;/strong&gt; (Data Link Layer Packets) - control packets used by the DLL itself (ACK, NAK, flow control updates)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3.2 Sequence Numbers
&lt;/h3&gt;

&lt;p&gt;Each outgoing TLP is assigned a &lt;strong&gt;sequence number&lt;/strong&gt; and &lt;strong&gt;LCRC&lt;/strong&gt; by the Data Link Layer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+--------------------------------------+
| Seq#   |    TLP Payload    | LCRC    |
| 12-bit |                   | 32-bit  |
+--------------------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;DLL Tx steps&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Assign next sequence number&lt;/li&gt;
&lt;li&gt;Prepend seq#&lt;/li&gt;
&lt;li&gt;Compute LCRC over [Seq# | TLP]&lt;/li&gt;
&lt;li&gt;Append LCRC&lt;/li&gt;
&lt;li&gt;Save copy in Replay buffer&lt;/li&gt;
&lt;li&gt;Send [Seq# | TLP | LCRC] to Physical Layer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;DLL Rx steps&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Receive [Seq# | TLP | LCRC] from Physical Layer&lt;/li&gt;
&lt;li&gt;Recompute LCRC - does it match?

&lt;ul&gt;
&lt;li&gt;YES: continue&lt;/li&gt;
&lt;li&gt;NO: send NAK DLLP, discard TLP&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Check Seq# - is it the expected one?

&lt;ul&gt;
&lt;li&gt;YES: continue&lt;/li&gt;
&lt;li&gt;NO: send NAK DLLP, discard TLP&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Strip Seq# and LCRC&lt;/li&gt;

&lt;li&gt;Pass TLP up to Transaction Layer&lt;/li&gt;

&lt;li&gt;Send ACK DLLP back to the transmitter&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  3.3 ACK/NAK Protocol
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Replay Buffer&lt;/strong&gt;&lt;br&gt;
The transmitter keeps a &lt;strong&gt;Replay Buffer&lt;/strong&gt; - a copy of every TLP that has been sent but not yet acknowledged.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Replay Buffer:
+--------------------------------+
| Seq = 40 | Seq = 41 | Seq = 42 |
+--------------------------------+
      ^
      |
     Once ACK(40) is received, Seq = 40 is removed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;ACK&lt;/strong&gt;&lt;br&gt;
When the receiver successfully receives a TLP, it sends back an &lt;strong&gt;ACK DLLP&lt;/strong&gt; with the sequence number of the last successfully received TLP.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;ACK is cumulative. An ACK (Seq = 41) means TLPs are received TLPs upto and including Seq = 41.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;NAK&lt;/strong&gt;&lt;br&gt;
When the receiver detected a corrupted TLP (bad LCRC or an out of order sequence number), it sends a &lt;strong&gt;NACK DLLP&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Upon receiving a NACK, the transmitter replays all unACK'd TLPs from the replay buffer, starting from the NAK'd sequence number.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A Complete ACK/NAK Example&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;RC DLL                                  EP DLL
  |                                       |
  |-- TLP (Seq=41) ----------------------&amp;gt;| OK
  |-- TLP (Seq=42) ----------------------&amp;gt;| OK
  |-- TLP (Seq=43) ----------------------&amp;gt;| CRC Error!
  |                                       |
  |&amp;lt;-- ACK (Seq=42)-----------------------| ACKs Seq=41 and Seq=42 cumulatively
  |&amp;lt;-- NAK (Seq=43)-----------------------| NAK Seq=43, requests retransmit
  |                                       |
  | [Transmitter replays from Seq = 43]   |
  |-- TLP (Seq=43)-----------------------&amp;gt;| OK
  |-- TLP (Seq=44)-----------------------&amp;gt;| OK
  |                                       |
  |&amp;lt;-- ACK (Seq=44)-----------------------| ACKs Seq=43 and Seq=44 cumulatively
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the transmitter does not receive an ACK within a timeout period, it assumes the TLP was lost and replays all unACK'd TLPs. This handles the case where the ACK itself was lost.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.4 Flow Control
&lt;/h3&gt;

&lt;p&gt;PCIe uses a credit-based flow control system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The receiver advertises how many credits it has (how much buffer space is available)&lt;/li&gt;
&lt;li&gt;The transmitter can only send a TLP if it has enough credits to cover that TLP&lt;/li&gt;
&lt;li&gt;After the receiver processes a TLP and frees buffer space, it sends a &lt;strong&gt;Flow Control Update DLLP&lt;/strong&gt; to return credits to the transmitter&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Credit Units&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Credit Type&lt;/th&gt;
&lt;th&gt;Space&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Header Credit&lt;/td&gt;
&lt;td&gt;1 credit = space for 1 TLP header&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data Credit&lt;/td&gt;
&lt;td&gt;1 credit = space for 4 DWORDs of TLP Data&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;PCIe tracks credits separately for different types of traffic to prevent &lt;strong&gt;deadlock&lt;/strong&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Credit Pool&lt;/th&gt;
&lt;th&gt;Used For&lt;/th&gt;
&lt;th&gt;Completion Expected?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Posed (P)&lt;/td&gt;
&lt;td&gt;Memory Write, Messages&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Non-Posed (NP)&lt;/td&gt;
&lt;td&gt;Memory Read&lt;/td&gt;
&lt;td&gt;Config Read/Write&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Completion (Cpl)&lt;/td&gt;
&lt;td&gt;Completions (responses to NP requests&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Flow Control Initialization&lt;/strong&gt;&lt;br&gt;
Before any TLPs can be sent, the two sides must initialize flow control. This happens right after the Physical Layer declares the link up:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Both sides send InitFC1 DLLPs advertising their initial credits&lt;/li&gt;
&lt;li&gt;Both sides send InitFC2 DLLPs confirming they received the other side's credits&lt;/li&gt;
&lt;li&gt;Flow Control is initialized - TLPs can flow now&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  3.5 DLLPs - Data Link Layer Packets
&lt;/h3&gt;

&lt;p&gt;DLLPs are &lt;strong&gt;small control packets&lt;/strong&gt; used exclusively by the Data Link Layer. They are never seen by the Transaction Layer - they are created and consumed entirely within the DLL.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DLLP Format&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+---------------------------------------------+
|Type (1 byte)|Payload (3 bytes)|CRC (2 bytes)|
+---------------------------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;DLLPs are much smaller than TLPs. They are sent in the gaps between TLPs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DLLP Types&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;When Sent&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ACK&lt;/td&gt;
&lt;td&gt;Ack received TLPs up to a given Seq#&lt;/td&gt;
&lt;td&gt;After successfully receiving a TLP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NAK&lt;/td&gt;
&lt;td&gt;Request retransmission from a given Seq#&lt;/td&gt;
&lt;td&gt;After detecting a bad TLP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;InitFC1&lt;/td&gt;
&lt;td&gt;Flow control initialization (phase 1)&lt;/td&gt;
&lt;td&gt;During link initialization&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;InitFC2&lt;/td&gt;
&lt;td&gt;Flow control initialization (phase 2)&lt;/td&gt;
&lt;td&gt;During link initialization&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;UpdateFC&lt;/td&gt;
&lt;td&gt;Return flow control credits to transmitter&lt;/td&gt;
&lt;td&gt;After processing a TLP and freeing buffer space&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

</description>
      <category>architecture</category>
      <category>computerscience</category>
      <category>networking</category>
    </item>
    <item>
      <title>Understanding PCIe Link Training</title>
      <dc:creator>Ripan Deuri</dc:creator>
      <pubDate>Mon, 30 Mar 2026 18:52:17 +0000</pubDate>
      <link>https://dev.to/ripan030/understanding-pcie-link-training-165i</link>
      <guid>https://dev.to/ripan030/understanding-pcie-link-training-165i</guid>
      <description>&lt;h2&gt;
  
  
  1. Introduction
&lt;/h2&gt;

&lt;p&gt;PCIe link training is the process by which a Root Complex (RC) and an Endpoint (EP) autonomously negotiate and establish a reliable high-speed serial link. No software is involved; everything is done by the Physical Layer state machine.&lt;/p&gt;

&lt;p&gt;The process must solve:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Receiver detection&lt;/strong&gt;: Does anything exist on the other end?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bit lock&lt;/strong&gt;: Can the receiver lock its clock-data recovery (CDR) circuit to the incoming bit stream?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Symbol/block lock&lt;/strong&gt;: Can the receiver identify symbol or block boundaries?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Link configuration&lt;/strong&gt;: What width and lane ordering to use?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Speed negotiation&lt;/strong&gt;: What is the highest mutually supported data rate?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This article focuses on the &lt;strong&gt;physical layer (PHY)&lt;/strong&gt; and explains the &lt;strong&gt;LTSSM (Link Training and Status State Machine)&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. System Setup
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Topology&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;RC Lane&lt;/th&gt;
&lt;th&gt;EP Lane&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Lane0&lt;/td&gt;
&lt;td&gt;Lane0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lane1&lt;/td&gt;
&lt;td&gt;Lane1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lane2&lt;/td&gt;
&lt;td&gt;open&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lane3&lt;/td&gt;
&lt;td&gt;open&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Expected Outcome&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Link width: &lt;strong&gt;x2&lt;/strong&gt; (limited by the EP)&lt;/li&gt;
&lt;li&gt;Final speed: &lt;strong&gt;Gen3 (8 GT/s)&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  3. Encoding Fundamentals
&lt;/h2&gt;

&lt;h3&gt;
  
  
  3.2 8b/10b Encoding (Gen1 and Gen2)
&lt;/h3&gt;

&lt;p&gt;Every 8-bit byte is replaced by a 10-bit symbol. The two extra bits provide the overhead needed for DC balance and transition density.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Running disparity (RD)&lt;/strong&gt;: The encoder tracks whether the current symbol has sent more 1s or 0s. RD+ means the last symbol had a net excess of 1s; RD- means net excess of 0s. The next symbol is chosen from the RD+ or RD- variant to balance the line.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Symbol classes&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data symbols&lt;/strong&gt; Dxx.y: xx = bits [4:0], y = bits [7:5]. Value = y×32 + xx.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Control symbols&lt;/strong&gt; Kxx.y: special characters outside the normal data space.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3.3 Deriving the Symbol Names (Example)
&lt;/h3&gt;

&lt;p&gt;The TS1/TS2 identifier bytes are:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TS1 ID byte = 0x4A&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;0x4A = 74 decimal = 0100_1010 binary&lt;/li&gt;
&lt;li&gt;bits [4:0] = 0_1010 = 10, bits [7:5] = 010 = 2 → &lt;strong&gt;D10.2&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;TS2 ID byte = 0x45&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;0x45 = 69 decimal = 0100_0101 binary&lt;/li&gt;
&lt;li&gt;bits [4:0] = 0_0101 = 5, bits [7:5] = 010 = 2 → &lt;strong&gt;D5.2&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;K28.5 (COM) = 0xBC&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;0xBC = 1011_1100 binary&lt;/li&gt;
&lt;li&gt;Bits [4:0] = 1_1100 = 28, bits [7:5] = 101 = 5 → &lt;strong&gt;K28.5&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;K28.5 is the designated "comma" character used to establish symbol alignment because its 10-bit patterns (both RD+ and RD-) contain a unique 6-bit run (six consecutive same-polarity bits) not achievable in any valid data symbol.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  3.4 8b/10b Encoded Bit Patterns
&lt;/h2&gt;

&lt;p&gt;8b/10b encoding uses lookup tables; below are the encoded values for the symbols used in training:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Byte&lt;/th&gt;
&lt;th&gt;Symbol&lt;/th&gt;
&lt;th&gt;10-bit (RD-)&lt;/th&gt;
&lt;th&gt;10-bit (RD+)&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0xBC&lt;/td&gt;
&lt;td&gt;K28.5&lt;/td&gt;
&lt;td&gt;0011_111010&lt;/td&gt;
&lt;td&gt;1100_000101&lt;/td&gt;
&lt;td&gt;COM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x00&lt;/td&gt;
&lt;td&gt;D0.0&lt;/td&gt;
&lt;td&gt;1001_110100&lt;/td&gt;
&lt;td&gt;0110_001011&lt;/td&gt;
&lt;td&gt;Logical Idle&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0xF8&lt;/td&gt;
&lt;td&gt;K23.7&lt;/td&gt;
&lt;td&gt;1110_101000&lt;/td&gt;
&lt;td&gt;0001_010111&lt;/td&gt;
&lt;td&gt;PAD symbol&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x4A&lt;/td&gt;
&lt;td&gt;D10.2&lt;/td&gt;
&lt;td&gt;0101_010101&lt;/td&gt;
&lt;td&gt;1010_101010&lt;/td&gt;
&lt;td&gt;TS1 ID ('J')&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x45&lt;/td&gt;
&lt;td&gt;D5.2&lt;/td&gt;
&lt;td&gt;1010_100101&lt;/td&gt;
&lt;td&gt;0101_011010&lt;/td&gt;
&lt;td&gt;TS2 ID ('E')&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0xFF&lt;/td&gt;
&lt;td&gt;D31.7&lt;/td&gt;
&lt;td&gt;1010_111110&lt;/td&gt;
&lt;td&gt;0101_000001&lt;/td&gt;
&lt;td&gt;N_FTS (255)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x07&lt;/td&gt;
&lt;td&gt;D7.0&lt;/td&gt;
&lt;td&gt;1110_100010&lt;/td&gt;
&lt;td&gt;0001_011101&lt;/td&gt;
&lt;td&gt;Rate ID&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x20&lt;/td&gt;
&lt;td&gt;D0.1&lt;/td&gt;
&lt;td&gt;0110_010111&lt;/td&gt;
&lt;td&gt;1001_101000&lt;/td&gt;
&lt;td&gt;Speed change bit5=1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0xC8&lt;/td&gt;
&lt;td&gt;D8.6&lt;/td&gt;
&lt;td&gt;0001_011011&lt;/td&gt;
&lt;td&gt;1110_100100&lt;/td&gt;
&lt;td&gt;N_FTS=200&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note on D10.2&lt;/strong&gt;: The RD- pattern &lt;code&gt;0101_010101&lt;/code&gt; happens to be an alternating sequence, and the RD+ pattern &lt;code&gt;1010_101010&lt;/code&gt; is the complement.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  3.5 128b/130b Encoding (Gen3 and Above)
&lt;/h3&gt;

&lt;p&gt;At Gen3 (8 GT/s), 8b/10b is replaced by 128b/130b encoding:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Every 128-bit payload gets a 2-bit &lt;strong&gt;sync header&lt;/strong&gt; prepended.&lt;/li&gt;
&lt;li&gt;Sync header &lt;code&gt;10&lt;/code&gt; = data block; &lt;code&gt;01&lt;/code&gt; = ordered set (control) block.&lt;/li&gt;
&lt;li&gt;Overhead: 2/130 ≈ 1.5%, vs. 20% for 8b/10b. This is why Gen3 at 8 GT/s delivers ~4× the effective bandwidth of Gen1 at 2.5 GT/s, not just 3.2×.&lt;/li&gt;
&lt;li&gt;Scrambling uses a different LFSR polynomial than Gen1/Gen2.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Gen3 TS1/TS2 format in 128b/130b&lt;/strong&gt;: An ordered set block (sync header = &lt;code&gt;01&lt;/code&gt;) carries a 128-bit payload. A TS1 or TS2 occupies one such block. The payload contains the same logical fields (Link, Lane, N_FTS, Rate, Training Control, identifier) but packed differently than the 16-symbol 8b/10b format. Specifically, a Gen3 TS1/TS2 block is:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;[01][TS_ID(1B)][Link(1B)][Lane(1B)][N_FTS(1B)][Rate(1B)][TrainingCtrl(1B)][ID×10(10B)][reserved(2B)]&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Total: 2 bits sync + 128 bits payload = 130 bits per ordered set block.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Ordered Set Structure (TS1 / TS2) — Gen1/Gen2
&lt;/h2&gt;

&lt;p&gt;Each ordered set = &lt;strong&gt;16 symbols × 10 bits = 160 bits&lt;/strong&gt; at Gen1/Gen2.&lt;/p&gt;

&lt;h3&gt;
  
  
  TS1 Layout (pre-scramble byte values)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Symbol index&lt;/th&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Byte value (typical)&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;COM&lt;/td&gt;
&lt;td&gt;0xBC&lt;/td&gt;
&lt;td&gt;K28.5, always sent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Link Number&lt;/td&gt;
&lt;td&gt;0xF8 (PAD) or 0x00&lt;/td&gt;
&lt;td&gt;PAD until link assigned&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Lane Number&lt;/td&gt;
&lt;td&gt;0xF8 (PAD) or 0x00..n&lt;/td&gt;
&lt;td&gt;PAD until lane assigned&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;N_FTS&lt;/td&gt;
&lt;td&gt;0xC8 (200)&lt;/td&gt;
&lt;td&gt;Receiver's FTS requirement&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Rate Identifier&lt;/td&gt;
&lt;td&gt;0x07&lt;/td&gt;
&lt;td&gt;Gen1+Gen2+Gen3 support&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Training Control&lt;/td&gt;
&lt;td&gt;0x00 or 0x20&lt;/td&gt;
&lt;td&gt;0x20 = speed change request&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6–15&lt;/td&gt;
&lt;td&gt;TS1 Identifier&lt;/td&gt;
&lt;td&gt;0x4A × 10&lt;/td&gt;
&lt;td&gt;ASCII 'J', repeated 10 times&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  TS2 Layout
&lt;/h3&gt;

&lt;p&gt;Identical to TS1, except:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Symbol index&lt;/th&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Byte value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;6–15&lt;/td&gt;
&lt;td&gt;TS2 Identifier&lt;/td&gt;
&lt;td&gt;0x45 × 10&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Key Training Control Bits (Symbol 5)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Bit&lt;/th&gt;
&lt;th&gt;Name&lt;/th&gt;
&lt;th&gt;Meaning when set&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;Hot Reset&lt;/td&gt;
&lt;td&gt;Request hot reset&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Disable Link&lt;/td&gt;
&lt;td&gt;Request link disable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Loopback&lt;/td&gt;
&lt;td&gt;Request loopback mode&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Disable Scrambling&lt;/td&gt;
&lt;td&gt;Scrambling off (test/debug)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Compliance Receive&lt;/td&gt;
&lt;td&gt;Enter compliance mode&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Speed Change&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Request transition to new speed&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  5. LTSSM Overview
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Detect → Polling → Configuration → L0 (Gen1) → Recovery → L0 (Gen3)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Training always begins at &lt;strong&gt;Gen1 (2.5 GT/s)&lt;/strong&gt;, regardless of device capability.&lt;/li&gt;
&lt;li&gt;Speed upgrade happens later via the Recovery state.&lt;/li&gt;
&lt;li&gt;The LTSSM runs independently and in parallel on the RC and the EP. They converge through the exchange of ordered sets.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  6. Detect State
&lt;/h2&gt;

&lt;p&gt;The transmitter output is a current-mode driver with a nominal output impedance of 50 Ω into a 50 Ω termination at the receiver. Total DC path = 100 Ω differential.&lt;/p&gt;

&lt;p&gt;The Detect state uses a slow voltage ramp on the TX differential pair:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Receiver present&lt;/strong&gt;: The 50 Ω termination loads the ramp → slow rise time detected.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No receiver (open circuit)&lt;/strong&gt;: No termination → fast rise to rail → detected as absent.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The spec requires the transmitter to charge the line to a voltage and measure the time to reach a threshold. If it stays below the threshold long enough (indicating a load), a receiver is declared present.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RC Detect Results (per lane)&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;RC Lane&lt;/th&gt;
&lt;th&gt;Detect result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Lane0&lt;/td&gt;
&lt;td&gt;Receiver present (50 Ω load from EP Lane0)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lane1&lt;/td&gt;
&lt;td&gt;Receiver present (50 Ω load from EP Lane1)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lane2&lt;/td&gt;
&lt;td&gt;No receiver (open circuit)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lane3&lt;/td&gt;
&lt;td&gt;No receiver (open circuit)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;RC exits Detect with &lt;strong&gt;2 active lanes: Lane0 and Lane1&lt;/strong&gt;.&lt;br&gt;&lt;br&gt;
Lane2 and Lane3 are deactivated for the remainder of training.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;EP Detect Results (per lane)&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;EP Lane&lt;/th&gt;
&lt;th&gt;Detect result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Lane0&lt;/td&gt;
&lt;td&gt;Receiver present&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lane1&lt;/td&gt;
&lt;td&gt;Receiver present&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;EP exits Detect with &lt;strong&gt;2 active lanes&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  7. Polling State
&lt;/h2&gt;

&lt;p&gt;Polling has two sub-states: &lt;strong&gt;Polling.Active&lt;/strong&gt; and &lt;strong&gt;Polling.Configuration&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  Goal of Polling
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Bit lock&lt;/strong&gt;: The receiver CDR circuit locks its internal clock to the incoming bit transitions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Symbol lock&lt;/strong&gt;: The receiver identifies COM (K28.5) symbols and aligns its symbol boundaries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configuration capability exchange&lt;/strong&gt;: Devices advertise their supported speeds in the Rate ID field.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;
  
  
  7.1 Polling.Active — TS1 Transmission
&lt;/h3&gt;

&lt;p&gt;Both RC and EP begin transmitting TS1 ordered sets simultaneously on all active lanes. At this stage:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Link Number = PAD (0xF8) — no link number assigned yet&lt;/li&gt;
&lt;li&gt;Lane Number = PAD (0xF8) — no lane number assigned yet&lt;/li&gt;
&lt;li&gt;Training Control = 0x00&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;RC → EP on Lane0, TS1 ordered set (pre-scramble bytes):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Symbol:  [ 0][ 1][ 2][ 3][ 4][ 5][ 6][ 7][ 8][ 9][10][11][12][13][14][15]
Byte:    [BC][F8][F8][C8][07][00][4A][4A][4A][4A][4A][4A][4A][4A][4A][4A]
Field:   COM  LNK LAN FTS RAT CTL &amp;lt;-----------TS1 ID ('J') × 10---------&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Bit-level expansion of the first three symbols (RD- for each):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Symbol 0 — COM (K28.5, 0xBC), RD-:
  10-bit: 0011_1110 10
  Wired:  0 0 1 1 1 1 1 0 1 0   (LSB first on differential pair)

Symbol 1 — PAD (K23.7, 0xF8), RD- → RD+:
  Sending K28.5 in RD- leaves RD+, so next symbol is RD+ variant.
  K23.7 RD+: 0001_0101 11
  Wired:  0 0 0 1 0 1 0 1 1 1

Symbol 2 — PAD (K23.7, 0xF8), RD+ → RD-:
  K23.7 RD-: 1110_1010 00
  Wired:  1 1 1 0 1 0 1 0 0 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;PCIe bit ordering&lt;/strong&gt;: Bits are transmitted LSB-first within each 10-bit symbol. The wired sequence above shows bits as they appear on the differential pair over time, left to right.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Same TS1 on Lane1 (identical content, independent CDR):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Symbol:  [ 0][ 1][ 2][ 3][ 4][ 5][ 6][ 7][ 8][ 9][10][11][12][13][14][15]
Byte:    [BC][F8][F8][C8][07][00][4A][4A][4A][4A][4A][4A][4A][4A][4A][4A]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Lane0 and Lane1 carry identical TS1 content during Polling. Each lane's CDR circuit locks independently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;EP → RC (simultaneously, same TS1 format):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Lane0: [BC][F8][F8][C8][07][00][4A × 10]
Lane1: [BC][F8][F8][C8][07][00][4A × 10]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Polling.Active Exit Condition:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Transmit at least &lt;strong&gt;1024 TS1 ordered sets&lt;/strong&gt; on all active lanes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AND&lt;/strong&gt; receive &lt;strong&gt;8 consecutive identical TS1 or TS2&lt;/strong&gt; on any active lane.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both conditions must be met. The 1024-TS1 minimum ensures the receiver had enough transitions for bit and symbol lock before either side checks the received content.&lt;/p&gt;

&lt;p&gt;On receipt of 8 consecutive valid TS1, the device transitions to &lt;strong&gt;Polling.Configuration&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  7.2 Polling.Configuration
&lt;/h3&gt;

&lt;p&gt;In Polling.Configuration, each device sends &lt;strong&gt;TS2 ordered sets&lt;/strong&gt; (symbols 6–15 = 0x45). The Rate ID field is still 0x07. The device exits when it has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Received &lt;strong&gt;8 consecutive TS2&lt;/strong&gt; with matching Rate ID&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AND&lt;/strong&gt; transmitted at least &lt;strong&gt;16 TS2&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After Polling.Configuration, the devices have confirmed mutual speed support. Both transition to &lt;strong&gt;Configuration&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  8. Configuration State
&lt;/h2&gt;

&lt;p&gt;Configuration negotiates link width and lane numbering. It proceeds through several sub-states.&lt;/p&gt;

&lt;h3&gt;
  
  
  8.1 Configuration.LinkWidth.Start
&lt;/h3&gt;

&lt;p&gt;The RC transmits TS1 on all active lanes with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Link Number = PAD (0xF8)&lt;/li&gt;
&lt;li&gt;Lane Number = PAD (0xF8)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This signals: "I am proposing lanes; tell me what you can accept."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RC → EP per lane:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Lane0: [BC][F8][F8][C8][07][00][4A × 10]
Lane1: [BC][F8][F8][C8][07][00][4A × 10]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both lanes carry PAD/PAD — the RC has not assigned numbers yet.&lt;/p&gt;

&lt;h3&gt;
  
  
  8.2 Configuration.LinkWidth.Accept
&lt;/h3&gt;

&lt;p&gt;The EP receives the RC's PAD/PAD TS1 on Lane0 and Lane1. It assigns link and lane numbers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;EP Lane&lt;/th&gt;
&lt;th&gt;Assigned Link Number&lt;/th&gt;
&lt;th&gt;Assigned Lane Number&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Lane0&lt;/td&gt;
&lt;td&gt;0x00&lt;/td&gt;
&lt;td&gt;0x00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lane1&lt;/td&gt;
&lt;td&gt;0x00&lt;/td&gt;
&lt;td&gt;0x01&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The EP sends TS1 back to the RC with these assigned values:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;EP → RC on Lane0 (pre-scramble):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Symbol:  [ 0][ 1][ 2][ 3][ 4][ 5][ 6–15]
Byte:    [BC][00][00][C8][07][00][4A × 10]
Field:   COM  LNK=0 LAN=0 FTS RAT CTL  TS1_ID
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;EP → RC on Lane1 (pre-scramble):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Symbol:  [ 0][ 1][ 2][ 3][ 4][ 5][ 6–15]
Byte:    [BC][00][01][C8][07][00][4A × 10]
Field:   COM  LNK=0 LAN=1 FTS RAT CTL  TS1_ID
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Bit-level for Symbol 2 (Lane Number) on Lane1 — D1.0 (0x01), assume RD-:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0x01 = D1.0
bits[4:0] = 00001, bits[7:5] = 000
D1.0 RD-: 1001_010111   (10 bits, LSB first on wire: 1 0 0 1 0 1 0 1 1 1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On Lane0, Symbol 2 = D0.0 (0x00):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;D0.0 RD-: 1001_110100   (wire: 1 0 0 1 1 1 0 1 0 0)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  8.3 RC Echoes Back
&lt;/h3&gt;

&lt;p&gt;The RC receives the EP's numbered TS1. It now echoes the same Link=0 / Lane=N assignments back in its own TS1 transmissions, confirming acceptance:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RC → EP on Lane0:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[BC][00][00][C8][07][00][4A × 10]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;RC → EP on Lane1:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[BC][00][01][C8][07][00][4A × 10]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Agreement is reached: &lt;strong&gt;Link 0, x2 width, Lane0 and Lane1&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  8.4 Configuration.LaneNum.Wait — Switch to TS2
&lt;/h3&gt;

&lt;p&gt;Both sides transition from TS1 → TS2 to confirm lane numbering.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RC → EP on Lane0:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[BC][00][00][C8][07][00][45 × 10]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;RC → EP on Lane1:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[BC][00][01][C8][07][00][45 × 10]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;EP → RC (same, mirrored).&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Symbol 15 of TS2 on Lane0, D5.2 (0x45), assume RD-:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0x45 = D5.2
bits[4:0] = 00101 = 5, bits[7:5] = 010 = 2
D5.2 RD-: 1010_100101   (wire: 1 0 1 0 1 0 0 1 0 1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  8.5 Configuration.LaneNum.Accept
&lt;/h3&gt;

&lt;p&gt;Exit condition: receive &lt;strong&gt;2 consecutive TS2&lt;/strong&gt; with matching Link and Lane fields.&lt;/p&gt;

&lt;h3&gt;
  
  
  8.6 Configuration.Complete
&lt;/h3&gt;

&lt;p&gt;Both sides exchange TS2 until:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;8 consecutive TS2&lt;/strong&gt; received&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AND&lt;/strong&gt; minimum &lt;strong&gt;2 ms&lt;/strong&gt; has elapsed since entering Configuration state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The 2 ms minimum is a guard interval to allow all lanes to settle and converge regardless of implementation variance.&lt;/p&gt;

&lt;h3&gt;
  
  
  8.7 Configuration.Idle
&lt;/h3&gt;

&lt;p&gt;Both sides stop sending TS2 and send &lt;strong&gt;Electrical Idle&lt;/strong&gt; followed by &lt;strong&gt;Logical Idle&lt;/strong&gt; symbols:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Logical Idle symbol — D0.0 (0x00), RD-:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;10-bit RD-: 1001_110100
Wire (LSB first): 1 0 0 1 1 1 0 1 0 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both sides send 8 consecutive Logical Idle symbols on each lane. On receipt of these, both sides transition to &lt;strong&gt;L0&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  9. L0 State — Gen1 Active Link
&lt;/h2&gt;

&lt;p&gt;The link is now operational:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parameter&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Speed&lt;/td&gt;
&lt;td&gt;Gen1, 2.5 GT/s per lane&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Width&lt;/td&gt;
&lt;td&gt;x2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Encoding&lt;/td&gt;
&lt;td&gt;8b/10b&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Raw bit rate&lt;/td&gt;
&lt;td&gt;2.5 Gb/s per lane&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Effective rate&lt;/td&gt;
&lt;td&gt;2.5 × 0.8 = 2.0 Gb/s per lane (8b/10b overhead)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Aggregate BW&lt;/td&gt;
&lt;td&gt;2 lanes × 2.0 Gb/s = 4.0 Gb/s = &lt;strong&gt;500 MB/s&lt;/strong&gt; (bidirectional per direction)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;TLPs (Transaction Layer Packets) and DLLPs (Data Link Layer Packets) can now flow. Software is notified that a link is up.&lt;/p&gt;

&lt;p&gt;The RC or EP (typically the RC, or the driver stack) can now initiate a speed change by entering Recovery.&lt;/p&gt;

&lt;h2&gt;
  
  
  10. Recovery State — Speed Upgrade to Gen3
&lt;/h2&gt;

&lt;h3&gt;
  
  
  10.1 Recovery.RcvrLock — Requesting Speed Change
&lt;/h3&gt;

&lt;p&gt;The RC (or EP) initiates Recovery by sending TS1 with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Speed Change bit set&lt;/strong&gt; (Training Control bit 5 = 1 → byte = 0x20)&lt;/li&gt;
&lt;li&gt;Link = 0x00, Lane = 0x00 or 0x01 per lane&lt;/li&gt;
&lt;li&gt;Rate ID = 0x07 (all speeds)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;RC → EP on Lane0 (pre-scramble):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Symbol:  [ 0][ 1][ 2][ 3][ 4][ 5][ 6–15]
Byte:    [BC][00][00][C8][07][20][4A × 10]
                              ^^
                        Speed Change = 1 (bit 5)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Symbol 5, Training Control = 0x20:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0x20 = 0010_0000 binary = D0.1
bits[4:0] = 00000 = 0, bits[7:5] = 001 = 1 → D0.1

D0.1 RD-: 0110_010111   (wire: 0 1 1 0 0 1 0 1 1 1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;RC → EP on Lane1:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[BC][00][01][C8][07][20][4A × 10]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The EP receives these and responds with the same: TS1 with Speed Change bit = 1, its own Link=0, Lane=N.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;EP → RC on Lane0:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[BC][00][00][C8][07][20][4A × 10]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;EP → RC on Lane1:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[BC][00][01][C8][07][20][4A × 10]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Exit condition for Recovery.RcvrLock: receive &lt;strong&gt;8 consecutive TS1 or TS2&lt;/strong&gt; (with or without speed change bit) on all active lanes.&lt;/p&gt;

&lt;h3&gt;
  
  
  10.2 Recovery.RcvrCfg — Confirming Speed with TS2
&lt;/h3&gt;

&lt;p&gt;Both sides switch to TS2 (still Gen1, still 8b/10b, Speed Change bit = 1):&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RC → EP on Lane0:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[BC][00][00][C8][07][20][45 × 10]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;RC → EP on Lane1:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[BC][00][01][C8][07][20][45 × 10]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Exit condition: receive &lt;strong&gt;8 consecutive TS2&lt;/strong&gt; with Speed Change bit set.&lt;/p&gt;

&lt;h3&gt;
  
  
  10.3 Recovery.Speed — PHY Retrain
&lt;/h3&gt;

&lt;p&gt;Both sides simultaneously:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Assert electrical idle (stop driving differential data).&lt;/li&gt;
&lt;li&gt;Reset the PHY PLL from 2.5 GT/s to 8 GT/s.&lt;/li&gt;
&lt;li&gt;Switch the serializer/deserializer (SerDes) to Gen3 signaling parameters (different equalization, different reference voltage levels).&lt;/li&gt;
&lt;li&gt;Switch framing from 8b/10b to &lt;strong&gt;128b/130b&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Reset the LFSR scrambler to the Gen3 seed.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;There is a mandatory quiet time (Electrical Idle) during this phase. The timeout for this step is 24 ms.&lt;/p&gt;

&lt;h3&gt;
  
  
  10.4 Recovery.RcvrLock Again (at Gen3)
&lt;/h3&gt;

&lt;p&gt;After the PHY switches to 8 GT/s, both sides retransmit TS1 — now using &lt;strong&gt;128b/130b&lt;/strong&gt; framing.&lt;/p&gt;

&lt;p&gt;Each TS1 block at Gen3 is a &lt;strong&gt;130-bit&lt;/strong&gt; unit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[sync: 01][TS1 payload: 128 bits]
 ^^^^^^^^
 Ordered set block indicator
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Payload (128 bits = 16 bytes), logical byte layout:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Byte:  [TS1_OS_ID][LNK][LAN][N_FTS][RATE][CTRL][4A×10][rsvd×2]
       [0xF0    ][0x00][0x00][0xC8][0x07][0x20][4A..4A][00 00]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Note: The Gen3 TS1 ordered set identifier byte is &lt;code&gt;0xF0&lt;/code&gt; (different from Gen1/Gen2, which uses the symbol position to indicate "this is an ordered set"). In 128b/130b, a dedicated byte in the payload identifies the ordered set type.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;On the wire (Lane0, first 20 bits of a Gen3 TS1 block, before scrambling):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Sync header (2 bits): 0 1
Byte 0 = 0xF0 = 1111_0000, wire (LSB first): 0 0 0 0 1 1 1 1
Byte 1 = 0x00 = 0000_0000, wire (LSB first): 0 0 0 0 0 0 0 0
...continues for remaining 14 bytes (112 bits)
Total = 2 + 128 = 130 bits per block
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both lanes (Lane0, Lane1) transmit identical ordered sets simultaneously.&lt;/p&gt;

&lt;h3&gt;
  
  
  10.5 Recovery.RcvrCfg (at Gen3) — TS2
&lt;/h3&gt;

&lt;p&gt;Both sides switch to TS2 at Gen3. Same 130-bit block format, payload byte 0 = TS2 OS identifier (&lt;code&gt;0xF2&lt;/code&gt; for TS2 in 128b/130b), symbols 6–15 equivalent = 0x45 bytes.&lt;/p&gt;

&lt;p&gt;Exit condition: &lt;strong&gt;8 consecutive TS2 at Gen3&lt;/strong&gt; on all active lanes.&lt;/p&gt;

&lt;h3&gt;
  
  
  10.6 Transition to L0 (Gen3)
&lt;/h3&gt;

&lt;p&gt;Both sides send Logical Idle in 128b/130b and transition to &lt;strong&gt;L0 at Gen3&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final Link State&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parameter&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Width&lt;/td&gt;
&lt;td&gt;x2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Speed&lt;/td&gt;
&lt;td&gt;Gen3, 8 GT/s per lane&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Encoding&lt;/td&gt;
&lt;td&gt;128b/130b&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Raw bit rate&lt;/td&gt;
&lt;td&gt;8 Gb/s per lane&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Encoding overhead&lt;/td&gt;
&lt;td&gt;2/130 ≈ 1.54%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Effective rate&lt;/td&gt;
&lt;td&gt;8 × (128/130) ≈ 7.877 Gb/s per lane&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Aggregate BW&lt;/td&gt;
&lt;td&gt;2 lanes × 7.877 / 8 ≈ &lt;strong&gt;~1.97 GB/s&lt;/strong&gt; (per direction)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  12. Complete LTSSM State Transition Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;State&lt;/th&gt;
&lt;th&gt;Duration / Exit Condition&lt;/th&gt;
&lt;th&gt;Lanes active&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Detect.Quiet&lt;/td&gt;
&lt;td&gt;Receiver detection logic runs&lt;/td&gt;
&lt;td&gt;All (RC: 4)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Detect.Active&lt;/td&gt;
&lt;td&gt;Receiver seen on Lane0, Lane1; Lane2/3 deactivated&lt;/td&gt;
&lt;td&gt;RC: 2, EP: 2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Polling.Active&lt;/td&gt;
&lt;td&gt;TX ≥1024 TS1 &lt;strong&gt;AND&lt;/strong&gt; RX 8 consecutive TS1/TS2&lt;/td&gt;
&lt;td&gt;2 per side&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Polling.Configuration&lt;/td&gt;
&lt;td&gt;RX 8 consecutive TS2 &lt;strong&gt;AND&lt;/strong&gt; TX ≥16 TS2&lt;/td&gt;
&lt;td&gt;2 per side&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Configuration.LinkWidth.Start&lt;/td&gt;
&lt;td&gt;Send PAD/PAD TS1&lt;/td&gt;
&lt;td&gt;2 per side&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Configuration.LinkWidth.Accept&lt;/td&gt;
&lt;td&gt;RX TS1 with Link≠PAD, Lane≠PAD&lt;/td&gt;
&lt;td&gt;2 per side&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Configuration.LaneNum.Wait&lt;/td&gt;
&lt;td&gt;Switch to TS2 with Link/Lane assigned&lt;/td&gt;
&lt;td&gt;2 per side&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Configuration.LaneNum.Accept&lt;/td&gt;
&lt;td&gt;RX 2 consecutive matching TS2&lt;/td&gt;
&lt;td&gt;2 per side&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Configuration.Complete&lt;/td&gt;
&lt;td&gt;RX 8 consecutive TS2 &lt;strong&gt;AND&lt;/strong&gt; ≥2 ms in Configuration&lt;/td&gt;
&lt;td&gt;2 per side&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Configuration.Idle&lt;/td&gt;
&lt;td&gt;RX 8 Logical Idle symbols&lt;/td&gt;
&lt;td&gt;2 per side&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;L0 (Gen1)&lt;/td&gt;
&lt;td&gt;Active data transfer at 2.5 GT/s x2&lt;/td&gt;
&lt;td&gt;2 per side&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Recovery.RcvrLock&lt;/td&gt;
&lt;td&gt;RX 8 consecutive TS1/TS2 (speed change requested)&lt;/td&gt;
&lt;td&gt;2 per side&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Recovery.RcvrCfg&lt;/td&gt;
&lt;td&gt;RX 8 consecutive TS2 with speed change bit&lt;/td&gt;
&lt;td&gt;2 per side&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Recovery.Speed&lt;/td&gt;
&lt;td&gt;PHY retrains to Gen3; 24 ms timeout&lt;/td&gt;
&lt;td&gt;2 per side&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Recovery.RcvrLock (Gen3)&lt;/td&gt;
&lt;td&gt;RX 8 consecutive TS1/TS2 at 8 GT/s (128b/130b)&lt;/td&gt;
&lt;td&gt;2 per side&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Recovery.RcvrCfg (Gen3)&lt;/td&gt;
&lt;td&gt;RX 8 consecutive TS2 at 8 GT/s&lt;/td&gt;
&lt;td&gt;2 per side&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;L0 (Gen3)&lt;/td&gt;
&lt;td&gt;Active data transfer at 8 GT/s x2 ≈ 1.97 GB/s&lt;/td&gt;
&lt;td&gt;2 per side&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

</description>
      <category>architecture</category>
      <category>computerscience</category>
      <category>systems</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Inside PCIe PHY: End-to-End Transmit and Receive Path</title>
      <dc:creator>Ripan Deuri</dc:creator>
      <pubDate>Sat, 28 Mar 2026 12:23:45 +0000</pubDate>
      <link>https://dev.to/ripan030/inside-pcie-phy-end-to-end-transmit-and-receive-path-g18</link>
      <guid>https://dev.to/ripan030/inside-pcie-phy-end-to-end-transmit-and-receive-path-g18</guid>
      <description>&lt;p&gt;This article builds on the PCIe overview and physical layer fundamentals by presenting an end-to-end view of how data flows through the transmit and receive paths. The focus is on how a Transaction Layer Packet (TLP) is transformed into a high-speed serial bit stream and reconstructed at the receiver.&lt;/p&gt;






&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

============================================================================
TRANSMITTER (e.g., Root Complex sending a Memory Write TLP)
============================================================================

[Data Link Layer]
    Seq# + TLP Header + Data + LCRC (parallel bytes)
            |
            v
[Physical Layer]

    [Framing / Block Formation]
        Gen1/2:
            STP + TLP + END (control symbols embedded in stream)
        Gen3+:
            Data organized into 128-bit blocks
            (packet boundaries inferred, no explicit STP/END)

            |
            v

    [Scrambler]
        Bit stream XOR’d with LFSR sequence
        -&amp;gt; Randomized data for transition density and EMI reduction

            |
            v

    [Encoder / Block Encoding]
        Gen1/2 (8b/10b):
            8-bit -&amp;gt; 10-bit symbol (DC balance + control encoding)

        Gen3/4/5 (128b/130b):
            128-bit block + 2-bit sync header -&amp;gt; 130-bit block

            |
            v

    [Transmit PLL / Clock Generation]
        Reference clock (e.g., 100 MHz)
        -&amp;gt; Generates high-speed serial rate (e.g., 8.0 GT/s for Gen3)

            |
            v

    [Serializer]
        Parallel block -&amp;gt; serial bit stream
        Bit time ≈ 125 ps (Gen3)

            |
            v

    [Differential Driver]
        Drives Tx+ / Tx− pair
        Bit encoded as polarity (Tx+ &amp;gt; Tx− or Tx+ &amp;lt; Tx−)
        (~800 mVpp differential, implementation-dependent)

            |
============|===============================================================
            |
            |
    Tx+ ----|---- Tx−   (PCIe serial link, e.g., 8.0 GT/s)
            |
            |
============|===============================================================
            |
============================================================================
RECEIVER (e.g., Endpoint receiving the Memory Write TLP)
============================================================================

[Physical Layer]

    [Differential Receiver]
        Senses (Rx+ − Rx−)
        -&amp;gt; Recovers serial bit stream (noise rejection)

            |
            v

    [Clock Data Recovery (CDR)]
        Extracts clock from data transitions
        Phase Detector -&amp;gt; Loop Filter -&amp;gt; VCO
        -&amp;gt; Sampling aligned to center of Unit Interval (UI)

            |
            v

    [Deserializer]
        Serial stream -&amp;gt; parallel blocks
        (e.g., 130-bit blocks in Gen3+)

            |
            v

    [Decoder / Block Decoding]
        Gen1/2:
            10-bit -&amp;gt; 8-bit symbols

        Gen3/4/5:
            Remove 2-bit sync header
            -&amp;gt; Recover 128-bit scrambled data

            |
            v

    [De-scrambler]
        XOR with same LFSR sequence
        -&amp;gt; Restores original data

            |
            v

    [De-framing / Packet Reconstruction]
        Gen1/2:
            Detect STP / END symbols

        Gen3+:
            Packet boundaries inferred from protocol structure

            |
            v

[Data Link Layer]
    Seq# + TLP Header + Data + LCRC
        -&amp;gt; LCRC validation
        -&amp;gt; Sequence tracking
        -&amp;gt; ACK/NACK via DLLP

============================================================================

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>architecture</category>
      <category>computerscience</category>
      <category>systems</category>
    </item>
    <item>
      <title>Understanding PCIe Physical Layer</title>
      <dc:creator>Ripan Deuri</dc:creator>
      <pubDate>Sat, 28 Mar 2026 12:22:33 +0000</pubDate>
      <link>https://dev.to/ripan030/understanding-pcie-physical-layer-1ka1</link>
      <guid>https://dev.to/ripan030/understanding-pcie-physical-layer-1ka1</guid>
      <description>&lt;p&gt;The PCIe Physical Layer is responsible for converting structured packet data into high-speed electrical signals that can traverse the link between devices. While higher layers define what to send, the Physical Layer determines how those bits are transmitted reliably over a noisy channel.&lt;/p&gt;

&lt;p&gt;This article focuses on the transmit (TX) path, breaking down each stage from TLP data to differential signaling on the wire, including scrambling, encoding, and serialization.&lt;/p&gt;




&lt;h2&gt;
  
  
  PCIe Physical Layer
&lt;/h2&gt;

&lt;p&gt;The Root Complex (RC) and Endpoint (EP) are connected by a PCIe Link — a set of high-speed differential signal pairs. Everything that happens between them (register reads, DMA transfers, interrupts) is ultimately carried as Transaction Layer Packets (TLPs) over this link.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+-------------+                             +-------------+
|             |&amp;lt;====== PCIe Link ========&amp;gt;  |             |
|     RC      |                             |     EP      |
|             |                             |             |
+-------------+                             +-------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  PCIe Link
&lt;/h2&gt;

&lt;p&gt;A PCIe &lt;strong&gt;lane&lt;/strong&gt; is a full-duplex serial connection consisting of one transmit pair and one receive pair.&lt;/p&gt;

&lt;p&gt;A PCIe &lt;strong&gt;link&lt;/strong&gt; consists of one or more such lanes aggregated together (x1, x2, x4, x8, x16).&lt;/p&gt;

&lt;h3&gt;
  
  
  Physical wires in one PCIe lane
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;RC (Transmitter)                     EP (Receiver)

TX+  -----------------------------&amp;gt;  RX+
TX-  -----------------------------&amp;gt;  RX-

RX+  &amp;lt;-----------------------------  TX+
RX-  &amp;lt;-----------------------------  TX-
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each lane is &lt;strong&gt;full-duplex&lt;/strong&gt;, meaning data flows simultaneously in both directions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Differential Signaling
&lt;/h2&gt;

&lt;p&gt;PCIe uses &lt;strong&gt;low-voltage differential signaling&lt;/strong&gt;, where the same signal is transmitted on two wires with opposite polarity.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;TX+ carries the signal&lt;/li&gt;
&lt;li&gt;TX- carries the inverted signal&lt;/li&gt;
&lt;li&gt;The receiver subtracts the two → noise cancels out&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High noise immunity&lt;/li&gt;
&lt;li&gt;Better signal integrity at high speeds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Conceptually:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Transmitting ‘1’: TX+ &amp;gt; TX- → positive differential&lt;/li&gt;
&lt;li&gt;Transmitting ‘0’: TX+ &amp;lt; TX- → negative differential&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;(The exact voltage swing is small — typically a few hundred millivolts — enabling high-speed operation with low power.)&lt;/p&gt;

&lt;h2&gt;
  
  
  Lanes — Bundling for Bandwidth
&lt;/h2&gt;

&lt;p&gt;A single lane provides a fixed bandwidth. PCIe increases throughput by aggregating multiple lanes.&lt;/p&gt;

&lt;h3&gt;
  
  
  x1 Link (1 Lane)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[ Lane 0 ]
TX pair  ---&amp;gt;  
RX pair  &amp;lt;---
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  x2 Link (2 Lanes)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[ Lane 0 ]          [ Lane 1 ]
TX pair ---&amp;gt;        TX pair ---&amp;gt;
RX pair &amp;lt;---        RX pair &amp;lt;---
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Data is &lt;strong&gt;striped across lanes in a round-robin manner&lt;/strong&gt;, and the receiver reassembles it using alignment mechanisms.&lt;/p&gt;

&lt;h2&gt;
  
  
  Transmit Path
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[TLP Stream]
     |
     v
[Framing / Control Symbols]
     |
     v
[Scrambler]
     |
     v
[Encoder]
     |
     v
[Serializer]
     |
     v
[Differential Driver]
     |
     v
TX+ / TX-
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 1: Framing and Control Symbols
&lt;/h2&gt;

&lt;p&gt;The Data Link Layer passes a sequence of bytes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Seq# | TLP Header | Data | LCRC]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;TLP boundaries are represented using &lt;strong&gt;control symbols&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;STP (Start of TLP)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;END (End of TLP)&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In Gen1/Gen2, these are explicit symbols embedded in the encoded stream.&lt;br&gt;
In Gen3 and later, boundaries are encoded within &lt;strong&gt;control blocks&lt;/strong&gt; using sync headers and ordered sets.&lt;/p&gt;

&lt;p&gt;So the conceptual view becomes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[STP | Seq# | Header | Data | LCRC | END]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 2: Scrambler
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why scrambling is needed
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Transition density&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;Clock recovery requires frequent signal transitions&lt;/li&gt;
&lt;li&gt;Long runs of 0s or 1s break timing recovery&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;EMI reduction&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;Repetitive patterns create strong electromagnetic emissions&lt;/li&gt;
&lt;li&gt;Scrambling randomizes the spectrum&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  How it works
&lt;/h3&gt;

&lt;p&gt;The scrambler XORs data with a pseudo-random sequence generated by an &lt;strong&gt;LFSR (Linear Feedback Shift Register)&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Transmitter: data ⊕ lfsr → scrambled  
Receiver:    scrambled ⊕ lfsr → data
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;PCIe Gen3+ uses a &lt;strong&gt;23-bit LFSR&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example (simplified LFSR)
&lt;/h3&gt;

&lt;p&gt;Initial register:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[S3 S2 S1 S0] = 1 0 0 1
Polynomial: x⁴ + x + 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;LFSR sequence generation&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Cycle&lt;/th&gt;
&lt;th&gt;Register&lt;/th&gt;
&lt;th&gt;Feedback&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;1001&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1100&lt;/td&gt;
&lt;td&gt;1 XOR 1 = 0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;0110&lt;/td&gt;
&lt;td&gt;1 XOR 0 = 1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;1011&lt;/td&gt;
&lt;td&gt;0 XOR 0 = 0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;0101&lt;/td&gt;
&lt;td&gt;1 XOR 1 = 0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;1010&lt;/td&gt;
&lt;td&gt;0 XOR 1 = 1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;1101&lt;/td&gt;
&lt;td&gt;1 XOR 0 = 1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;1110&lt;/td&gt;
&lt;td&gt;1 XOR 1 = 0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Generated sequence (LSB output):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1 0 0 1 1 0 1 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Effect on data
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Original:   0 0 0 0 0 0 0 0
LFSR:       1 0 0 1 1 0 1 0
Scrambled:  1 0 0 1 1 0 1 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the signal has rich transitions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Encoder
&lt;/h2&gt;

&lt;p&gt;The encoder provides structure for clock recovery and alignment.&lt;/p&gt;

&lt;h3&gt;
  
  
  8b/10b Encoding (Gen1, Gen2)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Each 8-bit byte → 10-bit symbol&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Guarantees:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bounded run length&lt;/li&gt;
&lt;li&gt;DC balance&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  128b/130b Encoding (Gen3, Gen4)
&lt;/h3&gt;

&lt;p&gt;Instead of per-byte encoding:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Sync Header] + [128-bit scrambled data] = 130 bits
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Sync Header = &lt;code&gt;01&lt;/code&gt; (data) or &lt;code&gt;10&lt;/code&gt; (control)&lt;/li&gt;
&lt;li&gt;Guarantees a transition at block start&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This improves efficiency:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;8b/10b → 80% efficiency&lt;/li&gt;
&lt;li&gt;128b/130b → ~98.5% efficiency&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 4: Serializer
&lt;/h2&gt;

&lt;p&gt;The serializer converts parallel data into a serial bit stream.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[130-bit block]
      |
      v
1 → 0 → 1 → 1 → 0 → ... (bit stream)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For PCIe Gen3:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data rate = &lt;strong&gt;8 GT/s&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Unit Interval (UI) ≈ &lt;strong&gt;125 ps per bit&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Internally, serializers use high-speed clocking derived from a reference clock (commonly 100 MHz) via a PLL.&lt;/p&gt;

&lt;p&gt;For multi-lane links:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Lane 0 → Serializer 0
Lane 1 → Serializer 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Data is distributed across lanes and transmitted in parallel.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: Differential Driver
&lt;/h2&gt;

&lt;p&gt;The serialized bits drive a differential output stage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Bit = 1 → TX+ &amp;gt; TX-
Bit = 0 → TX+ &amp;lt; TX-
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This produces a high-speed differential signal on the wire.&lt;/p&gt;

&lt;h2&gt;
  
  
  Receive Path (High-Level)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;RX+ / RX-
     |
     v
[Analog Front-End + Equalization]
     |
     v
[Clock Data Recovery (CDR)]
     |
     v
[Deserializer]
     |
     v
[Decoder]
     |
     v
[De-scrambler]
     |
     v
[TLP Stream]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Equalization (CTLE, DFE)&lt;/strong&gt; → compensates channel loss&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CDR&lt;/strong&gt; → extracts clock from data transitions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lane alignment&lt;/strong&gt; → reconstructs multi-lane streams&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>architecture</category>
      <category>computerscience</category>
      <category>systems</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>PCIe Overview</title>
      <dc:creator>Ripan Deuri</dc:creator>
      <pubDate>Sat, 28 Mar 2026 06:52:27 +0000</pubDate>
      <link>https://dev.to/ripan030/pcie-overview-2bm4</link>
      <guid>https://dev.to/ripan030/pcie-overview-2bm4</guid>
      <description>&lt;p&gt;PCIe is widely used across modern computing systems, powering devices such as SSDs, GPUs, and network interfaces.&lt;/p&gt;

&lt;p&gt;This article provides a structured overview of PCIe with an emphasis on practical understanding. It covers key concepts including topology, protocol layers, BARs, DMA, and interrupt mechanisms&lt;/p&gt;

&lt;h2&gt;
  
  
  What is PCIe
&lt;/h2&gt;

&lt;p&gt;PCIe (Peripheral Component Interconnect Express) is a high-speed serial bus standard used to connect peripheral devices to the main processor.&lt;/p&gt;

&lt;p&gt;The older PCI bus was parallel—it had 32 or 64 data lines all switching simultaneously. However, parallel buses have a fundamental limitation at high frequencies: signals on different wires arrive at slightly different times (called skew), making it difficult to scale to higher speeds.&lt;/p&gt;

&lt;p&gt;PCIe uses a high-speed serialized bitstream over differential pairs of wires (called lanes). Each lane consists of one differential pair for transmit and one for receive. Because there are only two wires per direction per lane, PCIe can operate at much higher frequencies without skew-related issues.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;PCI&lt;/th&gt;
&lt;th&gt;PCIe&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Signal Type&lt;/td&gt;
&lt;td&gt;Parallel (32/64 wires)&lt;/td&gt;
&lt;td&gt;Serial (differential pair per lane)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Topology&lt;/td&gt;
&lt;td&gt;Shared bus&lt;/td&gt;
&lt;td&gt;Point-to-point (switched fabric)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Max Bandwidth&lt;/td&gt;
&lt;td&gt;~533 MB/s&lt;/td&gt;
&lt;td&gt;Up to ~128 GB/s (Gen5 x16)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Interrupt&lt;/td&gt;
&lt;td&gt;IRQ Lines&lt;/td&gt;
&lt;td&gt;MSI / MSI-X&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;PCIe uses a point-to-point topology. Each device connects through a dedicated link to the Root Complex or via switches. There is no shared electrical bus; instead, PCIe forms a hierarchical switched interconnect.&lt;/p&gt;

&lt;h2&gt;
  
  
  Basic PCIe Topology
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+-----------------------------------------------+
|                   SoC                         |
|                                               |
|  +-------+            +-------------------+   |
|  |  CPU  |&amp;lt;----------&amp;gt;| Root Complex (RC) |   |
|  +-------+  AXI bus   |                   |   |
|                       |                   |   |
|  +-------+            |                   |   |
|  | DRAM  |&amp;lt;----------&amp;gt;|                   |   |
|  +-------+  AXI bus   +---------^---------+   |
|                                 |             |
+---------------------------------|-------------+
                                  |
                        =====================
                              PCIe Link
                        =====================
                                  |
                        +---------v---------+
                        |   Endpoint (EP)   |
                        +-------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Key Components:
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Root Complex (RC):&lt;/strong&gt;&lt;br&gt;
The Root Complex is the PCIe host controller inside the SoC. It acts as a bridge between the CPU’s memory bus (AXI/AHB) and the PCIe fabric. It:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Initiates configuration space reads/writes during enumeration&lt;/li&gt;
&lt;li&gt;Translates CPU memory accesses into PCIe TLPs (Transaction Layer Packets)&lt;/li&gt;
&lt;li&gt;Receives TLPs from endpoints and converts them into memory transactions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Endpoint (EP):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Exposes registers and memory regions via BARs (Base Address Registers)&lt;/li&gt;
&lt;li&gt;Can initiate DMA transfers to/from host memory&lt;/li&gt;
&lt;li&gt;Signals the CPU using MSI/MSI-X interrupts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;PCIe Switch (optional):&lt;/strong&gt;&lt;br&gt;
A PCIe switch expands a single upstream port into multiple downstream ports, allowing multiple endpoints to connect. This creates a hierarchical topology rather than a simple bus.&lt;/p&gt;
&lt;h2&gt;
  
  
  PCIe Link Layers
&lt;/h2&gt;

&lt;p&gt;PCIe uses a layered architecture:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+------------------------+
| Transaction Layer (TL) |
+------------------------+
| Data Link Layer (DLL)  |
+------------------------+
| Physical Layer (PL)    |
+------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Physical Layer (PL):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Transmits serialized data over differential pairs (TX+/TX- and RX+/RX-)&lt;/li&gt;
&lt;li&gt;Uses encoding schemes (8b/10b for Gen1/2, 128b/130b for Gen3+) to embed clock information&lt;/li&gt;
&lt;li&gt;Performs link training using LTSSM (Link Training and Status State Machine), negotiating lane width, speed, and equalization parameters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Data Link Layer (DLL):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Adds sequence numbers and LCRC (Link CRC) to ensure data integrity&lt;/li&gt;
&lt;li&gt;Implements ACK/NAK-based retransmission for reliable delivery&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Transaction Layer (TL):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Creates and processes Transaction Layer Packets (TLPs)&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Supports different types of transactions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Memory Read (Non-posted):&lt;/strong&gt; Request is sent, and a Completion TLP returns data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory Write (Posted):&lt;/strong&gt; Sent without requiring a completion&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configuration Read/Write:&lt;/strong&gt; Used during enumeration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Message TLPs:&lt;/strong&gt; Used for MSI/MSI-X interrupts&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  PCIe Lanes and Link Speed
&lt;/h2&gt;

&lt;p&gt;A PCIe lane consists of one differential pair for transmit and one for receive, enabling full-duplex communication.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Link Width&lt;/th&gt;
&lt;th&gt;Notation&lt;/th&gt;
&lt;th&gt;Typical Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1 lane&lt;/td&gt;
&lt;td&gt;x1&lt;/td&gt;
&lt;td&gt;Low-bandwidth peripherals&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4 lanes&lt;/td&gt;
&lt;td&gt;x4&lt;/td&gt;
&lt;td&gt;SSDs, NICs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8 lanes&lt;/td&gt;
&lt;td&gt;x8&lt;/td&gt;
&lt;td&gt;High-performance NICs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16 lanes&lt;/td&gt;
&lt;td&gt;x16&lt;/td&gt;
&lt;td&gt;Graphics cards (GPUs)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;PCIe generations define per-lane throughput:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Generation&lt;/th&gt;
&lt;th&gt;Raw Rate&lt;/th&gt;
&lt;th&gt;Effective Bandwidth per Lane&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Gen1&lt;/td&gt;
&lt;td&gt;2.5 GT/s&lt;/td&gt;
&lt;td&gt;250 MB/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gen2&lt;/td&gt;
&lt;td&gt;5 GT/s&lt;/td&gt;
&lt;td&gt;500 MB/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gen3&lt;/td&gt;
&lt;td&gt;8 GT/s&lt;/td&gt;
&lt;td&gt;~985 MB/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gen4&lt;/td&gt;
&lt;td&gt;16 GT/s&lt;/td&gt;
&lt;td&gt;~1969 MB/s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Gen1/Gen2 use 8b/10b encoding, while Gen3 and above use 128b/130b encoding, which improves efficiency.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
A PCIe Gen3 x2 link provides ~2 GB/s bandwidth in each direction.&lt;/p&gt;
&lt;h2&gt;
  
  
  PCIe Configuration Space
&lt;/h2&gt;

&lt;p&gt;PCIe devices expose a configuration space used by the host (Root Complex) to discover devices, read capabilities, and configure them.&lt;/p&gt;

&lt;p&gt;The first 256 bytes follow the legacy PCI configuration format for backward compatibility. PCIe extends this to a total of 4 KB.&lt;/p&gt;
&lt;h3&gt;
  
  
  Standard 256-byte PCI Configuration Space:
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Offset&lt;/th&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0x00&lt;/td&gt;
&lt;td&gt;Vendor ID&lt;/td&gt;
&lt;td&gt;2 B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x02&lt;/td&gt;
&lt;td&gt;Device ID&lt;/td&gt;
&lt;td&gt;2 B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x04&lt;/td&gt;
&lt;td&gt;Command&lt;/td&gt;
&lt;td&gt;2 B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x06&lt;/td&gt;
&lt;td&gt;Status&lt;/td&gt;
&lt;td&gt;2 B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x08&lt;/td&gt;
&lt;td&gt;Revision ID&lt;/td&gt;
&lt;td&gt;1 B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x09&lt;/td&gt;
&lt;td&gt;Class Code (Prog IF)&lt;/td&gt;
&lt;td&gt;1 B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x0A&lt;/td&gt;
&lt;td&gt;Class Code (Subclass)&lt;/td&gt;
&lt;td&gt;1 B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x0B&lt;/td&gt;
&lt;td&gt;Class Code (Base)&lt;/td&gt;
&lt;td&gt;1 B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x0C&lt;/td&gt;
&lt;td&gt;Cache Line Size&lt;/td&gt;
&lt;td&gt;1 B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x0D&lt;/td&gt;
&lt;td&gt;Latency Timer&lt;/td&gt;
&lt;td&gt;1 B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x0E&lt;/td&gt;
&lt;td&gt;Header Type&lt;/td&gt;
&lt;td&gt;1 B ← bit=MFD, bits[6:0]=type&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x0F&lt;/td&gt;
&lt;td&gt;BIST&lt;/td&gt;
&lt;td&gt;1 B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x10&lt;/td&gt;
&lt;td&gt;BAR0&lt;/td&gt;
&lt;td&gt;4 B ┐&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x14&lt;/td&gt;
&lt;td&gt;BAR1&lt;/td&gt;
&lt;td&gt;4 B │ Type 0 header&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x18&lt;/td&gt;
&lt;td&gt;BAR2&lt;/td&gt;
&lt;td&gt;4 B │ (endpoint)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x1C&lt;/td&gt;
&lt;td&gt;BAR3&lt;/td&gt;
&lt;td&gt;4 B │&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x20&lt;/td&gt;
&lt;td&gt;BAR4&lt;/td&gt;
&lt;td&gt;4 B │&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x24&lt;/td&gt;
&lt;td&gt;BAR5&lt;/td&gt;
&lt;td&gt;4 B ┘&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x28&lt;/td&gt;
&lt;td&gt;Cardbus CIS Pointer&lt;/td&gt;
&lt;td&gt;4 B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x2C&lt;/td&gt;
&lt;td&gt;Subsystem Vendor ID&lt;/td&gt;
&lt;td&gt;2 B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x2E&lt;/td&gt;
&lt;td&gt;Subsystem ID&lt;/td&gt;
&lt;td&gt;2 B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x30&lt;/td&gt;
&lt;td&gt;Expansion ROM BAR&lt;/td&gt;
&lt;td&gt;4 B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x34&lt;/td&gt;
&lt;td&gt;Capabilities Pointer&lt;/td&gt;
&lt;td&gt;1 B ← points into capabilities list&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x3C&lt;/td&gt;
&lt;td&gt;Interrupt Line&lt;/td&gt;
&lt;td&gt;1 B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x3D&lt;/td&gt;
&lt;td&gt;Interrupt Pin&lt;/td&gt;
&lt;td&gt;1 B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x3E&lt;/td&gt;
&lt;td&gt;Min_Gnt&lt;/td&gt;
&lt;td&gt;1 B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x3F&lt;/td&gt;
&lt;td&gt;Max_Lat&lt;/td&gt;
&lt;td&gt;1 B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;PCI-compatible header ends at 0x3F&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x40&lt;/td&gt;
&lt;td&gt;Capability structures&lt;/td&gt;
&lt;td&gt;variable (linked list)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0xFF&lt;/td&gt;
&lt;td&gt;End of PCI-compatible space&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x100&lt;/td&gt;
&lt;td&gt;PCIe Extended Capabilities&lt;/td&gt;
&lt;td&gt;variable (linked list, 4 KB total)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0xFFF&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  Header Types:
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;HeaderType&lt;/th&gt;
&lt;th&gt;Device Type&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0x00&lt;/td&gt;
&lt;td&gt;Endpoint&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x01&lt;/td&gt;
&lt;td&gt;PCI-to-PCI Bridge&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Bridges include bus routing registers assigned during enumeration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0x10  BAR0
0x14  BAR1

0x18  Primary Bus Number
0x19  Secondary Bus Number
0x1A  Subordinate Bus Number
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  BAR - Base Address Register
&lt;/h2&gt;

&lt;p&gt;A BAR (Base Address Register) allows a PCIe endpoint to expose memory or register regions to the host.&lt;/p&gt;

&lt;p&gt;Each endpoint can have up to 6 BARs. Each BAR defines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Type:&lt;/strong&gt; Memory (MMIO) or legacy I/O&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Size:&lt;/strong&gt; Power-of-two region size&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prefetchable:&lt;/strong&gt; Indicates reads have no side effects and can be cached/prefetched&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Address Width:&lt;/strong&gt; 32-bit or 64-bit&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  BAR sizing mechanism:
&lt;/h3&gt;

&lt;p&gt;During enumeration, the OS writes &lt;code&gt;0xFFFFFFFF&lt;/code&gt; to a BAR and reads it back. The device returns a mask indicating the size.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Returned value: &lt;code&gt;0xFFFF0000&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Size: 64 KB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After sizing, the OS assigns a physical address. The driver maps it into virtual space using functions like &lt;code&gt;pci_iomap()&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  MSI / MSI-X
&lt;/h2&gt;

&lt;p&gt;Traditional PCI used dedicated interrupt lines (physical wires). PCIe replaces these with message-based interrupts.&lt;/p&gt;

&lt;h3&gt;
  
  
  How MSI works:
&lt;/h3&gt;

&lt;p&gt;During setup, the device is programmed with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A target address in host memory (interrupt controller region)&lt;/li&gt;
&lt;li&gt;A data value&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When the device generates an interrupt:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It sends a &lt;strong&gt;Memory Write TLP&lt;/strong&gt; to that address with the data&lt;/li&gt;
&lt;li&gt;The CPU’s interrupt controller interprets this as an interrupt&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  DMA
&lt;/h2&gt;

&lt;p&gt;DMA allows devices to directly access system memory without CPU involvement.&lt;/p&gt;

&lt;h3&gt;
  
  
  Types of DMA:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Inbound DMA (Device → Host):&lt;/strong&gt;&lt;br&gt;
Device sends Memory Write TLPs to host memory&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Outbound DMA (Device ← Host):&lt;/strong&gt;&lt;br&gt;
Device sends Memory Read TLPs and receives Completion data&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Addressing:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;CPU uses virtual addresses (via MMU)&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Devices use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Physical addresses, or&lt;/li&gt;
&lt;li&gt;IOMMU-translated IO virtual addresses (IOVA)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

</description>
      <category>architecture</category>
      <category>computerscience</category>
      <category>systems</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Understanding Linux Boot Memory Management</title>
      <dc:creator>Ripan Deuri</dc:creator>
      <pubDate>Wed, 04 Mar 2026 20:07:47 +0000</pubDate>
      <link>https://dev.to/ripan030/understanding-linux-boot-memory-management-1dcn</link>
      <guid>https://dev.to/ripan030/understanding-linux-boot-memory-management-1dcn</guid>
      <description>&lt;p&gt;When the Linux kernel begins executing on ARM64 hardware, the CPU starts in a minimal environment. The Memory Management Unit (MMU) is disabled and the processor executes instructions using physical addresses directly.&lt;/p&gt;

&lt;p&gt;Before Linux can use its normal virtual address space, the kernel must construct the page tables required for address translation. This work happens very early in the boot process &lt;code&gt;head.S&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;During this phase the kernel performs three important tasks:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Construct minimal page tables&lt;/li&gt;
&lt;li&gt;Create both identity and kernel virtual mappings&lt;/li&gt;
&lt;li&gt;Enable the MMU and switch execution to the kernel's high virtual address space&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This article explains how that process works, using concrete memory layouts and examples.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Boot Environment Assumptions
&lt;/h2&gt;

&lt;p&gt;To make the discussion concrete, assume the following system configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;RAM start          : 0x80000000
Kernel load addr   : 0x80800000
Kernel size        : 30 MB
Kernel end         : 0x82600000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Other relevant parameters:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Page size          : 4 KB
L2 block size      : 2 MB
Virtual address size : 48 bits
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The bootloader loads the kernel image into RAM at &lt;code&gt;0x80800000&lt;/code&gt; and then jumps to the kernel entry point.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Physical Layout of the Kernel Image
&lt;/h2&gt;

&lt;p&gt;After the bootloader loads the kernel, RAM contains the kernel image and its data sections.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Physical RAM
=======================================================

0x80000000  ───────────────────────────────────────────
             Start of RAM

0x80800000  ───────────────────────────────────────────
             Kernel _text

             Kernel code
             Kernel rodata
             Kernel data
             Kernel BSS

0x82500000  ───────────────────────────────────────────
             init_pg_dir region

0x82600000  ───────────────────────────────────────────
             End of kernel image

=======================================================
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The kernel image contains multiple sections including code, read-only data, writable data, and the BSS section. The BSS section stores zero-initialized global variables. The early page tables are allocated within BSS region.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Early Page Tables Are Placed in BSS&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;Early boot code cannot use dynamic memory allocation because the memory subsystem is not yet initialized. As a result, the kernel must reserve memory for early structures at build time.&lt;/p&gt;

&lt;p&gt;The ARM64 kernel defines the root page table region and is placed in the BSS section by the linker script.&lt;/p&gt;

&lt;p&gt;Because the memory already exists inside the kernel image, early boot code can simply reference it directly.&lt;/p&gt;

&lt;p&gt;During boot the physical address of &lt;code&gt;init_pg_dir&lt;/code&gt; becomes the base location where the kernel builds its early page tables.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Page Table Structure
&lt;/h2&gt;

&lt;p&gt;With 4 KB pages and a 48-bit virtual address space, ARM64 uses four levels of page tables.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Virtual Address Bits [47:0]:
+-----+-----+-----+-----+------------+
| L0  | L1  | L2  | L3  |  Offset    |
|47:39|38:30|29:21|20:12|   11:0     |
+-----+-----+-----+-----+------------+
  9b    9b    9b    9b      12b
 (512) (512) (512) (512)   (4KB)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VA[47:39] → L0 index
VA[38:30] → L1 index
VA[29:21] → L2 index
VA[20:12] → L3 index
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each level contains 2^9 = 512 entries.&lt;/p&gt;

&lt;p&gt;Block mappings can be created at intermediate levels:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;L1 block size = 1 GB
L2 block size = 2 MB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Early boot typically uses &lt;strong&gt;L2 block mappings&lt;/strong&gt; because they are simple and cover memory efficiently.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Early Page Table Memory Layout
&lt;/h2&gt;

&lt;p&gt;The early page tables are placed sequentially in the BSS region.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Physical RAM
=============================================================

0x82500000  ── L0 table (TTBR0 - identity root)

0x82501000  ── L1 table (identity)

0x82502000  ── L2 table (identity)


0x82503000  ── L0 table (TTBR1 - kernel high VA root)

0x82504000  ── L1 table (kernel VA)

0x82505000  ── L2 table (kernel VA)

=============================================================
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each table contains 512 slots and 8 bytes per slot. So each table occupies one page (4 KB). The 8 bytes contains either PA of block or PA of another table.&lt;/p&gt;

&lt;p&gt;Total memory required:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;6 tables × 4 KB = 24 KB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  5. Page Table Creation Steps (Simplified)
&lt;/h2&gt;

&lt;p&gt;It constructs the minimal set of page tables required before the MMU is enabled.&lt;/p&gt;

&lt;h3&gt;
  
  
  5.1 Clearing Page Table Memory
&lt;/h3&gt;

&lt;p&gt;The first step clears the memory used by &lt;code&gt;init_pg_dir&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This ensures all entries start as invalid descriptors.&lt;/p&gt;

&lt;p&gt;This is implemented using a loop that stores zero values across the reserved region.&lt;/p&gt;

&lt;h3&gt;
  
  
  5.2 Creating the Identity Mapping
&lt;/h3&gt;

&lt;p&gt;The kernel builds an identity mapping for the region containing the kernel image.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VA 0x80800000 → PA 0x80800000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Page table hierarchy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;L0 entry → L1 table
L1 entry → L2 table
L2 entries → 2 MB blocks
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since the kernel size is 30 MB, the L2 table maps approximately 15 blocks.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;15 blocks × 2 MB = 30 MB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This mapping allows the CPU to continue executing the kernel immediately after the MMU is enabled.&lt;/p&gt;

&lt;h3&gt;
  
  
  5.3 Creating the Kernel Virtual Mapping
&lt;/h3&gt;

&lt;p&gt;Linux does not run the kernel at low addresses. Instead, the kernel executes in the upper portion of the virtual address space.&lt;/p&gt;

&lt;p&gt;Assuming 48-bit address space:&lt;br&gt;
Kernel VA starts from &lt;code&gt;0xffff_0000_0000_0000 (= PAGE_OFFSET)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Example kernel VA for PA &lt;code&gt;0x8080_0000&lt;/code&gt;: &lt;code&gt;0xFFFF000080800000&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The page tables create a mapping: &lt;code&gt;VA 0xFFFF000080800000 → PA 0x80800000&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This allows the same physical memory to appear at a high virtual address.&lt;/p&gt;

&lt;p&gt;So the layout becomes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Physical RAM
===========================================================

0x80000000  ───────────────────────────────────────────────
             Start of RAM

0x80800000  ───────────────────────────────────────────────
             Kernel _text (bootloader loaded image)

             Kernel code
             Kernel rodata
             Kernel data
             Kernel BSS

0x82500000  ───────────────────────────────────────────────
             init_pg_dir region (inside BSS)

             Early Page Tables
             ──────────────────────────────────────────────

0x82500000  ── L0 table (TTBR0)  → Identity map root
                entry[0] → 0x82501000

0x82501000  ── L1 table (identity)
                entry[2] → 0x82502000

0x82502000  ── L2 table (identity)
                entry[4..18] → 2MB block

                Example:
                L2[4]  → PA 0x80800000
                L2[5]  → PA 0x80A00000
                ...
                L2[18] → PA 0x82400000

0x82503000  ── L0 table (TTBR1) → Kernel virtual root
                entry[511] → 0x82504000

0x82504000  ── L1 table (kernel VA)
                entry[...] → 0x82505000

0x82505000  ── L2 table (kernel VA)
             ──────────────────────────────────────────────
0x825FFFFF  ───────────────────────────────────────────────
             End of kernel image

0x82600000  ───────────────────────────────────────────────
             First free RAM after kernel
===========================================================
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  6. Why Dual Mapping Is Required
&lt;/h2&gt;

&lt;p&gt;At the moment the MMU is enabled, the CPU is already executing instructions from the kernel.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PC = 0x80800100
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Before enabling the MMU, this address is interpreted as a physical address.&lt;/p&gt;

&lt;p&gt;After enabling the MMU, the CPU interprets the program counter as a virtual address.&lt;/p&gt;

&lt;p&gt;If the page tables contain an identity mapping:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VA 0x80800100 → PA 0x80800100
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;the instruction fetch continues successfully.&lt;/p&gt;

&lt;p&gt;Afterward, the kernel performs a branch to its intended virtual address:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0xFFFF000080800000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From that point onward, the kernel runs entirely in high virtual memory.&lt;/p&gt;

&lt;p&gt;If the identity mapping did not exist, enabling the MMU would immediately cause a translation fault.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PC = 0x80800100
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After enabling the MMU the CPU attempts to translate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VA 0x80800100
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If no mapping exists for that address, the CPU raises an instruction abort. So identity mapping is required during boot.&lt;/p&gt;

&lt;p&gt;Once the page tables are created, the kernel configures the translation system registers.&lt;/p&gt;

&lt;p&gt;Kernel installs page table base addresses in following registers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TTBR0_EL1  → identity mapping tables
TTBR1_EL1  → kernel virtual mapping tables
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally, the MMU is enabled. At this moment the CPU switches from physical addressing to virtual addressing.&lt;/p&gt;

&lt;p&gt;After enabling the MMU, the kernel performs a branch to its virtual address:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0xFFFF000080800000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The page tables translate this address to the physical location of the kernel in RAM.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VA 0xFFFF000080800000 → PA 0x80800000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From this point forward, the kernel executes entirely in its high virtual address space.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Early in the boot process, the Linux kernel must construct its own memory translation environment before the MMU can be enabled. The code in &lt;code&gt;head.S&lt;/code&gt; performs this task by building minimal page tables inside statically allocated memory.&lt;/p&gt;

&lt;p&gt;Two mappings are created during this phase. An identity mapping ensures that execution continues safely when the MMU is first enabled, while a kernel virtual mapping allows the kernel to run in its intended high address space.&lt;/p&gt;

&lt;p&gt;After the MMU is enabled and execution switches to the high virtual address, the kernel continues building the full virtual memory system used during normal operation. These early page tables therefore serve as the foundation for the entire memory management subsystem.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>computerscience</category>
      <category>linux</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Understanding Cache Coherency</title>
      <dc:creator>Ripan Deuri</dc:creator>
      <pubDate>Wed, 25 Feb 2026 17:25:48 +0000</pubDate>
      <link>https://dev.to/ripan030/understanding-cache-coherency-4lj8</link>
      <guid>https://dev.to/ripan030/understanding-cache-coherency-4lj8</guid>
      <description>&lt;p&gt;Modern high-performance devices communicate with the CPU through shared memory structures such as &lt;a href="https://dev.to/ripan030/how-hardware-and-software-share-a-queue-understanding-dma-rings-pea"&gt;DMA Rings&lt;/a&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;When one side updates memory, the other side must see the latest value.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;On cache-coherent systems this happens automatically. On many ARM platforms it does not.&lt;/p&gt;

&lt;p&gt;This post explains what breaks, why it breaks, and how the Linux DMA API solves it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why DMA Fails on Non-Coherent Systems
&lt;/h2&gt;

&lt;p&gt;Consider the completion flow from the earlier ring design in &lt;a href="https://dev.to/ripan030/how-hardware-and-software-share-a-queue-understanding-dma-rings-pea"&gt;How Hardware and Software Share a Queue: Understanding DMA Rings&lt;/a&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Device DMA-writes a completion entry&lt;/li&gt;
&lt;li&gt;Device updates &lt;code&gt;WR_IDX&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;CPU reads &lt;code&gt;WR_IDX&lt;/code&gt; and processes new entries&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;On a non-coherent system the driver may:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;read an old &lt;code&gt;WR_IDX&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;read a partially updated descriptor&lt;/li&gt;
&lt;li&gt;never observe new completions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This happens because the CPU and the DMA engine do not observe memory through the same path.&lt;/p&gt;

&lt;h2&gt;
  
  
  System Hardware View
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                +----------------------+
                |        CPU           |
                |   Driver (load/store)|
                +----------+-----------+
                           |
                      +----v----+
                      |  Cache  |  (L1/L2)
                      +----+----+
                           |
                           |
                    +------v------+
                    |     DDR     |  (System RAM)
                    +------+------+
                           ^
                           | PCIe link
                    +------v------+
                    | PCIe Device |
                    |  DMA Engine |
                    +-------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key observation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CPU accesses DDR through &lt;strong&gt;cache&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;DMA accesses DDR &lt;strong&gt;directly&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Cache and DDR can hold different data at the same time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the source of incoherency.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is Cache Coherency
&lt;/h2&gt;

&lt;p&gt;Physical memory (DDR) is the shared storage.&lt;/p&gt;

&lt;p&gt;The CPU does not read DDR on every load. It reads cached copies stored in cache lines.&lt;/p&gt;

&lt;p&gt;Two operations are required to keep both sides consistent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Flush&lt;/strong&gt; – push updated cache lines to DDR&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Invalidate&lt;/strong&gt; – discard cached copies so the next read comes from DDR&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without these operations, both sides operate on different versions of the same memory.&lt;/p&gt;

&lt;h2&gt;
  
  
  DMA Memory in System DDR
&lt;/h2&gt;

&lt;p&gt;The ring allocated in &lt;a href="https://dev.to/ripan030/how-hardware-and-software-share-a-queue-understanding-dma-rings-pea"&gt;How Hardware and Software Share a Queue: Understanding DMA Rings&lt;/a&gt; resides in &lt;strong&gt;system DDR&lt;/strong&gt;. It is normal RAM shared between CPU and device.&lt;/p&gt;

&lt;p&gt;Coherency is achieved by changing &lt;strong&gt;how the CPU maps that memory&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The same physical DDR page can be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;mapped as &lt;strong&gt;cacheable&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;mapped as &lt;strong&gt;non-cacheable&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is controlled by page table attributes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Memory Types From the CPU Perspective
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Cacheable Memory&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Default for &lt;code&gt;kzalloc&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Fast for CPU&lt;/li&gt;
&lt;li&gt;Not automatically DMA-safe on non-coherent systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Non-cacheable Memory&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CPU always accesses DDR directly&lt;/li&gt;
&lt;li&gt;No stale cache lines&lt;/li&gt;
&lt;li&gt;Safe for shared control structures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On many ARM systems, coherent DMA memory is implemented using a non-cacheable CPU mapping.&lt;/p&gt;

&lt;h2&gt;
  
  
  Linux Kernel DMA APIs
&lt;/h2&gt;

&lt;p&gt;Linux Kernel provides two usage patterns:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Coherent DMA&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CPU and device always observe the same data&lt;/li&gt;
&lt;li&gt;No explicit cache maintenance in the driver&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Streaming DMA&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Memory is cacheable&lt;/li&gt;
&lt;li&gt;Driver must perform cache sync at specific points&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;dma_alloc_coherent()&lt;/code&gt;&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;allocates memory from system RAM (often via CMA or page allocator)&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;returns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CPU virtual address&lt;/li&gt;
&lt;li&gt;DMA address for the device&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;On non-coherent ARM systems it typically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;maps the region as &lt;strong&gt;non-cacheable for the CPU&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Result:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CPU accesses go directly to DDR&lt;/li&gt;
&lt;li&gt;DMA accesses go to the same DDR&lt;/li&gt;
&lt;li&gt;both sides see identical data without cache operations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why it is ideal for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;descriptor rings&lt;/li&gt;
&lt;li&gt;doorbells&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;kzalloc()&lt;/code&gt; + DMA (Streaming DMA)&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kzalloc&lt;/code&gt; returns &lt;strong&gt;cacheable normal memory&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For DMA usage the driver must:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Map it for DMA&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;code&gt;dma_map_single()&lt;/code&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Before device reads the buffer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;code&gt;dma_sync_single_for_device()&lt;/code&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;After device writes the buffer and before CPU reads&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;code&gt;dma_sync_single_for_cpu()&lt;/code&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;When finished&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;code&gt;dma_unmap_single()&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Ring Buffer
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Ring allocated with &lt;code&gt;dma_alloc_coherent&lt;/code&gt;&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ring lives in DDR&lt;/li&gt;
&lt;li&gt;CPU mapping is non-cacheable&lt;/li&gt;
&lt;li&gt;Device DMA writes directly to DDR&lt;/li&gt;
&lt;li&gt;Driver reads fresh data&lt;/li&gt;
&lt;li&gt;No cache maintenance required&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Ring allocated with &lt;code&gt;kzalloc&lt;/code&gt;&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;After interrupt and before reading completions, invalidate cached lines &lt;code&gt;dma_sync_single_for_cpu()&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance and Design Trade-offs
&lt;/h2&gt;

&lt;p&gt;Coherent memory:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;simpler&lt;/li&gt;
&lt;li&gt;safe for shared control data&lt;/li&gt;
&lt;li&gt;slower for large CPU accesses (no caching)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Streaming DMA:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;fast for bulk data&lt;/li&gt;
&lt;li&gt;requires correct sync points&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Typical design:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;rings → coherent memory&lt;/li&gt;
&lt;li&gt;data buffers → streaming DMA&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;On non-coherent systems, the CPU cache and the DMA engine observe DDR through different paths. The Linux DMA API bridges this gap by either:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;creating a coherent mapping, or&lt;/li&gt;
&lt;li&gt;providing explicit cache synchronization primitives.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>architecture</category>
      <category>computerscience</category>
      <category>linux</category>
      <category>performance</category>
    </item>
    <item>
      <title>Understanding DMA Rings</title>
      <dc:creator>Ripan Deuri</dc:creator>
      <pubDate>Sat, 21 Feb 2026 10:16:59 +0000</pubDate>
      <link>https://dev.to/ripan030/how-hardware-and-software-share-a-queue-understanding-dma-rings-pea</link>
      <guid>https://dev.to/ripan030/how-hardware-and-software-share-a-queue-understanding-dma-rings-pea</guid>
      <description>&lt;p&gt;Modern high-performance systems rely on a shared memory queue for communication between hardware and software, where the device writes data using DMA and indicates new work by updating an index. This mechanism is widely used in network controllers, NVMe storage, GPUs, and asynchronous I/O frameworks because it eliminates lock contention, reduces register access, and allows both sides to operate independently at high throughput.&lt;/p&gt;

&lt;p&gt;Understanding this structure requires looking beyond the idea of a circular buffer and focusing on ownership transfer, memory ordering, and cache visibility. These are the concepts that determine correctness and performance in real driver implementations.&lt;/p&gt;

&lt;p&gt;This post explains how a lock-free queue is shared between hardware and software and breaks down the synchronization model that makes it work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Mechanism Exists
&lt;/h2&gt;

&lt;p&gt;At high data rates, traditional communication methods between software and hardware become too expensive:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reading device registers frequently causes latency.&lt;/li&gt;
&lt;li&gt;Locking shared structures limits parallelism.&lt;/li&gt;
&lt;li&gt;Interrupt-per-event models do not scale.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead, modern devices and drivers communicate through &lt;strong&gt;shared memory queues&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The key idea is simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The device publishes completed work into memory using DMA, and software consumes it later.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This removes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;register polling from the fast path&lt;/li&gt;
&lt;li&gt;lock contention&lt;/li&gt;
&lt;li&gt;synchronous handshakes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;and replaces them with &lt;strong&gt;ownership transfer over a circular buffer&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Shared Memory Layout and Ownership Model
&lt;/h2&gt;

&lt;p&gt;The circular queue lives in system DDR memory and is accessible to both the CPU and the device.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+--------------------------------------------------------------------+
|                            HOST SYSTEM                             |
|  +------------------+                                              |
|  |       CPU        |                                              |
|  |    +--------+    |                                              |
|  |    | Driver |------------------------------------------------+  |
|  |    +--------+    |                                           |  |
|  +---------^--------+                                           |  |
|            |                                                    |  |
|  +---------v--------+                                           |  |
|  |      Cache       |                                           |  |
|  |     L1 / L2      |                                           |  |
|  +---------^--------+                                           |  |
|            | Cache lines                                        |  |
|  +---------v-------------------------------------------------+  |  |
|  |                 SYSTEM DDR (Non-Coherent)                 |  |  |
|  |   +------------------------+    +----------------+        |  |  |
|  |   | Desc 0 | Desc 1 | ...  |    |     WR_IDX     |        |  |  |
|  |   +------------------^-----+    +-----^----------+        |  |  |
|  |   RING DESCRIPTORS   |                |  WR_ADDR (SHADOW) |  |  |
|  |                      +--------+-------+                   |  |  |
|  +-------------------------------|---------------------------+  |  |
|                        DMA write |         MMIO write (RD_IDX)  |  |
|                (Metadata, WR_IDX)|         +--------------------+  |
|                                  |         |                       |
|                     +----------------------v----+                  |
|                     |       Root Complex        |                  |
|                     +------------^--------------+                  |
+----------------------------------|---------------------------------+
                                   |                              
                                   | PCIe Link                    
                                   |
+----------------------------------v---------------------------------+
|                             PCIe DEVICE                            |
|  MMIO RING REGS                                                    |
|  +--------------+          +----------------------------+          |
|  | BASE_ADDR    |          |         DMA ENGINE         |          |
|  +--------------+          +----------------------------+          |
|  | ...          |                                                  |
|  +--------------+          +----------------------------+          |
|  | WR_ADDR      |          |           MSI-X            |          |
|  +--------------+          +----------------------------+          |
|  | RD_IDX       |                                                  |
|  +--------------+                                                  |
+--------------------------------------------------------------------+

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Device → advances WR_IDX&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Driver → advances RD_IDX&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At any moment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;      RD_IDX
        │
        +----+----+----+----+----+----+----+----+
        | S  | S  | S  | D  | D  | D  | D  | D  |
        +----+----+----+----+----+----+----+----+
                       |
                     WR_IDX

Driver owns   : [RD_IDX … WR_IDX)
Device owns   : [WR_IDX … RD_IDX)

[D]  → Device-owned slot (empty, can be filled by HW)
[S]  → Driver-owned slot (ready to process by SW)

The ring grows clockwise ➜
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How Ownership Moves Around the Ring
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Init - no valid entries&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;        RD_IDX
        │
        +----+----+----+----+----+----+----+----+
        | D  | D  | D  | D  | D  | D  | D  | D  |
        +----+----+----+----+----+----+----+----+
        |
        WR_IDX
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Device fills new entries&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                            WR_IDX
                            │
        +----+----+----+----+----+----+----+----+
        | S  | S  | S  | S  | D  | D  | D  | D  |
        +----+----+----+----+----+----+----+----+
        │
        RD_IDX
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Driver processes new entries&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                            WR_IDX
                            │
        +----+----+----+----+----+----+----+----+
        | D  | D  | S  | S  | D  | D  | D  | D  |
        +----+----+----+----+----+----+----+----+
                  │
                  RD_IDX
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Warp-around - idndices wrap modulo ring size&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;              WR_IDX
             │
        +----+----+----+----+----+----+----+----+
        | S  | D  | S  | S  | S  | S  | S  | S  |
        +----+----+----+----+----+----+----+----+
                  │
                  RD_IDX
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Full ring - device must stop&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If WR_IDX catches RD_IDX:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                  WR_IDX
                  │
        +----+----+----+----+----+----+----+----+
        | S  | S  | S  | S  | S  | S  | S  | S  |
        +----+----+----+----+----+----+----+----+
                  │
                  RD_IDX
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There are no device-owned slots. Device cannot write.&lt;/p&gt;

&lt;p&gt;This is not an error - it is &lt;strong&gt;backpressure&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Empty ring - driver has nothing to do&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If RD_IDX catches WR_IDX:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                  WR_IDX
                  │
        +----+----+----+----+----+----+----+----+
        | D  | D  | D  | D  | D  | D  | D  | D  |
        +----+----+----+----+----+----+----+----+
                  │
                  RD_IDX
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No software-owned entries. Driver stops processing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lifecycle of a Completion: From Device to Driver
&lt;/h2&gt;

&lt;p&gt;This sequence describes how a real device reports finished work to software through the shared ring.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;[1] Initialization&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;During setup, driver:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Allocates the ring in system memory.&lt;/li&gt;
&lt;li&gt;Programs the device with:

&lt;ul&gt;
&lt;li&gt;the ring base address&lt;/li&gt;
&lt;li&gt;the ring size&lt;/li&gt;
&lt;li&gt;the address where WR_IDX will be written (shadow in host memory).&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Initializes RD_IDX to zero.&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;At this point:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The queue contains no valid entry.&lt;/li&gt;
&lt;li&gt;The entire ring is owned by the device.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;[2] Device finishes processing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The device already knows where the result data should be placed.&lt;br&gt;
This typically comes from a separate provisioning mechanism (another queue or pre-registered buffers).&lt;/p&gt;

&lt;p&gt;It DMA-writes the result into system memory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;[3] Device writes a completion entry&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The device selects the slot at its current WR_IDX and DMA-writes a completion record.&lt;/p&gt;

&lt;p&gt;This record may contain:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;an identifier for the buffer or request&lt;/li&gt;
&lt;li&gt;the length of valid data&lt;/li&gt;
&lt;li&gt;status or error information&lt;/li&gt;
&lt;li&gt;device-generated metadata&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At this stage the entry exists in memory, but software does not yet know that it is valid.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;[4] Device publishes WR_IDX&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;After the completion entry is fully written, the device updates WR_IDX in host memory.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The index update is the visibility point for software.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;[5] Interrupt&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The device may generate an interrupt to notify CPU. Refer to &lt;a href="https://dev.to/ripan030/linux-kernel-interrupt-1e7n"&gt;How an Interrupt Reaches the CPU&lt;/a&gt; to understand how interrupt is delivered.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;[6] Software consumption&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When software runs (either due to an interrupt or polling):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reads WR_IDX to determine how far the device has progressed.&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;processes entries in the range: [RD_IDX … WR_IDX). For each entry:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;interpret the completion record&lt;/li&gt;
&lt;li&gt;recycle the associated resources&lt;/li&gt;
&lt;li&gt;advance RD_IDX&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;[7] Returning ownership to the device&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;After consuming entries, software writes the updated RD_IDX to the device via &lt;a href="https://dev.to/ripan030/memory-mapped-io-mmio-5bn8"&gt;MMIO&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This tells the device:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;These slots are free again.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Cache Coherency and DMA Visibility
&lt;/h2&gt;

&lt;p&gt;On cache-coherent systems, CPU and device observe the same memory contents automatically.&lt;/p&gt;

&lt;p&gt;On non-coherent systems, DMA updates system memory but the CPU may still read stale data from its cache.&lt;/p&gt;

&lt;p&gt;Before reading new completions, the driver must invalidate the cache lines that cover:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the completion entries&lt;/li&gt;
&lt;li&gt;WR_IDX&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Otherwise, software may see an old index or partially updated entries even though the device has already written the new data to memory.&lt;/p&gt;

&lt;h2&gt;
  
  
  Memory Ordering
&lt;/h2&gt;

&lt;p&gt;The queue works because both sides publish progress in a strictly defined order. Without this ordering, an index can become visible before the data it refers to.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Device side&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The device must ensure:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;completion entry write → WR_IDX update&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This guarantees that when software observes the new WR_IDX, the corresponding completion entry is already fully written in memory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Software side&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Software must:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;read WR_IDX → then read the completion entries&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This prevents the CPU from speculatively reading ring contents before it knows how far the device has progressed.&lt;/p&gt;

&lt;p&gt;These rules are enforced with &lt;strong&gt;memory barriers&lt;/strong&gt; in the driver and with ordering guarantees in the device.&lt;/p&gt;

&lt;h2&gt;
  
  
  Timeline View
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuyqz1qdcyd9rouif2brq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuyqz1qdcyd9rouif2brq.png" alt="Timeline between device and driver" width="519" height="384"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;A shared ring is a contract where hardware and software exchange ownership through ordered index updates. Completed work becomes visible when WR_IDX is updated, and buffer space is returned to the device when RD_IDX advances. This memory-based publication model removes locks, reduces MMIO, enabling scalable, high-throughput operation.&lt;/p&gt;

</description>
      <category>algorithms</category>
      <category>architecture</category>
      <category>computerscience</category>
      <category>performance</category>
    </item>
    <item>
      <title>From Reset to Control: Disabling Interrupts on ARM Bare Metal</title>
      <dc:creator>Ripan Deuri</dc:creator>
      <pubDate>Fri, 02 Jan 2026 06:59:03 +0000</pubDate>
      <link>https://dev.to/ripan030/bare-metal-arm-bootstrapping-disabling-interrupts-and-reading-cpsr-2ojd</link>
      <guid>https://dev.to/ripan030/bare-metal-arm-bootstrapping-disabling-interrupts-and-reading-cpsr-2ojd</guid>
      <description>&lt;p&gt;Bare-metal execution on ARMv7 begins at the reset vector, long before any C environment exists. When a Cortex-A9 leaves reset under QEMU’s &lt;code&gt;vexpress-a9&lt;/code&gt; model, the processor enters Supervisor mode with interrupts masked and the MMU disabled. The stack pointer is undefined, no memory sections are initialized, and no handlers are installed. Execution begins only with a defined program counter and CPSR value.&lt;/p&gt;

&lt;p&gt;This post examines that earliest stage of execution and shows how a minimal block of startup assembly takes control after reset: explicitly masking interrupts, verifying the processor’s mode, and halting in a known state. This establishes a predictable baseline before introducing stacks, memory initialization, and eventually a C runtime.&lt;/p&gt;

&lt;h2&gt;
  
  
  From Reset Vector to &lt;code&gt;_start&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;At reset, ARMv7 defines the following initial conditions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Mode:&lt;/strong&gt; Supervisor (&lt;code&gt;0b10011&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IRQ mask (CPSR.I):&lt;/strong&gt; 1 (disabled)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FIQ mask (CPSR.F):&lt;/strong&gt; 1 (disabled)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Instruction set:&lt;/strong&gt; ARM state&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MMU:&lt;/strong&gt; Disabled&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Although interrupts are architecturally masked at reset, early startup code should not rely on this implicit state. Explicitly disabling interrupts ensures a deterministic environment before installing a vector table or exception handlers.&lt;/p&gt;

&lt;p&gt;The previous post &lt;a href="https://dev.to/ripan030/reset-on-armv7-42p7?preview=faa923959e61c5f6237c34a0aff32b99be91f7f7de66dc73eef40ecb1add9326590b15aee947a94e9e7c942f17734e55b219de281290336a1fc991a2"&gt;Bare Metal ARM Boot: Understanding the Reset Vector and First Instructions&lt;/a&gt; established a minimal vector table and reset vector:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Vector table placed at address 0x0 in flash&lt;/li&gt;
&lt;li&gt;Reset vector branches from &lt;code&gt;_vectors&lt;/code&gt; to &lt;code&gt;_start&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;_start&lt;/code&gt; contained an infinite loop&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This post builds on that example by giving &lt;code&gt;_start&lt;/code&gt; its first real responsibility: enforcing interrupt masking before halting.&lt;/p&gt;

&lt;h2&gt;
  
  
  CPSR Overview
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;Current Program Status Register (CPSR)&lt;/strong&gt; controls key aspects of execution:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bits [31:28]&lt;/strong&gt; — Condition flags (N, Z, C, V)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bit 7&lt;/strong&gt; — IRQ disable (I)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bit 6&lt;/strong&gt; — FIQ disable (F)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bits [4:0]&lt;/strong&gt; — Mode bits&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Common mode encodings:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;0x10&lt;/code&gt; — User&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;0x11&lt;/code&gt; — FIQ&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;0x12&lt;/code&gt; — IRQ&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;0x13&lt;/code&gt; — Supervisor&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;0x17&lt;/code&gt; — Abort&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;0x1B&lt;/code&gt; — Undefined&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;0x1F&lt;/code&gt; — System&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Disabling Interrupts Explicitly
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;&lt;code&gt;cpsid&lt;/code&gt;&lt;/strong&gt; instruction (Change Processor State, Interrupt Disable) provides direct control over interrupt masking:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;cpsid i&lt;/code&gt; — Disable IRQ&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;cpsid f&lt;/code&gt; — Disable FIQ&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;cpsid if&lt;/code&gt; — Disable both&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;cpsid&lt;/code&gt; is a privileged instruction, so it can execute only in modes such as Supervisor. Issuing &lt;code&gt;cpsid if&lt;/code&gt; at startup prevents accidental exception entry until valid handlers are in place.&lt;/p&gt;

&lt;h2&gt;
  
  
  Minimal Startup Assembly
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;startup.s&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.section .vectors, "ax"
.global _vectors
_vectors:
    b _start            @ Reset vector: branch to startup

.section .text
.global _start
_start:
    @ Disable IRQ and FIQ interrupts
    cpsid   if

    @ Infinite loop
halt:
    b   halt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Verifying Behavior with GDB
&lt;/h2&gt;

&lt;p&gt;A breakpoint confirms that execution flows from the reset vector into &lt;code&gt;_start&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;(&lt;/span&gt;gdb&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="nb"&gt;break &lt;/span&gt;_start
Breakpoint 1 at 0x4: file startup.s, line 9.
&lt;span class="o"&gt;(&lt;/span&gt;gdb&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="k"&gt;continue
&lt;/span&gt;Continuing.

Breakpoint 1, _start &lt;span class="o"&gt;()&lt;/span&gt; at startup.s:9
9       cpsid   &lt;span class="k"&gt;if&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Examining the CPSR before and after executing &lt;code&gt;cpsid if&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;(&lt;/span&gt;gdb&lt;span class="o"&gt;)&lt;/span&gt; info registers cpsr
cpsr           0x400001d3          1073742291
&lt;span class="o"&gt;(&lt;/span&gt;gdb&lt;span class="o"&gt;)&lt;/span&gt; stepi
10      b &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;span class="o"&gt;(&lt;/span&gt;gdb&lt;span class="o"&gt;)&lt;/span&gt; info registers cpsr
cpsr           0x400001d3          1073742291
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The CPSR value remains unchanged because interrupts were already masked at reset.&lt;/p&gt;

&lt;p&gt;Binary view:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;(&lt;/span&gt;gdb&lt;span class="o"&gt;)&lt;/span&gt; print/t &lt;span class="nv"&gt;$cpsr&lt;/span&gt;
&lt;span class="nv"&gt;$1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; 1000000000000000000000111010011
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Interpretation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Bits[4:0] = &lt;code&gt;10011&lt;/code&gt; → Supervisor mode&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Bit 6 = 1 → FIQ disabled&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Bit 7 = 1 → IRQ disabled&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This confirms that startup code is executing in a privileged mode with both interrupt sources masked.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The processor now transitions cleanly from reset into startup assembly that establishes a controlled execution state. Interrupt masking is explicitly enforced, and the processor mode is verified. With this foundation in place, subsequent steps - stack setup, &lt;code&gt;.data&lt;/code&gt; and &lt;code&gt;.bss&lt;/code&gt; initialization, and entry into &lt;code&gt;main()&lt;/code&gt; can be introduced safely.&lt;/p&gt;

&lt;p&gt;The next post builds on this minimal bootstrap to construct a usable C runtime for bare-metal ARM systems.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>computerscience</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Linker Scripts Explained: Controlling Memory Layout on Bare Metal</title>
      <dc:creator>Ripan Deuri</dc:creator>
      <pubDate>Sat, 13 Dec 2025 08:23:12 +0000</pubDate>
      <link>https://dev.to/ripan030/linker-scripts-explained-controlling-memory-layout-on-bare-metal-3ocb</link>
      <guid>https://dev.to/ripan030/linker-scripts-explained-controlling-memory-layout-on-bare-metal-3ocb</guid>
      <description>&lt;ul&gt;
&lt;li&gt;
The Anatomy of a Minimal Linker Script

&lt;ul&gt;
&lt;li&gt;ENTRY Directive&lt;/li&gt;
&lt;li&gt;MEMORY Block&lt;/li&gt;
&lt;li&gt;SECTIONS Block&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;The Location Counter: &lt;code&gt;.&lt;/code&gt;
&lt;/li&gt;

&lt;li&gt;VMA vs LMA: Virtual and Load Memory Addresses&lt;/li&gt;

&lt;li&gt;Linker-Defined Symbols&lt;/li&gt;

&lt;li&gt;The Map File&lt;/li&gt;

&lt;li&gt;Complete Minimal Example&lt;/li&gt;

&lt;li&gt;

Verification: What the Linker Produced

&lt;ul&gt;
&lt;li&gt;Step 1: Examine Disassembly&lt;/li&gt;
&lt;li&gt;Step 2: Examine Section Headers&lt;/li&gt;
&lt;li&gt;Step 3: Examine the Map File&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Verification: Loading in QEMU and Inspecting with GDB&lt;/li&gt;

&lt;li&gt;Alignment Directives&lt;/li&gt;

&lt;li&gt;Conclusion&lt;/li&gt;

&lt;/ul&gt;




&lt;p&gt;Compilers generate relocatable object code—machine instructions and data whose addresses are not yet fixed. On hosted systems, a loader chooses the runtime addresses of each segment. Bare-metal systems have no loader: the ELF file’s section addresses become the addresses used directly by the CPU.&lt;/p&gt;

&lt;p&gt;A linker script provides this mapping. It determines:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;What memory regions are available?&lt;/strong&gt; For example, FLASH at 0x00000000 (64M), RAM at 0x60000000 (128M), etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Where should each section go?&lt;/strong&gt; &lt;code&gt;.text&lt;/code&gt; to flash, &lt;code&gt;.bss&lt;/code&gt; to RAM, etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Exactly where each section begins&lt;/strong&gt;, with alignment enforced according to architecture requirements.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Without an explicit script, the linker defaults to placing sections at low addresses such as &lt;strong&gt;0x00000000&lt;/strong&gt;, which rarely matches real hardware memory layouts. Bare-metal firmware requires deterministic placement, so linker scripts are a fundamental tool.&lt;/p&gt;

&lt;p&gt;In this post, the linker script targets &lt;strong&gt;QEMU’s vexpress-a9&lt;/strong&gt;, where RAM begins at &lt;strong&gt;0x60000000&lt;/strong&gt;. QEMU’s &lt;code&gt;-kernel&lt;/code&gt; argument loads ELF segments to their VMA addresses, so linking into RAM is sufficient for initial bring-up.&lt;/p&gt;

&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Anatomy of a Minimal Linker Script
&lt;/h2&gt;

&lt;p&gt;A linker script has two primary blocks: &lt;code&gt;MEMORY&lt;/code&gt; and &lt;code&gt;SECTIONS&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ENTRY(_start)

MEMORY
{
    RAM (rwx) : ORIGIN = 0x60000000, LENGTH = 128M
}

SECTIONS
{
    . = 0x60000000;
    .text : {
        *(.text)
    } &amp;gt; RAM
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each component has a specific purpose.&lt;/p&gt;

&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  ENTRY Directive
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;ENTRY(_start)&lt;/code&gt; specifies the program's entry point symbol. When the ELF file is loaded, this symbol's address becomes the starting point that debuggers and loaders recognize. In bare metal, &lt;code&gt;_start&lt;/code&gt; should match the first instruction executed after the reset vector transfers control.&lt;/p&gt;

&lt;p&gt;Note that QEMU does not read an architectural reset vector for vexpress-a9 when using &lt;code&gt;-kernel&lt;/code&gt;. Instead, it sets the CPU’s PC to the ELF entry point address.&lt;/p&gt;

&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  MEMORY Block
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;MEMORY&lt;/code&gt; block declares available address regions and their properties. Each region has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Name:&lt;/strong&gt; &lt;code&gt;RAM&lt;/code&gt;, &lt;code&gt;FLASH&lt;/code&gt; - labels used in &lt;code&gt;SECTIONS&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Attributes:&lt;/strong&gt; &lt;code&gt;r&lt;/code&gt; (read), &lt;code&gt;w&lt;/code&gt; (write), &lt;code&gt;x&lt;/code&gt; (execute)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ORIGIN:&lt;/strong&gt; the starting address of the region&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LENGTH:&lt;/strong&gt; the size in bytes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MEMORY
{
    FLASH (rx)  : ORIGIN = 0x00000000, LENGTH = 64M
    RAM (rwx)   : ORIGIN = 0x60000000, LENGTH = 128M
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Attributes are advisory. They let the linker validate that sections are placed in regions matching their needs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;FLASH (rx)&lt;/code&gt;: read and execute only; marking it writable would be nonsensical&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;RAM (rwx)&lt;/code&gt;: fully flexible for code, initialized data, uninitialized data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The linker issues warnings if a section with write permission is assigned to a read-only region, helping catch configuration mistakes.&lt;/p&gt;

&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  SECTIONS Block
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;SECTIONS&lt;/code&gt; block specifies the output memory layout. Each entry maps input sections (from object files) to output sections (in the final binary) and assigns a memory location.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SECTIONS
{
    .text : {
        *(.text)
    } &amp;gt; FLASH

    .rodata : {
        *(.rodata*)
    } &amp;gt; FLASH
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Breaking this down:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;.text : { ... }&lt;/code&gt; defines an output section named &lt;code&gt;.text&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;*(.text)&lt;/code&gt; means "include all &lt;code&gt;.text&lt;/code&gt; input sections from all object files"&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;&amp;gt; FLASH&lt;/code&gt; assigns this output section to the FLASH memory region&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;*&lt;/code&gt; is a wildcard matching all input files. More specific patterns are possible (e.g., &lt;code&gt;startup.o(.text)&lt;/code&gt; to include only startup's .text section), but wildcards are typical for bare-metal work.&lt;/p&gt;

&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Location Counter: &lt;code&gt;.&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;The linker maintains an implicit variable called the &lt;strong&gt;location counter&lt;/strong&gt;, written as a dot: &lt;code&gt;.&lt;/code&gt;. The location counter tracks the current position within memory and automatically increments as sections are laid out.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SECTIONS
{
    . = 0x60000000;         /* Set location counter to 0x60000000 */

    .text : {
        *(.text)            /* Place .text at current location */
    } &amp;gt; RAM
    /* After .text, . is automatically incremented by .text's size */

    .rodata : {
        *(.rodata*)         /* Place .rodata immediately after .text */
    } &amp;gt; RAM
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Explicit assignment of &lt;code&gt;.&lt;/code&gt; ensures predictable placement. Without &lt;code&gt;. = 0x60000000;&lt;/code&gt;, the linker might place &lt;code&gt;.text&lt;/code&gt; at 0x0 by default, ignoring the intended RAM region.&lt;/p&gt;

&lt;p&gt;The location counter can also be used to create symbols:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;_text_start = .;            /* Symbol marks current position */
.text : { *(.text) } &amp;gt; RAM
_text_end = .;              /* Symbol marks end position */
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These symbols have no storage; they are merely addresses that can be referenced from assembly or C.&lt;/p&gt;

&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  VMA vs LMA: Virtual and Load Memory Addresses
&lt;/h2&gt;

&lt;p&gt;Every output section has two associated addresses:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;VMA (Virtual Memory Address):&lt;/strong&gt; where the section resides during execution (runtime)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LMA (Load Memory Address):&lt;/strong&gt; where the section is stored initially (typically in non-volatile flash)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In simple cases—executable code stored and executed from the same location—VMA and LMA are identical. Both the &lt;code&gt;&amp;gt; RAM&lt;/code&gt; clause and implicit location counter assignment govern VMA.&lt;/p&gt;

&lt;p&gt;The .data section stores its initial contents in flash (LMA) but executes from RAM (VMA). Startup code copies the data from flash to RAM before execution.&lt;/p&gt;

&lt;p&gt;Linker script syntax to specify both addresses:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.data : {
    *(.data)
} &amp;gt; RAM AT &amp;gt; FLASH          /* VMA in RAM, LMA in FLASH */
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;&amp;gt; RAM&lt;/code&gt; specifies VMA. The &lt;code&gt;AT &amp;gt; FLASH&lt;/code&gt; specifies LMA. The linker stores the section in FLASH (LMA) but generates symbols and relocation information assuming runtime execution from RAM (VMA).&lt;/p&gt;

&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Linker-Defined Symbols
&lt;/h2&gt;

&lt;p&gt;The linker can create symbols by assigning the location counter or other expressions to a name. These symbols exist only as addresses; they occupy no storage.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SECTIONS
{
    . = 0x60000000;

    _text_start = .;        /* Symbol: address where .text begins */

    .text : {
        *(.text)
    } &amp;gt; RAM

    _text_end = .;          /* Symbol: address where .text ends */

    _stack_top = ORIGIN(RAM) + LENGTH(RAM);  /* Custom symbol */
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In assembly, these symbols are referenced using the &lt;code&gt;= pseudo-op&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ldr r0, =_text_start        @ Load symbol address into r0
ldr sp, =_stack_top         @ Load stack top address into sp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;= pseudo-op&lt;/code&gt; generates an LDR instruction with a literal pool reference. The linker resolves the symbol address and embeds it in the instruction encoding. When the CPU executes the LDR, the symbol's address is loaded into the register.&lt;/p&gt;

&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Map File
&lt;/h2&gt;

&lt;p&gt;The linker can generate a map file showing all sections, symbols, and their addresses. Generate it by adding &lt;code&gt;-Map=output.map&lt;/code&gt; to the linker command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;arm-none-eabi-ld &lt;span class="nt"&gt;-T&lt;/span&gt; linker.ld startup.o &lt;span class="nt"&gt;-o&lt;/span&gt; boot.elf &lt;span class="nt"&gt;-Map&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;output.map
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Complete Minimal Example
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;startup.s&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.global _start

.section .vectors, "ax"
_vectors:
    b _start                @ branch to start_

.section .text
_start:
    ldr r0, =0xDEADBEEF     @ Load test value
    b .                     @ Infinite loop
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that QEMU does not use the vectors section above as a real reset vector when booting with &lt;code&gt;-kernel&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;linker.ld&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ENTRY(_start)

MEMORY
{
    RAM (rwx) : ORIGIN = 0x60000000, LENGTH = 128M
}

SECTIONS
{
    . = 0x60000000;

    .vectors : {
        *(.vectors)
    } &amp;gt; RAM

    .text : {
        *(.text)
    } &amp;gt; RAM
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Build Commands&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Assemble the startup code&lt;/span&gt;
arm-none-eabi-as &lt;span class="nt"&gt;-mcpu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;cortex-a9 &lt;span class="nt"&gt;-g&lt;/span&gt; startup.s &lt;span class="nt"&gt;-o&lt;/span&gt; startup.o

&lt;span class="c"&gt;# Link with the linker script&lt;/span&gt;
arm-none-eabi-ld &lt;span class="nt"&gt;-T&lt;/span&gt; linker.ld startup.o &lt;span class="nt"&gt;-o&lt;/span&gt; boot.elf

&lt;span class="c"&gt;# Generate a map file&lt;/span&gt;
arm-none-eabi-ld &lt;span class="nt"&gt;-T&lt;/span&gt; linker.ld startup.o &lt;span class="nt"&gt;-o&lt;/span&gt; boot.elf &lt;span class="nt"&gt;-Map&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;output.map

&lt;span class="c"&gt;# Create raw binary (optional, for certain QEMU loading modes)&lt;/span&gt;
arm-none-eabi-objcopy &lt;span class="nt"&gt;-O&lt;/span&gt; binary boot.elf boot.bin
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Verification: What the Linker Produced
&lt;/h2&gt;

&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Examine Disassembly
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;arm-none-eabi-objdump &lt;span class="nt"&gt;-d&lt;/span&gt; boot.elf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Excerpt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Disassembly of section .vectors:

60000000 &amp;lt;_vectors&amp;gt;:
60000000:   eaffffff    b   60000004 &amp;lt;_start&amp;gt;

Disassembly of section .text:

60000004 &amp;lt;_start&amp;gt;:
60000004:   e51f0000    ldr r0, [pc, #-0]   @ 6000000c &amp;lt;_start+0x8&amp;gt;
60000008:   eafffffe    b   60000008 &amp;lt;_start+0x4&amp;gt;
6000000c:   deadbeef    .word   0xdeadbeef
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key observations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Addresses in disassembly match the VMA from objdump -h&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;.vectors&lt;/code&gt; at 0x60000000 contains a branch to &lt;code&gt;_start&lt;/code&gt; at 0x60000004&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ldr r0, =0xDEADBEEF&lt;/code&gt; -&amp;gt; &lt;code&gt;ldr   r0, [pc, #-0]&lt;/code&gt; is located at 0x60000004 and it depends on the literal stored at 0x6000000c.&lt;/li&gt;
&lt;li&gt;The infinite loop &lt;code&gt;b .&lt;/code&gt; branches to itself at 0x60000008&lt;/li&gt;
&lt;li&gt;Literal &lt;code&gt;deadbeef&lt;/code&gt; is stored at 0x6000000c&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Examine Section Headers
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;arm-none-eabi-objdump &lt;span class="nt"&gt;-h&lt;/span&gt; boot.elf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Excerpt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Sections:
Idx Name          Size      VMA       LMA       File off  Algn
  0 .vectors      00000004  60000000  60000000  00001000  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  1 .text         0000000c  60000004  60000004  00001004  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key observations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;VMA shows where code resides at runtime: 0x60000000 for &lt;code&gt;.vectors&lt;/code&gt;, 0x60000004 for &lt;code&gt;.text&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;LMA matches VMA in this simple case (both in RAM)&lt;/li&gt;
&lt;li&gt;Section sizes: 4 bytes for the branch instruction in &lt;code&gt;.vectors&lt;/code&gt;, 12 bytes for the load and branch in &lt;code&gt;.text&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Examine the Map File
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;output.map | &lt;span class="nb"&gt;head&lt;/span&gt; &lt;span class="nt"&gt;-30&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Excerpt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Memory Configuration

Name             Origin             Length             Attributes
RAM              0x60000000         0x08000000         xrw
*default*        0x00000000         0xffffffff

Linker script and memory map

                0x60000000                        . = 0x60000000

.vectors        0x60000000        0x4
 *(.vectors)
 .vectors       0x60000000        0x4 startup.o

.text           0x60000004        0xc
 *(.text)
 .text          0x60000004        0xc startup.o
                0x60000004                _start
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Verification: Loading in QEMU and Inspecting with GDB
&lt;/h2&gt;

&lt;p&gt;Launch QEMU with GDB Server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;qemu-system-arm &lt;span class="nt"&gt;-M&lt;/span&gt; vexpress-a9 &lt;span class="nt"&gt;-cpu&lt;/span&gt; cortex-a9 &lt;span class="nt"&gt;-m&lt;/span&gt; 128M &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-kernel&lt;/span&gt; boot.elf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-nographic&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-S&lt;/span&gt; &lt;span class="nt"&gt;-gdb&lt;/span&gt; tcp::1234
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Flags:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;-kernel boot.elf&lt;/code&gt;: load ELF file (QEMU parses program headers and loads sections to their VMA)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-S&lt;/code&gt;: start halted, waiting for debugger connection&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-gdb tcp::1234&lt;/code&gt;: open GDB server on port 1234&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Connect GDB and Inspect:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gdb-multiarch boot.elf
&lt;span class="o"&gt;(&lt;/span&gt;gdb&lt;span class="o"&gt;)&lt;/span&gt; target remote :1234
Remote debugging using :1234
_start &lt;span class="o"&gt;()&lt;/span&gt; at startup.s:9
9       ldr r0, &lt;span class="o"&gt;=&lt;/span&gt;0xDEADBEEF     @ Load &lt;span class="nb"&gt;test &lt;/span&gt;value
&lt;span class="o"&gt;(&lt;/span&gt;gdb&lt;span class="o"&gt;)&lt;/span&gt; info registers pc
pc             0x60000004          0x60000004 &amp;lt;_start&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The program counter starts at 0x60000004, points to the next instruction.&lt;/p&gt;

&lt;p&gt;Disassemble in GDB:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;(&lt;/span&gt;gdb&lt;span class="o"&gt;)&lt;/span&gt; disassemble _start
Dump of assembler code &lt;span class="k"&gt;for function &lt;/span&gt;_start:
&lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; 0x60000004 &amp;lt;+0&amp;gt;: ldr r0, &lt;span class="o"&gt;[&lt;/span&gt;pc, &lt;span class="c"&gt;#-0]   @ 0x6000000c &amp;lt;_start+8&amp;gt;&lt;/span&gt;
   0x60000008 &amp;lt;+4&amp;gt;: b   0x60000008 &amp;lt;_start+4&amp;gt;
   0x6000000c &amp;lt;+8&amp;gt;: cdple   14, 10, cr11, cr13, cr15, &lt;span class="o"&gt;{&lt;/span&gt;7&lt;span class="o"&gt;}&lt;/span&gt;
End of assembler dump.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Instructions are at exact addresses from objdump, confirming the linker placed code at the intended locations.&lt;/p&gt;

&lt;p&gt;The literal 0xDEADBEEF appears as a coprocessor instruction in disassembly because objdump interprets raw data as instructions when listing code; this is expected.&lt;/p&gt;

&lt;p&gt;Step Through Instructions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;(&lt;/span&gt;gdb&lt;span class="o"&gt;)&lt;/span&gt; info registers r0
r0             0x0                 0
&lt;span class="o"&gt;(&lt;/span&gt;gdb&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="nb"&gt;break &lt;/span&gt;_start
Breakpoint 1 at 0x60000004: file startup.s, line 9.
&lt;span class="o"&gt;(&lt;/span&gt;gdb&lt;span class="o"&gt;)&lt;/span&gt; stepi
10      b &lt;span class="nb"&gt;.&lt;/span&gt;                     @ Infinite loop
&lt;span class="o"&gt;(&lt;/span&gt;gdb&lt;span class="o"&gt;)&lt;/span&gt; info registers r0
r0             0xdeadbeef          &lt;span class="nt"&gt;-559038737&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After the instruction 0x60000004, r0 contains the test value, proving the instruction executed and the linker's address assignment was correct.&lt;/p&gt;

&lt;p&gt;Inspect Memory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;(&lt;/span&gt;gdb&lt;span class="o"&gt;)&lt;/span&gt; x/4i 0x60000000
   0x60000000 &amp;lt;_vectors&amp;gt;:   b   0x60000004 &amp;lt;_start&amp;gt;
   0x60000004 &amp;lt;_start&amp;gt;: ldr r0, &lt;span class="o"&gt;[&lt;/span&gt;pc, &lt;span class="c"&gt;#-0]   @ 0x6000000c &amp;lt;_start+8&amp;gt;&lt;/span&gt;
&lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; 0x60000008 &amp;lt;_start+4&amp;gt;:   b   0x60000008 &amp;lt;_start+4&amp;gt;
   0x6000000c &amp;lt;_start+8&amp;gt;:   cdple   14, 10, cr11, cr13, cr15, &lt;span class="o"&gt;{&lt;/span&gt;7&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Memory contains the expected branch and ldr instructions at exact addresses, confirming the linker-assigned layout matches actual memory.&lt;/p&gt;

&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Alignment Directives
&lt;/h2&gt;

&lt;p&gt;ARMv7 uses 4-byte instructions and requires 4-byte alignment. Explicit alignment ensures that sections begin at valid boundaries.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SECTIONS
{
    . = 0x60000000;
    . = ALIGN(4);           /* Ensure 4-byte alignment */

    .vectors : {
        *(.vectors)
    } &amp;gt; RAM

    . = ALIGN(4);           /* Ensure next section is aligned */

    .text : {
        *(.text)
    } &amp;gt; RAM
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;ALIGN(n)&lt;/code&gt; function rounds the location counter up to the next multiple of n bytes. If already aligned, it is a no-op. If misaligned, it advances the counter and introduces padding.&lt;/p&gt;

&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;A linker script describes the mapping from ELF sections to concrete memory addresses and serves as the bridge between compiler output and hardware layout. By defining memory regions, assigning sections, managing alignment, and generating linker-defined symbols, the script establishes the program’s static memory structure. Verifying these decisions with map files and objdump ensures the final image matches the intended layout.&lt;/p&gt;

&lt;p&gt;With this foundation established, the next step is to examine how execution begins—specifically, the reset mechanism and the first instruction fetched by the CPU. Understanding reset behavior complements the static layout described here and completes the initial stage of bare-metal bring-up.&lt;/p&gt;

</description>
      <category>arm</category>
      <category>embedded</category>
      <category>baremetal</category>
    </item>
    <item>
      <title>Bare Metal ARM Boot: Understanding the Reset Vector and First Instructions</title>
      <dc:creator>Ripan Deuri</dc:creator>
      <pubDate>Sat, 13 Dec 2025 04:35:08 +0000</pubDate>
      <link>https://dev.to/ripan030/reset-on-armv7-42p7</link>
      <guid>https://dev.to/ripan030/reset-on-armv7-42p7</guid>
      <description>&lt;p&gt;Understanding what happens before &lt;code&gt;main()&lt;/code&gt; is essential when working on bare-metal systems. This article examines the reset behavior of ARMv7 and shows how to take control of the first instructions executed after reset on the vexpress-a9 platform in QEMU. A minimal vector table and linker script demonstrate how the CPU fetches its initial instruction from address 0x0 and how a simple branch verifies that control flow behaves exactly as intended.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Myth of “PC Starts at &lt;code&gt;main&lt;/code&gt;”
&lt;/h2&gt;

&lt;p&gt;In a bare-metal system, &lt;code&gt;main&lt;/code&gt; is simply a C function invoked &lt;strong&gt;after&lt;/strong&gt; a sequence of hardware-defined and software-defined steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Hardware reset forces the CPU into a well-defined privileged mode.&lt;/li&gt;
&lt;li&gt;A reset vector determines the first instruction the CPU fetches.&lt;/li&gt;
&lt;li&gt;Low-level startup code configures the execution environment.&lt;/li&gt;
&lt;li&gt;Only after this preparation does the C runtime transfer control to &lt;code&gt;main&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This post focuses on step 2: &lt;strong&gt;how the CPU selects its first instruction after reset and how software controls that decision.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  ARMv7 Reset Behavior: What the CPU Actually Does
&lt;/h2&gt;

&lt;p&gt;When an ARMv7-A core such as Cortex-A9 exits reset, the architecture defines only a minimal and deterministic subset of the processor state. The SoC integration adds further rules, but the essential behaviors are consistent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CPU mode:&lt;/strong&gt; The core enters &lt;strong&gt;Supervisor (SVC) mode&lt;/strong&gt;, a privileged mode suitable for exception entry.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Program Counter (PC):&lt;/strong&gt; Loaded from the reset vector address, which is implementation-defined but commonly &lt;strong&gt;0x00000000&lt;/strong&gt; or a remapped alias of another memory region.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;General-purpose registers:&lt;/strong&gt; All registers except the PC (and certain status bits) are &lt;strong&gt;architecturally undefined&lt;/strong&gt;. Software must not assume values for r0–r12, SP, or LR.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Interrupts:&lt;/strong&gt; External interrupts are &lt;strong&gt;disabled&lt;/strong&gt; at reset, preventing accidental entry into uninitialized exception handlers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MMU and caches:&lt;/strong&gt; &lt;strong&gt;Disabled&lt;/strong&gt;. Execution begins in a flat physical address space without virtual memory.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This initial state contains just enough information for the CPU to fetch and execute the first instruction from the reset vector. Everything else—stack initialization, memory sections, BSS clearing, C runtime setup—must be implemented in startup code.&lt;/p&gt;

&lt;p&gt;Historically, many ARM systems located the vector table at &lt;strong&gt;0x00000000&lt;/strong&gt;, simplifying early boot ROM design. Modern systems provide more flexibility:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Some support &lt;strong&gt;high-vector mode&lt;/strong&gt;, placing the vector table at &lt;strong&gt;0xFFFF0000&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Many SoCs implement &lt;strong&gt;memory remapping&lt;/strong&gt; so that ROM or flash appears temporarily at 0x0 during reset.&lt;/li&gt;
&lt;li&gt;The physical storage for the bootloader may exist elsewhere (e.g. 0x40000000) but is made visible at 0x0 through a small alias window.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On QEMU’s vexpress-a9 model, the NOR flash device is mapped at &lt;strong&gt;0x40000000&lt;/strong&gt; but &lt;strong&gt;aliased&lt;/strong&gt; at &lt;strong&gt;0x00000000&lt;/strong&gt;, ensuring the reset vector resides at address 0x0. This is a QEMU modeling choice that mirrors typical early boot behavior.&lt;/p&gt;

&lt;h2&gt;
  
  
  Vector Table
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;vector table&lt;/strong&gt; defines CPU entry points for exceptions such as reset, undefined instructions, software interrupts, data aborts, IRQ, and FIQ.&lt;/p&gt;

&lt;p&gt;Key properties:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The table resides at a fixed virtual address: &lt;strong&gt;0x00000000&lt;/strong&gt; or &lt;strong&gt;0xFFFF0000&lt;/strong&gt;, depending on mode and SoC configuration.&lt;/li&gt;
&lt;li&gt;Each entry corresponds to an exception type.&lt;/li&gt;
&lt;li&gt;On ARMv7, typically entries are &lt;strong&gt;instructions&lt;/strong&gt;, commonly unconditional branch instructions that jump to full handlers.&lt;/li&gt;
&lt;li&gt;The first entry is the &lt;strong&gt;reset vector&lt;/strong&gt;, which contains the first instruction fetched after reset.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A minimal vector table for experiments may contain only a reset vector, acknowledging that any other exception would lead to undefined behavior. Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.section .vectors, "ax"
.global _vectors
_vectors:
    b _start        @ Reset vector: branch to startup
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The instruction at the vector table address branches to &lt;code&gt;_start&lt;/code&gt;, which performs the earliest software-controlled action in the system.&lt;/p&gt;

&lt;h2&gt;
  
  
  How QEMU Wires Reset to Memory
&lt;/h2&gt;

&lt;p&gt;QEMU offers multiple ways to load and boot an ARM image, each influencing reset behavior and initial PC selection:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;-kernel boot.elf&lt;/code&gt; (ELF as kernel image):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;QEMU parses the ELF program headers.&lt;/li&gt;
&lt;li&gt;Loadable segments are placed at their specified VMAs.&lt;/li&gt;
&lt;li&gt;The CPU’s initial PC is set to the ELF entry point from the ELF header.&lt;/li&gt;
&lt;li&gt;This bypasses the classic reset-vector mechanism and does not reflect hardware reset routing.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;-drive if=pflash,format=raw,file=flash.bin&lt;/code&gt; (NOR flash image)&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;QEMU memory-maps the raw binary directly into the emulated NOR flash region.&lt;/li&gt;
&lt;li&gt;On vexpress-a9, NOR flash content is aliased at &lt;strong&gt;0x00000000&lt;/strong&gt;, so the CPU fetches the reset vector from the binary’s first instruction.&lt;/li&gt;
&lt;li&gt;This faithfully models how a real SoC remaps flash or boot ROM to 0x0 during reset.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;This article uses the &lt;code&gt;-pflash&lt;/code&gt; method so execution begins naturally at the reset vector located at address 0x0.&lt;/p&gt;

&lt;h2&gt;
  
  
  Minimal Vector Table Example
&lt;/h2&gt;

&lt;p&gt;A minimal example demonstrating control over the reset vector:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;startup.s&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.section .vectors, "ax"
.global _vectors
_vectors:
    b _start        @ Reset vector: branch to startup

.section .text
.global _start
_start:
    b .             @ Infinite loop to prove we reached here
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;linker.ld&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ENTRY(_start)

MEMORY
{
    FLASH (rx) : ORIGIN = 0x00000000, LENGTH = 64M
    RAM   (rwx): ORIGIN = 0x60000000, LENGTH = 128M
}

SECTIONS
{
    .vectors : {
        *(.vectors)
    } &amp;gt; FLASH

    .text : {
        *(.text)
    } &amp;gt; FLASH
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;.vectors&lt;/code&gt; and &lt;code&gt;.text&lt;/code&gt; both live in FLASH.&lt;/li&gt;
&lt;li&gt;No &lt;code&gt;.data&lt;/code&gt;, &lt;code&gt;.bss&lt;/code&gt;, or RAM usage.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Build steps:
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Assemble the startup code&lt;/span&gt;
arm-none-eabi-as &lt;span class="nt"&gt;-mcpu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;cortex-a9 &lt;span class="nt"&gt;-g&lt;/span&gt; startup.s &lt;span class="nt"&gt;-o&lt;/span&gt; startup.o

&lt;span class="c"&gt;# Link using the custom linker script&lt;/span&gt;
arm-none-eabi-ld &lt;span class="nt"&gt;-T&lt;/span&gt; linker.ld startup.o &lt;span class="nt"&gt;-o&lt;/span&gt; boot.elf

&lt;span class="c"&gt;# Convert ELF to raw binary for NOR flash&lt;/span&gt;
arm-none-eabi-objcopy &lt;span class="nt"&gt;-O&lt;/span&gt; binary boot.elf flash.bin

&lt;span class="c"&gt;# Create a 64 MB NOR flash image&lt;/span&gt;
&lt;span class="nb"&gt;truncate&lt;/span&gt; &lt;span class="nt"&gt;-s&lt;/span&gt; 64M flash.bin
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Artifacts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;boot.elf&lt;/strong&gt; – Contains symbol information useful for GDB debugging.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;flash.bin&lt;/strong&gt; – Raw memory image that QEMU maps into its NOR flash region.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Verification with QEMU and GDB
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Start QEMU&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;qemu-system-arm &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-M&lt;/span&gt; vexpress-a9 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-cpu&lt;/span&gt; cortex-a9 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; 128M &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-nographic&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-drive&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;pflash,format&lt;span class="o"&gt;=&lt;/span&gt;raw,file&lt;span class="o"&gt;=&lt;/span&gt;flash.bin &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-S&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-gdb&lt;/span&gt; tcp::1234
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key points:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The flash image is mapped into the emulated NOR flash device.&lt;/li&gt;
&lt;li&gt;The alias at &lt;strong&gt;0x00000000&lt;/strong&gt; ensures &lt;code&gt;_vectors&lt;/code&gt; is fetched at reset.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-S&lt;/code&gt; halts the CPU until GDB connects.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Connect GDB&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gdb-multiarch boot.elf

&lt;span class="o"&gt;(&lt;/span&gt;gdb&lt;span class="o"&gt;)&lt;/span&gt; target remote :1234
Remote debugging using :1234
_vectors &lt;span class="o"&gt;()&lt;/span&gt; at startup.s:4
4       b _start        @ Reset vector: branch to startup
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check PC:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;(&lt;/span&gt;gdb&lt;span class="o"&gt;)&lt;/span&gt; info registers pc
pc             0x0                 0x0 &amp;lt;_vectors&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This confirms that the CPU began executing at the reset vector address.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Disassemble the vector table and startup code&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;(&lt;/span&gt;gdb&lt;span class="o"&gt;)&lt;/span&gt; disassemble _vectors
Dump of assembler code &lt;span class="k"&gt;for function &lt;/span&gt;_vectors:
&lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; 0x00000000 &amp;lt;+0&amp;gt;: b   0x4 &amp;lt;_start&amp;gt;
End of assembler dump.
&lt;span class="o"&gt;(&lt;/span&gt;gdb&lt;span class="o"&gt;)&lt;/span&gt; disassemble _start
Dump of assembler code &lt;span class="k"&gt;for function &lt;/span&gt;_start:
   0x00000004 &amp;lt;+0&amp;gt;: b   0x4 &amp;lt;_start&amp;gt;
End of assembler dump.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first instruction at 0x0 is a branch to &lt;code&gt;_start&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Inspect section placement&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;arm-none-eabi-objdump &lt;span class="nt"&gt;-h&lt;/span&gt; boot.elf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Excerpt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Sections:
Idx Name          Size      VMA       LMA       File off  Algn
  0 .vectors      00000004  00000000  00000000  00001000  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  1 .text         00000004  00000004  00000004  00001004  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;.vectors&lt;/code&gt; is placed exactly at 0x0; &lt;code&gt;.text&lt;/code&gt; begins at 0x4.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step Through the Reset Vector&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;(&lt;/span&gt;gdb&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="nb"&gt;break &lt;/span&gt;_start
Breakpoint 1 at 0x4: file startup.s, line 9.
&lt;span class="o"&gt;(&lt;/span&gt;gdb&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="k"&gt;continue
&lt;/span&gt;Continuing.

Breakpoint 1, _start &lt;span class="o"&gt;()&lt;/span&gt; at startup.s:9
9       b &lt;span class="nb"&gt;.&lt;/span&gt;             @ Infinite loop to prove we reached here
&lt;span class="o"&gt;(&lt;/span&gt;gdb&lt;span class="o"&gt;)&lt;/span&gt; info registers pc
pc             0x4                 0x4 &amp;lt;_start&amp;gt;
&lt;span class="o"&gt;(&lt;/span&gt;gdb&lt;span class="o"&gt;)&lt;/span&gt; stepi

Breakpoint 1, _start &lt;span class="o"&gt;()&lt;/span&gt; at startup.s:9
9       b &lt;span class="nb"&gt;.&lt;/span&gt;             @ Infinite loop to prove we reached here
&lt;span class="o"&gt;(&lt;/span&gt;gdb&lt;span class="o"&gt;)&lt;/span&gt; info registers pc
pc             0x4                 0x4 &amp;lt;_start&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;b .&lt;/code&gt; instruction keeps the PC at &lt;code&gt;_start&lt;/code&gt;, proving that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The reset vector branch executed correctly.&lt;/li&gt;
&lt;li&gt;The CPU reached &lt;code&gt;_start&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Execution remains in the infinite loop.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Demonstrating aliasing with the QEMU monitor&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Enter the monitor (&lt;code&gt;Ctrl-A&lt;/code&gt;, then &lt;code&gt;c&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;(&lt;/span&gt;qemu&lt;span class="o"&gt;)&lt;/span&gt; xp /4xw 0x0
0000000000000000: 0xeaffffff 0xeafffffe 0x00000000 0x00000000
&lt;span class="o"&gt;(&lt;/span&gt;qemu&lt;span class="o"&gt;)&lt;/span&gt; xp /4xw 0x40000000
0000000040000000: 0xeaffffff 0xeafffffe 0x00000000 0x00000000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both regions reflect the same underlying flash content, confirming the aliasing behavior used during reset.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The minimal example demonstrates complete control over the system’s first executed instruction by placing a reset vector at address 0x0 and directing it to custom startup code. QEMU’s aliasing of NOR flash provides a convenient environment for experimenting with early boot behavior that closely reflects real hardware.&lt;/p&gt;

&lt;p&gt;This foundation forms the basis for building full startup routines: stack setup, memory initialization, exception-vector expansion, and transition into higher-level runtime code. With reset behavior understood and verified, the next steps involve constructing a complete bare-metal initialization sequence.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>computerscience</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Dissecting ELF for Bare Metal Development: Sections, Segments, VMA, and LMA Explained</title>
      <dc:creator>Ripan Deuri</dc:creator>
      <pubDate>Fri, 12 Dec 2025 07:18:08 +0000</pubDate>
      <link>https://dev.to/ripan030/dissecting-elf-for-bare-metal-development-sections-segments-vma-and-lma-explained-4390</link>
      <guid>https://dev.to/ripan030/dissecting-elf-for-bare-metal-development-sections-segments-vma-and-lma-explained-4390</guid>
      <description>&lt;p&gt;The memory map in the previous post &lt;a href="https://dev.to/ripan030/bare-metal-basics-part-1-understanding-memory-maps-p4d?preview=db8f2c318c2d0c12045e42ee25bac0a1ae02c865d9403cafc02f17b502a24e7aca6483dad08870b2e70d92acb1d707e113d083018351c26022317ab2"&gt;Bare Metal Basics - Part 1: Understanding Memory Maps&lt;/a&gt; describes the hardware address space, but it does not explain how compiled code and data are placed into that space. The compiler does not emit instructions directly at fixed memory addresses, nor does it decide where variables reside at runtime. Instead, compilation produces relocatable artifacts that must later be assigned concrete addresses.&lt;/p&gt;

&lt;p&gt;The GNU toolchain uses ELF (Executable and Linkable Format) as its standard output at multiple stages. These ELF files contain executable machine code, but they also include metadata such as symbols, relocation records, and debugging information—structures that bare-metal hardware cannot interpret. Since bare-metal systems have no loader, the addresses assigned during linking become the actual runtime addresses used by the CPU. Understanding these distinctions is essential for building reliable bare-metal systems.&lt;/p&gt;

&lt;p&gt;This post dissects ELF files, explains sections and segments, and demonstrates how to inspect linker output to verify that the generated layout matches the intended memory map.&lt;/p&gt;




&lt;h2&gt;
  
  
  From Source Code to Executable
&lt;/h2&gt;

&lt;p&gt;Bare-metal development involves multiple stages, each with a specific responsibility.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 1: Source to Object Files&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Source files (C and assembly) pass through the compiler and assembler, producing &lt;strong&gt;relocatable&lt;/strong&gt; object files (.o) that contain machine code and data grouped into sections, but not yet bound to fixed addresses. References to symbols defined in other object files are left unresolved.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 2: Linking Object Files&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The linker reads all object files and a linker script. It then:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Resolves symbol references to actual memory addresses&lt;/li&gt;
&lt;li&gt;Combines sections from multiple object files&lt;/li&gt;
&lt;li&gt;Assigns memory addresses to sections based on the linker script&lt;/li&gt;
&lt;li&gt;Produces an ELF executable with a defined entry point&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The linker script serves as the blueprint: it describes how sections such as &lt;code&gt;.text&lt;/code&gt;, &lt;code&gt;.rodata&lt;/code&gt;, &lt;code&gt;.data&lt;/code&gt;, and &lt;code&gt;.bss&lt;/code&gt; map into the hardware memory regions. The linker assigns addresses accordingly, subject to options such as dead-code elimination (&lt;code&gt;--gc-sections&lt;/code&gt;). The specific addresses used depend entirely on the memory layout defined in the script.&lt;/p&gt;

&lt;h2&gt;
  
  
  ELF Structure
&lt;/h2&gt;

&lt;p&gt;An ELF executable is a structured container composed of several logical parts.&lt;/p&gt;

&lt;h3&gt;
  
  
  ELF Header
&lt;/h3&gt;

&lt;p&gt;The ELF header identifies the file format and describes global properties such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Magic number (0x7f, 'E', 'L', 'F')&lt;/li&gt;
&lt;li&gt;Target architecture (ARM, x86, etc.)&lt;/li&gt;
&lt;li&gt;Entry point address where CPU execution begins&lt;/li&gt;
&lt;li&gt;Offsets to section headers and program headers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These values guide development tools and loaders but are not interpreted by bare-metal processors.&lt;/p&gt;

&lt;h3&gt;
  
  
  Section Headers
&lt;/h3&gt;

&lt;p&gt;The linker organizes code and data into named &lt;strong&gt;sections&lt;/strong&gt;. Each section is a logical container:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;.text&lt;/code&gt;&lt;/strong&gt;: Executable code&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;.rodata&lt;/code&gt;&lt;/strong&gt;: Read-only data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;.data&lt;/code&gt;&lt;/strong&gt;: Initialized global/static variables&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;.bss&lt;/code&gt;&lt;/strong&gt;: Uninitialized global/static variables&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom sections&lt;/strong&gt;: Platform-specific sections such as interrupt vectors or application-defined segments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each section has several important address concepts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LMA (Load Memory Address)&lt;/strong&gt;: Where the section’s initial contents reside in non-volatile memory (typically flash).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VMA (Virtual Memory Address)&lt;/strong&gt;: The runtime address where the CPU accesses the section.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Size&lt;/strong&gt; and &lt;strong&gt;Alignment&lt;/strong&gt;: Constraints on how the linker places each section.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;VMAs represent the addresses used by executing code, and symbol values correspond to VMAs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Program Headers (Segments)
&lt;/h3&gt;

&lt;p&gt;Segments describe how an ELF file should be loaded by a tool that understands ELF—such as QEMU or a custom bootloader. Loaders interpret &lt;em&gt;segments&lt;/em&gt;, not sections. Each segment describes a contiguous region of memory to populate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Segment type (e.g., LOAD)&lt;/li&gt;
&lt;li&gt;File offset&lt;/li&gt;
&lt;li&gt;Virtual or physical destination address&lt;/li&gt;
&lt;li&gt;File size and memory size&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Bare-metal CPUs do not interpret segments, but they matter when using QEMU or bootloaders that load ELF directly. QEMU follows program headers and uses the physical address field when present.&lt;/p&gt;

&lt;h3&gt;
  
  
  Metadata
&lt;/h3&gt;

&lt;p&gt;Additional ELF components include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Symbol and string tables&lt;/li&gt;
&lt;li&gt;Debug information (DWARF)&lt;/li&gt;
&lt;li&gt;Relocation records&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are essential during development but have no meaning for the hardware.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sections and Their Runtime Meaning
&lt;/h2&gt;

&lt;h3&gt;
  
  
  .text: Executable Code
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;.text&lt;/code&gt; section contains all compiled machine instructions, including startup routines. Some platforms place interrupt vectors in a dedicated section with strict address requirements; linker scripts typically handle this separately.&lt;/p&gt;

&lt;p&gt;Properties:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Readable and executable&lt;/li&gt;
&lt;li&gt;Typically stored in flash&lt;/li&gt;
&lt;li&gt;May execute in place or be copied to RAM depending on design&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use objdump to see disassembled .text with addresses:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;arm-none-eabi-objdump &lt;span class="nt"&gt;-d&lt;/span&gt; boot.elf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  .rodata: Read-Only Data
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;.rodata&lt;/code&gt; section contains immutable data:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;String literals&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;const&lt;/code&gt; global variables&lt;/li&gt;
&lt;li&gt;Compiler-generated lookup tables&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This section typically resides in flash alongside &lt;code&gt;.text&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;objdump &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-j&lt;/span&gt; .rodata boot.elf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  .data: Initialized Global Variables
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;.data&lt;/code&gt; section contains global and static variables with explicit initializers. These variables must be writable at runtime. Their initial values are stored in flash (LMA), and the startup code copies them into RAM (VMA) before entering the C runtime.&lt;/p&gt;

&lt;p&gt;Use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;readelf &lt;span class="nt"&gt;-S&lt;/span&gt; boot.elf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to view VMA and LMA assignments.&lt;/p&gt;

&lt;h3&gt;
  
  
  .bss: Uninitialized Global Variables
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;.bss&lt;/code&gt; section contains uninitialized global and static variables. &lt;code&gt;.bss&lt;/code&gt; does not appear in the binary because storing zero-filled data in flash would be wasteful.&lt;/p&gt;

&lt;p&gt;The linker records:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Section name&lt;/li&gt;
&lt;li&gt;VMA&lt;/li&gt;
&lt;li&gt;Size&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At runtime, the startup code sets the entire &lt;code&gt;.bss&lt;/code&gt; region to zero. The memory is not allocated by software; it is reserved by the linker through the memory map.&lt;/p&gt;

&lt;h2&gt;
  
  
  Raw Binary Image
&lt;/h2&gt;

&lt;p&gt;A processor begins execution at a hardware-defined reset address (for example, 0x00000000 or a device-specific flash base). The CPU does not interpret ELF metadata. It requires only:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Executable instructions at the reset address&lt;/li&gt;
&lt;li&gt;Read-only data accessible at expected locations&lt;/li&gt;
&lt;li&gt;Writable memory for initialized and uninitialized variables&lt;/li&gt;
&lt;li&gt;Valid stack space&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;ELF files must therefore be transformed into binary images whose bytes correspond exactly to the intended load addresses.&lt;/p&gt;

&lt;h2&gt;
  
  
  Inspecting ELF Output
&lt;/h2&gt;

&lt;p&gt;Common toolchain utilities make ELF inspection straightforward.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;readelf&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;arm-none-eabi-readelf &lt;span class="nt"&gt;-h&lt;/span&gt; boot.elf   &lt;span class="c"&gt;# ELF header&lt;/span&gt;
arm-none-eabi-readelf &lt;span class="nt"&gt;-S&lt;/span&gt; boot.elf   &lt;span class="c"&gt;# Section headers&lt;/span&gt;
arm-none-eabi-readelf &lt;span class="nt"&gt;-l&lt;/span&gt; boot.elf   &lt;span class="c"&gt;# Program headers&lt;/span&gt;
arm-none-eabi-readelf &lt;span class="nt"&gt;-s&lt;/span&gt; boot.elf   &lt;span class="c"&gt;# Symbol table&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;objdump&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;arm-none-eabi-objdump &lt;span class="nt"&gt;-h&lt;/span&gt; boot.elf   &lt;span class="c"&gt;# Section summary&lt;/span&gt;
arm-none-eabi-objdump &lt;span class="nt"&gt;-d&lt;/span&gt; boot.elf   &lt;span class="c"&gt;# Disassembly&lt;/span&gt;
arm-none-eabi-objdump &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-j&lt;/span&gt; .rodata boot.elf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;objcopy&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;arm-none-eabi-objcopy &lt;span class="nt"&gt;-O&lt;/span&gt; binary boot.elf boot.bin
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This produces a raw binary by extracting only loadable content, arranged according to LMAs. The resulting file contains no address metadata, so it must be programmed into flash at the correct offset corresponding to those LMAs.&lt;/p&gt;

&lt;p&gt;Alternative formats such as Intel HEX and S-records convey the same conceptual information with explicit addressing.&lt;/p&gt;

&lt;h2&gt;
  
  
  How QEMU Loads ELF and Binary Image
&lt;/h2&gt;

&lt;p&gt;QEMU supports both ELF-based loading and raw binary loading.&lt;/p&gt;

&lt;h3&gt;
  
  
  ELF Loading (--kernel)
&lt;/h3&gt;

&lt;p&gt;When provided with an ELF file, QEMU:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reads the ELF header for architecture and entry point&lt;/li&gt;
&lt;li&gt;Parses program headers&lt;/li&gt;
&lt;li&gt;Loads each LOAD segment at its specified destination address&lt;/li&gt;
&lt;li&gt;Sets the CPU PC to the ELF entry point
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;qemu-system-arm &lt;span class="nt"&gt;-M&lt;/span&gt; vexpress-a9 &lt;span class="nt"&gt;-cpu&lt;/span&gt; cortex-a9 &lt;span class="nt"&gt;-m&lt;/span&gt; 128M &lt;span class="nt"&gt;-nographic&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-kernel&lt;/span&gt; boot.elf &lt;span class="nt"&gt;-S&lt;/span&gt; &lt;span class="nt"&gt;-gdb&lt;/span&gt; tcp::1234
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Binary Loading (-pflash)
&lt;/h3&gt;

&lt;p&gt;When loading a raw binary, QEMU:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Maps the file directly into the flash device model&lt;/li&gt;
&lt;li&gt;Uses device-specific flash size limits&lt;/li&gt;
&lt;li&gt;Begins execution from the board’s reset vector&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This model mirrors real hardware and validates whether the firmware image matches the expected memory map.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;qemu-system-arm &lt;span class="nt"&gt;-M&lt;/span&gt; vexpress-a9 &lt;span class="nt"&gt;-cpu&lt;/span&gt; cortex-a9 &lt;span class="nt"&gt;-m&lt;/span&gt; 128M &lt;span class="nt"&gt;-nographic&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-drive&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;pflash,format&lt;span class="o"&gt;=&lt;/span&gt;raw,file&lt;span class="o"&gt;=&lt;/span&gt;boot.bin &lt;span class="nt"&gt;-S&lt;/span&gt; &lt;span class="nt"&gt;-gdb&lt;/span&gt; tcp::1234
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;ELF files exist to support the toolchain. They encode executable code along with metadata required for linking, relocation, and debugging. Bare-metal processors, however, require only instructions and data placed at precise addresses.&lt;/p&gt;

&lt;p&gt;Understanding the distinctions between sections and segments, between VMA and LMA, and between compilation and linking is foundational to bare-metal work. The linker script provides the concrete mapping between ELF structure and hardware memory.&lt;/p&gt;

&lt;p&gt;The next post focuses on linker scripts: how they express memory intent, how the linker interprets them, and how to validate the final layout against the hardware memory map.&lt;/p&gt;

</description>
      <category>embedded</category>
      <category>baremetal</category>
      <category>architecture</category>
      <category>beginners</category>
    </item>
  </channel>
</rss>
