Entering Ring 0 gave me complete control over CPU execution, but I faced a major challenge: I had no drivers.
I couldn't read a single byte from a hard drive or load a file from disk. Standard operating systems rely on legacy BIOS calls or massive driver stacks; I had to write my own.
We are building a bare-metal, self-healing operating system running entirely inside the CPU's L3 cache. Here is the roadmap for this 12-part series:The V.E.L.O.C.I.T.Y.-OS 12-Part Roadmap
Driver 1: The PCI configuration Space Scanner (src/pci.rs)
To find hardware devices attached to the motherboard, I wrote a PCI scanner.
The scanner recursively queries buses 0..255, slots 0..31, and functions 0..7 using CPU legacy I/O ports 0xCF8 (Address) and 0xCFC (Data). It checks the vendor and class registers to identify what hardware is present, capturing BAR0 addresses.
Driver 2: The NVMe storage Block Controller (src/nvme.rs)
Using the PCI scanner, the kernel locates the mass storage controller (Class 0x01, Subclass 0x08).
From BAR0, I retrieve the base pointer to the memory-mapped I/O (MMIO) registers. The driver maps and executes the NVMe startup sequence:
- Allocates Admin Submission (ASQ) and Completion (ACQ) queues.
- Configures Doorbell Stride registers (
CAP.DSTRD). - Maps I/O Submission (SQ) and Completion (CQ) queues.
- Implements ring doorbells (
BAR0 + 0x1000 + 2 * (4 << CAP.DSTRD)) to submit block reads (read_blocks) and writes (write_blocks).
Here is the block-reading and command-submission queue logic in src/nvme.rs mapping physical addresses and polling doorbells without OS caching:
// velocity-bootloader/src/nvme.rs — NVMe Command Submission & Read
pub fn read_blocks(mut lba: u64, mut count: u16, buf: &mut [u8]) -> Result<(), &'static str> {
let mut controller = NVME_CONTROLLER.lock();
if !controller.initialized { return Err("NVMe controller not initialized"); }
let mut offset = 0;
while count > 0 {
let chunk = count.min(8); // Read up to 8 blocks at once
let chunk_bytes = chunk as usize * 512;
let chunk_buf = unsafe { core::slice::from_raw_parts_mut(buf.as_mut_ptr().add(offset), chunk_bytes) };
let phys_addr = chunk_buf.as_ptr() as u64;
let page_offset = phys_addr & 0xFFF;
let dptr1 = phys_addr;
let dptr2 = if page_offset + chunk_bytes as u64 > 4096 {
(phys_addr & !0xFFF) + 4096 // PRPs mapping across boundary limits
} else {
0
};
let cmd = NvmeCmd {
opcode: 0x02, // NVMe Read Opcode
flags: 0,
cid: 0,
nsid: 1, // Namespace ID 1
reserved0: 0, mptr: 0, dptr1, dptr2,
cdw10: (lba & 0xFFFFFFFF) as u32,
cdw11: (lba >> 32) as u32,
cdw12: (chunk - 1) as u32, // Number of sectors (0-indexed)
cdw13: 0, cdw14: 0, cdw15: 0,
};
controller.submit_io_cmd(cmd)?;
lba += chunk as u64;
count -= chunk;
offset += chunk_bytes;
}
Ok(())
}
impl NvmeController {
// Submit a command to the I/O Submission Queue and poll Completion Queue
pub fn submit_io_cmd(&mut self, mut cmd: NvmeCmd) -> Result<NvmeCqe, &'static str> {
cmd.cid = self.io_sq_tail;
unsafe {
self.io_sq.add(self.io_sq_tail as usize).write(cmd);
}
self.io_sq_tail = (self.io_sq_tail + 1) % 64;
unsafe {
// Ring SQ doorbell for I/O Queue (QID = 1, doorbells start at offset 0x1000)
let db_sq_offset = (0x1000 + 2 * (4 << self.dstrd)) / 4;
core::ptr::write_volatile(self.bar0.add(db_sq_offset as usize), self.io_sq_tail as u32);
// Poll completion queue phase bit
let mut timeout = 10000000;
loop {
let cqe_ptr = self.io_cq.add(self.io_cq_head as usize);
// Flush CPU cache line for physical memory read
core::arch::asm!("clflush [{}]", in(reg) cqe_ptr, options(nostack, preserves_flags));
let cqe = cqe_ptr.read();
let phase = cqe.status & 0x01;
if phase == self.io_cq_phase {
self.io_cq_head = (self.io_cq_head + 1) % 64;
if self.io_cq_head == 0 { self.io_cq_phase ^= 1; }
// Ring CQ doorbell
let db_cq_offset = (0x1000 + 3 * (4 << self.dstrd)) / 4;
core::ptr::write_volatile(self.bar0.add(db_cq_offset as usize), self.io_cq_head as u32);
let status_val = cqe.status;
if (status_val >> 1) != 0 { return Err("I/O command failed status"); }
return Ok(cqe);
}
timeout -= 1;
if timeout == 0 { return Err("I/O command completion timeout"); }
core::hint::spin_loop();
}
}
}
}
Driver 3: The Zero-Allocation FAT32 Parser (src/fat.rs)
With block reads working, I needed a filesystem parser to read directories and files.
I wrote a custom, #![no_std] FAT32 driver. Because alignment-safe access is critical on bare-metal hardware, the parser uses direct offset-based byte reads (rather than pointer-casting structs) to prevent alignment exception crashes.
The parser crawls directory clusters, decodes standard 8.3 space-padded uppercase filenames (e.g. converting fibonacci.nda to FIBONACCNDA), and loads file data cluster-by-cluster.
Here is the layout stack representing how raw PCIe disk blocks are parsed and cached:
// Shell console call dynamically reading from NVMe disk
let file_bytes = fat::read_file("NEURAL_N.NDA")?;
Fixing the Deadlocks & Calling Conventions
During integration, I hit a critical boot-time freeze: the serial COM1 logger (serial.rs) deadlocked when mirroring print logs to the GUI log buffer.
I resolved this by rewriting add_log to bypass the high-level print! macros and write directly to SERIAL_COM1.lock() without acquiring recursive locks.
Furthermore, I fixed a JIT compilation stack crash: under #![no_std] UEFI compilation targets, the JIT assembler was emitting System V registers. I updated the compiler target mapping to align System V registers to Microsoft x64 (RCX/RDX/R8/R9) when target_os = "uefi" is set.
Pascal's Verification: Cold Context on the NVMe Drive
I launched QEMU with a virtual 64MB NVMe drive containing my compiled .nda programs. The bare-metal shell successfully ran ls to list NVMe files and executed run fibonacci.nda dynamically from disk.
This filesystem integration was about more than just loading files—it allowed the JIT VM and the model to query and use the active codebase directly as context without CPU overhead.
By combining the FAT32 driver with the Merkle root sitemap caching, the entire written codebase sitting on the NVMe drive acts as a virtual "Cold Context". The active task in memory represents the "Hot Context", and the system hot-swaps relevant code blocks in and out on demand.
As
noted when reviewing this demand-paging context model:
"The site-map + NDA hot-swap into buffers is essentially a demand-paging system for model context — you load what the current reasoning step needs, not the entire history. The NVMe drive as long-term context window is the right abstraction: infinite effective context, bounded active memory, deterministic access patterns via the triple graph."
By linking my FAT32 driver directly to the JIT VM, I could load, compile, and execute modules dynamically from NVMe sectors in microseconds.
But I was still operating in a text-only serial terminal. I needed a graphical interface.
In the next post, I'll document how I built the swappable double-buffered GUI engines and the Synaptic Canvas force-directed GUI compositor.
Discussion
What's your experience writing bare-metal driver software in Rust? What are the trickiest elements of PCI discovery and NVMe queue mapping without an underlying OS? Let's discuss in the comments below!
Special thanks to for helping me realign calling conventions and resolve serial lock deadlocks.
Disclaimer: AI was used throughout this project, it is just fitting that it would co-author with me, so special thanks to the Foundry for its tireless hours toiling away and Gemini for producing the cover image.

Top comments (1)
@pascal_cescato_692b7a8a20 Part 9 🫠 3 more to go!