V.E.L.O.C.I.T.Y.-OS: Writing Bare-Metal Drivers – PCI, NVMe & FAT32 (Part 9)

#showdev #coding #compilers #rust

Self-healing OS running in L3 cache

Entering Ring 0 gave me complete control over CPU execution, but I faced a major challenge: I had no drivers.

I couldn't read a single byte from a hard drive or load a file from disk. Standard operating systems rely on legacy BIOS calls or massive driver stacks; I had to write my own.

The V.E.L.O.C.I.T.Y.-OS 12-Part Roadmap

We are building a bare-metal, self-healing operating system running entirely inside the CPU's L3 cache. Here is the roadmap for this 12-part series:

Part 1: The Spark — Exposing the "Safe-Room" security leak and building the compiler gate.
Part 2: The NDA Language — Designing a content-addressed triplet representation to cure context bloat.
Part 3: Ditching the Web Stack — Building a native 30MB IDE with 1,500,000x IPC latency drops.
Part 4: The Closure JIT — Compiling AST blocks to nested closures and bypassing borrow checker limits.
Part 5: JIT Math Optimizations — Replacing division operations with precomputed 16-bit lookup tables.
Part 6: x86-64 Assembler & SCEV-Lite — Compiling scalar loops directly to native code in constant time.
Part 7: Classic Compiler Passes — Implementing inter-procedural Dead Code Elimination and loop unrolling.
Part 8: Reclaiming Ring 0 — Exiting UEFI boot services and transitioning the kernel to Ring 0.
Part 9: Bare-Metal Drivers — Writing a PCI scanner, NVMe block storage controller, and FAT32 parser. (You are here)
Part 10: Synaptic Canvas — Rendering a spatial, force-directed GUI based on model token activation vectors.
Part 11: Swarms & Hot-Patching — Building multi-agent scheduling and zero-downtime RCU driver updates.
Part 12: Self-Evolution — Handing system control over to a local LLM Terminal that self-optimizes via telemetry.

Driver 1: The PCI configuration Space Scanner (`src/pci.rs`)

To find hardware devices attached to the motherboard, I wrote a PCI scanner.

The scanner recursively queries buses 0..255, slots 0..31, and functions 0..7 using CPU legacy I/O ports 0xCF8 (Address) and 0xCFC (Data). It checks the vendor and class registers to identify what hardware is present, capturing BAR0 addresses.

Driver 2: The NVMe storage Block Controller (`src/nvme.rs`)

Using the PCI scanner, the kernel locates the mass storage controller (Class 0x01, Subclass 0x08).

From BAR0, I retrieve the base pointer to the memory-mapped I/O (MMIO) registers. The driver maps and executes the NVMe startup sequence:

Allocates Admin Submission (ASQ) and Completion (ACQ) queues.
Configures Doorbell Stride registers (CAP.DSTRD).
Maps I/O Submission (SQ) and Completion (CQ) queues.
Implements ring doorbells (BAR0 + 0x1000 + 2 * (4 << CAP.DSTRD)) to submit block reads (read_blocks) and writes (write_blocks).

Here is the block-reading and command-submission queue logic in src/nvme.rs mapping physical addresses and polling doorbells without OS caching:

// velocity-bootloader/src/nvme.rs — NVMe Command Submission & Read
pub fn read_blocks(mut lba: u64, mut count: u16, buf: &mut [u8]) -> Result<(), &'static str> {
    let mut controller = NVME_CONTROLLER.lock();
    if !controller.initialized { return Err("NVMe controller not initialized"); }

    let mut offset = 0;
    while count > 0 {
        let chunk = count.min(8); // Read up to 8 blocks at once
        let chunk_bytes = chunk as usize * 512;

        let chunk_buf = unsafe { core::slice::from_raw_parts_mut(buf.as_mut_ptr().add(offset), chunk_bytes) };
        let phys_addr = chunk_buf.as_ptr() as u64;
        let page_offset = phys_addr & 0xFFF;

        let dptr1 = phys_addr;
        let dptr2 = if page_offset + chunk_bytes as u64 > 4096 {
            (phys_addr & !0xFFF) + 4096 // PRPs mapping across boundary limits
        } else {
            0
        };

        let cmd = NvmeCmd {
            opcode: 0x02, // NVMe Read Opcode
            flags: 0,
            cid: 0,
            nsid: 1,      // Namespace ID 1
            reserved0: 0, mptr: 0, dptr1, dptr2,
            cdw10: (lba & 0xFFFFFFFF) as u32,
            cdw11: (lba >> 32) as u32,
            cdw12: (chunk - 1) as u32, // Number of sectors (0-indexed)
            cdw13: 0, cdw14: 0, cdw15: 0,
        };

        controller.submit_io_cmd(cmd)?;

        lba += chunk as u64;
        count -= chunk;
        offset += chunk_bytes;
    }
    Ok(())
}

impl NvmeController {
    // Submit a command to the I/O Submission Queue and poll Completion Queue
    pub fn submit_io_cmd(&mut self, mut cmd: NvmeCmd) -> Result<NvmeCqe, &'static str> {
        cmd.cid = self.io_sq_tail;
        unsafe {
            self.io_sq.add(self.io_sq_tail as usize).write(cmd);
        }

        self.io_sq_tail = (self.io_sq_tail + 1) % 64;

        unsafe {
            // Ring SQ doorbell for I/O Queue (QID = 1, doorbells start at offset 0x1000)
            let db_sq_offset = (0x1000 + 2 * (4 << self.dstrd)) / 4;
            core::ptr::write_volatile(self.bar0.add(db_sq_offset as usize), self.io_sq_tail as u32);

            // Poll completion queue phase bit
            let mut timeout = 10000000;
            loop {
                let cqe_ptr = self.io_cq.add(self.io_cq_head as usize);
                // Flush CPU cache line for physical memory read
                core::arch::asm!("clflush [{}]", in(reg) cqe_ptr, options(nostack, preserves_flags));
                let cqe = cqe_ptr.read();
                let phase = cqe.status & 0x01;

                if phase == self.io_cq_phase {
                    self.io_cq_head = (self.io_cq_head + 1) % 64;
                    if self.io_cq_head == 0 { self.io_cq_phase ^= 1; }

                    // Ring CQ doorbell
                    let db_cq_offset = (0x1000 + 3 * (4 << self.dstrd)) / 4;
                    core::ptr::write_volatile(self.bar0.add(db_cq_offset as usize), self.io_cq_head as u32);

                    let status_val = cqe.status;
                    if (status_val >> 1) != 0 { return Err("I/O command failed status"); }
                    return Ok(cqe);
                }

                timeout -= 1;
                if timeout == 0 { return Err("I/O command completion timeout"); }
                core::hint::spin_loop();
            }
        }
    }
}

Driver 3: The Zero-Allocation FAT32 Parser (`src/fat.rs`)

With block reads working, I needed a filesystem parser to read directories and files.

I wrote a custom, #![no_std] FAT32 driver. Because alignment-safe access is critical on bare-metal hardware, the parser uses direct offset-based byte reads (rather than pointer-casting structs) to prevent alignment exception crashes.

The parser crawls directory clusters, decodes standard 8.3 space-padded uppercase filenames (e.g. converting fibonacci.nda to FIBONACCNDA), and loads file data cluster-by-cluster.

Here is the layout stack representing how raw PCIe disk blocks are parsed and cached:

Diagram showing storage hierarchy layers: PCIe Bus to NVMe Controller to FAT32 Parser to Cold Context Cache — Fig 1: The bare-metal storage and caching hierarchy layout.

// Shell console call dynamically reading from NVMe disk
let file_bytes = fat::read_file("NEURAL_N.NDA")?;

Fixing the Deadlocks & Calling Conventions

During integration, I hit a critical boot-time freeze: the serial COM1 logger (serial.rs) deadlocked when mirroring print logs to the GUI log buffer.

I resolved this by rewriting add_log to bypass the high-level print! macros and write directly to SERIAL_COM1.lock() without acquiring recursive locks.

Furthermore, I fixed a JIT compilation stack crash: under #![no_std] UEFI compilation targets, the JIT assembler was emitting System V registers. I updated the compiler target mapping to align System V registers to Microsoft x64 (RCX/RDX/R8/R9) when target_os = "uefi" is set.

Pascal's Verification: Cold Context on the NVMe Drive

I launched QEMU with a virtual 64MB NVMe drive containing my compiled .nda programs. The bare-metal shell successfully ran ls to list NVMe files and executed run fibonacci.nda dynamically from disk.

This filesystem integration was about more than just loading files—it allowed the JIT VM and the model to query and use the active codebase directly as context without CPU overhead.

By combining the FAT32 driver with the Merkle root sitemap caching, the entire written codebase sitting on the NVMe drive acts as a virtual "Cold Context". The active task in memory represents the "Hot Context", and the system hot-swaps relevant code blocks in and out on demand.

Pascal CESCATO

Full-stack dev sharing practical guides on WordPress, n8n automation, AI tools, Docker & self-hosting. Always experimenting with new tech to make life easier.

noted when reviewing this demand-paging context model:

"The site-map + NDA hot-swap into buffers is essentially a demand-paging system for model context — you load what the current reasoning step needs, not the entire history. The NVMe drive as long-term context window is the right abstraction: infinite effective context, bounded active memory, deterministic access patterns via the triple graph."

By linking my FAT32 driver directly to the JIT VM, I could load, compile, and execute modules dynamically from NVMe sectors in microseconds.

But I was still operating in a text-only serial terminal. I needed a graphical interface.

In the next post, I'll document how I built the swappable double-buffered GUI engines and the Synaptic Canvas force-directed GUI compositor.

Discussion

What's your experience writing bare-metal driver software in Rust? What are the trickiest elements of PCI discovery and NVMe queue mapping without an underlying OS? Let's discuss in the comments below!

Special thanks to

Pascal CESCATO

Full-stack dev sharing practical guides on WordPress, n8n automation, AI tools, Docker & self-hosting. Always experimenting with new tech to make life easier.

for helping me realign calling conventions and resolve serial lock deadlocks.

Disclaimer: AI was used throughout this project, it is just fitting that it would co-author with me, so special thanks to the Foundry for its tireless hours toiling away and Gemini for producing the cover image.