Up until this point, I had built an incredible JIT compiler, but it was still running on top of Windows.
If I wanted true zero-allocation, microsecond execution, I had to control the hardware page tables, the instruction pipeline, and the CPU registers directly. I needed to write my own operating system.
We are building a bare-metal, self-healing operating system running entirely inside the CPU's L3 cache. Here is the roadmap for this 12-part series:The V.E.L.O.C.I.T.Y.-OS 12-Part Roadmap
On Saturday morning, June 27th, the sprint to bare metal began.
Step 1: The UEFI Bootloader
I created a new sub-crate, velocity-bootloader, configured as a #![no_std] and #![no_main] application.
The bootloader boots under UEFI, utilizing the uefi crate to query BIOS interfaces, establish console logging, and allocate initial memory pages.
But the core of V.E.L.O.C.I.T.Y.-OS is a Single-Address-Space Operating System (SASOS). I don't want to run inside the restricted UEFI BIOS environment. I want to exit boot services and reclaim the processor.
Step 2: Transitioning to Ring 0
To safely exit UEFI, I implemented three core modules:
-
The Heap Allocator (
allocator.rs): Before callingexit_boot_services(), I pre-allocated a contiguous 16MB block of conventional RAM pages from UEFI. I initialized my own global heap allocator (linked_list_allocator::LockedHeap) using this block, ensuring dynamic heap operations (vectors, maps) remain functional after BIOS services terminate. -
The GDT and Task State Segment (
gdt.rs): I configured flat 64-bit kernel code/data segments. I set up the Task State Segment (TSS) with an Interrupt Stack Table (IST), mapping double-fault exceptions to a dedicated stack, preventing CPU resets.
Here is the GDT and TSS stack allocation setup in src/gdt.rs that loads segment selectors and maps the double fault handler stack:
// velocity-bootloader/src/gdt.rs — GDT & TSS Setup
use x86_64::structures::gdt::{Descriptor, GlobalDescriptorTable, SegmentSelector};
use x86_64::structures::tss::TaskStateSegment;
use x86_64::VirtAddr;
pub const DOUBLE_FAULT_IST_INDEX: u16 = 0;
static mut TSS: TaskStateSegment = TaskStateSegment::new();
static mut GDT: GlobalDescriptorTable = GlobalDescriptorTable::new();
static mut DOUBLE_FAULT_STACK: [u8; 4096 * 5] = [0; 4096 * 5];
pub fn init() {
use x86_64::instructions::segmentation::{Segment, CS, DS, SS};
use x86_64::instructions::tables::load_tss;
unsafe {
// Separate stack for double fault handler to prevent triple faults
let stack_start = VirtAddr::from_ptr(&DOUBLE_FAULT_STACK);
let stack_end = stack_start + DOUBLE_FAULT_STACK.len();
TSS.interrupt_stack_table[DOUBLE_FAULT_IST_INDEX as usize] = stack_end;
// Populate segments
let mut gdt = GlobalDescriptorTable::new();
let code_selector = gdt.add_entry(Descriptor::kernel_code_segment());
let data_selector = gdt.add_entry(Descriptor::kernel_data_segment());
let tss_selector = gdt.add_entry(Descriptor::tss_segment(&TSS));
GDT = gdt;
GDT.load();
// Reload segment selectors
CS::set_reg(code_selector);
DS::set_reg(data_selector);
SS::set_reg(data_selector);
load_tss(tss_selector);
}
}
-
Interrupt Descriptors (
interrupts.rs): I initialized the IDT, remapping the 8259 PIC interrupts to offsets0x20and0x28. I wrote custom interrupt service routines (ISRs) for IRQ 0 (Timer), IRQ 1 (PS/2 Keyboard), and IRQ 4 (COM1 Serial).
Here is the visual transition mapping how the CPU context is moved from UEFI services to our own bare-metal OS kernel control:
// Exiting boot services and taking raw CPU control
let (system_table, memory_map) = boot_services.exit_boot_services(image_handle, &mut map_buf);
The Bare-Metal Performance Gain
Running directly on raw CPU cycles in Ring 0 without OS scheduling traps or BIOS polling overhead resulted in a massive speedup:
- Fibonacci execution: dropped from 53M cycles under UEFI to 25M cycles bare-metal (a 2.1x speedup).
- Neural Net Layer GEMV: dropped from 55M cycles to 11M cycles (a 5.0x speedup).
The entire kernel compiled down to less than 6MB, allowing the entire operating system to fit and run directly inside the CPU's L3 cache!
Pascal's Analysis: The Bootstrapping Legend
When I shared the QEMU boot logs,
linked the design choices to classic computer science:
"Bare-metal NDA without dependencies means... the first NDA interpreter has to be written in something else — assembly or a minimal C stub — to pull itself up by its own bootstraps. That's the same path Forth took in the 70s, and it's still the cleanest approach for a self-hosting language at bare metal."
Pascal noted that by combining Merkle validation with a bare-metal kernel, the system was cryptographically secure by construction: if the boot code's Merkle root didn't validate, the processor would refuse to execute.
But a bare-metal kernel is useless without disk storage. I needed to write drivers to read files from NVMe drives.
In the next post, I'll document how I wrote a PCI configuration scanner, an NVMe block storage driver, and a custom FAT32 filesystem from scratch.
Discussion
Have you written UEFI bootloaders or OS kernels in Rust? What are the biggest hurdles you faced when exiting UEFI boot services and transitioning control to your custom GDT and IDT? Let's discuss in the comments below!
Special thanks to for grounding my bare-metal sprint in the historical wisdom of Forth and Lisp machines.
Disclaimer: AI was used throughout this project, it is just fitting that it would co-author with me, so special thanks to the Foundry for its tireless hours toiling away and Gemini for producing the cover image.

Top comments (1)
@pascal_cescato_692b7a8a20 Enter the age of bare-metal 🥳