DEV Community

RoTSL
RoTSL

Posted on • Originally published at rotsl.Medium on

Bypassing the OS to Run LLMs: What I Learned Building a Firmware-Centric Runtime

image

I spent the last few months asking a question that sounds slightly unhinged: what happens if you strip the operating system out of the LLM inference loop?

Not metaphorically. I mean literally. Remove the Linux page cache, the IOMMU, the CUDA Runtime wrappers, the framework dispatchers. Demote the host OS to an interrupt router. Let the GPU handle its own memory faults, its own scheduling, its own DMA transfers.

The result is NexusRT – an alpha-stage firmware-equivalent runtime that sits below PyTorch, TensorFlow, and JAX, talking directly to the CUDA Driver API and (on Apple Silicon) the Metal API. It is not a framework. It is a runtime that asks, “how much latency can you actually remove when the LLM pipeline owns the hardware?”

This is what I found.

The Stack Is Thicker Than You Think

Most of us do not think about the layers between our Python code and the GPU silicon. We write model.forward(), PyTorch dispatches to CUDA, and somewhere in a dark room in Santa Clara, a tensor core lights up.

But the stack is deep:

┌────────────────────────────────────┐
. Application (PyTorch / TF / JAX)  
├────────────────────────────────────┤
. Framework dispatcher, autograd.   
├────────────────────────────────────┤
. CUDA Runtime (cudart)             
├────────────────────────────────────┤
. CUDA Driver (libcuda)             
├────────────────────────────────────┤
. Host OS (page cache, IOMMU, IRQ)  
├────────────────────────────────────┤
. GPU firmware (closed, NVIDIA-only)
└────────────────────────────────────┘

Enter fullscreen mode Exit fullscreen mode

Each layer adds latency. Each layer makes assumptions – about memory layout, about scheduling fairness, about who gets to touch the DMA engine. The OS, in particular, insists on being the memory authority. It wants to page things in and out. It wants to validate every IOMMU mapping. It wants to schedule interrupts.

For most workloads, this is fine. The overhead is noise. But for LLM inference at scale – where you are chasing microseconds on token generation, where KV-cache residency is the entire game – that noise compounds.

NexusRT collapses the stack to this:

┌────────────────────────────────────┐
. Application (C ABI / Python)      
├────────────────────────────────────┤
. NexusRT micro-kernel.             
.  firmware-equivalent boot.       
.  GPU-driven virtual memory.      
.  warp-specialized task graph.    
.  GDS / GRDMA / TMA / ILC.        
├────────────────────────────────────┤
. CUDA Driver API / Metal API.      
├────────────────────────────────────┤
. GPU firmware (vendor)             
└────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

The host OS is still there. It just stopped being in charge.

What “Firmware-Equivalent” Actually Means

I want to be clear: I am not modifying NVIDIA firmware. That is impossible without signing keys I do not have and would not use if I did.

NexusRT implements a firmware-equivalent micro-kernel in user-space. It uses only the lowest publicly available CUDA Driver APIs – cuMemAddressReserve, cuMemMap, cuStreamCreateWithPriority, cuTensorMapEncodeTiled on Hopper, doorbell-style sync via cuStreamWaitValue32. The same APIs NVIDIA exposes to anyone who reads the driver documentation.

The trick is how you compose them. Instead of letting the OS manage virtual memory, NexusRT reserves GPU virtual address space and maps it directly. Instead of framework-managed DMA, it uses prioritized async streams with custom GDS-style paths. Page faults are handled by GPU-resident threads reading from a fault buffer in HBM – the DREAM approach, adapted for LLM workloads.

It is not black magic. It is just using the driver API the way the driver API was designed to be used, before frameworks added their comfort blankets on top.

The Hardware Reality

NexusRT is not a toy. It targets real silicon:

Target Memory TMA ILC Notes
NVIDIA A100 SXM 40/80 GB HBM2e No No Async-copy + warp queues
NVIDIA H100 SXM 80 GB HBM3 Yes Yes TMA + Thread Block Clusters
NVIDIA T4 / P100 14–16 GB No No Kaggle smoke validation
Apple M1 Pro 16–32 GB unified n/a n/a Metal / MLX path

A100 and H100 are the research targets. T4 and P100 are what I can actually afford to test on – Kaggle gives me free GPU hours, and I use every one of them. The Apple path is there because I develop on an M1 Pro Mac, and unified memory on Apple Silicon is genuinely interesting for smaller models.

The Kaggle notebooks are real. They run. They build from source, link against the CUDA driver, and exercise the runtime on actual T4 hardware. The logs are in the repo. I am not projecting performance from a spreadsheet.

A Minimal Example

Here is what using NexusRT looks like from Python:

import nexusrt as nrt

# Initialize the firmware-equivalent layer
dev = nrt.firmware.init(profile="auto")

# Allocate HBM-resident tensor with GPU-driven virtual memory
buf = nrt.memory.alloc(shape=(4096, 4096), dtype="bf16", ilc=True)

# Build a pipeline stage contract
stage = nrt.scheduler.stage(
    name="infer.transformer_block_0",
    inputs=[buf],
    outputs=[],
    token_budget=4096,
    sm_footprint_mb=64,
)

nrt.scheduler.submit(stage)
nrt.scheduler.wait_barrier()
Enter fullscreen mode Exit fullscreen mode

The Python layer is intentionally thin. It is a control plane. The C++ core owns the runtime – the memory management, the scheduling, the kernel submission. This is not a Python library that calls into CUDA via PyTorch. This is a C++ runtime that exposes a C ABI, with Python bindings for convenience.

Why This Exists

I am not trying to replace PyTorch. PyTorch is excellent. It has saved thousands of researchers from writing CUDA kernels by hand.

But there is a class of experiments where the question is not “which API do I call?” but “what happens if I remove the API entirely?” What is the actual floor on inference latency when you control scheduling, memory movement, token-cache residency, and GPU work submission directly?

NexusRT is for that class of experiments. It is a research runtime. It is alpha. It will break. The benchmarks on Kaggle T4 are modest because T4 is modest. The A100/H100 numbers are projections based on validated hardware profiles, not measured yet – because I do not have an A100 in my living room.

That is the honest state of it.

Where It Goes From Here

The repo is open. The code is MIT-licensed. The architecture docs are in docs/architecture.md. The research lineage – DREAM, KOKARYOKU, TrainMover – is documented in docs/research.md.

I am currently focused on:

  1. Warp-specialized task graphs  — getting the scheduler to reason about SM occupancy the way a firmware scheduler would, not a framework scheduler.
  2. KV-cache pruning  — integrating the ICM layered context work into the token optimization path.
  3. TMA/ILC on Hopper  — cuTensorMapEncodeTiled is available. Using it correctly is the hard part.
  4. More Kaggle validation  — because free GPU time is the only GPU time I have right now.

If you have an A100 or H100 and want to run the smoke tests, I would genuinely love to see the results. If you think this is a terrible idea, I would also love to hear why — the whole point is to test assumptions.

Try It

git clone https://github.com/rotsl/nexusrt.git
cd nexusrt

python3 -m venv .venv
source .venv/bin/activate
python -m pip install -U pip cmake ninja pytest

# CUDA path
cmake -S packaging -B packaging/build -GNinja \
  -DCMAKE_BUILD_TYPE=Release \
  -DNEXUSRT_ENABLE_CUDA=ON \
  -DNEXUSRT_BUILD_TESTS=ON

cmake --build packaging/build -j
nexusrt-bench --stage detect
Enter fullscreen mode Exit fullscreen mode

Or grab a pre-built artifact from the GitHub Actions workflow. The CUDA artifact is disabled by default (it installs the toolkit during the run), but you can enable it if you need it.

Final Thought

I keep coming back to one thing: the gap between what the hardware can do and what the software stack lets you do is enormous. Frameworks are necessary. They make ML accessible. But they also make assumptions — about latency tolerance, about memory semantics, about who owns the DMA engine — that are not universal.

NexusRT is my attempt to see what lives in that gap. It is not a product. It is an experiment. But experiments are how we find out where the floor actually is.

If you are curious about the floor too, take a look.

MIT Licensed. Research attributions to DREAM, KOKARYOKU, and TrainMover documented in docs/research.md

Top comments (0)

The discussion has been locked. New comments can't be added.