How I bypassed PyTorch OOM errors with a Zero-Copy C++ Graph Engine

Krish Singaria — Sun, 15 Mar 2026 06:07:22 +0000

If you have ever tried to train a Graph Neural Network (GNN) on a massive dataset, you already know the pain of the "Memory Wall."

Loading a dataset like Papers100M into PyTorch Geometric almost always ends the exact same way on a standard machine: an instant 24GB+ Out-Of-Memory (OOM) allocation crash. Standard libraries try to load the entire edge list and feature matrix into RAM before moving it to the GPU.

I got tired of my laptop crashing, so I built GraphZero (v0.2.0): a custom C++ data engine that bypasses system RAM entirely and streams datasets natively from the SSD.

Here is how I built a zero-copy pipeline that lets PyTorch train on 30GB of data while allocating 0 bytes of RAM.

🧠 The Architecture: mmap and Zero-Copy
The core philosophy of GraphZero is simple: let the Operating System do the heavy lifting.

Instead of parsing CSVs into Python lists or Pandas DataFrames, GraphZero compiles raw data into two heavily optimized binary formats:

.gl files: Stores the graph topology (edge lists).

.gd files: Stores the node features, using strict C++ template dispatching to enforce memory layouts (like FLOAT32 or INT64).

Once compiled, the engine uses POSIX mmap to memory-map the binary files. Using nanobind, we hand the raw C++ pointers directly to PyTorch as zero-copy NumPy arrays.

import graphzero as gz
import torch

# 1. Mount the zero-copy engine
fs = gz.FeatureStore("papers100M_features.gd")

# 2. Instantly map SSD data to PyTorch (RAM used: 0 Bytes)
X = torch.from_numpy(fs.get_tensor())

print(f"Feature Tensor: {X.shape} ({X.dtype})")

⚡ The Execution: OS Page Faults and OpenMP
During a training loop (like GraphSAGE), PyTorch thinks it has a massive 50GB tensor sitting in RAM.

When the neural network requests a batch of target nodes, it indexes the mapped tensor. This triggers an OS Page Fault. The operating system automatically fetches only the required 4KB blocks from the NVMe drive.

To keep the pipeline saturated, the C++ engine uses OpenMP multi-threading for neighbor sampling (batch_random_fanout). Because this happens in C++, we release the Python GIL, allowing disk I/O, CPU sampling, and GPU math to run perfectly in parallel.

🚀 Try it out
Building GraphZero forced me to dive deep into low-level memory management, CI/CD matrix builds, and Python C-bindings.

If you want to train GNNs without melting your RAM, check out the repository. It includes an end-to-end GraphSAGE training script with a synthetic dataset generator so you can test the zero-copy mounting locally.

github repo
I would love any harsh technical feedback on the C++ architecture, or the API design!

DEV Community: Krish Singaria

How I bypassed PyTorch OOM errors with a Zero-Copy C++ Graph Engine