<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Krish Singaria</title>
    <description>The latest articles on DEV Community by Krish Singaria (@krish_singaria).</description>
    <link>https://dev.to/krish_singaria</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3824929%2Fea11c110-d0f9-46df-b7ed-14db7d8664b6.jpg</url>
      <title>DEV Community: Krish Singaria</title>
      <link>https://dev.to/krish_singaria</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/krish_singaria"/>
    <language>en</language>
    <item>
      <title>How I bypassed PyTorch OOM errors with a Zero-Copy C++ Graph Engine</title>
      <dc:creator>Krish Singaria</dc:creator>
      <pubDate>Sun, 15 Mar 2026 06:07:22 +0000</pubDate>
      <link>https://dev.to/krish_singaria/how-i-bypassed-pytorch-oom-errors-with-a-zero-copy-c-graph-engine-2983</link>
      <guid>https://dev.to/krish_singaria/how-i-bypassed-pytorch-oom-errors-with-a-zero-copy-c-graph-engine-2983</guid>
      <description>&lt;p&gt;If you have ever tried to train a Graph Neural Network (GNN) on a massive dataset, you already know the pain of the "Memory Wall."&lt;/p&gt;

&lt;p&gt;Loading a dataset like Papers100M into PyTorch Geometric almost always ends the exact same way on a standard machine: an instant 24GB+ Out-Of-Memory (OOM) allocation crash. Standard libraries try to load the entire edge list and feature matrix into RAM before moving it to the GPU.&lt;/p&gt;

&lt;p&gt;I got tired of my laptop crashing, so I built GraphZero (v0.2.0): a custom C++ data engine that bypasses system RAM entirely and streams datasets natively from the SSD.&lt;/p&gt;

&lt;p&gt;Here is how I built a zero-copy pipeline that lets PyTorch train on 30GB of data while allocating 0 bytes of RAM.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgn78wur1cgy780x1doze.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgn78wur1cgy780x1doze.png" alt="graphzero" width="800" height="786"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F77sh2tt9anxfigjhtfph.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F77sh2tt9anxfigjhtfph.png" alt="PyG" width="800" height="238"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🧠 The Architecture: mmap and Zero-Copy&lt;br&gt;
The core philosophy of GraphZero is simple: let the Operating System do the heavy lifting.&lt;/p&gt;

&lt;p&gt;Instead of parsing CSVs into Python lists or Pandas DataFrames, GraphZero compiles raw data into two heavily optimized binary formats:&lt;/p&gt;

&lt;p&gt;.gl files: Stores the graph topology (edge lists).&lt;/p&gt;

&lt;p&gt;.gd files: Stores the node features, using strict C++ template dispatching to enforce memory layouts (like FLOAT32 or INT64).&lt;/p&gt;

&lt;p&gt;Once compiled, the engine uses POSIX mmap to memory-map the binary files. Using nanobind, we hand the raw C++ pointers directly to PyTorch as zero-copy NumPy arrays.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;graphzero&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;gz&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;

&lt;span class="c1"&gt;# 1. Mount the zero-copy engine
&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gz&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;FeatureStore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;papers100M_features.gd&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 2. Instantly map SSD data to PyTorch (RAM used: 0 Bytes)
&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_numpy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_tensor&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Feature Tensor: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;⚡ The Execution: OS Page Faults and OpenMP&lt;br&gt;
During a training loop (like GraphSAGE), PyTorch thinks it has a massive 50GB tensor sitting in RAM.&lt;/p&gt;

&lt;p&gt;When the neural network requests a batch of target nodes, it indexes the mapped tensor. This triggers an OS Page Fault. The operating system automatically fetches only the required 4KB blocks from the NVMe drive.&lt;/p&gt;

&lt;p&gt;To keep the pipeline saturated, the C++ engine uses OpenMP multi-threading for neighbor sampling (batch_random_fanout). Because this happens in C++, we release the Python GIL, allowing disk I/O, CPU sampling, and GPU math to run perfectly in parallel.&lt;/p&gt;

&lt;p&gt;🚀 Try it out&lt;br&gt;
Building GraphZero forced me to dive deep into low-level memory management, CI/CD matrix builds, and Python C-bindings.&lt;/p&gt;

&lt;p&gt;If you want to train GNNs without melting your RAM, check out the repository. It includes an end-to-end GraphSAGE training script with a synthetic dataset generator so you can test the zero-copy mounting locally.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/KrishSingaria/graphzero" rel="noopener noreferrer"&gt;github repo&lt;/a&gt;&lt;br&gt;
I would love any harsh technical feedback on the C++ architecture, or the API design!&lt;/p&gt;

</description>
      <category>ai</category>
      <category>performance</category>
      <category>cpp</category>
      <category>python</category>
    </item>
  </channel>
</rss>
