<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: whomi928</title>
    <description>The latest articles on DEV Community by whomi928 (@whomi928).</description>
    <link>https://dev.to/whomi928</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4007145%2F4b8ce8e7-bb7b-4aca-9d70-c44542d4d695.png</url>
      <title>DEV Community: whomi928</title>
      <link>https://dev.to/whomi928</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/whomi928"/>
    <language>en</language>
    <item>
      <title>I Built a Neural Network Inference Engine From Scratch in C++ (No PyTorch, No ONNX, Just AVX2)</title>
      <dc:creator>whomi928</dc:creator>
      <pubDate>Mon, 29 Jun 2026 02:30:40 +0000</pubDate>
      <link>https://dev.to/whomi928/i-built-a-neural-network-inference-engine-from-scratch-in-c-no-pytorch-no-onnx-just-avx2-57gm</link>
      <guid>https://dev.to/whomi928/i-built-a-neural-network-inference-engine-from-scratch-in-c-no-pytorch-no-onnx-just-avx2-57gm</guid>
      <description>&lt;h2&gt;
  
  
  Why does inference need a framework at all?
&lt;/h2&gt;

&lt;p&gt;Every time I ran a tiny linear model through PyTorch, I felt like I was driving a go-kart with a jet engine strapped to it. The model was a few hundred KB. PyTorch's runtime was gigabytes. Somewhere between &lt;code&gt;model(x)&lt;/code&gt; and the actual floating-point math, an entire universe of abstraction — autograd graphs, dispatch layers, tensor metadata — was quietly eating my CPU cycles.&lt;/p&gt;

&lt;p&gt;So I asked a simple question: &lt;strong&gt;what does inference actually look like with nothing in the way?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That question turned into &lt;a href="https://github.com/whomi928/ML-model-loader" rel="noopener noreferrer"&gt;ML-model-loader&lt;/a&gt; — a bare-metal C++ inference engine that loads raw binary weights and runs forward passes directly on the CPU, using the same low-level techniques that power &lt;code&gt;ggml&lt;/code&gt; and &lt;code&gt;llama.cpp&lt;/code&gt;: cache-tiled GEMM, AVX2 SIMD intrinsics, and INT8 quantization.&lt;/p&gt;

&lt;p&gt;No PyTorch. No ONNX Runtime. No GPU. Just C++, some pointer arithmetic, and a CPU that's faster than people give it credit for.&lt;/p&gt;

&lt;h2&gt;
  
  
  The architecture
&lt;/h2&gt;

&lt;p&gt;The pipeline is intentionally minimal — two stages, one handoff:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[ Python Training (Colab) ]
          |
          | exports
          v
[ multi_model_weights.bin ]   (FP32 binary weight dump)
[ quantized_weights.bin   ]   (INT8 quantized weights)
          |
          | loaded by
          v
[ ML_loader_3.cpp ]
  ├── Weight loader (binary deserialization)
  ├── GEMM kernel (cache-tiled, AVX2)
  ├── INT8 quantization runtime
  └── Chrono benchmarking
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Training stays in Python because there's no point reinventing backprop. But the moment the model is trained, it gets exported to a flat binary file — just layer dimensions followed by raw FP32 arrays — and from there, Python never touches the inference path again.&lt;/p&gt;

&lt;h2&gt;
  
  
  The actual bottleneck: it's not the math, it's the cache
&lt;/h2&gt;

&lt;p&gt;A naive triple-nested-loop matrix multiply is O(N³), and on any model bigger than a toy example, it absolutely destroys your L1 cache. Every time you stride across a large matrix row, you evict data you'll need again two iterations later. The CPU spends more time waiting on memory than doing arithmetic.&lt;/p&gt;

&lt;p&gt;The fix is &lt;strong&gt;cache tiling&lt;/strong&gt;: instead of multiplying full rows and columns, you break the matrices into small blocks — roughly 64×64 in this engine — sized so a tile fits entirely inside L1 cache. The inner multiply loop then operates entirely on hot data, and cache misses during the GEMM operation basically disappear. This one change is usually the single biggest performance lever in CPU-bound inference, bigger than any individual instruction-level trick.&lt;/p&gt;

&lt;h2&gt;
  
  
  Then: feeding the cores 8 floats at a time
&lt;/h2&gt;

&lt;p&gt;Once memory stops being the bottleneck, the next lever is the ALU. Scalar code multiplies one float, adds one float, one instruction at a time. AVX2 lets you do better:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="n"&gt;__m256&lt;/span&gt; &lt;span class="n"&gt;acc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm256_setzero_ps&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="n"&gt;acc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm256_fmadd_ps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;weight_vec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input_vec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;acc&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// 8 floats, fused multiply-add, one instruction&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;_mm256_fmadd_ps&lt;/code&gt; performs a fused multiply-add across 8 floats simultaneously. On paper that's an 8× speedup on the compute-bound inner loop — in practice you don't get the full 8× because memory bandwidth and tiling overhead eat into it, but it's still a massive win over scalar code. Combined with cache tiling, this is what took the FP32 forward pass down to roughly &lt;strong&gt;8ms&lt;/strong&gt; for a 10→512→512→128→10 network — no GPU involved.&lt;/p&gt;

&lt;p&gt;One detail that matters more than people expect: all weight buffers are allocated with &lt;code&gt;_mm_malloc&lt;/code&gt; for 32-byte alignment. Unaligned loads with AVX2 carry a real penalty, and it's a one-line fix that's easy to forget.&lt;/p&gt;

&lt;h2&gt;
  
  
  Squeezing further: INT8 quantization
&lt;/h2&gt;

&lt;p&gt;FP32 weights are 4 bytes per value. For large weight matrices, that's a lot of memory bandwidth spent just moving numbers around — and bandwidth, not compute, is often the real ceiling. Quantizing to INT8 cuts that 4×.&lt;/p&gt;

&lt;p&gt;The scheme here is symmetric per-tensor quantization — about as simple as quantization gets:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;scale&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;W&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;127&lt;/span&gt;
&lt;span class="n"&gt;W_int8&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;W&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At inference time, the integer-quantized weights run through &lt;code&gt;_mm256_madd_epi16&lt;/code&gt;, processing integer vectors instead of floats, and the FP32 result is recovered by dequantizing after accumulation. That took the same network down to roughly &lt;strong&gt;5ms&lt;/strong&gt; — a meaningful drop on top of an already-fast FP32 path, mostly from the reduced memory traffic rather than from integer math being inherently faster here.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model Architecture&lt;/th&gt;
&lt;th&gt;Precision&lt;/th&gt;
&lt;th&gt;Inference Time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;10→512→512→128→10 (Linear NN)&lt;/td&gt;
&lt;td&gt;FP32&lt;/td&gt;
&lt;td&gt;~8ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10→512→512→128→10 (Linear NN)&lt;/td&gt;
&lt;td&gt;INT8 (quantized)&lt;/td&gt;
&lt;td&gt;~5ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;(Benchmarked with &lt;code&gt;std::chrono&lt;/code&gt;, CPU only.)&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd still call unfinished
&lt;/h2&gt;

&lt;p&gt;This is deliberately a &lt;em&gt;learning&lt;/em&gt; engine, not a production one, and the roadmap reflects that honestly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Convolutional layers (2D GEMM tiling) — currently linear/fully-connected only&lt;/li&gt;
&lt;li&gt;Multi-threading across tiles via &lt;code&gt;std::thread&lt;/code&gt; or OpenMP — right now it's single-threaded, which leaves obvious performance on the table&lt;/li&gt;
&lt;li&gt;ONNX import, so models don't need a custom binary export step&lt;/li&gt;
&lt;li&gt;An ARM NEON port, since AVX2 ties this to x86-64 entirely&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try it yourself
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/whomi928/ML-model-loader
&lt;span class="nb"&gt;cd &lt;/span&gt;ML-model-loader
g++ &lt;span class="nt"&gt;-O3&lt;/span&gt; &lt;span class="nt"&gt;-mavx2&lt;/span&gt; &lt;span class="nt"&gt;-mfma&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; ML_loader_3 ML_loader_3.cpp
./ML_loader_3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You'll need a CPU with AVX2 (Intel Haswell/AMD Ryzen or newer) and &lt;code&gt;multi_model_weights.bin&lt;/code&gt; sitting next to the binary — there's an included Colab notebook that trains a small linear network and exports the weights file if you want to generate your own.&lt;/p&gt;

&lt;p&gt;If you've ever wanted to see what's &lt;em&gt;actually&lt;/em&gt; happening underneath a &lt;code&gt;model.forward()&lt;/code&gt; call — no autograd, no dispatch tables, just memory layout and instruction throughput — this is a fun rabbit hole to fall into. The repo's linked below, and the &lt;code&gt;ggml&lt;/code&gt;/&lt;code&gt;llama.cpp&lt;/code&gt; projects are worth a read if you want to see these same ideas taken much, much further.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/whomi928/ML-model-loader" rel="noopener noreferrer"&gt;github.com/whomi928/ML-model-loader&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Linkedin:&lt;/strong&gt; &lt;a href="http://www.linkedin.com/in/shaurya-aditya-0563a0377" rel="noopener noreferrer"&gt;www.linkedin.com/in/shaurya-aditya-0563a0377&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Shaurya Aditya — B.Tech ECE, IIT BHU&lt;/em&gt;&lt;/p&gt;

</description>
      <category>cpp</category>
      <category>machinelearning</category>
      <category>performance</category>
      <category>simd</category>
    </item>
  </channel>
</rss>
