I Built a Neural Network Inference Engine From Scratch in C++ (No PyTorch, No ONNX, Just AVX2)

#cpp #machinelearning #performance #simd

Why does inference need a framework at all?

Every time I ran a tiny linear model through PyTorch, I felt like I was driving a go-kart with a jet engine strapped to it. The model was a few hundred KB. PyTorch's runtime was gigabytes. Somewhere between model(x) and the actual floating-point math, an entire universe of abstraction — autograd graphs, dispatch layers, tensor metadata — was quietly eating my CPU cycles.

So I asked a simple question: what does inference actually look like with nothing in the way?

That question turned into ML-model-loader — a bare-metal C++ inference engine that loads raw binary weights and runs forward passes directly on the CPU, using the same low-level techniques that power ggml and llama.cpp: cache-tiled GEMM, AVX2 SIMD intrinsics, and INT8 quantization.

No PyTorch. No ONNX Runtime. No GPU. Just C++, some pointer arithmetic, and a CPU that's faster than people give it credit for.

The architecture

The pipeline is intentionally minimal — two stages, one handoff:

[ Python Training (Colab) ]
          |
          | exports
          v
[ multi_model_weights.bin ]   (FP32 binary weight dump)
[ quantized_weights.bin   ]   (INT8 quantized weights)
          |
          | loaded by
          v
[ ML_loader_3.cpp ]
  ├── Weight loader (binary deserialization)
  ├── GEMM kernel (cache-tiled, AVX2)
  ├── INT8 quantization runtime
  └── Chrono benchmarking

Training stays in Python because there's no point reinventing backprop. But the moment the model is trained, it gets exported to a flat binary file — just layer dimensions followed by raw FP32 arrays — and from there, Python never touches the inference path again.

The actual bottleneck: it's not the math, it's the cache

A naive triple-nested-loop matrix multiply is O(N³), and on any model bigger than a toy example, it absolutely destroys your L1 cache. Every time you stride across a large matrix row, you evict data you'll need again two iterations later. The CPU spends more time waiting on memory than doing arithmetic.

The fix is cache tiling: instead of multiplying full rows and columns, you break the matrices into small blocks — roughly 64×64 in this engine — sized so a tile fits entirely inside L1 cache. The inner multiply loop then operates entirely on hot data, and cache misses during the GEMM operation basically disappear. This one change is usually the single biggest performance lever in CPU-bound inference, bigger than any individual instruction-level trick.

Then: feeding the cores 8 floats at a time

Once memory stops being the bottleneck, the next lever is the ALU. Scalar code multiplies one float, adds one float, one instruction at a time. AVX2 lets you do better:

__m256 acc = _mm256_setzero_ps();
acc = _mm256_fmadd_ps(weight_vec, input_vec, acc); // 8 floats, fused multiply-add, one instruction

_mm256_fmadd_ps performs a fused multiply-add across 8 floats simultaneously. On paper that's an 8× speedup on the compute-bound inner loop — in practice you don't get the full 8× because memory bandwidth and tiling overhead eat into it, but it's still a massive win over scalar code. Combined with cache tiling, this is what took the FP32 forward pass down to roughly 8ms for a 10→512→512→128→10 network — no GPU involved.

One detail that matters more than people expect: all weight buffers are allocated with _mm_malloc for 32-byte alignment. Unaligned loads with AVX2 carry a real penalty, and it's a one-line fix that's easy to forget.

Squeezing further: INT8 quantization

FP32 weights are 4 bytes per value. For large weight matrices, that's a lot of memory bandwidth spent just moving numbers around — and bandwidth, not compute, is often the real ceiling. Quantizing to INT8 cuts that 4×.

The scheme here is symmetric per-tensor quantization — about as simple as quantization gets:

scale = max(|W|) / 127
W_int8 = round(W / scale)

At inference time, the integer-quantized weights run through _mm256_madd_epi16, processing integer vectors instead of floats, and the FP32 result is recovered by dequantizing after accumulation. That took the same network down to roughly 5ms — a meaningful drop on top of an already-fast FP32 path, mostly from the reduced memory traffic rather than from integer math being inherently faster here.

Model Architecture	Precision	Inference Time
10→512→512→128→10 (Linear NN)	FP32	~8ms
10→512→512→128→10 (Linear NN)	INT8 (quantized)	~5ms

(Benchmarked with std::chrono, CPU only.)

What I'd still call unfinished

This is deliberately a learning engine, not a production one, and the roadmap reflects that honestly:

Convolutional layers (2D GEMM tiling) — currently linear/fully-connected only
Multi-threading across tiles via std::thread or OpenMP — right now it's single-threaded, which leaves obvious performance on the table
ONNX import, so models don't need a custom binary export step
An ARM NEON port, since AVX2 ties this to x86-64 entirely

Try it yourself

git clone https://github.com/whomi928/ML-model-loader
cd ML-model-loader
g++ -O3 -mavx2 -mfma -o ML_loader_3 ML_loader_3.cpp
./ML_loader_3

You'll need a CPU with AVX2 (Intel Haswell/AMD Ryzen or newer) and multi_model_weights.bin sitting next to the binary — there's an included Colab notebook that trains a small linear network and exports the weights file if you want to generate your own.

If you've ever wanted to see what's actually happening underneath a model.forward() call — no autograd, no dispatch tables, just memory layout and instruction throughput — this is a fun rabbit hole to fall into. The repo's linked below, and the ggml/llama.cpp projects are worth a read if you want to see these same ideas taken much, much further.

Repo: github.com/whomi928/ML-model-loader
Linkedin: www.linkedin.com/in/shaurya-aditya-0563a0377

Shaurya Aditya — B.Tech ECE, IIT BHU