Rajdeep

Posted on Jan 21

Building a mini PyTorch in C++ from scratch as a high school student...

#machinelearning #opensource #ai #development

INTRODUCTION

My background? I've spent the past 5 years doing game dev in C#. Last year I tried peeking under the Game Engines and that was when my obsession with low level systems began.

I was craving for a challenge, a heavy learning experience, so I decided to make a minimal ML runtime in C++.

I knew it was going to be hard, but I did not expect it to be this hard...

Today was the first day of this project, I tried implementing Tensors and the two most basic tensor operations, add & matmul.

INITIAL STRUGGLES

What confused me the most was how the interpretation of a tensor in ML runtimes differs from how someone who knows the mathematical definition would think.

Mathematically, tensors are just generalized higher dimensional matrices, so it was natural for me to think I needed to somehow initialize a multidimensional nested array of floats and that was when it hit me, how on Earth am I going to do that...?

After a bit of digging into PyTorch's monstrous source code, guess what... I understood nothing. Then I found this goldmine https://github.com/Adam-Mazur/TinyTensor/blob/main/include/tensor.h the TinyTensor's Tensor class was not... tiny at all :

I'm not sure if massive classes like these are good from a design perspective but anyway the key to my confusion was right at the top :

Took me a solid hour to understand how this works but the moment it clicked, everything made sense...

From here I will try my best to explain how multidimensional arrays are actually stored in memory and also talk a bit about matmul.

NESTED ARRAY REPRESENTATION

Computers don’t know what a 4D tensor is. They only know how to store things in a straight line. In my C# days, I might have used a nested array like float[][][]. But in high-performance C++, that’s a nightmare. It scatters data all over the heap. Instead, every ML library uses a flat 1D array which ensures contiguous memory allocation, significantly improving cache hit rates, and a Shape vector to describe the dimensions and the "length" of the tensor in each dimension.

If I have a Tensor A_R,C,D where R(ow), C(olumn), D(epth) equal 2, 3, 2 respectively, it’s just 12 floats in a row. To find the element at [i, j, k], we don't "index" into a nested structure, we use a simple offset formula:

Index = (i * C * D) + (j * D) + k

This is called "Row-Major" Ordering, you can also do "Column-Major" Ordering it's all up to you and this offset is called "stride" I'm computing it manually for now but runtimes usually store these strides for each dimension in a vector. Here is a small MS Paint illustration of this, hopefully my toddler level drawing helps you visualize this formula:

The first term (i * C * D) jumps across "depth units", the second term (j ∗ D) jumps across "column units" and the third term 'k' specifies the row. I took a 3D example so that you can visualize higher dimensions more easily.

MATMUL AND TENSOR PRODUCT

So now that you understand how data is stored, I'll try to explain how Matmul works, and please don't confuse it with the Tensor Product denoted by A ⊗ B...

If you search for "Tensor Product", also called "Outer Product" for some reason, you get this definition:

A_{ijk} ⊗ B_{lmno} = C_{ijklmno}

This is carried out by multiplying every single element of A by every single element of B... And that is NOT what we want for ML. In ML runtimes, matmul is basically the classic 2D Matrix Product on steroids.

Mathematically, a Tensor Product ⊗ is a dimension-exploding machine. If you multiply a 2D matrix by another 2D matrix using a tensor product, you don’t get a matrix back, you get a 4D tensor since it keeps every single combination of products.

But in ML, we actually want the opposite, a "Tensor Contraction" specifically.

Thus, matmul "contracts" a shared dimension by summing up the values after every row-column elementwise product. This keeps the data manageable and retains most of the information. Hence, matmul is just a contraction along ONE of the dimensions.

I know it's hard to visualize but honestly I can't think of a way to help you here, you'll be able to feel matmul eventually as you keep playing with examples...

For now, I’m sticking to the naive nested loop and my matmul only supports 2D matrix multiplications.

FINAL WORDS

Looking at today's progress, I've cut down my original scope by like 60% because "smol finished project" > unfinished mess. I just want to get a simple working horizontal slice by the end of March. Will probably write my next blog when I have that ready.

It wasn't a very productive day but friction at the start is normal when you are pushing the boundaries of your knowledge, hopefully I speed up as time goes on.

Check out the project on my GitHub : https://github.com/Raju1173/ml-runtime-cpp
Shine a little star on it if you'd like to follow along!

Top comments (9)

PEACEBINFLOW • Jan 22

This is one of those posts where the learning is the achievement, not the end result — and you’re clearly doing the right kind of learning.

What stood out to me isn’t “high school student builds mini PyTorch” (which is impressive on its own), but where the confusion happened. You didn’t get stuck on syntax or C++ mechanics — you got stuck at the representation boundary: how math abstractions collapse into memory layouts. That’s the exact seam where real systems understanding begins.

The moment you realized that a tensor is:

a flat buffer
plus shape
plus stride everything snapped into place. That’s not accidental. That’s pattern recognition. You crossed from “using abstractions” to “owning them.”

A lot of people learn ML by stacking libraries. Very few pause to ask why nested arrays are poison for cache locality, or why matmul is contraction rather than explosion. That distinction between tensor product vs contraction is something many practitioners never fully internalize — they just call torch.matmul and move on.

Also: your instinct to cut scope is a senior-level move. Horizontal slices beat heroic half-built systems every time. A naive 2D matmul that you fully understand is infinitely more valuable than a half-generalized N-D one you don’t.

If you keep doing this — identifying where intuition breaks, drilling until the representation clicks, then rebuilding the abstraction cleanly — you’ll end up with something much rarer than “ML knowledge”: systems intuition. That transfers everywhere.

Keep going. This is exactly how real engine builders are made.

Rajdeep • Jan 23

Thank you for taking the time to read it so carefully. The “representation boundary” point really resonated with me. I dont stop asking why until i can feel the concept in my chest.

I think that’s what hooked me. Once the abstraction collapses into something you can reason about in memory and loops, it feels like you actually own the idea instead of borrowing it from a library.

And that senior level move you pointed out, that's just the result of my unfinished video games sitting on my hard drive where i overscoped to the point of burnout XD. Never again am i letting that happen.

I’m trying to keep following that instinct you mentioned - find where my intuition breaks, slow down, and rebuild from first principles instead of stacking tools. Reading this was very encouraging.

Thanks again for the perspective.

Palash Kanti Kundu • Jan 22

Writing the library myself is the single most important teaching factor that helped me understand each bit of Deep Learning.

All the best.

Rajdeep • Jan 22

Thank you, that means a lot! Writing things from scratch forced me to confront every assumption I was hand-waving before. Your comment reassures me that I’m on the right track.