Hady Walied

Posted on Sep 26

Porting micrograd to C++: Step One of Getting My Hands Dirty Again

#machinelearning #ai #beginners #cpp

Let me be completely honest: I didn't invent anything here. This is Andrej Karpathy's brilliant micrograd ported to C++, nothing more, nothing less. But sometimes the best way to really understand something is to rebuild it in a different language, and that's exactly what I needed.

Why I'm Going Back to Basics

Here's my situation: I've been working as a software engineer in electronics automation, while mostly dealing with high-level libraries. PyTorch here, TensorFlow there, plug and play. But I realized I was getting comfortable; maybe too comfortable. My C++ was getting rusty, and more importantly, I was treating neural networks like magic black boxes.

So, I decided to take a step back. Way back. All the way to "let me implement automatic differentiation from scratch" back.

This is the first project in what I'm calling my "back to fundamentals" roadmap:

Step 1: Port micrograd to C++ (refresh C++ skills, understand backprop deeply)
Step 2: get dirty with statistics and probabilistic machine learning, then Implement Physics-Informed Machine Learning (PIML) in Python
Step 3: well, Let's see where I'll end up.

I'll try documenting this journey because, well, maybe I'm not the only one who needs to go back to basics.

What micrograd Taught Me (And What C++ Added)

If you haven't seen Karpathy's micrograd, stop reading this and go watch his lecture first. Seriously. It's a 150-line automatic differentiation engine that can train neural networks, and it's pure educational gold.

The genius of micrograd isn't in its complexity; it's in its simplicity. Every operation (add, multiply, tanh) knows how to compute its own gradient. Chain them together, and you get backpropagation for free.

Porting it to C++ forced me to think about things Python hides from you:

Memory Management: Python's garbage collector handles circular references in computational graphs automatically. In C++, I had to be explicit about shared_ptr and weak_ptr to avoid memory leaks.

// This took me way longer to get right than I'd like to admit :")
template<typename T>
class Value {
    T data;
    double grad;
    std::vector<std::shared_ptr<Value<T>>> children;
    std::function<void(Value<T>&)> _backward;
    // ... lots of careful pointer management :->
};

Type Safety: Python's duck typing lets you be lazy. C++ templates forced me to think explicitly about what types my Value class should support. Turns out, being explicit helps you catch bugs earlier.

Performance Considerations: Not that this tiny implementation is performance-critical, but C++ made me think about move semantics, unnecessary copies, and memory layout in ways Python abstracts away.

The "Ahaaaaa" Moments (And Face-Palm Bugs)

Gradient Accumulation Bug: My first implementation had a subtle bug where gradients kept accumulating between training iterations. The loss would decrease for a few steps, then explode to infinity. Took me embarrassingly long to realize I wasn't zeroing gradients between iterations.

// The bug that ate my Thursday night.
for (auto& param : network.parameters()) {
    param->grad = 0.0;  // This line was missing -_-
}

Operator Overloading Revelation: Implementing operator+ and operator* for the Value class was genuinely satisfying. You write c = a + b and behind the scenes, you're building a computational graph that remembers how to backpropagate gradients. It's like syntactic sugar for automatic differentiation.

The Chain Rule in Action: Watching the backward pass work was the moment everything clicked. You call loss.backward() and it ripples backwards through the entire computational graph, each node knowing exactly how to pass gradients to its children. It's like watching dominoes fall in reverse.

What This Actually Accomplished

Let's be real about scope here. My C++ implementation handles:

Basic operations (add, multiply, power, tanh)
Multi-layer perceptron
Mean squared error loss
Simple gradient descent

That's it. No optimizers, no regularization, no fancy architectures. It can barely solve XOR, let alone generate poetry.

But that's exactly the point. I now understand, at a visceral level, what happens when PyTorch calls loss.backward(). I understand why you need to zero gradients. I understand what a computational graph actually is.

Testing Against the Original

The real validation was making sure my C++ version produced identical results to Karpathy's Python version. Same random seed, same data, same hyperparameters – the loss curves should be identical.

They were. And that felt better than any unit test.

// Training loop that mirrors micrograd exactly
for (int epoch = 0; epoch < 100; ++epoch) {
    auto pred = network.forward(inputs);
    auto loss = mean_squared_error(pred, targets);

    zero_gradients(network.parameters());
    loss->backward();

    for (auto& p : network.parameters()) {
        p->data -= 0.05 * p->grad;  // Same learning rate as original
    }

    if (epoch % 10 == 0) {
        std::cout << "epoch " << epoch << " loss " << loss->data << std::endl;
    }
}

The Real Value of Going Backwards

This project didn't advance the state of the art. It didn't solve any novel problems. It didn't even implement anything new.

But it reminded me why I love this field. There's something deeply satisfying about understanding your tools at the lowest level. When I use PyTorch now, I'm not just calling magical functions; I understand what those functions are actually doing under the hood.

Plus, my C++ skills are definitely less rusty now.

For Anyone Considering Similar Projects

If you're thinking about implementing micrograd yourself (in C++ or any other language), do it. The code is on my GitHub if you want to see how I approached the port, but honestly, the value is in doing it yourself.

Fair warning: it's going to take longer than you expect. Not because the concepts are hard, but because getting all the pointer management and operator overloading right takes time. And debugging gradient computation errors is its own special kind of hell.

But when you finally see that loss curve decreasing, when you realize your handwritten automatic differentiation engine is actually teaching a neural network to learn... it's worth it.

Next up: teaching neural networks to respect the laws of physics. Should be fun.

This is part of my "back to fundamentals" series where I'm rebuilding my understanding from the ground up. Follow along as I work through PIML and eventually contribute to the community.

Full implementation: hadywalied/cGrad

Original inspiration: karpathy/micrograd

Image by Chen from Pixabay

DEV Community