DEV Community: Nalin Angrish

How to build Flexible Neural Networks from scratch in C++

Nalin Angrish — Fri, 10 Apr 2026 22:04:52 +0000

Overview

This post basically describes how I built FlexNN. FlexNN is a fully connected neural network that can be customized to any use case by having any arbitrary number of layers, and support for multiple activation functions.

At the time of writing this article, FlexNN only has support for Dense layers and ReLU and SoftMax activation functions. Right now, I have no motivation to continue adding support for other types of layers and activation functions, since this project is intended just as a proof of concept and learning purposes. For any practical purposes I would shut up and pick TensorFlow without second thoughts.

What are we going to do

For designing a neural network we will first consider a problem statement to solve and a dataset for the same. So for this experiment, I've picked up the MNIST dataset for handwritten digit recognition.

The dataset contains a lot of 28x28 pixel images with handwritten digits from 0-9. In this, each pixel is an integer value from 0 to 255, with 0 representing black, and 255 representing white, and 50* shades of gray in between.

* more like 254 :)

Our model will basically look at a large subset of these samples, and learn how each digit looks like. And when we give it a random sample which it has never seen, it will try to pick up some patterns and guess which digit it is.

Okay what the hell is a neural network?

You would have seen something similar to the image above wherever you read about neural networks. This is how a basic neural network can be represented.

In our dataset, we have a lot of 28x28 pixel images, or basically, a lot of 784 pixel images. What our model in this example image does is:

take an input vector with 784 elements
the 10 circles, also called neurons or nodes do some math with the input vector and give an output vector with 10 elements.
The output vector of this layer acts as an input vector for the second layer, and the second layer does some more math with it. This goes on and on for n-layers the neural network might have.
The output of the last layer, which in this case is the third layer, is the output we desire. In this case, i $^{t h}$ element of the vector is the probability that the input represents i. For example, if the input image represents 7, then the output vector of the neural network must look something like: [0 0.04 0.02 0 0.04 0 0 0.9 0 0]

As you can clearly see, the probability of the 7 $^{t h}$ element is the highest, which suggests the input must be 7.

The math behind a neural network

At the first sight it might seem as if there is some magic math done by a neural network to accurately guess the user's input, but in fact the math is just a bunch of Matrix multiplications, with a sprinkle of Calculus. Each layer of neurons has associated matrices with it. Those are called Weights and Biases.

A neural network primarily performs 3 operations:

Forward propagation: Take the input, pass it through all layers sequentially, and obtain the output.
Backward propagation: Take the output and the expected output, pass gradients back through all layers starting from the end towards the beginning, and figure out what changes need to be made to the weights and biases.
Updating weights and biases: After backward propagation, since our model knows what changes need to be made, update the weights and biases accordingly.

Notation

For discussing the math, we'll start with writing out the notation we will be using. Math becomes very very hard if we don't have any proper notation.

$A_{m \times n}$ : A matrix with m rows and n columns.
$W_{m \times n (i)}$ : Weights of the $i^{t h}$ layer
$b_{m \times 1 (i)}$ : Biases of the $i^{t h}$ layer
$X^{(i)}$ : Input to the $i^{t h}$ layer. $X^{(0)}$ is the original input to the model.
$Z^{(i)}$ : Pre-activation output of the $i^{t h}$ layer.
$A^{(i)}$ : Post-activation output of the $i^{t h}$ layer. Also equals $X^{(i + 1)}$ .
$L$ : Total number of layers.
$Y^$ : Predicted output (i.e., $A^{(L)}$ ), and Y is the actual label.

Forward Propagation

For each layer i from 0 to L-1, two operations are performed:

Linear Combination:

Z^{(i)} = W^{(i)} X^{(i)} + b^{(i)}

Activation:

A^{(i)} = g (Z^{(i)})

where g is the activation function. I won't go in depth on what an activation function is, but if you want to, you can refer to this article. In this model, as I mentioned earlier, we support ReLU and Softmax.

ReLU: $g (x) = max (0, x)$ — used for hidden layers.
Softmax: $g (x_{i}) = \frac{e ^{x_{i}}}{\sum _{j} e ^{x_{j}}}$ — used for the output layer, converts raw scores into probabilities that sum to 1.

Backward Propagation

This is the part that actually makes the network learn, and also the part that made me want to throw my laptop out the window the first time I implemented it.

The idea is to compute how much each weight and bias contributed to the error, so we know which direction to nudge them. We do this using the chain rule from Calculus.

We start from the output layer and work backwards. First, we need a loss function to measure how wrong we are. For a classification problem like MNIST, we use Cross-Entropy Loss:

L = - k \sum Y_{k} lo g (Y^_{k})

Now, for the output layer using Softmax + Cross-Entropy, the gradient conveniently simplifies to:

d Z^{(L)} = A^{(L)} - Y

Yeah, that's it. The math gods were kind here.

For every layer i going backwards from L to 1, we compute:

d W^{(i)} = \frac{1}{m} \cdot d Z^{(i)} \cdot (X^{(i)})^{T}

d b^{(i)} = \frac{1}{m} \cdot \sum d Z^{(i)}

d X^{(i)} = (W^{(i)})^{T} \cdot d Z^{(i)}

And then propagate the gradient to the previous layer:

d Z^{(i - 1)} = d X^{(i)} \circ g^{'} (Z^{(i - 1)})

where

g^{'}

is the derivative of the activation function, and

\circ

is elementwise multiplication. For ReLU,

g^{'} (x) = 1

x > 0

, else

0

Updating the Weights and Biases

Once we have the gradients, the actual update is the easy part — Gradient Descent:

W^{(i)} \leftarrow W^{(i)} - α \cdot d W^{(i)}

b^{(i)} \leftarrow b^{(i)} - α \cdot d b^{(i)}

where

α

is the learning rate — a hyperparameter that controls how big a step we take. Too high and the model diverges. Too low and you'll die of old age before it converges.

Generalizing to n-layers

The beauty of this formulation is that it generalizes cleanly. Every layer has the same structure: weights, biases, an activation, and a backward pass. So instead of hardcoding a 2-layer network, we can just store layers in a list and loop over them — forward during inference, backward during training. That's literally the entire design philosophy of FlexNN.

Time to Code

Alright, enough math. Let's build the thing.

Prerequisites

FlexNN uses Eigen3 for matrix operations and CMake to build. On Ubuntu:

sudo apt install cmake libeigen3-dev

That's it.

Helper Functions

CSV Reading

MNIST data comes as a CSV where the first column is the label and the remaining 784 are pixel values. We skip the header and parse everything into Eigen matrices:

void FlexNN::readCSV_XY(const std::string &filename, Eigen::MatrixXd &X, Eigen::VectorXd &Y) {
    std::ifstream file(filename);
    std::vector<std::vector<double>> data;
    std::string line;
    size_t cols = 0;

    if (std::getline(file, line)) { /* skip header */ }

    while (std::getline(file, line)) {
        std::stringstream ss(line);
        std::string cell;
        std::vector<double> row;
        while (std::getline(ss, cell, ','))
            row.push_back(std::stod(cell));
        if (cols == 0) cols = row.size();
        data.push_back(row);
    }

    size_t nRows = data.size(), nCols = cols;
    X.resize(nRows, nCols - 1);
    Y.resize(nRows);
    for (size_t i = 0; i < nRows; ++i) {
        Y(i) = data[i][0];
        for (size_t j = 1; j < nCols; ++j)
            X(i, j - 1) = data[i][j];
    }
}

One-Hot Encoding

The network outputs 10 probabilities, but labels are just integers. We convert them into a one-hot matrix of shape (num_classes, num_samples):

Eigen::MatrixXd FlexNN::oneHotEncode(const Eigen::VectorXd &Y, int num_classes) {
    Eigen::MatrixXd Y_onehot = Eigen::MatrixXd::Zero(num_classes, Y.size());
    for (int i = 0; i < Y.size(); ++i) {
        int label = static_cast<int>(Y(i));
        if (label >= 0 && label < num_classes)
            Y_onehot(label, i) = 1.0;
    }
    return Y_onehot;
}

Single Layer

Each Layer stores its weights W, biases b, and activation type. The forward pass returns a std::pair<Z, A> — both the pre-activation and post-activation outputs. We need both because backprop needs Z to compute the ReLU derivative.

Forward Pass

std::pair<Eigen::MatrixXd, Eigen::MatrixXd> FlexNN::Layer::forward(const Eigen::MatrixXd &input) {
    Eigen::MatrixXd Z = (W * input).colwise() + b;
    Eigen::MatrixXd A;

    if (activationFunction == "relu") {
        A = Z.unaryExpr([](double x) { return std::max(0.0, x); });
    }
    else if (activationFunction == "softmax") {
        A = Z;
        for (int i = 0; i < Z.cols(); ++i) {
            Eigen::VectorXd col = Z.col(i);
            Eigen::VectorXd expCol = (col.array() - col.maxCoeff()).exp(); // numerically stable
            A.col(i) = expCol / expCol.sum();
        }
    }
    else {
        A = Z; // linear, no activation
    }
    return {Z, A};
}

The col.maxCoeff() subtraction before the exp is the numerically stable softmax trick — without it, large values in Z cause exp() to overflow. Small detail, big consequences.

Backward Pass

The layer's backward method takes the next layer's weights and dZ, and propagates the gradient back through this layer's activation:

Eigen::MatrixXd FlexNN::Layer::backward(const Eigen::MatrixXd &nextW,
                                         const Eigen::MatrixXd &nextdZ,
                                         const Eigen::MatrixXd &currZ) {
    if (activationFunction == "relu") {
        return (nextW.transpose() * nextdZ).array()
               * (currZ.array() > 0.0).cast<double>();
    }
    // linear fallback
    return nextW.transpose() * nextdZ;
}

Notice there's no softmax case here — and that's intentional. The softmax + cross-entropy gradient at the output layer simplifies to just A - Y, so it's handled directly in the network's backward pass, not here.

Neural Network

With layers working, NeuralNetwork just chains them.

Forward Pass

We store every intermediate Z and A in an outputs vector — we'll need all of them during backprop:

std::vector<Eigen::MatrixXd> FlexNN::NeuralNetwork::forward(const Eigen::MatrixXd &input) {
    std::vector<Eigen::MatrixXd> outputs;
    outputs.push_back(input); // index 0: original input
    for (size_t i = 0; i < layers.size(); ++i) {
        auto [Z, A] = layers[i].forward(outputs.back());
        outputs.push_back(Z); // Z at index 2i+1
        outputs.push_back(A); // A at index 2i+2
    }
    return outputs;
}

After this, outputs looks like: [X, Z₀, A₀, Z₁, A₁, ...]. The indexing matters — backprop walks this in reverse.

Backward Pass

The output layer gradient is A_last - Y_onehot. From there we walk backwards, calling each layer's backward() to get dZ for the previous layer, and accumulating dW and db along the way:

std::vector<Eigen::MatrixXd> FlexNN::NeuralNetwork::backward(
    const std::vector<Eigen::MatrixXd> &outputs, const Eigen::MatrixXd &target) {

    std::vector<Eigen::MatrixXd> gradients;
    std::vector<Eigen::MatrixXd> dZs;
    int m = outputs.back().cols();

    // Output layer: softmax + cross-entropy gradient = A - Y
    Eigen::MatrixXd dZ = outputs.back() - target;
    dZs.push_back(dZ);
    gradients.push_back(dZ.rowwise().mean());                               // db
    gradients.push_back(dZ * outputs[outputs.size() - 3].transpose() / m); // dW

    // Hidden layers
    for (int i = layers.size() - 2; i >= 0; --i) {
        dZ = layers[i].backward(layers[i + 1].getWeights(), dZs.back(), outputs[2 * i + 1]);
        dZs.push_back(dZ);
        gradients.push_back(dZ.rowwise().mean());                  // db
        gradients.push_back(dZ * outputs[2 * i].transpose() / m); // dW
    }

    std::reverse(gradients.begin(), gradients.end()); // align with layer order
    return gradients;
}

Updating Weights

Gradients come back interleaved as [dW₀, db₀, dW₁, db₁, ...]:

void FlexNN::NeuralNetwork::updateWeights(
    const std::vector<Eigen::MatrixXd> &gradients, double learningRate) {
    for (int i = 0; i < layers.size(); ++i) {
        Eigen::MatrixXd dW = gradients[2 * i];
        Eigen::VectorXd db = gradients[2 * i + 1];
        layers[i].updateWeights(dW, db, learningRate);
    }
}

Training

Putting it all together — one-hot encode the labels, then loop:

void FlexNN::NeuralNetwork::train(const Eigen::MatrixXd &input, const Eigen::MatrixXd &target, double learningRate, int epochs) {
    Eigen::MatrixXd Y_onehot = FlexNN::oneHotEncode(target, target.maxCoeff() + 1);
    for (int epoch = 0; epoch < epochs; ++epoch) {
        auto outputs = forward(input);
        auto gradients = backward(outputs, Y_onehot);
        updateWeights(gradients, learningRate);
        if ((epoch + 1) % 10 == 0)
            std::cout << "Epoch " << epoch + 1 << "/" << epochs
                      << ": Accuracy = " << this->accuracy(input, target) << std::endl;
    }
}

Evaluating Accuracy

double FlexNN::NeuralNetwork::accuracy(const Eigen::MatrixXd &X, const Eigen::MatrixXd &Y) {
    Eigen::MatrixXd predictions = this->predict(X);
    int correct = 0;
    for (int i = 0; i < predictions.cols(); ++i) {
        int predictedClass;
        predictions.col(i).maxCoeff(&predictedClass);
        if (predictedClass == static_cast<int>(Y(i))) correct++;
    }
    return static_cast<double>(correct) / predictions.cols();
}

And as a fun bonus — main.cpp has a little interactive loop where you enter an index and it prints the digit as ASCII art using #, *, ., and spaces based on pixel intensity.

Analysis

With a 784 → 64 ReLU → 10 Softmax architecture, learning rate of 0.5, trained for 300 epochs on a 90/10 train-test split:

Training accuracy: ~95–96%
Test accuracy: ~93–94%
Training time: A few minutes on CPU.

Not bad for zero ML frameworks. PyTorch would hit 99% in 10 lines — but you'd also understand approximately nothing about why it works. That was the whole point.

How to Refine

Vectorization

Eigen already uses SIMD internally — but only if you compile with the right flags. If your CMakeLists.txt has:

target_compile_options(FlexNN PRIVATE -O3 -march=native)

Free speedup, zero code changes.

Multithreading with OpenMP

The README mentions OpenMP support. Eigen can be told to use it internally, and you can parallelize the training loop across batches with #pragma omp parallel for. The infrastructure is already there — just needs to be wired in properly.

Mini-batch Gradient Descent

Right now training runs on the full dataset every epoch (batch gradient descent). Mini-batches would converge faster and generalize better. Shouldn't be a huge change given how splitXY is already written.

What Can Be Improved

Plenty:

More optimizers — Adam, RMSProp, SGD with momentum. Plain gradient descent works but it's the slow lane.
More activations — Sigmoid, Tanh, Leaky ReLU. The architecture already supports plugging them in.
Model save/load — Right now you retrain from scratch every run. Serialize the weights to disk and this becomes actually usable.
Convolutional layers — For real image tasks, dense layers are not the move. But that's a whole different beast.
GPU support — Not happening from me anytime soon. Swap Eigen for cuBLAS if you're feeling adventurous.

The code is on GitHub if you want to poke around. And if you find a bug — feel free to open an issue. I might fix it. Eventually. Probably after exams.

That's where FlexNN stands for now. The more I look at it though, the more I think there's something worth pursuing here — a lightweight, dependency-minimal inference engine in C++ is not far from something you could run on an embedded system. No Python runtime, no TFLite overhead. A proper lightweight alternative to TensorFlow Lite for edge applications. That's the direction I'm thinking of taking this next.

If that sounds interesting, do reach out!

More stuff coming soon. 👀

Luna - My Journey Building a Robotic Dog (and Learning the Hard Way)

Nalin Angrish — Wed, 02 Jul 2025 19:56:41 +0000

History

Those who know me might have heard about Luna, and for those who haven't, it's a robotic quadruped (inspired by SPOT by Boston Dynamics) that I have wanted to make since last summer. I started the work by the end of my summer break, but progress remained slow for specific reasons, and I couldn't do much work. The primary reason for this delay was that I had no idea about SolidWorks or any other CAD software I could use to design my bot, so I was starting from scratch. This was my first big-scale project where I was the entire Design and Coding team.

please hire me Boston Dynamics :)

In my third semester, our course titled "Machine Drawing" briefly introduced us to SolidWorks, and that's when I got the push to learn it more deeply, and that's what I did. Moreover, since I already wanted to design Luna, I decided to submit it as my project for the course. I kept the design so I could directly print the parts and assemble them easily after printing.

Design Overview - First Iteration

I completed the design by the end of the third semester, and it was a minimal, small, cute robot powered by 12 SG90 Servo Motors. The design used three motors per leg -

2 for hip movement, and
1 for the knee.

This way, I could get the full range of motion you should expect from a robotic leg.

With the help of one of my seniors, I got the parts 3D printed in the early days of my 4th semester. However, he did warn me that I wouldn't be able to operate the bot using the motors I had chosen.

Naturally, I ignored him. (Spoiler alert: it didn't work)

So now, since I've printed the parts and bought the electronic components, it is time to build. But wait, exams! By the time I had the parts on my desk, mid-semester exams were waiting to welcome my dumb ass. Then, other commitments started popping up, and I eventually paused the project for a while. The project then started again after the semester ended.

My Mistakes

Winging it

I encountered my first roadblock pretty early in the project. In hindsight, I now realize I had no proper planning or calculation for the project.

The motors couldn’t even hold a leg straight - not a bug, just the motors having a mid-life crisis at high frequency. They just couldn’t hold torque well, vibrated like crazy under load, and were too weak to handle the robot's weight.

I didn't calculate anything - I just went with the feeling that

"Yeah, these motors are all I've ever used, and they've always worked fine; these bad boys will work fine for this project."

Sorry Arnav bhaiya, I should've listened to you....

Lesson Learnt: Make component and design choices after careful consideration and calculation and listen to seniors' advice.

Not Simulating or Testing

Since I didn't test/simulate anything, I had no idea whether the things I was planning would work. Had I simulated the bot on software before purchasing the components or 3D printing anything, I would have known that it wouldn't work and would have updated the design and required parts on time, saving money and time.

Lesson Learnt: Simulate or otherwise test your design before ordering components. You're not Ramanujan - test your stuff.

Being Broke

As a student, it made sense that I couldn't spend a lot on parts, but the fact is that good things take time, effort, and money. I could give the time and effort (though I didn't give enough, apparently), but money?

My choice of using SG90s was partly because bigger motors cost more. Spending a little more and designing with better motors in consideration would have worked out better. Tough to control, I get it - but sometimes, cutting corners just means cutting your whole project short.

Lesson Learnt: Don't optimize for cost, but for quality. Spending a little extra right now will help in the long run.

What Now?

Learning from my mistakes, there are a few main choices I made:

Calculate what kinds of motors are needed instead of winging it.
Plan requirements properly and redesign.
Simulate the robot on Gazebo/Pybullet before purchasing anything.

With these points now in mind, and after some calculation along with the help of our dear friend ChatGPT, I've decided to use Tower Pro's MG958 Servo Motor. These are 7x heavier than the SG90 servos but have 10x the torque (albeit at 10x the price). Hopefully, they will be suitable, but so as not to make the same mistake again, I'll test this by simulating the bot before placing an order for 12.

Now that I knew what components I would use (hopefully), I redesigned the bot from scratch. The most significant change was in the leg mechanism. In the previous design, I had two motors at the hip and one at the knee. However, in the current design, I changed it to have all three motors at the hip and use linkage mechanisms to move the hip and knee joints properly. This reduces the mass and moment of inertia of the legs, resulting in faster and smoother movements. The inverse kinematics for this might be tricky, but I'm confident I'll manage.

The main thing that might take me time is to simulate everything on Gazebo correctly. I've only used Gazebo with PX4-Autopilot as a part of my project on Fault Tolerant Control of Quadrotor with Single Motor Failure.

As I keep working on the project, I’ll also add ROS2 to handle the brain side of things. Once the bot is walking properly in simulation using Gazebo, I’ll share another post with all the updates (and probably a few disasters, too).

More fun stuff is coming soon — stay tuned. 👀