I Built a Neural Network from Scratch (No Frameworks) - Here's What I Discovered

#machinelearning #neuralnetworks #python #beginners

I Built a Neural Network from Scratch - Here's What I Discovered

I'm on a 42-week mission to deeply understand AI - from first principles to frontier research. No shortcuts. No model.fit(). Just raw Python and NumPy.

Week 1: Build a neural network from scratch that recognizes handwritten digits.

Here's what happened.

The Setup

Dataset: MNIST (60k training, 10k test images of digits 0-9)
Architecture: Input (784) → Hidden (128) → Output (10)
Tools: Python, NumPy, Hugging Face datasets
Framework: None. That's the point.

What Even Is an Input Layer?

My first question wasn't about gradients or loss functions. It was: what information can you extract from a single pixel?

I pulled the dataset from Hugging Face and went into the data itself before writing any code. The images are 28×28 pixels, and a pixel is just a number from 0 to 255 representing brightness. Nothing more. Flatten that grid and you get a vector of 784 numbers - that's your input layer. Not magic, just the data.

From the starter hint - Input (784) → Hidden (128) → Output (10) - the input layer was the easy part to understand. It's literally just the pixels stacked in an array, with no transformation applied.

The 784 → 128 Problem

This is where it got interesting. 784 divided by 128 is not a whole number, so we're definitely not merging or compressing pixels directly. That leaves two approaches:

Manually extract 128 features - trace of the matrix, determinant, sum of each row, column averages. Something hand-designed. But then what makes a feature useful? If every digit gives you the same determinant value, that feature teaches the network nothing. And if we're just computing fixed statistics, isn't this just another input layer with extra steps?
Assign 128 different weight vectors, each of length 784, with one weight per pixel per neuron. The network figures out which pixels matter through training.

I leaned toward option 1 for a while. Option 2 felt like cheating - how can random weights learn anything meaningful? But that's exactly the point. The hidden layer isn't supposed to store hand-crafted features; it's a space the network builds by itself through training.

Why 128 specifically? Fewer than 16 hidden neurons and there aren't enough to capture the differences between digit shapes. More than a thousand and the network stops learning patterns and starts memorizing individual training images instead.

Getting from 128 to a Prediction

After computing the 128 hidden values, I needed to turn them into an actual digit prediction. My first instinct was to sum all 128 values into a single number, normalize it, and compare against a learned threshold per label.

That gave me 22% accuracy - barely better than guessing randomly across 10 classes. Worse, I realized I had a deeper problem: if everything collapses into one number, how do I figure out which weights caused the error and in which direction to fix them?

So I switched to a completely different approach: multiply the hidden layer by 10 separate weight vectors, each of length 128, one per digit class. Instead of collapsing to a single value, I now had 10 raw output numbers and I'd let the network decide what each one means.

That got me to a prediction, but 10 raw numbers aren't probabilities. The highest one might be 47 and the second highest 46 - I had no way to tell if the network was confident or not. I needed to convert them into values that sum to 1.

That's where softmax comes in, and I worked it out before knowing it had a name:

The simplest attempt: zᵢ / Σz. Divide each value by the total sum and you get a ratio. But the sum of my 10 outputs could easily be zero or negative, which makes the whole thing break immediately.
To fix the sign problem, I squared everything: zᵢ² / Σz². All positive now, no more zero-sum issue. But squaring throws away sign information entirely. +5 and -5 both become 25, so there's no way to tell if the network was confidently right or confidently wrong. The direction of the signal disappears.
Instead of squaring, raise a constant to the power of each value: aᶻⁱ / Σaᶻʲ. For negative inputs, a^(-x) gives a small positive number. For positive inputs, it gives a large one. The direction is preserved and everything stays positive. I used a = 2; the standard is e - both work, e just has nicer mathematical properties.

That's softmax. I derived it through trial and error before I knew what it was called.

Error and Learning

Before touching any weights, I had to figure out what direction to push them and by how much.

Error was the more intuitive part. For the correct label, I want the network's confidence to be 1, so the error is 1 - confidence. For every wrong label, I want confidence to be 0, so the error is 0 - confidence, which is just -confidence. In short: error = actual - predicted. A negative error means the weight contributed to a wrong answer and needs to be pulled back. A positive error means it helped and should be reinforced.

Cost - figuring out how much to actually change each weight - was harder.

My first attempt was to calculate the total impact of each neuron across all 10 outputs, then scale the weight update by total error across all classes. It looked clean on paper, but fell apart in practice. The errors from different output neurons often cancelled each other out when summed, and the weight magnitudes could also sum to near zero, so the gradient signal kept disappearing into the aggregation.

The fix was to stop thinking globally and handle one output at a time:

Cost(weight) = hidden_activation × output_error × learning_rate
weight_new = weight_old + cost

Each weight gets updated separately for each output neuron's error signal. The sign of the error gives direction, and the size of the hidden activation scales how much that weight gets blamed. Running this, accuracy jumped to 60%.

The catch was that only the output layer weights were being updated. The weights from input to hidden were still completely random, which meant 40% of the network was pure noise.

Backpropagation

To update Layer 1, I needed an "error" signal at the hidden layer - but there's no direct label there, only at the output. So I defined the fault of each hidden neuron based on how much it contributed to the output errors:

error_hidden(n) = n × Σ(w_output × output_error)
cost(input → n) = input_pixel × error_hidden(n) × learning_rate

I multiplied by n (the neuron's activation value) to make sure weights for neurons that weren't even firing didn't get updated. A neuron that output zero didn't contribute to anything, so it shouldn't be blamed. It turned out this reasoning was consistent with the ReLU derivative, but I got there through logic rather than calculus.

Once Layer 1 started updating as well, accuracy jumped to 95% on the test set. I drew a few digits in MS Paint and ran them through the model. It worked on those too.

What I Discovered Along the Way

Softmax isn't obvious. I went through three candidate functions before landing on the exponential form. Each failure had a reason, and understanding why each one broke told me more than the final answer did.

Gradient descent is intuitive before it has a name. I was already computing gradients when I asked "how much did this weight contribute to the error?" and scaled the update accordingly. The math came after the intuition.

Non-linearity is what makes depth matter. Without something like ReLU between layers, stacking matrix multiplications doesn't add any expressive power - W₃ · W₂ · W₁ · x collapses into a single matrix multiplication W · x. The non-linearity is what lets each layer actually learn something the previous one couldn't.

Aggregating gradients loses information. My first cost function summed errors across all outputs and the updates kept cancelling. Switching to per-output updates, where each weight is updated for each error signal separately, fixed it immediately.

What I'd Do Differently

Batch processing: I updated weights after every single image. Mini-batches would converge faster and produce better generalization.
Learning rate scheduling: Fixed learning rate from start to finish. Decaying it over time helps fine-tuning once the weights are in the right ballpark.
Better initialization: Starting from uniform random weights isn't optimal. Xavier and He initialization are designed specifically to keep gradients stable through early training.
Cross-entropy loss: My actual - predicted error works, but cross-entropy gives smoother gradients when combined with softmax and is what every real implementation uses.