How to calculate weights using gradient descent

Hello, I'm Ganesh. I'm building git-lrc, an AI code reviewer that runs on every commit. It is free, unlimited, and source-available on Github. Star git-lrc on GitHub to help more developers discover the project. Do give it a try and share your feedback for improving the product.

In the previous article, I explained the requirements for finding the weights of the last layer. Now let's see how we actually assign and optimize those weights using gradient descent.

Selecting Random Weights From a Normal Distribution

First, we assign random numbers (drawn from a normal distribution) to the weights w3 and w4.

Then we sum up both results along with the bias b3 = 0:

predicted = (output of top neuron × w3) + (output of bottom neuron × w4) + b3

This gives us our initial prediction, and we can plot the final graph:

With these initial random weights, the SSR (Sum of Squared Residuals) is calculated again to measure how far off our predictions are.

Gradient Descent Algorithm For Optimal Values

Now we need to find the derivative of SSR with respect to b3 so we can update it.

Recall our loss function:

SSR = Σ (observed − predicted)²

And our predicted value is:

predicted = (output of top neuron × w3) + (output of bottom neuron × w4) + b3

This is the same chain rule approach we used for backpropagation when optimizing only b3.

The key insight here is that the products of w3 and w4 with their respective neuron outputs are treated as constants for a single gradient calculation with respect to b3. Since only b3 is the variable in this expression, the derivative simplifies cleanly — just as we saw in the previous articles.

Conclusion

By assigning random weights from a normal distribution and then applying gradient descent with the chain rule, we can iteratively optimize each weight and bias in the network. The same process that worked for b3 alone now extends to w3 and w4 — we just need to carefully apply the chain rule at each step to compute the correct gradients.