DEV Community

Cover image for How AI Learns: Gradient Descent Explained Through a Midnight Smoky Jollof Adventure
Fawole Joshua
Fawole Joshua

Posted on

How AI Learns: Gradient Descent Explained Through a Midnight Smoky Jollof Adventure

Many aspects of the modern world are now powered by artificial intelligence, and this has significantly accelerated human civilization.

From faster disease detection to automated decision-making. From breakthroughs in medical imaging to the quiet and rapid adoption of artificial intelligence in law firms and the entire judicial system. Artificial intelligence is actively reshaping the future of agriculture and its impact can be felt across nearly all sectors.

Yet, despite this tremendous progress, many people do not actually understand where artificial intelligence gets its brilliance from. AI's ability to identify errors and iteratively improve is certainly amazing.

This article will gently hold you by the hand and explain the true superpower behind AI and machine learning.

The answer lies in a simple mathematical algorithm called Gradient Descent.

What is Gradient Descent?

Gradient descent can be explained as a general-purpose mathematical algorithm that is capable of finding the best solutions to a very wide range of problems. In machine learning, it works by rapidly updating parameters to quickly minimize a loss (or cost) function.

In very simple terms, Gradient descent helps AI figure out how wrong it is and how to quickly become less wrong.

To explicitly understand what gradient descent is, its complexities and purpose, we can look under the hood and reason like an AI model.

Midnight Smoky Jollof Adventure

Say you went for Thanksgiving and your mom cooked a special and very taste-bud-pleasing Nigerian jollof. Thanksgiving was perfect, you reconnected with your siblings and then everyone went to bed. But in the middle of the night, your brain and tongue just kept craving more, the smoky jollof rice was so tantalising that you could smell it several feet away.

You resisted the feeling but it got the better side of you and so you stood up and started making your way to the kitchen. But here is the problem, the lights are off, you can't see a thing. You don't want to get caught, nor do you want to fall off something.

Imagine the house floor as a graph paper.

X-axis = left-right position

Y-axis = forward-backward position

Your location = coordinates (X, Y)

You are currently at point (1, 1)

The Loss Function

We need to find a way to measure how close we are to the jollof rice.

Normal distance formula:

Distance = √(x - 3)²+(y - 4)²

Let's just use squared distance:

Loss(x, y) = (x - 3)²+(y - 4)²

This loss is very important, it will be our compass to get to the jollof rice, it will show how far-off (wrong) we are.

The higher the loss, the more wrong we are (i.e we are very far off from the kitchen). Therefore, our goal is to greatly reduce the loss function so that we can reach the kitchen and the jollof rice.

At starting point (1, 1):

Loss = (1 - 3)² + (1 - 4)² = (-2)² + (-3)² = 4 + 9 = 13. This means we are very far from the kitchen.

Testing Directions

Then let's tweak the parameters a little:

From (1, 1) to (1.001, 1)

New loss: (1.001 - 3)² + (1 - 4)² = (-1.999)² + (-3)² = 3.996 + 9 = 12.996

The old loss was 13, now the new loss is 12.996 (decreased by 0.004, we are making progress!)

Then let's say we tweak the parameters even more. From (1, 1) to (1, 1.001):

New loss: (1 - 3)² + (1.001 - 4)² = (-2)² + (-2.999)² = 4 + 8.994 = 12.994 (getting closer)

The Mathematical Shortcut

Instead of testing each direction, we can take a mathematical shortcut (find the derivative):

For loss = (x - 3)² + (y - 4)²

How loss changes with x:

If we change x by ∆x, loss changes by approximately:

2 * (x - 3) * ∆x

Why? This is because the derivative of (x - 3)² = 2(x - 3).

So at x = 1:

2 * (1 - 3) = 2 * (-2) = -4

This means that for every tiny step right, loss decreases by 4 times that step size.

How loss changes with y:

2 * (y - 4) * ∆x

At y = 1:

2 * (1 - 4) = 2 * (-3) = -6

For every step forward, loss decreases by 6 times that tiny step size.

The Gradient Vector

We put these together into a gradient vector:

Gradient = [-4, -6]^T

To always update our position, we need to adopt a movement sequence or otherwise called a learning rate (η = 0.1).

The learning rate must not be too slow or small ( we don't want to take forever) nor should it be too fast or large ( we don't want to fall or overshoot).

Now our movement formula will be:

New position = old position - η * Gradient

x-new = 1 - 0.1 * (-4) = 1 + 0.4 = 1.4

y-new = 1 - 0.1 * (-6) = 1 + 0.6 = 1.6.

We just moved from (1, 1) to (1.4, 1.6).

Old loss at (1, 1) = 13

New loss at (1.4, 1.6) = (1.4 - 3)² + (1.6 - 4)² = (-1.6)² + (-2.4)² = 2.56 + 5.76 = 8.32

We just improved from a loss of 13 to only 8.32, this is great progress and we are certainly close to the kitchen now.

Next Iterations

As our little journey continues we compute the next gradients:

Now at (1.4, 1.6):

For x: 2 * (1.4 - 3) = 2 * (-1.6) = -3.2

For y: 2 * (1.6 - 4) = 2 * (-2.4) = -4.8

Gradient = [-3.2, -4.8]^T

x-new = 1.4 - 0.1 * (-3.2) = 1.4 + 0.32 = 1.72

y-new = 1.6 - 0.1 * (-4.8) = 1.6 + 0.48 = 2.08

Loss at (1.72, 2.08): (-1.28)² + (-1.92)² = 1.6384 + 3.686 = 5.3248

Loss dropped from 8.32 to 5.32. Congratulations, you are now at the kitchen door!

With a couple more iterations, you will have reached the global optimum, this is certain because your loss function is convex and gradient descent is guaranteed to converge (your goal: the lowest loss, little to no error).

In essence, gradient descent measures the local gradient of the error function with regard to the parameter vector θ and it goes in the descending gradient. Once the gradient is zero, you have reached the minimum! (or more precisely, a critical point, which could be a minimum, maximum, or saddle point).

In Real Machine Learning

Instead of 2 parameters (x, y), there are millions or billions (weights in a neural network).

Instead of "squared distance," they use losses like Cross-Entropy or Mean Squared Error.

Instead of one perfect pot, they navigate a complex, multi-dimensional "loss landscape" with hills, valleys, and plateaus.

But the core algorithm, the relentless optimization engine, remains Gradient Descent and its smarter variants (Adam, RMSProp).

This is exactly how gradient descent works and how artificial intelligence can learn patterns and improve its predictions.

Types of Gradient Descent

Batch Gradient Descent

This is the process whereby all training examples are utilized to compute the gradient, then take one update step!

θ_new = θ_old - η * (1/m) * Σ(∇L(θ, x_i, y_i))

Where:

  • m = total number of training examples

  • η = learning rate

  • ∇L = gradient for example

Stochastic Gradient Descent

While Batch gradient descent uses the whole training data to compute the gradient at every step which eventually greatly slows down computation, stochastic gradient descent on the other hand only picks a random instance in the training set at every step and then computes the gradients based on that single instance.

This makes the algorithm much faster but also noisier. Due to its stochasticity, SGD's stochastic noise can help it escape some local minima and will also end up very close to the global optimum but with a constant learning rate, it oscillates around the minimum rather than converging exactly.

For each random example i:

θ = θ - η * ∇L(θ, x_i, y_i)

Mini-Batch Gradient Descent

This is a system where a small batch, probably 16 or 32 is used to compute the gradient, then update is initiated. It is like the sweet spot between SGD and BGD.

For each batch B of size b:

∇L_batch = (1/b) * Σ ∇L(θ, x_i, y_i) for i in B

θ = θ - η * ∇L_batch

Conclusion

Understanding how gradient descent works is very profound and points to the very fact that artificial intelligence and its system of learning isn't about being perfect from the very beginning, it's about having a reliable method to quickly and accurately become less wrong.

This is how AI learns, it could also be instrumental in how humans function as well, psychologists often mention that every human being should have a reflective/mediation time in order to reason what went wrong and how to fix it. Gradient descent is somehow a link between artificial intelligence and the human race.

Top comments (0)