Exploring the Softmax Function (Part 2): Formula, Derivatives, and Why Argmax Fails in Backpropagation

#machinelearning #argmax #softmax #ai

In the previous article, we covered the Softmax function with an example.

Now, let’s generalize it.

Here, ( i ) refers to an individual raw output value.

When ( i = 1 ), we are talking about the raw output value corresponding to Setosa.

We started looking into Softmax because of several disadvantages of Argmax.

One major disadvantage is related to derivatives.
When we try to take the derivative of the Argmax function, it becomes 0, which makes it useless for learning.

On the other hand, Softmax has a derivative that can be used for backpropagation.

Let’s look at the predicted probability of Setosa, denoted as ( psetosa ).

Derivatives of Softmax

Derivative of ( psetosa ) with respect to its own raw score:

Derivative of ( psetosa ) with respect to the raw score of Versicolor:

Derivative of ( psetosa ) with respect to the raw score of Virginica:

From this, we can clearly see that the derivative of the Softmax function is not always zero, which means it can be effectively used with gradient descent.

Now, in many neural networks, we use the sum of squared residuals to measure how well the model fits the data.

However, in the case of Softmax, where the output values lie between 0 and 1, we need a different loss function. This is where cross-entropy comes in.

We will discuss cross-entropy in the next article.

Looking for an easier way to install tools, libraries, or entire repositories?
Try Installerpedia: a community-driven, structured installation platform that lets you install almost anything with minimal hassle and clear, reliable guidance.

Just run: