Understanding the ArgMax Function in Neural Networks: A Handy Simplification for Prediction

#ai #machinelearning

In the previous articles related to multiple input and output nodes for predicting the Iris species, we plugged in different values for petal and sepal width and obtained predictions for Setosa, Versicolor, and Virginica.

Suppose we set:

Petal Width = 0
Sepal Width = 1

We get the following values:

Setosa = 1.43
Versicolor = -0.4
Virginica = 0.23

From this, we can notice that raw output values are not always between 0 and 1.

Setosa is greater than 1
Versicolor is less than 0

Due to this broad range of values, interpreting raw outputs becomes difficult.

Why ArgMax?

To solve this, we pass the raw output values to either an ArgMax layer or a Softmax layer.

ArgMax works in a very simple way.

It sets the largest value to 1 and all other values to 0.

For our example:

Setosa = 1.43
Versicolor = -0.4
Virginica = 0.23

Setosa has the highest value, so it will be set to 1, and the remaining species will be set to 0.

So the output becomes:

Setosa = 1
Versicolor = 0
Virginica = 0

This makes the output very easy to interpret.
At a glance, we can see that Setosa has a value of 1, which implies that for the given petal and sepal width, the input most likely belongs to the Setosa species.

Problem with ArgMax

However, ArgMax has a major problem.

We cannot use ArgMax to optimize weights and biases in a neural network.

This is because the output values of ArgMax are constants: 0 and 1.

Let’s visualize why this is an issue.

Visualizing the ArgMax Issue

Assume the second-largest value is 0.23 (Virginica).

According to ArgMax:

Any value greater than 0.23 becomes 1
Any value less than or equal to 0.23 becomes 0

This creates a step function.

Let’s plot this.

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-1, 1, 400)
y = np.where(x > 0.23, 1, 0)

plt.plot(x, y)
plt.xlabel("Raw output value")
plt.ylabel("ArgMax output")
plt.title("ArgMax behavior for second-largest value (0.23)")
plt.show()

What this graph tells us

The graph consists of flat lines
Both regions have a slope of 0
Therefore, the derivative is also 0

Since derivatives are used during backpropagation, a zero derivative means:

No gradient
No learning
No movement toward optimal weights and biases

If the gradient becomes zero, the network cannot update its parameters.

Conclusion

Because ArgMax outputs are non-differentiable constants, it cannot be used during training.

This limitation leads us to the Softmax function, which produces smooth, differentiable outputs and is suitable for backpropagation.

We’ll explore Softmax in the next article.

You can try the examples out in the Colab notebook.

Looking for an easier way to install tools, libraries, or entire repositories?
Try Installerpedia: a community-driven, structured installation platform that lets you install almost anything with minimal hassle and clear, reliable guidance.

Just run: