Imdadul Haque ohi

Posted on Nov 23

Why Softmax is Used Instead of Argmax in Neural Network Training

#algorithms #deeplearning #ai #machinelearning

Why Softmax is Used Instead of Argmax in Neural Network Training

1. Information Loss with Argmax

Argmax only returns the index of the highest logit value and completely discards all confidence information:

argmax([2.1, 1.0, 0.5]) = 0
argmax([5.0, 0.1, 0.1]) = 0

Both return class 0, but we lose critical information about how confident the model is in its prediction.

2. Softmax Preserves the Full Distribution

Softmax converts logits into a probability distribution that preserves the relative confidence across all classes:

\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j} e^{z_j}}

This allows the loss function to measure certainty or uncertainty, which is essential for gradient-based learning.

3. Numerical Example: Uncertain vs Confident Predictions

Let's compare two scenarios with 3 classes and true label = class 0.

Scenario A: Uncertain Model

Logits: [2.1, 1.0, 0.5]

Argmax output:

argmax([2.1, 1.0, 0.5]) = 0

Softmax output:

\text{softmax}([2.1, 1.0, 0.5]) = \left[\frac{e^{2.1}}{e^{2.1}+e^{1.0}+e^{0.5}}, \frac{e^{1.0}}{e^{2.1}+e^{1.0}+e^{0.5}}, \frac{e^{0.5}}{e^{2.1}+e^{1.0}+e^{0.5}}\right]

= \left[\frac{8.17}{8.17+2.72+1.65}, \frac{2.72}{12.54}, \frac{1.65}{12.54}\right]

= [0.651, 0.217, 0.132]

The model is somewhat confident but not very certain (65% probability for class 0).

Scenario B: Confident Model

Logits: [5.0, 0.1, 0.1]

Argmax output:

argmax([5.0, 0.1, 0.1]) = 0

Softmax output:

\text{softmax}([5.0, 0.1, 0.1]) = \left[\frac{e^{5.0}}{e^{5.0}+e^{0.1}+e^{0.1}}, \frac{e^{0.1}}{e^{5.0}+2e^{0.1}}, \frac{e^{0.1}}{e^{5.0}+2e^{0.1}}\right]

= \left[\frac{148.41}{148.41+1.105+1.105}, \frac{1.105}{150.62}, \frac{1.105}{150.62}\right]

= [0.985, 0.007, 0.007]

The model is highly confident (98.5% probability for class 0).

Key Observation

Argmax gives the same result (class 0) in both cases, losing all information about confidence.
Softmax gives drastically different distributions, capturing the model's uncertainty.

4. Cross-Entropy Loss with Softmax Outputs

Cross-entropy loss measures how well the predicted probabilities match the true label:

\mathcal{L} = -\sum_{i} y_i \log(p_i)

where $y_i$ is the true label (one-hot encoded) and $p_i$ is the softmax probability.

For true label = class 0 (one-hot: [1, 0, 0]), the loss simplifies to:

\mathcal{L} = -\log(p_0)

Scenario A: Uncertain Model

\mathcal{L}_A = -\log(0.651) = 0.429

Scenario B: Confident Model

\mathcal{L}_B = -\log(0.985) = 0.015

Analysis

Higher loss (0.429) for the uncertain model → gradient pushes weights to increase confidence
Lower loss (0.015) for the confident model → gradient is small, model is already doing well
The loss quantifies how far the model is from perfect prediction

With argmax, we'd have no way to distinguish between these two scenarios and couldn't compute meaningful gradients.

5. Why Argmax is Not Differentiable

The Problem with Argmax

Argmax is a discrete, non-continuous function:

f(z) = argmax(z)

Example: argmax([2.1, 1.0, 0.5]) = 0

If we slightly change the input:

argmax([2.1, 1.0, 0.5]) = 0
argmax([2.09, 1.0, 0.5]) = 0
argmax([1.99, 1.0, 0.5]) = 0
...
argmax([0.99, 1.0, 0.5]) = 1 ← Sudden jump!

The derivative

\frac{\partial \text{argmax}}{\partial z_i}

is:

Zero almost everywhere (small changes don't affect the output)
Undefined at the boundary (when two logits are equal)

\frac{\partial \text{argmax}(z)}{\partial z_i} = \begin{cases} 0 & \text{most of the time} \ \text{undefined} & \text{at boundaries} \end{cases}

This means no gradient flows back through argmax → backpropagation cannot update weights!

Why Softmax IS Differentiable

Softmax is a smooth, continuous function:

\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j} e^{z_j}}

For any small change in $z_i$, the output changes smoothly. We can compute the derivative:

\frac{\partial \text{softmax}(z_i)}{\partial z_j} = \begin{cases} \text{softmax}(z_i)(1 - \text{softmax}(z_i)) & \text{if } i = j \ -\text{softmax}(z_i) \cdot \text{softmax}(z_j) & \text{if } i \neq j \end{cases}

This provides well-defined gradients that backpropagation can use to update weights!

Visual Intuition

Imagine plotting outputs vs a logit value:

Argmax: A step function (flat, then jumps)

Output
  1 |     ___________
    |    |
  0 |____|
    +----+----+----+
         z

→ Derivative is 0 or undefined

Softmax: A smooth S-curve

Output
  1 |       ___---
    |    ,-'
  0 |__-'
    +----+----+----+
         z

→ Derivative exists everywhere and is meaningful

6. Summary Comparison Table

Aspect	Argmax	Softmax
Output Type	Index (discrete)	Probability distribution (continuous)
Information Preserved	Only the winning class	Full confidence across all classes
Differentiable	No (zero or undefined gradient)	Yes (smooth, continuous gradients)
Use in Training	Cannot be used	Essential for backpropagation
Use in Inference	Often used to get final prediction	Optional (can use probabilities directly)
Loss Computation	Impossible (no probability distribution)	Works with cross-entropy loss
Gradient Flow	None (stops backpropagation)	Provides meaningful gradients
Example Output	`0` (just an index)	`[0.651, 0.217, 0.132]` (probabilities)

Conclusion

Softmax is used during training because:

It preserves the full probability distribution, allowing the loss function to measure confidence
It is differentiable, enabling gradient-based optimization via backpropagation
It provides meaningful gradients that guide the learning process

Argmax is only used during inference (after training) when we need to make a discrete prediction, as it simply selects the class with the highest probability. During training, we need the continuous, differentiable nature of softmax to learn effectively.

\boxed{\text{Training: Softmax} \quad \text{Inference: Argmax (optional)}}

DEV Community

Why Softmax is Used Instead of Argmax in Neural Network Training

Why Softmax is Used Instead of Argmax in Neural Network Training

1. Information Loss with Argmax

2. Softmax Preserves the Full Distribution

3. Numerical Example: Uncertain vs Confident Predictions

Scenario A: Uncertain Model

Scenario B: Confident Model

Key Observation

4. Cross-Entropy Loss with Softmax Outputs

Scenario A: Uncertain Model

Scenario B: Confident Model

Analysis

5. Why Argmax is Not Differentiable

The Problem with Argmax

Why Softmax IS Differentiable

Visual Intuition

6. Summary Comparison Table

Conclusion

Top comments (0)