DEV Community

Imdadul Haque ohi
Imdadul Haque ohi

Posted on

Why Softmax is Used Instead of Argmax in Neural Network Training

Why Softmax is Used Instead of Argmax in Neural Network Training

1. Information Loss with Argmax

Argmax only returns the index of the highest logit value and completely discards all confidence information:

argmax([2.1, 1.0, 0.5]) = 0
argmax([5.0, 0.1, 0.1]) = 0
Enter fullscreen mode Exit fullscreen mode

Both return class 0, but we lose critical information about how confident the model is in its prediction.


2. Softmax Preserves the Full Distribution

Softmax converts logits into a probability distribution that preserves the relative confidence across all classes:

softmax(zi)=ezijezj\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j} e^{z_j}}

This allows the loss function to measure certainty or uncertainty, which is essential for gradient-based learning.


3. Numerical Example: Uncertain vs Confident Predictions

Let's compare two scenarios with 3 classes and true label = class 0.

Scenario A: Uncertain Model

Logits: [2.1, 1.0, 0.5]

Argmax output:

argmax([2.1, 1.0, 0.5]) = 0
Enter fullscreen mode Exit fullscreen mode

Softmax output:

softmax([2.1,1.0,0.5])=[e2.1e2.1+e1.0+e0.5,e1.0e2.1+e1.0+e0.5,e0.5e2.1+e1.0+e0.5]\text{softmax}([2.1, 1.0, 0.5]) = \left[\frac{e^{2.1}}{e^{2.1}+e^{1.0}+e^{0.5}}, \frac{e^{1.0}}{e^{2.1}+e^{1.0}+e^{0.5}}, \frac{e^{0.5}}{e^{2.1}+e^{1.0}+e^{0.5}}\right]
=[8.178.17+2.72+1.65,2.7212.54,1.6512.54]= \left[\frac{8.17}{8.17+2.72+1.65}, \frac{2.72}{12.54}, \frac{1.65}{12.54}\right]
=[0.651,0.217,0.132]= [0.651, 0.217, 0.132]

The model is somewhat confident but not very certain (65% probability for class 0).


Scenario B: Confident Model

Logits: [5.0, 0.1, 0.1]

Argmax output:

argmax([5.0, 0.1, 0.1]) = 0
Enter fullscreen mode Exit fullscreen mode

Softmax output:

softmax([5.0,0.1,0.1])=[e5.0e5.0+e0.1+e0.1,e0.1e5.0+2e0.1,e0.1e5.0+2e0.1]\text{softmax}([5.0, 0.1, 0.1]) = \left[\frac{e^{5.0}}{e^{5.0}+e^{0.1}+e^{0.1}}, \frac{e^{0.1}}{e^{5.0}+2e^{0.1}}, \frac{e^{0.1}}{e^{5.0}+2e^{0.1}}\right]
=[148.41148.41+1.105+1.105,1.105150.62,1.105150.62]= \left[\frac{148.41}{148.41+1.105+1.105}, \frac{1.105}{150.62}, \frac{1.105}{150.62}\right]
=[0.985,0.007,0.007]= [0.985, 0.007, 0.007]

The model is highly confident (98.5% probability for class 0).


Key Observation

  • Argmax gives the same result (class 0) in both cases, losing all information about confidence.
  • Softmax gives drastically different distributions, capturing the model's uncertainty.

4. Cross-Entropy Loss with Softmax Outputs

Cross-entropy loss measures how well the predicted probabilities match the true label:

L=iyilog(pi)\mathcal{L} = -\sum_{i} y_i \log(p_i)

where $y_i$ is the true label (one-hot encoded) and $p_i$ is the softmax probability.

For true label = class 0 (one-hot: [1, 0, 0]), the loss simplifies to:

L=log(p0)\mathcal{L} = -\log(p_0)

Scenario A: Uncertain Model

LA=log(0.651)=0.429\mathcal{L}_A = -\log(0.651) = 0.429

Scenario B: Confident Model

LB=log(0.985)=0.015\mathcal{L}_B = -\log(0.985) = 0.015

Analysis

  • Higher loss (0.429) for the uncertain model → gradient pushes weights to increase confidence
  • Lower loss (0.015) for the confident model → gradient is small, model is already doing well
  • The loss quantifies how far the model is from perfect prediction

With argmax, we'd have no way to distinguish between these two scenarios and couldn't compute meaningful gradients.


5. Why Argmax is Not Differentiable

The Problem with Argmax

Argmax is a discrete, non-continuous function:

f(z) = argmax(z)
Enter fullscreen mode Exit fullscreen mode

Example: argmax([2.1, 1.0, 0.5]) = 0

If we slightly change the input:

  • argmax([2.1, 1.0, 0.5]) = 0
  • argmax([2.09, 1.0, 0.5]) = 0
  • argmax([1.99, 1.0, 0.5]) = 0
  • ...
  • argmax([0.99, 1.0, 0.5]) = 1Sudden jump!

The derivative

argmaxzi\frac{\partial \text{argmax}}{\partial z_i}

is:

  • Zero almost everywhere (small changes don't affect the output)
  • Undefined at the boundary (when two logits are equal)
argmax(z)zi={0most of the time undefinedat boundaries\frac{\partial \text{argmax}(z)}{\partial z_i} = \begin{cases} 0 & \text{most of the time} \ \text{undefined} & \text{at boundaries} \end{cases}

This means no gradient flows back through argmax → backpropagation cannot update weights!


Why Softmax IS Differentiable

Softmax is a smooth, continuous function:

softmax(zi)=ezijezj\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j} e^{z_j}}

For any small change in $z_i$, the output changes smoothly. We can compute the derivative:

softmax(zi)zj={softmax(zi)(1softmax(zi))if i=j softmax(zi)softmax(zj)if ij \frac{\partial \text{softmax}(z_i)}{\partial z_j} = \begin{cases} \text{softmax}(z_i)(1 - \text{softmax}(z_i)) & \text{if } i = j \ -\text{softmax}(z_i) \cdot \text{softmax}(z_j) & \text{if } i \neq j \end{cases}

This provides well-defined gradients that backpropagation can use to update weights!


Visual Intuition

Imagine plotting outputs vs a logit value:

Argmax: A step function (flat, then jumps)

Output
  1 |     ___________
    |    |
  0 |____|
    +----+----+----+
         z
Enter fullscreen mode Exit fullscreen mode

→ Derivative is 0 or undefined

Softmax: A smooth S-curve

Output
  1 |       ___---
    |    ,-'
  0 |__-'
    +----+----+----+
         z
Enter fullscreen mode Exit fullscreen mode

→ Derivative exists everywhere and is meaningful


6. Summary Comparison Table

Aspect Argmax Softmax
Output Type Index (discrete) Probability distribution (continuous)
Information Preserved Only the winning class Full confidence across all classes
Differentiable No (zero or undefined gradient) Yes (smooth, continuous gradients)
Use in Training Cannot be used Essential for backpropagation
Use in Inference Often used to get final prediction Optional (can use probabilities directly)
Loss Computation Impossible (no probability distribution) Works with cross-entropy loss
Gradient Flow None (stops backpropagation) Provides meaningful gradients
Example Output 0 (just an index) [0.651, 0.217, 0.132] (probabilities)

Conclusion

Softmax is used during training because:

  1. It preserves the full probability distribution, allowing the loss function to measure confidence
  2. It is differentiable, enabling gradient-based optimization via backpropagation
  3. It provides meaningful gradients that guide the learning process

Argmax is only used during inference (after training) when we need to make a discrete prediction, as it simply selects the class with the highest probability. During training, we need the continuous, differentiable nature of softmax to learn effectively.

Training: SoftmaxInference: Argmax (optional) \boxed{\text{Training: Softmax} \quad \text{Inference: Argmax (optional)}}

Top comments (0)