Why Softmax is Used Instead of Argmax in Neural Network Training
1. Information Loss with Argmax
Argmax only returns the index of the highest logit value and completely discards all confidence information:
argmax([2.1, 1.0, 0.5]) = 0
argmax([5.0, 0.1, 0.1]) = 0
Both return class 0, but we lose critical information about how confident the model is in its prediction.
2. Softmax Preserves the Full Distribution
Softmax converts logits into a probability distribution that preserves the relative confidence across all classes:
This allows the loss function to measure certainty or uncertainty, which is essential for gradient-based learning.
3. Numerical Example: Uncertain vs Confident Predictions
Let's compare two scenarios with 3 classes and true label = class 0.
Scenario A: Uncertain Model
Logits: [2.1, 1.0, 0.5]
Argmax output:
argmax([2.1, 1.0, 0.5]) = 0
Softmax output:
The model is somewhat confident but not very certain (65% probability for class 0).
Scenario B: Confident Model
Logits: [5.0, 0.1, 0.1]
Argmax output:
argmax([5.0, 0.1, 0.1]) = 0
Softmax output:
The model is highly confident (98.5% probability for class 0).
Key Observation
- Argmax gives the same result (class 0) in both cases, losing all information about confidence.
- Softmax gives drastically different distributions, capturing the model's uncertainty.
4. Cross-Entropy Loss with Softmax Outputs
Cross-entropy loss measures how well the predicted probabilities match the true label:
where $y_i$ is the true label (one-hot encoded) and $p_i$ is the softmax probability.
For true label = class 0 (one-hot: [1, 0, 0]), the loss simplifies to:
Scenario A: Uncertain Model
Scenario B: Confident Model
Analysis
- Higher loss (0.429) for the uncertain model → gradient pushes weights to increase confidence
- Lower loss (0.015) for the confident model → gradient is small, model is already doing well
- The loss quantifies how far the model is from perfect prediction
With argmax, we'd have no way to distinguish between these two scenarios and couldn't compute meaningful gradients.
5. Why Argmax is Not Differentiable
The Problem with Argmax
Argmax is a discrete, non-continuous function:
f(z) = argmax(z)
Example: argmax([2.1, 1.0, 0.5]) = 0
If we slightly change the input:
argmax([2.1, 1.0, 0.5]) = 0argmax([2.09, 1.0, 0.5]) = 0argmax([1.99, 1.0, 0.5]) = 0- ...
-
argmax([0.99, 1.0, 0.5]) = 1← Sudden jump!
The derivative
is:
- Zero almost everywhere (small changes don't affect the output)
- Undefined at the boundary (when two logits are equal)
This means no gradient flows back through argmax → backpropagation cannot update weights!
Why Softmax IS Differentiable
Softmax is a smooth, continuous function:
For any small change in $z_i$, the output changes smoothly. We can compute the derivative:
This provides well-defined gradients that backpropagation can use to update weights!
Visual Intuition
Imagine plotting outputs vs a logit value:
Argmax: A step function (flat, then jumps)
Output
1 | ___________
| |
0 |____|
+----+----+----+
z
→ Derivative is 0 or undefined
Softmax: A smooth S-curve
Output
1 | ___---
| ,-'
0 |__-'
+----+----+----+
z
→ Derivative exists everywhere and is meaningful
6. Summary Comparison Table
| Aspect | Argmax | Softmax |
|---|---|---|
| Output Type | Index (discrete) | Probability distribution (continuous) |
| Information Preserved | Only the winning class | Full confidence across all classes |
| Differentiable | No (zero or undefined gradient) | Yes (smooth, continuous gradients) |
| Use in Training | Cannot be used | Essential for backpropagation |
| Use in Inference | Often used to get final prediction | Optional (can use probabilities directly) |
| Loss Computation | Impossible (no probability distribution) | Works with cross-entropy loss |
| Gradient Flow | None (stops backpropagation) | Provides meaningful gradients |
| Example Output |
0 (just an index) |
[0.651, 0.217, 0.132] (probabilities) |
Conclusion
Softmax is used during training because:
- It preserves the full probability distribution, allowing the loss function to measure confidence
- It is differentiable, enabling gradient-based optimization via backpropagation
- It provides meaningful gradients that guide the learning process
Argmax is only used during inference (after training) when we need to make a discrete prediction, as it simply selects the class with the highest probability. During training, we need the continuous, differentiable nature of softmax to learn effectively.
Top comments (0)