Neural networks don’t output decisions — they output probabilities.
This post explains how logits, softmax, and cross-entropy turn raw outputs into meaningful predictions in deep learning.
Cross-posted from Zeromath. Original article: https://zeromathai.com/en/output-layer-probabilistic-interpretation-en/
The Real Role of the Output Layer
A neural network doesn’t directly say:
“This is class A.”
Instead, it computes:
A probability distribution over all classes.
Step 1 — Logits (Raw Scores)
Final layer:
ŷ = Wh + b
Output:
z = logits
- Not probabilities
- Not normalized
- Can be negative or large
Example:
[0.4, -1.7, 4.2]
Step 2 — Softmax (Make It Probabilistic)
softmax(z_i) = exp(z_i) / Σ exp(z_j)
Transforms:
[0.4, -1.7, 4.2]
→ [0.022, 0.003, 0.975]
Now outputs are:
- positive
- sum to 1
- interpretable
Step 3 — Argmax (Decision)
- Softmax → probabilities
- Argmax → final class
Important:
Softmax keeps uncertainty
Argmax removes it
Step 4 — Training Uses Cross-Entropy
Loss:
− log(p_true_class)
Why:
- differentiable
- punishes confident mistakes
- aligns with probability theory
Step 5 — Why Frameworks Use Logits Directly
In PyTorch / TensorFlow:
- CrossEntropyLoss expects logits
- NOT softmax output
Why?
Numerical stability:
log(softmax(z)) is computed safely without overflow
Step 6 — Softmax vs Sigmoid (Real-World Bug Source)
- Binary → sigmoid
- Multi-class → softmax
- Multi-label → sigmoid per class
Common bug:
Using softmax for multi-label → wrong behavior
Step 7 — Inference Tip
Do you always need softmax?
- For prediction only → argmax(logits) works
- For probabilities → apply softmax
This saves computation in production systems.
Mental Model
Input → Features → Logits → Softmax → Probabilities → Argmax → Prediction
Debugging Checklist
- Overconfident wrong → calibration issue
- Always low confidence → weak features
- Loss not decreasing → output/loss mismatch
- Multi-label broken → wrong activation
Final Takeaway
- Output layer → scores
- Softmax → probabilities
- Argmax → decisions
- Cross-entropy → learning
Deep learning works because:
It models uncertainty, not just outputs.
Where do you usually get stuck — logits, softmax, or loss functions?
Top comments (0)