zeromathai

Posted on Apr 11 • Edited on May 7 • Originally published at zeromathai.com

Output Layer Explained — Logits, Softmax, Cross-Entropy, and Why They Work Together

#machinelearning #deeplearning #ai #neuralnetworks

Neural networks don’t output decisions — they output probabilities.

This post explains how logits, softmax, and cross-entropy turn raw outputs into meaningful predictions in deep learning.

Cross-posted from Zeromath. Original article: https://zeromathai.com/en/output-layer-probabilistic-interpretation-en/

The Real Role of the Output Layer

A neural network doesn’t directly say:

“This is class A.”

Instead, it computes:

A probability distribution over all classes.

Step 1 — Logits (Raw Scores)

Final layer:

ŷ = Wh + b

Output:

z = logits

Not probabilities
Not normalized
Can be negative or large

Example:

[0.4, -1.7, 4.2]

Step 2 — Softmax (Make It Probabilistic)

softmax(z_i) = exp(z_i) / Σ exp(z_j)

Transforms:

[0.4, -1.7, 4.2]

→ [0.022, 0.003, 0.975]

Now outputs are:

positive
sum to 1
interpretable

Step 3 — Argmax (Decision)

Softmax → probabilities
Argmax → final class

Important:

Softmax keeps uncertainty

Argmax removes it

Step 4 — Training Uses Cross-Entropy

Loss:

− log(p_true_class)

Why:

differentiable
punishes confident mistakes
aligns with probability theory

Step 5 — Why Frameworks Use Logits Directly

In PyTorch / TensorFlow:

CrossEntropyLoss expects logits
NOT softmax output

Why?

Numerical stability:

log(softmax(z)) is computed safely without overflow

Step 6 — Softmax vs Sigmoid (Real-World Bug Source)

Binary → sigmoid
Multi-class → softmax
Multi-label → sigmoid per class

Common bug:

Using softmax for multi-label → wrong behavior

Step 7 — Inference Tip

Do you always need softmax?

For prediction only → argmax(logits) works
For probabilities → apply softmax

This saves computation in production systems.

Mental Model

Input → Features → Logits → Softmax → Probabilities → Argmax → Prediction

Debugging Checklist

Overconfident wrong → calibration issue
Always low confidence → weak features
Loss not decreasing → output/loss mismatch
Multi-label broken → wrong activation

Final Takeaway

Output layer → scores
Softmax → probabilities
Argmax → decisions
Cross-entropy → learning

Deep learning works because:

It models uncertainty, not just outputs.

Where do you usually get stuck — logits, softmax, or loss functions?

GitHub Resources
AI diagrams, study notes, and visual guides:
https://github.com/zeromathai/zeromathai-ai

DEV Community