DEV Community

shangkyu shin
shangkyu shin

Posted on • Originally published at zeromathai.com

Output Layer Explained — Logits, Softmax, Cross-Entropy, and Why They Work Together

Neural networks don’t output decisions — they output probabilities.

This post explains how logits, softmax, and cross-entropy turn raw outputs into meaningful predictions in deep learning.

Cross-posted from Zeromath. Original article: https://zeromathai.com/en/output-layer-probabilistic-interpretation-en/


The Real Role of the Output Layer

A neural network doesn’t directly say:

“This is class A.”

Instead, it computes:

A probability distribution over all classes.


Step 1 — Logits (Raw Scores)

Final layer:

ŷ = Wh + b

Output:

z = logits

  • Not probabilities
  • Not normalized
  • Can be negative or large

Example:

[0.4, -1.7, 4.2]


Step 2 — Softmax (Make It Probabilistic)

softmax(z_i) = exp(z_i) / Σ exp(z_j)

Transforms:

[0.4, -1.7, 4.2]

→ [0.022, 0.003, 0.975]

Now outputs are:

  • positive
  • sum to 1
  • interpretable

Step 3 — Argmax (Decision)

  • Softmax → probabilities
  • Argmax → final class

Important:

Softmax keeps uncertainty

Argmax removes it


Step 4 — Training Uses Cross-Entropy

Loss:

− log(p_true_class)

Why:

  • differentiable
  • punishes confident mistakes
  • aligns with probability theory

Step 5 — Why Frameworks Use Logits Directly

In PyTorch / TensorFlow:

  • CrossEntropyLoss expects logits
  • NOT softmax output

Why?

Numerical stability:

log(softmax(z)) is computed safely without overflow


Step 6 — Softmax vs Sigmoid (Real-World Bug Source)

  • Binary → sigmoid
  • Multi-class → softmax
  • Multi-label → sigmoid per class

Common bug:

Using softmax for multi-label → wrong behavior


Step 7 — Inference Tip

Do you always need softmax?

  • For prediction only → argmax(logits) works
  • For probabilities → apply softmax

This saves computation in production systems.


Mental Model

Input → Features → Logits → Softmax → Probabilities → Argmax → Prediction


Debugging Checklist

  • Overconfident wrong → calibration issue
  • Always low confidence → weak features
  • Loss not decreasing → output/loss mismatch
  • Multi-label broken → wrong activation

Final Takeaway

  • Output layer → scores
  • Softmax → probabilities
  • Argmax → decisions
  • Cross-entropy → learning

Deep learning works because:

It models uncertainty, not just outputs.


Where do you usually get stuck — logits, softmax, or loss functions?

Top comments (0)