zeromathai

Posted on Apr 11 • Edited on May 7 • Originally published at zeromathai.com

Multilayer Perceptron (MLP) — How Neural Networks Learn Representations, Probabilities, and Gradients

#machinelearning #deeplearning #ai #neuralnetworks

Multilayer Perceptron (MLP) is the simplest neural network worth learning deeply.

It looks basic, but it already contains the full logic of deep learning: feature transformation, nonlinear representation learning, probabilistic output for classification, backpropagation, and gradient-based optimization.

Cross-posted from Zeromath. Original article:
https://zeromathai.com/en/dl-mlp-representation-and-backprop-en/

Why MLP still matters

A lot of people treat MLP as “the model before the real models.”
That framing is misleading.

If you understand MLP well, you already understand most of what modern deep learning is doing under the hood:

repeated linear transformations
nonlinear activations
learned hidden representations
task-specific output layers
loss-driven training
backward gradient flow

CNNs, RNNs, and Transformers all build on those same ideas.

So MLP is not just a beginner topic.
It is the cleanest mental model for the whole field.

Start from the baseline: linear models

A linear model looks like this:

y = Wx + b

That is useful, simple, and often a good baseline.
But it has a hard limit: it can only model linear relationships in the original input space.

That becomes a problem fast.

Real data is rarely organized in a way that a single straight decision boundary can handle:

image classes are not linearly separated raw pixels
text meaning is not a linear function of token IDs
audio structure is not a simple linear mix

So the issue is not just “we need more parameters.”
The issue is “we need multiple stages of transformation.”

That is where MLP enters.

The core computation

At each layer, an MLP does this:

h^(l) = σ(W^(l)h^(l-1) + b^(l))

In implementation terms, that means:

matrix multiply
add bias
apply activation
pass result to the next layer

Repeat that across layers.

That is the whole forward pipeline.

One useful beginner mental model is:

W decides how signals mix
b shifts the response
σ decides how much of that response survives

So a neuron is not mysterious.
It is just a learned filter.

Why activation functions matter so much

Here is the critical point beginners often miss:

If you stack multiple linear layers without activation functions, the whole network is still just one linear transformation.

So depth alone is not enough.

Activation functions are what make depth useful.

They introduce nonlinearity, which means the network can learn curved and compositional mappings instead of just one flat linear rule.

Typical choices:

ReLU for efficient hidden-layer training
Sigmoid for probability-like outputs in binary cases
Tanh when zero-centered activations are useful

Another important detail: activation functions do not only affect expressiveness.
They also affect gradient flow during training.

So activation choice influences both:

what the model can represent
how easy it is to optimize

What hidden layers are actually doing

A lot of explanations say hidden layers “extract features.”
That is true, but vague.

A better explanation is this:

Hidden layers map the input into a new representation space where the task becomes easier.

That means the model is not just learning a boundary.
It is learning a space in which the boundary is easier to draw.

This is why people say neural networks do representation learning.

Instead of relying on manual feature engineering, the network learns internal features on its own.

A useful intuition for image classification is:

pixels → edges → textures → shapes → object parts → object identity

Even if real learned features are messier than that, the basic idea holds:
later layers usually operate on more task-relevant abstractions than earlier ones.

That is also why MLP is fundamentally different from a plain linear model.
A linear model works in the feature space you give it.
An MLP learns a better feature space internally.

A tiny numeric example

Suppose your input is:

[1.0, 2.0]

and one neuron has:

weights = [0.5, -1.0]
bias = 0.2

Then the pre-activation value is:

z = 0.5×1.0 + (-1.0)×2.0 + 0.2 = -1.3

If the activation is ReLU, the output becomes:

max(0, -1.3) = 0

That is it.

A giant deep network is just this pattern repeated many times with many neurons and many layers.

This matters because it keeps your mental model grounded.
When training feels abstract, remember that the network is still just learning lots of weighted transforms like this.

Output layer = task meaning

The output layer is not arbitrary.
Its design depends on the task.

For regression:

you usually want a real-valued output

For classification:

you usually want class probabilities

That is where softmax comes in.

Softmax converts logits into a probability distribution over classes.
The values sum to 1, so the output can be read as:

P(y | x)

That is a big conceptual shift.

The model is no longer just producing “scores.”
It is estimating how likely each class is, given the input.

This also explains why certain output layers pair naturally with certain loss functions.

softmax + cross-entropy for multiclass classification
sigmoid + binary cross-entropy for binary classification
linear output + MSE for many regression problems

This pairing is not arbitrary.
The output interpretation and the loss function should match.

Forward pass as a code mental model

If you are implementing an MLP from scratch, the forward pass is basically:

start with input tensor x
for each layer:
- compute z = x @ W + b
- compute x = activation(z)
compute final output
compute loss

That is all the network does during inference and the first half of training.

So if you want a practical mental model, think of an MLP as a chain of differentiable tensor transformations.

Backpropagation: what it is really doing

Once the model makes a prediction and computes a loss, training asks:

Which parameters should move, and in what direction?

Backpropagation answers that.

It computes gradients of the loss with respect to each parameter.
Those gradients tell you how sensitive the final error is to each weight and bias.

A good non-mystical explanation is:

Backprop is just dependency tracing through a differentiable computation graph.

The loss depends on the output.
The output depends on the last hidden layer.
That hidden layer depends on the previous one.
And so on.

The chain rule lets us follow those dependencies backward efficiently.

That is why backprop is powerful:
it solves the credit assignment problem in deep models.

Instead of guessing which parameter caused the mistake, the network computes how much each parameter contributed to the error.

The training loop in practical terms

The full training loop is:

forward pass
compute loss
backward pass
parameter update
repeat

In code, that becomes something like:

zero gradients
run model
compute loss
call backward
step optimizer

That loop is simple, but conceptually rich.

Forward pass builds a representation and prediction.
Backward pass turns error into learning signal.
Optimizer turns gradient information into actual parameter movement.

Optimization is not separate from architecture

One subtle but important point:
training behavior is not determined by the optimizer alone.

Depth, width, activation choice, and initialization all affect optimization.

Examples:

deeper networks may become harder to train because gradients weaken
wider networks may fit more but can also overfit more easily
poor initialization can destabilize learning from the start
activation choice changes how gradients propagate

So “model design” and “training design” are tightly coupled.

That is one reason MLP is such a useful educational model.
It makes you see that expressive power and learnability are related but not identical.

A model can be expressive in theory and still frustrating to train in practice.

Common beginner confusion

“More layers automatically means better learning”

Not necessarily.
More depth increases capacity, but it can also make optimization harder.

“The network learns features magically”

It learns them through repeated gradient-based adjustment.
Representation learning is powerful, but it is still the result of the loss + backprop + optimizer loop.

“The output is just the answer”

Not always.
In classification, the output is often better understood as a probability distribution over classes.

“Backprop is a separate trick”

It is not separate.
It is the natural backward computation for the same forward graph used in prediction.

Why MLP is still inside modern models

Even in architectures that no longer look like textbook MLPs, MLP blocks still appear everywhere.

Transformers are the easiest example.
They contain feed-forward layers that are basically structured MLP components applied inside a larger architecture.

So learning MLP is not just learning an old model.
It is learning a building block that keeps reappearing.

Final takeaway

MLP matters because it teaches the full deep learning pipeline in the cleanest possible form.

It shows you:

why linear models hit a wall
why hidden layers matter
why activation functions make depth useful
how outputs become probabilities
how loss creates a training signal
how backprop distributes responsibility
how optimization turns gradients into learning

That is not “intro material.”
That is the core abstraction behind deep learning.

Which part felt most useful to you:
the representation-learning view, the output/loss pairing, or the backprop mental model?

GitHub Resources
AI diagrams, study notes, and visual guides:
https://github.com/zeromathai/zeromathai-ai

DEV Community