Jyoti Prajapati

Posted on Jan 15 • Edited on Jan 22

Transformer Series - Blog #1: Neural Network Fundamentals You Must Understand Before Transformers

#tutorial #machinelearning #beginners

This post is the first in a series where I break down Transformers and Vision Transformers from the ground up — no shortcuts, no magic, just fundamentals.

Why This Series?

Transformers are often introduced through attention diagrams and complex equations.
But most confusion around Transformers does not come from attention itself.

It comes from weak fundamentals.

Transformers are deep neural networks with attention layered on top.
If the neural network basics are unclear, everything else feels mystical.

This first post focuses only on the neural network foundation required for Transformers.

a. Transformers Are Still Neural Networks

Before talking about attention, embeddings, or tokens, it’s important to state something clearly:

Transformers are deep feedforward neural networks + attention.

Every Transformer block contains:

Linear layers
Activation functions
Forward and backward passes
Gradient-based learning

If you understand standard neural networks, Transformers become far less intimidating.

b. What a Neural Network Layer Really Is

A neural network layer is simply a function:

𝑦 = 𝑓(𝑊𝑥 + 𝑏)

Where:

𝑥 is the input vector
𝑊 is the weight matrix
𝑏 is the bias
𝑓(⋅) is an activation function

Figure 1: A neural network layer applies a linear transformation using weights and bias, followed by a non-linear activation.

This exact computation appears everywhere in Transformers:

Query, Key, Value projections
Output projections
Feed Forward Networks (FFN)

Nothing special — just repeated at scale.

c. Weights and Biases: The Only Things That Learn

Neural networks don’t learn rules or symbols. They learn numbers.

Weights control how strongly inputs affect outputs
Biases allow shifting behavior independently of inputs

In Transformers:

Attention matrices are learned weights
Projection layers are learned weights
FFN layers are learned weights

If you know where the weights are, you know where the learning is happening.

d. Why Activation Functions Are Non-Negotiable

Without activation functions, depth is meaningless.

Stacking linear layers without activation collapses into a single linear transformation:

𝑊3(𝑊2(𝑊1𝑥)) = 𝑊𝑥

Activation functions introduce non-linearity, allowing the model to represent complex patterns.

e. ReLU vs GELU (Why Transformers Use GELU)

ReLU is simple and widely used:

ReLU(𝑥) = max(0,𝑥)

But Transformers typically use GELU:

GELU(𝑥) = 𝑥⋅Φ(𝑥)

Intuition:

ReLU applies a hard cutoff
GELU softly decides which values to pass

This smooth behavior helps with:

Very deep networks
Stable gradient flow

That’s why models like BERT and GPT use GELU inside their feedforward blocks.

f. Forward Pass vs Backward Pass

Forward Pass
The forward pass answers: Given current parameters, what output does the model produce?

Inputs flow through:

Linear layers
Activation functions
Attention modules

This is what architecture diagrams usually show.

Backward Pass

The backward pass answers: How should the parameters change to reduce error?

Using backpropagation:

Gradients flow backward
Weights and biases are updated
Learning happens

Transformers are trained using the same backpropagation principles as any other neural network.

g. Feedforward Networks Inside Transformers

Every Transformer block contains a standard feedforward neural network applied independently to each token:

FFN(𝑥) = GELU(𝑥𝑊1 + 𝑏1)𝑊2+𝑏2

This is not an auxiliary component — it does most of the representational work.

Attention mixes information.
Feedforward networks transform it.

Key Takeaways

Transformers are not fundamentally new neural networks
Layers, weights, activations still define behavior
GELU is preferred over ReLU for deep Transformer models
Forward and backward passes work exactly as in standard networks
Attention sits on top of familiar feedforward structures

What’s Next?

In the next post, we’ll cover Embeddings — the step that turns discrete tokens into continuous vectors.
Without embeddings, attention has nothing to operate on.