This post is the first in a series where I break down Transformers and Vision Transformers from the ground up β no shortcuts, no magic, just fundamentals.
Why This Series?
Transformers are often introduced through attention diagrams and complex equations.
But most confusion around Transformers does not come from attention itself.
It comes from weak fundamentals.
Transformers are deep neural networks with attention layered on top.
If the neural network basics are unclear, everything else feels mystical.
This first post focuses only on the neural network foundation required for Transformers.
a. Transformers Are Still Neural Networks
Before talking about attention, embeddings, or tokens, itβs important to state something clearly:
Transformers are deep feedforward neural networks + attention.
Every Transformer block contains:
- Linear layers
- Activation functions
- Forward and backward passes
- Gradient-based learning
If you understand standard neural networks, Transformers become far less intimidating.
b. What a Neural Network Layer Really Is
A neural network layer is simply a function:
π¦ = π(ππ₯ + π)
Where:
- π₯ is the input vector
- π is the weight matrix
- π is the bias
- π(β ) is an activation function
Figure 1: A neural network layer applies a linear transformation using weights and bias, followed by a non-linear activation.
This exact computation appears everywhere in Transformers:
- Query, Key, Value projections
- Output projections
- Feed Forward Networks (FFN)
Nothing special β just repeated at scale.
c. Weights and Biases: The Only Things That Learn
Neural networks donβt learn rules or symbols. They learn numbers.
- Weights control how strongly inputs affect outputs
- Biases allow shifting behavior independently of inputs
In Transformers:
- Attention matrices are learned weights
- Projection layers are learned weights
- FFN layers are learned weights
If you know where the weights are, you know where the learning is happening.
d. Why Activation Functions Are Non-Negotiable
Without activation functions, depth is meaningless.
Stacking linear layers without activation collapses into a single linear transformation:
π3(π2(π1π₯)) = ππ₯
Activation functions introduce non-linearity, allowing the model to represent complex patterns.
e. ReLU vs GELU (Why Transformers Use GELU)
ReLU is simple and widely used:
ReLU(π₯) = max(0,π₯)
But Transformers typically use GELU:
GELU(π₯) = π₯β Ξ¦(π₯)
Intuition:
- ReLU applies a hard cutoff
- GELU softly decides which values to pass
This smooth behavior helps with:
- Very deep networks
- Stable gradient flow
Thatβs why models like BERT and GPT use GELU inside their feedforward blocks.
f. Forward Pass vs Backward Pass
Forward Pass
The forward pass answers: Given current parameters, what output does the model produce?
Inputs flow through:
- Linear layers
- Activation functions
- Attention modules
This is what architecture diagrams usually show.
Backward Pass
The backward pass answers: How should the parameters change to reduce error?
Using backpropagation:
- Gradients flow backward
- Weights and biases are updated
- Learning happens
Transformers are trained using the same backpropagation principles as any other neural network.
g. Feedforward Networks Inside Transformers
Every Transformer block contains a standard feedforward neural network applied independently to each token:
FFN(π₯) = GELU(π₯π1 + π1)π2+π2
This is not an auxiliary component β it does most of the representational work.
Attention mixes information.
Feedforward networks transform it.
Key Takeaways
- Transformers are not fundamentally new neural networks
- Layers, weights, activations still define behavior
- GELU is preferred over ReLU for deep Transformer models
- Forward and backward passes work exactly as in standard networks
- Attention sits on top of familiar feedforward structures
Whatβs Next?
In the next post, weβll cover Embeddings β the step that turns discrete tokens into continuous vectors.
Without embeddings, attention has nothing to operate on.

Top comments (0)