DEV Community

Jyoti Prajapati
Jyoti Prajapati

Posted on

Blog #1: Neural Network Fundamentals You Must Understand Before Transformers

This post is the first in a series where I break down Transformers and Vision Transformers from the ground up β€” no shortcuts, no magic, just fundamentals.

Why This Series?

Transformers are often introduced through attention diagrams and complex equations.
But most confusion around Transformers does not come from attention itself.

It comes from weak fundamentals.

Transformers are deep neural networks with attention layered on top.
If the neural network basics are unclear, everything else feels mystical.

This first post focuses only on the neural network foundation required for Transformers.

a. Transformers Are Still Neural Networks

Before talking about attention, embeddings, or tokens, it’s important to state something clearly:

Transformers are deep feedforward neural networks + attention.

Every Transformer block contains:

  • Linear layers
  • Activation functions
  • Forward and backward passes
  • Gradient-based learning

If you understand standard neural networks, Transformers become far less intimidating.

b. What a Neural Network Layer Really Is

A neural network layer is simply a function:

𝑦 = 𝑓(π‘Šπ‘₯ + 𝑏)
Enter fullscreen mode Exit fullscreen mode

Where:

  • π‘₯ is the input vector
  • π‘Š is the weight matrix
  • 𝑏 is the bias
  • 𝑓(β‹…) is an activation function

Figure 1: A neural network layer applies a linear transformation using weights and bias, followed by a non-linear activation.

Figure 1: A neural network layer applies a linear transformation using weights and bias, followed by a non-linear activation.

This exact computation appears everywhere in Transformers:

  • Query, Key, Value projections
  • Output projections
  • Feed Forward Networks (FFN)

Nothing special β€” just repeated at scale.

c. Weights and Biases: The Only Things That Learn

Neural networks don’t learn rules or symbols. They learn numbers.

  • Weights control how strongly inputs affect outputs
  • Biases allow shifting behavior independently of inputs

In Transformers:

  • Attention matrices are learned weights
  • Projection layers are learned weights
  • FFN layers are learned weights

If you know where the weights are, you know where the learning is happening.

d. Why Activation Functions Are Non-Negotiable

Without activation functions, depth is meaningless.

Stacking linear layers without activation collapses into a single linear transformation:

π‘Š3(π‘Š2(π‘Š1π‘₯)) = π‘Šπ‘₯
Enter fullscreen mode Exit fullscreen mode

Activation functions introduce non-linearity, allowing the model to represent complex patterns.

e. ReLU vs GELU (Why Transformers Use GELU)

ReLU is simple and widely used:

ReLU(π‘₯) = max(0,π‘₯)

But Transformers typically use GELU:

GELU(π‘₯) = π‘₯β‹…Ξ¦(π‘₯)

Intuition:

  • ReLU applies a hard cutoff
  • GELU softly decides which values to pass

This smooth behavior helps with:

  • Very deep networks
  • Stable gradient flow

That’s why models like BERT and GPT use GELU inside their feedforward blocks.

f. Forward Pass vs Backward Pass

Forward Pass
The forward pass answers: Given current parameters, what output does the model produce?

Inputs flow through:

  • Linear layers
  • Activation functions
  • Attention modules

This is what architecture diagrams usually show.

Backward Pass

The backward pass answers: How should the parameters change to reduce error?

Using backpropagation:

  • Gradients flow backward
  • Weights and biases are updated
  • Learning happens

Transformers are trained using the same backpropagation principles as any other neural network.

g. Feedforward Networks Inside Transformers

Every Transformer block contains a standard feedforward neural network applied independently to each token:

FFN(π‘₯) = GELU(π‘₯π‘Š1 + 𝑏1)π‘Š2+𝑏2
Enter fullscreen mode Exit fullscreen mode

This is not an auxiliary component β€” it does most of the representational work.

Attention mixes information.
Feedforward networks transform it.

Key Takeaways

  • Transformers are not fundamentally new neural networks
  • Layers, weights, activations still define behavior
  • GELU is preferred over ReLU for deep Transformer models
  • Forward and backward passes work exactly as in standard networks
  • Attention sits on top of familiar feedforward structures

What’s Next?

In the next post, we’ll cover Embeddings β€” the step that turns discrete tokens into continuous vectors.
Without embeddings, attention has nothing to operate on.

Top comments (0)