Aagam

Posted on Mar 16

Neural Networks: The First Step Toward Understanding Transformers

#ai #beginners #tutorial #llm

Understanding Neural Networks — The Foundation You Need Before Transformers

Whenever we try to start learning about LLMs, the first word we come across is "Transformer," which is based on the paper Attention Is All You Need. But when we start reading about it, we find it confusing. Do you know why? Because while reading, we come across a popular term — "neural architecture" and we're like, "Yeah, we understand it," but to be honest, we don't.

My goal is to make all the basics required to understand Transformers clear. In today's post, we will discuss neural networks, divided into five sections:

What is a neural network?
Why do we need a neural network?
How does it work?
What advantages does it provide, and what are the use cases and limitations?
What more can be explored in the future — research areas

1. What Is a Neural Network?

A neural network, in a very simplified way, is nothing but a way to make our model mimic how our brain functions or, you could say, how it processes information. It consists of internal units called neurons (nodes) that transform the input data with the help of weighted connections and nonlinear functions to learn patterns from the data.

2. Why Do We Need a Neural Network?

Now another question arises: why do we even need a neural network? Before neural networks, we were heavily dependent on feature engineering and handcrafted rules for predictions. But this approach struggled when handling complex data, which gave rise to the need for better functionality and that was the neural network.

3. How Does a Neural Network Work?

Now the hardest but most exciting part how does a neural network work exactly? I think this is something that is generally ignored by a lot of people, and that's what I think filters out good and bad engineers in today's era. To understand how it works exactly, we will divide the process into six simple steps.

Step 1 — Input Layer

A neural network does not understand raw images, text, or audio and thus everything must be converted into a numerical vector format. As the name specifies, this layer receives the input data, which can be anything: a video, image, or anything else. It uses multiple features of those inputs and converts them into one single input vector. For example, to predict house prices: Input = [size, rooms, location_score].

Step 2 — Linear Transformation

Once the input is converted into numerical vector format, we perform a linear combination of inputs. Confused? Yeah, I was too like, what the hell is a "linear combination of inputs," and why do we do it?

I had those questions too. I referred to blogs and GPT but couldn't understand it. Let's understand it together.

So, let's first understand what we have: a vector from the input (image, video, or whatever), which will of course have multiple features. We don't know which features we need to give importance to — even our model doesn't. So we assign weights. How are the weights calculated? That's another big topic, but to simplify: we assign random weights at first, train the model on the input data, calculate the error with our weights, adjust the weights, and repeat the process thousands of times to arrive at valid weight values.

We then multiply these weights with the input features and sum them up. But why do we do that? Because it tells the system how strongly it should weigh each feature for its prediction. However, it's still linear we don't really know the true relationship yet. That is why we perform the next step.

Step 3 — Activation Function

So what we currently have is a raw signal — or, for people like me, I'd say a "random value," because I don't really know what the value means after we obtained it from the linear transformation.

For this particular blog, I'm taking ReLU as the activation function, but there are other activation functions as well. ReLU converts this raw value by turning negative values to 0 and keeping positive values as they are. It basically means: if the answer is 0, don't fire (or use) this neuron; otherwise, fire it. Whatever output we get from the activation function becomes the input for the next step. Why? Let's see and understand together.

Step 4 — Forward Propagation (Information Flows Through the Network)

To understand forward propagation, we need to first understand the structure. Let's take a simple neural network architecture: it consists of multiple layers, each layer extracts meaningful information from the raw data, and each layer has multiple neurons. Each neuron has one job — to check if a particular pattern exists, and if it does, how strongly it exists.

To make it simple, let's take an example. Suppose I have three layers:

Layer 1 detects simple patterns one neuron detects edges, another detects corners, and another detects shapes. It then sends the information to Layer 2.
Layer 2 detects complex shapes and patterns from the previous layer's output. It then forwards the information to Layer 3.
Layer 3 produces the final output. We apply a softmax function, which converts the raw output into probabilities for example, detecting whether something is a dog's ear, a cat's ear, or something else.

(Kindly find the diagram below.)

Step 5 — Loss Function

Now, once our model makes its prediction after forward propagation, it checks whether its prediction was correct or incorrect. We use cross-entropy loss:

L = −∑ y log(ŷ)

Cross-entropy measures how wrong the prediction is. Here, y is the actual class and ŷ is the predicted class. From the log function, we understand that the loss is small when the correct class has a high probability and the loss is large when the correct class has a low probability.

So our model's end goal is to minimize the loss. Once it gets the value of the loss, it moves to the next step: backward propagation.

Step 6 — Backward Propagation

We all kind of get the gist in backpropagation, the network goes back and finds which layer contributed how much to the error and then adjusts the weights accordingly, right? But have you ever wondered how it knows who contributed to the error and how much?

Well, here comes the main part: backward propagation uses the chain rule. Now, I know a lot of us get scared when we hear "chain rule," but let's simplify it together. In the chain rule, we basically ask: "If we change the weight for a particular neuron slightly, how much does it affect the loss function?" We get the derivative using the chain rule, and once that is done, we update the weight using the formula below:

new_weight = old_weight − learning_rate × gradient

The gradient here is basically the derivative we found for that particular weight using the chain rule. And yes, generally the learning rate is constant across the whole model, but in practice, we change it during training.

4. Advantages and Limitations of Neural Networks

Advantages

With the way neural networks adjust weights by themselves and learn patterns, they can definitely handle complex pattern recognition where traditional models used to fail.
They can extract features automatically without the need for manual feature engineering.
They became the foundation of modern AI.

Limitations

Although neural networks come with a lot of advantages, there are some limitations too:

They need a huge amount of data with a small dataset, there is a high chance the model will overfit.
As we've seen in the "how it works" section, they require high computational cost.
Lack of interpretability we don't even understand why the model made a specific decision. This same problem continues in today's modern AI, and even the CEOs of big AI companies say they don't fully know how it's working exactly.

5. What More Can Be Explored — Research Areas

For me personally, one thing that can really be researched is mechanistic interpretability of neural networks understanding how neural networks internally represent concepts and make decisions. To frame it as a question:

How can we automatically discover interpretable functional circuits inside large neural networks instead of relying on manual neuron-level analysis?

DEV Community