Nitish Kumar Chaudhary

Posted on May 27 • Originally published at Medium on May 5

How Neural Networks Work

#ai #neuralnetworks #llm

The Brain Behind Every AI

When ChatGPT launched in November 2022, it was not the first AI language model — but it was the first one anyone could simply open and use. No code, no API key, just a browser. And that changed everything. A new model drops almost every week now, each better than the last. But every time one launches, it comes with a pitch — benchmark scores, reasoning tests, agent capabilities, and those mysterious billion parameter numbers that nobody really explains.

In this article we are going to change that. We will start from the very basics and go all the way down to what those billions actually mean — no shortcuts, no jargon left unexplained. So buckle up. It is going to be a good one.

What is a Neural Network?

At its core a neural network is just a mathematical function — it takes something in and gives something out. You give it a photo, it tells you if there is a cat. You give it a sentence, it completes it.

Mathematically it looks like this:

x → the input — a photo, a sentence, a sound clip
θ (theta) → the parameters — all the weights and biases the network has learned
f → the mapping the network learned — the logic connecting input to output

Think of it like a black box that learned from millions of examples. You put something in, it gives you an answer. The parameters inside are what make that answer good or bad.

Everything we discuss in this series lives inside that θ.

What are Neurons?

A neural network is not one big brain sitting somewhere doing all the thinking. It is a massive collection of tiny decision units — all working together.

Those tiny units are called neurons.

Each neuron does one simple job — it takes some numbers in, does a small calculation, and passes a result forward. That calculation looks like this:

x → the input coming in
w → the weight — how much importance this input gets
b → the bias — the neuron’s personal threshold
y → the output passed to the next neuron

Nothing fancy. Just a linear equation.

Weights and Bias

Let us take an example. You feed an image into a model and its job is to tell you — is this a cat or a dog?

The model breaks that image into pixels and analyses each one. But pixels alone mean nothing. To reach a conclusion the model needs to know what matters and what doesn’t. That is where weights and bias come in.

Weights

A weight is simply how much importance a neuron gives to its input.

Not every feature of a cat carries equal importance. Fur is highly distinctive. Ear shape matters. But color? A cat and a dog can both be black — so color carries very little weight.

So when fur is detected the input is multiplied by a high weight. When color is detected it is multiplied by a low weight. The network learns to focus on what actually matters.

Bias

Bias is a fixed number attached to every neuron that gets added after the weight multiplication.

Think of it as the neuron’s personal firing threshold — how strong does the signal need to be before this neuron reacts?

Here is a good example. Pointy ears are detected. Pointy ears are a strong cat signal — dogs rarely have them. But what if the ears were only slightly pointy? A small negative bias ensures the neuron does not fire on weak uncertain signals. Only a genuinely strong pointy ear detection pushes past that bias and fires.

This is what makes bias powerful — it filters out weak uncertain signals and only lets through the ones the neuron is truly confident about.

And the best part? Both weights and bias are not set manually. The model learns and adjusts them automatically through training.

What are Layers?

If a neuron is a single worker, a layer is the entire team.

A layer is simply a collection of neurons all working in parallel — taking the same input, each doing their own calculation, and passing their results forward together.

Mathematically it looks like this:

Think of it as an assembly line. Each layer receives what the previous layer figured out, adds its own understanding on top, and passes it forward.

Why Do Layers Use Non-Linear Math?

You already know that every neuron does simple linear math :

multiply the input by a weight, add a bias, pass it forward.

So if every neuron is linear, why does the layer’s final output go through a non-linear function ( sigma { σ } ).

Let us go back to our example to find out.

You feed a picture into the model. Its job — cat or dog? The model breaks the image into pixels and starts analysing. Now imagine if layers only did linear math.

Total → 60

But what does 60 mean? Is it a cat? Maybe. Probably. Could be. Linear math gives you a number but never a clean confident decision. It lets every signal partially contribute — even weak and irrelevant ones like brightness and color that are shared between cats and dogs.

This is where non-linear activation functions come in.

Take ReLU for example. When brightness comes in and the weight and bias together produce a negative number — ReLU simply makes it zero. That signal is gone. The layer stops caring about brightness and focuses on what actually matters — fur, ears, whiskers.

That hard zero is something linear math can never produce. And that hard zero is everything.

Let us see how ReLU actually helps the model reach a confident answer.

Same image. Same features. But this time ReLU is the gatekeeper.

Brightness was negative — ReLU killed it completely. The model stops wasting attention on it.

Total → 70

But now this 70 goes into the final layer with a Sigmoid function which squishes it into a probability between 0 and 1.

Sigmoid takes 70 → outputs 0.94

The model says — 94% confident this is a cat. Not maybe. Not probably. A clean confident answer.

That is the difference. Linear gives you a number that means nothing. Non-linear gives you a decision.

And here is the most important thing to remember — every neuron still does linear math. That never changes. The activation function is what sits on top and breaks that linearity. Like a gatekeeper. Without it everything collapses back into one straight line.

Non-linearity is what gives a neural network the ability to actually think.

The Three Main Activation Functions

There are many activation functions out there — but three of them show up almost everywhere. Let us take a look at the ones that actually matter in practice.

ReLU

If the value is negative, set it to 0 ( for example, -8 -> 0 ).
If the value is positive, leave it unchanged ( for example, 20 -> 20 ).

Fast, simple, and works well in practice. Used in almost every modern neural network’s hidden layers.

Sigmoid

Squishes everything between 0 and 1.

Used in the final layer of binary classification — “Is this a cat? 87% yes.”

Tanh

Squishes everything between -1 and 1.

Used in hidden layers when the network needs both positive and negative outputs. Works better than sigmoid in many cases.

How Information Flows Through a Neural Network?

Till now we have seen the math behind each neuron and why layers use non-linear functions. Now let us zoom out and watch the entire network think — step by step.

A neural network never sees a cat directly. It builds the idea of a cat from raw numbers — transforming them layer by layer until a final answer emerges.

Think of it as a pipeline. Raw data goes in one end. Meaning comes out the other.

Our Setup

A small network with 3 layers:

Step 1 — Input Layer

A 4 pixel image of a cat comes in.

Now before we go further — we are taking a grayscale image. That means each pixel carries only one value — its brightness. 0 is pure black, 255 is pure white, everything in between is a shade of gray. We normalize these values between 0 and 1 to make the math cleaner.

If this were a color image *, each pixel would carry three values — Red, Green and Blue. So a 4 pixel color image would have 12 input neurons instead of 4. But for simplicity we are keeping it grayscale here.*

No calculation happens in the input layer. It simply breaks the image into pixel values and passes them forward:

These four numbers now flow into every single neuron in the hidden layer simultaneously.

Step 2 — Hidden Layer

Each of the 3 hidden neurons receives all 4 pixel values. Every neuron has its own set of learned weights — one per input — and its own bias. Each neuron multiplies every input by its weight, adds its bias, then passes the result through ReLU.

Why ReLU here? Because we are in a hidden layer. We want the network to make sharp decisions — pass strong signals forward and completely kill weak irrelevant ones. ReLU does exactly that.

Neuron A — detects pointy shapes and ears

This neuron has learned that pixel 1 and pixel 4 — the edge and ear shaped pixels — matter the most. So it gives them high weights. Fur texture and brightness are less relevant to this specific detection so they get low weights.

Weights: W1 = 0.8 , W2 = 0.1 , W3 = 0.1 , W4 = 0.9 | Bias: -0.2

Why bias -0.2? This neuron is looking for genuinely pointy shapes. A slightly pointy edge should not be enough to fire it. The -0.2 ensures only strong confident pointy shape signals push through.

(0.9 × 0.8) + (0.8 × 0.1) + (0.2 × 0.1) + (0.7 × 0.9) + (-0.2) 
= 0.72 + 0.08 + 0.02 + 0.63 - 0.20 
= 1.25

ReLU → 1.25 is positive → Neuron A fires strongly → output 1.25 ✅

The network detected a strong pointy shape. Ears confirmed.

Neuron B — detects fur texture

This neuron cares most about pixel 2 — the dense texture signal. Pointy shapes and ears are somewhat related so pixel 1 and 4 get medium weights. Brightness is almost irrelevant to fur detection so pixel 3 gets a very low weight.

Weights: W1 = 0.2 , W2 = 0.9 , W3 = 0.05 , W4 = 0.2 | Bias: -0.15

Why bias -0.15? Fur needs to be genuinely present to matter. A slightly textured surface like carpet or fabric could give a weak fur signal. The -0.15 filters those weak signals out — only real dense fur pushes past it.

(0.9 × 0.2) + (0.8 × 0.9) + (0.2 × 0.05) + (0.7 × 0.2) + (-0.15) 
= 0.18 + 0.72 + 0.01 + 0.14 - 0.15 
= 0.90

ReLU → 0.90 is positive → Neuron B fires confidently → output 0.90 ✅

Strong fur signal detected. The network is building confidence.

Neuron C — detects brightness

This neuron focuses on pixel 3 — the brightness pixel. But here is the thing — brightness is a weak signal for distinguishing cats from dogs. Both can be bright or dark. So all weights are kept low. Pixel 3 gets the highest weight but even that is not very high.

Weights: W1 = 0.1 , W2 = 0.1 , W3 = 0.8, W4 = 0.1 | Bias: -0.3

Why bias -0.3 — the most negative of all three? Because brightness is the least useful feature here. We want this neuron to be very hard to fire. Only an extremely strong brightness signal should matter. A -0.3 bias raises that bar significantly. Even if brightness is somewhat present the neuron should mostly stay silent.

(0.9 × 0.1) + (0.8 × 0.1) + (0.2 × 0.8) + (0.7 × 0.1) + (-0.3) 
= 0.09 + 0.08 + 0.16 + 0.07 - 0.30 
= 0.10

ReLU → 0.10 is positive but extremely weak → Neuron C barely fires → output 0.10

The network detected some brightness but almost completely ignored it. Exactly as it should — brightness tells us almost nothing about whether this is a cat or a dog.

Two strong signals. One weak ignored signal. The network is already leaning heavily towards cat.

Step 3 — Output Layer

The outputs from all 3 hidden neurons — 1.25, 0.90, 0.10 — now flow into the single output neuron.

This neuron needs to combine everything and give one final answer. It gives high weight to the strong confident signals from Neuron A and B. Neuron C was weak and unreliable so it gets a very low weight.

Weights: W1 = 0.9 , W2 = 0.85 , W3 = 0.05 | Bias: -0.1

Why bias -0.1 here? The output neuron should not fire too easily. We want genuine confidence before declaring cat. A small negative bias means the combined signal needs to be genuinely strong.

(1.25 × 0.9) + (0.90 × 0.85) + (0.10 × 0.05) + (-0.1) 
= 1.125 + 0.765 + 0.005 - 0.10 
= 1.795

Now Sigmoid takes over. Why Sigmoid here and not ReLU?

Because we are in the final layer. We do not want a raw number like 1.795 — we want a probability between 0 and 1 that a human can understand. Sigmoid does exactly that — it squishes any number into a clean probability.

Sigmoid(1.795) → 0.86

Final Answer

The model says — 86% confident this is a cat.

Four raw pixel numbers went in. Layer by layer — weights decided what mattered, biases filtered the uncertain signals, ReLU silenced the noise, and Sigmoid turned it all into one clean confident answer.

That is how a neural network thinks.

Parameters

Now we know how information flows through a neural network — how raw pixels turn into a confident answer layer by layer. But there is one question left.

How does the network know what weights to use? How does it know which bias to set? How does it know what matters and what doesn’t?

The answer is simple — it learns.

And everything it learns gets stored in one place — the parameters.

A parameter is any value inside the model that is learned during training. Every weight and every bias across every single neuron — that is what parameters are.

At the start a neural network is completely useless. Every weight and every bias is set to a random number. The network knows nothing. But as training happens — passing inputs through, comparing outputs with correct answers, adjusting values to reduce errors — those random numbers slowly become meaningful. They become the intelligence of the model.

A neural network does not think. It adjusts numbers until the answers get better. Those numbers are the parameters.

How Does a Neural Network Actually Learn?

We said parameters adjust during training. But how exactly does that happen?

Think about this — when you were learning to ride a bike, you did not read a manual and get it right the first time. You fell. You corrected. You fell again. You corrected again. Slowly your body learned exactly how to balance. A neural network learns the exact same way. It makes mistakes and corrects itself. Over and over. Millions of times.

Here is how that process works step by step.

Step 1 — Make a Prediction

The network takes an input — say our cat image — and runs it through every layer. Weights multiply, biases filter, activation functions decide what passes. A final answer comes out.

Let us say it outputs 0.40 — meaning 40% confident this is a cat.

But the correct answer is 1.0 — yes, this is definitely a cat.

The network was wrong. Now what?

Step 2 — Measure the Mistake — Loss Function

The network needs to know how wrong it was. That is exactly what the loss function does.

The loss function takes two numbers — what the network predicted and what the correct answer actually was — and returns one number telling you how bad the mistake was.

A simple loss function looks like this:

Loss = (correct answer − predicted answer)²

Loss = (1.0 − 0.40)² Loss = (0.60)² Loss = 0.36

Loss = 0.36 — that is the size of the mistake.

The higher the loss — the worse the prediction. The goal of training is simple — get this number as close to zero as possible.

Step 3 — Figure Out Who Was Responsible — Backpropagation

Now the network knows it made a mistake of 0.36. But there are 19 parameters in our tiny example — and billions in real models. Which ones caused this mistake? And by how much?

This is where backpropagation comes in.

Backpropagation works backwards through the network — starting from the output, going layer by layer all the way back to the first hidden layer — asking one question at every weight:

“If I increase this weight slightly — does the loss go up or down?”

The answer to that question is called the gradient — it tells you which direction to move each weight to reduce the mistake.

Mathematically the gradient looks like this:

∂L/∂w

∂L → how much the loss changes
∂w → when this weight changes by a tiny amount
Together → “if I nudge this weight slightly, how much does the mistake grow or shrink?”

If the gradient is positive — increasing this weight makes the loss worse. Move it down.

If the gradient is negative — increasing this weight makes the loss better. Move it up.

The gradient is not just telling you that you are wrong. It is telling you exactly which direction to go to be less wrong. And it does this for every single weight in the network simultaneously — that is what makes backpropagation powerful.

Think of it like standing on a hill in the dark. You cannot see the bottom but you can feel which direction is downhill under your feet. Gradient is that feeling. Backpropagation figures out which direction is downhill for every single weight simultaneously.

One forward pass to make a prediction. One backward pass to calculate every weight’s responsibility. That efficiency is what makes training possible at scale.

Step 4 — Adjust Every Weight — Learning Rate

Now the network knows which direction to move each weight. But how big should each step be?

That is controlled by eta (η) — also called the learning rate.

The update formula looks like this:

w = w − η × gradient

Current weight = 0.8 
Learning rate (η) = 0.01 
Gradient = 2.0 

New weight = 0.8 − (0.01 × 2.0) 
New weight = 0.8 − 0.02 
New weight = 0.78

The weight moved slightly in the right direction. Next prediction will be slightly better.

Now imagine doing this for every single weight and bias in the network. After every single prediction. Millions of times. That is training.

Why Learning Rate Matters

If eta is too large — weights jump too far, overshoot the right value, and the network never settles.

If eta is too small — weights move so slowly that training takes forever.

Most modern models use an optimizer called Adam which gives every weight its own adaptive learning rate — automatically speeding up weights that are stuck and slowing down weights that are overshooting. You do not set it manually. It figures it out.

The Full Training Loop

1. Forward pass → input flows through → prediction made 
2. Loss function → how wrong was the prediction 
3. Backpropagation → error flows backward → every weight gets its gradient calculated 
4. Update → every weight nudged in right direction 
5. Repeat → millions of times

Every repetition of this loop is called one training step. A model like Gemma 4B goes through billions of these steps before it becomes useful.

How Many Parameters Were in Our Cat Example?

Let us count every single weight and bias used in our example — from input to output.

Input layer → 4 neurons (no parameters - just passes data) 
Hidden layer → 3 neurons (each connected to all 4 inputs) 
Output layer → 1 neuron (connected to all 3 hidden neurons)

Every hidden neuron has 4 weights — one per input — plus 1 bias.

Neuron A:

W1=0.8, W2=0.1, W3=0.1, W4=0.9 → 4 weights bias = -0.2 → 1 bias Neuron A total → 5 parametersW1=0.8, W2=0.1, W3=0.1, W4=0.9 → 4 weights
bias = -0.2 → 1 bias
Neuron A total → 5 parameters

Neuron B:

W1=0.2, W2=0.9, W3=0.05, W4=0.2 → 4 weights bias = -0.15 → 1 bias Neuron B total → 5 parametersW1=0.2, W2=0.9, W3=0.05, W4=0.2 → 4 weights
bias = -0.15 → 1 bias
Neuron B total → 5 parameters

Neuron C:

W1=0.1, W2=0.1, W3=0.8, W4=0.1 → 4 weights bias = -0.3 → 1 bias Neuron C total → 5 parametersW1=0.1, W2=0.1, W3=0.8, W4=0.1 → 4 weights
bias = -0.3 → 1 bias
Neuron C total → 5 parameters

Hidden layer total → 5 + 5 + 5 = 15 parameters

Output Layer Parameters

The output neuron has 3 weights — one per hidden neuron — plus 1 bias.

W1=0.9, W2=0.85, W3=0.05 → 3 weights bias = -0.1 → 1 bias Output layer total → 4 parametersW1=0.9, W2=0.85, W3=0.05 → 3 weights
bias = -0.1 → 1 bias
Output layer total → 4 parameters

Total Parameters

This tiny network used exactly 19 parameters to decide — 86% confident this is a cat.

19 parameters made one simple decision. Now imagine 4 billion. That is exactly what we look at next.

Why Do Models Need So Many Parameters?

Our tiny cat detector needed 19 parameters to make one simple decision — is this a cat or a dog, looking at just 4 pixels.

Now think about what a real AI model has to do.

It has to understand photos taken in different lighting, angles, and colors. It has to read sentences in multiple languages. It has to write code, answer questions, summarize documents, and understand context from a long conversation.

That is not one simple decision. That is billions of them.

More Parameters = More Patterns

Every parameter is one tiny learned decision. One weight saying — this feature matters more than that one. One bias saying — only fire when the signal is strong enough.

The more parameters a model has — the more fine grained patterns it can detect and the more complex relationships it can learn.

19 parameters → cat or dog from 4 pixels Millions → objects, shapes, basic language Billions → complex reasoning, nuance, context19 parameters → cat or dog from 4 pixels
Millions → objects, shapes, basic language
Billions → complex reasoning, nuance, context

Think of it like reading. A child who has read 10 books can answer simple questions. Someone who has read 10,000 books understands context, nuance, sarcasm and complexity. Parameters are that exposure — stored as numbers.

But Does More Parameters Always Mean More Intelligence?

Here is where most people get it wrong — bigger does not always mean smarter.

Overfitting — when a model memorizes instead of learning

A model with too many parameters for a simple task stops learning patterns and starts memorizing answers.

Imagine a student who memorizes every past exam paper word for word. Give them the exact same question — perfect score. Change one word — completely lost.

A model with too many parameters does the same. It memorizes training data instead of learning the actual pattern. Works perfectly on data it has seen. Fails on anything new.

Data has to match the parameters

More parameters need more training data to learn correctly. A 100B parameter model trained on limited data will perform worse than a 7B model trained on rich high quality data.

Parameters without enough data are just random numbers that never get corrected properly. Speed and cost grow with every parameter

More parameters means more calculations per input. More memory to store them. More GPUs to run them. More electricity. More money.

A 4B model runs on a decent laptop. A 105B model needs multiple high end GPUs just to load.

So intelligence is not just about parameter count. It is about the right number of parameters, trained on the right data, for the right task.

So When Do Parameters Actually Matter?

More parameters make sense when the task is genuinely complex — multiple languages, deep reasoning, long context, coding. When you have enough high quality data to train them. And when you have the hardware to run them.

Fewer parameters make sense when the task is focused and specific. When speed and cost matter. When the model needs to run locally on a phone or laptop.

Real Example — Gemma 4B vs Sarvam 105B

Gemma 4B — 4 billion parameters. Built by Google to be small, efficient and fast. Runs on a laptop. Handles everyday tasks extremely well. Small parameter count justified because the task is general but manageable.

Sarvam 105B — 105 billion parameters. Built specifically for Indian languages. India has 22 official languages with completely different scripts, grammar structures and cultural contexts. That complexity genuinely needs more parameters. The larger count is justified by the task.

The best model is never the biggest one. It is the one where parameters, data and task are perfectly balanced.

More parameters is a tool — not a guarantee of intelligence. A well trained small model will always beat a poorly trained large one. The number on the label tells you the capacity. The training tells you the intelligence.

Conclusion

We started with a simple question — what do those billion numbers in AI model names actually mean?

And now you know.

A neural network is not magic. It is millions of tiny neurons each doing simple math — multiplying inputs by weights, adding a bias, passing a signal forward. Layers stack on top of each other, activation functions decide what passes and what gets silenced, and slowly raw pixels transform into a confident answer.

The parameters — every weight and every bias — are the memory of that entire process. Every mistake the model made during training, every correction, every adjustment — all of it is stored in those numbers. A 4B model has 4 billion of those learned decisions. A 105B model has 105 billion.

But as we saw — bigger is not always better. The best model is the one where parameters, data and task are balanced perfectly.

Next time you see Gemma 4B or Sarvam 105B or GPT-4 — you will not just see a number. You will see billions of tiny learned decisions, trained through millions of mistakes, all working together to give you an answer in milliseconds.

That is the brain behind every AI. And now it is no longer a black box.

If this helped you understand something you have been curious about for a while — drop a comment and let me know. And if you enjoyed reading, do give it a like — it genuinely means a lot. More interesting and in depth articles are on their way, so check back every once in a while. See you in the next one.

Originally published at https://nkc.hashnode.dev on May 5, 2026.