DEV Community: Nilavukkarasan R

The Transformer: The Architecture Behind Modern AI

Nilavukkarasan R — Thu, 07 May 2026 02:44:04 +0000

"Attention Is All You Need." -- Vaswani, 2017

The Path So Far

We started with a single neuron drawing a line. Added hidden layers to bend it. Taught the network to learn its own weights. Scaled training with mini-batches and Adam. Fought overfitting with dropout. Built filters for images. Gave networks memory for sequences. Replaced compression with attention.

Each architecture solved a problem the previous one couldn't. Each carried forward what worked and discarded what didn't.

The Personal Connect

In Attention blog post, I described how I used to compose sentences in Tamil first, then translate word by word into English. It was slow, sequential, and lossy. When I finally started thinking directly in English, everything changed. I wasn't translating anymore. I was processing meaning, grammar, and context all at once, shaped by everything I'd read and heard before.

That shift, from sequential translation to parallel understanding, is exactly what the Transformer does. And the core idea is simple:

P(next token | all previous tokens)

What is the probability of the next token, given everything that came before? That single equation is the foundation of GPT, Claude, and every modern language model. Everything you produce is shaped by your past and present context, conscious or not. The Transformer makes that idea computational.

Breaking Down the Decoder

The decoder-only Transformer (used by GPT, Claude, and most generative AI models) is a stack of identical layers. Each layer has four components, and we've seen every one of them before.

Token + Position Embedding: Each token becomes a vector (say, 128 numbers). Since attention doesn't care about order, a position signal is added. Token "slow" at position 3 gets a different embedding than "slow" at position 6. The model learns that position matters.

Masked Multi-Head Self-Attention: This is the core. Every token computes how relevant every previous token is to it, then blends their information accordingly.

Consider the sentence from RNN: "My teacher said I was slow, but he didn't know I was just getting started."

When predicting what "he" refers to:
  "My"       → low relevance (possessive, context)
  "teacher"  → high relevance (the subject — "he" refers back here)
  "said"     → low relevance (verb, not a referent)
  "I"        → medium relevance (another person in the sentence)
  "was"      → low relevance (auxiliary verb)
  "slow"     → low relevance (adjective)
  "but"      → low relevance (conjunction)
  "he"       → current position

RNN had to compress everything into a fixed-size hidden state and hope "teacher" survived the journey. Here, attention reaches back directly. No compression, no forgetting.

The attention formula from Attention:

Attention(Q, K, V) = softmax(Q·Kᵀ / √d) · V

Each token generates a Query ("what am I looking for?"), a Key ("what do I offer?"), and a Value ("what information do I carry?"). The dot product Q·Kᵀ scores how well each key matches the query. Softmax turns scores into weights. The weighted sum of values produces the output. The causal mask ensures token 5 only sees tokens 1 through 4. No peeking ahead.

Multi-head attention runs this operation multiple times in parallel with different learned projections. Conceptually similar to CNN's multiple filters: in a CNN, each filter detects a different spatial pattern (edges, textures). In a Transformer, each head detects a different relationship (grammar, coreference, meaning). Eight heads, eight perspectives, same total computation.

Add & LayerNorm: The residual connection from Post 07. The input bypasses the attention layer and gets added back:

output = LayerNorm(x + Attention(x))

This keeps gradients alive through deep stacks. Layer normalization stabilizes the signal between layers. Without these, a 12-layer Transformer wouldn't train.

Feed-Forward Network: A two-layer MLP with GELU activation, applied to each position independently:

FFN(x) = GELU(x · W₁ + b₁) · W₂ + b₂

This is where the non-linearity lives. Attention itself is a weighted sum (linear). The FFN transforms what each token learned from attention through a non-linear function, the same principle from Post 02. Without it, stacking attention layers would collapse to a single linear operation.

These four components repeat N times. Each layer refines the representation. By the final layer, the vector for each token encodes its meaning in the full context of the sequence.

A final linear layer followed by softmax produces the probability distribution over the next token. This last layer is intentionally linear. Its job is to project the rich representations into vocabulary space. The non-linearity has already done its work in the layers below.

How It Learns

All weights start random. The Transformer knows nothing. Training uses the same loop from this series: backprop computes gradients, Adam updates weights, dropout prevents memorization.

What's different is what it learns from. No labels. No human annotations. Just raw text. "Given these tokens, predict the next one." Billions of times. The model learns grammar, facts, reasoning, style, all as a side effect of next-token prediction.

This is called self-supervised learning. The training signal comes from the data itself. Every sentence is both the input and the answer. Predict the next word, check if you were right, adjust. The same try-miss-adjust loop from Bakcpropagation, at a scale that would have seemed impossible when we started with XOR.

See It

Open the playground. Two pretrained models on Shakespeare, a small one (112K params) and a larger one (826K params). Type a prompt like "ROMEO:" and generate text instantly. Both models are tiny, so the output will still be rough, not real Shakespeare. But compare the two side by side and you'll see the 826K model produces noticeably better structure: dialogue format, character names, verse-like line breaks. Scale matters, even at this toy level.

The Series, Complete

This series started because I was building with AI tools but didn't understand how any of it worked. Ten posts later, I understand the foundations. Not because I memorised the formulas, but because I recreated each piece, watched it work, and saw how it connects to the next. There is still plenty to learn. The journey continues.

The Transformer didn't invent any of these pieces. It composed them. The genius was in what it removed, not what it added.

What's Next

We've built the architecture. But architecture alone doesn't make intelligence. Training is what brings it to life: how data is prepared, how models scale, how they're fine-tuned, how they learn to follow instructions. That's a separate series.

References:
Vaswani, A., and team (2017). Attention Is All You Need. NeurIPS.
Radford, A., and team. (2018). Improving Language Understanding by Generative Pre-Training. (GPT-1)

Series: From Perceptrons to Transformers | Code: GitHub

Attention Mechanisms: Stop Compressing, Start Looking Back

Nilavukkarasan R — Sun, 19 Apr 2026 05:32:31 +0000

"The art of being wise is the art of knowing what to overlook."
--William James

Three Problems LSTM Didn't Solve

LSTMs gave networks memory. But I didn't fully understand what was still missing until I thought about my own experience learning English.

I studied in Tamil medium all the way through school. English was a subject, not a language I lived in. When I started my first job 20 years ago, I had to learn to actually speak it and write it. Client emails. Professional communication.

My strategy: compose the sentence in Tamil first, then translate word by word into English. It worked for simple things. It broke down in three specific ways. Those three breakdowns map exactly onto the three problems attention was built to solve.

Problem 1: The Compressed Summary

Long emails broke me. I'd compose a full paragraph in Tamil mentally, then try to hold it all in my head while translating into English. By the third sentence, the first one had blurred. I'd lose the subject I'd introduced. The English output would drift from the original Tamil thought.

The problem: I was trying to carry a compressed summary of a long paragraph in working memory, and that summary wasn't big enough.

That's exactly what an RNN encoder does. It reads the entire input and compresses it into a single fixed-size vector. The decoder uses only that compressed summary. For short sentences, fine. For long ones, something always gets lost.

The fix (Bahdanau): don't compress. Keep every hidden state the encoder produced, one per input word. Let the decoder look back at any of them when generating each output word.

Without attention:  decoder sees only h_final (compressed summary)
With attention:     decoder sees h₁, h₂, ..., hₙ and picks what it needs

Problem 2: Word Order

Tamil is verb-final. "Can you send the report by tomorrow?" in Tamil is roughly "Tomorrow-by that report send can-you?" I'd start translating left to right and end up with "By tomorrow the report send" before realizing "Can you" needed to come first.

Attention solves this. The decoder can look at any encoder position in any order:

Tamil:    நாளைக்குள்  அந்த  report-ஐ  அனுப்ப  முடியுமா
              h₁        h₂      h₃       h₄       h₅

English output → attention focus:
"Can"      → h₅  (முடியுமா — can you?)
"send"     → h₄  (அனுப்ப — send)
"the report" → h₃  (report-ஐ)
"by tomorrow" → h₁  (நாளைக்குள்)

The decoder doesn't follow the Tamil order. It follows the English order, looking back at whatever Tamil position it needs. This is what the Q/K/V formulation captures:

Query (Q): what the decoder is asking for right now
Key (K): what each encoder position offers
Value (V): the content retrieved when you attend to that position

Attention(Q, K, V) = softmax(Q·Kᵀ / √d) · V

Reading it piece by piece: Q·Kᵀ computes a score between every query and every key, measuring how well what I'm asking for matches what each position offers. Softmax turns those scores into weights that sum to 1, so the decoder distributes its focus across positions. Multiplying by V retrieves a weighted blend of the actual content. √d (where d is the dimension of the key vectors) is a scaling factor that prevents dot products from growing too large in high dimensions, which would push softmax into extreme values where gradients vanish.

Problem 3: Speed

The third breakdown was about conversation. Word-by-word translation is sequential. Think in Tamil, translate, speak. Listen in English, translate back to Tamil, formulate response, translate to English, speak. For a fast-moving technical discussion, completely unworkable. By the time I'd finished translating, the conversation had moved on.

The bottleneck wasn't comprehension. It was that the process was sequential. Each step waited for the previous one.

RNNs have the same problem. Step 2 waits for step 1. For 100 tokens, that's 100 sequential operations. Self-attention breaks this entirely. Instead of processing word by word, it computes relationships between all positions simultaneously. No sequential chain. The entire sequence processed at once.

When I started thinking directly in English, the same shift happened. Grammar, meaning, context, all processed in parallel, automatically. Self-attention is the architectural version of that shift.

Self-Attention: Every Word Sees Every Other Word

Consider: "The report that the client who called yesterday requested is ready."

What is "ready"? The report. Which report? The one the client requested. Which client? The one who called yesterday. These connections span many positions. An RNN carries all of this through its hidden state, hoping nothing gets lost.

Self-attention resolves them in one operation. Every word attends to every other word, regardless of distance. "Ready" looks back at "report." "Requested" looks back at "client." No sequential chain, no compression bottleneck.

Multiple attention heads run in parallel, each learning to notice different relationships. One head tracks grammar. Another tracks what pronouns refer to. Another tracks meaning. Eight heads, eight perspectives, same computation cost.

See It

Open the playground. Five concept demos that follow this post's narrative. No training loops, no waiting. Every slider updates instantly because it's all matrix math.

On the left, naive left-to-right alignment: "Can" looks at "by-tmrw," which is wrong. On the right, learned attention: "Can" jumps to "can-you?" at position 5, "send" jumps to position 4, "tomorrow" reaches back to position 1. The non-diagonal pattern is the reordering.

ATTENTION_MATH_DEEP_DIVE

What's Next

Attention solves the bottleneck. But the architecture still has an RNN encoder underneath. It's still sequential at its core.

What if we removed the RNN entirely? What if the whole architecture was just attention, stacked?

That's the Transformer.

References:
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate.
Vaswani, A., et al. (2017). Attention Is All You Need.

Series: From Perceptrons to Transformers | Code: GitHub

Recurrent Neural Networks: Giving Networks Memory

Nilavukkarasan R — Mon, 13 Apr 2026 14:43:48 +0000

"The present contains nothing more than the past, and what is found in the effect was already in the cause."
Henri Bergson

Everything So Far Assumed a Snapshot

Every network we've built treats the input as a static snapshot. Feed it in, get a prediction out. The order doesn't matter. There's no before or after.

That works for isolated images. A digit is a digit regardless of what came before it. But even for images, context matters in complex scenes. A round object next to a table is a plate. The same round object in the sky is the moon. CNNs detect the shape but don't understand the surrounding context.

For language, the problem is even more fundamental. Text is inherently sequential. What came before changes the meaning of what comes after.

"My teacher said I was slow, but he didn't know I was just getting started."

What does "he" refer to? The teacher. But only because you held "my teacher" in mind while reading the rest. You carried context forward, unconsciously, effortlessly.

Every architecture we've built so far would fail this. None of them carry anything forward.

Learning to Read, Letter by Letter

I remember learning to read. Not the fluent reading I do now. The early, effortful kind.

Each letter had to be identified consciously. Then combined with the next to form a sound. Then sounds stitched into a word. Then words assembled into meaning. It was slow, sequential, and exhausting. By the time I reached the end of a long sentence, I'd often forgotten how it started.

That's a vanilla RNN. It processes sequences one step at a time, maintaining a hidden state, a running summary of everything seen so far:

At each step t:
  hidden(t) = tanh( W_h × hidden(t-1) + W_x × input(t) )
  output(t) = W_o × hidden(t)

The hidden state is the memory. It blends the new input with what came before. The same weights are reused at every step. One set of weights, applied repeatedly across time.

h(0) ──► h(1) ──► h(2) ──► h(3) ──► ...
  ▲         ▲         ▲         ▲
  │         │         │         │
x(0)      x(1)      x(2)      x(3)

It works for short sequences. Just like the early reader who handles a short word fine but loses the thread of a long sentence.

The Long Sentence Problem

Training uses backpropagation unrolled across time steps. And here's where the familiar problem returns: the vanishing gradient.

For a sequence of 50 words, the gradient gets multiplied by the weight matrix at each step backward. That's 50 multiplications. The gradient reaching step 1 is effectively zero. The network forgets the beginning of the sentence. Like my early reading days: by the end of a long sentence, I'd forgotten how it started.

In skip-connection, skip connections fixed vanishing gradients by adding a direct additive path. We need the same idea, but for time.

LSTM: Learning to Read Fluently

Think about what changes when reading becomes fluent. You stop processing letter by letter. You chunk into words, phrases, meaning. More importantly, you become selective. You don't hold every word in memory with equal weight. You retain what matters: the subject, the tension, the unresolved question. You discard the filler.

That selectivity is what the Long Short-Term Memory network introduced.

An LSTM has two states: a hidden state (what it's currently working with) and a cell state (long-term memory). The cell state runs through the sequence with only small, controlled modifications, an additive path that lets gradients flow backward without decaying.

Three gates control what happens to memory at each step:


f_t = sigmoid( W_f [h_{t-1}, x_t] + b_f )     (forget gate)
i_t = sigmoid( W_i [h_{t-1}, x_t] + b_i )     (input gate)
o_t = sigmoid( W_o [h_{t-1}, x_t] + b_o )     (output gate)

C'_t = tanh( W_c [h_{t-1}, x_t] + b_c )       (candidate memory)

C_t = f_t * C_{t-1} + i_t * C'_t              (cell state update)

h_t = o_t * tanh(C_t)                         (hidden state)

Each gate outputs a value between 0 and 1. Near 1 means "yes, do this." Near 0 means "no, skip it." Consider reading "My teacher said I was slow, but he didn't know I was just getting started." When the network reads "my teacher," the input gate fires high to store the subject. As it reads "said I was slow," the forget gate stays high to keep "teacher" in memory. When it reaches "he," the output gate surfaces "teacher" from memory to resolve the reference.

All three gates are learned from data. Nobody programs when to remember or forget.

The cell state update is additive: old memory plus new information. That additive structure is what saves the gradient. Instead of multiplying through a squashing function at every step, gradients flow through the cell state with far less decay. Same idea as the ResNet skip connection, applied to time instead of depth.

The hidden state isn't a recording of the past. It's a compressed summary of the parts that seem relevant for predicting what comes next. Just like a fluent reader doesn't remember the exact words from three pages ago, but does remember that the detective is suspicious of the butler.

See It

Open the playground. Train both a vanilla RNN and an LSTM, then pick a sentence length and watch the confidence bars update word by word. You'll see the exact step where the vanilla RNN changes its mind and the LSTM doesn't.

That's the difference between letter-by-letter reading and fluent reading. One forgets. The other holds on.

What's Next

RNNs gave networks memory. But they process sequences step by step. Step 2 waits for step 1. Step 50 waits for step 49. For a sequence of 100 tokens, that's 100 sequential operations. You can't parallelize.

There's a deeper problem too. The hidden state has to compress everything seen so far into a fixed-size vector. For long sequences, that bottleneck loses information no matter how good the gating is.

What if the network could look back at any part of the input directly, regardless of distance? No compression. No sequential chain.

That's attention. And it's what made Transformers possible.

References:
Hochreiter, S., & Schmidhuber, J. Long Short-Term Memory.
Cho, K(2014). Learning Phrase Representations using RNN Encoder-Decoder.

Series: From Perceptrons to Transformers | Code: GitHub

Batch Normalization and Residual Connections: Going Deeper Without Breaking

Nilavukkarasan R — Sat, 11 Apr 2026 14:46:14 +0000

"No man ever steps in the same river twice, for it's not the same river and he's not the same man."
Heraclitus

When Deeper Made Things Worse

CNNs rethought how networks process images. Filters, weight sharing, spatial structure. The natural next step: go deeper. Early layers detect edges, middle layers combine them into shapes, deeper layers recognize objects. To go from recognizing handwritten digits to understanding complex scenes, faces, medical scans, you need that depth. More layers, more abstraction, more power.

Researchers took a 20-layer network and added 36 more layers. The 56-layer network should have been better. Instead, it was worse. Not just on test data. On training data too.

That's not overfitting. Overfitting means you're too good on training data. This was the opposite: a bigger network that couldn't even fit the data it was trained on.

Two things were broken. Fixing them required two ideas.

The Signal Drifts

Each layer transforms its input and passes it to the next. But as weights update during training, each layer's output distribution shifts. The next layer was calibrated for the old distribution. Now it's receiving something different.

A small shift in layer 3 gets amplified by layer 4, amplified again by layer 5. After 20 layers, the signal has either exploded into enormous numbers or collapsed to near zero.

Without batch norm:
  Layer 5 output:  mean=2.3,  std=4.7
  Layer 10 output: mean=18.4, std=31.2   ← exploding
  Layer 20 output: mean=NaN              ← collapsed

Every layer is chasing a moving target. That's the problem batch normalization solves.

Batch Normalization

Before each layer processes its input, normalize it to zero mean and unit variance. Then let the network re-scale with two learned parameters (γ and β) so it can undo the normalization if needed.

x_norm = (x - mean) / sqrt(variance)
output = γ × x_norm + β

Now every layer starts from a stable baseline. Activations stay stable (no explosions), you can use higher learning rates, and weight initialization matters less.

With batch norm:
  Layer 5:  mean≈0, std≈1
  Layer 10: mean≈0, std≈1
  Layer 20: mean≈0, std≈1    ← stable all the way down

One detail: batch norm computes statistics from the current mini-batch during training. At inference, there's no batch, so it uses running averages accumulated during training.

The Gradient Vanishes

Batch norm fixes the forward pass. But there's a second problem in the backward pass.

Backpropagation multiplies derivatives together as it moves backward. Each layer contributes a factor. If those factors are consistently less than 1, the gradient shrinks at every layer. By the time it reaches layer 1 of a 50-layer network, the gradient is effectively zero.

This is why the 56-layer network performed worse than the 20-layer one. The early layers weren't getting any useful gradient signal. They were frozen. It's like studying so much for an exam that your brain goes blank. More preparation, worse performance. Not because you lack knowledge, but because the signal got lost somewhere along the way.

Residual Connections: The Shortcut

Instead of learning a full transformation, a residual block learns the difference from identity:

Normal layer:    output = F(x)
Residual block:  output = F(x) + x     ← add the input back

That + x is the skip connection. The input bypasses the transformation and gets added to the output.

Why this fixes vanishing gradients: in a normal layer, the gradient gets multiplied by F'(x) at every step. If F'(x) is 0.1, after 50 layers you're multiplying fifty 0.1s together. The gradient is gone.

With a residual block, the chain rule becomes:

Normal:    ∂L/∂x = ∂L/∂output × F'(x)
Residual:  ∂L/∂x = ∂L/∂output × (F'(x) + 1)

That + 1 comes from the skip connection. Instead of multiplying values less than 1 at every layer, the skip connection keeps each factor close to 1. The gradient stays alive all the way back to layer 1.

On the left, a 30-layer normal network. The gradient starts at 1.0 at the output and shrinks at every layer. By layer 1, it's 0.008. The early layers are frozen. On the right, the same depth with skip connections. The gradient stays close to 1.0 across all layers because the skip provides a direct path that doesn't decay.

Before ResNets, the practical limit was around 20 layers. After, researchers trained networks with over 1,000 layers.

How Everything Fits Together

Seven posts in, it can feel like an ever-growing list of techniques. It's not. Each solved a specific failure: hidden layers for non-linearity, backprop for learning, mini-batches for scale, dropout for overfitting, convolutions for spatial data, batch norm for signal drift, skip connections for vanishing gradients. Each patches a gap the others can't cover. Together, they make modern deep networks trainable.

What's Next

We can now train deep networks on images. But images are static. What about data where order matters? Text, audio, time series, where what came before changes the meaning of what comes after.

A fully connected network has no concept of sequence. A CNN has no concept of time. We need an architecture with memory.

That's where recurrent neural networks come in. And the vanishing gradient problem we just solved for depth comes back for length.

Convolutional Neural Networks: Teaching Networks to See

Nilavukkarasan R — Tue, 31 Mar 2026 13:28:17 +0000

"Vision is the art of seeing what is invisible to others."
Jonathan Swift

150 Million Parameters for One Layer

Post 05 ended with a number: a 224×224 color photograph has 150,528 pixels. Connect each pixel to 1,000 neurons in a fully connected layer and you need 150 million weights. Just for the first layer. Before the network has learned anything.

That's not a training problem. That's an architecture problem. Fully connected networks treat every pixel as equally related to every other pixel. A pixel in the top-left corner connects to the same neurons as a pixel in the bottom-right. But images don't work that way. Nearby pixels form edges, textures, shapes. Distant pixels are usually unrelated.

We need an architecture that knows this.

One Small Filter, Everywhere

Instead of connecting every pixel to every neuron, a CNN slides a small filter (say 3×3) across the image. At each position, it multiplies the 9 filter weights by the 9 pixel values underneath and sums them up, exactly like the perceptron's weighted sum from Post 01, just applied to a small patch instead of the whole input. Then it moves one pixel over and repeats.

The same 9 weights are used at every position. One filter, applied everywhere. If it learns to detect vertical edges, it detects them in the top-left, the center, and the bottom-right, all with the same 9 weights.

FC first layer:   150,528 × 1,000 = 150 million parameters
CNN first layer:  32 filters × 3×3×3 = 864 parameters

That's the core idea. Instead of one giant layer that sees everything, many small filters that each detect one pattern locally. The technical term is weight sharing, and it's why CNNs are practical for images.

On the left, the FC network flattens the image into a row of 784 pixels. Spatial structure destroyed. Every pixel connects to every neuron. On the right, the CNN keeps the image as a 2D grid and slides a small 3×3 filter across it. Same 9 weights, applied everywhere. Spatial structure preserved.

What the Filters Learn

A filter is just a small grid of numbers. Backpropagation (same algorithm from Post 03) adjusts these numbers until the filter detects something useful. Nobody designs the filters by hand. The network learns them from data.

Learned vertical edge filter:    Learned horizontal edge filter:
[-1  0  1]                        [-1 -2 -1]
[-2  0  2]                        [ 0  0  0]
[-1  0  1]                        [ 1  2  1]

Slide a vertical edge filter across a digit and you get a feature map: a heat map showing where vertical edges were detected and how strongly. Stack 32 filters and you get 32 feature maps, 32 different views of the same image.

Early layers learn edges. Deeper layers combine edges into shapes. Even deeper layers combine shapes into parts of objects. It's a hierarchy: simple patterns compose into complex ones.

Pooling: Summarize, Don't Memorize

After detecting features, we don't need to know exactly where they appeared. Just roughly where. Max pooling takes a small window (2×2) and keeps only the strongest activation.

Before pooling:        After 2×2 max pooling:
[1  3  2  4]           [6  4]
[5  6  1  2]     →     [8  7]
[3  8  4  7]
[1  2  6  3]

This shrinks the spatial size (fewer parameters downstream) and makes the network tolerant to small shifts. A digit shifted 2 pixels left still produces similar pooled features. The network recognizes the pattern regardless of exact position.

The Full Pipeline

Input Image
    ↓
[Conv → ReLU → Pool]    ← detect edges, textures
    ↓
[Conv → ReLU → Pool]    ← combine into shapes
    ↓
Flatten
    ↓
[Fully Connected]        ← classify based on learned features
    ↓
Softmax → Prediction

Everything we built still applies. Activation adds non-linearity (Post 02). Backpropagation trains the filters. Adam optimizes the updates. Dropout regularizes. CNNs didn't replace what we built. They added a smarter way to process spatial data on top of it.

See It

The original image is just pixels. After the first convolution, you see edges. After the second, you see shapes. The network is building its own understanding of the digit, layer by layer, from nothing but raw pixels and backpropagation.

Open the playground. Train both an FC network and a CNN on the same MNIST subset. The CNN reaches higher accuracy compared to FC network.

In the second tab, Pick a digit and watch what each filter detects. One filter lights up along vertical strokes. Another responds to horizontal edges. Another catches curves.

The top row shows the digit after four different filters. The vertical edge filter lights up along the strokes of the 3. The horizontal edge filter catches the top and bottom curves. Each filter sees something different in the same image. The bottom row shows the same feature maps after max pooling: smaller, coarser, but the important patterns survive.

CNNs Are Built for Spatial Data

CNNs work because of two assumptions: nearby inputs are related, and the same pattern can appear anywhere. Images satisfy both. So do audio spectrograms and video frames. Text doesn't. The word "not" next to "good" means something completely different from "not" next to "bad." That's why text needs different architectures, which we'll get to.

What's Next

CNNs solve the parameter problem for images. But go deep enough and training breaks down again. Gradients vanish through many layers. Adding more layers actually hurts accuracy, not from overfitting, but because the gradient signal can't reach the early layers.

Two ideas fixed this: batch normalization (stabilize activations between layers) and residual connections (let gradients skip layers entirely). Together, they made 50-layer and 100-layer networks trainable.

References:
Krizhevsky, A., (2012). ImageNet Classification with Deep Convolutional Neural Networks.

Series: From Perceptrons to Transformers | Code: GitHub

Neural Network Regularization: Fighting Overfitting

Nilavukkarasan R — Fri, 13 Mar 2026 14:35:02 +0000

"Learning without thought is labor lost." --Confucius

99% Accuracy That Means Nothing

Train a network on MNIST with Adam, mini-batches, 100 epochs. Training accuracy climbs past 99%. It feels like we've solved it.

Now check test accuracy. It hit 97% around epoch 50, then slowly dropped as training continued.

Imagine studying for an exam by memorizing that "Question 5 is always B" instead of understanding why B is correct. You'd ace the practice test but fail when questions are reordered. Neural networks do the same thing. With enough capacity and enough passes over the same data, they memorize training examples instead of learning the underlying patterns.

That's overfitting.

The Gap

Epoch 1:   Train: 87%  Test: 87%  Gap: 0%
Epoch 10:  Train: 97%  Test: 97%  Gap: 0%
Epoch 50:  Train: 99%  Test: 97%  Gap: 2%
Epoch 100: Train: 99.7% Test: 97%  Gap: 3%
Epoch 200: Train: 99.9% Test: 96%  Gap: 4%

Early on, both accuracies rise together. The network is learning real patterns. But past epoch 50, training accuracy keeps climbing while test accuracy stalls. The network has shifted from learning to memorizing.

Why? The network has 100,000 weights and only 60,000 training examples. It has enough capacity to memorize every single example. Given enough epochs, it will. The loss function only rewards getting training predictions right. It doesn't care whether the network learned a general rule or memorized each answer individually.

Dropout: Randomly Breaking the Network

What if we randomly disabled neurons during training? Not permanently. Just for each mini-batch, randomly turn off some neurons.

This sounds like chaos engineering. And thats exactly what it is. It forces the network to build redundancy.

Think of a team with a frontend developer, a backend developer, and a database specialist. To ship a feature, all three must contribute. If the database specialist is out sick, the feature stalls because nobody else knows the database layer. The team is fragile because each person owns one piece and nobody else can cover for them.

Now imagine the manager randomly rotates people out of the team each sprint. Nobody can afford to be the only person who knows their piece. Knowledge spreads. The frontend developer picks up some backend. The backend developer learns the database. The team becomes resilient because no single person is a bottleneck.

That's dropout. During training, each neuron has a chance (say 20%) of being turned off for that mini-batch. The network can't rely on any single neuron, so it spreads useful information across multiple pathways. At test time, all neurons are active, and the network benefits from all those redundant pathways working together.

if training:
    mask = np.random.binomial(1, 1 - dropout_rate, X.shape)
    X = X * mask / (1 - dropout_rate)   # scale to keep expected value the same
else:
    X = X                                # no dropout at test time

The scaling by 1 / (1 - dropout_rate) matters. Say you have 10 neurons and drop 20%. During training, only 8 are active. Their outputs sum to, say, 8.0. At test time, all 10 are active, so the sum jumps to 10.0. The next layer suddenly sees larger numbers than it was trained on, and predictions break.

The fix: during training, scale the surviving neurons' outputs up by 1 / (1 - 0.2) = 1.25. Now those 8 neurons produce 8.0 × 1.25 = 10.0, matching what the next layer will see at test time. Training and testing stay consistent.

Weight Decay: Preferring Simple Explanations

Dropout prevents neurons from co-depending. Weight decay takes a different approach: it penalizes large weights.

Total Loss = Prediction Loss + λ × (sum of squared weights)

Say your network has three weights: [3.0, -4.0, 2.0]. The sum of their squares is 9 + 16 + 4 = 29. With λ = 0.001, the penalty is 0.029, added on top of the prediction loss.

Now compare a network with weights [0.3, -0.4, 0.2]. Sum of squares: 0.09 + 0.16 + 0.04 = 0.29. Penalty: 0.00029. Ten times smaller weights, hundred times smaller penalty.

The optimizer now faces a trade-off: reduce prediction loss, but also keep weights small. A weight can grow large if it genuinely helps predictions enough to justify the penalty. But weights that grew large just to memorize noise get pulled back toward zero because the penalty outweighs their benefit.

Why does this help? Large weights make the network sensitive to small input changes, creating sharp decision boundaries that fit noise in the training data. Small weights produce smoother boundaries that generalize better.

In practice, most people use both. Start with dropout at 0.2 and weight decay at 0.0001, then adjust based on the gap. If the gap is still large, increase dropout. If training accuracy drops too much, ease off.

See It

Open the playground. Train with no regularization and watch the gap between train and test accuracy widen. Then add dropout at 0.2. The gap shrinks. Add weight decay at 0.0001. It shrinks further.

The visual is the two curves: training accuracy climbing, test accuracy following or falling behind. Regularization is what keeps them close together.

What's Next

We can now train networks that generalize. But we're still using fully connected networks where every input connects to every neuron. For MNIST's 28×28 images, that's 784 inputs. For a real photograph at 224×224×3, that's 150,528 inputs. The first layer alone would need 150 million weights.

We need an architecture that understands spatial structure. One where nearby pixels matter more than distant ones, and the same pattern can be detected anywhere in the image.

References:
Hinton, G. E., et al. Improving neural networks by preventing co-adaptation of feature detectors.

Series: From Perceptrons to Transformers | Code: GitHub

Neural Network Optimizers: Training at Scale

Nilavukkarasan R — Wed, 04 Mar 2026 15:06:50 +0000

"Adapt what is useful, reject what is useless, and add what is specifically your own."
Bruce Lee

From 4 Examples to 60,000

Backpropagation learned XOR from 4 training examples. Compute the gradient using all 4, update the weights, repeat. Every update sees the complete picture.

Now consider MNIST: 60,000 handwritten digit images, each 28×28 pixels. The task is to look at an image and predict which digit (0-9) it represents. The network needs 784 inputs, a hidden layer, and 10 outputs. Roughly 100,000 weights.

Computing the gradient using all 60,000 examples requires 60,000 forward and backward passes per update. On a simple NumPy implementation, that's a few seconds per update. Training for 100 epochs takes several minutes.

That's just MNIST. Models like GPT-4o and Claude train on trillions of tokens. Full-batch gradient descent doesn't scale.

You Don't Need Every Example

Think about cooking. You don't taste every grain of rice to know if you need more salt. A spoonful tells you enough.

That's mini-batch gradient descent. Instead of computing the gradient from all 60,000 examples, grab a small batch (say 64), compute the gradient from those, update the weights, grab the next 64, repeat.

for each epoch:
    shuffle training data
    for each mini-batch of 64:
        forward pass
        compute loss
        backward pass
        update weights

Each mini-batch gradient is noisy, not the exact direction from all 60,000 examples. But it points roughly right. And it's fast. Instead of one slow update per epoch using all data, you get hundreds of quick updates. Training that took minutes with full-batch finishes in seconds with mini-batches.

One complete pass through all the data is an epoch. With 60,000 examples and batch size 64, one epoch is 937 updates. We shuffle the data before each epoch so the mini-batches differ every time. This randomness (the "stochastic" in stochastic gradient descent) prevents the network from memorizing the order of examples.

Not All Weights Need the Same Push

Mini-batches solve the speed problem. But remember the radio from the last post? Seed 5 with a small network got stuck between stations. At MNIST scale, this problem gets worse. With 100,000 weights, some are tuning into strong signals and getting large gradients on every update. Others are listening for faint signals and barely getting any gradient at all. One learning rate can't serve both. The loud signals overshoot while the faint ones barely move.

This is where optimizers diverge.

SGD applies the same learning rate to every weight. It's the basic radio dial. Turn at one speed, hope for the best. If the station is strong, you find it. If it's faint, you might turn right past it.

Momentum keeps a running average of past gradients. If the last ten updates all pushed a weight in the same direction, momentum makes the next push bigger. If the updates keep flipping direction (up, down, up, down), momentum cancels them out and the weight stays steady. It smooths out the noise from mini-batches.

Adam does what momentum does, plus one more thing: it tracks how large each weight's gradients typically are. A weight that always gets big gradients is already moving fast, so Adam gives it smaller steps to avoid overshooting. A weight that gets tiny gradients is barely moving, so Adam gives it bigger steps to catch up. Every weight gets its own learning rate, adjusted automatically.

In practice, Adam is the default choice for most problems. It converges faster and is less sensitive to the initial learning rate.

For more details about the math model behind --> optimizers

See It

Open the playground and train all three optimizers on MNIST side by side. Watch which one pulls ahead first.

The gap is most visible in the first few epochs. Adam adapts quickly because it adjusts per weight. SGD treats every weight the same and takes longer to find its footing.

What's Next

We can now train on real data, efficiently. Backprop computes gradients, mini-batches make it fast, Adam adapts the learning rate per weight. 99% accuracy on MNIST in minutes.

But train longer and something breaks. Training accuracy keeps climbing, past 99%. Test accuracy stalls and drops. The network isn't learning patterns anymore. It's memorizing training examples.

That gap between training and test accuracy is called overfitting. Closing it is the next problem.

References:
Kingma, D. P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization

Series: From Perceptrons to Transformers | Code: GitHub

Backpropagation: How Neural Networks Learn From Mistakes

Nilavukkarasan R — Thu, 19 Feb 2026 14:37:13 +0000

"The backpropagation algorithm was a key historical step in demonstrating that deep neural networks could be trained effectively."
Geoffrey Hinton

From Hand-Crafted to Learned

A network that recognizes handwritten digits has hundreds of thousands of weights. A language model has billions. You can't hand-pick billions of numbers. There has to be a way for the network to find its own weights.

That's what backpropagation does. It starts with random weights and adjusts them automatically, using the errors the network figures out which direction to nudge each weight.

Try, Miss, Adjust

Think about learning to throw darts. Your first throw misses the bullseye by a foot. You don't start over with a completely random throw. You adjust. A little less force, slightly different angle. The error (how far you missed) tells you which way to correct.

Backpropagation does the same thing, for every weight in the network, simultaneously:

1. Forward pass:   feed input through the network, get a prediction
2. Compute error:  how far off was the prediction?
3. Backward pass:  trace the error back, figure out each weight's share of blame
4. Update weights: nudge each weight to reduce the error
5. Repeat

Step 3 is where the name comes from. The error at the output is clear: prediction minus target. But how much did each hidden neuron contribute to that error?

Think of it like a relay race where the team finishes 10 seconds too slow. The coach doesn't just blame the last runner. She works backward: the last runner lost 3 seconds, the one before lost 5, the first lost 2. Each runner's share of blame is traced back through the chain.

Backpropagation does the same thing. It starts at the output error and works backward through each layer, computing how much each weight contributed. This is the chain rule from calculus applied layer by layer. Each weight gets a gradient. Backpropagation calculus explained.

The Learning Rate

The gradient tells you which direction to move. The learning rate tells you how big a step to take.

new_weight = old_weight - learning_rate × gradient

Too high (1.0) and the network overshoots. The loss bounces around, never settling. Too low (0.01) and training crawls. Each update barely moves the weights. A learning rate around 0.3 to 0.5 usually gives steady progress.

In the playground, try training with learning rate 0.5 and seed 123. Watch the loss drop smoothly. Then try learning rate 1.0. Watch it bounce. The learning rate is the difference between a network that converges and one that thrashes.

The Loss Curve

In the MLP-Post, I hand-crafted weights and got 100% accuracy instantly. No learning, no process.

With backpropagation, you start with random weights. The network gets everything wrong. The loss is high. Then, epoch by epoch, the loss drops. The predictions get closer. The decision boundary shifts from random noise to something that actually separates the classes.

That curve going down is learning happening in real time.

Open the playground and train a 2-4-1 network with learning rate 0.5 and seed 123. Watch the loss curve drop. Explore hyperparameters

The Same Algorithm, Any Scale

Every modern neural network learned its weights through backpropagation. Image classifiers, language models, speech recognition. The algorithm that learned 9 weights for XOR is the same one that trains GPT-4's 1.76 trillion parameters. Forward pass, compute loss, backward pass, update weights. The scale changes. The principle doesn't.

Why the Starting Point Matters

Backpropagation starts with random weights. The random seed controls which random numbers you start with. Think of tuning an old analog radio. You turn the dial looking for a clear signal. Where you start turning from (the seed) decides which station you find first. Sometimes you land on a strong station. Sometimes you get stuck between two stations, hearing nothing but static, and no small turn of the dial fixes it.

A small network (2-2-1) is like a radio with a narrow dial. The stations are packed tight, and a tiny turn jumps past the one you wanted. Very sensitive to where you start. A larger network (2-4-1) is a wider dial with more room between stations. Easier to land on a clear signal from almost any starting position.

In the playground, seed 5 with a 2-2-1 network gets stuck at 75%. Switch to 2-4-1 with the same seed and it converges to 100%. More neurons don't just add capacity. They add alternative routes to the solution.

What's Next

We can now train networks automatically. But XOR has 4 training examples. Real datasets have thousands, millions. Computing the gradient using all examples at once is slow. And a single learning rate for every weight isn't ideal, some weights need bigger steps, others smaller.

Training a network is one thing. Training it efficiently, at scale, is a different problem. That's where optimizers come in.

References:
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors

Series: From Perceptrons to Transformers | Code: GitHub

Multi-Layer Perceptron: Where One Line Becomes Two

Nilavukkarasan R — Tue, 17 Feb 2026 15:14:10 +0000

"The perceptron has many limitations... the most serious is its inability to learn even the simplest nonlinear functions."
Marvin Minsky

XOR Needs More Than One Line

The perceptron solved AND, OR, and NAND. The natural next question: what can't it do?

XOR. Output 1 when inputs differ, 0 when they match.

[0, 0] → 0    [0, 1] → 1
[1, 0] → 1    [1, 1] → 0

The class 1 points sit diagonally opposite each other. Unlike AND or OR, where one straight line cleanly separates the classes, XOR needs at least two lines to carve out the right regions.

The obvious fix? Add more neurons. Stack another layer. Surely more layers means more power.

It doesn't.

Why Stacking Layers Alone Changes Nothing

A perceptron computes w·x + b and draws a line. Stack two layers:

Layer 1:  z₁ = w₁·x + b₁
Layer 2:  z₂ = w₂·z₁ + b₂

Expand it: z₂ = w₂·(w₁·x + b₁) + b₂ = (w₂·w₁)·x + (w₂·b₁ + b₂)

That's just W·x + B. A single line with different numbers. Two layers collapsed into one. Stack ten, a hundred, the math always simplifies to one straight line. More layers feel like more power. But without something to break the linearity between them, depth is an illusion.

The Carry

When I was a kid, single digit addition was simple. 3 + 5 = 8. One step, done.

Then came 27 + 15. I kept getting it wrong. I'd add 2 + 1 = 3, then 7 + 5 = 12, and write 312. Two separate problems stacked together. I was missing something invisible.

The breakthrough: 7 + 5 doesn't just equal 12. It creates a 1 that carries over to the next column. That carry doesn't stay where it was computed. It transforms into a 1 in a different column, changing what comes next.

Without the carry, stacking columns is useless. Each column is independent, and you get 312. With the carry, the columns interact, and you get 42.

Sigmoid: The Carry Between Layers

Perceptron:   output = w·x + b
MLP neuron:   output = sigmoid(w·x + b)

sigmoid(z) = 1 / (1 + e^(-z))

Sigmoid takes any number and squashes it between 0 and 1. Feed it −5, you get 0.007. Feed it 0, you get 0.5. Feed it +5, you get 0.993. It takes one layer's output and transforms it into a new range before the next layer sees it.

Layer 1:  h = sigmoid(w₁·x + b₁)
Layer 2:  y = sigmoid(w₂·h + b₂)

Try to simplify this into a single W·x + B. You can't. The sigmoid in the middle prevents the layers from collapsing. The hidden layer matters not because it adds more neurons, but because the activation function between layers stops them from collapsing into one.

Beyond Sigmoid

Sigmoid isn't the only activation function. There are others, each with a different shape:

sigmoid(z) = 1 / (1 + e^(-z))       → squashes to (0, 1)
tanh(z)    = (e^z - e^-z)/(e^z+e^-z) → squashes to (-1, 1)
ReLU(z)    = max(0, z)                → passes positives, zeros out negatives

Think of them as different volume knobs. Sigmoid only turns between 0 and 1. Tanh turns between -1 and 1, which is useful when you need the output centered around zero. ReLU is the simplest: if the signal is positive, pass it through unchanged. If negative, silence it.

ReLU is the default for hidden layers in modern networks. It's fast to compute and avoids a problem called vanishing gradients, where sigmoid and tanh squash large values so flat that the gradient nearly disappears, making learning extremely slow in deep networks. We'll see this problem firsthand at a later stage.

For output layers, the choice depends on the task: sigmoid for binary yes/no, softmax (a generalization of sigmoid) for picking one class out of many.

How It Solves XOR

A 2-2-1 network (2 inputs, 2 hidden neurons with sigmoid, 1 output) solves XOR. Each hidden neuron draws its own line. These two parallel lines create a band, and the region between them is where exactly one input is 1.

The output neuron combines them: neuron 1's signal (OR) minus neuron 2's signal (AND). What's left is OR but NOT AND, which is XOR.

  [0,0] → both neurons low  → output low  → class 0 ✓
  [0,1] → neuron 1 high, neuron 2 low → output high → class 1 ✓
  [1,0] → neuron 1 high, neuron 2 low → output high → class 1 ✓
  [1,1] → both neurons high → they cancel → output low → class 0 ✓

And this scales beyond a single hidden layer. Stack another layer on top, and the second layer doesn't see the original inputs. It sees the transformed outputs of the first layer. So the first layer draws boundaries, the second layer combines those boundaries into shapes, a third could combine shapes into patterns. Each layer builds on the previous one's transformation. That's why they're called deep neural networks.

See It

Open the playground. Perceptron on the left, stuck with one line, failing. MLP on the right, two hidden neurons creating a band that captures the XOR region.

What's Next

Two hidden neurons needed 9 hand-crafted weights to solve XOR. A network that recognizes handwritten digits needs thousands of neurons and hundreds of thousands of weights. One that understands language needs billions. The architecture scales, but hand-picking weights doesn't.

There has to be a way for the network to find its own weights. That's the next post.

References:
Minsky, M., & Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry.

Series: From Perceptrons to Transformers | Code: GitHub

Perceptron: The Foundation of Modern AI

Nilavukkarasan R — Sun, 15 Feb 2026 08:40:21 +0000

"We now have a new kind of programming paradigm. Instead of telling the computer what to do, we show it examples of what we want, and it figures out how to do it."

-- Michael Nielsen

Back to the Beginning

My first encounter with AI was in college. I memorised more than I understood. None of what I memorised appeared in the exam. So i wrote whatever I could in the exam, and I'm sure the professor didn't understand my answers either.

Fast forward Twenty years of building software systems. In all that time, I barely touched AI/ML. Sure, I designed applications that integrated with black box, AI/ML systems for OCR, but that was it.

Then ChatGPT happened.

Like many of you, I started experimenting. RAG chatbots, embedding models, agents, agentic patterns. I was building with these tools, but something bothered me. I didn't understand how any of it actually worked.

So I went back. Not to the latest paper, but to the very beginning. To the first artificial neuron.

Five Lines That Changed Programming

A perceptron takes inputs, multiplies each by a weight, adds them up, and makes a decision.

def perceptron(inputs, weights, bias):
    weighted_sum = sum(x * w for x, w in zip(inputs, weights))
    weighted_sum += bias
    return 1 if weighted_sum > 0 else 0

That's it. Each input has a weight (how important is this input?). We sum them up, add a bias, and if the result is positive, output 1. Otherwise, output 0.

Now consider the AND logic gate:

Input: [0, 0] → Output: 0
Input: [0, 1] → Output: 0
Input: [1, 0] → Output: 0
Input: [1, 1] → Output: 1

The traditional way? Write an if/else. The perceptron way? Show it examples and let it figure out the weights.

With learned weights [0.5, 0.5] and bias of −0.7, the perceptron solves this:

[0, 0]: 0×0.5 + 0×0.5 − 0.7 = −0.7 → Output: 0 ✓
[0, 1]: 0×0.5 + 1×0.5 − 0.7 = −0.2 → Output: 0 ✓
[1, 0]: 1×0.5 + 0×0.5 − 0.7 = −0.2 → Output: 0 ✓
[1, 1]: 1×0.5 + 1×0.5 − 0.7 = 0.3 → Output: 1 ✓

The if/else is hardcoded. The perceptron learned these numbers from examples.

How? It starts with random weights. It feeds in [1,1], gets the wrong answer, and nudges the weights a little in the direction that would have been correct. Feeds in [0,1], checks again, nudges again. After a few passes through all four examples, the weights settle at values that get everything right. That's the entire learning algorithm. Try, fail, adjust.

That shift, from writing rules to showing examples, is what Nielsen meant. And it's the same shift that powers every modern AI system.

It Draws a Line

A perceptron draws a line. The weights control the angle. The bias controls where it sits. Everything on one side is class 0, everything on the other is class 1. Training just means nudging the line until it separates the classes correctly.

For AND, the line puts [1,1] on one side and everything else on the other. Easy. For OR, it puts [0,0] alone on one side. Also easy.

For XOR (output 1 when inputs differ, 0 when they match), the class 1 points sit diagonally opposite each other. Try drawing one straight line that separates them. You can't. It's geometrically impossible.

That's the perceptron's entire story. If your problem lives on opposite sides of a line, it works beautifully. If not, no amount of training will help.

See It

Open the playground and train on AND. Watch the red dashed line settle into place, cleanly separating the orange dots from the blue ones. The error count drops to zero. Done.

Now switch to XOR. The line thrashes around, never settling. The error count never hits zero. The perceptron keeps trying, keeps adjusting, and keeps failing.

That contrast is the concept. Stare at it until it sticks.

What's Next

Every single neuron in GPT-4, in every transformer you've ever used, works on these same principles. The perceptron isn't history. It's the foundation.

But there's one simple logic gate it cannot learn. No matter how you adjust the weights, that single straight line can never solve it.

The problem is called XOR. And solving it required an idea that changed everything.

References:
Nielsen, M. (2015). Neural Networks and Deep Learning. neuralnetworksanddeeplearning.com

Series: Learning AI from First Principles | Code: GitHub