<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Nilavukkarasan R</title>
    <description>The latest articles on DEV Community by Nilavukkarasan R (@rnilav).</description>
    <link>https://dev.to/rnilav</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3772087%2Fb8707010-ec72-4401-bf94-a0595c046a4d.jpg</url>
      <title>DEV Community: Nilavukkarasan R</title>
      <link>https://dev.to/rnilav</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/rnilav"/>
    <language>en</language>
    <item>
      <title>Attention Mechanisms: Stop Compressing, Start Looking Back</title>
      <dc:creator>Nilavukkarasan R</dc:creator>
      <pubDate>Sun, 19 Apr 2026 05:32:31 +0000</pubDate>
      <link>https://dev.to/rnilav/attention-mechanisms-stop-compressing-start-looking-back-1bol</link>
      <guid>https://dev.to/rnilav/attention-mechanisms-stop-compressing-start-looking-back-1bol</guid>
      <description>&lt;p&gt;&lt;em&gt;"The art of being wise is the art of knowing what to overlook."&lt;/em&gt;&lt;br&gt;
— &lt;strong&gt;William James&lt;/strong&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;The Bottleneck We Didn't Notice&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In my &lt;a href="https://dev.to/rnilav/understanding-recurrent-neural-networks-from-forgetting-to-remembering-5f7"&gt;last post&lt;/a&gt;, we gave networks memory. An LSTM reads a sentence word by word, maintaining a hidden state that carries context forward. It solved the forgetting problem that plagued vanilla RNNs.&lt;/p&gt;

&lt;p&gt;But there are three problems LSTM still doesn't solve. And I didn't fully understand them until I thought about my own experience learning English.&lt;/p&gt;

&lt;p&gt;I studied in Tamil medium all the way through school. English was a subject, not a language I lived in. When I started my first job 20 years ago, I had to learn to actually &lt;em&gt;speak&lt;/em&gt; it and more terrifyingly, &lt;em&gt;write&lt;/em&gt; it. Client emails. Professional communication. Things that would be read, judged, and replied to.&lt;/p&gt;

&lt;p&gt;My strategy was the only one I knew: compose the sentence in Tamil first, then translate it word by word into English.&lt;/p&gt;

&lt;p&gt;It worked for simple things. It broke down in three very specific ways. Those three breakdowns map exactly onto the three problems that attention was built to solve.&lt;/p&gt;


&lt;h2&gt;
  
  
  Problem 1: The Compressed Summary
&lt;/h2&gt;

&lt;p&gt;The first breakdown happened with long emails.&lt;/p&gt;

&lt;p&gt;I'd compose a full paragraph in Tamil mentally: three or four sentences, a complete thought. Then I'd try to hold that entire paragraph in my head while translating it into English. By the time I was writing the third sentence in English, the first one had blurred. I'd lose the subject I'd introduced. I'd forget the condition I'd set up. The English output would drift from the original Tamil thought.&lt;/p&gt;

&lt;p&gt;The problem wasn't that I forgot individual words. It was that I was trying to carry a &lt;em&gt;compressed summary&lt;/em&gt; of a long paragraph in my working memory and that summary wasn't big enough to hold everything.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is exactly what an RNN encoder does.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It reads the entire input sequence and compresses it into a single fixed-size vector, the final hidden state. Then the decoder uses only that compressed summary to generate the output. For short sentences, fine. For long ones, that summary has to hold everything: the subject, the verb, the object, the tone, the nuance. Something always gets lost.&lt;/p&gt;
&lt;h3&gt;
  
  
  Bahdanau's Fix (2014)
&lt;/h3&gt;

&lt;p&gt;The fix came from Bahdanau, Cho, and Bengio. The idea is simple in principle: don't compress. Keep every hidden state the encoder produced, one per input word, and let the decoder look back at any of them when needed.&lt;/p&gt;

&lt;p&gt;Instead of one compressed summary, the decoder has access to the full sequence of encoder states. When generating each output word, it computes a weighted sum over all of them attending more to the ones that are relevant right now, less to the ones that aren't.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Without attention:  decoder sees only h_final (compressed summary of everything)
With attention:     decoder sees h₁, h₂, ..., hₙ and decides what to focus on
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bahdanau's original formulation used a small neural network to compute how well each encoder state matched the decoder's current need - a learned compatibility function. It worked remarkably well. Translation quality on long sentences improved dramatically.&lt;/p&gt;

&lt;p&gt;Your brain does this too. When you're answering a question about something you read, you don't reconstruct a compressed summary, rather you mentally flip back to the relevant section. The original is still accessible. Attention gives the network the same ability.&lt;/p&gt;




&lt;h2&gt;
  
  
  Problem 2: Word Order
&lt;/h2&gt;

&lt;p&gt;The second breakdown was more embarrassing. It happened in individual sentences, not long paragraphs.&lt;/p&gt;

&lt;p&gt;Tamil is a verb-final language. The verb comes at the end. When I wanted to write "Can you send the report by tomorrow?", the Tamil structure in my head was roughly: &lt;em&gt;"நாளைக்குள் அந்த report-ஐ அனுப்ப முடியுமா?"&lt;/em&gt; — "Tomorrow-by that report send can-you?" Subject implied. Object before verb.&lt;/p&gt;

&lt;p&gt;I'd start translating from the beginning of the Tamil sentence. "Tomorrow-by" → "By tomorrow". OK so far. "That report" → "the report". Fine. "Send" → "send". And then I'd realize I'd already written "By tomorrow the report send" and I was confused where to put "Can you."&lt;/p&gt;

&lt;p&gt;What appeared perfectly correct in Tamil didn't map cleanly to English word by word. The structures are different. A literal left-to-right translation produces nonsense.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is the word order problem — and it's where attention does its real work.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;An RNN decoder, even with access to all encoder states, still generates output left to right, one word at a time. But attention lets the decoder look at &lt;em&gt;any&lt;/em&gt; encoder position in &lt;em&gt;any&lt;/em&gt; order. When generating "Can", it attends to the Tamil modal at position 5. When generating "send", it attends to the Tamil verb at position 4. When generating "tomorrow", it attends back to position 1.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Tamil:    நாளைக்குள்  அந்த  report-ஐ  அனுப்ப  முடியுமா
              h₁        h₂      h₃       h₄       h₅
           (by tmrw)  (that) (report)  (send)  (can you?)

English output → attention focus:
"Can"      → h₅  (முடியுமா — the modal)
"you"      → h₅
"send"     → h₄  (அனுப்ப — the verb)
"the"      → h₂ + h₃
"report"   → h₃  (report-ஐ — the object)
"by"       → h₁  (நாளைக்குள் — the time marker)
"tomorrow" → h₁
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The attention weights form a matrix, one row per English output word, one column per Tamil input word. You can literally see the reordering: the decoder jumping from position 5 back to position 4, then to 3, then to 1. It's not following the Tamil order. It's following the English order, looking back at whatever Tamil position it needs.&lt;/p&gt;

&lt;p&gt;This is what the Q/K/V formulation captures cleanly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Query (Q)&lt;/strong&gt;: what the decoder is currently asking — "what do I need to generate this word?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Key (K)&lt;/strong&gt;: what each encoder position offers — a description of what's available there&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Value (V)&lt;/strong&gt;: the actual content retrieved when you attend to that position
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight tex"&gt;&lt;code&gt;Attention(Q, K, V) = softmax(Q·Kᵀ / √d) · V
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;√d&lt;/code&gt; scaling keeps dot products in a stable range as dimension grows without it, softmax saturates and gradients vanish. Same instability problem we saw in deep networks, same fix.&lt;/p&gt;




&lt;h2&gt;
  
  
  Problem 3: Speed
&lt;/h2&gt;

&lt;p&gt;The third breakdown was the slowest to notice, because it wasn't about a single sentence. It was about &lt;em&gt;conversation&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Word-by-word translation is sequential by nature. I'd think in Tamil, translate, speak. Then listen to the reply in English, translate it back to Tamil to understand it, formulate a Tamil response, translate that to English, speak. Every exchange had this full round-trip happening in my head.&lt;/p&gt;

&lt;p&gt;For a simple two-line exchange, manageable. For a fast-moving technical discussion with multiple people, completely unworkable. By the time I'd finished translating the last thing someone said, the conversation had moved on two turns.&lt;/p&gt;

&lt;p&gt;The bottleneck wasn't comprehension. It was that the process was &lt;em&gt;sequential&lt;/em&gt;. Each step had to wait for the previous one to finish.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is the parallelism problem — and it's what self-attention solves.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;An RNN processes a sequence one step at a time. Step 2 can't start until step 1 is done. For a sentence of length 100, that's 100 sequential operations. You can't parallelize across time steps because each hidden state depends on the previous one.&lt;/p&gt;

&lt;p&gt;Self-attention breaks this dependency entirely. Instead of processing word by word, it computes relationships between &lt;em&gt;all&lt;/em&gt; positions simultaneously in a single matrix operation. There's no sequential chain. The entire sequence is processed at once.&lt;/p&gt;

&lt;p&gt;When you start thinking directly in English, something similar happens. Its not a sequential process anymore. Grammar, meaning, and context were being processed in parallel, automatically, without conscious effort. It's parallel processing.&lt;/p&gt;

&lt;p&gt;Self-attention is the architectural version of that shift.&lt;/p&gt;




&lt;h2&gt;
  
  
  Self-Attention: Every Word Sees Every Other Word
&lt;/h2&gt;

&lt;p&gt;So far, attention was between two sequences: Tamil input, English output. The decoder attends to the encoder. But the same mechanism applies within a single sequence and this turns out to be even more powerful.&lt;/p&gt;

&lt;p&gt;Consider: "The report that the client who called yesterday requested is ready."&lt;/p&gt;

&lt;p&gt;What is "ready"? The report. Which report? The one the client requested. Which client? The one who called yesterday. These connections span many positions in the same sentence. An RNN would need to carry all of this through its hidden state, step by step, hoping nothing gets lost.&lt;/p&gt;

&lt;p&gt;Self-attention resolves them in one shot every word attends to every other word in the same sequence, regardless of distance.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"ready"     → attends back to "report" (subject of the predicate)
"requested" → attends to "client" (who did the requesting)
"who"       → attends to "client" (relative clause anchor)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No sequential processing. No hidden state bottleneck. One operation, all connections at once.&lt;/p&gt;

&lt;p&gt;Your brain does this effortlessly when reading fluently. It's only when you're translating word by word processing sequentially, one token at a time that you lose these long-range connections.&lt;/p&gt;




&lt;h2&gt;
  
  
  Multi-Head Attention: Noticing Multiple Things at Once
&lt;/h2&gt;

&lt;p&gt;There's one more piece. A single attention operation computes one set of weights. It can only "look for" one type of relationship at a time. But language has many simultaneous relationships.&lt;/p&gt;

&lt;p&gt;In "The cat sat on the mat because it was tired", the word "it" has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;syntactic&lt;/strong&gt; relationship with "sat" (subject of the clause)&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;coreference&lt;/strong&gt; relationship with "cat" (what "it" refers to)&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;semantic&lt;/strong&gt; relationship with "tired" (property being attributed)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A single attention head would have to pick one. Multi-head attention runs several attention operations in parallel, each with different learned projections:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;head_i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Attention&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Q&lt;/span&gt;&lt;span class="err"&gt;·&lt;/span&gt;&lt;span class="n"&gt;Wᵢ_Q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="err"&gt;·&lt;/span&gt;&lt;span class="n"&gt;Wᵢ_K&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;V&lt;/span&gt;&lt;span class="err"&gt;·&lt;/span&gt;&lt;span class="n"&gt;Wᵢ_V&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nc"&gt;MultiHead&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;V&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Concat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;head_1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...,&lt;/span&gt; &lt;span class="n"&gt;head_h&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="err"&gt;·&lt;/span&gt; &lt;span class="n"&gt;W_O&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each head learns to notice different relationships simultaneously. One head might track grammatical alignment. Another might track semantic similarity. Another might track coreference, which pronoun refers to which noun.&lt;/p&gt;

&lt;p&gt;The standard Transformer uses 8 heads. Each head operates on a smaller slice of the representation (dimension &lt;code&gt;d/8&lt;/code&gt; instead of &lt;code&gt;d&lt;/code&gt;), so the total computation is the same as a single large attention — but the network gets 8 different perspectives instead of one.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Clicked for Me
&lt;/h2&gt;

&lt;p&gt;The compressed summary problem is the bottleneck of trying to hold a whole paragraph in working memory before translating. The word order problem is the mismatch between SOV and SVO that makes literal translation fail. The sequential processing problem is the reason real-time conversation was impossible while I was still translating word by word.&lt;/p&gt;

&lt;p&gt;The shift from "translate word by word" to "think in English" is the shift from RNN to attention. It's not an optimization. It's a different way of processing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Interactive Playground
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;09-attention
streamlit run attention_playground.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://github.com/rnilav/perceptrons-to-transformers/tree/main/09-attention" rel="noopener noreferrer"&gt;GitHub Repository&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This playground is different from the previous ones. No training loops, no waiting. Five concept demos that follow the blog post narrative — every slider updates instantly because it's all just matrix math under the hood.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Attention solves the bottleneck. But the architecture we've built so far still has an RNN encoder underneath — it's still sequential at its core.&lt;/p&gt;

&lt;p&gt;Post 10 asks: what if we removed the RNN entirely? What if the whole architecture was just attention, stacked?&lt;/p&gt;

&lt;p&gt;That's the Transformer. Attention without recurrence. Parallel processing of the entire sequence at once. Positional encodings to restore order information. And a feed-forward network to add non-linearity between attention layers.&lt;/p&gt;

&lt;p&gt;It's the architecture behind every modern language model — GPT, BERT, T5, and everything that came after. And it's built entirely from pieces we already understand.&lt;/p&gt;




&lt;h2&gt;
  
  
  Deep Dive
&lt;/h2&gt;

&lt;p&gt;For the full mathematical treatment — dot-product attention, scaled attention, the Q/K/V framework, self-attention, multi-head attention, masking, gradient flow, and worked numerical examples — see &lt;a href="https://github.com/rnilav/perceptrons-to-transformers/tree/main/09-attention/ATTENTION_MATH_DEEP_DIVE.md" rel="noopener noreferrer"&gt;&lt;code&gt;ATTENTION_MATH_DEEP_DIVE.md&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Bahdanau, D., Cho, K., &amp;amp; Bengio, Y.&lt;/strong&gt; (2014). &lt;em&gt;Neural Machine Translation by Jointly Learning to Align and Translate&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Luong, M., Pham, H., &amp;amp; Manning, C. D.&lt;/strong&gt; (2015). &lt;em&gt;Effective Approaches to Attention-based Neural Machine Translation&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vaswani, A&lt;/strong&gt; (2017). &lt;em&gt;Attention Is All You Need&lt;/em&gt;
&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>selfattention</category>
      <category>ai</category>
      <category>transformer</category>
      <category>multiheadattention</category>
    </item>
    <item>
      <title>Understanding Recurrent Neural Networks: From Forgetting to Remembering</title>
      <dc:creator>Nilavukkarasan R</dc:creator>
      <pubDate>Mon, 13 Apr 2026 14:43:48 +0000</pubDate>
      <link>https://dev.to/rnilav/understanding-recurrent-neural-networks-from-forgetting-to-remembering-5f7</link>
      <guid>https://dev.to/rnilav/understanding-recurrent-neural-networks-from-forgetting-to-remembering-5f7</guid>
      <description>&lt;p&gt;&lt;em&gt;"The present contains nothing more than the past, and what is found in the effect was already in the cause."&lt;/em&gt;&lt;br&gt;
— &lt;strong&gt;Henri Bergson&lt;/strong&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;Everything We Built Assumed a Snapshot&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Look back at every network we've built so far.&lt;/p&gt;

&lt;p&gt;A perceptron takes a fixed input vector and draws a line. An MLP stacks layers to bend that line into curves. A CNN slides filters across an image to detect spatial patterns. Even with all that sophistication, the convolutions, the pooling, the skip connections, every single one treats the input as a &lt;strong&gt;static snapshot&lt;/strong&gt;. Feed it in, get a prediction out. The order of inputs doesn't matter. There's no before or after.&lt;/p&gt;

&lt;p&gt;That assumption works perfectly for images. A digit is a digit regardless.&lt;/p&gt;

&lt;p&gt;But language isn't a snapshot. Neither is audio, or time series, or any signal where &lt;em&gt;what came before&lt;/em&gt; changes the meaning of &lt;em&gt;what comes after&lt;/em&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"My teacher said I was slow, but &lt;strong&gt;he&lt;/strong&gt; didn't know I was just getting started."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;What does "he" refer to? The teacher, obviously. But only because you held "my teacher" in mind while reading the rest. You carried context forward — unconsciously, effortlessly.&lt;/p&gt;

&lt;p&gt;Every architecture we've built so far would fail this. It has no mechanism for carrying anything forward.&lt;/p&gt;

&lt;p&gt;That's the gap RNNs were built to fill.&lt;/p&gt;


&lt;h2&gt;
  
  
  Learning to Read — Letter by Letter
&lt;/h2&gt;

&lt;p&gt;I remember learning to read. Not the fluent reading I do now, the early, effortful kind.&lt;/p&gt;

&lt;p&gt;Each letter had to be identified consciously. Then combined with the next to form a sound. Then sounds stitched into a word. Then words assembled into meaning. It was slow, sequential, and exhausting. And crucially, by the time I reached the end of a long sentence, I'd often forgotten how it started.&lt;/p&gt;

&lt;p&gt;That's a vanilla RNN.&lt;/p&gt;

&lt;p&gt;It processes sequences one step at a time, maintaining a &lt;strong&gt;hidden state&lt;/strong&gt;, a running summary of everything seen so far and updating it at each step:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# At each step t:
hidden(t) = tanh( W_h × hidden(t-1) + W_x × input(t) )
output(t) = W_o × hidden(t)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The hidden state is the memory. It blends the new input with what came before. The same weights are reused at every step, the network doesn't learn separate rules for position 1 vs position 50. One set of weights, applied repeatedly across time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;h(0) ──► h(1) ──► h(2) ──► h(3) ──► ...
  ▲         ▲         ▲         ▲
  │         │         │         │
x(0)      x(1)      x(2)      x(3)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Elegant. And it works, for short sequences. Just like the early reader who handles a short word fine but loses the thread of a long sentence.&lt;/p&gt;




&lt;h2&gt;
  
  
  Training It: Backprop Through Time
&lt;/h2&gt;

&lt;p&gt;Training uses the same backpropagation from &lt;a href="https://dev.to/rnilav/3-backpropagation-errors-flow-backward-knowledge-flows-forward-5320"&gt;Post 3&lt;/a&gt; — unrolled across time steps. To compute how much each weight contributed to the final loss, you trace gradients backward through every step:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;x(0)→[RNN]→h(0)→[RNN]→h(1)→[RNN]→h(2)→[RNN]→h(3)→ Loss
                                                        │
                              gradients flow backward ◄─┘
                              through every time step
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same chain rule. Just applied across time instead of across layers. The depth is now &lt;em&gt;temporal&lt;/em&gt; rather than architectural.&lt;/p&gt;

&lt;p&gt;And here's where the familiar problem returns.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Long Sentence Problem
&lt;/h2&gt;

&lt;p&gt;Remember the vanishing gradient from &lt;a href="https://dev.to/rnilav/understanding-internal-covariate-shift-and-residual-connections-beyond-activation-functions-and-2c8"&gt;Post 7&lt;/a&gt;? Gradients shrink as they travel backward through many layers, multiply enough numbers less than 1 together and you get zero.&lt;/p&gt;

&lt;p&gt;The same thing happens here, but across time steps instead of layers.&lt;/p&gt;

&lt;p&gt;At each step backward, the gradient gets multiplied by the weight matrix &lt;code&gt;W_h&lt;/code&gt;. For a sequence of 50 words, that's 50 multiplications. The gradient reaching step 1 is effectively zero. &lt;/p&gt;

&lt;p&gt;Like my early reading days: by the end of a long sentence, I'd forgotten how it started.&lt;/p&gt;

&lt;p&gt;In Post 7, skip connections fixed vanishing gradients by adding a direct additive path that bypassed the layers. We need the same idea, but for time.&lt;/p&gt;




&lt;h2&gt;
  
  
  LSTM: Learning to Read Fluently
&lt;/h2&gt;

&lt;p&gt;Think about what changes when reading becomes fluent.&lt;/p&gt;

&lt;p&gt;You stop processing letter by letter. You chunk into words, phrases, meaning. More importantly, you become &lt;em&gt;selective&lt;/em&gt;. You don't hold every word in memory with equal weight. You retain what matters: the subject, the tension, the unresolved question. You discard the filler. And you do this automatically, without thinking.&lt;/p&gt;

&lt;p&gt;That selectivity is exactly what the Long Short-Term Memory network (Hochreiter &amp;amp; Schmidhuber, 1997) introduced.&lt;/p&gt;

&lt;p&gt;An LSTM has two states instead of one: a &lt;strong&gt;hidden state&lt;/strong&gt; &lt;code&gt;h&lt;/code&gt; (what it's currently working with) and a &lt;strong&gt;cell state&lt;/strong&gt; &lt;code&gt;c&lt;/code&gt; (long-term memory). The cell state is the key innovation, it runs through the sequence with only small, controlled modifications. Like the skip connection in ResNets, it's an additive path that lets gradients flow backward without decaying at every step.&lt;/p&gt;

&lt;p&gt;Three &lt;strong&gt;gates&lt;/strong&gt; control what happens to memory at each step:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Forget gate:  should I clear out old memory?
              f = sigmoid( W_f × [h(t-1), x(t)] )

Input gate:   is this new input worth remembering?
              i = sigmoid( W_i × [h(t-1), x(t)] )
              candidate = tanh( W_c × [h(t-1), x(t)] )

Output gate:  what should I act on right now?
              o = sigmoid( W_o × [h(t-1), x(t)] )

Update:
  cell:    c(t) = f × c(t-1)  +  i × candidate
  hidden:  h(t) = o × tanh( c(t) )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The sigmoid gates output values between 0 and 1 — soft switches. A forget gate near 1 means "keep everything." Near 0 means "wipe it." The network &lt;em&gt;learns&lt;/em&gt; when to remember and when to forget, based on what the task requires.&lt;/p&gt;

&lt;p&gt;The cell state update — &lt;code&gt;c(t) = f × c(t-1) + i × candidate&lt;/code&gt; is additive. Old memory plus new information. That additive structure is what saves the gradient. Instead of multiplying through a squashing function at every step, gradients flow backward through the cell state with far less decay.&lt;/p&gt;

&lt;p&gt;Same intuition as the ResNet skip connection. Different problem, same fix.&lt;/p&gt;




&lt;h2&gt;
  
  
  GRU: Fluency With Less Overhead
&lt;/h2&gt;

&lt;p&gt;Once reading becomes fluent, you don't consciously run through all three questions at every word. Most decisions are automatic, keep reading, update the picture, move on.&lt;/p&gt;

&lt;p&gt;The Gated Recurrent Unit is that streamlined version. It merges the cell state and hidden state into one, and uses two gates instead of three:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Reset gate:   how much past context to use for the new candidate
              r = sigmoid( W_r × [h(t-1), x(t)] )

Update gate:  how much to blend old state with new candidate
              z = sigmoid( W_z × [h(t-1), x(t)] )

Update:
  candidate: h̃ = tanh( W × [r × h(t-1), x(t)] )
  hidden:    h(t) = (1-z) × h(t-1)  +  z × h̃
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Fewer parameters, similar performance. The update gate does double duty controlling both forgetting and writing in one operation. In practice, LSTMs and GRUs perform comparably. GRUs train faster; LSTMs have slightly more expressive memory. Most practitioners try both.&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer Normalization: The Normalization That Fits Sequences
&lt;/h2&gt;

&lt;p&gt;In Post 7, batch normalization stabilized deep networks by normalizing across the batch. But RNNs have a problem with batch norm. Sequences have variable lengths, and the hidden state carries information across steps. Normalizing across a batch of sequences at each time step is unstable.&lt;/p&gt;

&lt;p&gt;Layer normalization fixes this by normalizing across the &lt;em&gt;features&lt;/em&gt; of each individual sample, not across the batch. Same idea, different axis. Completely independent of batch size and sequence length.&lt;/p&gt;

&lt;p&gt;This is why layer norm became the standard for all sequence models and why every modern LLM uses it. When we get to Transformers in Post 10, it'll be everywhere.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Clicked for Me
&lt;/h2&gt;

&lt;p&gt;The reading analogy didn't just help me explain RNNs — it helped me understand what the hidden state actually &lt;em&gt;is&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;It's not a recording of the past. It's a compressed summary the parts of history that seem relevant for predicting what comes next. Just like a fluent reader doesn't remember the exact words from three pages ago, but does remember that the detective is suspicious of the butler.&lt;/p&gt;




&lt;h2&gt;
  
  
  Interactive Playground
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;08-rnn
streamlit run rnn_playground.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://github.com/rnilav/perceptrons-to-transformers/tree/main/08-rnn" rel="noopener noreferrer"&gt;GitHub Repository&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Train both models, then pick a sentence length and watch the confidence bars update word by word, you'll see exactly the step where Vanilla RNN changes its mind and LSTM doesn't.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;RNNs gave networks memory. But they process sequences step by step - slow, sequential, and still limited by how far gradients can travel even with LSTM.&lt;/p&gt;

&lt;p&gt;There's a deeper problem too. The hidden state has to compress &lt;em&gt;everything&lt;/em&gt; seen so far into a fixed-size vector. For long sequences, that bottleneck loses information no matter how good the gating is.&lt;/p&gt;

&lt;p&gt;Post 9 introduces &lt;strong&gt;Attention Mechanisms&lt;/strong&gt;: a way for the network to directly look back at any part of the input sequence it needs, regardless of distance. No compression bottleneck. No sequential processing. No hoping, the gradient survives 100 time steps.&lt;/p&gt;

&lt;p&gt;It's the idea that made RNNs obsolete — and made Transformers possible.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Hochreiter, S., &amp;amp; Schmidhuber, J.&lt;/strong&gt; (1997). &lt;em&gt;Long Short-Term Memory&lt;/em&gt;. Neural Computation, 9(8), 1735–1780.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cho, K., et al.&lt;/strong&gt; (2014). &lt;em&gt;Learning Phrase Representations using RNN Encoder-Decoder&lt;/em&gt;. EMNLP.&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>rnn</category>
      <category>lstm</category>
      <category>ai</category>
      <category>deeplearning</category>
    </item>
    <item>
      <title>Understanding Internal Covariate Shift and Residual Connections: Beyond Activation Functions and Optimizers</title>
      <dc:creator>Nilavukkarasan R</dc:creator>
      <pubDate>Sat, 11 Apr 2026 14:46:14 +0000</pubDate>
      <link>https://dev.to/rnilav/understanding-internal-covariate-shift-and-residual-connections-beyond-activation-functions-and-2c8</link>
      <guid>https://dev.to/rnilav/understanding-internal-covariate-shift-and-residual-connections-beyond-activation-functions-and-2c8</guid>
      <description>&lt;p&gt;&lt;em&gt;"No man ever steps in the same river twice, for it's not the same river and he's not the same man"&lt;/em&gt; - &lt;strong&gt;Heraclitus&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;When Going Deeper Made Things Worse&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In my &lt;a href="https://dev.to/rnilav/from-generalists-to-specialists-the-cnn-shift-1h1d"&gt;last post&lt;/a&gt;, we built CNNs that could see. Filters learned edges. Pooling built spatial tolerance. Stack enough layers and the network recognizes digits, faces, objects.&lt;/p&gt;

&lt;p&gt;So the obvious next move: go deeper. More layers, more capacity, more power.&lt;/p&gt;

&lt;p&gt;But there is a catch.&lt;/p&gt;

&lt;p&gt;Researchers took a 20-layer network and added 36 more layers. The 56-layer network should have been better. More parameters, more room to learn. Instead, it was &lt;em&gt;worse&lt;/em&gt;. Not just on test data, But on &lt;em&gt;training&lt;/em&gt; data as well.&lt;/p&gt;

&lt;p&gt;That's not overfitting. Overfitting means you're too good on training data. This was the opposite: a bigger network that couldn't even fit the data it was trained on.&lt;/p&gt;

&lt;p&gt;Two things were broken. And fixing them required two elegant ideas.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Noisy Room Problem
&lt;/h2&gt;

&lt;p&gt;Imagine you're at a loud party, trying to follow a conversation. The room is packed, music is blasting, five other conversations are happening around you. Your brain doesn't give up, it does something remarkable. It filters out the noise, locks onto the voice you care about, and normalizes the signal so you can follow along.&lt;/p&gt;

&lt;p&gt;You do this automatically, without thinking. But a neural network? It has no such mechanism.&lt;/p&gt;

&lt;p&gt;Here's what actually happens inside a deep network during training. Each layer transforms its input and passes it to the next. A small shift in one layer's output gets amplified by the next layer, which gets amplified again, and again. After 20 layers, the signal has either exploded into enormous numbers that saturate neurons, or collapsed into near-zero values that carry no information.&lt;/p&gt;

&lt;p&gt;The network is trying to learn in a room that keeps getting louder.&lt;/p&gt;

&lt;p&gt;That's the &lt;strong&gt;internal covariate shift&lt;/strong&gt; problem. The distribution of each layer's input keeps changing as weights update. Every layer is chasing a moving target.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Batch normalization&lt;/strong&gt; is the fix to it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Batch Normalization: Tuning Out the Noise
&lt;/h2&gt;

&lt;p&gt;Before each layer processes its input, normalize it. Force it to have zero mean and unit variance. Then let the network re-scale with two learned parameters: &lt;code&gt;γ&lt;/code&gt; (gamma) and &lt;code&gt;β&lt;/code&gt; (beta).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# For each mini-batch:
compute mean and variance of the inputs
normalize: x_norm = (x - mean) / sqrt(variance)

# Then re-scale with learned parameters:
output = γ * x_norm + β
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The network can undo the normalization if it needs to. &lt;code&gt;γ&lt;/code&gt; and &lt;code&gt;β&lt;/code&gt; are learned. But now every layer starts from a stable, predictable baseline. The moving target stops moving.&lt;/p&gt;

&lt;p&gt;Going back to the party analogy: batch norm is your brain's noise-cancellation. It doesn't remove the signal, it strips out the irrelevant variation so the important information comes through clearly.&lt;/p&gt;

&lt;p&gt;The effect on training is immediate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Without batch norm:
  Layer 5 output:  mean=2.3,  std=4.7
  Layer 10 output: mean=18.4, std=31.2   ← signal exploding
  Layer 20 output: mean=NaN              ← training collapsed

With batch norm:
  Layer 5 output:  mean≈0, std≈1
  Layer 10 output: mean≈0, std≈1
  Layer 20 output: mean≈0, std≈1         ← stable all the way down
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three things happen when you add batch norm:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Activations stay stable — no more explosions or collapses&lt;/li&gt;
&lt;li&gt;You can use much higher learning rates — the stable baseline means bigger steps are safe&lt;/li&gt;
&lt;li&gt;Weight initialization matters less — you no longer need to be as careful about starting values&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;One subtle thing worth knowing: batch norm uses statistics computed across the current mini-batch. At inference time, you might be predicting on a single example and no batch to compute statistics from. So during training, batch norm accumulates running averages of the mean and variance. At inference, it uses those instead.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Vanishing Gradient: A Deeper Problem
&lt;/h2&gt;

&lt;p&gt;Batch norm stabilizes the forward pass. But there's a second problem, and it lives in the backward pass.&lt;/p&gt;

&lt;p&gt;Backpropagation multiplies derivatives together as it moves backward through the network. Each layer contributes a factor. If those factors are consistently less than 1, which they often are, the gradient shrinks with every layer it passes through.&lt;/p&gt;

&lt;p&gt;By the time it reaches layer 1 of a 50 layer network, the gradient might be effectively zero. The early layers stop learning entirely.&lt;/p&gt;

&lt;p&gt;This is why the 56-layer network performed worse than the 20-layer one. It wasn't a capacity problem. The early layers simply weren't getting any useful gradient signal. They were frozen.&lt;/p&gt;




&lt;h2&gt;
  
  
  Residual Connections: The Shortcut
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;"If I have seen further, it is by standing on the shoulders of giants."&lt;/em&gt;&lt;br&gt;
— &lt;strong&gt;Isaac Newton&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Instead of learning a full transformation, a residual block learns the &lt;em&gt;difference&lt;/em&gt; from identity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Normal layer:
output = transform(x)

# Residual block:
output = transform(x) + x    ← just add the input back
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;+ x&lt;/code&gt; is the skip connection. The input bypasses the learned transformation and gets added back at the output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it changes the chain rule.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In a normal layer, backprop applies the chain rule like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;∂L/∂x = ∂L/∂output × F'(x)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The gradient gets multiplied by &lt;code&gt;F'(x)&lt;/code&gt; at every layer. If that's 0.1, after 50 layers you're multiplying fifty 0.1s together, the gradient reaches layer 1 as essentially zero.&lt;/p&gt;

&lt;p&gt;With a residual block, &lt;code&gt;output = F(x) + x&lt;/code&gt;, so the chain rule becomes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;∂L/∂x = ∂L/∂output × (F'(x) + 1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;+ 1&lt;/code&gt; comes from differentiating the skip connection &lt;code&gt;x&lt;/code&gt; with respect to &lt;code&gt;x&lt;/code&gt;, the derivative of a straight passthrough is always 1. Now instead of multiplying fifty 0.1s, you're multiplying fifty 1.1s. The gradient stays alive all the way back to layer 1.&lt;/p&gt;

&lt;p&gt;Before ResNets, the practical limit for trainable networks was around 20 layers. After ResNets, researchers trained a 1,202-layer network. Not because they needed 1,202 layers, but to prove they could.&lt;/p&gt;

&lt;p&gt;That distinction: &lt;strong&gt;capacity vs. trainability&lt;/strong&gt; is one of the most important ideas in deep learning.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Bigger Picture: How Everything Fits Together
&lt;/h2&gt;

&lt;p&gt;At this point in the series, it's worth stepping back. A lot of concepts been introduced, and it can start to feel like an ever growing list of tricks. It's not. Each one solved a specific, concrete failure mode:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Problem&lt;/th&gt;
&lt;th&gt;What Goes Wrong&lt;/th&gt;
&lt;th&gt;Solution&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Dying neurons&lt;/td&gt;
&lt;td&gt;Neurons output zero forever, stop learning&lt;/td&gt;
&lt;td&gt;ReLU&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vanishing gradients&lt;/td&gt;
&lt;td&gt;Gradients too small to reach early layers&lt;/td&gt;
&lt;td&gt;ReLU + careful init&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Exploding gradients&lt;/td&gt;
&lt;td&gt;Gradients too large, training diverges&lt;/td&gt;
&lt;td&gt;Gradient clipping, Adam&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Slow convergence&lt;/td&gt;
&lt;td&gt;Hard to find a good learning rate&lt;/td&gt;
&lt;td&gt;Adam optimizer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Internal covariate shift&lt;/td&gt;
&lt;td&gt;Each layer's inputs keep shifting distribution&lt;/td&gt;
&lt;td&gt;Batch Norm&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Degradation problem&lt;/td&gt;
&lt;td&gt;Deeper networks perform &lt;em&gt;worse&lt;/em&gt; than shallow ones&lt;/td&gt;
&lt;td&gt;Skip connections (ResNet)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These aren't redundant, they're complementary. ReLU keeps neurons alive. Adam navigates the loss landscape efficiently. Batch norm stabilizes the signal between layers. Skip connections ensure gradients reach the beginning. Each one patches a gap the others can't cover.&lt;/p&gt;

&lt;p&gt;Together, they form the foundation that makes modern deep networks trainable. You'll see all of them again — in every architecture from here on out.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;We can now train deep networks. But depth alone doesn't solve every problem.&lt;/p&gt;

&lt;p&gt;Images have spatial structure, CNNs exploit that. But what about sequences? Text, audio, time series, data where &lt;em&gt;order&lt;/em&gt; matters and context can span hundreds of steps?&lt;/p&gt;

&lt;p&gt;Post 8 introduces &lt;strong&gt;Recurrent Neural Networks&lt;/strong&gt;: architectures with memory, where the output at each step depends on everything that came before. And you'll see immediately why the vanishing gradient problem, which we just solved for depth, comes back with a vengeance for long sequences.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Ioffe, S., &amp;amp; Szegedy, C.&lt;/strong&gt; (2015). &lt;em&gt;Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift&lt;/em&gt;. ICML.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;He, K., Zhang, X., Ren, S., &amp;amp; Sun, J.&lt;/strong&gt; (2015). &lt;em&gt;Deep Residual Learning for Image Recognition&lt;/em&gt;. CVPR 2016.&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>deeplearning</category>
      <category>residualconnections</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>From Generalists to Specialists: The CNN Shift</title>
      <dc:creator>Nilavukkarasan R</dc:creator>
      <pubDate>Tue, 31 Mar 2026 13:28:17 +0000</pubDate>
      <link>https://dev.to/rnilav/from-generalists-to-specialists-the-cnn-shift-1h1d</link>
      <guid>https://dev.to/rnilav/from-generalists-to-specialists-the-cnn-shift-1h1d</guid>
      <description>&lt;p&gt;&lt;em&gt;"Vision is the art of seeing what is invisible to others."&lt;/em&gt; &lt;strong&gt;Jonathan Swift&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;When Regularization Wasn't Enough&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In my &lt;a href="https://dev.to/rnilav/regularization-fighting-overfitting-2pj"&gt;last post&lt;/a&gt;, I showed you how dropout and weight decay stop a network from memorizing training data. We trained on MNIST, closed the generalization gap, and got a network that actually works in the real world.&lt;/p&gt;

&lt;p&gt;It felt like we'd finally solved it.&lt;/p&gt;

&lt;p&gt;But then i tried it on a real photograph. Not 28×28 grayscale digits. A 224×224 color image.&lt;/p&gt;

&lt;p&gt;The math was brutal.&lt;/p&gt;

&lt;p&gt;224 × 224 × 3 = 150,528 inputs&lt;br&gt;
Connect those to 1,000 neurons: 150 million parameters&lt;br&gt;
Just for the first layer. Before learning anything useful.&lt;/p&gt;

&lt;p&gt;We needed a different idea entirely. And it came, as the best ideas often do, from biology.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Visual Cortex Moment
&lt;/h2&gt;

&lt;p&gt;In 1959, neuroscientists David Hubel and Torsten Wiesel did something remarkable. They inserted electrodes into a cat's visual cortex and projected shapes onto a screen. They were trying to find what made individual neurons fire.&lt;/p&gt;

&lt;p&gt;Most shapes did nothing. Then, almost by accident, they moved a glass slide and cast a thin line of light across the screen.&lt;/p&gt;

&lt;p&gt;One neuron went wild.&lt;/p&gt;

&lt;p&gt;Not all neurons. Not the whole cortex. &lt;em&gt;One specific neuron&lt;/em&gt;, responding to &lt;em&gt;one specific edge&lt;/em&gt;, at &lt;em&gt;one specific orientation&lt;/em&gt;, in &lt;em&gt;one specific region&lt;/em&gt; of the visual field.&lt;/p&gt;

&lt;p&gt;They kept experimenting. Different neurons responded to different orientations, horizontal edges, vertical edges, diagonal edges. Each neuron had a small &lt;strong&gt;receptive field&lt;/strong&gt;: a limited patch of the visual field it paid attention to. Neurons in later areas responded to more complex patterns, corners, curves, eventually whole shapes.&lt;/p&gt;

&lt;p&gt;The visual cortex isn't a fully connected blob. It's a hierarchy of local detectors, each building on the one before it.&lt;/p&gt;

&lt;p&gt;That insight, decades later, became the blueprint for CNNs.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Problem with Fully Connected Networks on Images
&lt;/h2&gt;

&lt;p&gt;Here's what a fully-connected network does to an image:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FC Network sees an image as:
┌─────────────────────────────────────────────────────┐
│  pixel_1, pixel_2, pixel_3, ..., pixel_150528       │
│  (all spatial structure destroyed, every pixel      │
│   connected to every neuron with separate weights)  │
└─────────────────────────────────────────────────────┘

Every neuron: "I must learn about ALL 150,528 pixels equally."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is wasteful in two ways. First, a pixel in the top-left corner has almost nothing to do with a pixel in the bottom-right corner, but the network treats them as equally related. Second, if a cat's ear appears in the top-left of one image and the top-right of another, the network needs &lt;em&gt;separate neurons&lt;/em&gt; to detect it in each location.&lt;/p&gt;

&lt;p&gt;A CNN thinks differently:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CNN sees an image as:
┌──────────────────────────────────────────────────────┐
│  ┌───┐  ┌───┐  ┌───┐                                │
│  │ F │  │ F │  │ F │  ← same filter, sliding across │
│  └───┘  └───┘  └───┘    "Is there an edge here?"    │
│     ↘      ↓      ↙                                  │
│      [feature map]                                   │
└──────────────────────────────────────────────────────┘

Every filter: "I detect ONE pattern, ANYWHERE in the image."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same filter, applied everywhere. One set of weights to detect vertical edges across the entire image. This is &lt;strong&gt;weight sharing&lt;/strong&gt;, and it's the core reason CNNs work.&lt;/p&gt;

&lt;p&gt;The parameter comparison is stark:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;FC Network&lt;/th&gt;
&lt;th&gt;CNN (typical)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Input&lt;/td&gt;
&lt;td&gt;224×224×3 = 150K&lt;/td&gt;
&lt;td&gt;224×224×3 = 150K&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;First layer params&lt;/td&gt;
&lt;td&gt;150K × 1000 = &lt;strong&gt;150M&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;96 filters × 11×11×3 = &lt;strong&gt;35K&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Assumption&lt;/td&gt;
&lt;td&gt;all pixels equally related&lt;/td&gt;
&lt;td&gt;nearby pixels are related&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For the full mathematical breakdown of parameter counts across FC, LeNet, AlexNet, VGG, and ResNet, see &lt;a href="https://github.com/rnilav/perceptrons-to-transformers/blob/main/06-cnn/CNN_ARCHITECTURE_DEEP_DIVE.md" rel="noopener noreferrer"&gt;&lt;code&gt;CNN_ARCHITECTURE_DEEP_DIVE.md&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Concepts, Grounded in Biology
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Local Receptive Field
&lt;/h3&gt;

&lt;p&gt;In the visual cortex, each neuron only responds to a small patch of the visual field. It's &lt;em&gt;local&lt;/em&gt;. It doesn't see the whole image—just its neighborhood.&lt;/p&gt;

&lt;p&gt;In a CNN, each filter application does the same thing. A 3×3 filter looks at a 3×3 patch of the image. That's its receptive field. It asks: "Does my pattern exist in this small region?"&lt;/p&gt;

&lt;p&gt;As you go deeper in the network, receptive fields grow. A neuron in layer 3 has seen the outputs of layer 2, which saw layer 1, which saw the raw pixels. So it effectively "sees" a larger region, just like neurons deeper in the visual cortex respond to larger, more complex patterns.&lt;/p&gt;

&lt;h3&gt;
  
  
  Learnable Filter (The Edge Detector)
&lt;/h3&gt;

&lt;p&gt;A filter is just a small grid of numbers—say 3×3. During training, backpropagation (the same algorithm from &lt;a href="https://dev.to/rnilav/3-backpropagation-errors-flow-backward-knowledge-flows-forward-5320"&gt;Post 3&lt;/a&gt;) adjusts these numbers until the filter detects something useful. One filter might learn to detect vertical edges. Another learns horizontal edges. Another learns a specific texture.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Learned vertical edge filter:    Learned horizontal edge filter:
[-1  0  1]                        [-1 -2 -1]
[-2  0  2]                        [ 0  0  0]
[-1  0  1]                        [ 1  2  1]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The network learns these automatically, no human designs them. That's the power.&lt;/p&gt;

&lt;h3&gt;
  
  
  Feature Map
&lt;/h3&gt;

&lt;p&gt;When you slide a filter across an image, you get a &lt;strong&gt;feature map&lt;/strong&gt;: a 2D grid showing &lt;em&gt;where&lt;/em&gt; that filter's pattern was detected and &lt;em&gt;how strongly&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Think of it like a heat map. Apply a vertical-edge filter to a photo of a face, and the feature map lights up along the sides of the nose, the edges of the eyes, the outline of the jaw. Dark where there are no vertical edges. Bright where there are.&lt;/p&gt;

&lt;p&gt;Stack 32 filters and you get 32 feature maps, 32 different "views" of the same image, each highlighting a different pattern.&lt;/p&gt;

&lt;h3&gt;
  
  
  Padding
&lt;/h3&gt;

&lt;p&gt;Here's a practical problem: if you slide a 3×3 filter across a 5×5 image, the filter can't be centered on the edge pixels. You lose a border of information, and the output shrinks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Padding&lt;/strong&gt; adds a ring of zeros around the image before applying the filter. This lets the filter visit every pixel, including the edges, and preserves the spatial dimensions.&lt;/p&gt;

&lt;p&gt;It's like giving your peripheral vision a bit of extra context at the boundary of your visual field.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pooling (Spatial Summarization)
&lt;/h3&gt;

&lt;p&gt;After detecting features, we don't need to track &lt;em&gt;exactly&lt;/em&gt; where they appeared—just roughly where. This is &lt;strong&gt;pooling&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Max pooling takes a small window (say 2×2) and keeps only the strongest activation. It's like asking: "Did this feature appear &lt;em&gt;anywhere&lt;/em&gt; in this region?" The exact location doesn't matter.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Feature map:          After 2×2 max pooling:
[1  3  2  4]          [6  4]
[5  6  1  2]    →     [8  7]
[3  8  4  7]
[1  2  6  3]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This does three things: reduces the spatial size (fewer parameters downstream), makes the network tolerant to small shifts in position (translation invariance), and forces the network to summarize rather than memorize exact locations.&lt;/p&gt;

&lt;p&gt;Your visual cortex doesn't care if a face is shifted 5 pixels left. You still recognize it. Pooling builds that tolerance in.&lt;/p&gt;

&lt;h3&gt;
  
  
  ReLU — Still Here
&lt;/h3&gt;

&lt;p&gt;The activation function hasn't changed. After each convolution, we still apply ReLU (from &lt;a href="https://dev.to/rnilav/understanding-ai-from-first-principles-multi-layer-perceptrons-and-the-hidden-layer-breakthrough-44pl"&gt;Post 2&lt;/a&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;convolution_result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Negative activations become zero. Positive ones pass through. Same reason as before: it introduces non-linearity and avoids the vanishing gradient problem we discussed in &lt;a href="https://dev.to/rnilav/3-backpropagation-errors-flow-backward-knowledge-flows-forward-5320"&gt;Post 3&lt;/a&gt;. The building blocks carry forward.&lt;/p&gt;

&lt;h2&gt;
  
  
  Putting It Together: The CNN Pipeline
&lt;/h2&gt;

&lt;p&gt;A CNN is just these ideas stacked in sequence:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input Image
    ↓
[Conv → ReLU] × N     ← learn local patterns (like V1 cortex)
    ↓
[Pooling]             ← summarize, reduce size
    ↓
[Conv → ReLU] × N     ← learn combinations of patterns (like V2/V4)
    ↓
[Pooling]
    ↓
Flatten
    ↓
[Fully Connected]     ← classify based on learned features
    ↓
Softmax → Prediction
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The early layers learn edges and textures. Middle layers combine those into shapes. Deep layers combine shapes into objects. It's the same hierarchy Hubel and Wiesel found in the cat's brain, just learned from data instead of evolution.&lt;/p&gt;

&lt;p&gt;Training still uses backpropagation and Adam. The same gradient flow, the same weight updates. CNNs didn't replace what we built, they extended it with a smarter architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  CNN Architectures: A Brief Lineage
&lt;/h2&gt;

&lt;p&gt;The core ideas — local receptive fields, learnable filters, pooling stayed constant. What changed over the years was how deep, how wide, and how trainable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LeNet (1998)&lt;/strong&gt; was the proof of concept. Two conv layers, two pooling layers, three fully-connected layers. Trained on handwritten digits. ~60K parameters. It worked, but the hardware and data of the time couldn't push it further.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AlexNet (2012)&lt;/strong&gt; was the moment everything changed. Five conv layers, three FC layers, ~60M parameters — trained on GPUs for the first time. It won ImageNet by a margin that shocked the field. The key additions: ReLU activations (faster training), dropout (regularization), and data augmentation. The deep learning era started here.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VGG (2014)&lt;/strong&gt; asked: what if we just go deeper, but keep it simple? Only 3×3 filters, stacked in blocks. 16–19 layers, ~138M parameters. It showed that depth itself was the driver of accuracy, but the three large FC layers at the end were a parameter bottleneck.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ResNet (2015)&lt;/strong&gt; solved the problem VGG exposed: beyond ~20 layers, adding more layers actually &lt;em&gt;hurts&lt;/em&gt; accuracy. Not from overfitting — from gradients vanishing before they reach early layers. ResNet's fix was elegant: skip connections that let gradients bypass layers entirely. Suddenly 50-layer, 152-layer networks were trainable. ResNet-50 achieves better accuracy than VGG-16 with 5× fewer parameters.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;LeNet (1998)    →  AlexNet (2012)   →  VGG (2014)      →  ResNet (2015)
~60K params        ~60M params         ~138M params        ~25M params
proof of concept   GPU + ReLU          depth matters       skip connections
                   revolution          but costly          solve depth
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  CNNs Are Built for Images — Not Text
&lt;/h2&gt;

&lt;p&gt;This is worth saying explicitly, because it's easy to assume CNNs are a general-purpose upgrade to fully-connected networks. They're not.&lt;/p&gt;

&lt;p&gt;CNNs work because of two assumptions baked into the architecture:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Nearby inputs are related&lt;/strong&gt; — a pixel's neighbors matter more than distant pixels&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The same pattern can appear anywhere&lt;/strong&gt; — weight sharing makes sense because an edge in the top-left is the same edge as one in the bottom-right&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Images satisfy both assumptions perfectly. So do audio spectrograms and video frames.&lt;/p&gt;

&lt;p&gt;Text doesn't. The word "not" next to "good" completely changes the meaning — but "not" next to "bad" means something different again. Context in language isn't local and positional the way it is in images. The same word means different things in different positions. Weight sharing across positions doesn't make the same kind of sense.&lt;/p&gt;

&lt;p&gt;That's why text needs different architectures — RNNs that process sequences step by step, and eventually Transformers that learn which words to pay attention to regardless of distance. CNNs are a specialized tool, and their specialization is spatial data.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Clicked for Me
&lt;/h2&gt;

&lt;p&gt;I kept thinking about Hubel and Wiesel's cat. One neuron, one edge orientation. Seemed pointless.&lt;/p&gt;

&lt;p&gt;Then it clicked: a neuron that responds to everything is useless. A neuron that responds to exactly one pattern, reliably, anywhere—that's signal. That's a building block.&lt;/p&gt;

&lt;p&gt;Fully-connected networks try to be generalists from pixel one. CNNs start with specialists and compose up.&lt;/p&gt;

&lt;h2&gt;
  
  
  Interactive Playground
&lt;/h2&gt;

&lt;p&gt;I've built an interactive &lt;a href="https://github.com/rnilav/perceptrons-to-transformers/tree/main/06-cnn" rel="noopener noreferrer"&gt;playground&lt;/a&gt; where you can watch CNN in action. It has two tabs&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tab 1: FC Network vs CNN&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Both models are trained from scratch on the same 1,000-sample MNIST subset using pure NumPy and Adam — the same setup from Post 4. You can adjust the FC hidden layer size, the number of CNN filters, the number of epochs (up to 20), and the batch size, then hit &lt;strong&gt;Train both models&lt;/strong&gt; to run real training.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tab 2: CNN Layer Explorer&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Pick any of the digits 0, 1, 6, or 8 and explore three views:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;What each filter detects&lt;/strong&gt; — shows the raw filter weights (3×3 grid) alongside the response heatmap on your chosen digit. Bright yellow means "this pattern is strongly present here." You can see how a vertical-edge filter lights up along strokes, while a blob filter responds to filled regions.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Layer-by-layer pipeline&lt;/strong&gt; — traces your digit through the full network: Conv1+ReLU → MaxPool → Conv2+ReLU → MaxPool → Flatten → FC → Softmax. Each stage shows the actual feature map image with a caption explaining what happened and why. A dimension table below tracks the shape at every step.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;MaxPool zoom-in&lt;/strong&gt; — takes a 4×4 patch from the conv output and shows the actual numerical values, then shows the 2×2 result after pooling. You can see exactly which values survived and why — the maximum in each 2×2 block wins.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What This Unlocked
&lt;/h2&gt;

&lt;p&gt;Before CNNs, computer vision meant handcrafting features. Researchers spent years designing SIFT descriptors, HOG features, edge detectors, all by hand. Then AlexNet (2012) showed that a CNN trained on enough data could learn better features automatically, and it wasn't close. The error rate dropped from 26% to 15% in one year.&lt;/p&gt;

&lt;p&gt;That was the moment the field changed.&lt;/p&gt;

&lt;p&gt;Every modern vision system—object detection, medical imaging, autonomous driving, face recognition—runs on some variant of this idea. Local receptive fields. Learnable filters. Hierarchical features. Weight sharing.&lt;/p&gt;

&lt;p&gt;All of it traced back to a cat, an electrode, and a sliding glass slide in 1959.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;We can now train CNNs that learn to see. But there's a catch: go deep enough, and training breaks down. Gradients vanish. Accuracy plateaus. Adding more layers actually &lt;em&gt;hurts&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Post 7 covers the two innovations that fixed this: &lt;strong&gt;Batch Normalization&lt;/strong&gt; (stabilize the activations between layers) and &lt;strong&gt;Residual Connections&lt;/strong&gt; (let gradients skip layers entirely). Together, they made 50-layer, 100-layer networks trainable—and unlocked the modern era of deep learning.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;LeCun et al.&lt;/strong&gt; (1998). &lt;em&gt;Gradient-Based Learning Applied to Document Recognition&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Krizhevsky et al.&lt;/strong&gt; (2012). &lt;em&gt;ImageNet Classification with Deep Convolutional Neural Networks&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=zfiSAzpy9NM&amp;amp;t=628s" rel="noopener noreferrer"&gt;Simple explanation of Convolutional Neural Network&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>convolutionalnetworks</category>
      <category>computervision</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Regularization: Fighting Overfitting</title>
      <dc:creator>Nilavukkarasan R</dc:creator>
      <pubDate>Fri, 13 Mar 2026 14:35:02 +0000</pubDate>
      <link>https://dev.to/rnilav/regularization-fighting-overfitting-2pj</link>
      <guid>https://dev.to/rnilav/regularization-fighting-overfitting-2pj</guid>
      <description>&lt;p&gt;&lt;em&gt;"Learning without thought is labor lost"&lt;/em&gt;  &lt;strong&gt;--Confucius&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;When Your Network Becomes a Memorizer&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In my &lt;a href="https://dev.to/rnilav/neural-network-optimizers-from-baby-steps-to-intelligent-learning-44po"&gt;last post&lt;/a&gt;, I showed you how to train a network on MNIST. Adam optimizer, mini-batches, 100 epochs. Training accuracy climbed over 99%.&lt;/p&gt;

&lt;p&gt;It felt like we'd solved it.&lt;/p&gt;

&lt;p&gt;But here's what I didn't show you: what happens if you keep training.&lt;/p&gt;

&lt;p&gt;Running the network for 200 epochs drives training accuracy over 99%. But test accuracy tells a different story, it hits 97% around epoch 50, then slowly drops as training continues.&lt;/p&gt;

&lt;p&gt;Why? Imagine studying for an exam by memorizing that "Question 5 is always B" instead of understanding why B is correct. You'd ace the practice test but fail when questions are reordered or rephrased. Neural networks do the same thing. They memorize training data so well they hit 99% accuracy, yet struggle with new examples because they never learned the underlying patterns.&lt;/p&gt;

&lt;p&gt;The network isn't learning anymore. It's memorizing. That's overfitting, and it's one of the core challenges in training neural networks.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Generalization Gap
&lt;/h2&gt;

&lt;p&gt;Here's what happens when you train a network without any safeguards:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Epoch 1:   Train Acc: 87.3%  Test Acc: 86.9%  Gap: 0.4%
Epoch 10:  Train Acc: 97.2%  Test Acc: 96.8%  Gap: 0.4%
Epoch 50:  Train Acc: 99.1%  Test Acc: 97.2%  Gap: 1.9%
Epoch 100: Train Acc: 99.7%  Test Acc: 96.8%  Gap: 2.9%
Epoch 200: Train Acc: 99.95% Test Acc: 96.1%  Gap: 3.85%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;See that gap? It starts small, the network learns generalizable patterns. But as training continues, the gap widens. The network is still improving on training data, but test accuracy stalls and drops.&lt;/p&gt;

&lt;p&gt;This gap is the &lt;strong&gt;generalization gap&lt;/strong&gt;. It's the difference between what your network learned and what it actually understands.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Does This Happen?
&lt;/h2&gt;

&lt;p&gt;A network has three things working against it:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Capacity:&lt;/strong&gt; Your network has 100,000 weights. Your training set has 60,000 examples. Mathematically, the network has enough capacity to memorize every single example.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Time:&lt;/strong&gt; Every epoch, the network sees the same training examples again. It gets more chances to memorize. After 200 epochs, it's seen each example 200 times. Memorization becomes easier than learning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. No penalty for complexity:&lt;/strong&gt; The network doesn't care if it uses simple patterns or complex ones. Both reduce training loss equally. So it drifts toward complexity overfitting.&lt;/p&gt;

&lt;p&gt;The solution? Force the network to generalize instead of memorize.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Regularization Toolkit: A Big Picture View
&lt;/h2&gt;

&lt;p&gt;Before we dive into specific techniques, let's zoom out and see the full landscape of solutions to overfitting. There are actually many ways to fix it:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Reduce Model Capacity&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use smaller networks (fewer neurons, fewer layers)&lt;/li&gt;
&lt;li&gt;Prune weights after training&lt;/li&gt;
&lt;li&gt;Use simpler architectures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The idea: if your network is smaller, it simply can't memorize as much. But this is a blunt instrument, you might lose the ability to learn complex patterns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Increase Training Data&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Collect more real data&lt;/li&gt;
&lt;li&gt;Use data augmentation (rotations, crops, noise for images; paraphrasing for text)&lt;/li&gt;
&lt;li&gt;Use synthetic data generation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The idea: with more diverse examples, memorization becomes harder. The network has to learn generalizable patterns to cover all the data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Stop Training Early&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Monitor test accuracy during training&lt;/li&gt;
&lt;li&gt;Stop when test accuracy starts declining&lt;/li&gt;
&lt;li&gt;This is called "early stopping"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The idea: overfitting gets worse over time. Stop before it happens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Ensemble Methods&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Train multiple networks and average their predictions&lt;/li&gt;
&lt;li&gt;Use techniques like boosting or bagging&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The idea: multiple imperfect models often generalize better than one perfect model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Architectural Innovations&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Skip connections (ResNets) allow training deeper networks that generalize better&lt;/li&gt;
&lt;li&gt;Attention mechanisms focus on relevant parts of the input&lt;/li&gt;
&lt;li&gt;Inductive biases (like convolutions for images) reduce the effective capacity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The idea: design the architecture to match the problem structure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Our Focus: Regularization Techniques
&lt;/h2&gt;

&lt;p&gt;In this post, we're going to deep-dive into two regularization techniques - &lt;strong&gt;dropout&lt;/strong&gt; and &lt;strong&gt;weight decay&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Why these two? Because they represent two different philosophies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dropout&lt;/strong&gt; prevents co-adaptation (neurons learning to work together in ways that only make sense for training data)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weight decay&lt;/strong&gt; encourages simplicity (smaller weights = simpler decision boundaries)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Together, they form a powerful one-two punch against overfitting. And understanding them deeply will help you understand other regularization techniques.&lt;/p&gt;

&lt;h2&gt;
  
  
  Dropout: An Ensemble of Smaller Networks
&lt;/h2&gt;

&lt;p&gt;Here's an idea: what if we randomly disabled neurons during training?&lt;/p&gt;

&lt;p&gt;Not permanently. Just during each forward pass.&lt;/p&gt;

&lt;p&gt;This sounds like sabotage. Why would we intentionally break our network?&lt;/p&gt;

&lt;p&gt;Because it forces the network to learn redundant representations.&lt;/p&gt;

&lt;p&gt;Think of it like this: imagine you're building a team to solve problems. If you always have the same 10 people, they'll specialize and depend on each other. Person A always handles data, Person B always handles logic. If Person A gets sick, the team fails.&lt;/p&gt;

&lt;p&gt;But if you randomly remove people from the team each day, they can't specialize. Everyone learns to do everything. The team becomes robust.&lt;/p&gt;

&lt;p&gt;That's dropout.&lt;/p&gt;

&lt;p&gt;When we randomly disable neurons during training, the network can’t rely on specific neurons to make a prediction. Instead, it must learn multiple pathways to the same answer. This redundancy prevents co-adaptation. i.e. neurons relying on each other in ways that only work for the training data.&lt;/p&gt;

&lt;p&gt;If a layer has n neurons, there are 2ⁿ possible subnetworks, depending on which neurons are active or dropped. During training, each mini-batch randomly samples one of these subnetworks.&lt;/p&gt;

&lt;p&gt;Imagine a hidden layer with 256 neurons and 50% dropout.&lt;/p&gt;

&lt;p&gt;Each mini-batch activates a different random subset of neurons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mini-batch 1 trains with neurons {1, 3, 5, 7, ..., 255}&lt;/li&gt;
&lt;li&gt;Mini-batch 2 trains with neurons {2, 4, 6, 8, ..., 256}&lt;/li&gt;
&lt;li&gt;Mini-batch 3 trains with neurons {1, 2, 4, 7, ..., 254}&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each subset forms a slightly different network. Over training, the model samples from an enormous space of subnetworks and learns weights that perform well across many of them.&lt;/p&gt;

&lt;p&gt;Modern implementations use inverted dropout.&lt;/p&gt;

&lt;p&gt;During training, we randomly drop neurons and scale the activations so that their expected value stays the same. At test time, we simply run the full network without any dropout.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Training: randomly disable neurons
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;training&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;mask&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;binomial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;dropout_rate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;X_dropped&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;mask&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;dropout_rate&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Scale to maintain expected value
&lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;X_dropped&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;  &lt;span class="c1"&gt;# No dropout at test time
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;strong&gt;scaling factor&lt;/strong&gt; 1 / (1 - dropout_rate) is crucial.&lt;/p&gt;

&lt;p&gt;Without it, the magnitude of activations during training would be smaller than during inference, causing inconsistent predictions.&lt;/p&gt;

&lt;p&gt;By scaling during training, the expected activation remains the same whether dropout is active or not.  &lt;/p&gt;

&lt;p&gt;Dropout forces the network to &lt;strong&gt;learn robust representations&lt;/strong&gt;. No neuron can assume another neuron will always be present, so useful features must be distributed across the network.&lt;/p&gt;

&lt;p&gt;The result:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Less memorization&lt;/li&gt;
&lt;li&gt;Better generalization&lt;/li&gt;
&lt;li&gt;A model that performs well on unseen data, not just the training set.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Weight Decay: Occam's Razor for Neural Networks
&lt;/h2&gt;

&lt;p&gt;Dropout prevents co-adaptation. But there's another approach: what if we penalize large weights?&lt;/p&gt;

&lt;p&gt;This idea is called weight decay, also known as L2 regularization.&lt;/p&gt;

&lt;p&gt;The idea is simple: add a penalty to the loss function proportional to the magnitude of weights.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Total Loss = Cross-Entropy Loss + λ * (sum of squared weights)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The λ (lambda) parameter controls how much we penalize large weights. Higher λ means stronger penalty.&lt;/p&gt;

&lt;p&gt;Why does this work? Large weights tend to make a network very sensitive to small changes in the input, producing sharp decision boundaries that can fit noise in the training data. Smaller weights produce smoother functions that change more gradually.&lt;/p&gt;

&lt;p&gt;Consider two networks that both achieve 95% training accuracy:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Network A:&lt;/strong&gt; Has weights like [0.1, 0.2, -0.15, 0.08, ...]. Small adjustments to inputs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Network B:&lt;/strong&gt; Has weights like [5.2, -8.7, 12.3, -6.1, ...]. Large adjustments to inputs.&lt;/p&gt;

&lt;p&gt;Both fit the training data. But Network B's large weights create sharper, more extreme responses to inputs, which increases the risk of overfitting. Weight decay prefers Network A because its weights have smaller magnitude.&lt;/p&gt;

&lt;p&gt;Without weight decay, the optimizer only cares about minimizing training loss.&lt;/p&gt;

&lt;p&gt;With weight decay, the optimizer faces a trade-off:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reduce training loss&lt;/li&gt;
&lt;li&gt;Keep weights small&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;During &lt;strong&gt;backpropagation&lt;/strong&gt;, the regularization term adds an extra component to the gradient:&lt;/p&gt;

&lt;p&gt;gradient = original_gradient + λ * w&lt;/p&gt;

&lt;p&gt;This gently pulls weights toward zero during training. The result is not that the network learns less, but that it learns more restrained solutions that tend to generalize better.&lt;/p&gt;

&lt;p&gt;Weight decay doesn't restrict learning.&lt;br&gt;
It simply nudges the model toward simpler explanations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Tuning Process:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Train without regularization. Measure the train-to-test gap.&lt;/li&gt;
&lt;li&gt;If gap &amp;lt; 1%, you're good. No regularization needed.&lt;/li&gt;
&lt;li&gt;If gap is 1-3%, add dropout=0.2. Retrain and measure.&lt;/li&gt;
&lt;li&gt;If gap is still &amp;gt; 2%, add weight_decay=0.0001. Retrain and measure.&lt;/li&gt;
&lt;li&gt;If gap is still &amp;gt; 2%, increase dropout to 0.3 or 0.4.&lt;/li&gt;
&lt;li&gt;If gap is &amp;gt; 3%, you might need more data or a smaller network.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The key is experimentation. Every dataset is different. What works for MNIST might not work for ImageNet. Start conservative, measure, adjust.&lt;/p&gt;

&lt;h2&gt;
  
  
  Interactive Exploration
&lt;/h2&gt;

&lt;p&gt;This is where the &lt;a href="https://github.com/rnilav/perceptrons-to-transformers/tree/main/05-regularization" rel="noopener noreferrer"&gt;playground&lt;/a&gt; comes in. I've built a Streamlit app that lets you experiment with these techniques in real-time. It covers two parts overfitting and weight distribution to explore with.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Clicked for Me
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Regularization is a trade-off.&lt;/strong&gt; You're not trying to achieve 100% training accuracy. You're trying to maximize test accuracy. I used to think "higher training accuracy = better network." Now I know better.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dropout is elegant.&lt;/strong&gt; It's not a hack. It's a principled way to train an ensemble of networks simultaneously.&lt;/p&gt;

&lt;p&gt;Each breakthrough solved a problem the previous one created:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://dev.to/rnilav/understanding-perceptrons-the-foundation-of-modern-ai-2g04"&gt;Perceptrons&lt;/a&gt; couldn't learn complex patterns → Multi-layer networks&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/rnilav/understanding-ai-from-first-principles-multi-layer-perceptrons-and-the-hidden-layer-breakthrough-44pl"&gt;Multi-layer networks&lt;/a&gt; couldn't learn efficiently → Backpropagation&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/rnilav/3-backpropagation-errors-flow-backward-knowledge-flows-forward-5320"&gt;Backpropagation&lt;/a&gt; was slow on large datasets → Optimization (mini-batches, Adam)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/rnilav/neural-network-optimizers-from-baby-steps-to-intelligent-learning-44po"&gt;Optimization&lt;/a&gt; worked but overfitted → Regularization (dropout, weight decay)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We're building a complete system. Each piece is necessary.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;We can now train networks that actually work in the real world. They learn patterns, not memorize data. They generalize to new examples.&lt;/p&gt;

&lt;p&gt;For now, we're still limited to fully connected networks on small images. MNIST is 28×28. Real images are 1000×1000 or larger. And fully connected networks don't scale, a 1000×1000 image would require 1 million input neurons.&lt;/p&gt;

&lt;p&gt;We need a different architecture. One designed specifically for images.&lt;/p&gt;

&lt;p&gt;Enter convolutional networks.&lt;/p&gt;

&lt;p&gt;The jump from fully connected to convolutional is as big as the jump from perceptrons to multi-layer networks. It's a completely different way of thinking about neural networks.&lt;/p&gt;

&lt;p&gt;And it's the next breakthrough in our journey.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., &amp;amp; Salakhutdinov, R. R.&lt;/strong&gt; &lt;em&gt;Improving neural networks by preventing co-adaptation of feature detectors&lt;/em&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Ng, A. Y.&lt;/strong&gt; &lt;em&gt;Feature selection, L1 vs. L2 regularization, and rotational invariance&lt;/em&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; #MachineLearning #AI #DeepLearning #Regularization #Dropout #WeightDecay #Overfitting #MNIST #NeuralNetworks&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code:&lt;/strong&gt; &lt;a href="https://github.com/rnilav/perceptrons-to-transformers" rel="noopener noreferrer"&gt;GitHub Repository&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>regularization</category>
      <category>weightdecay</category>
    </item>
    <item>
      <title>Neural Network Optimizers: From Baby Steps to Intelligent Learning</title>
      <dc:creator>Nilavukkarasan R</dc:creator>
      <pubDate>Wed, 04 Mar 2026 15:06:50 +0000</pubDate>
      <link>https://dev.to/rnilav/neural-network-optimizers-from-baby-steps-to-intelligent-learning-44po</link>
      <guid>https://dev.to/rnilav/neural-network-optimizers-from-baby-steps-to-intelligent-learning-44po</guid>
      <description>&lt;p&gt;&lt;em&gt;Adapt what is useful, reject what is useless, and add what is specifically your own.&lt;/em&gt; &lt;br&gt;
-- &lt;strong&gt;Bruce Lee&lt;/strong&gt;&lt;/p&gt;



&lt;p&gt;In my &lt;a href="https://dev.to/rnilav/understanding-ai-from-first-principles-backpropagation-the-algorithm-that-learns-3k8l"&gt;last post&lt;/a&gt;, I showed you how backpropagation could learn the weights for XOR automatically. No more hand-crafting. No more trial and error. Just set a learning rate, run the algorithm, and watch the loss curve drop.&lt;/p&gt;

&lt;p&gt;It felt like magic. Almost too easy.&lt;/p&gt;

&lt;p&gt;But here's what I glossed over: XOR has just 4 training examples. With 4 examples, you compute the gradient using all of them at once. Every weight update sees the complete picture.&lt;/p&gt;

&lt;p&gt;But XOR is a toy problem. Let me tell you about a real dataset.&lt;/p&gt;


&lt;h2&gt;
  
  
  MNIST: The "Hello World" of Deep Learning
&lt;/h2&gt;

&lt;p&gt;MNIST is a collection of 70,000 handwritten digit images—60,000 for training, 10,000 for testing. Each image is 28×28 grayscale pixels.&lt;/p&gt;

&lt;p&gt;The task: look at an image and predict which digit (0-9) it represents.&lt;/p&gt;

&lt;p&gt;Trivial for humans. Genuinely hard for 1990s computers. It became the standard benchmark for machine learning.&lt;/p&gt;

&lt;p&gt;Each image has 784 pixels (28×28). To classify them, we need 784 inputs, a hidden layer (say, 128 neurons), and 10 outputs. That's roughly 100,000 weights to learn.&lt;/p&gt;

&lt;p&gt;Here's the problem: computing the gradient using all 60,000 examples—like we did with XOR—requires 60,000 forward and backward passes per weight update. On my laptop, that's 30 seconds per epoch.&lt;/p&gt;

&lt;p&gt;Training for 100 epochs? 50 minutes total.&lt;/p&gt;

&lt;p&gt;Not terrible. But remember, this is just MNIST.&lt;/p&gt;

&lt;p&gt;Now consider:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GPT-2&lt;/strong&gt; trained on &lt;strong&gt;40 billion tokens&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPT-3&lt;/strong&gt; trained on &lt;strong&gt;300 billion tokens&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPT-4&lt;/strong&gt;? &lt;strong&gt;Trillions of tokens&lt;/strong&gt;, hundreds of billions of weights&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If full-batch gradient descent takes 50 minutes for MNIST, modern LLMs would take thousands of years.&lt;/p&gt;

&lt;p&gt;This is the &lt;strong&gt;"Scale Wall."&lt;/strong&gt; Nobody trains this way anymore.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Leap from Toy to Real
&lt;/h2&gt;

&lt;p&gt;The jump from XOR to MNIST isn't just more data—it's a fundamental shift in thinking. At scale, learning can't be perfect. It has to be incremental, approximate, adaptive. Just like human learning.&lt;/p&gt;

&lt;p&gt;This is where optimisers enter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mini-Batches&lt;/strong&gt;: Learning from Subsets. The insight that changed everything - you don't need all 60,000 examples to know which direction to move your weights.&lt;/p&gt;

&lt;p&gt;Think about cooking. You don't taste every grain of rice to know if you need more salt. You taste,a spoonful, a small sample tells you enough.&lt;/p&gt;

&lt;p&gt;That's mini-batch stochastic gradient descent. Instead of using all 60,000 examples:&lt;/p&gt;

&lt;p&gt;The algorithm is simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for each epoch:
    shuffle training data
    divide into mini-batches
    for each mini-batch:
        forward pass (compute predictions)
        compute loss (average over batch)
        backward pass (compute gradients)
        update weights
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One complete pass through all data is an &lt;strong&gt;epoch&lt;/strong&gt;. With 60,000 examples and batch size 64, that's 937 mini-batches per epoch.&lt;/p&gt;

&lt;p&gt;The magic? Each mini-batch gradient is noisy—not the exact gradient from all 60,000 examples. But it points roughly in the right direction. And it's fast.&lt;/p&gt;

&lt;p&gt;On my laptop:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Full-batch: 30 seconds per update&lt;/li&gt;
&lt;li&gt;Mini-batch (size 64): 0.03 seconds per update&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's &lt;strong&gt;1000× faster&lt;/strong&gt;. Training for 100 epochs drops from 50 minutes to 5 minutes.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Three Key Concepts
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Mini-batch:&lt;/strong&gt; A small subset of training data for one gradient update. Common sizes: 16, 32, 64, 128. Smaller batches are noisier but faster. Larger batches are more accurate but slower.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Epoch:&lt;/strong&gt; One complete pass through your training dataset. With 60,000 examples and batch size 64: one epoch = 937 updates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stochastic:&lt;/strong&gt; Means "random." We shuffle data before each epoch, so mini-batches differ every time. This randomness helps—it prevents the network from memorising example order.&lt;/p&gt;




&lt;h1&gt;
  
  
  Optimizers as Human Learners
&lt;/h1&gt;

&lt;p&gt;Just like human learners, no single optimizer is perfect. Each optimizer is a different learning strategy, a different way to navigate the loss landscape. &lt;/p&gt;

&lt;p&gt;I’ve put together a simple summary of three optimizers — &lt;strong&gt;SGD, Momentum,&lt;/strong&gt; and &lt;strong&gt;Adam&lt;/strong&gt; — and how they approach learning. Do take a look &lt;a href="https://github.com/rnilav/perceptrons-to-transformers/blob/main/04-optimization/OPTIMIZERS_SIMPLIFIED.md" rel="noopener noreferrer"&gt;OPTIMIZERS_SIMPLIFIED&lt;/a&gt;, if you’re interested&lt;/p&gt;

&lt;p&gt;There are many more optimizers in the wild: &lt;strong&gt;AdaGrad&lt;/strong&gt; (adapts per-parameter but burns out on long training runs), &lt;strong&gt;RMSProp&lt;/strong&gt; (fixes AdaGrad's aggressive decay), &lt;strong&gt;AdaDelta&lt;/strong&gt; (removes the need for a learning rate), &lt;strong&gt;NAdam&lt;/strong&gt; (Nesterov-accelerated Adam), &lt;strong&gt;L-BFGS&lt;/strong&gt; (second-order method for smaller datasets), and newer variants like &lt;strong&gt;AdamW&lt;/strong&gt; (Adam with weight decay done right), &lt;strong&gt;RAdam&lt;/strong&gt; (rectified Adam with warmup), and &lt;strong&gt;Lookahead&lt;/strong&gt; (maintains fast and slow weights).&lt;/p&gt;

&lt;p&gt;I'll cover these in future posts when the context is right, showing you not just &lt;em&gt;what&lt;/em&gt; they do, but &lt;em&gt;when&lt;/em&gt; and &lt;em&gt;why&lt;/em&gt; to use them.&lt;/p&gt;

&lt;p&gt;For now, Pick your learner, Start training.&lt;/p&gt;

&lt;h3&gt;
  
  
  Time to Experiment
&lt;/h3&gt;

&lt;p&gt;Let's see this in action! I've built a &lt;a href="https://github.com/rnilav/perceptrons-to-transformers/tree/main/04-optimization" rel="noopener noreferrer"&gt;neural network playground&lt;/a&gt; on MNIST with three optimizers— Batch SGD, Momentum, and Adam. Experiment with each and watch how they differ in training time and convergence speed.&lt;/p&gt;

&lt;p&gt;Sample screenshot from the playground:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1cysxo55tdld429ju9nf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1cysxo55tdld429ju9nf.png" alt="Playground_Optimizer" width="800" height="420"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What Clicked for Me
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Scale changes everything.&lt;/strong&gt; Full-batch works for 4 examples, collapses at 60,000. Mini-batching is survival, not cleverness.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Adaptive learning makes sense.&lt;/strong&gt; Not all weights should move equally. Adam adjusts per-parameter instead of treating everything the same.&lt;/p&gt;

&lt;p&gt;And the progression is elegant:&lt;br&gt;
&lt;a href="https://dev.to/rnilav/understanding-perceptrons-the-foundation-of-modern-ai-2g04"&gt;Perceptron&lt;/a&gt; → &lt;a href="https://dev.to/rnilav/understanding-ai-from-first-principles-multi-layer-perceptrons-and-the-hidden-layer-breakthrough-44pl"&gt;multi-layer networks&lt;/a&gt; → &lt;a href="https://dev.to/rnilav/3-backpropagation-errors-flow-backward-knowledge-flows-forward-5320"&gt;backpropagation&lt;/a&gt; → &lt;a href="https://dev.to/rnilav/neural-network-optimizers-from-baby-steps-to-intelligent-learning-44po"&gt;scalable optimizers&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Each breakthrough made the next one possible.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;We can now train on real data. Backprop computes gradients, Adam updates weights. 99% accuracy on MNIST in seconds.&lt;/p&gt;

&lt;p&gt;Everything works. Until it doesn't.&lt;/p&gt;

&lt;p&gt;Train longer: training accuracy climbs to 99.8%, but test accuracy stalls and drops. The model isn't learning—it's memorising.&lt;/p&gt;

&lt;p&gt;Next: why overfitting happens and how dropout and weight decay force networks to generalise instead of memorise.&lt;/p&gt;

&lt;p&gt;Training a network is easy. Making it work in the real world? That's the challenge.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Rumelhart, D. E., Hinton, G. E., &amp;amp; Williams, R. J.&lt;/strong&gt; (1986). &lt;em&gt;Learning representations by back-propagating errors&lt;/em&gt;. Nature, 323(6088), 533-536.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Tieleman, T., &amp;amp; Hinton, G.&lt;/strong&gt; (2012). &lt;em&gt;Lecture 6.5—RMSProp: Divide the gradient by a running average of its recent magnitude&lt;/em&gt;. COURSERA: Neural Networks for Machine Learning.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Kingma, D. P., &amp;amp; Ba, J.&lt;/strong&gt; (2014). &lt;em&gt;Adam: A Method for Stochastic Optimization&lt;/em&gt;. arXiv preprint arXiv:1412.6980.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; #MachineLearning #AI #DeepLearning #Optimization #SGD #Adam #Momentum #MNIST #NeuralNetworks&lt;/p&gt;

</description>
      <category>ai</category>
      <category>neuralnetworks</category>
      <category>deeplearning</category>
    </item>
    <item>
      <title>Backpropagation: Errors Flow Backward, Knowledge Flows Forward</title>
      <dc:creator>Nilavukkarasan R</dc:creator>
      <pubDate>Thu, 19 Feb 2026 14:37:13 +0000</pubDate>
      <link>https://dev.to/rnilav/3-backpropagation-errors-flow-backward-knowledge-flows-forward-5320</link>
      <guid>https://dev.to/rnilav/3-backpropagation-errors-flow-backward-knowledge-flows-forward-5320</guid>
      <description>&lt;p&gt;&lt;em&gt;"The backpropagation algorithm was a key historical step in demonstrating that deep neural networks could be trained effectively."&lt;/em&gt; &lt;/p&gt;

&lt;p&gt;-- &lt;strong&gt;Geoffrey Hinton&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Weight of the Problem&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In my &lt;a href="https://dev.to/rnilav/understanding-ai-from-first-principles-multi-layer-perceptrons-and-the-hidden-layer-breakthrough-44pl"&gt;last post&lt;/a&gt;, I showed you how a multi-layer perceptron could solve XOR—something a single perceptron couldn't do. The network had 2 inputs, 2 hidden neurons, and 1 output. It worked beautifully.&lt;/p&gt;

&lt;p&gt;But here's the catch: I hand-crafted those weights.&lt;/p&gt;

&lt;p&gt;I sat there, adjusting numbers, testing combinations, until I found values that worked. For a tiny 2-2-1 network with 9 weights total, it took me hours of trial and error—and probably three cups of coffee I shouldn't have had.&lt;/p&gt;

&lt;p&gt;Now imagine GPT-4. It has 1.76 trillion parameters.&lt;/p&gt;

&lt;p&gt;If I spent one second per weight, it would take me 55,000 years to hand-craft GPT-4's weights. And that's assuming I got each one right on the first try (spoiler: I wouldn't).&lt;/p&gt;

&lt;p&gt;This is the problem that haunted neural networks in the 1980s. We knew multi-layer networks could solve complex problems. We just didn't know how to train them.&lt;/p&gt;

&lt;p&gt;Then, in the mid-1980s, came backpropagation. And here's the magic: it doesn't require you to know the right weights ahead of time. It learns them automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Breakthrough: Learning from Mistakes&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Here's the beautiful insight: &lt;strong&gt;networks can learn from their mistakes. And that changes everything.&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;Think about learning to throw darts. Your first throw misses the bullseye by a foot. You don't randomly try a completely different throw. You adjust—a little less force, slightly different angle. You use the error (how far you missed) to guide your correction.&lt;/p&gt;

&lt;p&gt;That's exactly what backpropagation does.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The process is simple:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Forward pass&lt;/strong&gt;: Make a prediction with current weights&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Calculate error&lt;/strong&gt;: How wrong was the prediction?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backward pass&lt;/strong&gt;: Figure out which weights caused the error&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Update weights&lt;/strong&gt;: Adjust them to reduce the error&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Repeat&lt;/strong&gt;: Do this thousands of times&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The magic is in step 3—figuring out which weights to blame. That's where the "backpropagation" name comes from: we propagate the error backward through the network.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Chain Rule: Error Flows Backward&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Backpropagation is often described as "just the chain rule from calculus." And it is! But let me make it concrete.&lt;/p&gt;

&lt;p&gt;Imagine you're hiking and you want to go downhill (minimize loss). You're standing on a slope, and you need to know which direction is down.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gradient descent&lt;/strong&gt; is the strategy: always step in the direction that goes downhill.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Backpropagation&lt;/strong&gt; is how you figure out which direction that is for every weight in your network.&lt;/p&gt;

&lt;p&gt;Here's the flow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Forward Pass (Making Predictions):
Input → Hidden Layer → Output → Loss
  x   →      h       →   ŷ    →  L

Backward Pass (Computing Gradients):
Loss → Output Error → Hidden Error → Weight Updates
  L  →      δ₂      →      δ₁      →   Δw
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The error at the output layer is easy to compute: &lt;code&gt;prediction - target&lt;/code&gt;. But how do we know how much each hidden neuron contributed to that error?&lt;/p&gt;

&lt;p&gt;That's where the chain rule comes in. The error flows backward through the network, multiplied by weights and activation derivatives at each step. Each weight gets a gradient that tells it: "If you increase by a tiny amount, the loss will change by this much."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For the mathematically curious:&lt;/strong&gt; I've written a detailed walkthrough with concrete numerical examples in &lt;a href="https://github.com/rnilav/perceptrons-to-transformers/blob/main/03-backpropagation/BACKPROPAGATION_CALCULUS.md" rel="noopener noreferrer"&gt;&lt;code&gt;BACKPROP_CALCULUS_EXPLAINED.md&lt;/code&gt;&lt;/a&gt;. It shows the full chain rule derivation with a 2-2-1 network solving XOR, complete with actual numbers flowing through each calculation.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Learning Rate: The Step Size&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Once we know which direction to adjust each weight, we need to decide how big a step to take. That's the &lt;strong&gt;learning rate&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Think of it like adjusting the volume on a stereo:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Too high&lt;/strong&gt; (learning rate = 1.0): You overshoot. The volume jumps from 2 to 10, then back to 1, then to 8. You never settle on the right level.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Too low&lt;/strong&gt; (learning rate = 0.01): You're turning the knob so slowly it takes forever to reach the right volume.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Just right&lt;/strong&gt; (learning rate = 0.3): You make steady progress toward the perfect volume.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the playground, you can experiment with different learning rates and watch what happens. Too high and the loss bounces around. Too low and training crawls. Just right and you see that beautiful downward curve.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For practical tips:&lt;/strong&gt; Check out &lt;a href="https://github.com/rnilav/perceptrons-to-transformers/blob/main/03-backpropagation/HYPERPARAMETER_INSIGHTS.md" rel="noopener noreferrer"&gt;&lt;code&gt;HYPERPARAMETER_INSIGHTS.md&lt;/code&gt;&lt;/a&gt; for a deep dive into learning rates, architecture choices, and why some random seeds get stuck in local minima.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What Clicked for Me&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;After implementing backpropagation and watching it train on XOR, here's what became clear:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The loss curve tells the story.&lt;/strong&gt; In &lt;a href="https://dev.to/rnilav/understanding-ai-from-first-principles-multi-layer-perceptrons-and-the-hidden-layer-breakthrough-44pl"&gt;Post 2&lt;/a&gt;, I hand-crafted weights and got 100% accuracy immediately. With backpropagation, I watched the loss start high (the network is guessing randomly) and gradually decrease as it learned. That curve going down? That's learning happening in real-time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Initialization matters.&lt;/strong&gt; I tried different random seeds and got wildly different results. Some converged to 100% accuracy in 2000 epochs. Others got stuck at 75% accuracy forever. The starting point matters—it's like starting a hike from different locations on a mountain. Some paths lead to the summit, others to local valleys.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It's the same algorithm everywhere.&lt;/strong&gt; Whether it's XOR with 9 weights or GPT-4 with 1.76 trillion parameters, the algorithm is identical: forward pass, compute loss, backward pass, update weights. The scale changes, but the principle doesn't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Automatic beats manual.&lt;/strong&gt; Hand-crafting weights for XOR took me hours. Backpropagation learned them in seconds. For anything beyond toy problems, automatic learning isn't just better—it's the only option.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Watch It Learn: The Interactive Playground&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;I've built an interactive playground where you can watch backpropagation in action. It has two tabs, each showing a different aspect of learning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub Repository:&lt;/strong&gt; &lt;a href="https://github.com/rnilav/perceptrons-to-transformers/tree/main/03-backpropagation" rel="noopener noreferrer"&gt;perceptrons-to-transformers - 03-backpropagation&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Tab 1: Training Visualization
&lt;/h3&gt;

&lt;p&gt;Watch the network learn XOR from scratch. You'll see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Loss curve&lt;/strong&gt; decreasing over epochs (learning in action!)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decision boundary&lt;/strong&gt; evolving from random to correct&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Final accuracy&lt;/strong&gt; reaching 100% (when it works!)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu21iid3vweedfv4xj6u8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu21iid3vweedfv4xj6u8.png" alt="Training loss over time" width="800" height="485"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Try this:&lt;/strong&gt; &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Train with learning rate 0.3 and seed 123 (recommended) - watch it converge smoothly&lt;/li&gt;
&lt;li&gt;Try seed 42 with 2-2-1 architecture - watch it get stuck at 75% accuracy (local minimum!)&lt;/li&gt;
&lt;li&gt;Switch to 2-4-1 architecture - notice how it's more robust to bad initialization&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Tab 2: Gradient Flow Visualization
&lt;/h3&gt;

&lt;p&gt;See the backward pass in action. This tab shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Forward pass&lt;/strong&gt; step-by-step (input → hidden → output)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backward pass&lt;/strong&gt; step-by-step (error flowing backward)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gradient magnitudes&lt;/strong&gt; at each layer&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weight updates&lt;/strong&gt; before and after one training step&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F740p36pc1zv794ui6h7q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F740p36pc1zv794ui6h7q.png" alt="Gradient magnitudes" width="800" height="498"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is where the "backpropagation" name becomes concrete. You literally see the error propagating backward through the network, computing gradients for each weight.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Try this:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Select different XOR test cases and watch how gradients change&lt;/li&gt;
&lt;li&gt;Notice how gradients get smaller in earlier layers (vanishing gradient effect)&lt;/li&gt;
&lt;li&gt;Compare gradient magnitudes with different learning rates&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Running the Playground
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Clone the repository&lt;/span&gt;
git clone https://github.com/rnilav/perceptrons-to-transformers.git
&lt;span class="nb"&gt;cd &lt;/span&gt;perceptrons-to-transformers/03-backpropagation

&lt;span class="c"&gt;# Install dependencies (if needed)&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; ../requirements.txt

&lt;span class="c"&gt;# Run the playground&lt;/span&gt;
streamlit run backprop_playground.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then open your browser and explore all two tabs. The playground is designed to make the abstract concrete—you can see learning happen, watch gradients flow, and understand why backpropagation works.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What This Unlocked&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;When Rumelhart, Hinton, and Williams published their backpropagation paper in 1986, it changed everything.&lt;/p&gt;

&lt;p&gt;Before backpropagation, neural networks were theoretical curiosities. We knew multi-layer networks could solve complex problems, but we couldn't train them. It was like having a Ferrari with no key.&lt;/p&gt;

&lt;p&gt;After backpropagation, neural networks became practical. Suddenly, we could:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Train networks with multiple hidden layers&lt;/li&gt;
&lt;li&gt;Learn from large datasets automatically&lt;/li&gt;
&lt;li&gt;Solve problems that were previously impossible&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The progression is beautiful:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1958&lt;/strong&gt;: Perceptron learns linear boundaries&lt;br&gt;&lt;br&gt;
&lt;strong&gt;1969&lt;/strong&gt;: Minsky proves perceptrons can't solve XOR (AI winter begins)&lt;br&gt;&lt;br&gt;
&lt;strong&gt;1986&lt;/strong&gt;: Backpropagation enables training multi-layer networks (AI winter thaws)&lt;br&gt;&lt;br&gt;
&lt;strong&gt;2012&lt;/strong&gt;: Deep learning revolution (ImageNet breakthrough)&lt;br&gt;&lt;br&gt;
&lt;strong&gt;2017&lt;/strong&gt;: Transformers architecture (foundation for GPT)&lt;br&gt;&lt;br&gt;
&lt;strong&gt;2023&lt;/strong&gt;: ChatGPT and the LLM explosion&lt;/p&gt;

&lt;p&gt;Every single one of these breakthroughs builds on backpropagation. GPT-4 is trained using backpropagation. DALL-E is trained using backpropagation. Every modern neural network you've ever used learned its weights through backpropagation.&lt;/p&gt;

&lt;p&gt;The algorithm that learns 9 weights for XOR is the same algorithm that learns 1.76 trillion parameters for GPT-4. The scale changed, but the principle didn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What's Next&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;We can now train neural networks automatically. But there's still a lot to explore:&lt;/p&gt;

&lt;p&gt;We can train networks now. But training them well—at scale, reliably, without overfitting—that's a different challenge. Next post, we'll see how modern optimisation algorithms solve this puzzle.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Rumelhart, D. E., Hinton, G. E., &amp;amp; Williams, R. J.&lt;/strong&gt; (1986). &lt;em&gt;Learning representations by back-propagating errors&lt;/em&gt;. Nature, 323(6088), 533-536.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Nielsen, M.&lt;/strong&gt; (2015). &lt;em&gt;Neural Networks and Deep Learning&lt;/em&gt;. Determination Press. Available at: &lt;a href="http://neuralnetworksanddeeplearning.com/" rel="noopener noreferrer"&gt;http://neuralnetworksanddeeplearning.com/&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Goodfellow, I., Bengio, Y., &amp;amp; Courville, A.&lt;/strong&gt; (2016). &lt;em&gt;Deep Learning&lt;/em&gt;. MIT Press. Available at: &lt;a href="http://www.deeplearningbook.org/" rel="noopener noreferrer"&gt;http://www.deeplearningbook.org/&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; #MachineLearning #AI #DeepLearning #Backpropagation #NeuralNetworks #GradientDescent&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Series:&lt;/strong&gt; From Perceptrons to Transformers&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code:&lt;/strong&gt; &lt;a href="https://github.com/rnilav/perceptrons-to-transformers" rel="noopener noreferrer"&gt;GitHub Repository&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>backpropagation</category>
      <category>machinelearning</category>
      <category>neuralnetworks</category>
    </item>
    <item>
      <title>Multi Layer Perceptron: From Lines to Curves - The Hidden Layer</title>
      <dc:creator>Nilavukkarasan R</dc:creator>
      <pubDate>Tue, 17 Feb 2026 15:14:10 +0000</pubDate>
      <link>https://dev.to/rnilav/understanding-ai-from-first-principles-multi-layer-perceptrons-and-the-hidden-layer-breakthrough-44pl</link>
      <guid>https://dev.to/rnilav/understanding-ai-from-first-principles-multi-layer-perceptrons-and-the-hidden-layer-breakthrough-44pl</guid>
      <description>&lt;p&gt;&lt;em&gt;"The perceptron has many limitations... the most serious is its inability to learn even the simplest nonlinear functions."&lt;/em&gt; &lt;/p&gt;

&lt;p&gt;--&lt;strong&gt;Marvin Minsky&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Problem That Stumped AI&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In my &lt;a href="https://dev.to/rnilav/understanding-perceptrons-the-foundation-of-modern-ai-2g04"&gt;last post&lt;/a&gt;, I mentioned that the perceptron could learn AND, OR, and NAND gates perfectly. But there was one simple logic gate it couldn't learn, no matter how much you trained it.&lt;/p&gt;

&lt;p&gt;That gate was XOR (exclusive-or).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;XOR Truth Table:
┌─────────┬─────────┬────────┐
│ Input 1 │ Input 2 │ Output │
├─────────┼─────────┼────────┤
│    0    │    0    │   0    │
│    0    │    1    │   1    │
│    1    │    0    │   1    │
│    1    │    1    │   0    │
└─────────┴─────────┴────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When Marvin Minsky and Seymour Papert published their book "Perceptrons" in 1969, they proved mathematically that single-layer perceptrons couldn't solve XOR. This revelation triggered the first "AI winter" - funding dried up, research stalled, and neural networks were largely abandoned for over a decade.&lt;/p&gt;

&lt;p&gt;But why? What makes XOR so special?&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Geometry of Impossibility&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Here's the thing: a perceptron draws a straight line to separate classes. That's it. One straight line.&lt;/p&gt;

&lt;p&gt;For XOR, you need the output to be 1 when inputs are different, and 0 when they're the same:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Visual representation:
    Input 2
      ↑
    1 │  [1]    [0]
      │
    0 │  [0]    [1]
      └──────────────→ Input 1
         0       1

[0] = Output 0 
[1] = Output 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Try drawing a single straight line that separates the class [1] from class [0]. You can't. The pattern is diagonal - you'd need two lines, or a curve.&lt;/p&gt;

&lt;p&gt;This is what "not linearly separable" means.&lt;/p&gt;

&lt;p&gt;I spent hours staring at this diagram when I first learned about it. I tried every angle, every position for that line. Nothing worked. And that's exactly the point - it's mathematically impossible. The perceptron's limitation isn't a bug, it's a fundamental constraint of linear classifiers.&lt;/p&gt;

&lt;p&gt;For AND and OR gates, the pattern is simple - all the 1s are on one side, all the 0s on the other. But XOR? The classes are interleaved. You need a more sophisticated approach.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Breakthrough: Hidden Layers&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;When I was a kid learning math, adding single-digit numbers was simple. 3 + 5 = 8. I just did it. One step, done.&lt;/p&gt;

&lt;p&gt;But then came the leap to multi-digit addition: 27 + 15.&lt;/p&gt;

&lt;p&gt;I kept getting it wrong. I'd add 2 + 1 = 3, then 7 + 5 = 12, and write 312. Completely wrong. My brain was treating it like two separate single-digit problems mashed together. I was missing something invisible.&lt;/p&gt;

&lt;p&gt;Then came the breakthrough: 7 + 5 doesn't just equal 12. It creates a 1 that carries over to the next column. That invisible 1 moving from ones to tens column—that was the missing piece. Once I understood the carry, it clicked.&lt;/p&gt;

&lt;p&gt;The carry was an intermediate step that transformed the problem.&lt;/p&gt;

&lt;p&gt;It sounds trivial now. But back then? It was a massive leap for me. I couldn't see why single-digit rules didn't just scale up. I needed something new—not more of the same.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's the hidden layer. But here's the catch:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If I just wrote down two addition problems and stacked them on top of each other, nothing changes. 2 + 1, then 7 + 5. That's still just two separate additions. Adding more steps doesn't help if each step is the same linear operation.&lt;/p&gt;

&lt;p&gt;But the carry isn't linear. When 7 + 5 = 12, something special happens: the 1 doesn't stay in that column. It transforms—it becomes a 1 in a different column, changing what comes next. That transformation—that non-linearity—is what makes the whole system work.&lt;/p&gt;

&lt;p&gt;Without the carry's transformation, stacking problems is useless. With it, multi-digit addition becomes possible.&lt;/p&gt;

&lt;p&gt;That's exactly what non-linear activation functions do in neural networks.&lt;/p&gt;

&lt;p&gt;A single-layer perceptron is like single-digit addition—inputs straight to output, no transformation. If you just stack more linear layers, you still have the same problem: one straight line, no matter how many you combine.&lt;/p&gt;

&lt;p&gt;But add non-linear activation functions (like sigmoid or ReLU)—the carry transforms the space. Now XOR becomes solvable.&lt;/p&gt;

&lt;p&gt;*&lt;em&gt;Simple shallow network with hand-crafted weights and biases *&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy9148fmbr0dvvu9qp8kb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy9148fmbr0dvvu9qp8kb.png" alt="2-2-1 Multi-Layer Network" width="800" height="596"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Solving XOR: The Aha Moment&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;With a 2-2-1 network (2 inputs, 2 hidden neurons, 1 output), we can finally solve XOR:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;How it works:
┌──────────────────────────────────────┐
│ Hidden Neuron 1: Learns OR pattern   │
│   (fires when x₁ OR x₂ is 1)         │
│                                      │
│ Hidden Neuron 2: Learns AND pattern  │
│   (fires when x₁ AND x₂ are 1)       │
│                                      │
│ Output: Combines them                │
│   (OR but NOT AND = XOR)             │
└──────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The hidden layer isn't just adding complexity - it's transforming the problem into something solvable.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;A comparative snapshot generated from playground&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4wlf9ds9vfyxdfmr8y4d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4wlf9ds9vfyxdfmr8y4d.png" alt="single vs multi layer perceptron" width="800" height="598"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Try it yourself:&lt;/strong&gt; Run the interactive playground to see the curved decision boundary in action. Adjust the weight slider to see how the boundary changes from weak to strong. Compare it with a perceptron's straight line attempt. The visualisation makes it clear why hidden layers are the breakthrough.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub Repository:&lt;/strong&gt; &lt;a href="https://github.com/rnilav/perceptrons-to-transformers/tree/main/02-multi-layer-perceptron" rel="noopener noreferrer"&gt;perceptrons-to-transformers - 02-multi-layer-perceptron&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What you'll find:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;02-multi-layer-perceptron/mlp.py&lt;/code&gt; - Clean MLP implementation&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;02-multi-layer-perceptron/mlp_playground.py&lt;/code&gt; - Interactive Streamlit app&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The playground lets you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;See the curved decision boundary that solves XOR&lt;/li&gt;
&lt;li&gt;Adjust weights and watch the boundary change in real-time&lt;/li&gt;
&lt;li&gt;View the network architecture with all weights labeled&lt;/li&gt;
&lt;li&gt;Compare perceptron's straight line vs MLP's curve&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What This Unlocked&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Solving XOR might seem trivial now. But it was the breakthrough that unlocked everything.&lt;/p&gt;

&lt;p&gt;The problem wasn't just XOR. It was the realisation: hidden layers don't just add complexity—they enable non-linear thinking. Once researchers understood this, the floodgates opened.&lt;/p&gt;

&lt;p&gt;In the 1980s, Geoffrey Hinton, David Rumelhart, and Ronald Williams proved you could actually train these multi-layer networks with backpropagation. Suddenly, problems that seemed impossible became solvable. The AI winter thawed.&lt;/p&gt;

&lt;p&gt;The progression is beautiful:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Perceptrons&lt;/strong&gt; learned to draw lines (linear boundaries)&lt;br&gt;
&lt;strong&gt;MLPs&lt;/strong&gt; learned to draw curves (non-linear boundaries)&lt;br&gt;
&lt;strong&gt;Deep networks&lt;/strong&gt; learned hierarchies (edges → shapes → objects → concepts)&lt;/p&gt;

&lt;p&gt;Today's neural networks—from image classifiers to GPT-4—all follow the same principle: stack layers with non-linear activations to transform data into increasingly meaningful representations.&lt;/p&gt;

&lt;p&gt;It all started with one insight: add a hidden layer. All from adding that first hidden layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What's Next:&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;We can now build networks that solve XOR. But there's one crucial question: How do we learn the weights?&lt;/p&gt;

&lt;p&gt;The XOR network I showed you uses hand-crafted weights—I manually set values that worked. But for real problems with thousands of inputs and millions of weights, we can't do that by hand.&lt;/p&gt;

&lt;p&gt;We need an algorithm that automatically learns the right weights from examples.&lt;/p&gt;

&lt;p&gt;That algorithm is called backpropagation, and it's what makes neural networks practical. It's how networks learn from their mistakes and gradually improve.&lt;/p&gt;

&lt;p&gt;In the next post, we'll dive into backpropagation—the algorithm that ties everything together. It involves calculus, but I promise to make it intuitive.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Minsky, M., &amp;amp; Papert, S.&lt;/strong&gt; (1969). &lt;em&gt;Perceptrons: An Introduction to Computational Geometry&lt;/em&gt;. MIT Press.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Nielsen, M.&lt;/strong&gt; (2015). &lt;em&gt;Neural Networks and Deep Learning&lt;/em&gt;. Determination Press. Available at: &lt;a href="http://neuralnetworksanddeeplearning.com/" rel="noopener noreferrer"&gt;http://neuralnetworksanddeeplearning.com/&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; #MachineLearning #AI #DeepLearning #NeuralNetworks #MLP&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Series:&lt;/strong&gt; From Perceptrons to Transformers&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code:&lt;/strong&gt; &lt;a href="https://github.com/rnilav/perceptrons-to-transformers/tree/main/02-multi-layer-perceptron" rel="noopener noreferrer"&gt;GitHub Repository&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>deeplearning</category>
      <category>mlp</category>
    </item>
    <item>
      <title>Perceptron: The Foundation of Modern AI</title>
      <dc:creator>Nilavukkarasan R</dc:creator>
      <pubDate>Sun, 15 Feb 2026 08:40:21 +0000</pubDate>
      <link>https://dev.to/rnilav/understanding-perceptrons-the-foundation-of-modern-ai-2g04</link>
      <guid>https://dev.to/rnilav/understanding-perceptrons-the-foundation-of-modern-ai-2g04</guid>
      <description>&lt;p&gt;&lt;em&gt;"We now have a new kind of programming paradigm. Instead of telling the computer what to do, we show it examples of what we want, and it figures out how to do it."&lt;/em&gt; &lt;/p&gt;

&lt;p&gt;-- &lt;strong&gt;Michael Nielsen&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;My Journey Back to the Beginning&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;My first encounter with Artificial Intelligence was during my college days. I had memorised more than I understood, but none of what I studied appeared in the exam, so I wrote whatever I could, and I’m quite certain the professor didn’t understand my answers either.&lt;/p&gt;

&lt;p&gt;Fast forward 20 years of building software systems. In all that time, I barely touched AI/ML. Sure, I designed applications that integrated with black box, AI/ML systems for OCR, but that was it.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Then ChatGPT happened&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Like many of you, I started with the ChatGPT web interface, learning prompt engineering. Then I began experimenting—building RAG chatbots, exploring chunking strategies, testing different embedding models and retrieval techniques. I experimented with agents, explored MCPs and agentic patterns. I was learning these tools, building with them—but something bothered me.&lt;/p&gt;

&lt;p&gt;I didn't understand how any of it actually worked.&lt;/p&gt;

&lt;p&gt;So I decided to go back. Not to the latest paper or the newest framework, but to the very beginning. To the first artificial neuron.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Why This Matters&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;You might wonder why bother to learn about decades-old concept when we have ChatGPT, Claude and countless AI tools at our fingertips.&lt;/p&gt;

&lt;p&gt;Here's why: Every single neuron in GPT-4, in every transformer, in every neural network you've ever used, works on the same basic principles as that first artificial neuron. The perceptron isn't history-It's the foundation.&lt;/p&gt;

&lt;p&gt;Understanding it means understanding what's actually happening when you call an LLM API. It means knowing why things work, not just that they work.&lt;/p&gt;

&lt;p&gt;If you've felt this same curiosity and want to truly understand the foundations beneath the tools we use every day, join me. Learning from first principles, one concept at a time.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;From Biology to Silicon&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In 1943, Warren McCulloch and Walter Pitts created the first mathematical model of a neuron. But it was Frank Rosenblatt in 1958 who built the perceptron, the first artificial neuron that could actually learn.&lt;/p&gt;

&lt;p&gt;Rosenblatt's breakthrough came from mimicking nature. He studied how biological neurons work and translated that logic into mathematics. Here's how they compare:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Biological Neuron:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Dendrites  →  Cell Body  →  Threshold Check  →  Axon
(receive)     (process)     (fire if met)        (output)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Artificial Neuron (Perceptron):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Inputs     →  Weighted Sum  →  Threshold Check  →  Output
x₁,x₂,...     Σ(xᵢ × wᵢ)       (≥ threshold?)       0 or 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The key insight&lt;/strong&gt;: Learning happens by adjusting the weights.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;How a Perceptron Works&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Let's break it down to basics.&lt;/p&gt;

&lt;p&gt;A perceptron takes inputs, multiplies each by a weight, adds them up, and makes a decision.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;perceptron_forward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bias&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Multiply each input by its weight
&lt;/span&gt;    &lt;span class="n"&gt;weighted_sum&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="c1"&gt;# Add bias (shifts the decision boundary)
&lt;/span&gt;    &lt;span class="n"&gt;weighted_sum&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;bias&lt;/span&gt;

    &lt;span class="c1"&gt;# Activation: output 1 if positive, 0 otherwise
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;weighted_sum&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. That's the core of a perceptron.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's happening:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Each input has a weight (how important is this input?)&lt;/li&gt;
&lt;li&gt;We sum up: (input₁ × weight₁) + (input₂ × weight₂) + ... + bias&lt;/li&gt;
&lt;li&gt;If the sum is positive, output 1. Otherwise, output 0.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Example: AND gate&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let's say we want to implement the AND logic gate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Input: [0, 0] → Output: 0&lt;/li&gt;
&lt;li&gt;Input: [0, 1] → Output: 0&lt;/li&gt;
&lt;li&gt;Input: [1, 0] → Output: 0&lt;/li&gt;
&lt;li&gt;Input: [1, 1] → Output: 1&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Traditional way (if/else):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;and_gate_traditional&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input2&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;input1&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;input2&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Perceptron way (learned weights):&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;With the right weights ([0.5, 0.5] and bias -0.7), the perceptron can solve this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[0, 0]: 0×0.5 + 0×0.5 - 0.7 = -0.7 → Output: 0 ✓&lt;/li&gt;
&lt;li&gt;[0, 1]: 0×0.5 + 1×0.5 - 0.7 = -0.2 → Output: 0 ✓&lt;/li&gt;
&lt;li&gt;[1, 0]: 1×0.5 + 0×0.5 - 0.7 = -0.2 → Output: 0 ✓&lt;/li&gt;
&lt;li&gt;[1, 1]: 1×0.5 + 1×0.5 - 0.7 = 0.3 → Output: 1 ✓&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The difference? The traditional way is hardcoded. The perceptron learns these weights from examples. That's the new programming paradigm Nielsen talked about.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What Clicked for Me&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;After implementing and testing the perceptron, here's what became clear:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Weights are just numbers.&lt;/strong&gt; There's no magic. A weight of 0.5 means "this input matters half as much as an input with weight 1.0."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The bias shifts the boundary.&lt;/strong&gt; Without bias, the decision boundary always goes through the origin. Bias lets it move anywhere.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Learning is adjustment.&lt;/strong&gt; When the perceptron makes a mistake, we adjust the weights. That's learning. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It's a linear classifier.&lt;/strong&gt; The perceptron draws a straight line (or hyperplane) to separate classes. This is both its power and its limitation.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Explore the Code&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;I've implemented a complete perceptron from scratch with visualizations:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Here is the sample visualization screenshot from the playground&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsquog8vfx9oh2klke11g.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsquog8vfx9oh2klke11g.jpg" alt=" " width="800" height="396"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub Repository:&lt;/strong&gt; &lt;a href="https://github.com/rnilav/perceptrons-to-transformers/tree/main/01-perceptron" rel="noopener noreferrer"&gt;perceptrons-to-transformers&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What you'll find:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;01-perceptron/perceptron.py&lt;/code&gt; - Full implementation with learning algorithm&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;01-perceptron/perceptron_playground.py&lt;/code&gt; - Streamlit app to play with it&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What's Next&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The perceptron can learn AND, OR, and NAND gates perfectly. But it has a fundamental limitation.&lt;/p&gt;

&lt;p&gt;No matter how you adjust the weights, there's one simple logic gate it cannot learn. This limitation exposed a critical weakness in single-layer networks.&lt;/p&gt;

&lt;p&gt;In the next post, we'll explore this limitation and see why it led to the invention of multilayer networks.&lt;/p&gt;

&lt;p&gt;Spoiler: The problem is called XOR, and solving it ultimately enabled path to modern deep learning.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Nielsen, M.&lt;/strong&gt; (2015). &lt;em&gt;Neural Networks and Deep Learning&lt;/em&gt;. Determination Press. Available at: &lt;a href="http://neuralnetworksanddeeplearning.com/" rel="noopener noreferrer"&gt;http://neuralnetworksanddeeplearning.com/&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; #MachineLearning #AI #DeepLearning #Perceptron #NeuralNetworks&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Series:&lt;/strong&gt; From Perceptron to Transformers&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code:&lt;/strong&gt; &lt;a href="https://github.com/rnilav/perceptrons-to-transformers/tree/main/01-perceptron" rel="noopener noreferrer"&gt;GitHub Repository&lt;/a&gt;&lt;/p&gt;

</description>
      <category>perceptron</category>
      <category>neuralnetworks</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
