<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: NiShITa-code</title>
    <description>The latest articles on DEV Community by NiShITa-code (@nishitacode).</description>
    <link>https://dev.to/nishitacode</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F561396%2Fd34ba8fe-67a3-43f2-9868-16b1953afebc.png</url>
      <title>DEV Community: NiShITa-code</title>
      <link>https://dev.to/nishitacode</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/nishitacode"/>
    <language>en</language>
    <item>
      <title>[Boost]</title>
      <dc:creator>NiShITa-code</dc:creator>
      <pubDate>Wed, 18 Mar 2026 15:16:57 +0000</pubDate>
      <link>https://dev.to/nishitacode/-1539</link>
      <guid>https://dev.to/nishitacode/-1539</guid>
      <description>&lt;div class="ltag__link"&gt;
  &lt;a href="/nishitacode" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__pic"&gt;
      &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F561396%2Fd34ba8fe-67a3-43f2-9868-16b1953afebc.png" alt="nishitacode"&gt;
    &lt;/div&gt;
  &lt;/a&gt;
  &lt;a href="https://dev.to/nishitacode/neural-networks-still-confuse-you-start-here-1h4p" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__content"&gt;
      &lt;h2&gt;Neural Networks Still Confuse You? Start Here.&lt;/h2&gt;
      &lt;h3&gt;NiShITa-code ・ Mar 5&lt;/h3&gt;
      &lt;div class="ltag__link__taglist"&gt;
        &lt;span class="ltag__link__tag"&gt;#ai&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#beginners&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#machinelearning&lt;/span&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/a&gt;
&lt;/div&gt;


</description>
      <category>ai</category>
      <category>beginners</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Neural Networks Still Confuse You? Start Here.</title>
      <dc:creator>NiShITa-code</dc:creator>
      <pubDate>Thu, 05 Mar 2026 09:04:45 +0000</pubDate>
      <link>https://dev.to/nishitacode/neural-networks-still-confuse-you-start-here-1h4p</link>
      <guid>https://dev.to/nishitacode/neural-networks-still-confuse-you-start-here-1h4p</guid>
      <description>&lt;h3&gt;
  
  
  The real building blocks of modern AI — explained from first principles, with genuine intuition.
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Series: From Neural Networks to Transformers — Article 1&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;There's a phrase that gets repeated endlessly in AI articles:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Neural networks learn patterns from data."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And it sounds reasonable. Until you sit with it for a moment and realize: &lt;strong&gt;what does that actually mean?&lt;/strong&gt; What is a "pattern"? What does it mean to "learn"? What is actually happening inside these models that lets them write poetry, diagnose tumors, and beat world champions at chess?&lt;/p&gt;

&lt;p&gt;Most explanations either hand-wave through the intuition or drown you in equations. This article does neither. By the end, you'll have a genuine mental model of how neural networks work — not just a vague sense that they're "loosely inspired by the brain."&lt;/p&gt;

&lt;p&gt;Let's build this from the ground up.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 1: Why We Stopped Writing Rules
&lt;/h2&gt;

&lt;p&gt;Before neural networks took over, AI researchers tried a different approach: &lt;strong&gt;write the rules explicitly&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;To build a spam filter, you'd write something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;IF email contains "$$$"       → spam  
IF email contains "free money" → spam  
IF email contains "invoice"   → probably fine  
IF email contains a link       → suspicious
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This worked — for a while. Then reality intervened.&lt;/p&gt;

&lt;p&gt;What happens when a legitimate email from your bank contains "free" and a link? What about a phishing email that carefully avoids every keyword you've flagged? Language is ambiguous. Context matters. Edge cases multiply faster than you can write rules.&lt;/p&gt;

&lt;p&gt;The deeper problem is this: &lt;strong&gt;humans are terrible at introspecting on how they recognize things.&lt;/strong&gt; How do you know a cat from a dog? You just... know. Try writing an explicit rule for that. It's nearly impossible.&lt;/p&gt;

&lt;p&gt;So researchers had a radical idea: instead of telling the computer &lt;em&gt;what rules to use&lt;/em&gt;, what if we gave it examples and let it &lt;em&gt;figure out the rules itself&lt;/em&gt;?&lt;/p&gt;

&lt;p&gt;That's the entire premise of machine learning.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 2: Learning as Finding a Function
&lt;/h2&gt;

&lt;p&gt;Here's the key abstraction that makes everything else make sense.&lt;/p&gt;

&lt;p&gt;Somewhere in the universe, there exists a function — call it &lt;code&gt;f&lt;/code&gt; — that maps inputs to correct outputs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Input: a photo → Output: "cat" or "dog"&lt;/li&gt;
&lt;li&gt;Input: an email → Output: spam or not spam
&lt;/li&gt;
&lt;li&gt;Input: a sentence in English → Output: the same sentence in French&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We don't know what &lt;code&gt;f&lt;/code&gt; looks like. But we have thousands (or millions) of &lt;strong&gt;examples&lt;/strong&gt; of its inputs and outputs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(photo_1, "cat"), (photo_2, "dog"), (photo_3, "cat"), ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The job of machine learning is to find a function &lt;code&gt;f̂&lt;/code&gt; that closely matches &lt;code&gt;f&lt;/code&gt;&lt;/strong&gt; — not by deriving it mathematically, but by looking at enough examples.&lt;/p&gt;

&lt;p&gt;A neural network is just one particularly powerful and flexible way to &lt;em&gt;represent&lt;/em&gt; that function. The reason neural networks win is that they can, in principle, approximate &lt;em&gt;any&lt;/em&gt; function given enough capacity. We'll come back to why that's true.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 3: The Artificial Neuron — A Weighted Opinion
&lt;/h2&gt;

&lt;p&gt;The basic unit of a neural network is the &lt;strong&gt;neuron&lt;/strong&gt;. Despite the biological branding, it's actually much simpler than a real brain cell. Think of it as a very small, very opinionated calculator.&lt;/p&gt;

&lt;p&gt;Here's what a single neuron does:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Form a weighted opinion.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It receives several inputs — numbers representing features of your data. Each input gets multiplied by a &lt;strong&gt;weight&lt;/strong&gt;, which represents how much that input matters. Then everything gets summed together, plus a &lt;strong&gt;bias&lt;/strong&gt; term (think of bias as a baseline activation level).&lt;/p&gt;

&lt;p&gt;Mathematically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or in vector form:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;z = wᵀx + b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 2: Make a decision.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The raw sum &lt;code&gt;z&lt;/code&gt; gets passed through an &lt;strong&gt;activation function&lt;/strong&gt; &lt;code&gt;σ&lt;/code&gt;, which squashes or transforms it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;a = σ(z)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output &lt;code&gt;a&lt;/code&gt; is what flows to the next part of the network.&lt;/p&gt;

&lt;h3&gt;
  
  
  A Concrete Example
&lt;/h3&gt;

&lt;p&gt;Imagine a neuron trying to predict whether an email is spam. It receives three inputs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;x₁&lt;/code&gt; = number of suspicious words&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;x₂&lt;/code&gt; = number of links&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;x₃&lt;/code&gt; = length of email&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The neuron's weights might look like: &lt;code&gt;w₁ = 0.9, w₂ = 0.4, w₃ = -0.1&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This means the neuron has learned that &lt;strong&gt;suspicious words are the strongest signal&lt;/strong&gt;, links are mildly suspicious, and longer emails are &lt;em&gt;less&lt;/em&gt; likely to be spam (legitimate emails tend to be longer). The neuron didn't arrive at these weights through logic — it discovered them by looking at thousands of examples.&lt;/p&gt;

&lt;p&gt;That's the magic. The knowledge lives in the weights.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 4: Why Nonlinearity Changes Everything
&lt;/h2&gt;

&lt;p&gt;Here's a question that trips up a lot of people: if we're just doing math on numbers, why do we need this "activation function" at all? Why not just use the raw sum?&lt;/p&gt;

&lt;p&gt;The answer is one of the most important ideas in deep learning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Linear operations stacked on linear operations are still just... linear operations.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;No matter how many layers you add, if every layer just does &lt;code&gt;output = Wx + b&lt;/code&gt;, the entire network could be collapsed into a single layer. You'd never be able to model anything more complex than a straight line (or plane, or hyperplane).&lt;/p&gt;

&lt;p&gt;But real-world relationships aren't linear. The relationship between pixels and "cat-ness" is wildly non-linear. The relationship between words and sentiment is deeply non-linear.&lt;/p&gt;

&lt;p&gt;Activation functions introduce &lt;strong&gt;nonlinearity&lt;/strong&gt; — kinks, curves, thresholds — that allow the network to carve up its input space in complex, useful ways.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Three Activations You Need to Know
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Sigmoid&lt;/strong&gt; — the historical classic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;σ(x) = 1 / (1 + e⁻ˣ)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Squashes any input into a range of (0, 1). Useful for probabilities. Fell out of favor for hidden layers because it causes &lt;strong&gt;vanishing gradients&lt;/strong&gt; — a problem we'll explain shortly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tanh&lt;/strong&gt; — the improved sigmoid:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tanh(x) = (eˣ - e⁻ˣ) / (eˣ + e⁻ˣ)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Similar shape to sigmoid but outputs range from (-1, 1) and is zero-centered, which helps training. Still suffers from vanishing gradients at extremes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ReLU&lt;/strong&gt; — the modern workhorse:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ReLU(x) = max(0, x)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Brutally simple. If the input is positive, pass it through unchanged. If negative, output zero. This simplicity is its strength: it's fast to compute, doesn't saturate for large positive values, and makes gradients flow cleanly through deep networks. Most modern networks default to ReLU or a close variant.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sigmoid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;tanh&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tanh&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;relu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;maximum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Try it
&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sigmoid:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;sigmoid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;   &lt;span class="c1"&gt;# [0.047, 0.269, 0.5, 0.731, 0.953]
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tanh:   &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;tanh&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;     &lt;span class="c1"&gt;# [-0.995, -0.762, 0, 0.762, 0.995]
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ReLU:   &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;relu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;     &lt;span class="c1"&gt;# [0, 0, 0, 1, 3]
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Part 5: Layers — Where "Deep" Comes From
&lt;/h2&gt;

&lt;p&gt;A single neuron can only form a single weighted opinion. That's not very powerful. But stack hundreds of neurons side by side into a &lt;strong&gt;layer&lt;/strong&gt;, and then stack multiple layers on top of each other, and something remarkable happens.&lt;/p&gt;

&lt;p&gt;Each layer learns to represent the world at a different &lt;strong&gt;level of abstraction&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This is easiest to see in vision. A deep network trained on images learns, layer by layer:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;What It Detects&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Layer 1&lt;/td&gt;
&lt;td&gt;Raw edges, color gradients&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Layer 2&lt;/td&gt;
&lt;td&gt;Corners, curves, textures&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Layer 3&lt;/td&gt;
&lt;td&gt;Eyes, wheels, fur, windows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Layer 4&lt;/td&gt;
&lt;td&gt;Faces, cars, animals, buildings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Layer 5&lt;/td&gt;
&lt;td&gt;Full objects in context&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Nobody programmed these layers to work this way. The network discovered this hierarchical decomposition on its own, because it turned out to be the most efficient way to represent the training data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is what "deep learning" means:&lt;/strong&gt; the word "deep" just refers to many layers.&lt;/p&gt;

&lt;p&gt;For an entire layer, the math is the same as for a single neuron, but now with matrices instead of vectors:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Z = XW + b
A = σ(Z)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where &lt;code&gt;X&lt;/code&gt; is your batch of inputs, &lt;code&gt;W&lt;/code&gt; is the weight matrix for the whole layer, and &lt;code&gt;A&lt;/code&gt; is the layer's output — which becomes the input to the next layer.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 6: The Loss Function — Giving the Network a Report Card
&lt;/h2&gt;

&lt;p&gt;The network makes predictions. But how does it know if they're good?&lt;/p&gt;

&lt;p&gt;Enter the &lt;strong&gt;loss function&lt;/strong&gt; (also called the cost function). It takes the network's predictions and the true labels, and computes a single number representing how wrong the network is. The bigger the loss, the worse the predictions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For regression&lt;/strong&gt; (predicting a continuous value), the most common loss is &lt;strong&gt;Mean Squared Error&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;L = (1/n) Σ (yᵢ - ŷᵢ)²
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the network predicts house prices and gets within $5,000 on average, that's a low loss. If it's off by $200,000, that's a high loss.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For classification&lt;/strong&gt; (predicting a category), we use &lt;strong&gt;Cross-Entropy Loss&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;L = -Σ yᵢ log(ŷᵢ)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This loss punishes the network &lt;em&gt;especially hard&lt;/em&gt; when it's confidently wrong — which is exactly the behavior you want. If the true label is "cat" and the network says "99% dog," the loss is enormous.&lt;/p&gt;

&lt;p&gt;The loss function is the mechanism by which the outside world's expectations get translated into a learning signal. Without it, the network has no idea what "better" means.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 7: Gradient Descent — The Learning Algorithm
&lt;/h2&gt;

&lt;p&gt;Now we have a loss. The question becomes: &lt;strong&gt;how do we reduce it?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here's the key insight: the loss is a function of all the weights in the network. If we imagine a landscape where each point represents a set of weights and the elevation represents the loss, training is the process of &lt;strong&gt;finding the lowest valley in this landscape&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The algorithm for doing this is called &lt;strong&gt;gradient descent&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;gradient&lt;/strong&gt; of the loss (&lt;code&gt;∇L&lt;/code&gt;) is a vector that points in the direction of &lt;em&gt;steepest increase&lt;/em&gt;. So to decrease the loss, we move in the &lt;em&gt;opposite&lt;/em&gt; direction:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;θ ← θ - η · ∇L
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;θ&lt;/code&gt; represents all the weights (parameters)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;η&lt;/code&gt; (eta) is the &lt;strong&gt;learning rate&lt;/strong&gt; — how big a step we take&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;∇L&lt;/code&gt; is the gradient — which direction is uphill&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Repeat this process thousands of times across thousands of examples, and the weights slowly converge toward values that make good predictions.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Learning Rate Problem
&lt;/h3&gt;

&lt;p&gt;The learning rate is one of the most important hyperparameters to get right:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Too large:&lt;/strong&gt; The network takes massive steps and overshoots the minimum, bouncing around chaotically or even diverging.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Too small:&lt;/strong&gt; Training takes forever, and you might get trapped in a poor local minimum.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Getting the learning rate right is part art, part science — and a major focus of modern optimization research.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 8: Backpropagation — The Engine Under the Hood
&lt;/h2&gt;

&lt;p&gt;Computing gradients for a network with millions of parameters sounds impossibly hard. If you had to compute each gradient separately, it would be. But there's a clever algorithm that does it in one efficient pass: &lt;strong&gt;backpropagation&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Backprop is just the &lt;strong&gt;chain rule of calculus&lt;/strong&gt;, applied systematically and repeatedly.&lt;/p&gt;

&lt;p&gt;The chain rule says: if &lt;code&gt;L = f(g(x))&lt;/code&gt;, then:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;dL/dx = (dL/dg) · (dg/dx)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In a neural network, the loss is a function of the output layer, which is a function of the previous layer, which is a function of the layer before that, and so on. Backprop unravels this chain of functions from right to left — from the loss back to the first layer — computing gradients at each step.&lt;/p&gt;

&lt;p&gt;Here's the conceptual picture:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Input] → [Layer 1] → [Layer 2] → [Layer 3] → [Loss]

Forward pass:  data flows left to right  →→→→→
Backward pass: gradients flow right to left  ←←←←←
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each weight receives a gradient that tells it: "if you increase slightly, does the loss go up or down, and by how much?" This is how each neuron learns its responsibility for the network's errors.&lt;/p&gt;

&lt;p&gt;The reason this is powerful: &lt;strong&gt;backprop computes gradients for all parameters in a single backward pass&lt;/strong&gt;. No matter how many layers or neurons, the computation scales linearly. This is what makes training deep networks tractable.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 9: Putting It All Together
&lt;/h2&gt;

&lt;p&gt;Here's a minimal neural network implemented from scratch — no PyTorch, no TensorFlow, just NumPy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;NeuralNetwork&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hidden_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_size&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Initialize weights with small random values
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;W1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hidden_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.01&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;b1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;zeros&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hidden_size&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;W2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hidden_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.01&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;b2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;zeros&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_size&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;relu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;maximum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;relu_grad&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;z&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 1 where z &amp;gt; 0, else 0
&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sigmoid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;forward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Layer 1
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Z1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;W1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;b1&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;A1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;relu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Z1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Layer 2 (output)
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Z2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;A1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;W2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;b2&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;A2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sigmoid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Z2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;A2&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compute_loss&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_true&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Binary cross-entropy
&lt;/span&gt;        &lt;span class="n"&gt;eps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1e-8&lt;/span&gt;  &lt;span class="c1"&gt;# prevent log(0)
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_true&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_pred&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;eps&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
                        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;y_true&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;eps&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;backward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;learning_rate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.01&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

        &lt;span class="c1"&gt;# Output layer gradient
&lt;/span&gt;        &lt;span class="n"&gt;dZ2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;A2&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;y_true&lt;/span&gt;
        &lt;span class="n"&gt;dW2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;A1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dZ2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;
        &lt;span class="n"&gt;db2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dZ2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keepdims&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;

        &lt;span class="c1"&gt;# Hidden layer gradient (chain rule)
&lt;/span&gt;        &lt;span class="n"&gt;dA1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dZ2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;W2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;dZ1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dA1&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;relu_grad&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Z1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;dW1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dZ1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;
        &lt;span class="n"&gt;db1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dZ1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keepdims&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;

        &lt;span class="c1"&gt;# Update weights
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;W2&lt;/span&gt; &lt;span class="o"&gt;-=&lt;/span&gt; &lt;span class="n"&gt;learning_rate&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;dW2&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;b2&lt;/span&gt; &lt;span class="o"&gt;-=&lt;/span&gt; &lt;span class="n"&gt;learning_rate&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;db2&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;W1&lt;/span&gt; &lt;span class="o"&gt;-=&lt;/span&gt; &lt;span class="n"&gt;learning_rate&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;dW1&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;b1&lt;/span&gt; &lt;span class="o"&gt;-=&lt;/span&gt; &lt;span class="n"&gt;learning_rate&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;db1&lt;/span&gt;

&lt;span class="c1"&gt;# Usage
&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;NeuralNetwork&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hidden_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# 100 examples, 3 features each
&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rand&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# binary labels
&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;epoch&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;predictions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;forward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;loss&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compute_loss&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;backward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;learning_rate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;epoch&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Epoch &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;epoch&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: Loss = &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;loss&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This 60-line implementation contains the complete learning loop: forward pass, loss computation, backpropagation, and weight updates. Everything in PyTorch and TensorFlow is this, scaled up by a factor of millions.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 10: The Universal Approximation Theorem
&lt;/h2&gt;

&lt;p&gt;Here's a fact that's almost philosophically unsettling:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A neural network with a &lt;strong&gt;single hidden layer&lt;/strong&gt; and enough neurons can approximate &lt;em&gt;any&lt;/em&gt; continuous function to arbitrary precision.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is the &lt;strong&gt;Universal Approximation Theorem&lt;/strong&gt;, and it's why neural networks are so powerful. They're not limited to modeling linear relationships, or polynomial ones, or any particular family of functions. They're a universal function approximator.&lt;/p&gt;

&lt;p&gt;But this theorem has a catch — or really, two catches:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;"Enough neurons" can mean &lt;em&gt;a lot&lt;/em&gt; of neurons. In the worst case, exponentially many.&lt;/li&gt;
&lt;li&gt;The theorem says nothing about whether you can &lt;em&gt;find&lt;/em&gt; the right weights. It just says the right weights exist somewhere.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is why depth matters. Deep networks can represent the same functions as shallow ones, but far more &lt;strong&gt;efficiently&lt;/strong&gt; — with exponentially fewer parameters. A 10-layer network can learn representations that a 1-layer network would need to be astronomically large to match.&lt;/p&gt;

&lt;p&gt;Depth isn't just about having more capacity. It's about having &lt;strong&gt;the right kind of structure&lt;/strong&gt; to build complex representations hierarchically.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Limits of What We've Built
&lt;/h2&gt;

&lt;p&gt;We now have a powerful framework. But there's a fundamental assumption baked into everything we've discussed:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The network processes fixed-size inputs.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You define the input layer size once, at the beginning, and it never changes. Every example must be the same shape.&lt;/p&gt;

&lt;p&gt;This is fine for structured data — a table with 10 columns, or a 28×28 pixel image. But the real world is full of &lt;em&gt;sequential&lt;/em&gt; data where this assumption breaks down completely:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A sentence can be 3 words or 300 words.&lt;/li&gt;
&lt;li&gt;A piece of music can be 30 seconds or 30 minutes.&lt;/li&gt;
&lt;li&gt;A conversation has no fixed length.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Worse, &lt;strong&gt;order matters&lt;/strong&gt; in sequences. "The dog bit the man" and "The man bit the dog" contain the same words but mean completely different things. A standard neural network has no way to represent this.&lt;/p&gt;

&lt;p&gt;To handle sequences, researchers needed to give neural networks something they fundamentally lacked: &lt;strong&gt;memory&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That led to Recurrent Neural Networks — and eventually to the architecture that powers every major language model today.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Comes Next
&lt;/h2&gt;

&lt;p&gt;In &lt;strong&gt;Article 2&lt;/strong&gt;, we dig into sequence modeling:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Why RNNs were such a breakthrough&lt;/li&gt;
&lt;li&gt;How recurrent connections give networks a form of memory&lt;/li&gt;
&lt;li&gt;Why that memory broke down for long sequences (the vanishing gradient problem, revisited)&lt;/li&gt;
&lt;li&gt;And why researchers eventually abandoned RNNs entirely in favor of something stranger and more powerful: &lt;strong&gt;attention&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;→ Next: "Why Your Smart Speaker Can't Understand You: The Memory Problem in AI"&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If this article helped something click, share it with someone who's been nodding along to AI explanations without quite understanding them. That's exactly who it's written for.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>beginners</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
