<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Gary Jackson</title>
    <description>The latest articles on DEV Community by Gary Jackson (@garyljackson).</description>
    <link>https://dev.to/garyljackson</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F675295%2Fa4e74fbc-1c81-453d-a922-c4e43a458d1f.jpg</url>
      <title>DEV Community: Gary Jackson</title>
      <link>https://dev.to/garyljackson</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/garyljackson"/>
    <language>en</language>
    <item>
      <title>Chapter 1: The Value Class - Recording the Forward Pass</title>
      <dc:creator>Gary Jackson</dc:creator>
      <pubDate>Tue, 21 Apr 2026 05:07:40 +0000</pubDate>
      <link>https://dev.to/garyljackson/chapter-1-the-value-class-recording-the-forward-pass-4dcd</link>
      <guid>https://dev.to/garyljackson/chapter-1-the-value-class-recording-the-forward-pass-4dcd</guid>
      <description>&lt;h3&gt;
  
  
  What You'll Build
&lt;/h3&gt;

&lt;p&gt;A class called &lt;code&gt;Value&lt;/code&gt; that wraps a &lt;code&gt;double&lt;/code&gt; and remembers how it was created. Think of it as a number that keeps a receipt of every operation it went through.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why This Comes First
&lt;/h3&gt;

&lt;p&gt;In the Big Picture, Step 1 (the forward pass) chains together thousands of small operations, and Step 3 (the backward pass) walks those operations in reverse.&lt;/p&gt;

&lt;p&gt;For that to work, every operation has to leave a record behind: what were the inputs, and how sensitive was the output to each input? The &lt;code&gt;Value&lt;/code&gt; class is that record-keeping wrapper. Every number in our neural network is going to be a &lt;code&gt;Value&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Core Idea
&lt;/h3&gt;

&lt;p&gt;A &lt;code&gt;Value&lt;/code&gt; holds three things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The number itself (&lt;code&gt;Data&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;A reference to the values that produced it (&lt;code&gt;_inputs&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;local gradient&lt;/strong&gt; (&lt;code&gt;_localGrads&lt;/code&gt;) - how much the output of that specific operation would change if you wiggled each input&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You don't need to understand the calculus behind these local gradients right now. Each operation has a known, fixed rule for them (listed in the table below), and the backward pass in Chapter 2 uses them mechanically.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;Grad&lt;/code&gt; field is empty for now. It gets filled in during the backward pass (Chapter 2) with the answer to: "how much does the &lt;em&gt;final loss&lt;/em&gt; change if I wiggle &lt;em&gt;this specific value&lt;/em&gt;?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A naming distinction worth pinning down now.&lt;/strong&gt; There are two things on a &lt;code&gt;Value&lt;/code&gt; that both include the word "gradient", and they do different jobs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Local gradient&lt;/strong&gt; (&lt;code&gt;_localGrads&lt;/code&gt;) - stored per operation (&lt;code&gt;+&lt;/code&gt;, &lt;code&gt;*&lt;/code&gt;, &lt;code&gt;Exp&lt;/code&gt;, etc.), frozen at forward time. For each input to the op, it records: "if &lt;em&gt;only&lt;/em&gt; that input changed by a tiny amount, how much would &lt;em&gt;this op's output&lt;/em&gt; change?" It's a property of one operation in isolation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gradient&lt;/strong&gt; (&lt;code&gt;Grad&lt;/code&gt;) - filled in during the backward pass. Every &lt;code&gt;Value&lt;/code&gt; has its own &lt;code&gt;Grad&lt;/code&gt;, which records: "if &lt;em&gt;only&lt;/em&gt; this &lt;code&gt;Value&lt;/code&gt;'s &lt;code&gt;Data&lt;/code&gt; changed by a tiny amount, how much would the &lt;em&gt;final loss&lt;/em&gt; change?" It's a property of the whole path from this &lt;code&gt;Value&lt;/code&gt; to the loss.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The backward pass in Chapter 2 walks the graph in reverse, multiplying the two together via the chain rule to fill in every &lt;code&gt;Grad&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Code
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// --- Value.cs ---&lt;/span&gt;

&lt;span class="k"&gt;namespace&lt;/span&gt; &lt;span class="nn"&gt;MicroGPT&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Value&lt;/span&gt;&lt;span class="p"&gt;[]?&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;double&lt;/span&gt;&lt;span class="p"&gt;[]?&lt;/span&gt; &lt;span class="n"&gt;localGrads&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="n"&gt;Data&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="n"&gt;Grad&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// filled in during the backward pass (Chapter 2)&lt;/span&gt;

    &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;readonly&lt;/span&gt; &lt;span class="n"&gt;Value&lt;/span&gt;&lt;span class="p"&gt;[]?&lt;/span&gt; &lt;span class="n"&gt;_inputs&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;readonly&lt;/span&gt; &lt;span class="kt"&gt;double&lt;/span&gt;&lt;span class="p"&gt;[]?&lt;/span&gt; &lt;span class="n"&gt;_localGrads&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;localGrads&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// --- Arithmetic operations ---&lt;/span&gt;
    &lt;span class="c1"&gt;// Each operation records three things: the result, the inputs, and the local gradients.&lt;/span&gt;
    &lt;span class="c1"&gt;// The local gradients are explained in the "Verifying Local Gradients" section below.&lt;/span&gt;

    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="n"&gt;Value&lt;/span&gt; &lt;span class="k"&gt;operator&lt;/span&gt; &lt;span class="p"&gt;+(&lt;/span&gt;&lt;span class="n"&gt;Value&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Value&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Data&lt;/span&gt; &lt;span class="p"&gt;+&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;

    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="n"&gt;Value&lt;/span&gt; &lt;span class="k"&gt;operator&lt;/span&gt; &lt;span class="p"&gt;*(&lt;/span&gt;&lt;span class="n"&gt;Value&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Value&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;
        &lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Data&lt;/span&gt; &lt;span class="p"&gt;*&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;

    &lt;span class="c1"&gt;// NaN if Data is negative and n is fractional.&lt;/span&gt;
    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="n"&gt;Value&lt;/span&gt; &lt;span class="nf"&gt;Pow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Pow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="p"&gt;*&lt;/span&gt; &lt;span class="n"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Pow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)]);&lt;/span&gt;

    &lt;span class="c1"&gt;// -Infinity if Data == 0, NaN if Data &amp;lt; 0. If you see NaN propagating through&lt;/span&gt;
    &lt;span class="c1"&gt;// training, a softmax probability collapsed to 0 and this is usually the entry point.&lt;/span&gt;
    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="n"&gt;Value&lt;/span&gt; &lt;span class="nf"&gt;Log&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1.0&lt;/span&gt; &lt;span class="p"&gt;/&lt;/span&gt; &lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;

    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="n"&gt;Value&lt;/span&gt; &lt;span class="nf"&gt;Exp&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="p"&gt;)]);&lt;/span&gt;

    &lt;span class="c1"&gt;// ReLU: passes positive values through unchanged, blocks negatives entirely.&lt;/span&gt;
    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="n"&gt;Value&lt;/span&gt; &lt;span class="nf"&gt;Relu&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Data&lt;/span&gt; &lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="m"&gt;1.0&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;

    &lt;span class="c1"&gt;// --- Convenience overloads ---&lt;/span&gt;
    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="n"&gt;Value&lt;/span&gt; &lt;span class="k"&gt;operator&lt;/span&gt; &lt;span class="p"&gt;+(&lt;/span&gt;&lt;span class="n"&gt;Value&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="p"&gt;+&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;Value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="n"&gt;Value&lt;/span&gt; &lt;span class="k"&gt;operator&lt;/span&gt; &lt;span class="p"&gt;*(&lt;/span&gt;&lt;span class="n"&gt;Value&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="p"&gt;*&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;Value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="n"&gt;Value&lt;/span&gt; &lt;span class="k"&gt;operator&lt;/span&gt; &lt;span class="p"&gt;-(&lt;/span&gt;&lt;span class="n"&gt;Value&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="p"&gt;*&lt;/span&gt; &lt;span class="p"&gt;-&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="n"&gt;Value&lt;/span&gt; &lt;span class="k"&gt;operator&lt;/span&gt; &lt;span class="p"&gt;-(&lt;/span&gt;&lt;span class="n"&gt;Value&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="p"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(-&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="n"&gt;Value&lt;/span&gt; &lt;span class="k"&gt;operator&lt;/span&gt; &lt;span class="p"&gt;/(&lt;/span&gt;&lt;span class="n"&gt;Value&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Value&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="p"&gt;*&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Pow&lt;/span&gt;&lt;span class="p"&gt;(-&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="n"&gt;Value&lt;/span&gt; &lt;span class="k"&gt;operator&lt;/span&gt; &lt;span class="p"&gt;/(&lt;/span&gt;&lt;span class="n"&gt;Value&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="p"&gt;*&lt;/span&gt; &lt;span class="n"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Pow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;-&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="nf"&gt;ToString&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s"&gt;$"Value(data=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For quick reference, here are the local gradients each operation records:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Operation&lt;/th&gt;
&lt;th&gt;Local gradient(s)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;a + b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;1&lt;/code&gt;, &lt;code&gt;1&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;a * b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;b&lt;/code&gt;, &lt;code&gt;a&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;a.Pow(n)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;n * aⁿ⁻¹&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;a.Log()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1 / a&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;a.Exp()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;eᵃ&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;a.Relu()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;1&lt;/code&gt; if &lt;code&gt;a &amp;gt; 0&lt;/code&gt;, else &lt;code&gt;0&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Verifying Local Gradients - The Nudge Test
&lt;/h3&gt;

&lt;p&gt;Have a look at the addition operator:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="n"&gt;Value&lt;/span&gt; &lt;span class="k"&gt;operator&lt;/span&gt; &lt;span class="p"&gt;+(&lt;/span&gt;&lt;span class="n"&gt;Value&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Value&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Data&lt;/span&gt; &lt;span class="p"&gt;+&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The second argument, &lt;code&gt;[a, b]&lt;/code&gt;, records the two inputs. The third argument, &lt;code&gt;[1.0, 1.0]&lt;/code&gt;, records the local gradient for each input, &lt;em&gt;in the same order&lt;/em&gt;. So:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The local gradient for input &lt;code&gt;a&lt;/code&gt; is &lt;code&gt;1.0&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;The local gradient for input &lt;code&gt;b&lt;/code&gt; is &lt;code&gt;1.0&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But what does that actually mean, and why should you believe those are the right numbers?&lt;/p&gt;

&lt;p&gt;You can answer both questions without any calculus. The technique is simple: &lt;strong&gt;nudge one input by a tiny amount, run the operation again, and see how much the output changed.&lt;/strong&gt; The ratio of output-change to input-change &lt;em&gt;is&lt;/em&gt; the local gradient.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Addition: why is the local gradient &lt;code&gt;1.0&lt;/code&gt; for both inputs?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let's say &lt;code&gt;a = 2&lt;/code&gt; and &lt;code&gt;b = 3&lt;/code&gt;. The output is &lt;code&gt;2 + 3 = 5&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Now nudge &lt;code&gt;a&lt;/code&gt; up by a tiny amount - say, 0.001:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;New output: &lt;code&gt;2.001 + 3 = 5.001&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;The output changed by: &lt;code&gt;5.001 - 5.0 = 0.001&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;You nudged by 0.001, the output moved by 0.001&lt;/li&gt;
&lt;li&gt;Ratio: &lt;code&gt;0.001 / 0.001 = 1.0&lt;/code&gt; - that's the local gradient for &lt;code&gt;a&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now nudge &lt;code&gt;b&lt;/code&gt; instead:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;New output: &lt;code&gt;2 + 3.001 = 5.001&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Same result: ratio = &lt;code&gt;1.0&lt;/code&gt; - that's the local gradient for &lt;code&gt;b&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Addition passes changes through at a 1:1 rate for both inputs, so the local gradients array is &lt;code&gt;[1.0, 1.0]&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multiplication: why are the local gradients &lt;code&gt;[b.Data, a.Data]&lt;/code&gt;?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Have a look at the multiplication operator:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="n"&gt;Value&lt;/span&gt; &lt;span class="k"&gt;operator&lt;/span&gt; &lt;span class="p"&gt;*(&lt;/span&gt;&lt;span class="n"&gt;Value&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Value&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Data&lt;/span&gt; &lt;span class="p"&gt;*&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The local gradients are &lt;code&gt;[b.Data, a.Data]&lt;/code&gt;, meaning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The local gradient for input &lt;code&gt;a&lt;/code&gt; is &lt;code&gt;b&lt;/code&gt;'s value&lt;/li&gt;
&lt;li&gt;The local gradient for input &lt;code&gt;b&lt;/code&gt; is &lt;code&gt;a&lt;/code&gt;'s value&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's verify with &lt;code&gt;a = 2&lt;/code&gt;, &lt;code&gt;b = 3&lt;/code&gt;. The output is &lt;code&gt;2 * 3 = 6&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Nudge &lt;code&gt;a&lt;/code&gt; by 0.001:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;New output: &lt;code&gt;2.001 * 3 = 6.003&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;The output changed by: &lt;code&gt;6.003 - 6.0 = 0.003&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Ratio: &lt;code&gt;0.003 / 0.001 = 3.0&lt;/code&gt; - that's the local gradient for &lt;code&gt;a&lt;/code&gt;, and it equals &lt;code&gt;b&lt;/code&gt;'s value&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Nudge &lt;code&gt;b&lt;/code&gt; by 0.001:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;New output: &lt;code&gt;2 * 3.001 = 6.002&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;The output changed by: &lt;code&gt;6.002 - 6.0 = 0.002&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Ratio: &lt;code&gt;0.002 / 0.001 = 2.0&lt;/code&gt; - that's the local gradient for &lt;code&gt;b&lt;/code&gt;, and it equals &lt;code&gt;a&lt;/code&gt;'s value&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This makes intuitive sense: the bigger &lt;code&gt;b&lt;/code&gt; is, the more a small change to &lt;code&gt;a&lt;/code&gt; gets amplified, and vice versa.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Power: the first curved function.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Have a look at the power operator:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="n"&gt;Value&lt;/span&gt; &lt;span class="nf"&gt;Pow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Pow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="p"&gt;*&lt;/span&gt; &lt;span class="n"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Pow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)]);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For &lt;code&gt;a²&lt;/code&gt; (so &lt;code&gt;n = 2&lt;/code&gt;), the local gradient &lt;code&gt;n * Math.Pow(a, n - 1)&lt;/code&gt; simplifies to &lt;code&gt;2 * a&lt;/code&gt;. That's the first formula we've seen where the gradient depends on &lt;code&gt;a&lt;/code&gt; itself, and the reason is that &lt;code&gt;a²&lt;/code&gt; behaves differently from addition and multiplication. Let's see how.&lt;/p&gt;

&lt;p&gt;Line up some input/output pairs for &lt;code&gt;a²&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;a = 1  →  a² =  1
a = 2  →  a² =  4   (jumped by 3)
a = 3  →  a² =  9   (jumped by 5)
a = 4  →  a² = 16   (jumped by 7)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each step in &lt;code&gt;a&lt;/code&gt; produces a &lt;em&gt;bigger&lt;/em&gt; jump in &lt;code&gt;a²&lt;/code&gt; than the last. Compare that to &lt;code&gt;a + 5&lt;/code&gt;, which goes &lt;code&gt;6, 7, 8, 9&lt;/code&gt; - the jump is exactly &lt;code&gt;1&lt;/code&gt; every single step. We'll call &lt;code&gt;a + 5&lt;/code&gt; a &lt;strong&gt;straight&lt;/strong&gt; function (same rate of change everywhere) and &lt;code&gt;a²&lt;/code&gt; a &lt;strong&gt;curved&lt;/strong&gt; function (rate of change grows as &lt;code&gt;a&lt;/code&gt; gets bigger).&lt;/p&gt;

&lt;p&gt;Multiplication is straight too, from &lt;code&gt;a&lt;/code&gt;'s perspective. &lt;code&gt;a * 3&lt;/code&gt; goes &lt;code&gt;3, 6, 9, 12&lt;/code&gt; as &lt;code&gt;a&lt;/code&gt; goes &lt;code&gt;1, 2, 3, 4&lt;/code&gt; - the jump is always exactly &lt;code&gt;3&lt;/code&gt;. That's why the local gradient for multiplication is a fixed number (&lt;code&gt;b.Data&lt;/code&gt;): no matter where you nudge &lt;code&gt;a&lt;/code&gt;, the rate of change is the same.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;a²&lt;/code&gt; is different. The rate of change at &lt;code&gt;a = 3&lt;/code&gt; isn't the same as the rate at &lt;code&gt;a = 4&lt;/code&gt;. The formula &lt;code&gt;2 * a&lt;/code&gt; tells us: at &lt;code&gt;a = 3&lt;/code&gt;, the rate is &lt;code&gt;6&lt;/code&gt;; at &lt;code&gt;a = 4&lt;/code&gt;, the rate is &lt;code&gt;8&lt;/code&gt;. There's no single number that describes &lt;code&gt;a²&lt;/code&gt;'s rate - you have to ask "rate at which value of &lt;code&gt;a&lt;/code&gt;?".&lt;/p&gt;

&lt;p&gt;Let's nudge-test at &lt;code&gt;a = 3&lt;/code&gt;, using a small nudge (&lt;code&gt;0.0001&lt;/code&gt;) to match the default in &lt;code&gt;GradientCheck.cs&lt;/code&gt; below:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Original: &lt;code&gt;3² = 9&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Nudged: &lt;code&gt;3.0001² = 9.00060001&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Change in output: &lt;code&gt;0.00060001&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Ratio: &lt;code&gt;0.00060001 / 0.0001 = 6.0001&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The formula says the rate at &lt;code&gt;a = 3&lt;/code&gt; is exactly &lt;code&gt;6&lt;/code&gt;, and we measured &lt;code&gt;6.0001&lt;/code&gt;. The extra &lt;code&gt;0.0001&lt;/code&gt; isn't a bug - it's the curvature leaking in. When we nudged from &lt;code&gt;3&lt;/code&gt; to &lt;code&gt;3.0001&lt;/code&gt;, we technically measured something between "the rate at &lt;code&gt;3&lt;/code&gt;" and "the rate at &lt;code&gt;3.0001&lt;/code&gt;" (which is very slightly steeper), so we overshoot the true answer at &lt;code&gt;3&lt;/code&gt; by a tiny amount. Halve the nudge and the error halves with it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is a general fact about the nudge test:&lt;/strong&gt; it's exact for straight functions, and slightly off for curved ones by an amount proportional to the nudge size. Keep that in mind when we run the full check in a moment - you'll see &lt;code&gt;6.0001&lt;/code&gt; for Power and a similar drift for Exp (also curved), while Addition and Multiplication come out perfectly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You can verify any operation this way.&lt;/strong&gt; You don't need to trust the formulas. You don't need calculus. You just need subtraction and division.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Heads up:&lt;/strong&gt; &lt;code&gt;Log&lt;/code&gt; and &lt;code&gt;Pow&lt;/code&gt; can blow up on certain inputs. &lt;code&gt;Log(0)&lt;/code&gt; gives &lt;code&gt;-Infinity&lt;/code&gt;, &lt;code&gt;Log&lt;/code&gt; of any negative number gives &lt;code&gt;NaN&lt;/code&gt;, and &lt;code&gt;Pow&lt;/code&gt; gives &lt;code&gt;NaN&lt;/code&gt; when you raise a negative number to a non-whole power like &lt;code&gt;0.5&lt;/code&gt;. If &lt;code&gt;NaN&lt;/code&gt; ever starts spreading through training later in the course, one of these two operations is almost always where it started. Come back to this section and nudge-test the suspect values.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Verifying the Formulas: The Nudge Test
&lt;/h3&gt;

&lt;p&gt;Before trusting the local-gradient formulas from the table above, let's verify them by direct measurement. The helper class below runs the nudge test against raw math operations (no &lt;code&gt;Value&lt;/code&gt; objects involved) to prove the formulas are correct independently of the C# implementation. Put it in &lt;code&gt;GradientCheck.cs&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// --- GradientCheck.cs ---&lt;/span&gt;

&lt;span class="k"&gt;namespace&lt;/span&gt; &lt;span class="nn"&gt;MicroGPT&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;GradientCheck&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;/// &amp;lt;summary&amp;gt;&lt;/span&gt;
    &lt;span class="c1"&gt;/// Measures the gradient of a function at a specific input value by nudging&lt;/span&gt;
    &lt;span class="c1"&gt;/// and observing. Works for any function that takes a double and returns a double.&lt;/span&gt;
    &lt;span class="c1"&gt;/// &amp;lt;/summary&amp;gt;&lt;/span&gt;
    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="nf"&gt;MeasureGradient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Func&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;double&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;double&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="n"&gt;at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="n"&gt;nudge&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="m"&gt;0.0001&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="n"&gt;before&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;f&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;at&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="n"&gt;after&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;f&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;at&lt;/span&gt; &lt;span class="p"&gt;+&lt;/span&gt; &lt;span class="n"&gt;nudge&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;after&lt;/span&gt; &lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="n"&gt;before&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;/&lt;/span&gt; &lt;span class="n"&gt;nudge&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;/// &amp;lt;summary&amp;gt;&lt;/span&gt;
    &lt;span class="c1"&gt;/// Runs the nudge test for all Value operations and prints the results.&lt;/span&gt;
    &lt;span class="c1"&gt;/// &amp;lt;/summary&amp;gt;&lt;/span&gt;
    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="k"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;RunAll&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// The "expected" column for each check is computed by applying the local&lt;/span&gt;
        &lt;span class="c1"&gt;// gradient formula from Value.cs directly. The "measured" column comes from&lt;/span&gt;
        &lt;span class="c1"&gt;// the nudge test. If the formula is right, the two columns should agree.&lt;/span&gt;
        &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="k"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;Row&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="n"&gt;measured&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;
            &lt;span class="n"&gt;Console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;WriteLine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="s"&gt;$"  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;-&lt;/span&gt;&lt;span class="m"&gt;16&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  measured &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;measured&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;8&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;F4&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s"&gt;   expected &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;8&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;F4&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;);&lt;/span&gt;

        &lt;span class="n"&gt;Console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;WriteLine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"=== Straight functions (measurement should be exact) ==="&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;Console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;WriteLine&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

        &lt;span class="c1"&gt;// Addition: local gradient formula from Value.cs is [1.0, 1.0]&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="n"&gt;Console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;WriteLine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;$"--- Addition: a + b where a=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, b=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s"&gt; ---"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="nf"&gt;Row&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Gradient for a"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;MeasureGradient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="p"&gt;+&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="m"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="nf"&gt;Row&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Gradient for b"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;MeasureGradient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="p"&gt;+&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="m"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="n"&gt;Console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;WriteLine&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="c1"&gt;// Multiplication: local gradient formula from Value.cs is [b.Data, a.Data]&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="n"&gt;Console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;WriteLine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;$"--- Multiplication: a * b where a=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, b=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s"&gt; ---"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="nf"&gt;Row&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Gradient for a"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;MeasureGradient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="p"&gt;*&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="nf"&gt;Row&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Gradient for b"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;MeasureGradient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="p"&gt;*&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="n"&gt;Console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;WriteLine&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="n"&gt;Console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;WriteLine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"=== Curved functions (tiny drift proportional to nudge size) ==="&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;Console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;WriteLine&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

        &lt;span class="c1"&gt;// Power: local gradient formula from Value.cs is n * Math.Pow(Data, n - 1)&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="n"&gt;Console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;WriteLine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;$"--- Power: a^n where a=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, n=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s"&gt; ---"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="nf"&gt;Row&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Gradient for a"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;MeasureGradient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Pow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="p"&gt;*&lt;/span&gt; &lt;span class="n"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Pow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
            &lt;span class="n"&gt;Console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;WriteLine&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="c1"&gt;// Exp: local gradient formula from Value.cs is Math.Exp(Data)&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="n"&gt;Console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;WriteLine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;$"--- Exp: e^a where a=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s"&gt; ---"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="nf"&gt;Row&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Gradient for a"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;MeasureGradient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
            &lt;span class="n"&gt;Console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;WriteLine&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="c1"&gt;// Log: local gradient formula from Value.cs is 1.0 / Data&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="n"&gt;Console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;WriteLine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;$"--- Log: ln(a) where a=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s"&gt; ---"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="nf"&gt;Row&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Gradient for a"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;MeasureGradient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="m"&gt;1.0&lt;/span&gt; &lt;span class="p"&gt;/&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Wire it into the dispatcher by uncommenting the &lt;code&gt;gradcheck&lt;/code&gt; line in the &lt;code&gt;switch&lt;/code&gt; in &lt;code&gt;Program.cs&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="s"&gt;"gradcheck"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;GradientCheck&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;RunAll&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then run it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dotnet run &lt;span class="nt"&gt;--&lt;/span&gt; gradcheck
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every gradient matches the formula from the table, including the curvature tilt we predicted. &lt;code&gt;Power&lt;/code&gt; shows &lt;code&gt;6.0001&lt;/code&gt; (exactly the number we worked out by hand earlier), and &lt;code&gt;Exp&lt;/code&gt; shows a similar small drift because &lt;code&gt;e^a&lt;/code&gt; is also curved. &lt;code&gt;Addition&lt;/code&gt; and &lt;code&gt;Multiplication&lt;/code&gt; come out perfectly because they're straight from each input's perspective. &lt;code&gt;Log&lt;/code&gt; at &lt;code&gt;a = 4&lt;/code&gt; is curved but so gently that the error hides below the fourth decimal.&lt;/p&gt;

&lt;h3&gt;
  
  
  Exercise: Verify Value Operations
&lt;/h3&gt;

&lt;p&gt;Create &lt;code&gt;Chapter1Exercise.cs&lt;/code&gt;. This verifies that &lt;code&gt;Value&lt;/code&gt; computes correct forward results:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// --- Chapter1Exercise.cs ---&lt;/span&gt;

&lt;span class="k"&gt;namespace&lt;/span&gt; &lt;span class="nn"&gt;MicroGPT&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Chapter1Exercise&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="k"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;Run&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Verify forward pass - chained operations produce correct results&lt;/span&gt;
        &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;Value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;2.0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;Value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;3.0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;Value&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="p"&gt;*&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="n"&gt;Value&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="p"&gt;+&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="n"&gt;Value&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Pow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

        &lt;span class="n"&gt;Console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;WriteLine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"--- Forward Pass ---"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;Console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;WriteLine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;$"c: expected 6,  got &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;Console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;WriteLine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;$"d: expected 8,  got &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;Console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;WriteLine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;$"e: expected 64, got &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Wire it into the dispatcher by uncommenting this line in the &lt;code&gt;switch&lt;/code&gt; in &lt;code&gt;Program.cs&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="s"&gt;"ch1"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;Chapter1Exercise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Run&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then run it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dotnet run &lt;span class="nt"&gt;--&lt;/span&gt; ch1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  A Design Choice Worth Noticing
&lt;/h3&gt;

&lt;p&gt;If you look at the &lt;code&gt;Value&lt;/code&gt; operators, the local gradient &lt;em&gt;values&lt;/em&gt; are computed immediately during the forward pass. When &lt;code&gt;a * b&lt;/code&gt; runs, the resulting &lt;code&gt;Value&lt;/code&gt; already contains &lt;code&gt;[b.Data, a.Data]&lt;/code&gt; as concrete numbers. The backward pass then just multiplies and accumulates - it never computes a local gradient itself.&lt;/p&gt;

&lt;p&gt;Production frameworks like PyTorch do this differently. They store the &lt;em&gt;inputs&lt;/em&gt; during the forward pass, then compute the local gradient values during the backward pass using those stored inputs. For &lt;code&gt;a * b&lt;/code&gt;, PyTorch saves references to &lt;code&gt;a&lt;/code&gt; and &lt;code&gt;b&lt;/code&gt;, then during backward computes &lt;code&gt;b * upstream_gradient&lt;/code&gt; and &lt;code&gt;a * upstream_gradient&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The final numbers are identical - it's the same math, just performed at a different time.&lt;/p&gt;

&lt;p&gt;Our &lt;code&gt;Value&lt;/code&gt; class precomputes because it makes the code simpler to understand: you can see the local gradients right there in the operator. PyTorch defers the computation because at scale (tensors with millions of numbers), precomputing and storing all the local gradients would use a lot of memory. It's cheaper to store just the inputs and recompute when needed. But for a scalar &lt;code&gt;Value&lt;/code&gt;, storing two &lt;code&gt;double&lt;/code&gt;s per operation is trivial.&lt;/p&gt;

</description>
      <category>csharp</category>
      <category>machinelearning</category>
      <category>transformers</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Chapter 0: Project Setup</title>
      <dc:creator>Gary Jackson</dc:creator>
      <pubDate>Tue, 21 Apr 2026 05:06:47 +0000</pubDate>
      <link>https://dev.to/garyljackson/chapter-0-project-setup-55gf</link>
      <guid>https://dev.to/garyljackson/chapter-0-project-setup-55gf</guid>
      <description>&lt;h3&gt;
  
  
  What You'll Build
&lt;/h3&gt;

&lt;p&gt;A .NET console project that's ready to run, set up the way the rest of the course expects, with the training dataset downloaded.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;p&gt;You'll need the .NET 10 SDK (or later) installed. Check with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dotnet &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you don't have it, download it from &lt;a href="https://dotnet.microsoft.com/download" rel="noopener noreferrer"&gt;https://dotnet.microsoft.com/download&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Create the Project
&lt;/h3&gt;

&lt;p&gt;Open a terminal and run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create a new console application&lt;/span&gt;
dotnet new console &lt;span class="nt"&gt;-n&lt;/span&gt; MicroGPT &lt;span class="nt"&gt;-f&lt;/span&gt; net10.0

&lt;span class="c"&gt;# Move into the project directory&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;MicroGPT
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates a &lt;code&gt;MicroGPT.csproj&lt;/code&gt; file and a &lt;code&gt;Program.cs&lt;/code&gt; with a "Hello, World!" placeholder.&lt;/p&gt;

&lt;h3&gt;
  
  
  Download the Training Data
&lt;/h3&gt;

&lt;p&gt;The dataset is a text file with ~32,000 human names, one per line. Download it into the project directory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Linux / macOS&lt;/span&gt;
curl &lt;span class="nt"&gt;-o&lt;/span&gt; input.txt https://raw.githubusercontent.com/karpathy/makemore/refs/heads/master/names.txt

&lt;span class="c"&gt;# Windows (PowerShell)&lt;/span&gt;
Invoke-WebRequest &lt;span class="nt"&gt;-Uri&lt;/span&gt; &lt;span class="s2"&gt;"https://raw.githubusercontent.com/karpathy/makemore/refs/heads/master/names.txt"&lt;/span&gt; &lt;span class="nt"&gt;-OutFile&lt;/span&gt; &lt;span class="s2"&gt;"input.txt"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  A Note for Visual Studio Users
&lt;/h3&gt;

&lt;p&gt;If you plan to run the project from Visual Studio rather than &lt;code&gt;dotnet run&lt;/code&gt; at the command line, do this now or the program won't find &lt;code&gt;input.txt&lt;/code&gt;. Visual Studio runs your application from the &lt;code&gt;bin/Debug/net10.0/&lt;/code&gt; folder, not the project root, so the dataset needs to sit alongside the compiled binary.&lt;/p&gt;

&lt;p&gt;The fix is to tell the build system to copy &lt;code&gt;input.txt&lt;/code&gt; into the output folder automatically. Open &lt;code&gt;MicroGPT.csproj&lt;/code&gt; and add this inside the &lt;code&gt;&amp;lt;Project&amp;gt;&lt;/code&gt; element:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="c"&gt;&amp;lt;!-- --- MicroGPT.csproj (add inside the &amp;lt;Project&amp;gt; element) --- --&amp;gt;&lt;/span&gt;

&lt;span class="nt"&gt;&amp;lt;ItemGroup&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;None&lt;/span&gt; &lt;span class="na"&gt;Update=&lt;/span&gt;&lt;span class="s"&gt;"input.txt"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;CopyToOutputDirectory&amp;gt;&lt;/span&gt;PreserveNewest&lt;span class="nt"&gt;&amp;lt;/CopyToOutputDirectory&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/None&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/ItemGroup&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This copies &lt;code&gt;input.txt&lt;/code&gt; alongside the compiled binary whenever it's newer than the existing copy. After this change, both &lt;code&gt;dotnet run&lt;/code&gt; and Visual Studio's Run/Debug button will find the file.&lt;/p&gt;

&lt;h3&gt;
  
  
  Create the Core Source Files
&lt;/h3&gt;

&lt;p&gt;Create the empty source files alongside &lt;code&gt;Program.cs&lt;/code&gt;. These are the permanent files that make up the final model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Linux / macOS&lt;/span&gt;
&lt;span class="nb"&gt;touch &lt;/span&gt;Value.cs GradientCheck.cs Tokenizer.cs BigramModel.cs Helpers.cs Model.cs AdamOptimiser.cs FullTraining.cs

&lt;span class="c"&gt;# Windows (PowerShell)&lt;/span&gt;
&lt;span class="s2"&gt;"Value.cs"&lt;/span&gt;, &lt;span class="s2"&gt;"GradientCheck.cs"&lt;/span&gt;, &lt;span class="s2"&gt;"Tokenizer.cs"&lt;/span&gt;, &lt;span class="s2"&gt;"BigramModel.cs"&lt;/span&gt;, &lt;span class="s2"&gt;"Helpers.cs"&lt;/span&gt;, &lt;span class="s2"&gt;"Model.cs"&lt;/span&gt;, &lt;span class="s2"&gt;"AdamOptimiser.cs"&lt;/span&gt;, &lt;span class="s2"&gt;"FullTraining.cs"&lt;/span&gt; | ForEach-Object &lt;span class="o"&gt;{&lt;/span&gt; New-Item &lt;span class="nt"&gt;-ItemType&lt;/span&gt; File &lt;span class="nt"&gt;-Name&lt;/span&gt; &lt;span class="nv"&gt;$_&lt;/span&gt; &lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As you work through the chapters, you'll also create exercise files (&lt;code&gt;Chapter1Exercise.cs&lt;/code&gt;, &lt;code&gt;Chapter2Exercise.cs&lt;/code&gt;, and so on). Each chapter will tell you when to create one. They're self-contained - each has a static class with a &lt;code&gt;Run()&lt;/code&gt; method - and you run them via a dispatcher in &lt;code&gt;Program.cs&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dotnet run &lt;span class="nt"&gt;--&lt;/span&gt; ch1     &lt;span class="c"&gt;# runs Chapter1Exercise.Run()&lt;/span&gt;
dotnet run &lt;span class="nt"&gt;--&lt;/span&gt; ch7     &lt;span class="c"&gt;# runs Chapter7Exercise.Run()&lt;/span&gt;
dotnet run &lt;span class="nt"&gt;--&lt;/span&gt; full    &lt;span class="c"&gt;# runs the final training + inference&lt;/span&gt;
dotnet run            &lt;span class="c"&gt;# same as "full"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You'll build the dispatcher skeleton now and add a line to it at the end of every chapter, so &lt;code&gt;dotnet run -- chN&lt;/code&gt; just works from Chapter 1 onwards.&lt;/p&gt;

&lt;h3&gt;
  
  
  Create the Dispatcher and Verify the Setup
&lt;/h3&gt;

&lt;p&gt;Clear out the placeholder &lt;code&gt;Program.cs&lt;/code&gt; and replace it with the dispatcher skeleton below. Every chapter case is already wired up but commented out - you'll uncomment one line at the end of each chapter as you go. Until then, the default branch prints a sanity check that confirms the training dataset is in place:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// --- Program.cs ---&lt;/span&gt;

&lt;span class="k"&gt;namespace&lt;/span&gt; &lt;span class="nn"&gt;MicroGPT&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Program&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="k"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;Main&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;chapter&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Length&lt;/span&gt; &lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;ToLowerInvariant&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="k"&gt;switch&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chapter&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="c1"&gt;// Uncomment each case as you complete the corresponding chapter.&lt;/span&gt;
            &lt;span class="c1"&gt;// case "gradcheck":&lt;/span&gt;
            &lt;span class="c1"&gt;//     GradientCheck.RunAll();&lt;/span&gt;
            &lt;span class="c1"&gt;//     break;&lt;/span&gt;
            &lt;span class="c1"&gt;// case "ch1":&lt;/span&gt;
            &lt;span class="c1"&gt;//     Chapter1Exercise.Run();&lt;/span&gt;
            &lt;span class="c1"&gt;//     break;&lt;/span&gt;
            &lt;span class="c1"&gt;// case "ch2":&lt;/span&gt;
            &lt;span class="c1"&gt;//     Chapter2Exercise.Run();&lt;/span&gt;
            &lt;span class="c1"&gt;//     break;&lt;/span&gt;
            &lt;span class="c1"&gt;// case "ch3":&lt;/span&gt;
            &lt;span class="c1"&gt;//     Chapter3Exercise.Run();&lt;/span&gt;
            &lt;span class="c1"&gt;//     break;&lt;/span&gt;
            &lt;span class="c1"&gt;// case "ch4":&lt;/span&gt;
            &lt;span class="c1"&gt;//     Chapter4Exercise.Run();&lt;/span&gt;
            &lt;span class="c1"&gt;//     break;&lt;/span&gt;
            &lt;span class="c1"&gt;// case "ch5":&lt;/span&gt;
            &lt;span class="c1"&gt;//     Chapter5Exercise.Run();&lt;/span&gt;
            &lt;span class="c1"&gt;//     break;&lt;/span&gt;
            &lt;span class="c1"&gt;// case "ch6":&lt;/span&gt;
            &lt;span class="c1"&gt;//     Chapter6Exercise.Run();&lt;/span&gt;
            &lt;span class="c1"&gt;//     break;&lt;/span&gt;
            &lt;span class="c1"&gt;// case "ch7":&lt;/span&gt;
            &lt;span class="c1"&gt;//     Chapter7Exercise.Run();&lt;/span&gt;
            &lt;span class="c1"&gt;//     break;&lt;/span&gt;
            &lt;span class="c1"&gt;// case "ch8":&lt;/span&gt;
            &lt;span class="c1"&gt;//     Chapter8Exercise.Run();&lt;/span&gt;
            &lt;span class="c1"&gt;//     break;&lt;/span&gt;
            &lt;span class="c1"&gt;// case "ch9":&lt;/span&gt;
            &lt;span class="c1"&gt;//     Chapter9Exercise.Run();&lt;/span&gt;
            &lt;span class="c1"&gt;//     break;&lt;/span&gt;
            &lt;span class="c1"&gt;// case "ch10":&lt;/span&gt;
            &lt;span class="c1"&gt;//     Chapter10Exercise.Run();&lt;/span&gt;
            &lt;span class="c1"&gt;//     break;&lt;/span&gt;
            &lt;span class="c1"&gt;// case "full":&lt;/span&gt;
            &lt;span class="c1"&gt;//     FullTraining.Run();&lt;/span&gt;
            &lt;span class="c1"&gt;//     break;&lt;/span&gt;

            &lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;Console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;WriteLine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"MicroGPT project is ready."&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
                &lt;span class="n"&gt;Console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;WriteLine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;$"Dataset exists: &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;File&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Exists&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"input.txt"&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;File&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Exists&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"input.txt"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;lineCount&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;File&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ReadAllLines&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"input.txt"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;Length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
                    &lt;span class="n"&gt;Console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;WriteLine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;$"Dataset lines: &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;lineCount&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dotnet run
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should see:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MicroGPT project is ready.
Dataset exists: True
Dataset lines: 32033
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If that works, you're ready to start building.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Big Picture: How a Neural Network Learns
&lt;/h2&gt;

&lt;p&gt;Before we write any code, here's the 60-second version of what we're building and why. If you already know what "forward pass," "loss," and "gradient" mean, skip ahead to Chapter 1.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;neural network&lt;/strong&gt; is a math function with thousands of adjustable numbers called &lt;strong&gt;parameters&lt;/strong&gt;. At the start, these parameters are random - the function produces garbage. Training is the process of adjusting them until the function produces something useful.&lt;/p&gt;

&lt;p&gt;Training repeats four steps over and over:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1 - The forward pass.&lt;/strong&gt; Feed an input (like the letters of a name) through the math function. The function does a chain of operations - additions, multiplications, exponentials - and produces an output: a prediction of what character comes next.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2 - The loss.&lt;/strong&gt; Compare the prediction to the correct answer. The &lt;strong&gt;loss&lt;/strong&gt; is a single number that measures how wrong the prediction was. A loss of zero means the prediction was perfect. A high loss means the model is basically guessing randomly. Our goal is to make this number go down.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3 - The backward pass.&lt;/strong&gt; This is the clever bit. For each of the thousands of parameters, we need to know: "if I nudged this number up a tiny bit, would the loss go up or down, and by how much?" That "how much" is called the &lt;strong&gt;gradient&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Computing gradients by hand would be impossibly tedious. Instead, we record every operation during the forward pass, then walk that record backward using a calculus shortcut called the &lt;strong&gt;chain rule&lt;/strong&gt;. This process is called &lt;strong&gt;backpropagation&lt;/strong&gt;, and it gives us the gradient for every parameter automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4 - The update.&lt;/strong&gt; Now that we know each parameter's gradient, we nudge every parameter a tiny step in the direction that makes the loss smaller. Repeat from Step 1 with the next piece of training data.&lt;/p&gt;

&lt;p&gt;That's the entire learning loop. Everything in this course - the &lt;code&gt;Value&lt;/code&gt; class, the &lt;code&gt;Backward&lt;/code&gt; method, the &lt;code&gt;Softmax&lt;/code&gt; function, the &lt;code&gt;Adam&lt;/code&gt; optimiser - is a piece of this loop. When you see these terms in the chapters that follow, you'll know where they fit.&lt;/p&gt;

</description>
      <category>csharp</category>
      <category>machinelearning</category>
      <category>transformers</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Building a GPT From Scratch in C# - Introduction</title>
      <dc:creator>Gary Jackson</dc:creator>
      <pubDate>Tue, 21 Apr 2026 04:52:19 +0000</pubDate>
      <link>https://dev.to/garyljackson/building-a-gpt-from-scratch-in-c-introduction-4776</link>
      <guid>https://dev.to/garyljackson/building-a-gpt-from-scratch-in-c-introduction-4776</guid>
      <description>&lt;h2&gt;
  
  
  Why This Course Exists
&lt;/h2&gt;

&lt;p&gt;I'm on a journey to deepen my understanding of AI, and I wanted to properly learn how the transformer architecture works - not just at a hand-wavy conceptual level, but well enough to build one from scratch.&lt;/p&gt;

&lt;p&gt;The problem was that most tutorials weren't clicking for me, and it came down to three things.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;They're written in Python.&lt;/strong&gt; I'm a C# developer. Python isn't hard to read, but working in an unfamiliar language adds cognitive load in exactly the wrong place. You end up spending mental energy on syntax and idioms instead of on the concepts you're trying to learn.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;They lean on libraries like NumPy, PyTorch, and Hugging Face.&lt;/strong&gt; Powerful tools, but when a single function call hides an entire matrix multiplication or an attention computation, you don't really understand what's happening underneath. You're learning the API, not the algorithm.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;They assume you're comfortable with calculus.&lt;/strong&gt; I'm not. When a tutorial casually drops derivative notation and expects you to follow along, that's another barrier that has nothing to do with understanding how a transformer works.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So I wanted a tutorial that removed all three of those barriers. One written in C#, with zero external dependencies, where every operation is visible in the code, and where concepts like gradients are explained through practical techniques you can run and verify - not through math notation you're expected to already know.&lt;/p&gt;

&lt;p&gt;I couldn't find one that fit, so I built this course to fill that gap. If you're a C# developer who wants to understand transformers at the implementation level, without needing a math degree to get there, this is for you.&lt;/p&gt;




&lt;h2&gt;
  
  
  Course Map
&lt;/h2&gt;

&lt;p&gt;The course builds a complete GPT-style language model from scratch in C#, with zero ML framework dependencies. By the end, you'll have a working character-level language model that learns patterns from text and generates new, plausible-sounding text.&lt;/p&gt;

&lt;p&gt;Every chapter produces &lt;strong&gt;runnable code&lt;/strong&gt; that builds on the previous one. The concepts layer like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Chapter 0:  Project Setup - creating the project and file structure
Chapter 1:  Value - a single number that tracks how it was computed
Chapter 2:  Backward - teaching Value to compute gradients automatically  
Chapter 3:  Tokenizer - turning text into numbers and back
Chapter 4:  Bigram Model - the simplest possible "language model" (no neural net)
Chapter 5:  Linear + Softmax - the two workhorses of neural networks
Chapter 6:  Embeddings + Loss - giving tokens a learned identity
Chapter 7:  Training Loop + Adam - making the model learn
Chapter 8:  RMSNorm + Residuals - stabilising deep networks
Chapter 9:  Attention - the mechanism that lets tokens "look at" each other
Chapter 10: Multi-Head Attention + MLP - the full Transformer block
Chapter 11: Full GPT - assembling everything into a model class
Chapter 12: Inference - generating new text from the trained model
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each chapter tells you what it depends on, what it adds, and what the code should do when you run it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Project File Structure
&lt;/h3&gt;

&lt;p&gt;By the end of the course, your project will contain these files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MicroGPT/
├── MicroGPT.csproj       Created by dotnet CLI
├── input.txt              Training dataset (32K names)
│
│   ── Core files (permanent) ──────────────────────────
│
├── Value.cs               The computation recorder (Ch 1-2)
├── GradientCheck.cs       Nudge-test verification tool (Ch 1)
├── Tokenizer.cs           Text-to-numbers conversion (Ch 3)
├── BigramModel.cs         Simplest possible language model (Ch 4)
├── Helpers.cs             Pure math functions (Ch 5, 6, 8)
├── Model.cs               The GPT model (Ch 11)
├── AdamOptimiser.cs       Reusable Adam optimiser (Ch 11)
├── Program.cs             Entry point - final training + inference (Ch 11-12)
│
│   ── Chapter exercises (created as you go) ───────────
│
├── Chapter1Exercise.cs    Verify Value operations and gradient checking
├── Chapter2Exercise.cs    Verify backward pass computes correct gradients
├── Chapter3Exercise.cs    Verify tokenization encode/decode
├── Chapter4Exercise.cs    Run the bigram baseline model
├── Chapter5Exercise.cs    Verify Softmax produces valid probabilities
├── Chapter6Exercise.cs    Embeddings, forward pass, and loss
├── Chapter7Exercise.cs    Training loop with simplified model
├── Chapter8Exercise.cs    Verify RmsNorm
├── Chapter9Exercise.cs    Hand-crafted single-head attention demo
├── Chapter10Exercise.cs   Multi-head attention + MLP block demo
│
└── FullTraining.cs        The full training loop and inference (Ch 11-12)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The core files build up over the course and stay permanently. The exercise files are self-contained - each one has a &lt;code&gt;Run()&lt;/code&gt; method.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Program.cs&lt;/code&gt; is a small dispatcher that routes to whichever chapter you want to run: &lt;code&gt;dotnet run -- ch3&lt;/code&gt; runs Chapter 3, &lt;code&gt;dotnet run -- full&lt;/code&gt; runs the final training and inference from Chapters 11-12. No args defaults to &lt;code&gt;full&lt;/code&gt;. You'll build the dispatcher in Chapter 0 (just the stub) and fill in each case as the course progresses.&lt;/p&gt;




&lt;h2&gt;
  
  
  Reference implementation
&lt;/h2&gt;

&lt;p&gt;The complete source code for this course lives at &lt;a href="https://github.com/Garyljackson/GPT-From-Scratch-CSharp" rel="noopener noreferrer"&gt;Garyljackson/GPT-From-Scratch-CSharp&lt;/a&gt; on GitHub. You can clone it to follow along, or check your work against it as you go.&lt;/p&gt;

</description>
      <category>csharp</category>
      <category>machinelearning</category>
      <category>transformers</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
