Gary Jackson

Posted on Apr 21 • Originally published at garyjackson.dev

Chapter 1: The Value Class - Recording the Forward Pass

#csharp #machinelearning #transformers #tutorial

What You'll Build

A class called Value that wraps a double and remembers how it was created. Think of it as a number that keeps a receipt of every operation it went through.

Why This Comes First

In the Big Picture, Step 1 (the forward pass) chains together thousands of small operations, and Step 3 (the backward pass) walks those operations in reverse.

For that to work, every operation has to leave a record behind: what were the inputs, and how sensitive was the output to each input? The Value class is that record-keeping wrapper. Every number in our neural network is going to be a Value.

The Core Idea

A Value holds three things:

The number itself (Data)
A reference to the values that produced it (_inputs)
The local gradient (_localGrads) - how much the output of that specific operation would change if you wiggled each input

You don't need to understand the calculus behind these local gradients right now. Each operation has a known, fixed rule for them (listed in the table below), and the backward pass in Chapter 2 uses them mechanically.

The Grad field is empty for now. It gets filled in during the backward pass (Chapter 2) with the answer to: "how much does the final loss change if I wiggle this specific value?"

A naming distinction worth pinning down now. There are two things on a Value that both include the word "gradient", and they do different jobs:

Local gradient (_localGrads) - stored per operation (+, *, Exp, etc.), frozen at forward time. For each input to the op, it records: "if only that input changed by a tiny amount, how much would this op's output change?" It's a property of one operation in isolation.
Gradient (Grad) - filled in during the backward pass. Every Value has its own Grad, which records: "if only this Value's Data changed by a tiny amount, how much would the final loss change?" It's a property of the whole path from this Value to the loss.

The backward pass in Chapter 2 walks the graph in reverse, multiplying the two together via the chain rule to fill in every Grad.

Code

// --- Value.cs ---

namespace MicroGPT;

public class Value(double data, Value[]? inputs = null, double[]? localGrads = null)
{
    public double Data = data;
    public double Grad; // filled in during the backward pass (Chapter 2)

    private readonly Value[]? _inputs = inputs;
    private readonly double[]? _localGrads = localGrads;

    // --- Arithmetic operations ---
    // Each operation records three things: the result, the inputs, and the local gradients.
    // The local gradients are explained in the "Verifying Local Gradients" section below.

    public static Value operator +(Value a, Value b) => new(a.Data + b.Data, [a, b], [1.0, 1.0]);

    public static Value operator *(Value a, Value b) =>
        new(a.Data * b.Data, [a, b], [b.Data, a.Data]);

    // NaN if Data is negative and n is fractional.
    public Value Pow(double n) => new(Math.Pow(Data, n), [this], [n * Math.Pow(Data, n - 1)]);

    // -Infinity if Data == 0, NaN if Data < 0. If you see NaN propagating through
    // training, a softmax probability collapsed to 0 and this is usually the entry point.
    public Value Log() => new(Math.Log(Data), [this], [1.0 / Data]);

    public Value Exp() => new(Math.Exp(Data), [this], [Math.Exp(Data)]);

    // ReLU: passes positive values through unchanged, blocks negatives entirely.
    public Value Relu() => new(Math.Max(0, Data), [this], [Data > 0 ? 1.0 : 0.0]);

    // --- Convenience overloads ---
    public static Value operator +(Value a, double b) => a + new Value(b);

    public static Value operator *(Value a, double b) => a * new Value(b);

    public static Value operator -(Value a) => a * -1;

    public static Value operator -(Value a, double b) => a + (-b);

    public static Value operator /(Value a, Value b) => a * b.Pow(-1);

    public static Value operator /(Value a, double b) => a * Math.Pow(b, -1);

    public override string ToString() => $"Value(data={Data})";
}

For quick reference, here are the local gradients each operation records:

Operation	Local gradient(s)
`a + b`	`1`, `1`
`a * b`	`b`, `a`
`a.Pow(n)`	`n * aⁿ⁻¹`
`a.Log()`	`1 / a`
`a.Exp()`	`eᵃ`
`a.Relu()`	`1` if `a > 0`, else `0`

Verifying Local Gradients - The Nudge Test

Have a look at the addition operator:

public static Value operator +(Value a, Value b)
    => new(a.Data + b.Data, [a, b], [1.0, 1.0]);

The second argument, [a, b], records the two inputs. The third argument, [1.0, 1.0], records the local gradient for each input, in the same order. So:

The local gradient for input a is 1.0
The local gradient for input b is 1.0

But what does that actually mean, and why should you believe those are the right numbers?

You can answer both questions without any calculus. The technique is simple: nudge one input by a tiny amount, run the operation again, and see how much the output changed. The ratio of output-change to input-change is the local gradient.

Addition: why is the local gradient 1.0 for both inputs?

Let's say a = 2 and b = 3. The output is 2 + 3 = 5.

Now nudge a up by a tiny amount - say, 0.001:

New output: 2.001 + 3 = 5.001
The output changed by: 5.001 - 5.0 = 0.001
You nudged by 0.001, the output moved by 0.001
Ratio: 0.001 / 0.001 = 1.0 - that's the local gradient for a

Now nudge b instead:

New output: 2 + 3.001 = 5.001
Same result: ratio = 1.0 - that's the local gradient for b

Addition passes changes through at a 1:1 rate for both inputs, so the local gradients array is [1.0, 1.0].

Multiplication: why are the local gradients [b.Data, a.Data]?

Have a look at the multiplication operator:

public static Value operator *(Value a, Value b)
    => new(a.Data * b.Data, [a, b], [b.Data, a.Data]);

The local gradients are [b.Data, a.Data], meaning:

The local gradient for input a is b's value
The local gradient for input b is a's value

Let's verify with a = 2, b = 3. The output is 2 * 3 = 6.

Nudge a by 0.001:

New output: 2.001 * 3 = 6.003
The output changed by: 6.003 - 6.0 = 0.003
Ratio: 0.003 / 0.001 = 3.0 - that's the local gradient for a, and it equals b's value

Nudge b by 0.001:

New output: 2 * 3.001 = 6.002
The output changed by: 6.002 - 6.0 = 0.002
Ratio: 0.002 / 0.001 = 2.0 - that's the local gradient for b, and it equals a's value

This makes intuitive sense: the bigger b is, the more a small change to a gets amplified, and vice versa.

Power: the first curved function.

Have a look at the power operator:

public Value Pow(double n)
    => new(Math.Pow(Data, n), [this], [n * Math.Pow(Data, n - 1)]);

For a² (so n = 2), the local gradient n * Math.Pow(a, n - 1) simplifies to 2 * a. That's the first formula we've seen where the gradient depends on a itself, and the reason is that a² behaves differently from addition and multiplication. Let's see how.

Line up some input/output pairs for a²:

a = 1  →  a² =  1
a = 2  →  a² =  4   (jumped by 3)
a = 3  →  a² =  9   (jumped by 5)
a = 4  →  a² = 16   (jumped by 7)

Each step in a produces a bigger jump in a² than the last. Compare that to a + 5, which goes 6, 7, 8, 9 - the jump is exactly 1 every single step. We'll call a + 5 a straight function (same rate of change everywhere) and a² a curved function (rate of change grows as a gets bigger).

Multiplication is straight too, from a's perspective. a * 3 goes 3, 6, 9, 12 as a goes 1, 2, 3, 4 - the jump is always exactly 3. That's why the local gradient for multiplication is a fixed number (b.Data): no matter where you nudge a, the rate of change is the same.

a² is different. The rate of change at a = 3 isn't the same as the rate at a = 4. The formula 2 * a tells us: at a = 3, the rate is 6; at a = 4, the rate is 8. There's no single number that describes a²'s rate - you have to ask "rate at which value of a?".

Let's nudge-test at a = 3, using a small nudge (0.0001) to match the default in GradientCheck.cs below:

Original: 3² = 9
Nudged: 3.0001² = 9.00060001
Change in output: 0.00060001
Ratio: 0.00060001 / 0.0001 = 6.0001

The formula says the rate at a = 3 is exactly 6, and we measured 6.0001. The extra 0.0001 isn't a bug - it's the curvature leaking in. When we nudged from 3 to 3.0001, we technically measured something between "the rate at 3" and "the rate at 3.0001" (which is very slightly steeper), so we overshoot the true answer at 3 by a tiny amount. Halve the nudge and the error halves with it.

This is a general fact about the nudge test: it's exact for straight functions, and slightly off for curved ones by an amount proportional to the nudge size. Keep that in mind when we run the full check in a moment - you'll see 6.0001 for Power and a similar drift for Exp (also curved), while Addition and Multiplication come out perfectly.

You can verify any operation this way. You don't need to trust the formulas. You don't need calculus. You just need subtraction and division.

Heads up: Log and Pow can blow up on certain inputs. Log(0) gives -Infinity, Log of any negative number gives NaN, and Pow gives NaN when you raise a negative number to a non-whole power like 0.5. If NaN ever starts spreading through training later in the course, one of these two operations is almost always where it started. Come back to this section and nudge-test the suspect values.

Verifying the Formulas: The Nudge Test

Before trusting the local-gradient formulas from the table above, let's verify them by direct measurement. The helper class below runs the nudge test against raw math operations (no Value objects involved) to prove the formulas are correct independently of the C# implementation. Put it in GradientCheck.cs:

// --- GradientCheck.cs ---

namespace MicroGPT;

public static class GradientCheck
{
    /// <summary>
    /// Measures the gradient of a function at a specific input value by nudging
    /// and observing. Works for any function that takes a double and returns a double.
    /// </summary>
    public static double MeasureGradient(Func<double, double> f, double at, double nudge = 0.0001)
    {
        double before = f(at);
        double after = f(at + nudge);
        return (after - before) / nudge;
    }

    /// <summary>
    /// Runs the nudge test for all Value operations and prints the results.
    /// </summary>
    public static void RunAll()
    {
        // The "expected" column for each check is computed by applying the local
        // gradient formula from Value.cs directly. The "measured" column comes from
        // the nudge test. If the formula is right, the two columns should agree.
        static void Row(string name, double measured, double expected) =>
            Console.WriteLine(
                $"  {name, -16}  measured {measured, 8:F4}   expected {expected, 8:F4}"
            );

        Console.WriteLine("=== Straight functions (measurement should be exact) ===");
        Console.WriteLine();

        // Addition: local gradient formula from Value.cs is [1.0, 1.0]
        {
            double a = 2,
                b = 3;
            Console.WriteLine($"--- Addition: a + b where a={a}, b={b} ---");
            Row("Gradient for a", MeasureGradient(x => x + b, a), 1.0);
            Row("Gradient for b", MeasureGradient(x => a + x, b), 1.0);
            Console.WriteLine();
        }

        // Multiplication: local gradient formula from Value.cs is [b.Data, a.Data]
        {
            double a = 2,
                b = 3;
            Console.WriteLine($"--- Multiplication: a * b where a={a}, b={b} ---");
            Row("Gradient for a", MeasureGradient(x => x * b, a), b);
            Row("Gradient for b", MeasureGradient(x => a * x, b), a);
            Console.WriteLine();
        }

        Console.WriteLine("=== Curved functions (tiny drift proportional to nudge size) ===");
        Console.WriteLine();

        // Power: local gradient formula from Value.cs is n * Math.Pow(Data, n - 1)
        {
            double a = 3,
                n = 2;
            Console.WriteLine($"--- Power: a^n where a={a}, n={n} ---");
            Row("Gradient for a", MeasureGradient(x => Math.Pow(x, n), a), n * Math.Pow(a, n - 1));
            Console.WriteLine();
        }

        // Exp: local gradient formula from Value.cs is Math.Exp(Data)
        {
            double a = 1;
            Console.WriteLine($"--- Exp: e^a where a={a} ---");
            Row("Gradient for a", MeasureGradient(x => Math.Exp(x), a), Math.Exp(a));
            Console.WriteLine();
        }

        // Log: local gradient formula from Value.cs is 1.0 / Data
        {
            double a = 4;
            Console.WriteLine($"--- Log: ln(a) where a={a} ---");
            Row("Gradient for a", MeasureGradient(x => Math.Log(x), a), 1.0 / a);
        }
    }
}

Wire it into the dispatcher by uncommenting the gradcheck line in the switch in Program.cs:

case "gradcheck":
    GradientCheck.RunAll();
    break;

Then run it:

dotnet run -- gradcheck

Every gradient matches the formula from the table, including the curvature tilt we predicted. Power shows 6.0001 (exactly the number we worked out by hand earlier), and Exp shows a similar small drift because e^a is also curved. Addition and Multiplication come out perfectly because they're straight from each input's perspective. Log at a = 4 is curved but so gently that the error hides below the fourth decimal.

Exercise: Verify Value Operations

Create Chapter1Exercise.cs. This verifies that Value computes correct forward results:

// --- Chapter1Exercise.cs ---

namespace MicroGPT;

public static class Chapter1Exercise
{
    public static void Run()
    {
        // Verify forward pass - chained operations produce correct results
        var a = new Value(2.0);
        var b = new Value(3.0);
        Value c = a * b;
        Value d = c + a;
        Value e = d.Pow(2);

        Console.WriteLine("--- Forward Pass ---");
        Console.WriteLine($"c: expected 6,  got {c.Data}");
        Console.WriteLine($"d: expected 8,  got {d.Data}");
        Console.WriteLine($"e: expected 64, got {e.Data}");
    }
}

Wire it into the dispatcher by uncommenting this line in the switch in Program.cs:

case "ch1":
    Chapter1Exercise.Run();
    break;

Then run it:

dotnet run -- ch1

A Design Choice Worth Noticing

If you look at the Value operators, the local gradient values are computed immediately during the forward pass. When a * b runs, the resulting Value already contains [b.Data, a.Data] as concrete numbers. The backward pass then just multiplies and accumulates - it never computes a local gradient itself.

Production frameworks like PyTorch do this differently. They store the inputs during the forward pass, then compute the local gradient values during the backward pass using those stored inputs. For a * b, PyTorch saves references to a and b, then during backward computes b * upstream_gradient and a * upstream_gradient.

The final numbers are identical - it's the same math, just performed at a different time.

Our Value class precomputes because it makes the code simpler to understand: you can see the local gradients right there in the operator. PyTorch defers the computation because at scale (tensors with millions of numbers), precomputing and storing all the local gradients would use a lot of memory. It's cheaper to store just the inputs and recompute when needed. But for a scalar Value, storing two doubles per operation is trivial.