Gary Jackson

Posted on Apr 21 • Originally published at garyjackson.dev

Chapter 2: Backward - Automatic Gradient Computation

#csharp #machinelearning #transformers #tutorial

What You'll Build

The Backward() method on Value. Starting from a final output (like the loss), it walks the computation graph in reverse and fills in the .Grad field on every Value, answering "if I nudged this number, how much would the final output change?"

Depends On

Chapter 1 (the Value class with forward operations).

Why This Matters

Training a neural network comes down to: "nudge each parameter a tiny bit in the direction that lowers the loss." But with thousands of parameters and a complex computation graph, you can't compute those nudges by hand. Backward() does it automatically using the chain rule.

The Chain Rule in Plain English

If a car goes twice as fast as a bicycle, and a bicycle goes four times as fast as a walking person, then the car goes 2 x 4 = 8 times as fast as the person. That's the chain rule. You multiply the rates of change along the path.

In the computation graph: if changing a causes c to change by 3x, and changing c causes loss to change by 2x, then changing a causes loss to change by 3 x 2 = 6x.

The Algorithm

The backward pass works in two stages.

Stage 1 - Topological sort. Walk the graph from the output and order all nodes so that a node always appears after all the nodes it depends on (its inputs come first).

Stage 2 - Propagate gradients. Start from the output (where gradient = 1.0, because "if I nudge the loss, the loss changes by the same amount" - the ratio is trivially 1). Visit each node in reverse topological order. For each node, multiply its gradient by each local gradient and add the result to the corresponding input's gradient.

Notice the += rather than = in Stage 2. This is important, and the trace below shows why.

Code

A note on reading the code. The backward pass can look confusing at first because everything is a Value. The node you're currently processing is a Value, its inputs are also Value objects, and they all have a .Grad field.

When you see v._inputs[j].Grad += ..., remember that v is the current node (the operation's output), and v._inputs[j] is one of the nodes that was fed into that operation. The code is reaching back from an output to its inputs and telling each one how much it contributed to the final loss.

Add these methods to the Value class:

// --- Value.cs (add inside the Value class) ---

// Convenience overload: allocates fresh buffers on each call.
// Good for one-off use in the early chapters. The 3-argument version below
// lets the training loop in Chapter 7 reuse buffers across thousands of steps.
public void Backward() => Backward([], [], new Stack<(Value, int)>());

public void Backward(
    List<Value> topo,
    HashSet<Value> visited,
    Stack<(Value node, int inputIndex)> stack
)
{
    // Stage 1: Iterative topological sort
    stack.Push((this, 0));
    while (stack.Count > 0)
    {
        (Value? current, int inputIndex) = stack.Pop();
        Value[]? inputs = current._inputs;

        if (inputs != null && inputIndex < inputs.Length)
        {
            stack.Push((current, inputIndex + 1));
            Value input = inputs[inputIndex];
            if (visited.Add(input))
            {
                stack.Push((input, 0));
            }
        }
        else
        {
            topo.Add(current);
        }
    }

    // Stage 2: Propagate gradients in reverse topological order
    Grad = 1.0;
    for (int i = topo.Count - 1; i >= 0; i--)
    {
        Value v = topo[i];
        if (v.Grad == 0)
        {
            continue; // optimisation: nothing to propagate
        }

        if (v._inputs == null)
        {
            continue;
        }

        for (int j = 0; j < v._inputs.Length; j++)
        {
            v._inputs[j].Grad += v._localGrads![j] * v.Grad;
        }
    }
}

There are two overloads. The parameterless one is what we'll use in Chapters 2-6: it allocates its own buffers, you call L.Backward(), done. The three-argument version takes pre-allocated topo, visited, and stack buffers so the Chapter 7 training loop can reuse them across thousands of steps instead of reallocating on every call. Same algorithm either way.

Why Iterative Instead of Recursive?

Karpathy's Python version uses a recursive topological sort. That works fine for short sequences, but deep computation graphs can blow the call stack. The C# version replaces recursion with an explicit Stack<T>, which is both faster and safe for arbitrary graph depth. The algorithm is identical, only the mechanics differ.

Walking Through Both Stages

If the code feels opaque, here's a step-by-step trace using the same example from the exercise below:

a = Value(2.0)       ← leaf, no inputs
b = Value(3.0)       ← leaf, no inputs
c = a * b            ← inputs: [a, b], localGrads: [b.Data, a.Data] = [3.0, 2.0]
L = c + a            ← inputs: [c, a], localGrads: [1.0, 1.0]
L.Backward()

Stage 1 - Topological sort. The stack holds (node, inputIndex) pairs. Each pop says "I'm visiting node, and I'm up to input #inputIndex." When inputIndex has walked past every input, the node is done and gets added to sortedNodes.

Step  Pop        What happens                              Stack after               sortedNodes
────  ─────────  ────────────────────────────────────────   ───────────────────────    ───────────
      (start)    Push (L, 0)                               [(L,0)]                   []

1     (L, 0)     L has inputs, index 0 < 2:                [(L,1), (c,0)]            []
                   push (L,1) to come back later
                   c is new, push (c,0) to explore it

2     (c, 0)     c has inputs, index 0 < 2:                [(L,1), (c,1), (a,0)]     []
                   push (c,1) to come back later
                   a is new, push (a,0) to explore it

3     (a, 0)     a has no inputs (leaf):                   [(L,1), (c,1)]            [a]
                   add a to sortedNodes

4     (c, 1)     c has inputs, index 1 < 2:                [(L,1), (c,2), (b,0)]     [a]
                   push (c,2) to come back later
                   b is new, push (b,0) to explore it

5     (b, 0)     b has no inputs (leaf):                   [(L,1), (c,2)]            [a, b]
                   add b to sortedNodes

6     (c, 2)     c has inputs, but index 2 = 2:            [(L,1)]                   [a, b, c]
                   all inputs done, add c to sortedNodes

7     (L, 1)     L has inputs, index 1 < 2:                [(L,2)]                   [a, b, c]
                   push (L,2) to come back later
                   a already visited, skip

8     (L, 2)     L has inputs, but index 2 = 2:            []                        [a, b, c, L]
                   all inputs done, add L to sortedNodes

Step 7 is the interesting one: a is L's second input, but we've already visited it through c, so we skip it. Without the visited check, a would appear twice in sortedNodes and its gradient would be doubled.

Stage 2 - Propagate gradients. Walk sortedNodes in reverse (L → c → b → a), seeding L.Grad = 1.0:

The formula at each step is:  input.Grad += localGrad * node.Grad

Step  i   Node   .Grad   What happens
────  ──  ────   ─────   ──────────────────────────────────────────────────────
1     3   L      1.0     L's inputs [c, a] with localGrads [1.0, 1.0]:
                           c.Grad += 1.0 (localGrad) * 1.0 (L.Grad) = 1.0
                           a.Grad += 1.0 (localGrad) * 1.0 (L.Grad) = 1.0

2     2   c      1.0     c's inputs [a, b] with localGrads [3.0, 2.0]:
                           a.Grad += 3.0 (localGrad) * 1.0 (c.Grad) = 3.0    (a.Grad is now 4.0)
                           b.Grad += 2.0 (localGrad) * 1.0 (c.Grad) = 2.0

3     1   b      2.0     b has no inputs (leaf), skip

4     0   a      4.0     a has no inputs (leaf), skip

Result:  a.Grad = 4.0    (1.0 from the addition path + 3.0 from the multiplication path)
         b.Grad = 2.0

Notice a.Grad gets written to twice: once in step 1 (through L = c + a) and again in step 2 (through c = a * b). The two contributions are summed with +=, which is why a.Grad ends up at 4.0. Without the += accumulation, the second write would overwrite the first and we'd get 3.0 instead of 4.0.

Exercise: Verify Gradients

Create Chapter2Exercise.cs:

// --- Chapter2Exercise.cs ---

namespace MicroGPT;

public static class Chapter2Exercise
{
    public static void Run()
    {
        var a = new Value(2.0);
        var b = new Value(3.0);
        Value c = a * b; // c = 6.0
        Value loss = c + a; // L = 8.0

        loss.Backward();

        Console.WriteLine($"a.Grad: expected 4, got {a.Grad}");
        Console.WriteLine($"b.Grad: expected 2, got {b.Grad}");
    }
}

Uncomment the Chapter 2 case in the dispatcher in Program.cs:

case "ch2":
    Chapter2Exercise.Run();
    break;

Then run it:

dotnet run -- ch2

These are the same results we traced through in the tables above: a.Grad = 4.0 (1.0 from the addition path + 3.0 from the multiplication path) and b.Grad = 2.0.

What This Means

a.Grad = 4.0 tells you: "if you increase a by a tiny amount e, then L increases by approximately 4e." This is the information the optimiser will use to update parameters.

Architecture Diagrams vs. the Computation Graph

Now that you've seen a.Grad accumulate to 4.0 from two different paths, there's a distinction worth pinning down. Diagrams of neural networks usually show a neat chain of boxes:

Input → [Layer 1] → [Layer 2] → Output

That's the architecture diagram. Think of it as a building's floor plan: it shows you the rooms and which rooms connect to which. Clean and linear.

The computation graph is what Value actually tracks. It's more like the building's electrical wiring diagram, showing every individual operation inside those boxes. And when you zoom in, you discover that values get reused. A single Value can feed into multiple operations.

Here's a simple example, the same one from the exercise:

a = 2.0
b = 3.0
c = a * b       // a is used here
L = c + a       // a is used here too

If you drew this as a tree, a would appear in two separate places, as if there were two copies. But there's only one a. The computation graph reflects this: a is a single node with two arrows coming out of it, one going to the multiplication and one going to the addition.

This means the computation graph isn't a tree, it's a web (technically a "directed acyclic graph"). That's exactly why a.Grad ended up at 4.0 in the exercise: Backward() accumulated contributions from both paths. The clearest payoff comes in Chapter 8, where residual connections (x = layer(x) + xResidual) reuse the same Value objects on both sides of the addition, and the += accumulation here is what makes their "gradient highway" actually work.

DEV Community