Gary Jackson

Posted on Apr 24 • Originally published at garyjackson.dev

Chapter 5: Linear Transformation and Softmax

#csharp #machinelearning #transformers #tutorial

What You'll Build

Two helper functions that show up in nearly every layer of a neural network:

Linear takes an input vector and a weight matrix, multiplies each row of weights element-by-element with the input, and sums each row into a single output value:

  input:   [1, 2, 3]
  weights: [[0.1, 0.2, 0.3],    row 0: 0.1*1 + 0.2*2 + 0.3*3 = 1.4
            [0.4, 0.5, 0.6]]    row 1: 0.4*1 + 0.5*2 + 0.6*3 = 3.2
  output:  [1.4, 3.2]

Two rows of weights means two output values. This is how neural networks change the size of data as it flows through layers.

Softmax takes a list of raw numbers and turns them into probabilities that add up to 1. For example, [2.0, 1.0, 0.1] becomes roughly [0.66, 0.24, 0.10]. The largest input gets the highest probability.

They live in their own file because they're pure math utilities, independent of the model architecture.

Depends On

Chapters 1-2 (the Value class - our computation recorder).

Code

Linear needs a way to compute a dot product between two lists of Value objects. Add this method to Value.cs:

// --- Value.cs (add inside the Value class) ---

public static Value Dot(List<Value> a, List<Value> b)
{
    var result = new Value(0);
    for (int i = 0; i < a.Count; i++)
    {
        result += a[i] * b[i];
    }

    return result;
}

Debug vs Release matters from here on. Each += allocates a fresh Value, and a typical training step does tens of thousands of them. In Debug mode the JIT skips inlining and the GC churns, so the same run that takes ~30 seconds in Release can take 5+ minutes in Debug. Once the code is working, always run training with -c Release. The Performance Optimisation Notes section at the end of the course covers the rest of the speedups.

Now add Linear and Softmax to Helpers.cs:

// --- Helpers.cs ---

namespace MicroGPT;

public static class Helpers
{
    /// <summary>
    /// Matrix-vector multiply. Each row of weights is multiplied element-by-element
    /// with input and summed into a single value.
    /// </summary>
    public static List<Value> Linear(List<Value> input, List<List<Value>> weights) =>
        [.. weights.Select(row => Value.Dot(row, input))];

    /// <summary>
    /// Converts raw scores (logits) into a probability distribution.
    /// </summary>
    public static List<Value> Softmax(List<Value> logits)
    {
        double maxVal = logits.Max(v => v.Data);
        var exponentials = logits.Select(v => (v - maxVal).Exp()).ToList();
        var total = new Value(0);
        foreach (Value? e in exponentials)
        {
            total += e;
        }

        return [.. exponentials.Select(e => e / total)];
    }
}

Each element of Linear's output is a weighted sum of input, where weights contains the learned parameters. If input has 16 elements and weights has 64 rows of 16 elements each, the output has 64 elements. This is how neural networks change the dimensionality of data.

Notice there's no bias term added after the dot product. Production models typically include one (output = weights * input + bias), but we omit it for simplicity. Some modern architectures like LLaMA also drop biases.

The raw numbers that come out of a model before they're turned into probabilities are called logits. You'll see the term everywhere in ML. They can be any value (positive, negative, large, small), and on their own they don't mean much. They need to be converted into probabilities first, which is where Softmax comes in.

Softmax takes a list of logits and turns them into a probability distribution where all values are in [0, 1] and sum to 1. We subtract the max value before taking exp for numerical stability. Mathematically it doesn't change the result (the max cancels out in the division), but without it, exp of large numbers can overflow to infinity. The backward pass is unaffected too: shifting every logit by the same constant cancels in the ratio, so the gradients through Softmax come out identical to the unshifted version.

Because both Linear and Softmax are built entirely from Value operations (add, multiply, exp, divide), gradients flow through them automatically during the backward pass. They aren't "frozen" math. They're part of the computation graph, just like any other chain of Value operations.

Exercise

Create Chapter5Exercise.cs:

// --- Chapter5Exercise.cs ---

using static MicroGPT.Helpers;

namespace MicroGPT;

public static class Chapter5Exercise
{
    public static void Run()
    {
        // Test Linear: a 2x3 weight matrix times a length-3 input vector
        var input = new List<Value> { new(1.0), new(2.0), new(3.0) };
        var weights = new List<List<Value>>
        {
            new() { new(0.1), new(0.2), new(0.3) }, // row 0: 0.1*1 + 0.2*2 + 0.3*3 = 1.4
            new() { new(0.4), new(0.5), new(0.6) }, // row 1: 0.4*1 + 0.5*2 + 0.6*3 = 3.2
        };
        List<Value> output = Linear(input, weights);

        Console.WriteLine("--- Linear ---");
        Console.WriteLine("Expected: 1.4 3.2");
        Console.Write("Got:      ");
        foreach (Value v in output)
        {
            Console.Write($"{v.Data:F1} ");
        }
        Console.WriteLine();

        // Test Softmax: converts raw logits into probabilities that sum to 1
        var logits = new List<Value> { new(2.0), new(1.0), new(0.1) };
        List<Value> probabilities = Softmax(logits);

        Console.WriteLine("--- Softmax ---");
        Console.WriteLine(
            "Expected: 0.659 0.242 0.099  (sum to 1.0, largest logit gets highest prob)"
        );
        Console.Write("Got:      ");
        foreach (Value p in probabilities)
        {
            Console.Write($"{p.Data:F3} ");
        }
        Console.WriteLine();
    }
}

Uncomment the Chapter 5 case in the dispatcher in Program.cs:

case "ch5":
    Chapter5Exercise.Run();
    break;

Then run it:

dotnet run -- ch5

Common Misconception

"Why not just divide each logit by the sum of all logits?" Because logits can be negative, and a probability distribution needs all non-negative values. The exp function makes sure everything is positive before normalising.

DEV Community