Chapter 11: The Full GPT - Assembling the Model

#csharp #machinelearning #transformers #tutorial

What You'll Build

Four files that together make the project complete:

Model.cs - the GptModel class that holds all parameters and implements the full forward pass (replacing the simplified Forward function from Chapters 6-7)
AdamOptimiser.cs - a reusable class wrapping the Adam state and update from Chapter 7
FullTraining.cs - the real training loop that uses GptModel across 10,000 steps
Program.cs - the finalised dispatcher with the full case wired up

Depends On

All previous chapters.

The GptModel Class

A few design notes before the code.

The Forward method takes a single token at a time, not the whole sequence at once. The KV cache (passed in as parameters) holds the context from previous positions. This is the same one-token-at-a-time approach from Chapter 9: we process tokens sequentially during both training and inference.

Each document or sample needs its own fresh KV cache. The model provides CreateKvCache() for that, and the caller passes it back into every Forward call for that sequence.

The parameter dictionary uses string keys like "wte", "wpe", "lm_head", and "layer0.attn_wq". That means a typo in a key would be a runtime error, not a compile error, but it's the most direct mapping to how PyTorch stores model weights. If you ever wanted to load real GPT-2 checkpoints, the keys would line up. To keep the C# code readable inside Forward, we add private property aliases (TokenEmbeddings, PositionEmbeddings, OutputProjection) over the most-used dict entries. The cryptic two-letter names live in the dict, and the descriptive C# names live in the code that uses them.

Forward itself is short because it delegates to two private methods (AttentionBlock and MlpBlock) that mirror the "communicate, compute" framing from Chapter 10. Each block contains a pre-norm, the transformation, and the residual add.

// --- Model.cs ---

using static MicroGPT.Helpers;

namespace MicroGPT;

public class GptModel
{
    // The state dict keys follow PyTorch / GPT-2 convention (wte = weight token embedding,
    // wpe = weight position embedding, etc.) so this code can map directly to PyTorch
    // checkpoints if you ever want to load real GPT-2 weights. The aliased properties
    // below give us readable C# names to use inside Forward without losing that bridge.
    private readonly Dictionary<string, List<List<Value>>> _stateDict;
    private readonly int _embeddingSize;
    private readonly int _headCount;
    private readonly int _layerCount;
    private readonly int _headDimension;

    private List<List<Value>> TokenEmbeddings => _stateDict["wte"];
    private List<List<Value>> PositionEmbeddings => _stateDict["wpe"];
    private List<List<Value>> OutputProjection => _stateDict["lm_head"];

    /// <summary>All trainable parameters, flattened into a single list for the optimiser.</summary>
    public List<Value> Parameters { get; }

    public GptModel(
        int vocabSize,
        int embeddingSize,
        int headCount,
        int layerCount,
        int maxSequenceLength,
        Random random
    )
    {
        _embeddingSize = embeddingSize;
        _headCount = headCount;
        _layerCount = layerCount;
        _headDimension = embeddingSize / headCount;

        _stateDict = new Dictionary<string, List<List<Value>>>
        {
            ["wte"] = CreateMatrix(random, vocabSize, embeddingSize),
            ["wpe"] = CreateMatrix(random, maxSequenceLength, embeddingSize),
            ["lm_head"] = CreateMatrix(random, vocabSize, embeddingSize),
        };

        for (int i = 0; i < layerCount; i++)
        {
            _stateDict[$"layer{i}.attn_wq"] = CreateMatrix(random, embeddingSize, embeddingSize);
            _stateDict[$"layer{i}.attn_wk"] = CreateMatrix(random, embeddingSize, embeddingSize);
            _stateDict[$"layer{i}.attn_wv"] = CreateMatrix(random, embeddingSize, embeddingSize);
            _stateDict[$"layer{i}.attn_wo"] = CreateMatrix(random, embeddingSize, embeddingSize);
            _stateDict[$"layer{i}.mlp_fc1"] = CreateMatrix(
                random,
                4 * embeddingSize,
                embeddingSize
            );
            _stateDict[$"layer{i}.mlp_fc2"] = CreateMatrix(
                random,
                embeddingSize,
                4 * embeddingSize
            );
        }

        // Dictionary<TKey,TValue> enumeration order is not guaranteed by the spec.
        // In .NET Core+ it preserves insertion order in practice, so Adam's momentum[]/squaredGradAvg[]
        // line up across runs - but if that implementation detail ever changes, switch
        // to a List<KeyValuePair<string, ...>> to make the order explicit.
        Parameters = [.. _stateDict.Values.SelectMany(mat => mat).SelectMany(row => row)];
    }

    public List<Value> Forward(
        int tokenId,
        int posId,
        List<List<Value>>[] keys,
        List<List<Value>>[] values
    )
    {
        List<Value> tokenEmbedding = TokenEmbeddings[tokenId];
        List<Value> positionEmbedding = PositionEmbeddings[posId];

        var x = new List<Value>();
        for (int i = 0; i < _embeddingSize; i++)
        {
            x.Add(tokenEmbedding[i] + positionEmbedding[i]);
        }

        // Initial RmsNorm: stabilises the embeddings before entering the first block.
        // This isn't standard in all transformer implementations, but gives the
        // residual stream a stable starting magnitude.
        x = RmsNorm(x);

        for (int layerIndex = 0; layerIndex < _layerCount; layerIndex++)
        {
            x = AttentionBlock(x, layerIndex, keys, values);
            x = MlpBlock(x, layerIndex);
        }

        // Note: production transformers typically apply a final RmsNorm here
        // before the output projection. We omit it for simplicity.
        return Linear(x, OutputProjection);
    }

    // Attention wrapped with pre-norm and a residual connection.
    // Mutates keys[layerIndex] and values[layerIndex] by appending the current position's K and V.
    private List<Value> AttentionBlock(
        List<Value> x,
        int layerIndex,
        List<List<Value>>[] keys,
        List<List<Value>>[] values
    )
    {
        var xResidual = new List<Value>(x);
        x = RmsNorm(x);

        List<Value> query = Linear(x, _stateDict[$"layer{layerIndex}.attn_wq"]);
        List<Value> key = Linear(x, _stateDict[$"layer{layerIndex}.attn_wk"]);
        List<Value> value = Linear(x, _stateDict[$"layer{layerIndex}.attn_wv"]);

        keys[layerIndex].Add(key);
        values[layerIndex].Add(value);

        var concatenatedHeads = new List<Value>();
        for (int h = 0; h < _headCount; h++)
        {
            int headStart = h * _headDimension;
            List<Value> queryForHead = query.GetRange(headStart, _headDimension);

            var attentionLogits = new List<Value>();
            int cachedCount = keys[layerIndex].Count;
            for (int t = 0; t < cachedCount; t++)
            {
                List<Value> keyForHead = keys[layerIndex][t].GetRange(headStart, _headDimension);
                var dot = new Value(0);
                for (int j = 0; j < _headDimension; j++)
                {
                    dot += queryForHead[j] * keyForHead[j];
                }

                attentionLogits.Add(dot / Math.Sqrt(_headDimension));
            }

            List<Value> attentionWeights = Softmax(attentionLogits);

            var headOutput = new List<Value>();
            for (int j = 0; j < _headDimension; j++)
            {
                headOutput.Add(new Value(0));
            }

            for (int t = 0; t < cachedCount; t++)
            {
                List<Value> valueForHead = values[layerIndex]
                    [t]
                    .GetRange(headStart, _headDimension);
                Value w = attentionWeights[t];
                for (int j = 0; j < _headDimension; j++)
                {
                    headOutput[j] += w * valueForHead[j];
                }
            }
            concatenatedHeads.AddRange(headOutput);
        }

        x = Linear(concatenatedHeads, _stateDict[$"layer{layerIndex}.attn_wo"]);
        for (int i = 0; i < _embeddingSize; i++)
        {
            x[i] += xResidual[i];
        }

        return x;
    }

    // Two-layer feed-forward with ReLU, wrapped with pre-norm and a residual connection.
    private List<Value> MlpBlock(List<Value> x, int layerIndex)
    {
        var xResidual = new List<Value>(x);
        x = RmsNorm(x);
        x = Linear(x, _stateDict[$"layer{layerIndex}.mlp_fc1"]);
        x = [.. x.Select(xi => xi.Relu())];
        x = Linear(x, _stateDict[$"layer{layerIndex}.mlp_fc2"]);
        for (int i = 0; i < _embeddingSize; i++)
        {
            x[i] += xResidual[i];
        }

        return x;
    }

    /// <summary>Creates a fresh KV cache for a new document/sample.</summary>
    public List<List<Value>>[] CreateKvCache()
    {
        var cache = new List<List<Value>>[_layerCount];
        for (int i = 0; i < _layerCount; i++)
        {
            cache[i] = [];
        }

        return cache;
    }
}

Extracting the Adam Optimiser

The Adam state and update from Chapter 7 was inline so the mechanics were visible. Now that we've seen them, it's worth packaging them into a reusable class.

AdamOptimiser owns the per-parameter state (momentum, squaredGradAvg), the hyperparameters, and exposes two methods: ZeroGrad() and Step(int step). The maths inside Step is identical to the inline version in Chapter 7.

// --- AdamOptimiser.cs ---

namespace MicroGPT;

// Encapsulates the Adam update from Chapter 7 (momentum, squared-gradient
// average, bias correction) along with the linear learning-rate decay used
// across the course. See Chapter 7 for the underlying maths.
public class AdamOptimiser
{
    private const double MomentumSmoothing = 0.85;
    private const double SquaredGradSmoothing = 0.99;
    private const double Epsilon = 1e-8;

    private readonly IReadOnlyList<Value> _parameters;
    private readonly double[] _momentum;
    private readonly double[] _squaredGradAvg;
    private readonly double _baseLearningRate;
    private readonly int _totalSteps;

    public AdamOptimiser(IReadOnlyList<Value> parameters, double learningRate, int totalSteps)
    {
        _parameters = parameters;
        _momentum = new double[parameters.Count];
        _squaredGradAvg = new double[parameters.Count];
        _baseLearningRate = learningRate;
        _totalSteps = totalSteps;
    }

    // Reset every parameter's gradient to zero. Call before each Backward.
    public void ZeroGrad()
    {
        foreach (Value p in _parameters)
        {
            p.Grad = 0;
        }
    }

    // Apply one Adam update to every parameter using its current Grad.
    public void Step(int step)
    {
        double currentLearningRate = _baseLearningRate * (1 - (double)step / _totalSteps);
        for (int i = 0; i < _parameters.Count; i++)
        {
            Value p = _parameters[i];
            _momentum[i] = MomentumSmoothing * _momentum[i] + (1 - MomentumSmoothing) * p.Grad;
            _squaredGradAvg[i] =
                SquaredGradSmoothing * _squaredGradAvg[i]
                + (1 - SquaredGradSmoothing) * Math.Pow(p.Grad, 2);
            double correctedMomentum = _momentum[i] / (1 - Math.Pow(MomentumSmoothing, step + 1));
            double correctedSquaredGrad =
                _squaredGradAvg[i] / (1 - Math.Pow(SquaredGradSmoothing, step + 1));
            p.Data -=
                currentLearningRate
                * correctedMomentum
                / (Math.Sqrt(correctedSquaredGrad) + Epsilon);
        }
    }
}

The Training Loop

Now that GptModel and AdamOptimiser exist, we need a training loop that puts them through the same forward-backward-update cycle from Chapter 7, using the full model instead of three loose matrices. The training code (along with the inference loop we'll add in Chapter 12) goes into a new file, FullTraining.cs, with a single static Run() method.

The structure mirrors Chapter7Exercise.cs closely. The key differences: we create a GptModel instead of three loose matrices, we call model.Forward instead of a local function, we use model.CreateKvCache() to get a fresh cache per document, and the Adam step collapses to two optimiser calls.

There's also a new "milestone" print every 1000 steps that shows the current running-average loss alongside its value at the previous milestone. This is useful because per-step loss is noisy and an avg column alone can be hard to eyeball. The running average smooths things out but still wobbles, so over 1000 steps it can drift either way. Printing the previous value lets you judge the trend yourself.

Performance note: Run with dotnet run -c Release -- full. Debug mode is significantly slower and 10,000 steps could take a very long time without optimisation.

// --- FullTraining.cs ---

using System.Text;
using static MicroGPT.Helpers;

namespace MicroGPT;

public static class FullTraining
{
    public static void Run()
    {
        // ── Hyperparameters ──────────────────────────────────────
        int embeddingSize = 16;
        int layerCount = 1; // just one transformer block for speed - try layerCount=2 to see improvement
        int maxSequenceLength = 8;
        int numSteps = 10000;
        int headCount = 4;
        double learningRate = 1e-2;
        var random = new Random(42);

        // ── Dataset and Tokenizer ────────────────────────────────
        List<string> docs = Tokenizer.LoadDocs("input.txt", random);
        var tokenizer = new Tokenizer(docs);
        Console.WriteLine($"num docs: {docs.Count}");
        Console.WriteLine($"vocab size: {tokenizer.VocabSize}");

        // ── Model ────────────────────────────────────────────────
        var model = new GptModel(
            tokenizer.VocabSize,
            embeddingSize,
            headCount,
            layerCount,
            maxSequenceLength,
            random
        );
        Console.WriteLine($"num params: {model.Parameters.Count}");

        // ── Training Loop ────────────────────────────────────────
        var optimiser = new AdamOptimiser(model.Parameters, learningRate, numSteps);

        // Reusable buffers for Backward (see Chapter 2's convenience overload for the
        // simpler allocating version - here we hoist them out of the hot loop so 10,000
        // training steps don't allocate 10,000 fresh sets).
        var topo = new List<Value>();
        var visited = new HashSet<Value>();
        var backwardStack = new Stack<(Value, int)>();

        // Running average to smooth out the noisy per-step loss.
        double avgLoss = 0.0;
        // Milestone tracking so we can report the previous milestone's avg loss
        // alongside the current one every 1000 steps.
        double lastMilestoneLoss = 0.0;

        for (int step = 0; step < numSteps; step++)
        {
            string doc = docs[step % docs.Count];
            var tokens = new List<int> { tokenizer.Bos };
            tokens.AddRange(doc.Select(tokenizer.Encode));
            tokens.Add(tokenizer.Bos);
            // Any name longer than maxSequenceLength - 1 is silently truncated here.
            int tokenCount = Math.Min(maxSequenceLength, tokens.Count - 1);

            List<List<Value>>[] keys = model.CreateKvCache();
            List<List<Value>>[] values = model.CreateKvCache();

            var losses = new List<Value>();
            for (int posId = 0; posId < tokenCount; posId++)
            {
                List<Value> logits = model.Forward(tokens[posId], posId, keys, values);
                List<Value> probabilities = Softmax(logits);
                losses.Add(-probabilities[tokens[posId + 1]].Log());
            }

            var loss = new Value(0);
            foreach (Value l in losses)
            {
                loss += l;
            }

            loss *= 1.0 / tokenCount;

            // Track running average (exponential moving average with alpha = 0.01)
            avgLoss = step == 0 ? loss.Data : 0.99 * avgLoss + 0.01 * loss.Data;
            if (step == 0)
            {
                lastMilestoneLoss = avgLoss;
            }

            optimiser.ZeroGrad();

            topo.Clear();
            visited.Clear();
            backwardStack.Clear();
            loss.Backward(topo, visited, backwardStack);

            optimiser.Step(step);

            if (step == 0 || (step + 1) % 100 == 0)
            {
                Console.WriteLine(
                    $"step {step + 1, 5} / {numSteps, 5} | loss {loss.Data:F4} | avg {avgLoss:F4}"
                );
            }

            // Every 1000 steps, print a milestone showing overall progress.
            if ((step + 1) % 1000 == 0)
            {
                Console.WriteLine(
                    $"  [milestone] avg loss: {avgLoss:F4} (was {lastMilestoneLoss:F4})"
                );
                lastMilestoneLoss = avgLoss;
            }
        }

        // Chapter 12's inference loop lives here too - see the next chapter.
    }
}

A reminder about name length. As in Chapter 7, maxSequenceLength = 8 means any name longer than 7 characters is silently truncated during training. This is why the generated samples in Chapter 12 lean toward shorter names. The model simply hasn't seen the tails of longer ones during training. Raising maxSequenceLength to 16 removes the truncation at roughly 2x the training cost.

Finishing the Dispatcher

With FullTraining.cs in place, we can finalise Program.cs. You've been uncommenting one case in the dispatcher at the end of every chapter since Chapter 1. Now uncomment the final case "full" line and replace the temporary sanity-check default from Chapter 0 with the usage message. After this edit your Program.cs should look like this:

// --- Program.cs ---
//
// Dispatcher: `dotnet run -- chN` runs a specific chapter exercise,
// `dotnet run -- full` runs the full training + inference,
// `dotnet run` (no args) defaults to the full run.

namespace MicroGPT;

public static class Program
{
    public static void Main(string[] args)
    {
        string chapter = args.Length > 0 ? args[0].ToLowerInvariant() : "full";

        switch (chapter)
        {
            case "gradcheck":
                GradientCheck.RunAll();
                break;
            case "ch1":
                Chapter1Exercise.Run();
                break;
            case "ch2":
                Chapter2Exercise.Run();
                break;
            case "ch3":
                Chapter3Exercise.Run();
                break;
            case "ch4":
                Chapter4Exercise.Run();
                break;
            case "ch5":
                Chapter5Exercise.Run();
                break;
            case "ch6":
                Chapter6Exercise.Run();
                break;
            case "ch7":
                Chapter7Exercise.Run();
                break;
            case "ch8":
                Chapter8Exercise.Run();
                break;
            case "ch9":
                Chapter9Exercise.Run();
                break;
            case "ch10":
                Chapter10Exercise.Run();
                break;
            case "full":
                FullTraining.Run();
                break;
            default:
                Console.WriteLine($"Unknown chapter: {chapter}");
                Console.WriteLine("Usage: dotnet run -- [gradcheck|ch1..ch10|full]");
                break;
        }
    }
}

Two things changed from the Chapter 0 skeleton: the "full" case is now wired to FullTraining.Run(), and the default no-args value is "full" instead of "" so that dotnet run with no arguments runs the full training and inference.

With everything wired up, you can invoke any part of the project uniformly:

dotnet run -- ch1     # Chapter 1 exercise
dotnet run -- ch10    # Chapter 10 exercise
dotnet run -- full    # full training + inference (Chapters 11-12)
dotnet run            # same as "full"

The Parameter Count

With embeddingSize=16, headCount=4, layerCount=1, maxSequenceLength=8, and vocabSize=27:

Token embeddings (wte): 27 x 16 = 432
Position embeddings (wpe): 8 x 16 = 128
Output projection (lm_head): 27 x 16 = 432
Per layer: Q(256) + K(256) + V(256) + O(256) + FC1(1024) + FC2(1024) = 3,072
Total: 432 + 128 + 432 + 3,072 = 4,064

For perspective: GPT-2's largest variant had 1.5 billion parameters. GPT-4 class models have hundreds of billions. The architecture is the same, just much wider and deeper.

Optional: Try Training Now

Training is fully wired up. If you want to confirm everything works before adding inference, run it now:

dotnet run -c Release -- full

You'll see per-step loss lines and a [milestone] line every 1000 steps. The running average should drop from ~3.3 toward ~2.37 over 10,000 steps (5-15 minutes on a modern CPU in Release mode). Per-step loss is noisy, so watch the avg column for the trend.

Generation is the next chapter, so the run will end after training without producing any names yet. If you'd rather wait and run once with both training and inference in place, skip this and head straight into Chapter 12.