Gary Jackson

Posted on Apr 25 • Originally published at garyjackson.dev

Chapter 6: Embeddings, the Forward Pass, and the Loss Function

#csharp #machinelearning #transformers #tutorial

What You'll Build

Embedding tables that give each token and each position a learned vector, a minimal forward pass that produces logits, and the loss function that measures how wrong the predictions are.

Depends On

Chapters 1-3, 5 (Value, Tokenizer, Helpers).

Embeddings: Giving Tokens an Identity

The model needs two pieces of information about each token: what the token is, and where it appears in the sequence. Each piece gets its own embedding. We'll start with the first one (token embeddings) and cover position embeddings in the next section.

So far, each token is just an integer: a is 0, b is 1, z is 25. A neural network can't do anything useful with a raw integer. It needs a richer representation, a list of numbers that captures something meaningful about each token. Maybe the first number captures "how often this letter starts a name" and the second captures "how vowel-like it is". We don't hand-pick these meanings. The network discovers them during training.

This list of numbers for each token is called an embedding. An embedding table is just a matrix where row i is the embedding for token i:

Letter   Token ID   Token Embedding (4 numbers in this example)
─────    ────────   ───────────────────────────────────────────
  a         0       [ 0.02, -0.05,  0.11,  0.03]
  b         1       [-0.07,  0.04,  0.01, -0.09]
  c         2       [ 0.06,  0.08, -0.03,  0.05]
 ...       ...       ...
  .        26       [ 0.01, -0.02,  0.04, -0.01]   ← BOS

At the start, every embedding is random (like the small numbers above). By the end of training, tokens that behave similarly in names will end up with similar embeddings.

Every parameter in the network starts as a random number, but the range matters. If values are too large, numbers flowing through the network can explode in size, and gradients stop being useful. We want values centred around zero (both positive and negative) and mostly very small. A bell curve distribution with a small standard deviation (0.08 in the code) gives us exactly that: most values land between -0.16 and 0.16, clustering tightly near zero.

The 0.08 value is tuned for this model's dimensions. It's not a universal constant. If you scale the model up (for example by changing embeddingSize to 256 or 512), the right standard deviation shrinks with the layer width - a common rule of thumb is 1/sqrt(fan_in) - the same scaling at the heart of Xavier and Kaiming initialisation, two standard schemes you'll meet elsewhere in ML. Keeping 0.08 at larger widths tends to produce exploding gradients for no obvious reason, so flag this as the first knob to revisit if you start experimenting with size.

Add these helpers to Helpers.cs:

// --- Helpers.cs (add inside the Helpers class) ---

/// <summary>
/// Generates a random number from a bell curve (Gaussian/normal distribution)
/// centered on the mean, with most values falling within 'std' of it.
/// Uses the Box-Muller transform - turns two uniform random numbers into
/// one bell-curve random number.
/// </summary>
public static double RandomBellCurve(Random rng, double mean, double std)
{
    double u1 = 1.0 - rng.NextDouble();
    double u2 = 1.0 - rng.NextDouble();
    return mean + std * Math.Sqrt(-2.0 * Math.Log(u1)) * Math.Sin(2.0 * Math.PI * u2);
}

/// <summary>
/// Creates a matrix of Value objects initialized to small random numbers.
/// </summary>
public static List<List<Value>> CreateMatrix(Random rng, int rows, int cols, double std = 0.08)
{
    var mat = new List<List<Value>>();
    for (int i = 0; i < rows; i++)
    {
        var row = new List<Value>();
        for (int j = 0; j < cols; j++)
        {
            row.Add(new Value(RandomBellCurve(rng, 0, std)));
        }

        mat.Add(row);
    }
    return mat;
}

Position Embeddings: Why Position Matters

That covers the first piece. The model now knows what each token is, but it doesn't know where that token appears in the sequence. Consider the letter a in these two names:

anna - a is at position 0 (starting the name)
ann*a* - a is at position 3 (ending the name)

The same letter behaves very differently depending on where it sits. A model that only knows "the current token is a" can't tell these two situations apart. It needs position information too.

You might think: "just pass the position as a number, 0, 1, 2, 3." But that's the same problem we had with token IDs. A single integer doesn't give the network enough to work with. The network needs a rich representation, a list of numbers, so it can learn complex patterns about what "being at position 3" means. Maybe position 0 tends to start with certain consonants, or position 3 is where doubled letters often appear. Those patterns need room to be encoded.

So we create a second embedding table, positionEmbeddings, where row i is the learned embedding for position i:

Position   Position Embedding
────────   ───────────────────────────────────────────
  0        [ 0.01,  0.03, -0.02,  0.05]
  1        [-0.04,  0.01,  0.07, -0.03]
  2        [ 0.03, -0.06,  0.02,  0.08]
 ...        ...

To combine the two, we simply add them element-by-element. For example, token a (ID 0) appearing at position 1:

  Token embedding (a):     [ 0.02, -0.05,  0.11,  0.03]
+ Position embedding (1):  [-0.04,  0.01,  0.07, -0.03]
= Combined:                [-0.02, -0.04,  0.18,  0.00]

The result is a single vector that encodes both "what token is this?" and "where does it appear?"

From Embeddings to Predictions

After combining embeddings, we have a vector of 16 numbers (the internal representation). But we need 27 scores, one per possible next token, so we can pick the most likely one. This is where outputProjection comes in: it's a weight matrix that converts the 16-number vector into 27 scores using Linear, exactly as we built in Chapter 5.

Notice outputProjection has the same dimensions as tokenEmbeddings (27 x 16) but does the reverse job. tokenEmbeddings goes from a token ID to a 16-number vector, outputProjection goes from a 16-number vector back to 27 token scores.

Putting It Together

In the broader ML world (PyTorch, GPT-2, nanoGPT) you'll see these matrices called wte (weight token embedding), wpe (weight position embedding), and lm_head (language model head). When we put the model into a class in Chapter 11, the dictionary keys will use the GPT-2 names so the code maps directly to PyTorch references. The C# variables themselves will stay descriptive.

For now, our "model" is just: look up token embedding, look up position embedding, add them, project to vocabulary size. Chapter 11 replaces this with the full GptModel class, which adds layers between the embeddings and the projection so the model can consider relationships between tokens rather than looking at each token in isolation.

Cross-Entropy Loss - Naming the Loss Function

In the Big Picture, we said the loss is "a single number that measures how wrong the prediction was". Cross-entropy loss is the specific formula for computing that number. It works like this: look at the probability the model assigned to the correct next token, and take the negative log of it.

If the model assigns probability 1.0 to the correct token, -log(1.0) = 0 (no surprise, zero loss). If it assigns probability near 0, the negative log goes to +infinity (maximum surprise, huge loss). The formula is just -probabilities[correctToken].Log().

At initialisation with random weights, the model assigns roughly equal probability to all 27 tokens, so each token has probability 1/27, and the loss is -log(1/27) ≈ 3.296. That is exactly the "~3.3 starting loss" you'll see printed throughout Chapters 7-11. It isn't a rough heuristic, it's the arithmetic of a uniform distribution over a 27-token vocabulary. Anything below that during training means the model has learned something beyond guessing.

Exercise: Run a Single Forward + Loss

Two numbers control the shape of everything that follows:

embeddingSize = 16 - the size of each embedding vector. Every token and position is represented as a list of 16 numbers. Larger means the model can capture more nuance, but trains slower.
maxSequenceLength = 8 - the maximum number of tokens the model can look at in one sequence. For our dataset of names, this covers most names (the first 7 characters plus BOS). Chapter 7 talks about what happens to longer names.

Create Chapter6Exercise.cs. This pulls everything together: embeddings, a forward pass, and a loss computation:

// --- Chapter6Exercise.cs ---

using static MicroGPT.Helpers;

namespace MicroGPT;

public static class Chapter6Exercise
{
    public static void Run()
    {
        var random = new Random(42);
        List<string> docs = Tokenizer.LoadDocs("input.txt", random);
        var tokenizer = new Tokenizer(docs);

        int embeddingSize = 16;
        int maxSequenceLength = 8;

        // Each row is a learned vector for one token (27 rows x 16 columns)
        List<List<Value>> tokenEmbeddings = CreateMatrix(
            random,
            tokenizer.VocabSize,
            embeddingSize
        );

        // Each row is a learned vector for one position (8 rows x 16 columns)
        List<List<Value>> positionEmbeddings = CreateMatrix(
            random,
            maxSequenceLength,
            embeddingSize
        );

        // Projects the embedding back to vocabulary size for prediction
        List<List<Value>> outputProjection = CreateMatrix(
            random,
            tokenizer.VocabSize,
            embeddingSize
        );

        // Run a single forward pass and compute the loss.
        // We're pretending the correct next character after BOS is 'e' - the choice
        // is arbitrary, just to demonstrate the loss formula.
        List<Value> logits = Forward(
            tokenizer.Bos,
            0,
            tokenEmbeddings,
            positionEmbeddings,
            outputProjection,
            embeddingSize
        );
        List<Value> probabilities = Softmax(logits);
        Value loss = -probabilities[tokenizer.Encode('e')].Log();

        Console.WriteLine($"Loss: {loss.Data:F4}");
        Console.WriteLine(
            $"Predicted probability of 'e': {probabilities[tokenizer.Encode('e')].Data:F4}"
        );
        // With Random(42), these values are deterministic - you should see the same
        // numbers every time. The loss should be around 3.3 (close to -log(1/27)),
        // confirming the model is effectively guessing randomly at this point.
    }

    // Minimal forward pass: look up embeddings, add them, project to vocab size
    private static List<Value> Forward(
        int tokenId,
        int posId,
        List<List<Value>> tokenEmbeddings,
        List<List<Value>> positionEmbeddings,
        List<List<Value>> outputProjection,
        int embeddingSize
    )
    {
        List<Value> tokenEmbedding = tokenEmbeddings[tokenId];
        List<Value> positionEmbedding = positionEmbeddings[posId];

        var x = new List<Value>();
        for (int i = 0; i < embeddingSize; i++)
        {
            x.Add(tokenEmbedding[i] + positionEmbedding[i]);
        }

        return Linear(x, outputProjection);
    }
}

This returns 27 logits - one score per token in the vocabulary. Higher logit = the model thinks that token is more likely to come next. The loss is the negative log probability of the correct next token.

Uncomment the Chapter 6 case in the dispatcher in Program.cs:

case "ch6":
    Chapter6Exercise.Run();
    break;

Then run it:

dotnet run -- ch6

DEV Community