Gary Jackson

Posted on Apr 23 • Originally published at garyjackson.dev

Chapter 4: The Bigram Model - Simplest Possible Language Model

#csharp #machinelearning #transformers #tutorial

What You'll Build

A character-level language model that predicts the next character based only on the current character. No neural network, no gradients, just counting. A "bigram" is a pair of consecutive tokens (like 'e' followed by 'm'), so a bigram model is one that learns which pairs occur most often.

Notice that this model uses plain double arrays, not Value objects. That's because there's nothing to train here. We're just counting occurrences in the data. Value only starts pulling its weight when we need to compute gradients and update parameters, which kicks in from Chapter 6.

This file isn't used by the final GPT model. It's a baseline you can come back to later to see how much better the neural network does.

Depends On

Chapter 3 (tokenizer + dataset).

Why Start Here?

This model pins down the task clearly before introducing any complexity. The task is: given a sequence of tokens so far, predict what comes next. The bigram model does this in the simplest way imaginable. It counts how often each pair of consecutive characters appears, and uses those counts as probabilities.

Code

// --- BigramModel.cs ---

using System.Text;

namespace MicroGPT;

public class BigramModel
{
    private readonly double[,] _nextTokenProbs;
    private readonly Tokenizer _tokenizer;

    public BigramModel(List<string> docs, Tokenizer tokenizer)
    {
        _tokenizer = tokenizer;
        int vocabSize = tokenizer.VocabSize;

        // Count transitions: counts[i,j] = how often token j follows token i
        int[,] counts = new int[vocabSize, vocabSize];

        foreach (string doc in docs)
        {
            List<int> tokens = Tokenize(doc);
            for (int i = 0; i < tokens.Count - 1; i++)
            {
                counts[tokens[i], tokens[i + 1]]++;
            }
        }

        // Convert counts to probabilities
        _nextTokenProbs = new double[vocabSize, vocabSize];
        for (int i = 0; i < vocabSize; i++)
        {
            double rowSum = 0.0;
            for (int j = 0; j < vocabSize; j++)
            {
                rowSum += counts[i, j];
            }

            if (rowSum > 0)
            {
                for (int j = 0; j < vocabSize; j++)
                {
                    _nextTokenProbs[i, j] = counts[i, j] / rowSum;
                }
            }
        }
    }

    public string GenerateName(Random random, int maxLength = 20)
    {
        int token = _tokenizer.Bos;
        var name = new StringBuilder();
        for (int step = 0; step < maxLength; step++)
        {
            double r = random.NextDouble();
            double cumulative = 0;
            int next = _tokenizer.VocabSize - 1;
            for (int j = 0; j < _tokenizer.VocabSize; j++)
            {
                cumulative += _nextTokenProbs[token, j];
                if (r <= cumulative)
                {
                    next = j;
                    break;
                }
            }
            if (next == _tokenizer.Bos)
            {
                break;
            }

            name.Append(_tokenizer.Decode(next));
            token = next;
        }
        return name.ToString();
    }

    public void PrintSamples(int count, Random random)
    {
        Console.WriteLine("\n--- bigram samples ---");
        for (int s = 0; s < count; s++)
        {
            Console.WriteLine($"sample {s + 1, 2}: {GenerateName(random)}");
        }
    }

    /// <summary>
    /// Computes the average negative log probability across all documents.
    /// This is the loss baseline that our neural network should beat.
    /// </summary>
    public double ComputeLoss(List<string> docs)
    {
        double totalLoss = 0.0;
        int totalTokens = 0;
        foreach (string doc in docs)
        {
            List<int> tokens = Tokenize(doc);
            for (int i = 0; i < tokens.Count - 1; i++)
            {
                double p = _nextTokenProbs[tokens[i], tokens[i + 1]];
                // A pair never seen during training has p == 0, which would give
                // -log(0) = +infinity. We skip the loss contribution (but still count
                // the token in the denominator), which slightly flatters the baseline.
                if (p > 0)
                {
                    totalLoss += -Math.Log(p);
                }

                totalTokens++;
            }
        }
        return totalLoss / totalTokens;
    }

    private List<int> Tokenize(string doc)
    {
        var tokens = new List<int> { _tokenizer.Bos };
        tokens.AddRange(doc.Select(_tokenizer.Encode));
        tokens.Add(_tokenizer.Bos);
        return tokens;
    }
}

Why `-log` for the Loss?

ComputeLoss uses -Math.Log(p) where p is the probability the model assigned to the correct next token. Why that formula?

If the model assigns probability 1.0 to the correct token (perfect prediction), -log(1.0) = 0. Zero loss.
If it assigns 0.5 (coin flip), -log(0.5) = 0.69. Moderate loss.
If it assigns 0.01 (nearly wrong), -log(0.01) = 4.6. High loss.
If it assigns 0 (completely wrong), -log(0) is infinity.

So -log converts a probability into a "how surprised was the model" score, where lower is better. The average of this score across all tokens in the dataset is the loss. Chapter 6 covers this in more detail when the neural network needs to compute it.

How Generation Works

The GenerateName method is worth understanding because the transformer model in Chapters 11-12 generates names the same way. The only thing that changes is where the probabilities come from.

The strategy is:

Start with BOS. Just like training data wraps each name with BOS on both sides, generation starts from BOS and asks "what's the first character?"
Look up the probability row. _nextTokenProbs[token, ...] is a row of 27 numbers that sum to 1.0. Each number says how likely that token is to come next. For example, after BOS the row might say a=0.08, b=0.04, ..., z=0.01, reflecting how often each letter starts a name in the training data.
Pick a token using weighted random sampling. Roll a random number between 0 and 1. Think of it as a "dart throw" onto a number line. Each token's probability determines how wide its section of the line is. The loop walks the line by accumulating probabilities until the running total crosses the dart.

Here's a worked example. Say BOS is the current token, and the next-token probabilities for the first three tokens are [a=0.3, b=0.5, c=0.2]:

   |-------- a --------|--------------- b ---------------|------ c ------|
   0                   0.3                               0.8            1.0

If the random number is r = 0.6, the loop accumulates:

After a: cumulative = 0.3. Is 0.6 <= 0.3? No.
After b: cumulative = 0.8. Is 0.6 <= 0.8? Yes, pick b.

Token b gets picked 50% of the time because its slice covers half the line. Token a gets picked 30% of the time, c gets 20%. The loop always starts from the first token, but the random number decides how far along the line we go before stopping.

Repeat or stop. If the picked token is BOS, the model is saying "this name is done" and we stop. Otherwise we append the character, make it the new current token, and go back to step 2.

The bigram model's probabilities come from counting pairs in the training data. The transformer's probabilities will come from a neural network's forward pass. But the generation loop (start at BOS, sample from probabilities, stop at BOS) stays exactly the same.

Exercise: Run the Bigram Model

Create Chapter4Exercise.cs:

// --- Chapter4Exercise.cs ---

namespace MicroGPT;

public static class Chapter4Exercise
{
    public static void Run()
    {
        var random = new Random(42);
        var docs = Tokenizer.LoadDocs("input.txt", random);
        var tokenizer = new Tokenizer(docs);

        Console.WriteLine($"num docs: {docs.Count}");
        Console.WriteLine($"vocab size: {tokenizer.VocabSize}");

        var bigram = new BigramModel(docs, tokenizer);
        bigram.PrintSamples(20, random);
        Console.WriteLine($"Bigram loss: {bigram.ComputeLoss(docs):F4}");
        // Expect something around 2.45 - our neural network should beat this.
    }
}

Uncomment the Chapter 4 case in the dispatcher in Program.cs:

case "ch4":
    Chapter4Exercise.Run();
    break;

Then run it:

dotnet run -- ch4

What You'll See

The output will be names, but they won't be very good. The model knows that certain characters tend to follow others (like 'a' often follows 'k'), but it has no concept of what happened two characters ago. No memory beyond the immediate predecessor. Run it and see for yourself. The results are deterministic with the seeded random, so you'll get the same output every time.

Key Takeaway

The bigram model pins down the task (next-token prediction) and the metric (negative log probability of the correct next token). Everything from Chapter 5 onwards is about doing this same task better, using a neural network that can consider more context.

DEV Community

Chapter 4: The Bigram Model - Simplest Possible Language Model

What You'll Build

Depends On

Why Start Here?

Code

Why `-log` for the Loss?

How Generation Works

Exercise: Run the Bigram Model

What You'll See

Key Takeaway

Top comments (0)

What You'll Build

Depends On

Why Start Here?

Code

Why -log for the Loss?

How Generation Works

Exercise: Run the Bigram Model

What You'll See

Key Takeaway

Why `-log` for the Loss?