Chapter 3: The Tokenizer - Text to Numbers and Back

#csharp #machinelearning #transformers #tutorial

What You'll Build

A Tokenizer class that converts between characters and integer IDs, plus a special BOS (Beginning of Sequence) token.

Depends On

Nothing. This chapter is independent of Chapters 1-2. It sits here because the next chapter (Bigram) needs it.

Why Characters?

Production LLMs use subword tokenizers (like BPE) that merge frequent character sequences into single tokens, giving them a vocabulary of ~100K tokens. MicroGPT uses the simplest possible tokenizer: one token per unique character. For a dataset of names, that gives us 26 lowercase letters plus one special token, for a vocabulary of 27.

The integer values themselves are arbitrary. Token 0 being 'a' and token 25 being 'z' is just convention. They might as well be emoji. Each token is simply a distinct symbol.

Code

// --- Tokenizer.cs ---

namespace MicroGPT;

public class Tokenizer
{
    private readonly List<char> _allChars;

    public int Bos { get; } // Beginning of Sequence token ID
    public int VocabSize { get; } // total number of unique tokens

    public Tokenizer(List<string> docs)
    {
        _allChars = [.. string.Join("", docs).Distinct().OrderBy(c => c)];
        Bos = _allChars.Count; // e.g., 26 if a-z are 0-25
        VocabSize = _allChars.Count + 1; // 27 total
    }

    public int Encode(char c) => _allChars.IndexOf(c);

    public char Decode(int i) => i == Bos ? '.' : _allChars[i]; // display BOS as '.'

    /// <summary>
    /// Loads documents from a text file, one per line, shuffled.
    /// </summary>
    public static List<string> LoadDocs(string path, Random random)
    {
        return
        [
            .. File.ReadAllLines(path)
                .Select(l => l.Trim())
                .Where(l => !string.IsNullOrEmpty(l))
                .OrderBy(_ => random.Next()),
        ];
    }
}

The BOS Token

BOS stands for "Beginning of Sequence". It's token ID 26 (one past the last letter) and it doesn't exist as a character in the training data. It's purely synthetic, acting as a delimiter that tells the model "a new document starts here" and "the current document ends here". During training, each document (e.g. the name "emma") gets wrapped with BOS on both sides:

[BOS, e, m, m, a, BOS]

The model learns that BOS initiates a new name, and that producing BOS means "I'm done". When we need to display it, Decode renders it as '.' (a dot). The choice of dot is arbitrary. It just needs to be something visually distinct from the letters. You'll see it in the model's output later when a generated name ends: emma.

Exercise: Verify Tokenization

Create Chapter3Exercise.cs:

// --- Chapter3Exercise.cs ---

namespace MicroGPT;

public static class Chapter3Exercise
{
    public static void Run()
    {
        var random = new Random(42);
        List<string> docs = Tokenizer.LoadDocs("input.txt", random);
        var tokenizer = new Tokenizer(docs);

        Console.WriteLine($"num docs: {docs.Count}");
        Console.WriteLine($"vocab size: {tokenizer.VocabSize}");

        // Encode a name
        string name = "emma";
        List<int> encoded = name.Select(tokenizer.Encode).ToList();
        Console.WriteLine($"Encoded: [{string.Join(", ", encoded)}]"); // [4, 12, 12, 0]

        // Decode it back
        string decoded = string.Join("", encoded.Select(tokenizer.Decode));
        Console.WriteLine($"Decoded: {decoded}"); // emma

        // A full document with BOS on both sides
        var tokens = new List<int> { tokenizer.Bos };
        tokens.AddRange(name.Select(tokenizer.Encode));
        tokens.Add(tokenizer.Bos);
        Console.WriteLine($"With BOS: [{string.Join(", ", tokens)}]"); // [26, 4, 12, 12, 0, 26]
    }
}

Uncomment the Chapter 3 case in the dispatcher in Program.cs:

case "ch3":
    Chapter3Exercise.Run();
    break;

Then run it:

dotnet run -- ch3

DEV Community