LLM Study Diary #2: Tokenization

#algorithms #devjournal #llm #nlp

Background

I did some research online and found a nice course that teach how to build LLM from scratch. The course is shared public online and all the assignment resources are here: https://cs336.stanford.edu/. In the following series, I will put the summary and notes starting from lession 1.

Tokenization

Tokenization is at the very beginning of the LLM. There were many different tokenization algorithm, such as Character-based Tokenization, Byte-based Tokenization, Word-based Tokenization and Byte Pair Encoding (BPE).

Character-based Tokenization
Pros: Simple to define by mapping characters to code points.
Cons: Highly inefficient use of vocabulary because some characters are rare, and the compression ratio is suboptimal compared to more advanced methods.
Byte-based Tokenization
Pros: Uses a very small, fixed vocabulary (0-256 indices), avoiding sparsity issues.
Cons: Leads to very long sequences because the compression ratio is effectively 1:1 (one byte per token), which makes model training computationally expensive due to the quadratic nature of attention.
Word-based Tokenization
Pros: Captures semantic units through splitting strings by whitespace or regex.
Cons: Results in an unbounded vocabulary size; it struggles with rare or unseen words, often necessitating an "UNK" (unknown) token which creates significant challenges for model training and evaluation.

BPE

BPE is the best one out of all these. Here is how it works:

Convert to Bytes: First, represent the input string as a sequence of bytes (integers). This ensures every character, even rare ones, can be represented.
Count Frequencies: Scan the entire corpus to count the frequency of all adjacent pairs of bytes or existing tokens.
Merge the Most Frequent: Identify the pair that appears most often and merge them into a new, single token. Add this new token to your vocabulary.
Repeat: Repeat the process of counting and merging for a set number of iterations or until a desired vocabulary size is reached. This process allows the model to adaptively represent common sequences as single tokens and rare ones as multiple smaller components.

Key Takeaways:

Efficiency: BPE is effective because it learns the statistics of your specific data set, rather than relying on predefined word boundaries.
Robustness: Unlike word-based tokenization, BPE handles unknown or rare words gracefully because it can always fall back to individual characters or smaller sub-word units, avoiding the need for "UNK" tokens.
Historical Context: Originally a data compression algorithm from 1994, it was adopted for NLP to improve neural machine translation and eventually became a standard backbone for models like GPT-2 and beyond.

DEV Community

LLM Study Diary #2: Tokenization

Background

Tokenization

BPE

Top comments (0)