Tokens - the Language of AI

#ai #llm #tokenize

Hundreds of countries in the world, speaking thousands of languages - how can AI understand them all?

Large Language Models (LLM's) - the brains of modern Artificial Intelligence (AI) - speak in numbers. You can think of these numbers as a compressed form of human language - each token represents one or more letters in some human language (there are also special tokens - we'll get to that later). For example, the sentence "Hello World!", might be translated to the tokens [1989, 31337]. However, LLM's are no different than people - there are multiple tokenization methods, meaning different models have different tokens to represent the same text, as depicted in the image above.

Tokenization - the process of translating words into tokens - is the first step LLM's take to process their inputs. Before they can think and generate answers, they need to understand it in their own "language".

To understand how tokenization works, we first need to understand its goals, which include:

Structure: a single text can have multiple parts. For example, there can be multiple speakers in a single text, namely, the assistant and the user. However, LLM's only get as input a single series of numbers, therefore we need to make it easy for the LLM to understand their structure.
Uniform: computers encode text in weird ways. For example, the word café can be represented as c+a+f+é, but it can also be represented as c+a+f+e+´, and we want the LLM to treat those two as the same word.
Fast: users want answers, and they want them now. Tokenization can't take forever - the process needs to work as fast as possible.
Efficient: as we mentioned, tokens are a way to compress texts. Since LLM's only speak in tokens, they can't have millions of different ones - that would make them wildly inefficient. We need to make do with the least amount possible.

After understanding the above, we are perfectly poised to learn the full process of tokenization!

First, we encode the "added tokens". Those are signals to the AI, helping it add structure and meaning to different parts of the tokenized sequence. Each LLM has its own set of added tokens, though common ones indicate system instructions, user input, and start/end of attached pictures and documents. This first part finds the added tokens, replaces them (and only them) with their appropriate representations, and thus splits the text into different parts, to be tokenized separately.

Now that the text has an initial structure, we need to "normalize" it, so that different representations of the same word (as mentioned earlier - c+a+f+é vs c+a+f+e+´) will be encoded into the same tokens. For those who want to read further on the topic, this technique is called Unicode Normalization.

After normalizing the text, we are almost ready to start converting the text to tokens. However, tokenization is an expensive process, and we don't want to run it on a huge blob of text. Therefore, we first need to break it down into manageable chunks, and this is done by "pre-tokenizing" the input. This step makes tokenizing faster by splitting the large text into subsequences (e.g. using regex), and varies by tokenizer.

We are ready to do the actual tokenization step! There are several ways of doing this, and we'll focus on a method called Byte-Pair Encoding (BPE). The goal is to compress each subsequence into the least amount of tokens possible - the way is by each time merging a pair that, when combined, still has a single token representing them (if multiple merges are possible - the best one is selected according to its prevalence in the LLM's training data). In the example above, the best pair is "ma". After forming "ma", the next best pair is "ny". Following that, by scanning through the sequence again, we can see that "ma"+"ny"="many", so we combine them into a single token. Now, since "<space>many" isn't very prevalent in the training data, it didn't get its own token, and so the BPE is finished. We now have the final representations - 1289 for "<space>", and 81923 for "many".

Tokenization has another step - "post-processing". We won't elaborate on this step since it varies significantly between different models and use cases. Put simply, it either injects additional special tokens or adds metadata to the encoding object (the representation of the output of the tokenizer).

Now we finally have our tokens - and the LLM can get to work!

Top comments (1)

Dani • Mar 16

🚀🚀🚀