DEV Community

Cover image for Tokens
Boussaden Taha
Boussaden Taha

Posted on

Tokens

Introduction

Although we interact with LLMs using natural language, these models never processes raw text directly. Before a prompt reaches the model, it is converted into a sequence of tokens, the fundamental units that the model understands.

Tokenization is one of the earliest stages of the inference pipeline and influences everything from context windows and API pricing to latency and memory usage.


What Is a Token?

A token is the smallest unit of text processed by a language model, it is not necessarily a word. Depending on the tokenizer, a token may represent:

  • an entire word
  • part of a word
  • punctuation
  • whitespace
  • numbers
  • symbols
  • emojis

Different models use different tokenizers, so the same text may be split differently depending on the model.


Why Tokens?

Simply because language models operate on numbers, not text. Before the transformer can perform any computation, the input must be converted into a numerical representation.

The preprocessing pipeline looks like this:

Raw Text
    │
    ▼
Tokenizer
    │
    ▼
Tokens
    │
    ▼
Token IDs
    │
    ▼
Embedding Layer
    │
    ▼
Embedding Vectors
    │
    ▼
Transformer
Enter fullscreen mode Exit fullscreen mode

The tokenizer splits the input into tokens and each token is then mapped to a unique integer called a token ID, which are passed through the model's embedding layer, which converts them into dense vectors that become the actual input to the transformer.


A Real Example

Instead of using hypothetical examples, let's look at how OpenAI's tokenizer processes text.

Input:

I have no enemies.
Enter fullscreen mode Exit fullscreen mode

OpenAI tokenizes it to:

["I", " have", " no", " enemies", "."]
Enter fullscreen mode Exit fullscreen mode

with the following token IDs:

[40, 679, 860, 33974, 13]
Enter fullscreen mode Exit fullscreen mode

that have been generated by OpenAI Tokenizer for the "GPT-5.x & O1/3" models.

The transformer never sees the original sentence, it only receives the corresponding sequence of token IDs.


Token IDs

After tokenization, every token is replaced with an integer.

Conceptually:

" have"        → 679
" no"          → 860
" enemies"     → 33974
...
Enter fullscreen mode Exit fullscreen mode

The exact numbers differ between models because each tokenizer has its own vocabulary, as these integers are not meaningful by themselves, they simply act as indices into the model's embedding table.


From Tokens to Predictions

Once converted into embeddings, the transformer begins inference.

At each generation step, the model predicts the probability distribution of the next token.

The predicted token is appended to the existing sequence, and the process repeats until a stopping condition is reached.

Prompt
   │
   ▼
Tokenizer
   │
   ▼
Token IDs
   │
   ▼
Embeddings
   │
   ▼
Transformer
   │
   ▼
Predict Next Token
   │
   ▼
Append Token
   │
   └──────────────┐
                  ▼
              Repeat
Enter fullscreen mode Exit fullscreen mode

This autoregressive loop is how every modern decoder-only LLM generates text.


Try It Yourself

OpenAI Tokenizer: https://platform.openai.com/tokenizer

Top comments (0)