Aivan Carlos Tuquero

Posted on Aug 29

Breaking Down Tokenization in LLMs: How AI read your words

#llm #ai #machinelearning

When you interact with Large Language Models (LLMs) like GPT-5, LLaMA, or Claude, it feels like you’re sending sentences and paragraphs. But under the hood, the model doesn’t “see” text the way we do.

Instead, everything you type is broken down into tokens—tiny units that sit at the heart of how LLMs process, generate, and price their outputs.

In this post, we’ll dive deep into tokenization in machine learning, why it matters for developers, and how you can actually experiment with tokenizers using tools like tiktokenizer.

What Is Tokenization?

Tokenization is the process of splitting text into smaller units called tokens, which are then mapped to numerical IDs and embedded into vectors for processing.

For us, words are natural chunks of meaning.
For machines, tokens are the bridge between raw text and mathematical computation.

Example (GPT-style tokenizer):

Text: "Machine learning is amazing!"
Tokens: ["Machine", " learning", " is", " amazing", "!"]

Each token maps to an integer ID:

[1234, 5678, 90, 4321, 999]

Those integers are what the model actually processes.

Why Tokenization Matters in LLMs

Tokenization isn’t just a preprocessing detail—it has major real-world implications:

Context Window Limitations Models can only handle a certain number of tokens in memory (e.g., 4k, 32k, or even 1M tokens).

Your 10,000-word novel draft might not fit because tokenization can expand the input.

Pricing & API Costs Most LLM providers charge per token, not per word.

A “short” email could be 60 words but 90+ tokens depending on how it’s split.
Optimizing your prompts can save serious money.

Language & Domain Efficiency Tokenization efficiency differs by language and domain:

English words often tokenize cleanly.
Agglutinative languages (like Turkish or Finnish) might explode into many tokens.
Programming code tends to be token-heavy (function() {} often becomes multiple tokens).

Tokenization Techniques in LLMs

Let’s explore the main strategies used in modern NLP:

1. Word-Based Tokenization (Old School)

Splits on spaces/punctuation.

Example: "Machine learning rocks" → ["Machine", "learning", "rocks"].
Problem: Vocabulary explosion (“running”, “runs”, “ran” are all separate).

2. Subword Tokenization (BPE, WordPiece, Unigram LM)

Breaks text into subwords, balancing vocabulary size and generalization.

Example:

  "unhappiness" → ["un", "happi", "ness"]

Used in BERT and early Transformer models.

3. Byte-Pair Encoding (BPE, GPT-2/GPT-3)

Starts at character level, merges frequently co-occurring pairs.

Handles rare words better.
Example: "programming" → ["program", "ming"].

4. Byte-Level Tokenization (GPT-2 and beyond)

Works at the raw byte level (UTF-8).

Advantages: Handles emojis, accented characters, and rare text seamlessly.
Example: "🐱" → ["🐱"] instead of breaking into unknown tokens.

5. Character-Level Tokenization (Rare for LLMs)

Splits into every character.

Example: "AI" → ["A", "I"].
Downsides: Very long sequences → inefficient for LLMs.

Comparing Tokenizers on the Same Text

Let’s take the sentence:

I love programming in Python 🐍

Word-based: ["I", "love", "programming", "in", "Python", "🐍"]
BPE (GPT-2): ["I", " love", " program", "ming", " in", " Python", " 🐍"]
Character-level: ["I", " ", "l", "o", "v", "e", " ", "p", "r", ...]

Notice how BPE splits “programming” into “program” + “ming,” while emojis remain intact under byte-level tokenization.

Now this is what it looks like using the popular cl100k_base

Why Developers Should Care About Tokenization

As a developer building on top of LLMs, tokenization directly affects your work:

API Usage & Billing – Every token counts toward input/output cost.
Prompt Design – Understanding tokens helps you craft concise, efficient prompts.
Language Support – Tokenization efficiency varies across languages; testing is essential.
Optimization – Preprocessing your text (Compressing JSON, trimming whitespace) can reduce token usage by 10–20%.

So, run your prompts through a tokenizer first. It’ll save you money, prevent cutoff issues, and give you a clearer picture of how your app interacts with LLMs.

DEV Community