If you’ve been building with Large Language Models (LLMs), integrating APIs, or just messing around with prompt engineering, you’ve hit the word token a million times.
You know it’s the unit you get billed for. You know it’s the thing that fills up your "context window." But how does it actually work under the hood?
If you think LLMs read text word-by-word like humans, or character-by-character like traditional code compilers, think again. Let's pull back the curtain on tokenization and see what’s really going on when you hit "Send."
What Exactly is a Token?
To an AI, a token is the fundamental building block of language.
LLMs don't understand English, Python, or JavaScript directly. Instead, they run raw text through a processing step called tokenization, which chops strings into smaller pieces. A token can be a single character, a part of a word (sub-word), an entire word, or even punctuation and trailing spaces.
Here is a quick rule of thumb for English text:
- 1 token ≈ 4 characters
- 1 token ≈ 0.75 words
- 100 words ≈ 130–140 tokens
But things get weird when you look closer. Let's see how an AI tokenizer actually splits a sentence.
Tokenization in Action
Take a simple sentence like: "Learning AI is fun!"
A typical LLM tokenizer (like OpenAI's cl100k_base used for GPT-4) won't see four distinct words. It breaks them down like this:
| Fragment | Token Type | Reason |
|---|---|---|
| Learn | Sub-word | The root root of the word |
| ing | Suffix | Common sub-word ending |
| AI | Space + Word | The space before a word is grouped with it |
| is | Space + Word | Grouped together to save space |
| fun | Space + Word | Grouped together |
| ! | Punctuation | Standard punctuation gets its own token |
A 4-word sentence instantly becomes 6 tokens.
The Developer's Gotcha: White Space and Code
Because spaces are often baked into the tokens themselves, formatting matters immensely. In programming languages like Python—where indentation defines scope—tabbing or spacing drastically increases your token count.
# This code block uses more tokens than you think because
# indentation spaces are processed as distinct token fragments.
def hello_world():
print("Hello, World!")
Why Don't We Just Use Whole Words?
It seems like an extra step, so why do AI researchers rely on sub-word tokenization instead of a massive dictionary of whole words?
1. The "Out of Vocabulary" (OOV) Problem
If an LLM only recognized whole words, what happens when a user types a typo, a brand new framework name, or internet slang (like rizz)? The model would break down. By using sub-words (like breaking ungettable into un + get + table), the AI can dynamically deduce the meaning of words it has never seen before.
2. Computational Efficiency
The English language has millions of words. Teaching an AI a unique mathematical identity for every single word—plus all its tenses and plural forms—would make the model's architecture massive and sluggish. By using a fixed vocabulary of roughly 50,000 to 100,000 sub-word tokens, the AI can assemble literally any word in existence, acting like a bucket of Lego bricks.
3. Turning Text into Vectors
Computers only process numbers. Tokenization is the bridge. Once text is split into tokens, each unique token is mapped to a specific integer ID.
- Learn might be ID 4321
- ing might be ID 128
These IDs are then converted into high-dimensional vectors (embeddings) so the LLM can run complex matrix multiplication to predict the next logical token.
The Context Window Budget
Every LLM has a Context Window (e.g., 8k, 32k, or even 1M+ tokens). Think of this as the model's short-term working memory. When you text a chatbot, the entire history of your conversation is bundled up and sent back to the API with every single new prompt. If your conversation history hits 4,000 tokens and the model's limit is 4,000, it cannot generate another word without "forgetting" the very first token at the top of the chat.
As developers, managing this budget is critical. Techniques like vector databases (RAG), text summarization, and aggressive trimming of system prompts are entirely about keeping token costs low and preventing your application from hitting memory ceilings.
Want to Test It Yourself?
If you are writing backend code or optimizing prompts, don't guess your token counts. You can experiment with official tokenizer tools to see exactly how your text is being sliced:
- OpenAI Tokenizer: An interactive web tool showing how text translates to token IDs.
- Tiktoken (Python): A fast BPE tokenizer library you can integrate into your Python backends to count tokens locally before hitting an API.
import tiktoken
encoding = tiktoken.get_encoding("cl100k_base")
tokens = encoding.encode("Learning AI is fun!")
print(f"Token Count: {len(tokens)}") # Outputs: 6
Understanding tokens is the first step toward writing more cost-efficient prompts, building better AI apps, and understanding why models behave the way they do.
Over to you: Have you run into any weird bugs or massive cloud bills because of unexpected token usage? Let's talk about it in the comments below!
Top comments (0)