TLDR: A token is a subword chunk, typically 3 to 4 characters in English. The algorithm splitting your text into those chunks started as a file compression trick in 1994, was adapted for neural machine translation in 2016, and is now the billing unit for every major AI API. Each provider uses a different tokenizer. The same prompt can produce meaningfully different token counts across GPT, Claude, Gemini, and Llama. Understanding this is not optional if you are building production AI systems.
Where Tokens Actually Came From
Most developers learn about tokens through an API bill. That is the wrong starting point.
The algorithm behind modern LLM tokenization is called Byte Pair Encoding (BPE). Philip Gage introduced it in 1994 in a paper titled “A New Algorithm for Data Compression,” published in The C Users Journal. The idea was simple: scan a binary file for the most frequent pair of adjacent bytes, replace that pair with a single unused byte, repeat until the file is as small as possible.
That is it. A compression trick. Nothing to do with language models.
In 2016, Rico Sennrich, Barry Haddow, and Alexandra Birch published “Neural Machine Translation of Rare Words with Subword Units.” Their core insight was that the same iterative merge logic that compressed bytes could compress characters into subwords for machine translation. Instead of shrinking a file, you were shrinking a vocabulary, and you could represent rare words without an “unknown token” fallback.
That paper became the foundation of how every major LLM today handles text.
How BPE Tokenization Actually Works
Start with individual characters. Find the most frequent pair. Merge them into one token. Repeat until you hit your target vocabulary size.
The interactive animation below walks through this on a real example. Open it in any browser:
Here is the same process as a static diagram for reference:
The key insight: common sequences earn their own token slot through frequency alone. No linguistic rules, no hand-crafted vocabulary. Just counting and merging, over and over.
When GPT-2 shipped, OpenAI extended this to byte-level BPE: the base vocabulary is all 256 possible bytes, which means no text is ever truly “unknown.” Any Unicode input can be represented as a sequence of existing tokens. That was a meaningful leap.
The result is that common English words become single tokens. Rare words, code identifiers, and non-Latin scripts split into multiple tokens. The word “tokenization” is two tokens in OpenAI’s cl100k encoding. The Turkish word for “hello” (merhaba) is three.
Why Token Became the Unit of AI Economics
Language models process text as a flat sequence of integers, one per token. The transformer’s attention mechanism scales quadratically with sequence length, which means longer sequences are exponentially more expensive to compute. Pricing in tokens is not arbitrary: it maps directly to compute cost.
Context windows have grown from 512 tokens in early GPT-1 to 10 million tokens in models like Llama 4 Scout today. That is a 20,000x increase in working memory. But bigger windows are not free. A 1 million token context at Claude’s pricing costs more per call than most developers expect the first time they hit it.
There is also a performance ceiling. A 2023 paper by Liu et al. established what practitioners now call the “lost in the middle” problem: LLMs attend best to content at the start and end of a prompt. Information buried in the center of a long context is processed less reliably, even when it technically fits within the window. Context count is not the same as context quality.
Why Different Providers Count Differently
Every major provider maintains its own tokenizer, trained on its own corpus, with its own vocabulary size and merge rules. The same sentence produces different token counts depending on where you send it.
| Provider | Tokenizer | Vocab Size | Offline? |
|---|---|---|---|
| OpenAI (GPT-4o) | tiktoken (o200k_base) | 200,000 | Yes |
| OpenAI (GPT-4) | tiktoken (cl100k_base) | 100,256 | Yes |
| Meta (Llama 3) | tiktoken-compatible | 128,000 | Yes |
| Google (Gemini) | SentencePiece | ~256,000 | API only |
| Anthropic (Claude 3+) | Proprietary BPE | Undisclosed | API only |
| Mistral | SentencePiece | 32,000 | Yes |
The critical column is the last one. Anthropic does not ship a local tokenizer for Claude 3 and later models. The only way to get a ground-truth token count for Claude is through the count_tokens API endpoint, which is free but requires a network call. When that is not suitable, tiktoken works as a close approximation with roughly 5 to 10 percent variance.
The Rules That Actually Matter in Production
Rule 1: 1 token is roughly 4 English characters or 0.75 words.
This is the universal approximation. Divide your word count by 0.75 to estimate tokens quickly.
Rule 2: Non-English text multiplies your token budget.
Non-Latin scripts, especially Arabic, Chinese, and Thai, can produce 2 to 3x more tokens per word compared to English. If your product handles Arabic input and you priced it against English benchmarks, your cost model is wrong.
Rule 3: Code tokenizes differently than prose.
Code identifiers, whitespace, and symbols split unpredictably. A 200-line Python file is not the same token budget as a 200-word paragraph.
Rule 4: Count the full message, not just the user prompt.
System prompts, tool schemas, function signatures, and conversation history all count against your context window and your bill. Developers who only count user input are usually undercounting by 20 to 40 percent.
Rule 5: Fit matters more than size.
A 1 million token window does not mean you should use 900,000 of it. Attention degrades at scale, latency grows, and cost grows linearly. The right context is the minimum that gives the model what it needs.
Counting Tokens in Code
For OpenAI-compatible models, tiktoken is the fastest option and works fully offline:
import tiktoken
enc = tiktoken.get_encoding("cl100k_base") # GPT-4 and Claude approximation
text = "Your prompt goes here"
token_count = len(enc.encode(text))
print(f"{token_count} tokens")
For Claude’s exact count (requires a network call, but it is free):
import anthropic
client = anthropic.Anthropic()
response = client.messages.count_tokens(
model="claude-opus-4-6",
system="You are a helpful assistant.",
messages=[{"role": "user", "content": "Your prompt goes here"}]
)
print(f"{response.input_tokens} tokens")
For a zero-dependency approximation that works in any language:
def estimate_tokens(text: str) -> int:
# 4 characters per token is the standard English rule of thumb
return max(1, len(text) // 4)
What This Means for How You Build
Tokens are not a detail. They are the constraint that determines whether your RAG pipeline fits in a request, whether your agent loop runs for 5 turns or 50, and whether your API bill is $200 or $2,000 at scale.
The 1994 compression algorithm that became BPE was designed to do one thing: make redundant patterns smaller. That is still exactly what it does inside every LLM. Text in, integer sequence out, attention over that sequence, text back out.
Knowing the shape of that pipeline does not make you a researcher. It makes you a builder who does not get surprised by a $4,000 invoice.
Key references:
- Gage, P. (1994). “A New Algorithm for Data Compression.” The C Users Journal.
- Sennrich, R., Haddow, B., & Birch, A. (2016). “Neural Machine Translation of Rare Words with Subword Units.” ACL 2016.
- Liu, N., et al. (2023). “Lost in the Middle: How Language Models Use Long Contexts.” arXiv:2307.03172.
- OpenAI tiktoken: github.com/openai/tiktoken
- Anthropic Token Counting API: docs.anthropic.com/en/docs/build-with-claude/token-counting


Top comments (0)