DEV Community

Michael Ross
Michael Ross

Posted on

Your LLM can't read. Here's the weird trick it uses instead

Here's a fact that breaks people's mental model of large language models the first time they really sit with it:

A language model never sees your words. Not one. It sees numbers — and only numbers.

When you type Hello, world into ChatGPT, the model on the other end isn't reading English. By the time your text reaches the neural network, it's been chopped into chunks called tokens and each chunk has been swapped for an integer ID. The model is, underneath all the magic, a very expensive function that maps integers to integers. The "intelligence" is what happens in between.

Let's actually look at it.

See it for yourself (5 lines of Python)

# pip install tiktoken
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")  # the GPT-4 era tokenizer
ids = enc.encode("Hello, world")
print(ids)                       # -> [9906, 11, 1917]
print([enc.decode([i]) for i in ids])  # -> ['Hello', ',', ' world']
Enter fullscreen mode Exit fullscreen mode

Three tokens. Hello is one. The comma is its own token. And world? It comes through as ' world'with the leading space baked in. That space is part of the token. This is not a rounding error; it's central to how the whole thing works.

So what is a token?

A token is a frequent chunk of text. Not always a word, not always a letter — whatever the tokenizer found useful while it was trained on a mountain of text. Common words become single tokens. Rare words get shattered into pieces:

for word in ["playing", "tokenization", "antidisestablishmentarianism"]:
    print(word, "->", [enc.decode([i]) for i in enc.encode(word)])

# playing                      -> ['playing']
# tokenization                 -> ['token', 'ization']
# antidisestablishmentarianism -> ['ant', 'idis', 'establish', 'ment', 'arian', 'ism']
Enter fullscreen mode Exit fullscreen mode

playing is so common it earns a single ID. tokenization splits into two. The long one gets diced into six. This is Byte Pair Encoding — an intimidating name for a refreshingly simple idea: start with characters, then repeatedly glue together the most common neighboring pair until you've built a vocabulary of ~50k–100k chunks. Frequent stuff ends up whole; rare stuff stays in pieces. Every model ships with its own frozen vocabulary, which is why a token count from one model doesn't transfer to another.

The gotcha that costs you money

Here's the part that bites people in production: you are billed in tokens, and your context window is measured in tokens — not characters, not words. And tokens are sneakier than they look.

print(len(enc.encode("123456789")))   # -> 3   (numbers split oddly)
print(len(enc.encode("   ")))          # -> 1   (whitespace is real)
print(len(enc.encode("hello")))        # -> 1
print(len(enc.encode(" hello")))       # -> 1, but a DIFFERENT id than "hello"
Enter fullscreen mode Exit fullscreen mode

A few consequences that trip people up:

  • Numbers don't tokenize the way you'd guess. A long ID or a big number can eat more tokens than the English sentence around it. If you're stuffing logs, UUIDs, or JSON into a prompt, your token count balloons.
  • "hello" and " hello" are different tokens. Leading spaces matter. This is why few-shot prompt formatting is weirdly fiddly — the model genuinely sees Q: and Q: as different starts.
  • Your prompt is longer than your intuition says. A "short" 280-character message is usually ~70–90 tokens, but throw in code, punctuation, or non-English text and that ratio gets worse fast.

The practical move: count tokens before you send, not after you get the bill. len(enc.encode(prompt)) is the cheapest cost estimate you'll ever write, and it's also how you stop blowing past a context window at the worst possible moment.

Why this matters beyond trivia

Almost every confusing LLM behavior has a tokenization fingerprint on it:

  • Models are weirdly bad at counting letters in a word ("how many r's in strawberry?") — because they never saw the letters, they saw a couple of tokens.
  • Non-English languages can cost 2–3× more tokens for the same meaning, because the vocabulary was trained heavy on English.
  • Prompt-injection and jailbreak tricks often lean on unusual token boundaries.

Once you can see the tokens, a lot of "why is the model doing that?" turns into "oh, of course it's doing that."


This is the first idea in a 10-part plain-English series I've been writing on how LLMs actually work under the hood — embeddings, attention, KV cache, quantization, RAG, the whole stack, no math degree required. If this scratched an itch, the full write-up with diagrams lives here: How Language Becomes Numbers.

Now I'm genuinely curious: what's the weirdest tokenization edge case you've hit in production? Emoji that exploded into six tokens, a regex that broke on token boundaries, a non-English prompt that quietly 3×'d your bill? Drop it in the comments — I collect these.

Top comments (0)