You see words. A language model sees tokens — chunks of text, usually a few characters each. Everything starts here. Day 2 of my AIFromZero series.
Text gets shattered into tokens
"unbelievable" → ["un", "bel", "iev", "able"] (4 tokens, not 1 word, not 12 letters)
Before any "thinking", your text is chopped into tokens and each becomes a number the model processes.
Why not words or letters?
- Letters: too fine — the model would relearn spelling everywhere.
- Whole words: too many — millions, plus every typo and name.
- Subword tokens: the sweet spot. Common words = 1 token; rare words split into reusable pieces. A fixed ~100k-token vocabulary covers any text.
The ~4-chars rule (and why it costs you)
In English, ~4 characters ≈ 1 token, or ~0.75 tokens per word. This is how everything is priced and limited:
- API bills are per token (prompt + reply).
- A "context window" (how much it can read at once) is measured in tokens — 1,000 tokens ≈ 750 words.
Verbose prompts and long chat history burn tokens. Concise prompting is a real cost lever.
The strawberry problem
The model never sees s-t-r-a-w-b-e-r-r-y. It sees a token like straw + berry. The individual letters are buried inside tokens, so counting characters is genuinely hard for it. It's not dumb — it just doesn't read letters.
Tokens are step 1 of everything
Tokenize → turn each token into a vector (embeddings, next) → run through the transformer → predict the next token. Every LLM starts exactly here.
🔤 Type anything and watch it tokenize live: https://dev48v.infy.uk/ai/days/day2-tokens.html
Day 2 of AIFromZero.
Top comments (0)