LLMs don't see letters or even words — they see tokens: chunks of text mapped to integer IDs. Once you get tokenization, a dozen confusing things about LLMs suddenly make sense (cost, context limits, why "strawberry" trips them up).
🔤 Type and watch text become tokens: https://dev48v.infy.uk/ai/days/day12-tokenization.html
A token is usually a subword
Common words stay whole ("the"); rare or long ones split into pieces ("token" + "ization"). This subword approach (BPE — byte-pair encoding) means a fixed vocabulary can represent any word, even ones it's never seen.
Text → IDs → numbers
Each token maps to an integer from the model's vocabulary. That integer sequence is literally all the model ever receives. Leading spaces and capitalization change the tokens — "Hello" and " hello" are different tokens.
Why it matters
- Cost & context are measured in tokens (~4 chars each), not words.
- Letter-counting mistakes and odd emoji/code splits are tokenization artifacts.
- Different models use different tokenizers.
🔨 Build it (tiktoken / 🤗 tokenizers: encode → IDs → count → decode) on the page: https://dev48v.infy.uk/ai/days/day12-tokenization.html
Part of AIFromZero. 🌐 https://dev48v.infy.uk
Top comments (0)