Tokenization: How LLMs Actually Read Your Text

#ai #llm #machinelearning #beginners

LLMs don't see letters or even words — they see tokens: chunks of text mapped to integer IDs. Once you get tokenization, a dozen confusing things about LLMs suddenly make sense (cost, context limits, why "strawberry" trips them up).

🔤 Type and watch text become tokens: https://dev48v.infy.uk/ai/days/day12-tokenization.html

A token is usually a subword

Common words stay whole ("the"); rare or long ones split into pieces ("token" + "ization"). This subword approach (BPE — byte-pair encoding) means a fixed vocabulary can represent any word, even ones it's never seen.

Text → IDs → numbers

Each token maps to an integer from the model's vocabulary. That integer sequence is literally all the model ever receives. Leading spaces and capitalization change the tokens — "Hello" and " hello" are different tokens.