DEV Community

Devanshu Biswas
Devanshu Biswas

Posted on

Tokenization: How LLMs Actually Read Your Text

LLMs don't see letters or even words — they see tokens: chunks of text mapped to integer IDs. Once you get tokenization, a dozen confusing things about LLMs suddenly make sense (cost, context limits, why "strawberry" trips them up).

🔤 Type and watch text become tokens: https://dev48v.infy.uk/ai/days/day12-tokenization.html

A token is usually a subword

Common words stay whole ("the"); rare or long ones split into pieces ("token" + "ization"). This subword approach (BPE — byte-pair encoding) means a fixed vocabulary can represent any word, even ones it's never seen.

Text → IDs → numbers

Each token maps to an integer from the model's vocabulary. That integer sequence is literally all the model ever receives. Leading spaces and capitalization change the tokens — "Hello" and " hello" are different tokens.

Why it matters

  • Cost & context are measured in tokens (~4 chars each), not words.
  • Letter-counting mistakes and odd emoji/code splits are tokenization artifacts.
  • Different models use different tokenizers.

🔨 Build it (tiktoken / 🤗 tokenizers: encode → IDs → count → decode) on the page: https://dev48v.infy.uk/ai/days/day12-tokenization.html

Part of AIFromZero. 🌐 https://dev48v.infy.uk

Top comments (0)