What is Tokenisation? (How Machines Break Text into Pieces)
Before any machine learning model can understand text, it must first break the text into smaller units.
This process is called Tokenisation.
In simple words:
Tokenisation = Converting text into meaningful numerical pieces (tokens)
Because machines cannot understand raw text, they only understand numbers.
✅ 1. Sentence-Level Tokenisation
This method splits a paragraph into individual sentences.
Example:
Input:
I love machine learning. It is powerful. It is the future.
Output Tokens:
I love machine learning.
It is powerful.
It is the future.
✅ Used in:
Document summarization
News classification
Chatbots
✅ 2. Word-Level Tokenisation
This splits a sentence into individual words.
Example:
Input:
I love machine learning
Output Tokens:
I
love
machine
learning
✅ This was used in:
Traditional NLP
RNNs
LSTMs
Early translation models
⚠️ But it has problems:
Huge vocabulary
Unknown words (OOV problem)
Spelling variations break models
✅ 3. Subword-Level Tokenisation (Modern AI Uses This)
This splits words into smaller meaningful parts.
Example:
Word:
Unbelievable
Subword Tokens:
Un
belie
vable
Or:
Playing → Play + ing
✅ Used in:
GPT
BERT
ChatGPT
Google Translate
✅ Solves:
Unknown words
Large vocabulary issues
Better generalization
Common subword algorithms:
BPE (Byte Pair Encoding)
WordPiece
Unigram LM
🔁 Why Tokenisation Is Essential for Sequential Models
Remember:
Sequential Models work only on sequences.
Tokenisation is what creates the sequence.
➡️ Text → Tokens → Numbers → Model → Output
Without tokenisation:
❌ No RNN
❌ No LSTM
❌ No Transformers
❌ No ChatGPT
✅ Final Tokenisation Summary Table
| Level | Input | Output | Used In |
|---|---|---|---|
| Sentence | Paragraph | Sentences | Summarization |
| Word | Sentence | Words | Classic NLP |
| Subword | Word | Sub-parts | Transformers |
“From sequential data to tokenisation, this is how machines slowly learn to read, remember, and predict — just like us.”
Top comments (0)