Nagachinmay KN

Posted on Dec 8, 2025

Sentence, Word & Subword Tokenisation Explained

#ai #programming #beginners #tutorial

What is Tokenisation? (How Machines Break Text into Pieces)

Before any machine learning model can understand text, it must first break the text into smaller units.
This process is called Tokenisation.

In simple words:

Tokenisation = Converting text into meaningful numerical pieces (tokens)

Because machines cannot understand raw text, they only understand numbers.
✅ 1. Sentence-Level Tokenisation

This method splits a paragraph into individual sentences.

Example:

Input:

I love machine learning. It is powerful. It is the future.

Output Tokens:

I love machine learning.

It is powerful.

It is the future.

✅ Used in:

Document summarization

News classification

Chatbots
✅ 2. Word-Level Tokenisation

This splits a sentence into individual words.

Example:

Input:

I love machine learning

Output Tokens:

love

machine

learning

✅ This was used in:

Traditional NLP

RNNs

LSTMs

Early translation models

⚠️ But it has problems:

Huge vocabulary

Unknown words (OOV problem)

Spelling variations break models

✅ 3. Subword-Level Tokenisation (Modern AI Uses This)

This splits words into smaller meaningful parts.

Example:

Word:

Unbelievable

Subword Tokens:

belie

vable

Or:

Playing → Play + ing

✅ Used in:

GPT

BERT

ChatGPT

Google Translate

✅ Solves:

Unknown words

Large vocabulary issues

Better generalization

Common subword algorithms:

BPE (Byte Pair Encoding)

WordPiece

Unigram LM

🔁 Why Tokenisation Is Essential for Sequential Models

Remember:

Sequential Models work only on sequences.

Tokenisation is what creates the sequence.

➡️ Text → Tokens → Numbers → Model → Output

Without tokenisation:
❌ No RNN
❌ No LSTM
❌ No Transformers
❌ No ChatGPT

✅ Final Tokenisation Summary Table

Level	Input	Output	Used In
Sentence	Paragraph	Sentences	Summarization
Word	Sentence	Words	Classic NLP
Subword	Word	Sub-parts	Transformers

“From sequential data to tokenisation, this is how machines slowly learn to read, remember, and predict — just like us.”

DEV Community

Sentence, Word & Subword Tokenisation Explained

Top comments (0)