DEV Community

Nagachinmay KN
Nagachinmay KN

Posted on

Sentence, Word & Subword Tokenisation Explained

What is Tokenisation? (How Machines Break Text into Pieces)

Before any machine learning model can understand text, it must first break the text into smaller units.
This process is called Tokenisation.

In simple words:

Tokenisation = Converting text into meaningful numerical pieces (tokens)

Because machines cannot understand raw text, they only understand numbers.
✅ 1. Sentence-Level Tokenisation

This method splits a paragraph into individual sentences.

Example:

Input:

I love machine learning. It is powerful. It is the future.

Output Tokens:

I love machine learning.

It is powerful.

It is the future.

✅ Used in:

Document summarization

News classification

Chatbots
✅ 2. Word-Level Tokenisation

This splits a sentence into individual words.

Example:

Input:

I love machine learning

Output Tokens:

I

love

machine

learning

✅ This was used in:

Traditional NLP

RNNs

LSTMs

Early translation models

⚠️ But it has problems:

Huge vocabulary

Unknown words (OOV problem)

Spelling variations break models

✅ 3. Subword-Level Tokenisation (Modern AI Uses This)

This splits words into smaller meaningful parts.

Example:

Word:

Unbelievable

Subword Tokens:

Un

belie

vable

Or:

Playing → Play + ing

✅ Used in:

GPT

BERT

ChatGPT

Google Translate

✅ Solves:

Unknown words

Large vocabulary issues

Better generalization

Common subword algorithms:

BPE (Byte Pair Encoding)

WordPiece

Unigram LM

🔁 Why Tokenisation Is Essential for Sequential Models

Remember:

Sequential Models work only on sequences.

Tokenisation is what creates the sequence.

➡️ Text → Tokens → Numbers → Model → Output

Without tokenisation:
❌ No RNN
❌ No LSTM
❌ No Transformers
❌ No ChatGPT

✅ Final Tokenisation Summary Table

Level Input Output Used In
Sentence Paragraph Sentences Summarization
Word Sentence Words Classic NLP
Subword Word Sub-parts Transformers

“From sequential data to tokenisation, this is how machines slowly learn to read, remember, and predict — just like us.”

Top comments (0)