Tokenization

#nlp #ai #abotwrotethis

When you build an NLP pipeline—whether for sentiment analysis, chatbots, or translation—the very first step is always the same: tokenization. In plain words, tokenization dices raw text into smaller, consistent chunks that a model can count, index, and learn from.

1 What is a Token?

Think of tokens as the LEGO® bricks of language. They can be as big as a whole word or as tiny as a single character, depending on how you slice them.

Sentence: "IBM taught me tokenization."
Possible tokens: ["IBM", "taught", "me", "tokenization"]

Different models expect different brick sizes, so choosing the right tokenizer is strategic.

2 Why Tokenization Matters

Sentiment analysis: detect “good” vs “bad”.
Text generation: decide what piece comes next.
Search engines: match “running” with “run”.

Without tokenization, your neural net sees text as a long, unreadable string of bytes—hardly the recipe for comprehension.

3 The Three Classical Approaches

Method	How It Works	When to Use	Watch‑outs
Word‑based	Splits on whitespace & punctuation	Quick prototypes, rule‑based systems	Huge vocabulary, OOV* explosion
Character‑based	Every character is a token	Morphologically rich languages, misspellings	Longer sequences, less semantic punch
Sub‑word	Keeps common words whole, chops rare ones into pieces	State‑of‑the‑art transformers (BERT, GPT‑x)	More complex training & merges

*OOV = out‑of‑vocabulary words

4 A Closer Look at Sub‑word Algorithms

WordPiece (BERT) Greedy merges: start with characters, repeatedly join pairs that boost likelihood.

   from transformers import BertTokenizer
   tok = BertTokenizer.from_pretrained("bert-base-uncased")
   print(tok.tokenize("tokenization lovers"))  
   # ['token', '##ization', 'lovers']

Unigram (XLNet, SentencePiece) Vocabulary pruning: begin with many candidates, drop the least useful until a target size is reached.
SentencePiece Language‑agnostic: trains directly on raw text, treats spaces as tokens, so no pre‑tokenization needed.

5 Tokenization + Indexing in PyTorch

from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

sentences = ["Life is short", "Tokenization is powerful"]
tokenizer = get_tokenizer("basic_english")

def yield_tokens(data_iter):
    for text in data_iter:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(sentences),
                                  specials=["<unk>", "<bos>", "<eos>", "<pad>"])
vocab.set_default_index(vocab["<unk>"])

tokens = tokenizer(sentences[0])          # ['life', 'is', 'short']
indices = vocab(tokens)                   # [5, 6, 7]  (example output)

# Add special tokens + padding
max_len = 6
padded = ["<bos>"] + tokens + ["<eos>"]
padded += ["<pad>"] * (max_len - len(padded))

Why it matters: Models operate on integers, not strings. torchtext lets you jump from raw text to GPU‑ready tensors in three lines.

6 Special Tokens Cheat‑Sheet

Token	Purpose
`<bos>`	Beginning of sentence
`<eos>`	End of sentence
`<pad>`	Sequence padding
`<unk>`	Unknown / rare word

Adding them makes batching cleaner and generation deterministic.

7 Key Takeaways

Tokenization is non‑negotiable—mis‑tokenize and your downstream model will stumble.
Choose by trade‑off: word‑level (semantic clarity) vs character‑level (tiny vocab) vs sub‑word (best of both, extra complexity).
Modern transformers ♥ sub‑word algorithms such as WordPiece, Unigram, and SentencePiece.
Indexing turns tokens into numbers; libraries like torchtext, spaCy, and transformers automate the grunt work.
Special tokens (<bos>, <eos>, etc.) keep sequence models from losing their place.