DEV Community

Cover image for Tokenization
Thiago da Silva Teixeira
Thiago da Silva Teixeira

Posted on

Tokenization

When you build an NLP pipeline—whether for sentiment analysis, chatbots, or translation—the very first step is always the same: tokenization. In plain words, tokenization dices raw text into smaller, consistent chunks that a model can count, index, and learn from.

1  What is a Token?

Think of tokens as the LEGO® bricks of language. They can be as big as a whole word or as tiny as a single character, depending on how you slice them.

Sentence: "IBM taught me tokenization."
Possible tokens: ["IBM", "taught", "me", "tokenization"]
Enter fullscreen mode Exit fullscreen mode

Different models expect different brick sizes, so choosing the right tokenizer is strategic.

2  Why Tokenization Matters

  • Sentiment analysis: detect “good” vs “bad”.
  • Text generation: decide what piece comes next.
  • Search engines: match “running” with “run”.

Without tokenization, your neural net sees text as a long, unreadable string of bytes—hardly the recipe for comprehension.

3  The Three Classical Approaches

Method How It Works When to Use Watch‑outs
Word‑based Splits on whitespace & punctuation Quick prototypes, rule‑based systems Huge vocabulary, OOV* explosion
Character‑based Every character is a token Morphologically rich languages, misspellings Longer sequences, less semantic punch
Sub‑word Keeps common words whole, chops rare ones into pieces State‑of‑the‑art transformers (BERT, GPT‑x) More complex training & merges

*OOV = out‑of‑vocabulary words

4  A Closer Look at Sub‑word Algorithms

  1. WordPiece (BERT) Greedy merges: start with characters, repeatedly join pairs that boost likelihood.
   from transformers import BertTokenizer
   tok = BertTokenizer.from_pretrained("bert-base-uncased")
   print(tok.tokenize("tokenization lovers"))  
   # ['token', '##ization', 'lovers']
Enter fullscreen mode Exit fullscreen mode
  1. Unigram (XLNet, SentencePiece) Vocabulary pruning: begin with many candidates, drop the least useful until a target size is reached.
  2. SentencePiece Language‑agnostic: trains directly on raw text, treats spaces as tokens, so no pre‑tokenization needed.

5  Tokenization + Indexing in PyTorch

from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

sentences = ["Life is short", "Tokenization is powerful"]
tokenizer = get_tokenizer("basic_english")

def yield_tokens(data_iter):
    for text in data_iter:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(sentences),
                                  specials=["<unk>", "<bos>", "<eos>", "<pad>"])
vocab.set_default_index(vocab["<unk>"])

tokens = tokenizer(sentences[0])          # ['life', 'is', 'short']
indices = vocab(tokens)                   # [5, 6, 7]  (example output)

# Add special tokens + padding
max_len = 6
padded = ["<bos>"] + tokens + ["<eos>"]
padded += ["<pad>"] * (max_len - len(padded))
Enter fullscreen mode Exit fullscreen mode

Why it matters: Models operate on integers, not strings. torchtext lets you jump from raw text to GPU‑ready tensors in three lines.

6  Special Tokens Cheat‑Sheet

Token Purpose
<bos> Beginning of sentence
<eos> End of sentence
<pad> Sequence padding
<unk> Unknown / rare word

Adding them makes batching cleaner and generation deterministic.

7  Key Takeaways

  • Tokenization is non‑negotiable—mis‑tokenize and your downstream model will stumble.
  • Choose by trade‑off: word‑level (semantic clarity) vs character‑level (tiny vocab) vs sub‑word (best of both, extra complexity).
  • Modern transformers ♥ sub‑word algorithms such as WordPiece, Unigram, and SentencePiece.
  • Indexing turns tokens into numbers; libraries like torchtext, spaCy, and transformers automate the grunt work.
  • Special tokens (<bos>, <eos>, etc.) keep sequence models from losing their place.

Reference

My study notes from the IBM Generative AI and LLMs: Architecture and Data Preparation course.

Top comments (0)