DEV Community

Cover image for Open Source Project of the Day (#85): tiktoken - OpenAI's Blazing-Fast BPE Tokenizer
WonderLab
WonderLab

Posted on

Open Source Project of the Day (#85): tiktoken - OpenAI's Blazing-Fast BPE Tokenizer

Introduction

"How many tokens does your prompt actually use?"

This is article #85 in the Open Source Project of the Day series. Today's project is tiktoken β€” OpenAI's official tokenizer.

Before calling the OpenAI API, almost every developer runs into the same questions: How many tokens will this text consume? Will it exceed the context limit? How do I estimate the cost? The answers all trace back to a single step: tokenization.

tiktoken isn't just a "token counter." It's the actual tokenizer used by the GPT model family during both training and inference. Understanding it means understanding what the model truly "sees" as its input.

What You'll Learn

  • The core mechanics of BPE (Byte Pair Encoding) and its 4 key properties
  • Which encodings tiktoken supports and how to pick the right one
  • How to count tokens precisely to avoid API truncation and surprises
  • How to add custom special tokens for chat formats
  • Why tiktoken's Rust + Python architecture is 3-6x faster than alternatives

Prerequisites

  • Basic Python experience
  • Familiarity with the OpenAI API (knowing tokens are the billing unit is enough)
  • Basic NLP/tokenization concepts (optional)

Project Background

What Is tiktoken?

tiktoken is an open-source BPE (Byte Pair Encoding) tokenizer library released by OpenAI. Its core job is to convert text strings into sequences of token IDs (integer arrays) for language models to consume β€” and to reverse that process, converting token sequences back to the original text.

This isn't an experimental side project. It's the tokenizer powering GPT-3.5, GPT-4, GPT-4o, and more. When you send text through the API, the model doesn't see your words β€” it sees the token sequence tiktoken produces.

Author / Team

  • Author: OpenAI
  • Language breakdown: Python 64.9% + Rust 35.1%
  • Adoption: Depended on by 184,000+ GitHub repositories
  • Role: Core infrastructure for LLM application development

Project Stats

  • ⭐ GitHub Stars: 18,400+
  • 🍴 Forks: 1,500+
  • πŸ“¦ Latest version: 0.13.0 (2026-05-15)
  • πŸ“„ License: MIT

Core Features

What It Does

tiktoken does three things:

  1. Encode: Convert text into a list of token IDs
  2. Decode: Convert token IDs back into the original text (lossless)
  3. Count: Precisely report how many tokens a piece of text uses

These three operations are foundational in LLM application development:

  • Context management: Ensure prompts + history don't exceed the model's context window
  • Cost estimation: Predict API costs before sending requests (OpenAI charges per token)
  • Prompt engineering: Understand the actual "units" the model processes, and optimize around tokenization boundaries

Use Cases

  1. Token budget control before API calls

    • Check token count before sending to avoid truncation or max_tokens errors
  2. Smart document chunking for RAG

    • Split documents into chunks that stay under a token limit for retrieval-augmented generation
  3. Multi-turn conversation window management

    • Dynamically trim message history to keep the conversation within the model's context window
  4. Precise cost monitoring

    • Build token usage dashboards to track and optimize prompt efficiency
  5. Fine-tuning data preprocessing

    • Control sample length by token count when preparing training datasets

Quick Start

pip install tiktoken
Enter fullscreen mode Exit fullscreen mode
import tiktoken

# Option 1: Get encoding by name (recommended for new projects)
enc = tiktoken.get_encoding("o200k_base")

# Option 2: Get encoding by model name (auto-matches the correct encoding)
enc = tiktoken.encoding_for_model("gpt-4o")

# Encode: text β†’ list of token IDs
tokens = enc.encode("Hello, tiktoken!")
print(tokens)        # [13225, 11, 384, 4963, 0]
print(len(tokens))   # 5  ← this is the token count

# Decode: token IDs β†’ text
text = enc.decode(tokens)
print(text)          # "Hello, tiktoken!"

# Lossless round-trip
assert enc.decode(enc.encode("Any text can be perfectly restored.")) == "Any text can be perfectly restored."
Enter fullscreen mode Exit fullscreen mode

Key Properties

  1. High-performance Rust core

    • Core tokenization logic is implemented in Rust, achieving 3-6x speedup over comparable Python tokenizers (benchmark: 1 GB of text, vs GPT2TokenizerFast)
  2. Lossless reversibility

    • decode(encode(text)) == text always holds β€” no information is lost in the round-trip
  3. Universal coverage

    • Byte-level BPE handles any Unicode text, including content outside the training distribution
  4. High compression ratio

    • Each token corresponds to roughly 4 bytes of text, significantly reducing sequence length and computation
  5. Subword awareness

    • Recognizes common morphological units (e.g., ing, tion, pre-), helping models generalize across word forms
  6. Multiple built-in encodings

    • Ships with o200k_base (GPT-4o), cl100k_base (GPT-4/GPT-3.5-turbo), and legacy encodings
  7. Special token extension

    • Supports adding custom special tokens like <|im_start|> to adapt the tokenizer for chat formats
  8. Educational module

    • Built-in _educational module visualizes the BPE merging process step by step

Comparison with Alternatives

Dimension tiktoken HuggingFace Tokenizers SentencePiece
Speed ⚑ Fastest (Rust core) Fast (Rust core) Medium (C++)
OpenAI model alignment βœ… Exact match ❌ Approximate ❌ N/A
Python API simplicity βœ… Minimal Medium Medium
Model coverage OpenAI series Universal Universal
Custom encodings βœ… Supported βœ… Supported βœ… Supported

Why choose tiktoken?

  • The only tokenizer that produces token counts identical to what the OpenAI API actually charges
  • Two-line API for token counting β€” no boilerplate
  • MIT license β€” zero friction for commercial use

Deep Dive

BPE: The 4 Key Properties

BPE (Byte Pair Encoding) is tiktoken's core algorithm. Understanding its 4 properties tells you exactly what tiktoken can and cannot do.

β‘  Lossless Reversibility

Token sequences reconstruct the original text with 100% fidelity:

original = "GPT-4o uses the o200k_base encoding."
assert enc.decode(enc.encode(original)) == original  # Always true
Enter fullscreen mode Exit fullscreen mode

β‘‘ Open Vocabulary

tiktoken starts from individual bytes (256 characters) and merges them by frequency. Every Unicode character can be tokenized β€” including content the model was never trained on:

# New words, emoji, source code, edge cases β€” all tokenized without error
enc.encode("πŸ˜€πŸ€– tiktoken-v99 unknown_word_xyz")  # Never throws
Enter fullscreen mode Exit fullscreen mode

β‘’ High Compression Ratio

Each token covers roughly 4 bytes, reducing sequence length and the cost of attention computation:

text = "The quick brown fox jumps over the lazy dog"
tokens = enc.encode(text)
print(f"Characters: {len(text)}, Tokens: {len(tokens)}")
# Characters: 43, Tokens: 9  β†’ ~4.8 chars per token
Enter fullscreen mode Exit fullscreen mode

β‘£ Subword Awareness

BPE learns morphological patterns, helping models generalize:

# "encoding" β†’ ["encod", "ing"]
# "tokenization" β†’ ["token", "ization"]
# The model can infer the meaning of unseen compound words
Enter fullscreen mode Exit fullscreen mode

Encoding Selection Guide

Using the wrong encoding means your token counts won't match what the API actually charges:

Encoding Models Vocabulary Size
o200k_base GPT-4o, GPT-4o-mini 200,000
cl100k_base GPT-4, GPT-3.5-turbo, text-embedding-3-* 100,000
p50k_base text-davinci-003 and older 50,000
r50k_base GPT-3 (davinci) 50,000
import tiktoken

def count_tokens(text: str, model: str = "gpt-4o") -> int:
    """Count tokens for a given model, exactly matching the API's count."""
    enc = tiktoken.encoding_for_model(model)
    return len(enc.encode(text))

print(count_tokens("Hello, world!"))   # 4
print(count_tokens("δ½ ε₯½οΌŒδΈ–η•ŒοΌ"))     # 6
Enter fullscreen mode Exit fullscreen mode

Custom Special Tokens

Chat-format models use special tokens to delimit roles. You can extend an existing encoding to support them:

import tiktoken

cl100k_base = tiktoken.get_encoding("cl100k_base")

# Build a chat-aware encoding with custom special tokens
enc = tiktoken.Encoding(
    name="cl100k_im",
    pat_str=cl100k_base._pat_str,
    mergeable_ranks=cl100k_base._mergeable_ranks,
    special_tokens={
        **cl100k_base._special_tokens,
        "<|im_start|>": 100264,
        "<|im_end|>":   100265,
    }
)

text = "<|im_start|>user\nWhat is BPE?<|im_end|>"
tokens = enc.encode(text, allowed_special={"<|im_start|>", "<|im_end|>"})
print(f"Token count: {len(tokens)}")
Enter fullscreen mode Exit fullscreen mode

Real-World Pattern: Token Budget Management

The most common use case β€” trim message history to fit within the context window before sending:

import tiktoken

def trim_messages_to_budget(
    messages: list[dict],
    model: str = "gpt-4o",
    max_tokens: int = 8000,
) -> list[dict]:
    """
    Trim conversation history so the total token count stays under budget.
    Preserves the system prompt; drops the oldest user/assistant turns first.
    """
    enc = tiktoken.encoding_for_model(model)

    def count(msgs: list[dict]) -> int:
        # Each message carries ~4 tokens of overhead (role marker, separators)
        total = sum(4 + len(enc.encode(m.get("content", ""))) for m in msgs)
        return total + 2  # 2 tokens priming the reply

    system = [m for m in messages if m["role"] == "system"]
    others = [m for m in messages if m["role"] != "system"]

    while count(system + others) > max_tokens and others:
        others.pop(0)

    return system + others
Enter fullscreen mode Exit fullscreen mode

Architecture: Why Is It So Fast?

tiktoken achieves its 3-6x speedup through a Python + Rust hybrid architecture:

tiktoken/
β”œβ”€β”€ tiktoken/
β”‚   β”œβ”€β”€ __init__.py       ← Public Python API
β”‚   β”œβ”€β”€ core.py           ← Encoding class
β”‚   β”œβ”€β”€ model.py          ← Model name β†’ encoding name mapping
β”‚   β”œβ”€β”€ registry.py       ← Encoding registration and caching
β”‚   └── _educational.py   ← Pure-Python BPE for learning purposes
β”‚
└── src/ (Rust)
    └── lib.rs            ← High-performance BPE core (exposed via PyO3)
Enter fullscreen mode Exit fullscreen mode

Why Rust makes the difference:

  • No GIL overhead: Rust bypasses Python's Global Interpreter Lock for the inner merge loop
  • Zero-cost PyO3 bindings: Rust functions are callable from Python with minimal overhead
  • Vocabulary caching: The merge table is loaded once and held in memory across calls
  • Regex pre-tokenization: A fast regex splits text at word/punctuation boundaries before BPE, reducing the search space

Links and Resources

Official Resources

Related Resources


Conclusion

tiktoken's value extends well beyond counting tokens. It's the translation layer between developers and GPT models β€” the component that determines what the model actually "sees." Mastering tiktoken means you can precisely control context windows, estimate costs before they hit your bill, and build LLM applications that behave predictably at the boundaries.

Its Python + Rust architecture is also a design pattern worth studying: hand the performance-critical inner loop to a systems language, keep the ergonomics and flexibility in a dynamic language. Simple idea, significant payoff.


Find more useful knowledge and interesting products on my Homepage

Check out PrimeSkills β€” a curated marketplace of AI agents and skills that have been validated in real-world, enterprise-grade workflows. No fluff, just what actually works.

Top comments (0)