WonderLab

Posted on Jun 4

Open Source Project of the Day (#85): tiktoken - OpenAI's Blazing-Fast BPE Tokenizer

#opensource #nlp #bpe #openai

Introduction

"How many tokens does your prompt actually use?"

This is article #85 in the Open Source Project of the Day series. Today's project is tiktoken — OpenAI's official tokenizer.

Before calling the OpenAI API, almost every developer runs into the same questions: How many tokens will this text consume? Will it exceed the context limit? How do I estimate the cost? The answers all trace back to a single step: tokenization.

tiktoken isn't just a "token counter." It's the actual tokenizer used by the GPT model family during both training and inference. Understanding it means understanding what the model truly "sees" as its input.

What You'll Learn

The core mechanics of BPE (Byte Pair Encoding) and its 4 key properties
Which encodings tiktoken supports and how to pick the right one
How to count tokens precisely to avoid API truncation and surprises
How to add custom special tokens for chat formats
Why tiktoken's Rust + Python architecture is 3-6x faster than alternatives

Prerequisites

Basic Python experience
Familiarity with the OpenAI API (knowing tokens are the billing unit is enough)
Basic NLP/tokenization concepts (optional)

Project Background

What Is tiktoken?

tiktoken is an open-source BPE (Byte Pair Encoding) tokenizer library released by OpenAI. Its core job is to convert text strings into sequences of token IDs (integer arrays) for language models to consume — and to reverse that process, converting token sequences back to the original text.

This isn't an experimental side project. It's the tokenizer powering GPT-3.5, GPT-4, GPT-4o, and more. When you send text through the API, the model doesn't see your words — it sees the token sequence tiktoken produces.

Author / Team

Author: OpenAI
Language breakdown: Python 64.9% + Rust 35.1%
Adoption: Depended on by 184,000+ GitHub repositories
Role: Core infrastructure for LLM application development

Project Stats

⭐ GitHub Stars: 18,400+
🍴 Forks: 1,500+
📦 Latest version: 0.13.0 (2026-05-15)
📄 License: MIT

Core Features

What It Does

tiktoken does three things:

Encode: Convert text into a list of token IDs
Decode: Convert token IDs back into the original text (lossless)
Count: Precisely report how many tokens a piece of text uses

These three operations are foundational in LLM application development:

Context management: Ensure prompts + history don't exceed the model's context window
Cost estimation: Predict API costs before sending requests (OpenAI charges per token)
Prompt engineering: Understand the actual "units" the model processes, and optimize around tokenization boundaries

Use Cases

Token budget control before API calls
- Check token count before sending to avoid truncation or max_tokens errors
Smart document chunking for RAG
- Split documents into chunks that stay under a token limit for retrieval-augmented generation
Multi-turn conversation window management
- Dynamically trim message history to keep the conversation within the model's context window
Precise cost monitoring
- Build token usage dashboards to track and optimize prompt efficiency
Fine-tuning data preprocessing
- Control sample length by token count when preparing training datasets

Quick Start

pip install tiktoken

import tiktoken

# Option 1: Get encoding by name (recommended for new projects)
enc = tiktoken.get_encoding("o200k_base")

# Option 2: Get encoding by model name (auto-matches the correct encoding)
enc = tiktoken.encoding_for_model("gpt-4o")

# Encode: text → list of token IDs
tokens = enc.encode("Hello, tiktoken!")
print(tokens)        # [13225, 11, 384, 4963, 0]
print(len(tokens))   # 5  ← this is the token count

# Decode: token IDs → text
text = enc.decode(tokens)
print(text)          # "Hello, tiktoken!"

# Lossless round-trip
assert enc.decode(enc.encode("Any text can be perfectly restored.")) == "Any text can be perfectly restored."

Key Properties

High-performance Rust core
- Core tokenization logic is implemented in Rust, achieving 3-6x speedup over comparable Python tokenizers (benchmark: 1 GB of text, vs GPT2TokenizerFast)
Lossless reversibility
- decode(encode(text)) == text always holds — no information is lost in the round-trip
Universal coverage
- Byte-level BPE handles any Unicode text, including content outside the training distribution
High compression ratio
- Each token corresponds to roughly 4 bytes of text, significantly reducing sequence length and computation
Subword awareness
- Recognizes common morphological units (e.g., ing, tion, pre-), helping models generalize across word forms
Multiple built-in encodings
- Ships with o200k_base (GPT-4o), cl100k_base (GPT-4/GPT-3.5-turbo), and legacy encodings
Special token extension
- Supports adding custom special tokens like <|im_start|> to adapt the tokenizer for chat formats
Educational module
- Built-in _educational module visualizes the BPE merging process step by step

Comparison with Alternatives

Dimension	tiktoken	HuggingFace Tokenizers	SentencePiece
Speed	⚡ Fastest (Rust core)	Fast (Rust core)	Medium (C++)
OpenAI model alignment	✅ Exact match	❌ Approximate	❌ N/A
Python API simplicity	✅ Minimal	Medium	Medium
Model coverage	OpenAI series	Universal	Universal
Custom encodings	✅ Supported	✅ Supported	✅ Supported

Why choose tiktoken?

The only tokenizer that produces token counts identical to what the OpenAI API actually charges
Two-line API for token counting — no boilerplate
MIT license — zero friction for commercial use

Deep Dive

BPE: The 4 Key Properties

BPE (Byte Pair Encoding) is tiktoken's core algorithm. Understanding its 4 properties tells you exactly what tiktoken can and cannot do.

① Lossless Reversibility

Token sequences reconstruct the original text with 100% fidelity:

original = "GPT-4o uses the o200k_base encoding."
assert enc.decode(enc.encode(original)) == original  # Always true

② Open Vocabulary

tiktoken starts from individual bytes (256 characters) and merges them by frequency. Every Unicode character can be tokenized — including content the model was never trained on:

# New words, emoji, source code, edge cases — all tokenized without error
enc.encode("😀🤖 tiktoken-v99 unknown_word_xyz")  # Never throws

③ High Compression Ratio

Each token covers roughly 4 bytes, reducing sequence length and the cost of attention computation:

text = "The quick brown fox jumps over the lazy dog"
tokens = enc.encode(text)
print(f"Characters: {len(text)}, Tokens: {len(tokens)}")
# Characters: 43, Tokens: 9  → ~4.8 chars per token

④ Subword Awareness

BPE learns morphological patterns, helping models generalize:

# "encoding" → ["encod", "ing"]
# "tokenization" → ["token", "ization"]
# The model can infer the meaning of unseen compound words

Encoding Selection Guide

Using the wrong encoding means your token counts won't match what the API actually charges:

Encoding	Models	Vocabulary Size
`o200k_base`	GPT-4o, GPT-4o-mini	200,000
`cl100k_base`	GPT-4, GPT-3.5-turbo, text-embedding-3-*	100,000
`p50k_base`	text-davinci-003 and older	50,000
`r50k_base`	GPT-3 (davinci)	50,000

import tiktoken

def count_tokens(text: str, model: str = "gpt-4o") -> int:
    """Count tokens for a given model, exactly matching the API's count."""
    enc = tiktoken.encoding_for_model(model)
    return len(enc.encode(text))

print(count_tokens("Hello, world!"))   # 4
print(count_tokens("你好，世界！"))     # 6

Custom Special Tokens

Chat-format models use special tokens to delimit roles. You can extend an existing encoding to support them:

import tiktoken

cl100k_base = tiktoken.get_encoding("cl100k_base")

# Build a chat-aware encoding with custom special tokens
enc = tiktoken.Encoding(
    name="cl100k_im",
    pat_str=cl100k_base._pat_str,
    mergeable_ranks=cl100k_base._mergeable_ranks,
    special_tokens={
        **cl100k_base._special_tokens,
        "<|im_start|>": 100264,
        "<|im_end|>":   100265,
    }
)

text = "<|im_start|>user\nWhat is BPE?<|im_end|>"
tokens = enc.encode(text, allowed_special={"<|im_start|>", "<|im_end|>"})
print(f"Token count: {len(tokens)}")

Real-World Pattern: Token Budget Management

The most common use case — trim message history to fit within the context window before sending:

import tiktoken

def trim_messages_to_budget(
    messages: list[dict],
    model: str = "gpt-4o",
    max_tokens: int = 8000,
) -> list[dict]:
    """
    Trim conversation history so the total token count stays under budget.
    Preserves the system prompt; drops the oldest user/assistant turns first.
    """
    enc = tiktoken.encoding_for_model(model)

    def count(msgs: list[dict]) -> int:
        # Each message carries ~4 tokens of overhead (role marker, separators)
        total = sum(4 + len(enc.encode(m.get("content", ""))) for m in msgs)
        return total + 2  # 2 tokens priming the reply

    system = [m for m in messages if m["role"] == "system"]
    others = [m for m in messages if m["role"] != "system"]

    while count(system + others) > max_tokens and others:
        others.pop(0)

    return system + others

Architecture: Why Is It So Fast?

tiktoken achieves its 3-6x speedup through a Python + Rust hybrid architecture:

tiktoken/
├── tiktoken/
│   ├── __init__.py       ← Public Python API
│   ├── core.py           ← Encoding class
│   ├── model.py          ← Model name → encoding name mapping
│   ├── registry.py       ← Encoding registration and caching
│   └── _educational.py   ← Pure-Python BPE for learning purposes
│
└── src/ (Rust)
    └── lib.rs            ← High-performance BPE core (exposed via PyO3)

Why Rust makes the difference:

No GIL overhead: Rust bypasses Python's Global Interpreter Lock for the inner merge loop
Zero-cost PyO3 bindings: Rust functions are callable from Python with minimal overhead
Vocabulary caching: The merge table is loaded once and held in memory across calls
Regex pre-tokenization: A fast regex splits text at word/punctuation boundaries before BPE, reducing the search space

Links and Resources

Official Resources

🌟 GitHub: openai/tiktoken
📚 OpenAI Cookbook tutorial: How to count tokens with tiktoken
🔧 Online tokenizer playground: platform.openai.com/tokenizer
🐛 Issue Tracker: github.com/openai/tiktoken/issues

Related Resources

OpenAI API Pricing — understand how token counts translate to costs
HuggingFace Tokenizers docs — for comparison with the broader ecosystem

Conclusion

tiktoken's value extends well beyond counting tokens. It's the translation layer between developers and GPT models — the component that determines what the model actually "sees." Mastering tiktoken means you can precisely control context windows, estimate costs before they hit your bill, and build LLM applications that behave predictably at the boundaries.

Its Python + Rust architecture is also a design pattern worth studying: hand the performance-critical inner loop to a systems language, keep the ergonomics and flexibility in a dynamic language. Simple idea, significant payoff.

Find more useful knowledge and interesting products on my Homepage

Check out PrimeSkills — a curated marketplace of AI agents and skills that have been validated in real-world, enterprise-grade workflows. No fluff, just what actually works.

DEV Community