Introduction
"How many tokens does your prompt actually use?"
This is article #85 in the Open Source Project of the Day series. Today's project is tiktoken β OpenAI's official tokenizer.
Before calling the OpenAI API, almost every developer runs into the same questions: How many tokens will this text consume? Will it exceed the context limit? How do I estimate the cost? The answers all trace back to a single step: tokenization.
tiktoken isn't just a "token counter." It's the actual tokenizer used by the GPT model family during both training and inference. Understanding it means understanding what the model truly "sees" as its input.
What You'll Learn
- The core mechanics of BPE (Byte Pair Encoding) and its 4 key properties
- Which encodings tiktoken supports and how to pick the right one
- How to count tokens precisely to avoid API truncation and surprises
- How to add custom special tokens for chat formats
- Why tiktoken's Rust + Python architecture is 3-6x faster than alternatives
Prerequisites
- Basic Python experience
- Familiarity with the OpenAI API (knowing tokens are the billing unit is enough)
- Basic NLP/tokenization concepts (optional)
Project Background
What Is tiktoken?
tiktoken is an open-source BPE (Byte Pair Encoding) tokenizer library released by OpenAI. Its core job is to convert text strings into sequences of token IDs (integer arrays) for language models to consume β and to reverse that process, converting token sequences back to the original text.
This isn't an experimental side project. It's the tokenizer powering GPT-3.5, GPT-4, GPT-4o, and more. When you send text through the API, the model doesn't see your words β it sees the token sequence tiktoken produces.
Author / Team
- Author: OpenAI
- Language breakdown: Python 64.9% + Rust 35.1%
- Adoption: Depended on by 184,000+ GitHub repositories
- Role: Core infrastructure for LLM application development
Project Stats
- β GitHub Stars: 18,400+
- π΄ Forks: 1,500+
- π¦ Latest version: 0.13.0 (2026-05-15)
- π License: MIT
Core Features
What It Does
tiktoken does three things:
- Encode: Convert text into a list of token IDs
- Decode: Convert token IDs back into the original text (lossless)
- Count: Precisely report how many tokens a piece of text uses
These three operations are foundational in LLM application development:
- Context management: Ensure prompts + history don't exceed the model's context window
- Cost estimation: Predict API costs before sending requests (OpenAI charges per token)
- Prompt engineering: Understand the actual "units" the model processes, and optimize around tokenization boundaries
Use Cases
-
Token budget control before API calls
- Check token count before sending to avoid truncation or
max_tokenserrors
- Check token count before sending to avoid truncation or
-
Smart document chunking for RAG
- Split documents into chunks that stay under a token limit for retrieval-augmented generation
-
Multi-turn conversation window management
- Dynamically trim message history to keep the conversation within the model's context window
-
Precise cost monitoring
- Build token usage dashboards to track and optimize prompt efficiency
-
Fine-tuning data preprocessing
- Control sample length by token count when preparing training datasets
Quick Start
pip install tiktoken
import tiktoken
# Option 1: Get encoding by name (recommended for new projects)
enc = tiktoken.get_encoding("o200k_base")
# Option 2: Get encoding by model name (auto-matches the correct encoding)
enc = tiktoken.encoding_for_model("gpt-4o")
# Encode: text β list of token IDs
tokens = enc.encode("Hello, tiktoken!")
print(tokens) # [13225, 11, 384, 4963, 0]
print(len(tokens)) # 5 β this is the token count
# Decode: token IDs β text
text = enc.decode(tokens)
print(text) # "Hello, tiktoken!"
# Lossless round-trip
assert enc.decode(enc.encode("Any text can be perfectly restored.")) == "Any text can be perfectly restored."
Key Properties
-
High-performance Rust core
- Core tokenization logic is implemented in Rust, achieving 3-6x speedup over comparable Python tokenizers (benchmark: 1 GB of text, vs
GPT2TokenizerFast)
- Core tokenization logic is implemented in Rust, achieving 3-6x speedup over comparable Python tokenizers (benchmark: 1 GB of text, vs
-
Lossless reversibility
-
decode(encode(text)) == textalways holds β no information is lost in the round-trip
-
-
Universal coverage
- Byte-level BPE handles any Unicode text, including content outside the training distribution
-
High compression ratio
- Each token corresponds to roughly 4 bytes of text, significantly reducing sequence length and computation
-
Subword awareness
- Recognizes common morphological units (e.g.,
ing,tion,pre-), helping models generalize across word forms
- Recognizes common morphological units (e.g.,
-
Multiple built-in encodings
- Ships with
o200k_base(GPT-4o),cl100k_base(GPT-4/GPT-3.5-turbo), and legacy encodings
- Ships with
-
Special token extension
- Supports adding custom special tokens like
<|im_start|>to adapt the tokenizer for chat formats
- Supports adding custom special tokens like
-
Educational module
- Built-in
_educationalmodule visualizes the BPE merging process step by step
- Built-in
Comparison with Alternatives
| Dimension | tiktoken | HuggingFace Tokenizers | SentencePiece |
|---|---|---|---|
| Speed | β‘ Fastest (Rust core) | Fast (Rust core) | Medium (C++) |
| OpenAI model alignment | β Exact match | β Approximate | β N/A |
| Python API simplicity | β Minimal | Medium | Medium |
| Model coverage | OpenAI series | Universal | Universal |
| Custom encodings | β Supported | β Supported | β Supported |
Why choose tiktoken?
- The only tokenizer that produces token counts identical to what the OpenAI API actually charges
- Two-line API for token counting β no boilerplate
- MIT license β zero friction for commercial use
Deep Dive
BPE: The 4 Key Properties
BPE (Byte Pair Encoding) is tiktoken's core algorithm. Understanding its 4 properties tells you exactly what tiktoken can and cannot do.
β Lossless Reversibility
Token sequences reconstruct the original text with 100% fidelity:
original = "GPT-4o uses the o200k_base encoding."
assert enc.decode(enc.encode(original)) == original # Always true
β‘ Open Vocabulary
tiktoken starts from individual bytes (256 characters) and merges them by frequency. Every Unicode character can be tokenized β including content the model was never trained on:
# New words, emoji, source code, edge cases β all tokenized without error
enc.encode("ππ€ tiktoken-v99 unknown_word_xyz") # Never throws
β’ High Compression Ratio
Each token covers roughly 4 bytes, reducing sequence length and the cost of attention computation:
text = "The quick brown fox jumps over the lazy dog"
tokens = enc.encode(text)
print(f"Characters: {len(text)}, Tokens: {len(tokens)}")
# Characters: 43, Tokens: 9 β ~4.8 chars per token
β£ Subword Awareness
BPE learns morphological patterns, helping models generalize:
# "encoding" β ["encod", "ing"]
# "tokenization" β ["token", "ization"]
# The model can infer the meaning of unseen compound words
Encoding Selection Guide
Using the wrong encoding means your token counts won't match what the API actually charges:
| Encoding | Models | Vocabulary Size |
|---|---|---|
o200k_base |
GPT-4o, GPT-4o-mini | 200,000 |
cl100k_base |
GPT-4, GPT-3.5-turbo, text-embedding-3-* | 100,000 |
p50k_base |
text-davinci-003 and older | 50,000 |
r50k_base |
GPT-3 (davinci) | 50,000 |
import tiktoken
def count_tokens(text: str, model: str = "gpt-4o") -> int:
"""Count tokens for a given model, exactly matching the API's count."""
enc = tiktoken.encoding_for_model(model)
return len(enc.encode(text))
print(count_tokens("Hello, world!")) # 4
print(count_tokens("δ½ ε₯½οΌδΈηοΌ")) # 6
Custom Special Tokens
Chat-format models use special tokens to delimit roles. You can extend an existing encoding to support them:
import tiktoken
cl100k_base = tiktoken.get_encoding("cl100k_base")
# Build a chat-aware encoding with custom special tokens
enc = tiktoken.Encoding(
name="cl100k_im",
pat_str=cl100k_base._pat_str,
mergeable_ranks=cl100k_base._mergeable_ranks,
special_tokens={
**cl100k_base._special_tokens,
"<|im_start|>": 100264,
"<|im_end|>": 100265,
}
)
text = "<|im_start|>user\nWhat is BPE?<|im_end|>"
tokens = enc.encode(text, allowed_special={"<|im_start|>", "<|im_end|>"})
print(f"Token count: {len(tokens)}")
Real-World Pattern: Token Budget Management
The most common use case β trim message history to fit within the context window before sending:
import tiktoken
def trim_messages_to_budget(
messages: list[dict],
model: str = "gpt-4o",
max_tokens: int = 8000,
) -> list[dict]:
"""
Trim conversation history so the total token count stays under budget.
Preserves the system prompt; drops the oldest user/assistant turns first.
"""
enc = tiktoken.encoding_for_model(model)
def count(msgs: list[dict]) -> int:
# Each message carries ~4 tokens of overhead (role marker, separators)
total = sum(4 + len(enc.encode(m.get("content", ""))) for m in msgs)
return total + 2 # 2 tokens priming the reply
system = [m for m in messages if m["role"] == "system"]
others = [m for m in messages if m["role"] != "system"]
while count(system + others) > max_tokens and others:
others.pop(0)
return system + others
Architecture: Why Is It So Fast?
tiktoken achieves its 3-6x speedup through a Python + Rust hybrid architecture:
tiktoken/
βββ tiktoken/
β βββ __init__.py β Public Python API
β βββ core.py β Encoding class
β βββ model.py β Model name β encoding name mapping
β βββ registry.py β Encoding registration and caching
β βββ _educational.py β Pure-Python BPE for learning purposes
β
βββ src/ (Rust)
βββ lib.rs β High-performance BPE core (exposed via PyO3)
Why Rust makes the difference:
- No GIL overhead: Rust bypasses Python's Global Interpreter Lock for the inner merge loop
- Zero-cost PyO3 bindings: Rust functions are callable from Python with minimal overhead
- Vocabulary caching: The merge table is loaded once and held in memory across calls
- Regex pre-tokenization: A fast regex splits text at word/punctuation boundaries before BPE, reducing the search space
Links and Resources
Official Resources
- π GitHub: openai/tiktoken
- π OpenAI Cookbook tutorial: How to count tokens with tiktoken
- π§ Online tokenizer playground: platform.openai.com/tokenizer
- π Issue Tracker: github.com/openai/tiktoken/issues
Related Resources
- OpenAI API Pricing β understand how token counts translate to costs
- HuggingFace Tokenizers docs β for comparison with the broader ecosystem
Conclusion
tiktoken's value extends well beyond counting tokens. It's the translation layer between developers and GPT models β the component that determines what the model actually "sees." Mastering tiktoken means you can precisely control context windows, estimate costs before they hit your bill, and build LLM applications that behave predictably at the boundaries.
Its Python + Rust architecture is also a design pattern worth studying: hand the performance-critical inner loop to a systems language, keep the ergonomics and flexibility in a dynamic language. Simple idea, significant payoff.
Find more useful knowledge and interesting products on my Homepage
Check out PrimeSkills β a curated marketplace of AI agents and skills that have been validated in real-world, enterprise-grade workflows. No fluff, just what actually works.
Top comments (0)