Tokenizer Quirks: Claude, GPT, and Gemini Don't Count the Same Text the Same Way

#ai #llm #python #tokenization

Book: Prompt Engineering Pocket Guide: Techniques for Getting the Most from LLMs
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You've seen the spreadsheet. A finance partner asks for a per-vendor cost forecast on the same workload, and the engineer building it picks one tokenizer (usually tiktoken, because it ships in five seconds) and multiplies. The Claude row, the GPT row, and the Gemini row all use that same number. The forecast goes out. A month later the actuals roll in, the Claude line could land twelve percent off, the Gemini line could land fifteen percent the other way, and nobody can explain why.

The reason is that no two of those vendors count the same text the same way. cl100k_base, o200k_base, Claude's tokenizer, and Gemini's SentencePiece share a family resemblance (byte-pair-style merges, vocabularies in the hundred-thousand range), but the merge tables, vocabulary sizes, and handling of code, whitespace, and non-Latin scripts all diverge. Same string in. Different integers out.

Below: a small Python script that asks each vendor for its own count, three cases where the numbers disagree enough to matter, and the working rule that falls out of all of it. Never trust a count from a tokenizer you don't actually call in production.

What each vendor actually uses

OpenAI publishes its tokenizers as the tiktoken library. There are two encodings you'll meet:

cl100k_base — about 100k vocabulary, used by GPT-3.5-Turbo and GPT-4.
o200k_base — about 200k vocabulary, used by GPT-4o and the o-series. Larger vocab, fewer tokens per English word, much better at non-English scripts.

Anthropic does not publish Claude's tokenizer (as of this writing). The model files aren't open and there's no tiktoken-style library you can pip-install. The supported way to know how many tokens a Claude request will cost is the token counting endpoint, POST /v1/messages/count_tokens, which accepts the same shape as a Messages request and returns the input-token total. Per Anthropic's docs at the time of writing, it is free to call and rate-limited separately from message creation. Use it. Don't approximate.

Google Gemini uses SentencePiece with a Unigram language model, with a vocabulary that Google's docs document at the time of writing as 256k. The Unigram model considers every possible segmentation of the input and picks the most probable one, a different decision procedure from BPE merges. The SDK exposes client.models.count_tokens(...), which calls the model's actual tokenizer server-side. It is also free per Google's current pricing.

Three tokenizers, three vocabulary sizes, two of them only reachable over the network. The implication for any cost estimate is uncomfortable: if you've been measuring with one and quoting the others, you've been guessing.

The script

Here is the smallest thing that will tell you the truth. Take a string, ask each vendor's tokenizer, print the counts side by side. Set the three env vars before running:

import os
import tiktoken
import anthropic
import google.generativeai as genai

OPENAI_ENC = "o200k_base"
CLAUDE_MODEL = "claude-opus-4-7"
GEMINI_MODEL = "gemini-2.0-flash"

def count_openai(text: str) -> int:
    enc = tiktoken.get_encoding(OPENAI_ENC)
    return len(enc.encode(text))

def count_claude(text: str) -> int:
    client = anthropic.Anthropic(
        api_key=os.environ["ANTHROPIC_API_KEY"],
    )
    resp = client.messages.count_tokens(
        model=CLAUDE_MODEL,
        messages=[{"role": "user", "content": text}],
    )
    return resp.input_tokens

def count_gemini(text: str) -> int:
    genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
    model = genai.GenerativeModel(GEMINI_MODEL)
    resp = model.count_tokens(text)
    return resp.total_tokens

def report(label: str, text: str) -> None:
    o = count_openai(text)
    c = count_claude(text)
    g = count_gemini(text)
    print(f"{label:<14} chars={len(text):>5}  "
          f"openai={o:>5}  claude={c:>5}  "
          f"gemini={g:>5}")

if __name__ == "__main__":
    prose = (
        "The dashboard is green. The inbox is red. "
        "Somewhere in production, a tool call has "
        "been retrying for six hours."
    )
    code = (
        "def retry(fn, attempts=3):\n"
        "    for i in range(attempts):\n"
        "        try: return fn()\n"
        "        except Exception: pass\n"
    )
    japanese = "こんにちは、世界。今日は良い天気ですね。"

    report("english_prose", prose)
    report("python_code", code)
    report("japanese", japanese)

A few details worth calling out.

tiktoken.get_encoding("o200k_base") is what you want for any GPT-4o-class model. If you're still pricing GPT-4 or 3.5, swap to cl100k_base; they will not return the same number. For Claude, count_tokens is a separate endpoint with its own rate limit, so you can hammer it during development without burning through the message-creation quota. For Gemini, count_tokens runs the model's tokenizer server-side, so the count you get back is what the billing system will see.

The usage field on a real response is also worth checking once you ship. Claude returns usage.input_tokens and usage.output_tokens on every Messages response. Gemini returns usage_metadata with prompt_token_count and candidates_token_count. OpenAI returns usage.prompt_tokens and usage.completion_tokens. When the count from your local tokenizer disagrees with the usage field, the usage field wins; those are the integers the invoice is built from.

Where the numbers actually drift

Three cases worth knowing by heart, with realistic order-of-magnitude figures rather than invented exact ones. Run the script above on your own corpus to get the numbers that bind your contract.

English prose is the closest case. All three tokenizers land in the same zip code on plain English. The tiktoken README puts o200k_base at roughly 4 bytes per token on average; Claude and Gemini's SentencePiece sit in the same range on plain English. You'll see single-digit-percent differences across the same paragraph. That's the case people generalize from, which is exactly why the other two cases bite.

Code blocks are where vocabulary differences show. Take a Python or TypeScript snippet (punctuation-heavy, indentation-aware, full of getUserById-style identifiers). It exercises tokenizer merges that were optimized for the training corpora. o200k_base added a lot of code-friendly merges over cl100k_base, and the gap on a 50-line file can be ten to twenty percent fewer tokens than the older encoding reports. Claude and Gemini land between the two OpenAI encodings on the same input, with their own offsets typically a few percent off either anchor. If your product ships code (a code-review bot, an IDE assistant, a doc-generator over a repo), this is the column that wrecks your forecast.

Non-Latin scripts are where it goes from "interesting" to "budget-breaking." Japanese, Chinese, and Korean text on cl100k_base runs at roughly one token per character, about four times worse than English on the same encoding (Tony Baloney, Towards Data Science). o200k_base's 200k-token vocabulary closes that gap meaningfully. Gemini's 256k SentencePiece vocabulary closes it further. If you're serving a Japanese support copilot and pricing it off GPT-3.5 numbers, your real bill can land meaningfully higher than your forecast (see the cited per-character data). Switching the same workload to Gemini or a GPT-4o-class model is a tokenization-economics call as much as a quality one.

The rule

There are two ways to get a token count for a request: ask the tokenizer that ships with the model, or ask the API that uses it. Anything else is a guess that happens to be close on English prose.

For OpenAI, that's tiktoken.encoding_for_model(name). Pick the encoding by the model you'll actually call, not by the encoding you remember from last year. For Claude, it's client.messages.count_tokens(...) against the same model you'll send the real request to. For Gemini, it's client.models.count_tokens(...) against the same model name. When the request returns, read the usage field and reconcile. Build the reconciliation into a daily job if your cost matters; the moment your local count and the vendor's count drift apart, something in your prompt-construction layer changed.

The cleanest version of this is a tiny count_tokens(vendor, model, text) shim in your codebase that knows which path to take per vendor. Build it once. Call it from your forecasting code, log it in production, and alert on disagreement above a percent or two.

If this was useful

The token-cost surprise is one of those problems that only exists because every vendor invented their own merge table. It's exactly the kind of thing the Prompt Engineering Pocket Guide spends time on: how prompts are actually counted, where the cost lives, and which tokenization choices end up in the invoice. If you ship across more than one provider, that knowledge pays for itself the first time a finance partner asks you to defend a forecast.

If you'd rather see this kind of cross-vendor wiring done right inside an editor, Hermes IDE is the project I work on, an IDE for people who develop with Claude Code and other AI coding tools, where the per-vendor accounting is built in instead of bolted on.