AI Dev Hub

Posted on Apr 29

How token counters actually work in 2026, and when to trust them

#ai #programming #devtools #tutorial

Most "free token counter" tools in your bookmarks are not running the model's tokenizer. They're running a character-ratio estimate and labeling the output "tokens". For OpenAI's GPT family the official tokenizer is open and easy to ship in a browser. For Claude, Gemini, and most others it isn't. Here's what that means for your context-window math.

Up-front disclosure on this one: the tool I link to below is one I built. I got tired of paste-counter-paste-counter loops where the same input produced different numbers, and tired of tools that claim to support every model but quietly use one tokenizer for all of them. Free, client-side, no signup. I'm linking to it because it's what I use, and because I'd rather show you how it works than pitch it.

If you've ever opened three "GPT token counter" tabs and gotten three different numbers, you're not crazy and the tools aren't all wrong. They're doing different things and labeling them the same way. Knowing which is which makes the difference between "this prompt fits" and "the API will reject it at the boundary".

What "tokenization" actually does

A tokenizer takes raw text and splits it into the integer IDs the model actually consumes. Every model family ships its own vocabulary, trained on its own corpus. Same input string yields different token counts because the vocabularies differ.

OpenAI's GPT-4 family uses an encoding called cl100k_base. The newer GPT-4o, GPT-5, o3 and o4 models use o200k_base, a larger vocabulary tuned for multilingual and code-heavy input. Anthropic's Claude family uses its own vocabulary that's published only as a server-side counting endpoint. Google's Gemini family is similar: server-side counting, no public local tokenizer at the time of writing (April 2026).

The rule of thumb people quote, "1 token is about 4 characters of English", is fine for napkin math and wrong by 10 to 20 percent on real input. German tokenizes worse than English because compound words don't fit the English-trained vocabulary. Code with many short identifiers tokenizes better than prose. Emoji are usually 2 to 4 tokens each. JSON with verbose keys tokenizes much worse than minified JSON. If you're sitting near the context window, the rule of thumb will lie to you.

Exact vs estimated, the real divide

Free token counters fall into two camps.

Exact counters ship the model's actual tokenizer in the browser and run it on your input. The numbers match what the API will charge, give or take a token or two. This is feasible only when the tokenizer is published as a runnable library. For OpenAI's GPT and o-series, that library is tiktoken (Python) and gpt-tokenizer (JavaScript). Both are MIT-licensed and small enough to ship client-side.

Estimating counters apply a character-ratio heuristic. They divide the character count by some constant (3.5 to 4.0 depending on the model family) and round up. The number is roughly right on plain English. It can be 10 to 20 percent off on code, JSON, German, mixed scripts, or anything with unusual whitespace. If a counter is fast on a 100,000-character paste regardless of which model you pick, it's almost certainly estimating.

The honest move is to label which is which. Most counters don't.

What the tool I built actually does

Since I'm linking to one of these, I owe you the spec.

aidevhub.io/token-counter uses gpt-tokenizer to compute exact counts for OpenAI's GPT-4, GPT-5, o3, and o4 model names. For every other family (Claude 3.x, Claude 4.x, Gemini, Llama, DeepSeek, Mistral, Grok) it uses a character-ratio estimate calibrated per family. Claude is chars / 3.5. The others are chars / 4.0. The output labels each row as either exact or estimate so you can tell which you're looking at.

This is honest about what's possible. I can't ship Anthropic's tokenizer client-side because it isn't published as a local library. I can't ship Google's either. The choice was either to fake-claim "supports every tokenizer" (the easy lie) or to label estimates as estimates (the harder honesty). Picked the second.

For most context-budget math at 30 to 70 percent of the window, the estimate is close enough. For boundary cases at 95+ percent of the window, you want the actual tokenizer. The next section is how to get certainty when you need it.

How to get certainty when the number matters

If the count matters (you're at the boundary, or you're billing customers per-token), don't trust any browser tool, including mine. Use the model's own counting endpoint or library.

For OpenAI:

import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
with open("prompt.txt") as f:
    print(len(enc.encode(f.read())))

That's the source of truth. gpt-tokenizer in the browser uses the same encodings (cl100k_base for GPT-4 era, o200k_base for GPT-4o and newer), so a browser-based exact counter and tiktoken should match within a token or two. If they don't, your tiktoken version is probably stale and the model has updated its vocabulary since you last upgraded.

For Claude, Anthropic publishes a server-side counting endpoint accessible via the SDK as client.messages.count_tokens() (or client.beta.messages.count_tokens() depending on SDK version). It costs nothing to call but it does need network and an API key. Returns the exact count the API will charge for that exact messages array including system prompt and tool definitions.

For Gemini, the SDK exposes model.count_tokens() which similarly calls Google's server.

The post-call usage field on every modern API is also authoritative. After your call, the response includes input_tokens and output_tokens as the actual billed counts. If your local count and the API's usage consistently disagree, your local tokenizer is the one that's wrong.

Where token counts and API math diverge

A counter on raw text isn't the full picture for an API call. Three things eat budget that a naive counter doesn't see:

System prompt and tool definitions count. Every modern API includes them in the input total. If you're counting only the user message, you're under-counting.
Message structure adds overhead. Each message in a chat-format request costs a few tokens for the role markers and separators, on top of the content. OpenAI documents this; Anthropic does too. It's small (3 to 6 tokens per message) but at scale it matters.
Output tokens are a separate budget. The 200,000 number you see in Claude's docs is the input window. Output is configured separately. Claude 4 family has a third configurable budget for thinking tokens. Always check the model's docs for the specific split.

A browser counter that gives you a single number against a single model is a useful sanity check, not a complete budget calculation.

The compact summary

Counter type	What it does	Accuracy	When to use
`tiktoken` (Python)	Runs OpenAI's official tokenizer locally	Exact for GPT and o-series	Boundary cases, prod budget math
`gpt-tokenizer` (JS)	Same vocabularies, browser-shippable	Exact for GPT and o-series	Browser tools, paste-and-count UIs
Anthropic `count_tokens`	Server-side API call	Exact for Claude, includes message overhead	When the count matters and you have a key
Gemini `count_tokens`	Server-side API call	Exact for Gemini, includes message overhead	When the count matters and you have a key
Character-ratio estimate	`chars / 3.5` or `chars / 4.0`	Within 10 to 20 percent on most input	Quick sanity check, no key needed

A few small habits that pay off

After watching too many "but my count said it'd fit" boundary failures, three habits I've stuck with:

Count against the actual target model, not "GPT-4 close enough". Different vocabularies give different numbers on identical input. If you're sending to Claude 4.6, count with Anthropic's tokenizer.
Minify JSON before sending. Pretty-printed JSON spends tokens on whitespace. The model doesn't care. Editor reads the indented version, model reads the minified one. Easy to script in your client.
Log token counts on every prod call and graph the average weekly. If your average prompt size starts creeping up because someone added a new few-shot example, you'll see it before it tips over the budget. Costs about 10 lines of code per service.

FAQ

Q: Are there official tokenizers I can run locally for every model?
A: Only OpenAI publishes one as a runnable library (tiktoken in Python, gpt-tokenizer in JS). Anthropic and Google publish counting as server APIs only. If a third-party tool claims to do exact tokenization for Claude or Gemini in your browser, it's almost certainly estimating, no matter what the marketing says.

Q: Why does the count change when I add a system prompt?
A: Because system prompt is part of the input. Same for tool definitions if you're using tool-use APIs. The input window includes the entire request payload, not just the user turn. This trips people who count only their user message.

Q: How accurate is the post-call usage field?
A: It's the source of truth. That's what was billed. Counters before the call are estimates of what usage will say. They should match within 1 to 2 tokens if your local tokenizer matches the model's current version. Consistent drift means your local library is stale.

Q: Does whitespace really matter that much?
A: Yes, on text-heavy input. Repeated newlines and indentation are often single tokens each, but they add up. A pretty-printed 5,000-line JSON file can use noticeably more tokens than the same JSON minified, with no information loss. If you're trimming for budget, that's the first place to look.

Q: What about thinking tokens on Claude 4 and reasoning tokens on o-series?
A: Separate budget on both. Claude 4 family has a configurable thinking token budget independent of input and output. OpenAI's o-series has reasoning tokens that count against output. Check the specific model's docs because the rules vary by version.

Written with AI assistance and human review.

DEV Community