Mukunda Rao Katta

Posted on May 25

Check Token Count Before You Hit the 413

#hermeschallenge #ai #python #agents

The 413 I should have seen coming

It was a document summarization pipeline. The task: summarize a batch of research papers, one API call per paper. Standard stuff.

The papers varied in length. Most were fine. A few were long. One was very long.

The pipeline hit a 413 context-length error halfway through the batch.

The fix was obvious in hindsight: check the token count before sending the request. Truncate if needed. Skip if the input is too large to be useful anyway.

But there was no token count check in the pipeline. There was no budget gate. The code just grabbed the text, stuffed it into the prompt, and fired the request. That worked fine for 95% of the inputs. The edge cases blew up at runtime.

A simple chars/4 estimate would have caught it. Four characters per token is a rough heuristic, but it is accurate within 15% for most English LLM text. At the scale of "is this 200,000 tokens or 8,000 tokens," that accuracy is more than enough.

I did not write that check. I thought it was too simple to bother packaging. I wrote it inline, ad hoc, and moved on.

Three months later, I needed the same check in a different project. And then another. At that point, it made sense to package it properly.

The shape of the fix

Install:

pip install prompt-token-counter

The most common usage is fits():

from prompt_token_counter import fits

text = load_document(path)

if not fits(text, max_tokens=8000):
    text = text[:32000]  # rough trim before sending

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": text}]
)

That call returns True if the estimated token count is within the budget, False otherwise. The estimate uses len(text) / 4 by default.

If you want to count a full message list instead of a raw string:

from prompt_token_counter import count_messages

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_input},
]

total = count_messages(messages)
print(f"Estimated tokens: {total}")

The message counter adds per-role overhead to each message before summing. This mirrors how real tokenizers account for the <|im_start|>role\n tokens in chat-formatted prompts.

If you want better accuracy and already have tiktoken installed:

import tiktoken
from prompt_token_counter import fits

enc = tiktoken.get_encoding("cl100k_base")
tiktoken_counter = lambda text: len(enc.encode(text))

if not fits(text, max_tokens=8000, counter=tiktoken_counter):
    # handle oversized input
    ...

Pass any callable that takes a string and returns an int. The library does not care what is inside.

What it does NOT do

It does not install tiktoken or any other tokenizer. Zero runtime dependencies.
It does not make API calls. No network, no auth, no rate limits.
It does not truncate for you. It tells you whether the text fits. What you do about an oversized input is your call.
It does not claim to be exact. The chars/4 default is an estimate. If you need token-perfect counts for billing or strict context enforcement, use a real tokenizer.

Inside the lib: zero-dep default

The default estimator is len(text) / 4, rounded up.

That sounds almost too simple. But for English text going into a modern LLM, it holds up surprisingly well. Most English words are 4 to 7 characters. Subword tokenizers like BPE break common words into one or two tokens. The ratio works out close to 4 characters per token for typical prose.

The estimate drifts on:

Code (shorter tokens, more tokens per character)
Languages with non-ASCII characters (UTF-8 byte counts do not reflect token counts)
Very short inputs where overhead tokens dominate

For those cases, swap in a real tokenizer. The library is designed for that.

Inside the lib: BYO tokenizer design

The design decision was to make the tokenizer a parameter, not a dependency.

When you install a Python package, you get all of its dependencies. tiktoken is a good library. It is also 5MB+ on disk, it compiles a Rust extension on install, and it takes noticeable time to cold-import. If you are running in a Lambda function or a container that cares about cold-start latency, you do not want that on every import of every module that touches token counting.

So the default is zero deps. If you are already using tiktoken or transformers or sentencepiece in your project, you pass a wrapper. You pay the import cost once at module level, not because prompt-token-counter dragged it in.

# one-time setup at module level
import tiktoken
_enc = tiktoken.get_encoding("cl100k_base")
def my_counter(text: str) -> int:
    return len(_enc.encode(text))

# use it everywhere you need accurate counts
from prompt_token_counter import fits, count_messages
fits(some_text, 8000, counter=my_counter)
count_messages(msg_list, counter=my_counter)

When this is useful

Before sending a long document. A quick fits() check before each API call costs almost nothing. It catches the 413 before it happens.

When building a context window manually. You are assembling messages from retrieved chunks, history, and a system prompt. You want to know the total before you send.

In a pipeline that processes variable-length inputs. Batch jobs, document processing, data labeling. The inputs are not all the same size. You want to skip or truncate the outliers before they hit the API.

When you do not need tiktoken. Most of the time, the chars/4 heuristic is close enough. You do not need an exact count to decide whether to truncate a 200,000-character document.

When NOT to use this

If you need byte-perfect token counts for billing audits, use the provider's tokenizer.

If you are building a tool that needs to count tokens across multiple models with different vocabularies, the chars/4 heuristic will not be consistent. Different tokenizers produce different counts for the same text. The BYO callable lets you handle that, but you have to wire it up per model.

If your inputs are mostly non-English text, test the heuristic before relying on it. Token counts for languages like Chinese or Arabic diverge significantly from the chars/4 assumption.

Install

pip install prompt-token-counter

No dependencies. Works with any Python 3.8+ environment.

Source: MukundaKatta/prompt-token-counter

Siblings

These libraries sit adjacent to this one in a larger agent-stack:

Lib	Boundary	Repo
agent-message-window	Uses token count to trim the conversation window when it exceeds the context limit	MukundaKatta/agent-message-window
agentfit	Full token-aware fitting strategy: picks messages to include or drop given a hard token budget	MukundaKatta/agentfit
llm-cost-cap	Preflight USD cost gate: estimates input and output cost before making the API call	MukundaKatta/llm-cost-cap
token-budget-py	Pool budget: tracks cumulative token spend across multiple calls and raises when the cap is hit	MukundaKatta/token-budget-py

prompt-token-counter is the lowest layer. It counts. The others build on top of that count to decide what to trim, what to cost-gate, or when to stop.

What's next

A few things I have considered but not shipped yet:

Per-model calibration. Different models use different tokenizers, and the chars/4 ratio shifts slightly between them. A lookup table of per-model correction factors would improve accuracy without adding a real tokenizer dependency.

Streaming support. When streaming a response, you do not know the output token count upfront. A running estimate from partial output chunks would be useful for budget tracking mid-stream.

Integration with cost calculators. prompt-token-counter knows how many tokens. llm-cost-cap knows the per-token price. Wiring them together with a typed result (not just a bool) would make the preflight gate more informative.

For now, it does one thing: estimates whether your payload fits. That is all most pipelines need before they make the call.

DEV Community