Why Token Counting in Multi-LLM Systems Is Harder Than You Think

#llm #ai #programming #devtools

When we set out to build our adaptive context window management component, we ran into a problem that sounds deceptively simple: how do you manage context windows when your system routes requests across multiple LLM providers?

The Core Problem
Each model has its own tokenizer, context window, and pricing rules. The same text is not "the same" across providers. OpenAI might count a prompt as 1,200 tokens; Claude might see it as 1,450. A chat session that fits comfortably in one model can silently exceed limits or cost significantly more in another.

This creates real problems when you switch providers mid-conversation. The new model has to ingest the full conversation history again — but since each model counts that context differently, you can hit:

Unexpected context-window overflow: the conversation that fit before now breaches the limit
Inconsistent truncation: different models truncate at different points, changing what context the model sees
Hard-to-predict routing failures: your router makes decisions based on one token count, but the model uses another

Why a Single 'Token Estimate' Doesn't Cut It
The tempting solution is to maintain a single token count with a safety margin. The problem: OpenAI, Claude, Gemini, Cohere, xAI, and others don't tokenize text the same way. A single estimate will be wrong in both directions — undercount and you risk failures; overcount and you truncate too aggressively, degrading conversation quality unnecessarily.

How We Solved It
The answer is making token counting provider-aware. Instead of a single universal estimate, the context management layer measures each prompt the way the specific target model will measure it. The router uses this measurement before the request is sent.

In practice this means the system:

Knows when a conversation is approaching the edge of a model's context window
Trims or compresses history intelligently, not just blindly chopping from the front.
Avoids expensive overages from miscounted tokens
Keeps model-switching complexity invisible to the end user

The user sees a smooth conversation. The system handles the messy reality that every model speaks a slightly different "token language."

What We're Building Toward
This is one component of a larger routing layer. The goal: switch LLM providers mid-product — based on cost, capability, or availability — without that complexity leaking to users. Provider-aware token counting turns out to be a foundational piece of that.

We're doing this so you won't have to. :)

Top comments (2)

Jonathan Murray • Apr 17

Not all tokens are the same eh!

Archit Mittal • Apr 20

The cross-tokenizer accounting headache is real — I've seen teams burn months because their internal "token budget" was computed with tiktoken but actual spend was on a Claude/Gemini model using totally different tokenization. One pragmatic workaround: stop using tokens as your internal unit entirely and store "billable units" per provider at request time, pulled from the response metadata. It keeps your dashboards honest even when you swap models mid-pipeline. Great writeup.