DEV Community

Cover image for Why Token Counting in Multi-LLM Systems Is Harder Than You Think
Robert Imbeault
Robert Imbeault

Posted on

Why Token Counting in Multi-LLM Systems Is Harder Than You Think

When we set out to build our adaptive context window management component, we ran into a problem that sounds deceptively simple: how do you manage context windows when your system routes requests across multiple LLM providers?

The Core Problem
Each model has its own tokenizer, context window, and pricing rules. The same text is not "the same" across providers. OpenAI might count a prompt as 1,200 tokens; Claude might see it as 1,450. A chat session that fits comfortably in one model can silently exceed limits or cost significantly more in another.

This creates real problems when you switch providers mid-conversation. The new model has to ingest the full conversation history again — but since each model counts that context differently, you can hit:

  • Unexpected context-window overflow: the conversation that fit before now breaches the limit
  • Inconsistent truncation: different models truncate at different points, changing what context the model sees
  • Hard-to-predict routing failures: your router makes decisions based on one token count, but the model uses another

Why a Single 'Token Estimate' Doesn't Cut It
The tempting solution is to maintain a single token count with a safety margin. The problem: OpenAI, Claude, Gemini, Cohere, xAI, and others don't tokenize text the same way. A single estimate will be wrong in both directions — undercount and you risk failures; overcount and you truncate too aggressively, degrading conversation quality unnecessarily.

How We Solved It
The answer is making token counting provider-aware. Instead of a single universal estimate, the context management layer measures each prompt the way the specific target model will measure it. The router uses this measurement before the request is sent.

In practice this means the system:

  • Knows when a conversation is approaching the edge of a model's context window

  • Trims or compresses history intelligently, not just blindly chopping from the front.

  • Avoids expensive overages from miscounted tokens

  • Keeps model-switching complexity invisible to the end user

The user sees a smooth conversation. The system handles the messy reality that every model speaks a slightly different "token language."

What We're Building Toward
This is one component of a larger routing layer. The goal: switch LLM providers mid-product — based on cost, capability, or availability — without that complexity leaking to users. Provider-aware token counting turns out to be a foundational piece of that.

Hope this help.

Top comments (0)