Andrei P.

Posted on Feb 19

AI Isn’t Just Biased. It’s Fragmented — And You’re Paying for It.

#ai #openai #programming #discuss

When people talk about AI bias, they usually mean harmful outputs or unfair predictions.

But there’s a deeper layer most people ignore.

Before a model understands your sentence, it breaks it into tokens.

And that process quietly determines:

how much you pay
how much context you get
how well the model reasons

If you’re a user of a less common language, you may literally pay more — for worse performance.

Tokenization Isn’t Neutral

Large language models don’t read words — they read tokens. A tokenizer splits text into subword pieces based on frequency in the training corpus. Because common English patterns dominate web data, those patterns become compact tokens. Languages and dialects that appear less often get broken into more fragments.

That’s not just linguistic trivia:
it affects cost, performance, and user experience in measurable ways.

Same Meaning, Different Cost

Take two equivalent sentences in different languages. Because English appears far more frequently in training data, an English sentence often compresses into fewer tokens than its non-English equivalent. More tokens means:

Higher API charges (you pay per token)
Faster context window exhaustion (fewer usable reasoning steps)
Greater truncation risk
Lower effective performance

This isn’t hypothetical — it’s been documented in academic work showing that token disparities between languages can be orders of magnitude in some cases, causing non-English users to pay more for the same service and providing less context for inference.

How We Know This: tokka-bench

Open-source tooling now exists that highlights these inequalities in a systematic way. One such project is Tokka-Bench, a benchmark for evaluating how different tokenizers perform across 100 natural languages and 20 programming languages using real multilingual text corpora.

Tokka-Bench doesn’t just count tokens — it measures:

Efficiency (bytes per token): how well a tokenizer compresses text
Coverage (unique tokens): how well a script or language is represented
Subword fertility: how many tokens are needed per semantic unit
Word splitting rates

The results reveal stark differences. In low-resource languages, tokenizers often need 2×–3× more tokens to encode the same amount of semantic content compared with English.

This has real implications:

A model might treat the same idea in English with half the number of tokens compared to Persian, Hindi, or Amharic.
Inference costs scale with tokens — so non-English content costs more to process.
Long documents in token-hungry languages fill the model’s context window faster, reducing the model’s ability to reason over long input.

The benchmark even finds systematic differences in coverage: some tokenizers (e.g., models optimized for specific languages) have much lower subword fertility and better coverage in those languages, while others perform poorly outside dominant scripts.

Context Window Inequality

Every model has a finite context window (e.g., 8k, 32k, 128k tokens). If one language inflates token count:

Your document fills the window faster.
The model can’t “see” as much history in long conversations.
It loses access to earlier context sooner.
Summaries and reasoning chains break down earlier.

The API may be the same, but the usable intelligence you get differs by language once token efficiency varies.

Compression Bias Becomes Economic Bias

Tokenizers optimize for frequency and compression, not fairness or equity. But because frequency reflects the unequal distribution of data on the web, optimization under unequal data produces unequal infrastructure.

Non-English users often see:

Higher inference cost per semantic unit
Faster context consumption
Lower effective reasoning capacity
Worse performance on tasks like summarization and long-form Q&A

This is economic bias — subtle, pervasive, and hard to fix with output filters alone.

The Real Fix

To build fairer AI systems, we must treat tokenization as structural infrastructure, not incidental preprocessing. This requires:

Token cost audits per language
Context efficiency benchmarking
Balanced tokenizer training corpora
Intentional vocabulary allocation
Public fragmentation metrics

Because bias doesn’t start at the answer.

It starts at the first split of a word.

And projects like tokka-bench give us the tools we need to measure it.

Top comments (1)

ansh d • Feb 19

Spot on. Tokenization is the 'silent' budget killer for multilingual apps.

The problem I'm seeing is that once you solve for the token cost, you hit a 'Quality Ceiling.' Because the model sees less context per semantic unit in languages like Hindi or Arabic, the logic often breaks down before the token limit even hits.

I actually just published a piece on Evaluation Engineering that touches on how to move from 'vibe-checking' these fragmented outputs to actually verifying them with domain experts. It’s the only way we’ve found to stop paying for 'fragmented' intelligence:
Why Evaluation Engineering is the Final Frontier

Also, curious—have you seen any specific tokenizers that handle code-switching better?"