Thiago V.

Posted on Mar 30 • Edited on Apr 2

The Irony of Language Models That Don't Speak Your Language

#ai #i18n #llm #agents

This is a personal project and article. The opinions expressed here are my own and do not reflect the opinions of AWS or Amazon. This project is not an AWS product and is not endorsed by or affiliated with AWS.

AI is plugged everywhere now and its a breakthrough advanced technology.

However, there is a key element which turns out to be an elephant in the room that is not in the major headliner topics: LLMs are fundamentally centric to high-resource languages, and most specifically, English.

The only publicly disclosed training data breakdown — GPT-3 (Brown et al., 2020) — showed over 92% English tokens. Newer models don't publish exact ratios, but the picture has evolved: Llama 3 remains heavily English, GPT-4o highlights improved multilingual performance as a key feature, and models like Qwen and Aya have invested significantly more in non-English data. The gap is narrowing for high-resource languages, but for the thousands of low-resource languages, the structural imbalance remains.

The remaining languages — spoken by billions of people — are either poorly represented through low-quality machine-translated English content, or absent entirely.

This means that when a Thai farmer asks about crop subsidies, when a Nigerian mother searches for vaccination schedules in Yoruba, or when a Brazilian citizen navigates tax forms in Portuguese, the AI they're interacting with is operating at a fraction of its true capability.

Not because the intelligence isn't there, but because the model was never properly taught to listen in their language. The industry celebrates "human-level performance" on benchmarks, but those benchmarks are overwhelmingly English. For most of the world, the AI revolution hasn't arrived yet — it's still stuck at customs, waiting for a translator.

The Ancient Myth

Around 4,000 years ago, Babylon was the most cosmopolitan city on Earth.

Situated at the crossroads of ancient trade routes in modern-day Iraq, it was a place where Akkadian, Sumerian, Aramaic, Elamite, and dozens of other languages collided daily. Merchants, scholars, and diplomats from across Mesopotamia converged there, and the city thrived precisely because it found ways to bridge those languages — through scribes, translators, and the world's first multilingual libraries.

The biblical story of the Tower of Babel, set in Babylon, tells it differently: God scattered humanity across the earth and confused their languages so they could no longer understand each other. It's a story about the fracturing of communication — the moment when a shared project became impossible because people could no longer speak the same language.

We're living in a strange echo of that story. We've built the most powerful reasoning machines in human history — LLMs that can write poetry, prove theorems, and generate working code. But these machines think in English. When the rest of the world tries to speak to them, the tower crumbles. Not because the intelligence isn't there, but because the language barrier corrupts the signal before it reaches the model's reasoning core.

The Illusion of Multilingual AI

Ask any frontier LLM a question in English, and you'll get a polished, accurate, well-reasoned response. Now ask the same question in Thai. Or Amharic. Or even Portuguese.

Suddenly, the magic fades.

The response might be shorter, vaguer, or riddled with English fragments leaking through. In some cases, it's outright gibberish. And here's the part nobody talks about: you're paying more for that worse response.

While the industry celebrates benchmark after benchmark showing LLMs reaching "human-level performance," there's a massive asterisk: in English. For the 6,950 other languages spoken on this planet, AI remains broken, expensive, and in some cases, unreliable.

The Numbers Don't Lie

Most leading LLMs allocate approximately 92% of their training tokens to English (Brown et al., "Language Models are Few-Shot Learners", NeurIPS 2020). Of the approximately 7,000 spoken languages globally, most models only cover about 50 high-resource ones (Frontiers Research Topic: Language Models for Low-Resource Languages). The remaining languages lack both the digital data and quality resources to benefit from recent AI advancements — creating barriers to education, healthcare, financial access, and employment for the communities that speak them.

But the problem goes deeper than just quality. It's about money.

The Hidden Language Tax

LLMs use tokenizers to break text into chunks before processing. These tokenizers were designed primarily for English. When you feed them Thai, Japanese, Arabic, or Korean text, the same semantic content gets split into 2-4x more tokens.

I built a proxy called LLM Proxy Babylon to measure this. Here's what I found with a real Thai prompt about sorting algorithms:

Metric	Direct Thai	Optimized (English)
Prompt tokens	~166	49
Token savings	—	70% fewer input tokens
Quality score	0.456	~0.949 (English-level)

That's 3.4x fewer input tokens and 2x better quality. At Amazon Nova Lite pricing on Bedrock ($0.06/1M input tokens), sending 1M Thai prompts of this size would cost ~$0.01 directly vs ~$0.003 through the optimizer — and the optimized path delivers dramatically better responses.

The savings scale dramatically with premium models. At Claude Opus 4 pricing on Bedrock ($15/1M input tokens), the same 1M Thai prompts would cost $2.49 directly vs $0.74 through the optimizer — a $1.75 saving per million requests on input tokens alone, with better quality on every response.

Every company running a multilingual chatbot is silently paying this tax. Their English-speaking users get fast, cheap, high-quality responses. Their Thai-speaking users get slower, more expensive, lower-quality responses — for the same product, same subscription price.

And it compounds. Chatbots send the full conversation history with every request. A 10-message conversation in Thai accumulates tokens 3x faster than the same conversation in English. By turn 10, you're sending massive context windows that cost a fortune and may even overflow the model's limits.

When Government Chatbots Can't Serve Their Own Citizens

Now imagine this problem at the scale of a government.

Countries across Southeast Asia, Africa, the Middle East, and South America are deploying AI-powered chatbots to help citizens access healthcare information, navigate tax systems, apply for social programs, and find emergency services. These are critical services that directly impact people's lives.

But here's the catch: the LLMs powering these chatbots were trained on English. When a farmer in rural Thailand asks about crop subsidies in Thai, the model's reasoning capability drops by nearly 50%. When a mother in Nigeria asks about childhood vaccination schedules in Yoruba, the model might not even understand the question properly.

The irony is painful: governments invest in AI to serve their citizens better, but the AI itself delivers unequal quality across languages. Not intentionally — but structurally, through training data imbalance.

The Safety Gap Nobody Talks About

It gets worse. Research shows that low-resource languages exhibit about three times the likelihood of encountering harmful content compared to high-resource languages — and in intentional attack scenarios, unsafe output rates can reach over 80% (Deng et al., "Multilingual Jailbreak Challenges in Large Language Models", 2023).

LLM safety guardrails — the filters that prevent models from generating harmful content — were primarily trained on English data.

This means a prompt injection attack that would be caught instantly in English can sail right through in Amharic or Lao. The model simply doesn't recognize the harmful intent in languages it barely understands.

For any organization deploying AI in production — especially in healthcare, finance, or government — this isn't just a quality issue. It's a liability.

A Different Approach: Don't Fix the Model, Route Around It

The conventional wisdom says: "Just train better multilingual models." And yes, that's happening. But it's slow, expensive, and may never fully close the gap for the thousands of low-resource languages that lack sufficient training data.

What if we could get English-level quality from any language, today, without retraining a single model?

That's the idea behind LLM Proxy Babylon — an open-source proxy I built that sits between your application and any LLM API.

It detects the input language, decides whether translating to English would improve results, and if so, translates the prompt before sending it to the model. Then it appends a simple instruction: "Please respond in Thai since the original question was asked in Thai."

LLM Proxy Babylon is named for the city, not the curse. It's an attempt to do what ancient Babylon did: sit at the crossroads of languages and make sure everyone gets understood.

The key insight: LLMs have no difficulty generating output in a specified language. The performance gap is in understanding non-English prompts and in producing non-English response, so input translation for reasoning quality, and optional output translation for low-resource languages where the LLM's generation is also lossy. So we translate the input (where the model is weak) and let the model handle the output (where it's strong).

Real Results

I tested this with Mistral 7B on a Thai prompt about bubble sort complexity. The results were dramatic:

Without the optimizer (direct Thai): The model produced garbled output mixing English fragments into Thai text ("วงจirkle", "sorteering technique"), with confused, repetitive reasoning. 1,749 tokens of mostly noise.

With the optimizer (translated to English first): The same model produced a clean, structured response correctly explaining O(n²) vs O(n log n) complexity, listing Merge Sort, Quick Sort, and Heap Sort with accurate Big-O analysis — all responded back in Thai. 1,446 tokens of useful content.

The model's reasoning capability was there all along. It just couldn't access it through Thai input.

I also benchmarked Amazon Nova Lite across multiple languages using the built-in evaluation harness:

Rank	Language	Quality Score	Delta from English
1	English (baseline)	0.949	—
2	Portuguese	0.763	-0.19
3	Korean	0.663	-0.29
4	Japanese	0.595	-0.35
5	Thai	0.456	-0.49

The pattern maps exactly to language resource availability. Portuguese (high-resource) takes the smallest hit. Thai (low-resource) loses nearly half the quality.

How It Works

The proxy exposes an OpenAI-compatible API, so it works as a drop-in replacement with any framework — LangChain, Strands Agents, or any OpenAI SDK client. Just change the base URL:

from strands import Agent
from strands.models.openai import OpenAIModel

model = OpenAIModel(
    client_args={"base_url": "http://localhost:3000/v1", "api_key": "not-needed"},
    model_id="us.amazon.nova-lite-v1:0",
)

agent = Agent(model=model)
response = agent("อธิบายแนวคิดของ recursion ในการเขียนโปรแกรม")

Under the hood, each request flows through a pipeline:

Detect the language (using franc for BCP-47 identification)
Parse mixed content (preserve code blocks, URLs, JSON — only translate natural language)
Classify the task type (reasoning, math, code-generation, culturally-specific)
Route — decide whether to translate, skip, or use hybrid mode
Translate the prompt to English (if beneficial)
Inject a language instruction ("Please respond in Thai...")
Forward to the LLM (supports AWS Bedrock and OpenAI)
Return the response to the client

The routing engine is smart about when NOT to translate. Culturally-specific questions ("What's good tonight in Paris?") skip translation because the model needs cultural context, not English reasoning. English prompts skip entirely. The system only translates when it expects a quality improvement.

Built on AWS

The proxy supports AWS Bedrock natively via the Converse API. Authentication is handled automatically through the AWS SDK — no API keys needed in requests. I tested with Amazon Nova Lite and Mistral 7B, both available on Bedrock.

For translation, it supports Amazon Translate ($15/1M characters, high quality for proper nouns and technical content) and LibreTranslate (self-hosted, free) out of the box, with a pluggable interface for DeepL or Google Translate. Just set TRANSLATOR_BACKEND=amazon-translate to switch — uses your existing AWS credentials.

The Conversation Cache: Solving the Multi-Turn Problem

Multi-turn conversations are where the token tax really hurts. Every request sends the full history, and that history is in the user's language — eating 2-4x more tokens per turn.

The proxy includes a conversation translation cache. Pass an X-Conversation-Id header and previously translated messages are pulled from cache instead of being re-translated. By turn 10, you get 9 cache hits and only 1 miss per request — 9 translation API calls saved, and the LLM always sees a lean English context window.

Beyond Quality: Safety as a Side Effect

By translating low-resource language prompts to English before sending them to the LLM, the optimizer routes every prompt through the model's strongest safety alignment. A harmful prompt in Thai or Amharic gets evaluated by English-trained guardrails operating at full strength, rather than the weaker low-resource language alignment.

This isn't a complete safety solution — but for the common case, it significantly narrows the 3x safety gap between high-resource and low-resource languages identified by Deng et al.

But What If Models Get Better at Multilingual?

They will — and the optimizer is designed for that.

The token cost problem is structural, not a training problem. BPE tokenizers will always split Thai, Arabic, and Korean into 2-4x more tokens than semantically equivalent English. Unless providers fundamentally redesign their tokenizers and retrain everything, the cost disparity persists regardless of how multilingual the model becomes.

Conversation history compounding doesn't go away either. Even a perfectly multilingual model still charges per token. A 10-turn Thai conversation still accumulates tokens 3x faster than English. The conversation translation cache solves this at the infrastructure level.

RAG retrieval is an embedding problem, not an LLM problem. Vector embeddings are English-centric. Translating queries to English before retrieval improves recall regardless of how good the LLM itself is at understanding Thai.

Fine-tuning ROI is permanent. Companies fine-tune on English domain data. A perfectly multilingual base model still won't have that domain-specific knowledge accessible through non-English prompts unless the fine-tuning data was also multilingual — which it almost never is.

Safety alignment will always lag for low-resource languages. Even as models improve, safety training data will remain English-heavy. Routing through English for safety filtering is a defense-in-depth strategy that stays relevant.

And the adaptive router handles the transition gracefully. As models get better at specific languages, the shadow evaluator detects that translation no longer helps, and the router automatically switches to skip. The proxy doesn't fight against model improvements — it adapts to them. For a language where the model reaches English parity, the proxy becomes a transparent pass-through with zero overhead.

Today the proxy is primarily about quality. As models improve, it becomes primarily about cost optimization, safety, and RAG. The architecture already supports that transition because routing decisions are data-driven, not hardcoded assumptions.

What's Next

This is an open-source project and there's a lot more to explore:

RAG improvement — translate queries to English before vector retrieval for better recall (current architecture supports it)
Fine-tuning ROI — ensure non-English users benefit from English-only fine-tuned models
Dialect detection — handle Egyptian Arabic vs Modern Standard Arabic, European vs Brazilian Portuguese

The Question We Should Be Asking

The next time someone says "all LLMs are the same," ask them: in which language?

AI won't truly be intelligent until it understands every language, every culture. Until then, tools like LLM Proxy Babylon can bridge the gap — giving every user English-level quality, regardless of what language they think in.

The code is open source: github.com/tverney/llm-proxy-babylon

273 property-based tests. Real benchmarks. Ready to deploy.

Originally published on AWS Builder Center

DEV Community