You ask an AI a question in English. It answers fluently. You ask in Spanish. It answers well. You ask in Tamil. It stumbles. You ask in Swahili. It gives a generic, awkward response. You are not surprised. You expect the AI to be better at English. But you should be surprised. Tamil has 80 million speakers. Swahili has 200 million speakers. They are not obscure languages. They are just underrepresented in the training data.
This is the linguistic justice crisis. The AI does not speak all languages equally. It speaks the languages of the wealthy, the powerful, and the digitized. The languages of the marginalized are left behind.
The Data Distribution
The training data is not a balanced sample of the world's languages.
The Breakdown:
English: ~70-80% of training data.
Chinese: ~5-10%.
Spanish: ~3-5%.
Other Languages: The remainder.
The Missing:
Tamil: 80 million speakers, but tiny fraction of training data.
Swahili: 200 million speakers, but tiny fraction.
Yoruba: 50 million speakers, but tiny fraction.
The 7,000 other languages: Almost entirely absent.
A Contrarian Take: The Bias Is Not a Bug. It Is a Reflection of the Internet.
The AI is trained on the internet. The internet is mostly English. The bias is not a failure. It is a statistical reality.
The problem is not the AI. The problem is the uneven distribution of human knowledge online.
The Consequences
What happens when a language is underrepresented?
- Poor Translation:
The AI translates poorly.
It loses nuance.
It makes errors.
- Loss of Culture:
The AI does not understand cultural references.
It cannot capture idioms.
It erases local knowledge.
- Digital Divide:
Speakers of underrepresented languages are excluded.
They cannot use AI effectively.
They are left behind.
A Contrarian Take: The AI Is Not the Problem. It Is a Symptom.
The AI reflects the world. The world is unequal. The AI is unequal.
The solution is not to fix the AI. The solution is to fix the world.
The Economics of Language
Why are some languages represented and others not?
The Economics:
English is the language of commerce.
Chinese is the language of a large economy.
Spanish is the language of a large population.
The Marginalization:
Tamil is spoken by a large population, but many are poor.
Swahili is spoken by many, but the region is not wealthy.
Yoruba is spoken by many, but the region is not digitized.
A Contrarian Take: The Economics Are the Cause of the Linguistic Crisis.
The AI companies are not malicious. They are rational. They train on the data that is available and economically valuable.
The languages that are not digitized are not profitable. The AI companies ignore them.
Case Study: The Swahili Speaker
A Swahili speaker tries to use an AI assistant.
The Experience:
The AI understands basic questions.
It gives generic, awkward answers.
It does not understand local idioms.
It does not recognize cultural references.
The Result:
The Swahili speaker stops using the AI.
They feel excluded.
They are left behind.
A Contrarian Take: The Problem Is Not the AI. It Is the Data.
The AI is not malicious. It is just ignorant. It does not know Swahili because it was not trained on Swahili.
The solution is to digitize Swahili. The solution is to create more Swahili content.
What You Can Do
You cannot fix the problem alone. But you can contribute.
- Digitize Endangered Languages:
If you speak a minority language, create content in it.
Write blogs, record videos, create forums.
- Support Open Datasets:
Contribute to open datasets for underrepresented languages.
Support organizations that digitize languages.
- Advocate for Inclusion:
Demand that AI companies include underrepresented languages.
Support policies that promote linguistic diversity.
- Learn a Less Common Language:
Learn Tamil, Swahili, or Yoruba.
The more speakers, the more demand.
The Last Word
The last word is not spoken. It is written.
You ask: "What is the future of language?"
The model says: "The future is uncertain."
You realize: The future depends on the choices we make today.
If you could add one language to the training data, what would it be? And why?
Top comments (0)