Why Swahili AI Fails at 4 the Rate of English — and How We're Fixing It

#nlp #machinelearning #africa #opensource

A 2025 benchmark study (arXiv:2509.04516) confirmed what East African AI developers had been observing: general-purpose language models produce 4× more errors in Swahili than English — even for simple factual tasks.

The Root Cause: Data Starvation

Common Crawl, the primary LLM pre-training corpus, is ~50% English and ~0.1% Swahili. That's a 500× data disparity. Models don't fail at Swahili because Swahili is hard — they fail because they've never seen enough of it.

The Compounding Problem

For civic and financial AI in Kenya, it compounds. Kenyan-specific concepts have essentially zero representation:

M-PESA paybill codes and USSD flow formats
KCSE exam structure and curriculum terminology
Kenya Revenue Authority eTims procedures
County government devolution vocabulary

Ask any major LLM to troubleshoot a failed M-PESA B2C payment in Swahili. You'll get plausible confabulation — a payment flow that doesn't match how Daraja actually works.

What We've Built: 110+ Domain-Specific Tools

The East Africa AI portfolio addresses this through three principles:

1. Domain knowledge embedded at the system level

Every tool embeds Kenyan institutional context as baseline knowledge — not retrieval. The model knows what a paybill code is, what KCSE stands for, what the Employment Act says about notice periods.

2. Swahili as first language, not afterthought

"Angalia salio la M-PESA" is not a translation of "Check M-PESA balance" — it's what a Kenyan would actually say. Writing natively instead of translating builds trust with users.

3. MCP infrastructure as force multiplier

By wrapping Kenya's APIs as MCP servers — mpesa-mcp, swahili-health-mcp, kenya-legal-rag — any AI agent inherits correct domain behavior. Tool descriptions are in Kiswahili so agents reason directly from the user's language.

The Data Fix: LINGUA Africa Grant

Long-term fix: better training data. The LINGUA Africa grant application (Microsoft AI for Good Lab × Gates Foundation × Masakhane) targets:

100,000+ labeled Swahili sentence pairs — financial, civic, educational
LoRA fine-tuning of Aya-101 (13B parameters, Apache 2.0)
DPO alignment for accuracy in high-stakes domains
Community evaluation with real M-PESA agents, students, civic journalists in Kenya

Four open datasets are live on HuggingFace now: