DEV Community

Gabriel Mahia
Gabriel Mahia

Posted on

Why Swahili AI Fails at 4 the Rate of English — and How We're Fixing It

A 2025 benchmark study (arXiv:2509.04516) confirmed what East African AI developers had been observing: general-purpose language models produce 4× more errors in Swahili than English — even for simple factual tasks.

The Root Cause: Data Starvation

Common Crawl, the primary LLM pre-training corpus, is ~50% English and ~0.1% Swahili. That's a 500× data disparity. Models don't fail at Swahili because Swahili is hard — they fail because they've never seen enough of it.

The Compounding Problem

For civic and financial AI in Kenya, it compounds. Kenyan-specific concepts have essentially zero representation:

  • M-PESA paybill codes and USSD flow formats
  • KCSE exam structure and curriculum terminology
  • Kenya Revenue Authority eTims procedures
  • County government devolution vocabulary

Ask any major LLM to troubleshoot a failed M-PESA B2C payment in Swahili. You'll get plausible confabulation — a payment flow that doesn't match how Daraja actually works.

What We've Built: 110+ Domain-Specific Tools

The East Africa AI portfolio addresses this through three principles:

1. Domain knowledge embedded at the system level

Every tool embeds Kenyan institutional context as baseline knowledge — not retrieval. The model knows what a paybill code is, what KCSE stands for, what the Employment Act says about notice periods.

2. Swahili as first language, not afterthought

"Angalia salio la M-PESA" is not a translation of "Check M-PESA balance" — it's what a Kenyan would actually say. Writing natively instead of translating builds trust with users.

3. MCP infrastructure as force multiplier

By wrapping Kenya's APIs as MCP servers — mpesa-mcp, swahili-health-mcp, kenya-legal-rag — any AI agent inherits correct domain behavior. Tool descriptions are in Kiswahili so agents reason directly from the user's language.

The Data Fix: LINGUA Africa Grant

Long-term fix: better training data. The LINGUA Africa grant application (Microsoft AI for Good Lab × Gates Foundation × Masakhane) targets:

  • 100,000+ labeled Swahili sentence pairs — financial, civic, educational
  • LoRA fine-tuning of Aya-101 (13B parameters, Apache 2.0)
  • DPO alignment for accuracy in high-stakes domains
  • Community evaluation with real M-PESA agents, students, civic journalists in Kenya

Four open datasets are live on HuggingFace now:

All CC BY 4.0. The tools work today. The grant makes them work better at scale.

Portfolio: gabrielmahia.github.io

HuggingFace: huggingface.co/gmahia

Top comments (0)