A 2025 benchmark study (arXiv:2509.04516) confirmed what East African AI developers had been observing: general-purpose language models produce 4× more errors in Swahili than English — even for simple factual tasks.
The Root Cause: Data Starvation
Common Crawl, the primary LLM pre-training corpus, is ~50% English and ~0.1% Swahili. That's a 500× data disparity. Models don't fail at Swahili because Swahili is hard — they fail because they've never seen enough of it.
The Compounding Problem
For civic and financial AI in Kenya, it compounds. Kenyan-specific concepts have essentially zero representation:
- M-PESA paybill codes and USSD flow formats
- KCSE exam structure and curriculum terminology
- Kenya Revenue Authority eTims procedures
- County government devolution vocabulary
Ask any major LLM to troubleshoot a failed M-PESA B2C payment in Swahili. You'll get plausible confabulation — a payment flow that doesn't match how Daraja actually works.
What We've Built: 110+ Domain-Specific Tools
The East Africa AI portfolio addresses this through three principles:
1. Domain knowledge embedded at the system level
Every tool embeds Kenyan institutional context as baseline knowledge — not retrieval. The model knows what a paybill code is, what KCSE stands for, what the Employment Act says about notice periods.
2. Swahili as first language, not afterthought
"Angalia salio la M-PESA" is not a translation of "Check M-PESA balance" — it's what a Kenyan would actually say. Writing natively instead of translating builds trust with users.
3. MCP infrastructure as force multiplier
By wrapping Kenya's APIs as MCP servers — mpesa-mcp, swahili-health-mcp, kenya-legal-rag — any AI agent inherits correct domain behavior. Tool descriptions are in Kiswahili so agents reason directly from the user's language.
The Data Fix: LINGUA Africa Grant
Long-term fix: better training data. The LINGUA Africa grant application (Microsoft AI for Good Lab × Gates Foundation × Masakhane) targets:
- 100,000+ labeled Swahili sentence pairs — financial, civic, educational
- LoRA fine-tuning of Aya-101 (13B parameters, Apache 2.0)
- DPO alignment for accuracy in high-stakes domains
- Community evaluation with real M-PESA agents, students, civic journalists in Kenya
Four open datasets are live on HuggingFace now:
- gmahia/swahili-civic-nlp — civic and government terms
- gmahia/kenya-civic-data — county government data
- gmahia/kenya-agricultural-qa — crop and livestock Q&A
- gmahia/kenya-legal-nlp — Kenya legal NER annotations
All CC BY 4.0. The tools work today. The grant makes them work better at scale.
Portfolio: gabrielmahia.github.io
HuggingFace: huggingface.co/gmahia
Top comments (0)