DEV Community

Cover image for ๐Ÿ›ก๏ธ A Swahili SMS Scam Dataset and a Machine Learning Tool to Use It
Henry Dioniz
Henry Dioniz

Posted on

๐Ÿ›ก๏ธ A Swahili SMS Scam Dataset and a Machine Learning Tool to Use It

In Tanzania ๐Ÿ‡น๐Ÿ‡ฟ, scammers are getting smarter. They often pretend to be someone you know or trust a relative, a friend, a landlord, or even a job recruiter. Their goal? To trick you into sending them money.

Youโ€™ve probably seen texts like:

  • โ€œNi tumie kwa namba hii Jina litakuja SALOME KALUNGA, hiyo ni namba yangu mpya ya Halotelโ€
  • โ€œUtanitumia kwenye ii 0615810764 airtel jina MARIAM NDUGAI namba yangu inadeni usiitumieโ€
  • โ€œMZEE LUKA KIMBANGU tiba asili biashala kazi masomo utajili kesi kuludisha mke&mume piga (0787-406-889)(0787-406-889)โ€
  • โ€œ666,KARIBU FREEMASON UTIMIZE NDOTO KATIKA BIASHARA, KILIMO,UFUGAJI,MACHI MBO,MICHEZO N.K KWAMHITAJI KUJIUNGA PG: 0786543210 AU 0786543210โ€

These messages are dangerous, deceptive, and sadly, very common.

As a Tanzanian tech enthusiast and developer, I wanted to do something about it.
So I created Bongoscam dataset an open dataset of over 1,500 Swahili SMS scam examples, and a basic machine learning model to help detect them.

๐Ÿ“Š The Dataset: Swahili SMS Detection

I collected and labeled 1,508 real Swahili messages, split into two categories:

  • scam: Suspicious, misleading, or fraudulent messages.
  • trust: Legitimate or safe messages.

Example entries:

category sms
scam "IYO PESA ITUME KWENYE NAMBA HII 0657538690 JINA ITALETA Magomba Maila"
trust "Nashukuru kwa kupokea simu yangu. Tutalifanyia kazi."

โžก๏ธ Download the dataset on Kaggle:
๐Ÿ“ฅ swahili-sms-detection

๐Ÿง  The Model: Simple but Effective

To demonstrate whatโ€™s possible, I built a lightweight machine learning model using:

  • ๐Ÿงน CountVectorizer for converting text to numeric features
  • ๐Ÿค– Multinomial Naive Bayes classifier
  • ๐Ÿ“ˆ 98.7% accuracy on test data

The model is wrapped in a Flask API and deployed as a simple website for public use.

You can test it live here:
๐Ÿ‘‰ bongoscam.vercel.app

๐Ÿ“ฆ Project Structure

You can explore or contribute via GitHub:

๐Ÿ”— GitHub: BongoScamDetection

# Clone the repo
git clone https://github.com/Henryle-hd/BongoScamDetection
cd bongoscam

# Install frontend
cd frontend
npm install

# Install backend
cd backend
pip install -r requirements.txt

# Run backend
python main.py

# Run frontend
npm run dev
Enter fullscreen mode Exit fullscreen mode

๐Ÿ”Œ API Example

Endpoint: POST /api/predict
Request:

{
  "sms": "Iyo ela tuma humu kwenye vodacom 0655251448 Jina lije ALLY ISSA"
}
Enter fullscreen mode Exit fullscreen mode

Response:

{
  "prediction": "scam",
  "sms": "Iyo ela tuma humu kwenye vodacom 0655251448 Jina lije ALLY ISSA"
}
Enter fullscreen mode Exit fullscreen mode

๐Ÿ”„ Why This Matters

This project isnโ€™t just about coding. Itโ€™s about digital safety.
Millions of people in East Africa rely on SMS for communication.
Without strong tools or education, theyโ€™re vulnerable.

By:

  • Open-sourcing the data
  • Making the model public
  • Supporting Swahili language

...I'm hoping this becomes a starting point for more localized ML solutions โ€” in Swahili, for Africa, by Africans.

โœ๏ธ Final Thoughts

BongoScam dataset is a small step toward fighting digital fraud in Tanzania, but I believe it can grow with your input.
If you're a:

  • Developer ๐Ÿง‘โ€๐Ÿ’ป
  • Linguist ๐ŸŒ
  • Security researcher ๐Ÿ”
  • Student ๐Ÿ“š

โ€ฆthereโ€™s something in this project for you.

๐Ÿ‘‰ Test the tool at bongoscam.vercel.app
๐Ÿ‘‰ Explore the dataset on Kaggle
๐Ÿ‘‰ Contribute code via GitHub

๐Ÿ’ฌ Got feedback or want to collaborate? Drop a comment or find me on LinkedIn or GitHub.

Letโ€™s build AI that speaks Swahili and protects people, not just data.

Top comments (2)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.