Arjun M

Posted on May 25

I Built a Multilingual Spam Detection Dataset with 149K+ Messages Across 23 Languages

#data #machinelearning #nlp #showdev

Spam detection datasets are surprisingly bad once you move outside English.

Most public datasets are:

tiny,
outdated,
English-only,
SMS-only,
or missing real-world spam patterns.

Meanwhile, actual spam today is multilingual, code-mixed, obfuscated, and platform-adaptive.

So I built SpamShield Datasets — a multilingual spam detection corpus designed for real-world NLP systems.

It currently contains 149,359 messages across 23 languages, with support for both binary spam detection and category-level classification.

Dataset: SpamShield Datasets

Why I Built This

I was experimenting with multilingual moderation systems and quickly realized something:

Most spam datasets completely fail at:

Hinglish/code-mixed text
Unicode obfuscation
multilingual phishing
scam-style promotions
adversarial spam formatting

Real spam does not look clean.

People intentionally distort words using:

leetspeak
invisible Unicode characters
mixed scripts
emoji stuffing
transliterated language
fake urgency patterns

And almost no open dataset covered this properly.

So I started collecting, cleaning, normalizing, and structuring multilingual spam corpora into a single unified dataset.

That eventually became SpamShield Datasets.

Dataset Overview

The dataset currently contains:

Metric	Value
Total Messages	149,359
Ham Messages	72,439
Spam Messages	76,920
Languages	23
Formats	JSONL + Parquet
License	CC-BY-4.0

The schema is intentionally simple:

{
  "text": "Congratulations! You've won a free iPhone.",
  "label": 1,
  "category": "spam"
}

Where:

label = 0 → ham
label = 1 → spam

Supported Languages

SpamShield currently includes:

Arabic
Bengali
Chinese
Dutch
English
French
German
Hinglish
Indonesian
Italian
Japanese
Javanese
Korean
Marathi
Norwegian
Portuguese
Punjabi
Russian
Spanish
Swedish
Turkish
Ukrainian
Urdu

I specifically wanted the dataset to include:

low-resource languages,
mixed-script content,
and code-mixed communication styles.

Because that is how people actually communicate online.

How the Dataset Is Structured

The dataset repository contains:

README.md
language-wise JSONL files
combined.parquet
filtering scripts
metadata and processing utilities

I provided two formats intentionally.

1. JSONL Files

Each language has its own JSONL file.

This is useful when:

training language-specific models,
debugging,
or performing dataset analysis.

Example:

{
  "text": "Free recharge available now!",
  "label": 1,
  "category": "marketing"
}

2. Combined Parquet File

The repository also includes:

combined.parquet

This is the recommended format for large-scale training.

Why Parquet?

Because:

it loads faster,
uses less storage,
supports columnar access,
and works extremely well with ML pipelines.

Especially when training multilingual transformers.

Synthetic Augmentation

One thing I want to mention honestly:

About 20% of the dataset is synthetically augmented.

I used techniques like:

paraphrasing,
translation,
back-translation,
Unicode variation,
and leetspeak mutation.

Why?

Because modern spam constantly mutates itself.

If you only train on perfectly clean spam examples, your model performs badly against real-world adversarial spam.

The goal was robustness — not just benchmark accuracy.

Spam Categories

Instead of only binary labels, I also included category-level labels like:

phishing
scam
crypto
marketing
giveaway
promo
adult
job_scam

This makes the dataset useful for:

moderation systems,
risk scoring,
scam-type classification,
and advanced filtering pipelines.

Loading the Dataset

Using the Parquet file is very straightforward.

import pandas as pd

df = pd.read_parquet("combined.parquet")

print(df.shape)
print(df["label"].value_counts())

Filtering by language:

english = df[df["language"] == "English"]
print(len(english))

Challenges While Building It

The hardest parts were honestly:

normalization,
deduplication,
and balancing quality across languages.

Spam text is messy.

Different datasets had:

different schemas,
different encodings,
different label styles,
and inconsistent formatting.

Some datasets had:

only spam,
broken Unicode,
or duplicated messages thousands of times.

A lot of time went into cleaning and standardizing everything.

Acknowledgments

SpamShield Datasets was built using multiple publicly available open-source spam and ham datasets from the NLP and cybersecurity community.

The original datasets were carefully:

filtered,
cleaned,
normalized,
deduplicated,
reformatted,
and curated into a unified multilingual structure.

Additional processing was done to improve consistency across languages, schemas, encodings, and labeling formats.

I would like to thank all researchers, dataset maintainers, and open-source contributors whose work made this project possible. Open datasets are one of the biggest reasons independent research and experimentation can still happen at scale.

This project mainly focuses on:

multilingual unification,
dataset curation,
schema standardization,
quality filtering,
and robustness-oriented augmentation for real-world spam detection systems.

If you found this project useful, consider giving it a star. It genuinely helps support future updates and improvements.

Reference Links

Dataset: SpamShield Datasets
Dataset Card / README: View Documentation
License: CC-BY-4.0
Recommended File: combined.parquet

Final Thoughts

Spam detection is becoming much harder.

Modern spam is:

multilingual,
adaptive,
adversarial,
and increasingly AI-generated.

I wanted to create something that was actually useful for real-world NLP systems instead of another tiny benchmark dataset.

SpamShield Datasets is still evolving, but I hope it helps researchers and developers build stronger multilingual moderation systems.

If you want to experiment with multilingual spam detection, adversarial filtering, or moderation pipelines, feel free to check it out.

Support

Building and maintaining multilingual datasets takes a significant amount of time for:

cleaning,
balancing,
validation,
augmentation,
and formatting.

If this dataset helped your project or research, consider starring or sharing it. That support genuinely motivates future development.

Thanks for reading.

DEV Community