DEV Community

Arjun M
Arjun M

Posted on

I Built a Multilingual Spam Detection Dataset with 149K+ Messages Across 23 Languages

Spam detection datasets are surprisingly bad once you move outside English.

Most public datasets are:

  • tiny,
  • outdated,
  • English-only,
  • SMS-only,
  • or missing real-world spam patterns.

Meanwhile, actual spam today is multilingual, code-mixed, obfuscated, and platform-adaptive.

So I built SpamShield Datasets — a multilingual spam detection corpus designed for real-world NLP systems.

It currently contains 149,359 messages across 23 languages, with support for both binary spam detection and category-level classification.



Why I Built This

I was experimenting with multilingual moderation systems and quickly realized something:

Most spam datasets completely fail at:

  • Hinglish/code-mixed text
  • Unicode obfuscation
  • multilingual phishing
  • scam-style promotions
  • adversarial spam formatting

Real spam does not look clean.

People intentionally distort words using:

  • leetspeak
  • invisible Unicode characters
  • mixed scripts
  • emoji stuffing
  • transliterated language
  • fake urgency patterns

And almost no open dataset covered this properly.

So I started collecting, cleaning, normalizing, and structuring multilingual spam corpora into a single unified dataset.

That eventually became SpamShield Datasets.


Dataset Overview

The dataset currently contains:

Metric Value
Total Messages 149,359
Ham Messages 72,439
Spam Messages 76,920
Languages 23
Formats JSONL + Parquet
License CC-BY-4.0

The schema is intentionally simple:

{
  "text": "Congratulations! You've won a free iPhone.",
  "label": 1,
  "category": "spam"
}
Enter fullscreen mode Exit fullscreen mode

Where:

  • label = 0 → ham
  • label = 1 → spam

Supported Languages

SpamShield currently includes:

  • Arabic
  • Bengali
  • Chinese
  • Dutch
  • English
  • French
  • German
  • Hinglish
  • Indonesian
  • Italian
  • Japanese
  • Javanese
  • Korean
  • Marathi
  • Norwegian
  • Portuguese
  • Punjabi
  • Russian
  • Spanish
  • Swedish
  • Turkish
  • Ukrainian
  • Urdu

I specifically wanted the dataset to include:

  • low-resource languages,
  • mixed-script content,
  • and code-mixed communication styles.

Because that is how people actually communicate online.


How the Dataset Is Structured

The dataset repository contains:

  • README.md
  • language-wise JSONL files
  • combined.parquet
  • filtering scripts
  • metadata and processing utilities

I provided two formats intentionally.

1. JSONL Files

Each language has its own JSONL file.

This is useful when:

  • training language-specific models,
  • debugging,
  • or performing dataset analysis.

Example:

{
  "text": "Free recharge available now!",
  "label": 1,
  "category": "marketing"
}
Enter fullscreen mode Exit fullscreen mode

2. Combined Parquet File

The repository also includes:

combined.parquet
Enter fullscreen mode Exit fullscreen mode

This is the recommended format for large-scale training.

Why Parquet?

Because:

  • it loads faster,
  • uses less storage,
  • supports columnar access,
  • and works extremely well with ML pipelines.

Especially when training multilingual transformers.


Synthetic Augmentation

One thing I want to mention honestly:

About 20% of the dataset is synthetically augmented.

I used techniques like:

  • paraphrasing,
  • translation,
  • back-translation,
  • Unicode variation,
  • and leetspeak mutation.

Why?

Because modern spam constantly mutates itself.

If you only train on perfectly clean spam examples, your model performs badly against real-world adversarial spam.

The goal was robustness — not just benchmark accuracy.


Spam Categories

Instead of only binary labels, I also included category-level labels like:

  • phishing
  • scam
  • crypto
  • marketing
  • giveaway
  • promo
  • adult
  • job_scam

This makes the dataset useful for:

  • moderation systems,
  • risk scoring,
  • scam-type classification,
  • and advanced filtering pipelines.

Loading the Dataset

Using the Parquet file is very straightforward.

import pandas as pd

df = pd.read_parquet("combined.parquet")

print(df.shape)
print(df["label"].value_counts())
Enter fullscreen mode Exit fullscreen mode

Filtering by language:

english = df[df["language"] == "English"]
print(len(english))
Enter fullscreen mode Exit fullscreen mode

Challenges While Building It

The hardest parts were honestly:

  • normalization,
  • deduplication,
  • and balancing quality across languages.

Spam text is messy.

Different datasets had:

  • different schemas,
  • different encodings,
  • different label styles,
  • and inconsistent formatting.

Some datasets had:

  • only spam,
  • broken Unicode,
  • or duplicated messages thousands of times.

A lot of time went into cleaning and standardizing everything.


Acknowledgments

SpamShield Datasets was built using multiple publicly available open-source spam and ham datasets from the NLP and cybersecurity community.

The original datasets were carefully:

  • filtered,
  • cleaned,
  • normalized,
  • deduplicated,
  • reformatted,
  • and curated into a unified multilingual structure.

Additional processing was done to improve consistency across languages, schemas, encodings, and labeling formats.

I would like to thank all researchers, dataset maintainers, and open-source contributors whose work made this project possible. Open datasets are one of the biggest reasons independent research and experimentation can still happen at scale.

This project mainly focuses on:

  • multilingual unification,
  • dataset curation,
  • schema standardization,
  • quality filtering,
  • and robustness-oriented augmentation for real-world spam detection systems.

If you found this project useful, consider giving it a star. It genuinely helps support future updates and improvements.


Reference Links


Final Thoughts

Spam detection is becoming much harder.

Modern spam is:

  • multilingual,
  • adaptive,
  • adversarial,
  • and increasingly AI-generated.

I wanted to create something that was actually useful for real-world NLP systems instead of another tiny benchmark dataset.

SpamShield Datasets is still evolving, but I hope it helps researchers and developers build stronger multilingual moderation systems.

If you want to experiment with multilingual spam detection, adversarial filtering, or moderation pipelines, feel free to check it out.


Support

Building and maintaining multilingual datasets takes a significant amount of time for:

  • cleaning,
  • balancing,
  • validation,
  • augmentation,
  • and formatting.

If this dataset helped your project or research, consider starring or sharing it. That support genuinely motivates future development.

Thanks for reading.

Top comments (0)