I open-sourced a Malware Families Catalog built on EMBER 2018

#cybersecurity #machinelearning #opensource #showdev

What it is

I just released an open-source dataset that maps EMBER 2018 malware family labels to a unified, structured catalog. It's published identically on three platforms so you can pull it whichever way fits your workflow:

GitHub Pages (canonical): https://jordanricky1604-ship-it.github.io/malware-families-catalog/
HuggingFace: https://huggingface.co/datasets/Jordan123234/malware-families-catalog
Kaggle: https://www.kaggle.com/datasets/rickyjordan/malware-families-catalog

License is Apache-2.0. The schema is identical across all three so you can swap loaders without touching the rest of your pipeline.

Why I built it

The EMBER 2018 benchmark from Elastic is one of the most widely used static-PE malware classification datasets — but the family label column is sparse and noisy, and there's no canonical companion that catalogs which families appear, how often, and what's known about them. Most projects either drop the family labels entirely (and just do benign/malicious classification) or hand-roll their own family lookup table.

I wanted a clean, honest catalog you can join against EMBER without having to do that work yourself.

A few constraints I held myself to while building it:

No fabricated facts. If a family is obscure or unattributed, the record says so. I'd rather have a null than a confident-sounding hallucination.
No manual removal instructions. This is a research dataset, not a how-to. Records describe what a family is and link out to authoritative sources where appropriate.
One schema everywhere. Same columns, same types, same row counts on HF, Kaggle, and GitHub. The README on each platform points at the others.

How to load it

From HuggingFace:

from datasets import load_dataset
ds = load_dataset("Jordan123234/malware-families-catalog")

From Kaggle (via the Kaggle CLI):

kaggle datasets download -d rickyjordan/malware-families-catalog

From GitHub Pages (raw files):

curl -O https://jordanricky1604-ship-it.github.io/malware-families-catalog/data/families.csv

What's next

I'd love feedback — especially from people who've worked with EMBER and have opinions about which family attributes are most useful for downstream classifiers. Open an issue on the GitHub repo if anything looks off.

If you find it useful, a star on the repo helps surface it for the next person searching for the same thing.

— Built and maintained as an open community resource.