I got tired of Ctrl-F'ing PDFs for malware family names so I built a catalog

#cybersecurity #showdev #sideprojects #tooling

Quick backstory. I do MSP work and a chunk of my week is triage. Something pops on an endpoint, you get a family name back from whatever tool flagged it, and now you're trying to figure out if this thing is a banker, a loader, a wiper, ransomware, whatever. Half the time the top Google hit is a vendor blog from 2019 with a popup begging you to download a whitepaper. The other half is some forum thread where the actual useful comment got deleted.

So I made a thing. It's just a static site with one page per malware family. 2,899 of them, pulled from the EMBER 2018 list (Endgame's dataset, the one a lot of ML-for-malware papers train against). Each family gets its own URL like /families/emotet.html, /families/trickbot.html and so on. Nothing fancy. No JS framework. Just HTML you can land on from a search result and read in two seconds.

Live here if you want to poke at it: https://jordanricky1604-ship-it.github.io/malware-families-catalog/

Why bother. Honestly because I kept hitting the same wall. You're on a call, the SOC analyst on the other side says "we're seeing Qakbot", and you want a one-pager you can skim while they keep talking. Not a 40 page report. Not a paywall. Just "here's what this is, here's what it usually does, here's a couple of references." That's the whole pitch.

The other annoying thing was discoverability. If I dump a CSV on HuggingFace nobody searching for a specific family name is going to find it. The CSV is one URL. But if every family is its own page with the name in the title tag and the H1, then someone Googling "what is njRAT" can actually land on it. That was the bet anyway. Still waiting to see how Google feels about it but Bing already indexed ~250 URLs which I'll take.

I also mirrored the dataset to HuggingFace (https://huggingface.co/datasets/Jordan123234/malware-families-catalog) and Kaggle (https://www.kaggle.com/datasets/rickyjordan/malware-families-catalog) because that's where ML folks actually go looking. The GitHub Pages site is the canonical though, the other two are mirrors that point back.

A few things I learned that might save someone else time:

Generating 2,899 HTML files from a template is fine. The slow part isn't the generation, it's getting Pages to actually finish building. I had to split things or it would time out.
Sitemaps matter way more than I expected. I had the pages live for like a week before I realised Search Console wasn't picking them up because my sitemap was only listing the index. Once I generated a proper sitemap with every family URL in it the crawl rate jumped immediately.
Don't name your files with weird characters. Some of the family names in EMBER have slashes, dots, parentheses. I lowercased and stripped everything down to [a-z0-9-] and kept a mapping file. Worth it.
If you cross-link aggressively (A-Z index, related families, prev/next nav) crawlers will actually follow. If you just dump 2,899 orphan pages and pray, they sit there forever.

It's Apache 2.0 so do whatever you want with it. PRs welcome if you want to add a reference or fix a family description, there's a decent chance I got something wrong on a less common one. The build pipeline is in the repo too if you want to fork it for a different taxonomy (CVEs, threat actors, whatever).

That's it. Going back to my actual job now.

DEV Community

I got tired of Ctrl-F'ing PDFs for malware family names so I built a catalog

Top comments (0)