<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: jordanricky1604-ship-it</title>
    <description>The latest articles on DEV Community by jordanricky1604-ship-it (@jordan1604).</description>
    <link>https://dev.to/jordan1604</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3959082%2Fbfe416fa-e837-490b-b822-c226343648ea.png</url>
      <title>DEV Community: jordanricky1604-ship-it</title>
      <link>https://dev.to/jordan1604</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jordan1604"/>
    <language>en</language>
    <item>
      <title>I got tired of Ctrl-F'ing PDFs for malware family names so I built a catalog</title>
      <dc:creator>jordanricky1604-ship-it</dc:creator>
      <pubDate>Tue, 02 Jun 2026 17:06:16 +0000</pubDate>
      <link>https://dev.to/jordan1604/i-got-tired-of-ctrl-fing-pdfs-for-malware-family-names-so-i-built-a-catalog-1hn7</link>
      <guid>https://dev.to/jordan1604/i-got-tired-of-ctrl-fing-pdfs-for-malware-family-names-so-i-built-a-catalog-1hn7</guid>
      <description>&lt;p&gt;Quick backstory. I do MSP work and a chunk of my week is triage. Something pops on an endpoint, you get a family name back from whatever tool flagged it, and now you're trying to figure out if this thing is a banker, a loader, a wiper, ransomware, whatever. Half the time the top Google hit is a vendor blog from 2019 with a popup begging you to download a whitepaper. The other half is some forum thread where the actual useful comment got deleted.&lt;/p&gt;

&lt;p&gt;So I made a thing. It's just a static site with one page per malware family. 2,899 of them, pulled from the EMBER 2018 list (Endgame's dataset, the one a lot of ML-for-malware papers train against). Each family gets its own URL like /families/emotet.html, /families/trickbot.html and so on. Nothing fancy. No JS framework. Just HTML you can land on from a search result and read in two seconds.&lt;/p&gt;

&lt;p&gt;Live here if you want to poke at it: &lt;a href="https://jordanricky1604-ship-it.github.io/malware-families-catalog/" rel="noopener noreferrer"&gt;https://jordanricky1604-ship-it.github.io/malware-families-catalog/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Why bother. Honestly because I kept hitting the same wall. You're on a call, the SOC analyst on the other side says "we're seeing Qakbot", and you want a one-pager you can skim while they keep talking. Not a 40 page report. Not a paywall. Just "here's what this is, here's what it usually does, here's a couple of references." That's the whole pitch.&lt;/p&gt;

&lt;p&gt;The other annoying thing was discoverability. If I dump a CSV on HuggingFace nobody searching for a specific family name is going to find it. The CSV is one URL. But if every family is its own page with the name in the title tag and the H1, then someone Googling "what is njRAT" can actually land on it. That was the bet anyway. Still waiting to see how Google feels about it but Bing already indexed ~250 URLs which I'll take.&lt;/p&gt;

&lt;p&gt;I also mirrored the dataset to HuggingFace (&lt;a href="https://huggingface.co/datasets/Jordan123234/malware-families-catalog" rel="noopener noreferrer"&gt;https://huggingface.co/datasets/Jordan123234/malware-families-catalog&lt;/a&gt;) and Kaggle (&lt;a href="https://www.kaggle.com/datasets/rickyjordan/malware-families-catalog" rel="noopener noreferrer"&gt;https://www.kaggle.com/datasets/rickyjordan/malware-families-catalog&lt;/a&gt;) because that's where ML folks actually go looking. The GitHub Pages site is the canonical though, the other two are mirrors that point back.&lt;/p&gt;

&lt;p&gt;A few things I learned that might save someone else time:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Generating 2,899 HTML files from a template is fine. The slow part isn't the generation, it's getting Pages to actually finish building. I had to split things or it would time out.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Sitemaps matter way more than I expected. I had the pages live for like a week before I realised Search Console wasn't picking them up because my sitemap was only listing the index. Once I generated a proper sitemap with every family URL in it the crawl rate jumped immediately.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Don't name your files with weird characters. Some of the family names in EMBER have slashes, dots, parentheses. I lowercased and stripped everything down to [a-z0-9-] and kept a mapping file. Worth it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If you cross-link aggressively (A-Z index, related families, prev/next nav) crawlers will actually follow. If you just dump 2,899 orphan pages and pray, they sit there forever.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It's Apache 2.0 so do whatever you want with it. PRs welcome if you want to add a reference or fix a family description, there's a decent chance I got something wrong on a less common one. The build pipeline is in the repo too if you want to fork it for a different taxonomy (CVEs, threat actors, whatever).&lt;/p&gt;

&lt;p&gt;That's it. Going back to my actual job now.&lt;/p&gt;

</description>
      <category>cybersecurity</category>
      <category>showdev</category>
      <category>sideprojects</category>
      <category>tooling</category>
    </item>
    <item>
      <title>I open-sourced a Malware Families Catalog built on EMBER 2018</title>
      <dc:creator>jordanricky1604-ship-it</dc:creator>
      <pubDate>Fri, 29 May 2026 23:12:14 +0000</pubDate>
      <link>https://dev.to/jordan1604/i-open-sourced-a-malware-families-catalog-built-on-ember-2018-40ck</link>
      <guid>https://dev.to/jordan1604/i-open-sourced-a-malware-families-catalog-built-on-ember-2018-40ck</guid>
      <description>&lt;h2&gt;
  
  
  What it is
&lt;/h2&gt;

&lt;p&gt;I just released an open-source dataset that maps EMBER 2018 malware family labels to a unified, structured catalog. It's published identically on three platforms so you can pull it whichever way fits your workflow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub Pages (canonical):&lt;/strong&gt; &lt;a href="https://jordanricky1604-ship-it.github.io/malware-families-catalog/" rel="noopener noreferrer"&gt;https://jordanricky1604-ship-it.github.io/malware-families-catalog/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HuggingFace:&lt;/strong&gt; &lt;a href="https://huggingface.co/datasets/Jordan123234/malware-families-catalog" rel="noopener noreferrer"&gt;https://huggingface.co/datasets/Jordan123234/malware-families-catalog&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kaggle:&lt;/strong&gt; &lt;a href="https://www.kaggle.com/datasets/rickyjordan/malware-families-catalog" rel="noopener noreferrer"&gt;https://www.kaggle.com/datasets/rickyjordan/malware-families-catalog&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;License is Apache-2.0. The schema is identical across all three so you can swap loaders without touching the rest of your pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I built it
&lt;/h2&gt;

&lt;p&gt;The EMBER 2018 benchmark from Elastic is one of the most widely used static-PE malware classification datasets — but the family label column is sparse and noisy, and there's no canonical companion that catalogs which families appear, how often, and what's known about them. Most projects either drop the family labels entirely (and just do benign/malicious classification) or hand-roll their own family lookup table.&lt;/p&gt;

&lt;p&gt;I wanted a clean, honest catalog you can join against EMBER without having to do that work yourself.&lt;/p&gt;

&lt;p&gt;A few constraints I held myself to while building it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No fabricated facts.&lt;/strong&gt; If a family is obscure or unattributed, the record says so. I'd rather have a &lt;code&gt;null&lt;/code&gt; than a confident-sounding hallucination.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No manual removal instructions.&lt;/strong&gt; This is a research dataset, not a how-to. Records describe what a family is and link out to authoritative sources where appropriate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One schema everywhere.&lt;/strong&gt; Same columns, same types, same row counts on HF, Kaggle, and GitHub. The README on each platform points at the others.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How to load it
&lt;/h2&gt;

&lt;p&gt;From HuggingFace:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datasets&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dataset&lt;/span&gt;
&lt;span class="n"&gt;ds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Jordan123234/malware-families-catalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From Kaggle (via the Kaggle CLI):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kaggle datasets download &lt;span class="nt"&gt;-d&lt;/span&gt; rickyjordan/malware-families-catalog
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From GitHub Pages (raw files):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-O&lt;/span&gt; https://jordanricky1604-ship-it.github.io/malware-families-catalog/data/families.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;I'd love feedback — especially from people who've worked with EMBER and have opinions about which family attributes are most useful for downstream classifiers. Open an issue on the GitHub repo if anything looks off.&lt;/p&gt;

&lt;p&gt;If you find it useful, a star on the repo helps surface it for the next person searching for the same thing.&lt;/p&gt;

&lt;p&gt;— Built and maintained as an open community resource.&lt;/p&gt;

</description>
      <category>cybersecurity</category>
      <category>machinelearning</category>
      <category>opensource</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
