Andrei Toma

Posted on Apr 25 • Originally published at hookprobe.com

We Open-Sourced 627,000 Labeled Edge-IDS Verdicts on HuggingFace

#opensource #security #ids

Every week someone asks us on Discord the same question: where can I get a labelled intrusion-detection dataset that actually looks like the internet, not a lab? Today the answer stops being "you can't." We just published hookprobe/edge-ids-threats — 627,853 verdicts produced by our production edge IDS, labelled by the SENTINEL ensemble, enriched with country and ASN, and free for academic or commercial use under CC-BY-4.0.

Why another IDS dataset

The canonical datasets the field trains on — CICIDS2017, UNSW-NB15, Kitsune — are invaluable but synthetic. They were captured in controlled testbeds with injected attacks. Models trained on them tend to generalise poorly when deployed on the open internet, where attacker behaviour is noisier, ASN distribution is skewed by bulletproof hosting, and benign traffic is dominated by CDN edges and scanner services with legitimate intent.

HookProbe runs on a Raspberry Pi 5 connected directly to the public internet. The NAPSE AI-native flow classifier processes every packet, HYDRA correlates flows into verdicts, and the SENTINEL ensemble — an isolation forest plus a calibrated naive-Bayes classifier — labels each source IP as benign, suspicious, or malicious. Those labels are what we just released.

What's in the dataset

Two files, both Parquet, both under 3 MB compressed:

`data/*.parquet` — the labelled verdicts

One row per verdict, truncated to the hour. Columns:

timestamp_hour — UTC, hour-aligned
src_ip_hash — SHA-256 of a salted source IP, truncated to 16 hex characters
country — ISO-3166-1 alpha-2 from RDAP enrichment
asn and asn_name — the operator behind the source IP
anomaly_score — the ensemble's raw score between 0.0 and 1.0
verdict — the categorical label: malicious, suspicious, or benign
action_taken — what HookProbe actually did with the flow

`aggregated/daily_country_asn.parquet` — the analyst-grade view

For anyone who doesn't need per-flow granularity: daily threat counts broken down by country and ASN, plus per-class counts and the average anomaly score. This is what you'd build a landscape report from — no ML tooling required, just pandas or DuckDB.

Privacy model

Publishing anything tied to an IP address is a minefield: in most legal interpretations, an IP is personally identifiable even when it belongs to an attacker. We made three deliberate choices to keep the release on the right side of that line:

IPs are hashed with a project salt we never publish. The same attacker always maps to the same 16-character pseudonym — so longitudinal research ("how does this actor's behaviour evolve over a month?") still works — but there is no way to recover the original IP without access to the salt, which we hold.
Timestamps are truncated to the hour. Minute-precision timestamps can be correlated against third-party logs; hour-precision materially weakens that attack.
No payload data is released. We only ship verdicts and RDAP enrichment. Not a single byte of packet contents.

What we deliberately excluded

The dataset starts on 10 March 2026, not on the day our sensor came online. From 22 February through 9 March the SENTINEL ensemble was in calibration, with a false-positive rate around 98 percent. Publishing that window would teach any model trained on it to see threats everywhere. It is available via an --include-calibration flag in our open publisher, but it is opt-in and loud about it. Honesty about data quality is the only way an IDS dataset ages well.

Example usage

Installing is one line, once the HuggingFace datasets library is on your path:

from datasets import load_dataset

# Per-verdict, ML-grade labels
ds = load_dataset("hookprobe/edge-ids-threats", name="verdicts", split="all")
print(ds[0])
# {'timestamp_hour': datetime(2026, 4, 1, 0, 0), 'src_ip_hash': '69c62b26...',
#  'country': 'GB', 'asn': 0, 'asn_name': 'UNMANAGED-LTD',
#  'anomaly_score': 0.81, 'verdict': 'malicious', 'action_taken': 'escalate'}

# Pre-aggregated daily country/ASN counts
agg = load_dataset("hookprobe/edge-ids-threats", name="aggregated", split="daily")
top_asns = agg.to_pandas().groupby("asn_name")["malicious"].sum().nlargest(10)

Ideas for what to do with it

We built this release because we want to see other people's ideas, not ours. A short shortlist to seed the imagination:

Train a model and compare. Take an isolation forest, train it on CICIDS2017, then on our data, and see how the decision boundary shifts. The delta is the "reality gap" nobody has quantified with a real-world labelled dataset before.
Adversary behaviour over time. Because the hashes are stable across the dataset, you can track a single actor's anomaly-score trajectory. When do scanners pivot from port-sweep to credential-spray? How long after the first alert does the hash reappear?
Fine-tune an LLM for threat triage. The aggregated view gives you per-day, per-ASN context that an LLM can read directly. Build an assistant that answers "what ASN is attacking the most right now."
Compare to GreyNoise. Their free tier tags internet-background-noise from sensors around the world. Our dataset is one specific edge node. The overlap and disagreement is an honest test of both.

Release cadence

We will append a new month at 02:00 UTC on the 1st of each calendar month. The dataset page on HuggingFace is the canonical source; the live threat map on this site is the rolling view. Every quarter we publish a companion State of Edge IDS Telemetry report with the narrative — who attacked, from where, using what techniques.

Citation

If you use the dataset in published work, we appreciate a citation. We keep one up to date at the bottom of the dataset card on HuggingFace; the current form is:

@dataset{hookprobe_edge_ids_threats_2026,
  author  = {{HookProbe Security Research}},
  title   = {{HookProbe Edge IDS Threat Telemetry}},
  year    = {2026},
  url     = {https://huggingface.co/datasets/hookprobe/edge-ids-threats},
  note    = {627,853 verdicts; temporal coverage 2026-03 to 2026-04},
}

Get the dataset at huggingface.co/datasets/hookprobe/edge-ids-threats. If you find something interesting — or a bug in the labels — open a discussion on the HuggingFace page, or drop into our Discord. We maintain the publisher on GitHub; pull requests that improve the aggregation views are welcome.

Originally published at hookprobe.com. HookProbe is an open-source AI-native IDS that runs on a Raspberry Pi.

GitHub: github.com/hookprobe/hookprobe

DEV Community

We Open-Sourced 627,000 Labeled Edge-IDS Verdicts on HuggingFace

Why another IDS dataset

What's in the dataset

`data/*.parquet` — the labelled verdicts

`aggregated/daily_country_asn.parquet` — the analyst-grade view

Privacy model

What we deliberately excluded

Example usage

Ideas for what to do with it

Release cadence

Citation

Next

Top comments (0)

Why another IDS dataset

What's in the dataset

data/*.parquet — the labelled verdicts

aggregated/daily_country_asn.parquet — the analyst-grade view

Privacy model

What we deliberately excluded

Example usage

Ideas for what to do with it

Release cadence

Citation

Next

`data/*.parquet` — the labelled verdicts

`aggregated/daily_country_asn.parquet` — the analyst-grade view