@alonso_isidoro

Posted on May 10

Three Detection Paradigms. One Dataset. One Result.

#cybersecurity #machinelearning #opensource #networking

Three Detection Paradigms. One Dataset. One Result.

For the last 147 days I’ve been building aRGus, an open-source Network Detection & Response (NDR) pipeline focused on behavioral detection using machine learning and real packet telemetry.

Today we completed something I personally wanted to see for a long time:

A direct comparison between three radically different network security paradigms on the same dataset, same hardware, and same analysis conditions.

Not “which tool is better”.

But what each paradigm is actually capable of seeing.

The Experiment

We used the CTU-13 Neris botnet scenario from 2011.

The malicious corpus contains:

646 malicious flows
IRC beaconing
HTTP anomalies
SMB lateral movement
classic botnet behavior patterns

Three systems analyzed the same capture:

System	Paradigm
Suricata	Signature-based IDS
Zeek	Telemetry & anomaly observation
aRGus	Behavioral ML-based NDR

Environment:

Commodity hardware
Same PCAP corpus
Same execution constraints
No tuning magic
Default operational assumptions

Results

Detection Metrics

System	F1 Score	Recall	Alerts
Suricata 6.0.10 (50K ET Open rules)	0.000	0%	0
Zeek 8.1.2 (default scripts)	0.042	2%	14
aRGus NDR	0.998	100%	646

What This Actually Means

The conclusion is not:

“aRGus destroys Suricata and Zeek.”

That would completely miss the point.

The result is a taxonomy of detection paradigms.

1. Signature Detection — Suricata

Suricata is exceptional at detecting threats it already knows.

But the experiment demonstrated something important:

If the behavior does not match an existing rule, the engine may remain completely silent.

In this scenario:

50,000+ ET Open rules
646 malicious flows
zero alerts

This is not a failure of Suricata.

It is the expected limitation of signature dependency.

A signature engine requires prior knowledge.

No signature → no detection.

2. Observability Without Classification — Zeek

Zeek produced one of the most interesting outcomes.

Under default policy:

14 alerts
100% precision
only 2% recall

At first glance, this looks weak.

But then we inspected weird.log.

Zeek observed:

IRC beaconing
malformed HTTP behavior
SMB anomalies
protocol irregularities
suspicious communication patterns

In total:

182 anomaly observations from the malicious host

The key insight:

Zeek saw the anomalies.

It simply did not classify them as attacks.

That distinction matters enormously.

Observability is not classification.

Telemetry alone is not detection logic.

3. Behavioral Classification — aRGus

aRGus approaches the problem differently.

Instead of asking:

“Does this match a known signature?”

it asks:

“Does this flow statistically behave like malicious activity?”

The pipeline extracts behavioral features from packet telemetry and classifies flows using machine learning models trained on malicious and legitimate traffic patterns.

Result:

F1: 0.998
Recall: 100%
646 malicious flows detected

The important part is not the metric itself.

The important part is that the system classified malicious behavior regardless of:

malware family age,
IOC availability,
rule existence,
or threat naming.

Why This Matters

Many organizations cannot afford enterprise-grade NDR platforms.

Hospitals.
Schools.
Municipalities.
Small public institutions.

Yet they face the same ransomware operators as large enterprises.

The original motivation behind aRGus was simple:

Can modern behavioral detection be built using:

open-source tooling,
commodity hardware,
and reproducible research?

This experiment suggests the answer may be yes.

Important Caveats

This is not a claim that:

signatures are obsolete,
Zeek is ineffective,
or ML is magically superior.

All three paradigms solve different problems.

In real environments, they complement each other.

A mature detection stack likely needs:

signature detection,
telemetry observation,
behavioral analysis,
and correlation between all layers.

This benchmark is only one experiment.
One dataset.
One botnet family.
One point in the design space.

But it revealed something fundamental:

Different detection architectures do not merely differ in performance.

They differ in philosophy.

Research & Project

arXiv paper: aRGus Research Paper (arXiv:2604.04952)
GitHub repository: aRGus GitHub Repository

Feedback, criticism, and collaboration are welcome.

DEV Community

Three Detection Paradigms. One Dataset. One Result.

Three Detection Paradigms. One Dataset. One Result.

The Experiment

Results

Detection Metrics

What This Actually Means

1. Signature Detection — Suricata

2. Observability Without Classification — Zeek

3. Behavioral Classification — aRGus

Why This Matters

Important Caveats

Research & Project

Top comments (0)