DEV Community

Cover image for Three Detection Paradigms. One Dataset. One Result.
@alonso_isidoro
@alonso_isidoro

Posted on

Three Detection Paradigms. One Dataset. One Result.

Three Detection Paradigms. One Dataset. One Result.

For the last 147 days I’ve been building aRGus, an open-source Network Detection & Response (NDR) pipeline focused on behavioral detection using machine learning and real packet telemetry.

Today we completed something I personally wanted to see for a long time:

A direct comparison between three radically different network security paradigms on the same dataset, same hardware, and same analysis conditions.

Not “which tool is better”.

But what each paradigm is actually capable of seeing.


The Experiment

We used the CTU-13 Neris botnet scenario from 2011.

The malicious corpus contains:

  • 646 malicious flows
  • IRC beaconing
  • HTTP anomalies
  • SMB lateral movement
  • classic botnet behavior patterns

Three systems analyzed the same capture:

System Paradigm
Suricata Signature-based IDS
Zeek Telemetry & anomaly observation
aRGus Behavioral ML-based NDR

Environment:

  • Commodity hardware
  • Same PCAP corpus
  • Same execution constraints
  • No tuning magic
  • Default operational assumptions

Results

Detection Metrics

System F1 Score Recall Alerts
Suricata 6.0.10 (50K ET Open rules) 0.000 0% 0
Zeek 8.1.2 (default scripts) 0.042 2% 14
aRGus NDR 0.998 100% 646

What This Actually Means

The conclusion is not:

“aRGus destroys Suricata and Zeek.”

That would completely miss the point.

The result is a taxonomy of detection paradigms.


1. Signature Detection — Suricata

Suricata is exceptional at detecting threats it already knows.

But the experiment demonstrated something important:

If the behavior does not match an existing rule, the engine may remain completely silent.

In this scenario:

  • 50,000+ ET Open rules
  • 646 malicious flows
  • zero alerts

This is not a failure of Suricata.

It is the expected limitation of signature dependency.

A signature engine requires prior knowledge.

No signature → no detection.


2. Observability Without Classification — Zeek

Zeek produced one of the most interesting outcomes.

Under default policy:

  • 14 alerts
  • 100% precision
  • only 2% recall

At first glance, this looks weak.

But then we inspected weird.log.

Zeek observed:

  • IRC beaconing
  • malformed HTTP behavior
  • SMB anomalies
  • protocol irregularities
  • suspicious communication patterns

In total:

  • 182 anomaly observations from the malicious host

The key insight:

Zeek saw the anomalies.

It simply did not classify them as attacks.

That distinction matters enormously.

Observability is not classification.

Telemetry alone is not detection logic.


3. Behavioral Classification — aRGus

aRGus approaches the problem differently.

Instead of asking:

“Does this match a known signature?”

it asks:

“Does this flow statistically behave like malicious activity?”

The pipeline extracts behavioral features from packet telemetry and classifies flows using machine learning models trained on malicious and legitimate traffic patterns.

Result:

  • F1: 0.998
  • Recall: 100%
  • 646 malicious flows detected

The important part is not the metric itself.

The important part is that the system classified malicious behavior regardless of:

  • malware family age,
  • IOC availability,
  • rule existence,
  • or threat naming.

Why This Matters

Many organizations cannot afford enterprise-grade NDR platforms.

Hospitals.
Schools.
Municipalities.
Small public institutions.

Yet they face the same ransomware operators as large enterprises.

The original motivation behind aRGus was simple:

Can modern behavioral detection be built using:

  • open-source tooling,
  • commodity hardware,
  • and reproducible research?

This experiment suggests the answer may be yes.


Important Caveats

This is not a claim that:

  • signatures are obsolete,
  • Zeek is ineffective,
  • or ML is magically superior.

All three paradigms solve different problems.

In real environments, they complement each other.

A mature detection stack likely needs:

  • signature detection,
  • telemetry observation,
  • behavioral analysis,
  • and correlation between all layers.

This benchmark is only one experiment.
One dataset.
One botnet family.
One point in the design space.

But it revealed something fundamental:

Different detection architectures do not merely differ in performance.

They differ in philosophy.


Research & Project

Feedback, criticism, and collaboration are welcome.

Top comments (0)