Three Detection Paradigms. One Dataset. One Result.
For the last 147 days I’ve been building aRGus, an open-source Network Detection & Response (NDR) pipeline focused on behavioral detection using machine learning and real packet telemetry.
Today we completed something I personally wanted to see for a long time:
A direct comparison between three radically different network security paradigms on the same dataset, same hardware, and same analysis conditions.
Not “which tool is better”.
But what each paradigm is actually capable of seeing.
The Experiment
We used the CTU-13 Neris botnet scenario from 2011.
The malicious corpus contains:
- 646 malicious flows
- IRC beaconing
- HTTP anomalies
- SMB lateral movement
- classic botnet behavior patterns
Three systems analyzed the same capture:
| System | Paradigm |
|---|---|
| Suricata | Signature-based IDS |
| Zeek | Telemetry & anomaly observation |
| aRGus | Behavioral ML-based NDR |
Environment:
- Commodity hardware
- Same PCAP corpus
- Same execution constraints
- No tuning magic
- Default operational assumptions
Results
Detection Metrics
| System | F1 Score | Recall | Alerts |
|---|---|---|---|
| Suricata 6.0.10 (50K ET Open rules) | 0.000 | 0% | 0 |
| Zeek 8.1.2 (default scripts) | 0.042 | 2% | 14 |
| aRGus NDR | 0.998 | 100% | 646 |
What This Actually Means
The conclusion is not:
“aRGus destroys Suricata and Zeek.”
That would completely miss the point.
The result is a taxonomy of detection paradigms.
1. Signature Detection — Suricata
Suricata is exceptional at detecting threats it already knows.
But the experiment demonstrated something important:
If the behavior does not match an existing rule, the engine may remain completely silent.
In this scenario:
- 50,000+ ET Open rules
- 646 malicious flows
- zero alerts
This is not a failure of Suricata.
It is the expected limitation of signature dependency.
A signature engine requires prior knowledge.
No signature → no detection.
2. Observability Without Classification — Zeek
Zeek produced one of the most interesting outcomes.
Under default policy:
- 14 alerts
- 100% precision
- only 2% recall
At first glance, this looks weak.
But then we inspected weird.log.
Zeek observed:
- IRC beaconing
- malformed HTTP behavior
- SMB anomalies
- protocol irregularities
- suspicious communication patterns
In total:
- 182 anomaly observations from the malicious host
The key insight:
Zeek saw the anomalies.
It simply did not classify them as attacks.
That distinction matters enormously.
Observability is not classification.
Telemetry alone is not detection logic.
3. Behavioral Classification — aRGus
aRGus approaches the problem differently.
Instead of asking:
“Does this match a known signature?”
it asks:
“Does this flow statistically behave like malicious activity?”
The pipeline extracts behavioral features from packet telemetry and classifies flows using machine learning models trained on malicious and legitimate traffic patterns.
Result:
- F1: 0.998
- Recall: 100%
- 646 malicious flows detected
The important part is not the metric itself.
The important part is that the system classified malicious behavior regardless of:
- malware family age,
- IOC availability,
- rule existence,
- or threat naming.
Why This Matters
Many organizations cannot afford enterprise-grade NDR platforms.
Hospitals.
Schools.
Municipalities.
Small public institutions.
Yet they face the same ransomware operators as large enterprises.
The original motivation behind aRGus was simple:
Can modern behavioral detection be built using:
- open-source tooling,
- commodity hardware,
- and reproducible research?
This experiment suggests the answer may be yes.
Important Caveats
This is not a claim that:
- signatures are obsolete,
- Zeek is ineffective,
- or ML is magically superior.
All three paradigms solve different problems.
In real environments, they complement each other.
A mature detection stack likely needs:
- signature detection,
- telemetry observation,
- behavioral analysis,
- and correlation between all layers.
This benchmark is only one experiment.
One dataset.
One botnet family.
One point in the design space.
But it revealed something fundamental:
Different detection architectures do not merely differ in performance.
They differ in philosophy.
Research & Project
- arXiv paper: aRGus Research Paper (arXiv:2604.04952)
- GitHub repository: aRGus GitHub Repository
Feedback, criticism, and collaboration are welcome.
Top comments (0)