MaxHagl

Posted on Sep 8

🚨 Building an IOC Triage Pipeline with Suricata + ML + Docker

#cybersecurity #datascience #security #python

Honeypots generate tons of noisy logs. The challenge: how do you quickly tell which IPs deserve your attention and which are just background noise?
In this post, I’ll walk through how I built an IOC triage pipeline that ingests Suricata/Zeek telemetry, scores suspicious IPs, applies unsupervised ML, and outputs actionable blocklists.

🌐 The Problem

If you’ve ever run a honeypot like T-Pot, you know the drill:

Gigabytes of Suricata/Zeek alerts
Thousands of unique source IPs
Endless false positives

Manually sorting through all this isn’t scalable.
I wanted a pipeline that could automatically:

Aggregate activity per IP
Score each IP on suspicious behavior
Use ML to flag anomalies
Output human-readable casefiles + blocklists

🛠️ The IOC Triage Pipeline

I built a Python tool (ioc_triage.py) that takes NDJSON logs and produces structured outputs.

Key Features

Ingest Suricata/Zeek/T-Pot logs
Aggregate features like flows/min, unique ports, entropy, burstiness
Rule-based scoring (customizable via config.yaml)
Unsupervised ML (IsolationForest + LOF + OCSVM, optional PyOD HBOS+COPOD)
Fusion of rules + ML → combined tier (observe, investigate, block_candidate)
Outputs:
- Enriched per-IP CSVs
- JSON casefiles
- Blocklists (per-IP and prefix)

⚙️ How It Works

1. Ingest

Reads Suricata NDJSON logs:
bash

python scripts/ioc_triage.py \
  --input data/samples/raw.ndjson \
  --hours 72 -vv

2. Aggregate

Per source IP, it computes:

Flows/minute
Unique src/dst ports
Burstiness (variance of activity)
Port entropy
Signature counts ###3. Score

Configurable rule weights in scripts/config.yaml:
yaml

score:
  weights:
    flows_per_min: 2.0
    unique_dst_ports: 1.6
    unique_src_ports: 1.3
    alert_count: 0.8
    max_severity: 0.6
  thresholds:
    block: 7.0
    investigate: 3.5

4. Machine Learning

Uses unsupervised anomaly detection:

IsolationForest
LocalOutlierFactor
OneClassSVM (Optionally PyOD HBOS+COPOD)

Scores are normalized and combined into ml_score + ml_confidence.

5. Fusion

Rules + ML = tier_combined
→ final decision: observe, investigate, or block_candidate.

📦 Setup

Clone the repo:
bash

git clone https://github.com/YOUR-USERNAME/ioc-triage-pipeline.git
cd ioc-triage-pipeline

Install requirements:
bash

pip install -r requirements.txt

(Optional ML extras):
bash

pip install pyod scikit-learn

Or run via Docker:
bash

docker build -t ioc-triage .
docker run -it --rm -v $(pwd):/app ioc-triage \
    python scripts/ioc_triage.py --input data/samples/raw.ndjson --hours 72 -vv

🔍 Example Output

table

ip  score   ml_score    tier    ml_tier tier_combined   reason
61.184.87.135   9.455   0.944   block_candidate block   block_candidate flows/min high, burstiness high, multiple ports

Outputs:

data/outputs/enriched.csv → per-IP features
cases/.json → casefiles
outputs/blocklist_combined.tsv → fused blocklist
outputs/blocklist_combined_prefix.tsv → aggregated /24 + /48 prefixes

🙌 Why This Matters

This project turns raw honeypot noise into actionable intelligence:

Analysts can focus on high-confidence threats
Blocklists update automatically
You can tune thresholds & ML contamination rates

It’s also great for students (like me!) to showcase ML + cybersecurity skills in a practical, portfolio-ready way.

📚 What’s Next?

Try deep learning models (autoencoders, transformers)
Add active enrichment (WHOIS, VirusTotal, AbuseIPDB)
Build dashboards for live triage

👉 GitHub Repository

If you’re into honeypots, ML, or threat intelligence, give it a ⭐ on GitHub and let me know what features you’d like to see next!

DEV Community