Honeypots generate tons of noisy logs. The challenge: how do you quickly tell which IPs deserve your attention and which are just background noise?
In this post, Iโll walk through how I built an IOC triage pipeline that ingests Suricata/Zeek telemetry, scores suspicious IPs, applies unsupervised ML, and outputs actionable blocklists.
๐ The Problem
If youโve ever run a honeypot like T-Pot, you know the drill:
- Gigabytes of Suricata/Zeek alerts
- Thousands of unique source IPs
- Endless false positives
Manually sorting through all this isnโt scalable.
I wanted a pipeline that could automatically:
- Aggregate activity per IP
- Score each IP on suspicious behavior
- Use ML to flag anomalies
- Output human-readable casefiles + blocklists
๐ ๏ธ The IOC Triage Pipeline
I built a Python tool (ioc_triage.py) that takes NDJSON logs and produces structured outputs.
Key Features
- Ingest Suricata/Zeek/T-Pot logs
- Aggregate features like flows/min, unique ports, entropy, burstiness
- Rule-based scoring (customizable via config.yaml)
- Unsupervised ML (IsolationForest + LOF + OCSVM, optional PyOD HBOS+COPOD)
Fusion of rules + ML โ combined tier (observe, investigate, block_candidate)
-
Outputs:
- Enriched per-IP CSVs
- JSON casefiles
- Blocklists (per-IP and prefix)
โ๏ธ How It Works
1. Ingest
Reads Suricata NDJSON logs:
bash
python scripts/ioc_triage.py \
--input data/samples/raw.ndjson \
--hours 72 -vv
2. Aggregate
Per source IP, it computes:
- Flows/minute
- Unique src/dst ports
- Burstiness (variance of activity)
- Port entropy
- Signature counts ###3. Score
Configurable rule weights in scripts/config.yaml:
yaml
score:
weights:
flows_per_min: 2.0
unique_dst_ports: 1.6
unique_src_ports: 1.3
alert_count: 0.8
max_severity: 0.6
thresholds:
block: 7.0
investigate: 3.5
4. Machine Learning
Uses unsupervised anomaly detection:
- IsolationForest
- LocalOutlierFactor
- OneClassSVM (Optionally PyOD HBOS+COPOD)
Scores are normalized and combined into ml_score + ml_confidence.
5. Fusion
Rules + ML = tier_combined
โ final decision: observe, investigate, or block_candidate.
๐ฆ Setup
Clone the repo:
bash
git clone https://github.com/YOUR-USERNAME/ioc-triage-pipeline.git
cd ioc-triage-pipeline
Install requirements:
bash
pip install -r requirements.txt
(Optional ML extras):
bash
pip install pyod scikit-learn
Or run via Docker:
bash
docker build -t ioc-triage .
docker run -it --rm -v $(pwd):/app ioc-triage \
python scripts/ioc_triage.py --input data/samples/raw.ndjson --hours 72 -vv
๐ Example Output
table
ip score ml_score tier ml_tier tier_combined reason
61.184.87.135 9.455 0.944 block_candidate block block_candidate flows/min high, burstiness high, multiple ports
Outputs:
- data/outputs/enriched.csv โ per-IP features
- cases/.json โ casefiles
- outputs/blocklist_combined.tsv โ fused blocklist
- outputs/blocklist_combined_prefix.tsv โ aggregated /24 + /48 prefixes
๐ Why This Matters
This project turns raw honeypot noise into actionable intelligence:
- Analysts can focus on high-confidence threats
- Blocklists update automatically
- You can tune thresholds & ML contamination rates
Itโs also great for students (like me!) to showcase ML + cybersecurity skills in a practical, portfolio-ready way.
๐ Whatโs Next?
- Try deep learning models (autoencoders, transformers)
- Add active enrichment (WHOIS, VirusTotal, AbuseIPDB)
- Build dashboards for live triage
๐ GitHub Repository
If youโre into honeypots, ML, or threat intelligence, give it a โญ on GitHub and let me know what features youโd like to see next!
Top comments (0)