shipping an offline log triage cli, and the parser bugs that still haunt me

#cli #python #security #showdev

a few weeks back i had a 226MB JSON log dump from a lab exercise and absolutely no desire to stand up a full SIEM just to find brute force attempts and lateral movement traces. i tried grep, gave up, tried jq, gave up harder, then ended up writing a python script that snowballed into ThreatLens.

it's a CLI that does offline triage on security logs. 12 detection modules, sigma rule compat, MITRE ATT&CK mapping, no daemon, no docker, no infra. you point it at a folder, it gives you a report.

threatlens scan logs/ --sigma-rules sigma/rules/ --min-severity high -o report.html

that's it. one command, html report on the other side.

what i actually had to solve

the first thing that bit me was EVTX parsing. windows event log binary format is annoyingly underdocumented in places, and the python-evtx library is solid but slow if you use it naive. i was getting around 8k events/sec on a 22MB file which was unusable.

i ended up streaming records instead of loading them, plus deferring the XML-to-dict conversion until a record actually matched a candidate detector. that pulled the throughput up. on synthetic benchmarks (single-core python 3.11):

9k events / 2.3 MB: 0.13s, 69.3k events/sec
90k events / 22.6 MB: 1.27s, 71.0k events/sec
900k events / 226 MB: 14.24s, 63.3k events/sec

it scales pretty flat, which i didn't expect. i thought GC churn or memory pressure would tank the big runs. it didn't.

the detector design

there are 12 built-in detection modules and they all implement the same interface. you can write custom ones in python or YAML.

YAML rule example:

name: suspicious_powershell_encoded
event_id: 4104
match:
  - field: script_block
    op: regex
    value: '(?i)([A-Za-z0-9+/]{50,}={0,2})'
  - field: user
    op: not_equals
    value: SYSTEM
severity: high
mitre: T1059.001

twelve operators total: equals, contains, regex, thresholds, time windows, and a few others. sigma rules also work. i implemented selection/filter/condition parsing and most field modifiers. not all of them. the sigma cidr modifier is half-broken in my impl and i know it. it's on the issue list.

multi-stage chain correlation

this is the part i'm actually proud of. instead of just firing alerts per-event, ThreatLens groups events that look like they're part of the same kill chain. brute force, then an interactive logon, then mimikatz-style SAM access. it links them across time windows.

on a 26-event focused simulation it found 1 CRITICAL, 8 HIGH, 2 MEDIUM, 1 LOW. on a 52-event mixed dataset (benign noise plus embedded attack) it hit zero false positives and 100% detection on the embedded TTPs.

i don't trust those numbers as a generalization. the corpus is small and i wrote both. but it's enough to say the correlation logic isn't just hallucinating, which is the bar i actually cared about.

configuration and CI use

there's a ~/.threatlens.yaml file for defaults so you don't have to repeat flags. CLI overrides config. you can also ship an allowlist.yaml that suppresses known-good alerts, which matters more than it sounds, because as soon as you point a tool like this at real logs you get drowned in legitimate-but-suspicious-looking activity (admin tooling, scheduled tasks, backup agents).

i ended up adding the allowlist mid-project because i was getting 200 alerts on my own dev box and almost none were real. now they live in YAML and i version-control them per environment.

there's also a --fail-on flag that returns exit code 2 if alerts above a threshold fire. dumb little thing, but it means you can wire ThreatLens into a CI step on a log corpus and have it actually break the build if a regression sneaks in.

what's broken

27 open issues. some of them are real.
sigma cidr modifier as mentioned
the streamlit dashboard exporter occasionally double-counts events when the input is NDJSON with trailing whitespace lines. i thought i fixed it. i didn't.
the follow subcommand (real-time tailing) leaks file handles if you ctrl-C during a log rotation event. found this one in the wild. embarrassing.
EVTX parsing on logs that have been touched by wevtutil epl sometimes desyncs. i think this is upstream but i haven't proved it.

outputs i ship

it can dump to JSON, CSV, HTML, interactive timelines (one html file with a vis-timeline embed), and push to Elasticsearch, Wazuh, Splunk HEC, ATT&CK Navigator layers, or STIX 2.1 bundles. the navigator output is the one i use most. you scan, you load the layer, and the heatmap of touched techniques is instantly readable.

what i'd do differently

honestly, i'd write the YAML rule loader first. i wrote the python plugin system first because it was more fun, then bolted YAML on later, and there are seams where the two abstractions don't quite agree. if i rewrote it i'd start at the rule format and make python plugins compile down to the same internal representation.

also i'd write tests earlier. test coverage is maybe 40%. the correlation logic has decent coverage because i kept breaking it. the output formatters basically have none.

repo

https://github.com/TiltedLunar123/ThreatLens

MIT licensed. PRs welcome, issues even more welcome. if you triage logs and have an EVTX file that breaks the parser, i would actually love to see it.