my sigma scanner can't count, so i wrote that down instead of faking it

#cybersecurity #python #detection #siem

i've got a small python tool called SIEMForge. you point it at a log file and a folder of sigma rules and it tells you which rules fire on which events, no SIEM involved. it's at v3.1, 10 bundled rules, and the only runtime dependency is pyyaml.

most of what it does is one-event-at-a-time matching. read an event, run every rule against it, print the ones that hit. that model is dead simple, and it's most of the reason the tool is useful for actually writing rules. you tweak a rule, rerun, see the result in under a second. that loop is the whole point.

then i hit the one rule that doesn't fit that model at all.

the rule that's supposed to wait

one of the bundled rules is an ssh brute-force rule. the intent is normal: fire after ten failed logins from the same source ip in a short window. that's a count over time. the sigma rule even carries a custom.threshold_count field that says exactly that.

my scanner has no memory between events. it looks at event 1, forgets it, looks at event 2, forgets it. so when it reaches the first failed ssh login that matches the rule's pattern, it fires. immediately. on failure number one.

which is wrong. an analyst seeing that alert would assume ten failures happened. one did. a single failed ssh login is a tuesday, not an incident.

i sat with this for a while, because there were a few honest ways to go and only one of them was any good.

option one: actually build the counting

i could give the scanner state. keep a dict keyed on (rule, source ip) holding a list of timestamps, and only alert once the count inside the window crosses the threshold. that's real correlation.

it's also a real feature with real edge cases. where's the window boundary. what about clock skew between the log's timestamps and anything else. when do you evict old entries so the dict doesn't grow forever. what's the memory footprint when someone points this at a 90MB jsonl with a million distinct source ips, which, given the tool reads logs that might come off a compromised box, is not a hypothetical.

i started sketching it and realized i was about to build a worse version of the exact thing a SIEM already does well. correlation across time is the part you genuinely want a real engine for. the reason this tool exists is the fast local authoring loop, and bolting a half-working correlator into the hot path would make the simple thing slower to serve the one rule that needs it.

option two: lie a little

i could have made the scanner honor the threshold in some half-baked way and just not mention it. nobody would catch it on the sample data. it would demo fine.

that's the option that bugs me most in hindsight, because detection tooling that quietly does the wrong thing is the worst kind of broken. a web app that breaks throws a 500. a detection rule that breaks just stops alerting, and a missed alert looks identical to a quiet day. there's no error to grep for.

what i actually did

i wrote it down. there's a section in the readme now called "Stateless Matching and Thresholds" that says, in plain words, the scanner evaluates one event at a time, holds no state, and a rule with a threshold_count fires on the first matching event instead of after the count. the threshold fields stay on the rules on purpose, because when you deploy them to wazuh or splunk or elastic, those engines honor them. locally, treat it as a known gap.

here's roughly what the per-event loop looks like, state-free by design:

def scan_events(events, rules):
    alerts = []
    for i, event in enumerate(events):
        for rule in rules:
            if rule.matches(event):          # pure function of one event
                alerts.append(Alert(rule, event, index=i))
    return alerts

rule.matches(event) is a pure function of a single event. no accumulator, no lookback at the events before it. a threshold_count of 10 sits on the rule object, but matches never reads it, so it physically can't wait for the tenth anything.

writing the limitation into the docs felt like admitting the tool is incomplete. it is. but a documented gap is something a user can plan around. an undocumented one is a trap you set for whoever trusts your output.

the part i'd still change

the honest version of this isn't "docs forever." if i build correlation later, i'd do it as a second pass over the alerts the matcher already produced, not by stuffing state into the per-event matcher. keep matches pure. then a separate stage groups matched alerts by rule and source, sorts them by time, and only emits when a window actually has enough. the simple path stays simple, and the stateful path becomes opt-in and testable on its own instead of tangled into everything.

there's a smaller thing the stateless model exposes too. with no state, the scanner can't dedupe. if one noisy process-creation event matches forty times, you get forty alerts. for authoring rules that's fine, you want to see every hit. for anything resembling triage it's just noise. same fix, same hypothetical second pass.

why i bothered writing this up

i'm studying for security+ and aiming at soc analyst work, and the thing that's stuck with me hardest building this isn't a clever piece of code. it's that the distance between "a rule matched an event" and "this is a real alert worth a human's time" is mostly state and context, and that's exactly where a SIEM earns its keep. my little scanner makes that line really visible, mostly because it sits right on the wrong side of it.

repo's here if you want to poke at it or tell me the second-pass design is wrong: https://github.com/TiltedLunar123/SIEMForge

if you've built detection tooling: where do you draw the line between "this authors rules" and "this is a SIEM"? i keep moving mine.