DEV Community: Chris Ray

The Threshold Your Detection Can Never Reach

Chris Ray — Wed, 10 Jun 2026 14:01:00 +0000

Your EC2 enumeration detection buckets events into 5-minute windows, counts the distinct instances each actor touched, and alerts when that count passes 10. Reasonable on paper. A burst of instance enumeration is recon, and recon is worth catching early.

It has never fired.

Not because nobody is enumerating your environment. Because the math forbids it from ever firing, at any attacker pace you'll actually see, and it shows up green on your dashboard the entire time.

The detection that looks tuned

index=aws sourcetype=aws:cloudtrail eventName=DescribeInstances
| bin _time span=5m
| stats dc(instance_id) AS distinct_instances by sourceIPAddress, _time
| where distinct_instances > 10

A window and a threshold is what a rate detection is supposed to have. It maps to discovery. It looks deliberate. It looks tuned. That's why it sails through review – it has the shape of a careful rule.

The shape is the problem. The numbers inside it can't happen.

Why it can never fire

Three things stack up, and any one of them is enough.

The fixed-bucket boundary. bin doesn't draw windows around your events. It chops the timeline on a fixed origin – :00, :05, :10 – and drops events into whichever slot they land in. An attacker enumerating 10 instances over 12 minutes spreads across three buckets. Maybe four instances, then three, then three. Never 10 in one. The enumeration absolutely happened. The bucketing sliced it into pieces too small to cross the threshold.

The pace. Ten DISTINCT instances inside a single 300-second window assumes a tempo that deliberate recon doesn't use and API rate limits don't encourage. Low-and-slow enumeration – the kind you most want to catch – is the kind least likely to pile 10 unique instances into five minutes.

The straddle. Even a genuine burst, 10 instances in six minutes, gets split the moment it crosses a boundary. Five before :05, five after. Two buckets of five. Nothing fires. The detection punishes the attacker for nothing they did and rewards them for where your clock happened to start.

This is a class of bug, not a rule

The specific rule doesn't matter. The category does: any rate detection where the threshold can't be reached inside the window at a realistic pace.

It shows up as a span set tighter than the threshold can fill. As a threshold sized for sliding-window intuition but implemented as a fixed bucket. As a window shorter than the natural cadence of the behavior it's hunting. Different surface, same dead detection underneath.

And it's invisible, because nothing about it errors. The SPL is valid. The schedule runs. The job completes clean. The cell on your ATT&CK heatmap is green. I've written before that the heatmap counts rules instead of coverage – this is one of those rules. It's behind a green cell, and it cannot fire.

How to spot them

Do the arithmetic. Can the threshold be reached inside one window at a pace a real attacker would use? Multiply it out. If the answer is no, the detection is dead and no tuning saves it.
Know fixed-bucket from sliding-window. bin and timechart are fixed-origin. A burst that straddles a boundary gets split across windows. If you wrote a fixed bucket but reasoned about it like a sliding window, your intuition and your SPL disagree.
Backtest for EVER, not for HOW OFTEN. Run it across 90 days of real activity. A noisy detection tells you it's miscalibrated. A detection that fired ZERO times across all that data is telling you something worse – suspect dead-by-construction, not quiet-because-safe.
Compare the window to the behavior's cadence. Match your span against API rate limits, automation intervals, and how fast the technique actually runs in the wild. Not against your gut sense of "fast."

The fix

Widen the window, lower the threshold, or both – but only after the arithmetic, not by feel. Then stop using a fixed bucket for a sliding-window question:

index=aws sourcetype=aws:cloudtrail eventName=DescribeInstances
| streamstats time_window=15m dc(instance_id) AS distinct_instances by sourceIPAddress
| where distinct_instances > 10

streamstats time_window counts over a window that moves with the events instead of snapping to a fixed clock. Ten instances in any rolling 15-minute span trips it, no matter where the boundaries would have fallen. The straddle problem disappears because there are no fixed boundaries to straddle.

Re-backtest after the change, and confirm it fires on a known-good positive control before you trust it.

The actual lesson

A detection that cannot fire is worse than no detection. No detection at least leaves a hole you can see and plan around. A dead one fills the hole with a green cell, claims a slot on the coverage report, and tells you you're watching a door that's been welded shut since the day you deployed it.

Do the arithmetic before you trust the threshold. The window has to be big enough to hold the thing you're trying to catch.

The SIEM Isn't Dying. Its Job Is Splitting in Two.

Chris Ray — Mon, 08 Jun 2026 13:58:00 +0000

Every few months someone declares the SIEM dead, and an AI layer that queries all your systems in place – no central log lake – is the latest murder weapon. The pitch is good. Stop paying to ship petabytes you never read. Let an agent connect to CloudTrail, Okta, and your sensors directly, and run detection at the edge, in place, in each system's native form.

The pitch is also half right, which is worse than wrong.

The SIEM isn't dying. The single thing we built it to do is splitting into two jobs that don't belong in the same box anymore – and the half of the pitch that's correct is hiding the half that decides whether any of this works.

The assumption nobody examines

We don't centralize logs because centralizing is good. We centralize because a query engine needs its data in one schema, in one place, to run. That's it. Detection was a query, and a query needed unified data, so we built an enormous machine whose entire purpose is to drag everything into one searchable pile.

That's an artifact of a constraint, not a law of nature. The constraint was real for twenty years. It's worth asking what's left standing when it loosens – when you can put an agent at each source that speaks the source's own API and correlates across them on demand, instead of pre-joining everything into one index first.

Some of the SIEM's reasons for existing survive that question. Some don't. The whole argument is about which is which.

The case for querying in place

Steelman it properly, because the economics are genuinely compelling.

SIEM licensing is volume-based, and most of what you ingest is never queried. You are paying to move and store an entire haystack in order to look at three needles. Query-in-place bills you only for what you actually touch.

The schema tax goes away too. Central SIEMs normalize everything into their own worldview on the way in, and they lose fidelity doing it. I've written two separate posts about Splunk quietly mangling field structure – braces from JSON arrays, multivalue fields collapsing – before a human ever sees the data. The store distorts the event as a condition of accepting it. Query the source natively and the event stays intact.

Then there's dark data. Full packet capture, the DNS firehose, verbose cloud audit – sources too large to centralize economically. They become huntable on demand instead of being dropped at the collector. This is where the idea is strongest for anyone who lives at the network layer: NSM and Zeek data is enormous, almost nobody ingests it whole, and an edge query layer is how you'd finally hunt over all of it.

Detection also moves closer to the source – no forwarder, index, and scheduled-search lag stacked in front of it. And the agent hunts the way a human actually does, pivoting an identity from Okta to AWS to Google Workspace in its native trail, instead of being boxed into whatever someone thought to pre-join into an index six months ago.

All of that is real. None of it kills the SIEM.

Why the "SIEM dies" version is wrong

Centralization solves problems that querying in place doesn't touch.

Retention and forensics. Source systems rotate and expire their own logs – CloudTrail defaults to about 90 days, Okta caps its retention, sensors overwrite. Incident response and compliance need immutable, timestamped evidence that lives for months or years. You still need a store. An edge query layer has nothing to say about where the data lives in six months.

Correlation at scale. The deterministic join – the same entity across five sources over a 90-day window – is exactly what a unified store does well and what N live API calls stuffed into a context window do badly. The thing the agent demo makes look easy on three events gets slow, bounded, and lossy at real volume.

Auditability. "Why did this fire" has to be answerable for the analyst at 2am and the regulator six months later. A query is something you can read. An agent's reasoning chain is not, not in a way that survives an audit.

These reasons don't evaporate because a demo was slick. They're load-bearing, and the edge-query pitch is quiet about all three.

The part that actually decides it

Here's the one that matters most, and it's personal to anyone who tests their detections.

A scheduled query returns the same answer every time it runs. You can backtest it against thirty days of history. You can prove it fires on a malicious sample and stays quiet on a benign one. You can version it, diff it, and put it through CI. An LLM inventing detection logic on the fly does none of that reliably.

I've argued before that an untested detection isn't a detection – it's a query that runs on a schedule and a hope. That argument assumes a reproducible artifact: a thing you can point a test at and get the same result twice. An agentic detection layer doesn't produce one. There's no Sigma rule, no CI gate, no backtest, no way to prove it can fire before you trust it to tell you the bad thing didn't happen.

You can't unit-test a vibe.

That's the line between the two jobs. Hunting tolerates non-determinism – a good hunter is supposed to follow a hunch and surprise you. Detection cannot tolerate it, because the entire value of a detection is that it does the same correct thing every time, including the times nobody's watching. Until agentic detection hands you a testable, repeatable artifact, it's a hunting accelerant, not a detection program. Those are different jobs, and conflating them is how you end up trusting silence you never validated.

The blast radius nobody prices in

One more. To query everything in place, the AI layer needs broad read credentials to every source system you own. That's a single identity with read access to your entire estate.

Compromise it and the attacker inherits your whole field of view at once. It's the same shape as the exclusion-list problem I've written about – the lookup that gives your detections coverage also becomes part of your blast radius if it's compromised – except now the blast radius is everything, in one credential. Add the rate limits and analytic-hostile APIs on the source systems, and "query everything live during an incident" starts throttling at exactly the moment you need it most.

What actually changes

So put the hype down and say what's really happening. The SIEM's two jobs come apart.

Hunting and detection federate out to an AI edge layer that queries sources in their own language. And retention, forensics, compliance, and the heavy deterministic correlation stay centralized – except, freed from the pressure to also be your live query surface, the central store can get cheaper and dumber. Object storage you query occasionally, not a search license you feed continuously.

The reason you centralize shifts from querying to remembering. That's the actual headline. It's less exciting than "the SIEM is dead," which is precisely why nobody trying to sell you the agent is saying it.

The actual lesson

The SIEM doesn't die. It stops being the place you hunt and becomes the place you remember – and those were always two jobs wearing one license.

Detection moves to the edge the moment an agent can query a source in its own language. But it stays a hunting tool, not a detection program, until the day it can hand you an artifact you can test. Watch for that day. It decides more than any pricing slide, and it's the one thing nobody demoing this wants to talk about.

When AWS Fires Your MFA Detection For You

Chris Ray — Thu, 04 Jun 2026 01:55:17 +0000

An attacker stripping MFA off an account is exactly the kind of thing you want to catch. Remove the second factor, and a stolen password is the whole game. So you write the rule: alert when someone calls DeactivateMFADevice or DeleteVirtualMFADevice.

It fires. Good. Then it fires again. Then it fires every single time someone leaves the company.

Because AWS removes MFA for you when you delete a user, and CloudTrail records the cleanup as the same event an attacker would generate.

The detection that looks correct

MFA removal maps cleanly to a real technique – credential access, account manipulation, T1556. Alerting on it is a defensible call that passes review without a second look. The rule is about as simple as detection gets:

index=aws sourcetype=aws:cloudtrail eventName IN (DeactivateMFADevice, DeleteVirtualMFADevice)

Specific. Low volume on paper. Maps to the matrix. Nothing about it looks wrong, which is exactly the problem.

What AWS actually does

You cannot delete an IAM user that still has an MFA device attached. AWS rejects the DeleteUser call until the device is gone. So deletion has a precondition: remove the MFA first.

It doesn't matter whether the offboarding runs through the console, the CLI, or a Lambda someone wrote two years ago. The sequence CloudTrail records is the same every time:

DeactivateMFADevice    (actor: offboarding-role, target: jdoe)
DeleteVirtualMFADevice (actor: offboarding-role, target: jdoe)
DeleteUser             (actor: offboarding-role, target: jdoe)

Same actor. Same target. Seconds apart. Every legitimate offboarding produces the precise event your "attacker stripped MFA" rule was built to catch. The detection can't tell the two apart because there is nothing to tell apart. It's the same API call. No threshold, no severity tweak, no field you've overlooked changes that.

Why the docs won't save you

AWS documents its API calls one at a time. DeactivateMFADevice does what it says. DeleteUser does what it says. What the docs don't tell you is that the second one emits the first as a precondition – that deleting a user fans out into a cascade of events, and the cascade members are indistinguishable from standalone malicious activity.

Most AWS detection content has the same blind spot. It treats CloudTrail events atomically, one rule per eventName. But the truth here doesn't live in any single event. It lives in the relationship between three of them, and that relationship is exactly what nobody writes down.

You don't learn this from reading. You learn it from triaging your own false positives until the pattern is obvious.

Detect the action, not the side effect

The benign case has a signature: the MFA removal is always followed by a DeleteUser for the same person. So correlate, and let the benign pattern suppress itself.

index=aws sourcetype=aws:cloudtrail eventName IN (DeactivateMFADevice, DeleteVirtualMFADevice)
| eval target_user=coalesce('requestParameters.userName', 'requestParameters.serialNumber')
| join type=left target_user
    [ search index=aws sourcetype=aws:cloudtrail eventName=DeleteUser earliest=-60m latest=+60m
      | eval target_user='requestParameters.userName'
      | eval deleted=1
      | fields target_user deleted ]
| where isnull(deleted)

The window is two-sided on purpose. CloudTrail timestamps and delivery aren't strictly sequential – don't assume the DeleteUser always lands after the MFA events. Look both directions.

What survives the suppression is MFA removal that is NOT part of a deletion. Someone stripped a second factor off a user who is still active. That's the thing you actually wanted to know about, and now it's the only thing left in the queue.

The lesson generalizes

Cloud control planes emit cascades. One human action fans out into a handful of API calls, and the side-effect calls in that fan-out look identical to the same calls fired in isolation by an attacker. MFA removal on offboarding is one instance. Snapshot sharing during a backup job is another. Key rotation, role teardown, bucket policy changes during a decommission – every one of them has a benign cascade that mimics an attack.

Detect the originating action where one exists. Where it doesn't, correlate the cascade members so the benign sequence cancels itself out. And catalog the cascades for your high-value events the same way you'd catalog the noisy controller roles in your environment – the cascade map is a reusable artifact, not a one-time tuning pass.

The actual lesson

The detection wasn't catching attackers. It was catching HR. AWS generated the smoking gun on every offboarding, and the rule dutifully reported routine cleanup as an intrusion.

Detect the action a human took, not the side effect the platform emitted to carry it out. The platform's side effects look exactly like the attack, and they always will.

Your ATT&CK Heatmap Is Counting Rules, Not Coverage

Chris Ray — Wed, 03 Jun 2026 12:48:44 +0000

Every detection vendor ships a MITRE ATT&CK heatmap, and every one of them is mostly green. Broad coverage, techniques lit up across the board, a reassuring wall of color in the sales deck and the board slide. It's the universal flex. We cover the matrix.

Then you parse the actual rules – the real YAML in the public repo, not the marketing layer on top of it – and the green collapses into three tactics.

Everyone covers execution and persistence. Almost nobody covers discovery, lateral movement, or collection. The heatmap wasn't measuring coverage. It was counting rules, and counting them in a way designed to look complete.

What a green cell actually means

A green cell in a Navigator layer means one thing: at least one rule somewhere references that technique tag.

That's it. Not "we detect this reliably." Not "this fires on real attacks and stays quiet otherwise." Not "this survives an attacker who knows the rule exists." One rule that names the technique in its metadata turns the cell green, and forty rules turn it the same shade. Unless the layer is scored by rule count – and most published heatmaps aren't – one and forty are indistinguishable.

I've written before that an untested detection isn't a detection, it's a query that runs on a schedule. The heatmap is the same lie one level up. A green matrix isn't coverage. It's a wall of queries that run, rendered in a color that means "present," dressed up as a color that means "protected."

The vendor knows the difference. The buyer staring at the green doesn't.

You can measure the real shape yourself

Here's the part the heatmap marketing depends on you not doing: the rules are public, and you can count them.

SigmaHQ, Elastic's detection-rules, Splunk ESCU, Panther, Sublime – all on GitHub, all tagged with attack.TXXXX technique IDs and tactic tags in the rule metadata. The method is boring on purpose. Walk the rule directories. Pull the ATT&CK tags out of each rule. Aggregate by technique, roll the technique counts up to tactic, and emit a Navigator layer scored by count instead of binary presence.

That's exactly what my spl-coverage-map tool does against my own rules. The same parser points at anyone's repo. There's no vendor cooperation required and no proprietary access involved – the coverage data has been sitting in the open the whole time. Nobody runs the count because the heatmap already told them what they wanted to hear.

One honesty constraint up front: this measures what's written, not what works. A high rule count for a tactic is still just a count – it doesn't mean those rules fire correctly. But the distribution across tactics is real signal even when any single number is soft. The shape is the finding.

The shape is lopsided, every time

Run it and the same picture shows up no matter whose corpus you point it at.

Coverage piles up on Execution, Persistence, Privilege Escalation, and Defense Evasion. These are the tactics where a single log event maps cleanly to a single rule. A process creation event. A registry run key. A new service install. One event, one rule, one green cell. Easy to write, easy to count, easy to demo.

It thins out fast on Discovery, Lateral Movement, Collection, Exfiltration, and Impact. These are the tactics that need correlation across multiple events, a baseline to deviate from, or behavioral context that a single point-event rule can't express.

My own numbers are no exception. Running this on my rules: roughly 37 techniques across 12 tactics, with discovery and lateral movement as confirmed gaps. That's not a brag and it's not a confession – it's the same lopsided shape everyone else has, and I'm not exempt from it. The point isn't that my coverage is good or bad. It's that the distribution is identical to the corpus at large, because the cause is structural.

The gaps aren't laziness. The easy tactics are easy precisely because their data model is one-event-one-rule. The hard tactics are gaps because they require session reconstruction, time-series baselining, and network context that point-event rules simply cannot encode. You don't fill those cells by trying harder. The rule format itself can't hold the detection.

The gaps are exactly where attackers live

Now line the shape up against where an intrusion actually spends its time.

Attacker dwell concentrates in the middle of the kill chain – discovery and lateral movement, the post-compromise stretch after initial access and before the objective. That's the hallway. It's where a breach turns from one box into your environment.

It's also the part of the matrix the corpus covers worst. The industry is heavily instrumented at the front door and at the smash-and-grab, and nearly blind in the hallway where the actual intrusion plays out. The green is densest exactly where attackers spend the least time, and thinnest where they spend the most.

This is the part I want to land, because it's where most "ATT&CK coverage is overrated" takes stop short. The discovery and lateral-movement gap isn't really a rule-writing gap. It's a telemetry gap.

Lateral movement and internal discovery are loud on the wire. East-west flows, SMB and RPC patterns, internal port scans, anomalous service-to-service connections – they're visible at the network layer in ways they are almost never visible in endpoint logs. The corpus is thin on these tactics partly because the corpus is written against endpoint and cloud-audit telemetry, and those tactics don't show up cleanly there. The blind spot isn't in the rules. It's in the data the rules were written against.

So the reframe is this: a coverage gap is a telemetry blind spot wearing a costume. You do not close your lateral-movement gap by writing more Sysmon rules. You close it by instrumenting the layer that lateral movement actually traverses, and then writing detections against telemetry that can see it.

What to do with the count

Run it on your own stack before you trust anyone's heatmap, mine included. The parser is public. Point it at your rules and score by count and tactic, not by binary technique presence – the distribution is what tells you anything.

Then read your thinnest tactics as a telemetry question first and a rule-writing question second. A bare lateral-movement column almost never means you forgot to write the rules. It means you have no east-west visibility to write them against. The fix lives in your network taps and your flow data, not your rule backlog.

And treat every published vendor heatmap as a claim to verify, not a result to trust. When a green matrix lands in front of you, the only question worth asking is which telemetry the green is built on. If it's all endpoint and cloud audit, that wall of color is blind in the hallway no matter how green it looks.

The actual lesson

The detection industry isn't hiding its blind spots. They're sitting in plain sight in every public rule repo, the same lopsided shape every time – dense at the front of the kill chain, thin through the middle where attackers actually live. The heatmap stays green because it counts rules and nobody asks it to count harder, and because "present" photographs the same as "protected" when you pick the right color.

Run the count yourself. Read the gaps as telemetry, not effort. What the heatmap won't tell you is the only thing worth knowing: not how many techniques you've named, but whether you can see the part of the attack that matters.

The Splunk Token That Silently Swallows Curly Braces

Chris Ray — Tue, 02 Jun 2026 14:41:47 +0000

Your Okta detection fires. A privilege grant, the kind you want eyes on fast. The Slack alert message lands and tells you the actor "" just handed out admin. Two empty quotes where the name should be.

The search is correct. The field is populated. Run the SPL by hand and you'll see the display name sitting right there in the results. But the alert went out with a blank, and Splunk never said a word about it – not in the job log, not in the alert action, nowhere.

This is the part worth your attention: every test you'd normally run passes. The detection is, by every check that happens inside Splunk's search pipeline, working. It's only broken in the one place you don't habitually look – the rendered message a human actually reads.

A correct search is not a working detection.

The shape of it

The detection pulls from a JSON source – Okta, in this case, but it could be CloudTrail, Google Workspace, M365, GitHub audit, anything that ships nested JSON. The field you want in the alert lives inside a JSON array. Okta puts the affected user under target, an array, so spath extracts it as a field literally named target{}.displayName.

The alert action references it the obvious way:

actor=$result.actor.alternateId$ target=$result.target{}.displayName$

The actor token resolves. The target token comes back empty. Same message, same syntax, one renders and one doesn't. The only difference is the curly braces.

Two faults, stacked

It's tempting to call this one bug. It's two, and you have to fix both.

Fault A: the token engine rejects the braces. Splunk's alert-action token substituter – the same engine behind dashboard input tokens – only accepts letters, digits, underscores, and dots inside a $...$ reference. Curly braces aren't on the list, and they're reserved for Splunk's own template constructs. When the engine scans result.target{}.displayName, it either stops at the { and goes looking for a field plainly named target (there isn't one – spath only ever created target{}.*), or it fails to match a valid token at all. Both roads end at an empty string.

Fault B: the field is multivalue. The {} in spath's output isn't decoration. It means "this came from a JSON array, it holds multiple values." A scalar token slot has no rule for rendering a multivalue field. Splunk doesn't pick the first value, or join them, or guess – it substitutes nothing.

Fault B is the one people miss. Rename the field to strip the braces and you've satisfied Fault A, but the token is still empty, because the field is still multivalue. Here's the full truth table:

What you reference	Grammar OK?	Scalar?	Renders
$result.target{}.displayName$	No	No	empty
$result.target_array$ (rename only)	Yes	No	empty
$result.target_first$ (rename + mvindex)	Yes	Yes	the name

And the reason none of this ever pages you: token substitution has no "field not found" path that reaches the user. It's silent on purpose. Dashboard tokens are routinely optional, meant to disappear quietly when unset. The alert engine inherits that behavior and applies it to your detection, where a missing field is anything but optional.

The fix is one line doing three jobs

| eval target_displayName=mvindex('target{}.displayName', 0)

Drop it in before the alert action sees the data. Each piece is load-bearing:

The single quotes tell SPL to read target{}.displayName as a literal field name. SPL itself has no problem with braces in field names – only the alert template engine does. Double quotes would make it a string literal, backticks would make it a macro. Single quotes are the only form that works here, and getting this wrong is the most common way people copy the pattern and stay broken.
mvindex(..., 0) collapses the multivalue field to its first element. That kills Fault B.
The rename to target_displayName gives the token a name built from characters the grammar accepts. That kills Fault A.

Now $result.target_displayName$ resolves, and the analyst sees a name.

Prove it to yourself in any environment

Don't trust a search that looks right. Trust what the alert renders. You can isolate Fault A in one query by putting a known-good token next to the broken one:

index=security sourcetype="OktaIM2:log" earliest=-1h
| head 1
| eval token_test="actor=$result.actor.alternateId$ target=$result.target{}.displayName$"
| table actor.alternateId target{}.displayName token_test

The actor.alternateId half of token_test renders. The target{}.displayName half comes back empty, sitting right next to a table column that proves the underlying value exists. That gap – populated in the table, blank in the token – is the whole bug in one screen.

Then go wider. Grep your saved searches for {} inside any $...$ reference. Every match is a detection that's been shipping blanks. And bake the normalize-to-scalar step into your detection template, so multivalue JSON fields never reach an alert action with their braces intact.

The actual lesson

The search was never wrong. That's what makes this one mean. There's no syntax error to catch, no failed job to investigate, no red in CI. The field is in your results the entire time. The detection passes every test that runs inside the pipeline, and the only symptom is a name that isn't there – visible solely in Slack, and only if the analyst happens to notice the quotes are empty.

Test the render, not the search. The search tells you the detection works. Only the alert tells you it's useful.