DEV Community: Bala Paranj

One Security Question, Five Reasoning Engines, Zero Source Code

Bala Paranj — Sat, 18 Jul 2026 11:50:21 +0000

We ran an experiment. We took a security question — "can an unauthenticated principal reach a sensitive resource?" and wrote specifications for five different reasoning engines: Z3 (SMT solver), Soufflé (Datalog), Clingo (Answer Set Programming), SWI-Prolog, and PRISM (probabilistic model checking).

Then we gave each spec to an AI agent with no access to our source code. Just the spec YAML, the exported facts from a configuration snapshot, and the export schema. No Go source, internal documentation or prior context about the tool.

The agent produced correct security reasoning in all five paradigms. That result is about architecture.

The spec as the contract

Each reasoning spec is a YAML file with a fixed structure:

trial:
  engine: souffle
  question: >
    How many assets are anonymously reachable via
    self-registration or privilege-escalation paths?
  input: input.facts
  export_schema: sir-predicates.md
  reasoning_steps:
    - Load the JSONL facts into Soufflé relations
    - Apply the reachability rules from reachability.dl
    - Count tuples in the anonymous_reachable relation
  expected_output:
    anonymous_reachable_count: 12
  validation:
    method: exact_match
    ignore:
      - output_ordering
      - whitespace

The spec is simultaneously three things: a specification (what the engine should compute), an implementation guide (how to compute it), and a test definition (the correct answer). The golden output in expected_output is committed alongside the spec and diffed against the agent's output.

This means contributing a new security analysis doesn't require reading our Go codebase. A contributor writes a spec, commits the golden output, and the trial framework validates it. The spec is the interface. The source code is an implementation detail behind it.

What the trials found

Five trials. Three passed on first run. Two failed. The failures were more informative than the passes.

Z3 (SMT): pass. The agent translated the JSONL facts into SMT-LIB assertions, ran (check-sat), and produced the correct verdict and witness. The question is a satisfiability problem. Can an unauthenticated principal read an S3 bucket with Public Access Block disabled. Z3 either finds a satisfying assignment (the specific principal and bucket) or proves none exists. The agent found the right one.

Soufflé (Datalog): pass. The agent loaded facts into relations, applied reachability rules, and counted 12 anonymously reachable assets. Byte-identical to the golden output. Datalog's strength here is transitive closure — role assumption chains of arbitrary depth are computed by fixed-point evaluation, not by bounded loops.

Clingo (ASP): fail, then pass. The agent's output didn't match the golden because the spec used mfa_enforced but the exported facts used has_mfa_enforced. This was a spec bug. A naming-convention slip where the spec author forgot the has_ prefix that Stave's fact catalog uses for all boolean predicates.

The fix was two lines: correct the predicate name in the spec and add an explicit note about the naming convention. When we re-ran with a blind agent (fresh context, no knowledge of the fix), the agent picked up the convention note and produced correct output.

This is the kind of defect a trial framework should catch. The spec was wrong. The engine was right. The framework distinguished between the two.

Prolog: fail, then pass. The agent's output was correct. 12 proof chains, but the golden said 6. The spec author had transcribed the golden by hand and missed the DynamoDB action variants that create a cartesian product with the Cognito identity pool paths. The trial framework caught this by comparing the agent's output against the engine's actual output (12) rather than trusting the golden.

Category A failure (Clingo): spec bug — agent couldn't derive engine output from spec. Category C failure (Prolog): golden bug — agent's output matched engine, but golden was wrong. The distinction is foundational. In Category A, you fix the spec. In Category C, you fix the golden. Conflating the two means you can't trust your test suite.

PRISM (probabilistic): pass. The agent computed per-attack-shape exploitation probabilities from the PRISM model. Step probabilities were exact. Aggregate probability was within ±0.005 of the golden. PRISM is the most unusual engine in the set. It doesn't prove safety, it quantifies risk. "Given this configuration, the probability of successful exploitation via the Cognito self-registration path is 0.342." That's a different kind of output than SAT/UNSAT, and the spec format handles it properly.

The blind re-run

The original trials had a methodological weakness: the spec author and the trial runner were in the same session. The agent might have been influenced by context from the authoring process.

So we re-ran the two trials that had spec changes (Clingo and Prolog) with a fresh sub-agent. No prior context, knowledge of the original failures or fixes. Just the corrected spec YAML, the input facts, and the export schema.

Both passed.

The Prolog blind agent derived 12 proof chains via the cartesian product, matching the post-fix golden count—all without seeing the explanation comment in the spec.

It independently discovered the same combinatorial structure the original agent had found.

The Clingo blind agent correctly used has_mfa_enforced with the has_ prefix. The naming-convention note in the spec was sufficient for a context-free agent to recover the convention.

The blind run also surfaced two non-issues: output ordering and indentation width. Both were already in the spec's ignore: list — confirming that the ignore directives are doing their job of preventing false failures on cosmetically different but semantically correct output.

Five paradigms, one contract

Each engine answers a different kind of question about the same configuration snapshot:

Engine	Paradigm	Question shape	Output
Z3	SMT	"Is it possible for X?"	SAT/UNSAT + witness
Soufflé	Datalog	"What can reach what?"	All reachable tuples
Clingo	ASP	"Which assets violate which rules?"	Violation atoms
Prolog	Logic	"Prove the chain from A to B"	Proof trees
PRISM	Probabilistic	"How likely is exploitation?"	Risk probabilities

The input to all five is the same: Stave's JSONL fact export from a single configuration snapshot. The export contract — has_action(arn, action), trusts_service(arn, service), has_tag(arn, key, value) — is the stable interface. Each engine consumes the same facts in its native format.

This is the architectural boundary the trials validated: Stave's export contracts are complete for five reasoning paradigms. No engine reported a missing input field. Every engine produced correct output from the exported facts alone. The contract layer is tested, not asserted.

What this means for contributors

The reasoning-specs directory is open. The steps for adding a new security analysis:

Pick a security question and a reasoning engine.
Write a spec YAML: the question, the reasoning steps, the expected output.
Commit the golden output.
Run the trial. If it passes, the analysis is validated. If it fails, the framework tells you whether the bug is in the spec, the golden, or the engine.

No Go source code required. The spec references the export schema (which predicates are available, what they mean) and the input facts (a JSONL file from any Stave snapshot). A contributor who knows Z3 but not Go can write an SMT-based analysis. A contributor who knows Prolog but not cloud security can implement a proof-chain analysis from a well-written spec.

Several engines don't have trial packages yet; PySAT, TLA+, game theory, STIX, JSON-LD/GraphML, OSCAL/OCSF, and compliance evidence packs are all open for contribution. The spec format is documented. The trial framework is committed. The export facts are stable.

The prerequisite is a working example with a committed, fixture-tied golden output. The spec wraps the example. The trial validates the spec. The blind re-run validates the trial. Each layer adds confidence without requiring access to the layer below it.

Stave is an open-source cloud security reasoning engine. The reasoning-specs trials, input fixtures, and export schemas are at stave.

Rhino Found 21. Z3 Found 50.

Bala Paranj — Fri, 17 Jul 2026 11:11:47 +0000

In 2018 Spencer Gietzen at Rhino Security Labs published the definitive reference on AWS IAM privilege escalation: 21 named methods, each a specific action or action combination that turns limited permissions into administrative ones. The research is cited by every AWS security tool. PMapper checks for it. Pacu exploits it. Prowler enumerates it. Stave has 44+ per-technique controls, one per method or close enough.

The 21 methods are correctly identified. The math says there are at least 50.

This article runs Rhino's research through a Z3 SAT solver against a synthetic principal with a permission set Rhino's research was written to characterize. The solver finds all 21 of Rhino's methods and 27 additional methods Rhino didn't enumerate. The additional methods are the same structural shapes — "can the principal modify its own permissions?", "can the principal launch compute with a privileged role?" — applied to AWS services that exist today but weren't on Rhino's enumeration in 2018.

The killer follow-up: write a Deny policy that blocks every action on Rhino's list. Re-run the solver. 24 methods still reachable.

The collapse

Rhino enumerated 21 methods. They group into 5 structural patterns:

#	Pattern	Rhino methods
1	Policy Self-Mutation	1, 2, 7-13 (9 methods)
2	Credential Creation / Theft	4, 5, 6, 14
3	Compute + PassRole	3, 15-21 (8 methods)
4	Indirect Compute Invocation	16
5	Role Trust Modification	14

Each pattern is one structural shape: "principal can modify own permissions," "principal can hijack another principal's credentials," "principal can launch compute with a role," "principal can trigger compute indirectly," "principal can change a role's trust." The 21 methods are 21 instances of these five shapes.

A checklist scanner writes a separate predicate for each instance: 21 checks. A Z3 prover writes a separate query for each shape: 5 queries. The collapse ratio is 21 → 5.

Z3 prover

For each pattern, the prover holds a registry of known methods that fit the shape. Pattern 3's registry, for example:

ec2:RunInstances + iam:PassRole                                    [Rhino  3]
lambda:CreateFunction + lambda:InvokeFunction + iam:PassRole       [Rhino 15]
lambda:CreateFunction + lambda:CreateEventSourceMapping            [Rhino 16]
lambda:UpdateFunctionCode                                          [Rhino 17]
glue:CreateDevEndpoint + iam:PassRole                              [Rhino 18]
glue:UpdateDevEndpoint                                             [Rhino 19]
cloudformation:CreateStack + iam:PassRole                          [Rhino 20]
datapipeline:CreatePipeline + datapipeline:PutPipelineDefinition   [Rhino 21]
autoscaling:CreateLaunchConfiguration + CreateAutoScalingGroup     [NEW]
ecs:RunTask + iam:PassRole                                         [NEW]
ecs:CreateService + iam:PassRole                                   [NEW]
codebuild:CreateProject + codebuild:StartBuild + iam:PassRole      [NEW]
sagemaker:CreateNotebookInstance + iam:PassRole                    [NEW]
sagemaker:CreateTrainingJob + iam:PassRole                         [NEW]
batch:SubmitJob + iam:PassRole                                     [NEW]
states:CreateStateMachine + iam:PassRole                           [NEW]
apprunner:CreateService + iam:PassRole                             [NEW]

Eight Rhino methods. Nine additions. Same shape: any service that lets you launch compute with a specified IAM role is in the same structural class. Adding a new service to the registry is one struct entry. No new predicate, control YAML or check.

The query: "is there a method in this registry whose actions are all effectively allowed for the principal (in some Allow statement, not in any Deny)?" The solver returns the index of any reachable method. SAT iff the registry has at least one reachable entry.

The full results on a vulnerable principal

Run the prover against a principal that grants every relevant action:

====================================================================
== rhino-vulnerable (all 21 Rhino methods enabled)
====================================================================

--- Pattern 1: Policy Self-Mutation ---
  registry size: 13 methods (9 Rhino-numbered + 4 additional)
  reachable:     13 / 13 methods
  verdict:       SAT — at least one method reachable; full list:
    [Rhino 01] iam:CreatePolicyVersion → activate admin version on a policy the principal uses
    [Rhino 02] iam:SetDefaultPolicyVersion → activate dormant admin version
    [Rhino 07] iam:AttachUserPolicy → attach AdministratorAccess to self
    [Rhino 08] iam:AttachGroupPolicy → attach admin to own group
    [Rhino 09] iam:AttachRolePolicy → attach admin to assumable role
    [Rhino 10] iam:PutUserPolicy → create admin inline on self
    [Rhino 11] iam:PutGroupPolicy → create admin inline on own group
    [Rhino 12] iam:PutRolePolicy → create admin inline on own role
    [Rhino 13] iam:AddUserToGroup → join admin group
    [NEW    ] iam:CreatePolicy + iam:AttachUserPolicy → create admin policy and attach
    [NEW    ] iam:DeleteUserPolicy + iam:PutUserPolicy → drop restrictive inline, replace
    [NEW    ] iam:DetachUserPolicy + iam:AttachUserPolicy → swap boundary for admin
    [NEW    ] iam:DeleteRolePermissionsBoundary → remove permissions boundary on assumable role

--- Pattern 2: Credential Creation / Theft ---
  registry size: 7 methods (4 Rhino-numbered + 3 additional)
  reachable:     7 / 7 methods
  verdict:       SAT
    [Rhino 04] iam:CreateAccessKey
    [Rhino 05] iam:CreateLoginProfile
    [Rhino 06] iam:UpdateLoginProfile
    [Rhino 14] iam:UpdateAssumeRolePolicy
    [NEW    ] iam:CreateVirtualMFADevice + iam:EnableMFADevice
    [NEW    ] iam:DeactivateMFADevice
    [NEW    ] sts:GetFederationToken

--- Pattern 3: Compute + PassRole ---
  registry size: 17 methods (8 Rhino-numbered + 9 additional)
  reachable:     17 / 17 methods
  verdict:       SAT
    (all 17 methods listed)

--- Pattern 4: Indirect Compute Invocation ---
  registry size: 10 methods (1 Rhino-numbered + 9 additional)
  reachable:     10 / 10 methods
  verdict:       SAT
    (all 10 methods listed)

--- Pattern 5: Role Trust Modification ---
  registry size: 3 methods (1 Rhino-numbered + 2 additional)
  reachable:     3 / 3 methods
  verdict:       SAT
    (all 3 methods listed)

--- Cross-pattern summary ---
  registry total: 50 methods across 5 patterns
  reachable:      50 methods
  Rhino's 21 hit: 21 / 21
  beyond Rhino:   27 methods (cross-pattern, with overlaps)

50 reachable methods. 5 queries. Every one of Rhino's 21 confirmed.

The Script Said "Confirmed." The Attack Failed.

The collapse ratio above understates the value. "Rhino found 21, Z3 found 50" sounds like Z3's contribution is more methods. That framing misses the bigger win: Z3 also eliminates false positives the script-style checkers report.

A real-world penetration test, written up by Security Shenanigans (October 2020), illustrates the problem. The attacker had iam:PassRole and ec2:RunInstances. Pattern 3 / Rhino Method 3 looks for this exact pair. aws_escalate.py ran and reported:

The attacker launched an EC2 instance with an admin role attached. The instance started. The reverse-shell user-data fired. Nothing connected back. The default security group had no egress rules — the standard "least-privilege" default that AWS ships. The escalation was confirmed but not exploitable.

The attacker eventually pivoted: enumerated security groups (ec2:DescribeSecurityGroups), found a hadoop-cluster SG with full egress, enumerated subnets (ec2:DescribeSubnets) for one in the same VPC, and re-launched. The escalation worked. But it required four constraints the script never checked.

This is shipped in the example as fixtures/real-world-pattern3/. Five assets:

arn:aws:iam:::user/rhino-attacker          (PassRole, RunInstances, ListRoles, no GetRole)
arn:aws:iam:::role/danger-role             (admin, trusts ec2.amazonaws.com)
arn:aws:ec2:::security-group/sg-f73b339e   (default, no egress)
arn:aws:ec2:::security-group/sg-42csce3f   (hadoop-cluster, all egress)
arn:aws:ec2:::subnet/subnet-a213as8c       (vpc-a1b2c3d4, matches both SGs)

Z3 encodes the full conjunction:

exploitable = passrole
            ∧ run_instances
            ∧ can_discover_role          ; ListRoles OR GetRole
            ∧ role_is_admin
            ∧ role_trusts_ec2
            ∧ exists_egress_sg           ; DescribeSGs ∧ SG.has_egress
            ∧ exists_valid_subnet        ; DescribeSubnets ∧ subnet.vpc=sg.vpc

Three queries on the same fixture:

--- Default SG only (no egress) ---
  verdict: UNSAT
  failed:  security group sg-f73b339e has no egress rules — reverse-shell
           user-data cannot connect back

--- Hadoop SG (full egress) ---
  verdict: SAT — full compound satisfied
  witness: PassRole→arn:aws:iam:::role/danger-role + RunInstances + ListRoles
           + sg=sg-42csce3f + subnet in vpc=vpc-a1b2c3d4

--- Without iam:ListRoles (no role discovery) ---
  verdict: UNSAT
  failed:  role discovery (no iam:ListRoles, no iam:GetRole)
  note:    PassRole on Resource:* is useless without knowing which
           role to pass — script checkers miss this

Three different verdicts on three different scenarios that all share the PassRole + RunInstances pair the script keys on. The script reports "Confirmed" three times. Z3 reports SAT once.

The granular permission bypass

Notice the third clause: iam:ListRoles OR iam:GetRole. The real-world fixture allows ListRoles but denies GetRole. A defender who reads "block GetRole" as the fix is wrong. ListRoles returns the role's trust policy in the response payload. The attacker doesn't need GetRole to read a trust policy; ListRoles already discloses it. Z3's disjunction encodes this correctly. A script-style checker that lists "GetRole" as the discovery permission misses the bypass entirely.

Why this matters for the headline number

Check	aws_escalate.py	Z3 compound
PassRole + RunInstances	Confirmed	necessary, not sufficient
Discoverable admin role	not checked	ListRoles OR GetRole
Role trusts ec2.amazonaws.com	not checked	trust policy clause
Security group with egress	not checked	sg.egress_rules ≠ ∅
Subnet in SG's VPC	not checked	subnet.vpc_id = sg.vpc_id
Result	"Confirmed"	SAT or UNSAT

The script checks 1 condition. Z3 checks 5. In a real environment where default SGs have no egress, the script's "Confirmed" is wrong on every account whose attacker hasn't yet enumerated SGs and subnets. Z3 is right relative to the modeled constraints and when the model is incomplete (missing AMI availability, for example) it returns UNSAT instead of false-positive SAT.

So the headline is wrong. "Rhino found 21, Z3 found 50" sounds like a counting argument. The sharper claim is two-sided:

Recall: Z3 finds the methods Rhino enumerated and finds methods Rhino didn't, in the same query.
Precision: Z3 reports UNSAT when the compound that matters fails — even if the action-list checker reports "Confirmed."

The Bybit/Safe{WALLET} extension to the iam-overpermission-wildcard example shows the dual case from the other side: the heuristic boolean stays silent on a prefix wildcard while Z3 finds the production write. Together the two extensions sketch what a sound + complete checker looks like at the boundary between "policy is broad" and "the resulting attack chain is reachable" — different questions, different provers, both necessary.

The deny-list refutation

The natural defensive move after reading Rhino's research is to write a Deny policy listing all 21 actions:

{
  "Effect": "Deny",
  "Action": [
    "iam:CreatePolicyVersion", "iam:SetDefaultPolicyVersion",
    "iam:AttachUserPolicy", "iam:AttachGroupPolicy", "iam:AttachRolePolicy",
    "iam:PutUserPolicy", "iam:PutGroupPolicy", "iam:PutRolePolicy",
    "iam:AddUserToGroup",
    "iam:CreateAccessKey", "iam:CreateLoginProfile",
    "iam:UpdateLoginProfile", "iam:UpdateAssumeRolePolicy",
    "ec2:RunInstances",
    "lambda:CreateFunction", "lambda:UpdateFunctionCode",
    "lambda:CreateEventSourceMapping",
    "glue:CreateDevEndpoint", "glue:UpdateDevEndpoint",
    "cloudformation:CreateStack",
    "datapipeline:CreatePipeline"
  ],
  "Resource": "*"
}

Re-run the prover:

====================================================================
== partial-deny (deny covers Rhino's 21 actions)
====================================================================

--- Pattern 1: Policy Self-Mutation ---
  reachable: 1 / 13 methods
    [NEW    ] iam:DeleteRolePermissionsBoundary

--- Pattern 2: Credential Creation / Theft ---
  reachable: 3 / 7 methods
    [NEW    ] iam:CreateVirtualMFADevice + iam:EnableMFADevice
    [NEW    ] iam:DeactivateMFADevice
    [NEW    ] sts:GetFederationToken

--- Pattern 3: Compute + PassRole ---
  reachable: 9 / 17 methods
    [NEW    ] autoscaling:CreateLaunchConfiguration + autoscaling:CreateAutoScalingGroup
    [NEW    ] ecs:RunTask
    [NEW    ] ecs:CreateService
    [NEW    ] codebuild:CreateProject + codebuild:StartBuild
    [NEW    ] sagemaker:CreateNotebookInstance
    [NEW    ] sagemaker:CreateTrainingJob
    [NEW    ] batch:SubmitJob
    [NEW    ] states:CreateStateMachine
    [NEW    ] apprunner:CreateService

--- Pattern 4: Indirect Compute Invocation ---
  reachable: 10 / 10 methods
    [Rhino 16] dynamodb:PutItem (the partial deny didn't include this)
    [NEW    ] sqs:SendMessage
    [NEW    ] sns:Publish
    [NEW    ] s3:PutObject
    [NEW    ] events:PutRule + events:PutTargets
    [NEW    ] iot:CreateTopicRule
    [NEW    ] ses:CreateReceiptRule
    [NEW    ] cognito-idp:UpdateUserPool
    [NEW    ] cloudwatch:PutMetricAlarm
    [NEW    ] kinesis:PutRecord

--- Pattern 5: Role Trust Modification ---
  reachable: 1 / 3 methods
    [NEW    ] iam:DeleteRolePolicy

--- Cross-pattern summary ---
  reachable:      24 methods
  Rhino's 21 hit: 1 / 21
  beyond Rhino:   23 methods

24 reachable methods after the defender did everything Rhino's research told them to. The deny-list approach catches 20 of 21 Rhino-numbered methods (Rhino 16's dynamodb:PutItem slipped through because the defender thought of it as a data-write action, not a compute-trigger). The 27 additional methods Z3 finds — the autoscaling chain, the ECS chain, the SQS trigger, the CloudWatch alarm — those weren't on Rhino's list, so they aren't on the defender's deny list, so they aren't blocked.

That's the proof.

Why structural beats enumeration

AWS adds roughly one new compute service per quarter. Each new compute service is a potential new entry in Pattern 3's registry. Each new event source is a potential new entry in Pattern 4. The defender's deny list grows monotonically. Their research budget does not.

A structural query asks "is there any method in this shape that the principal can execute?" The query itself doesn't change as new services launch. The registry changes — one new line per new service. The Z3 evaluation runs in milliseconds. The cost of extending coverage is one struct entry plus a recompile.

A heuristic enumeration asks "does the principal have permission to call this specific action?" The heuristic itself is correct for what it checks. The coverage is fragile. Each new method requires a new predicate, control YAML and test. The defender's catalogue grows at a rate determined by research bandwidth. AWS's API surface grows at a rate determined by AWS's product roadmap. These rates are not comparable.

This is Lambert's argument applied to defense: enumeration loses to a growing surface; verification proves coverage within a model. The model is the five patterns. New methods that fit any of those five shapes get coverage automatically. Methods that don't fit any shape — a brand new attack class — require a sixth pattern, which is also one query.

The remediated case

Least privilege closes everything:

====================================================================
== remediated (least-privilege)
====================================================================

--- Pattern 1: Policy Self-Mutation ---           UNSAT
--- Pattern 2: Credential Creation / Theft ---     UNSAT
--- Pattern 3: Compute + PassRole ---              UNSAT
--- Pattern 4: Indirect Compute Invocation ---     UNSAT
--- Pattern 5: Role Trust Modification ---         UNSAT

--- Cross-pattern summary ---
  reachable:      0 methods
  Rhino's 21 hit: 0 / 21
  beyond Rhino:   0 methods

The remediated principal has:

No self-mutation actions (no iam:Attach*, iam:Put*, iam:Create* against IAM resources).
No credential-modification actions on other principals.
iam:PassRole scoped to one specific role ARN.
No iam:UpdateAssumeRolePolicy.
The roles it can pass to are not admin-equivalent.

All five patterns return UNSAT. There is no escalation path within the modeled space.

The least-privilege configuration of the remediation checklist looks like in mathematical form: scope the resource ARNs, drop the self-mutation actions, refuse the broad iam:PassRole grant, and the SAT solver returns UNSAT on every pattern.

What pattern-matching tools see

Run a per-technique scanner against the rhino-vulnerable principal. The output is 21 separate findings: "principal has CreatePolicyVersion," "principal has AttachUserPolicy," "principal has RunInstances + PassRole," etc. Each is correct. None of them composes the 21 findings into "and here are the 27 other paths in the same shape that aren't on your radar."

Run the per-technique scanner against partial-deny. The output is fewer findings. The deny closes most of them. The output reads as a successful remediation. But the 24 paths Z3 still finds aren't in the scanner's list, so they don't appear in the report. The defender ships the change, runs the scan, sees the green dashboard, and concludes the work is done.

The math says otherwise.

The architectural rule

Don't build deny lists of known privesc actions. Build allow lists of known-needed actions, scoped by resource ARN, and rely on the implicit deny for everything else.

// What the defender wrote (deny-list — fragile):
{
  "Effect": "Deny",
  "Action": [...21 actions Rhino enumerated...],
  "Resource": "*"
}

// What math suggests instead (allow-list — sound):
{
  "Effect": "Allow",
  "Action": ["s3:GetObject", "s3:ListBucket"],
  "Resource": ["arn:aws:s3:::data-team-bucket",
               "arn:aws:s3:::data-team-bucket/*"]
},
{
  "Effect": "Allow",
  "Action": "iam:PassRole",
  "Resource": ["arn:aws:iam::*:role/data-team-job-role"]
}

The first form is a list of known bad. It grows. It trails AWS's product roadmap. It's never complete. The second form is a list of known good. It grows only when the team's actual needs grow. Everything else is implicitly denied. Patterns 1, 2, 4, 5 don't fire because the actions aren't in the allow list. Pattern 3 doesn't fire because iam:PassRole is scoped to a non-admin role.

Z3 returns UNSAT on all five patterns. That's not "we caught the methods we knew about." It's "no method in the modeled space exists." Different proposition.

Checklist

IAM identity policies use Allow-with-resource-ARN-scoping rather than Deny-with-action-blocklist
iam:PassRole resource is a list of specific role ARNs, not Resource: "*" with a service condition
No principal has self-mutation actions (iam:Attach*, iam:Put* on IAM resources, iam:DeleteRolePermissionsBoundary, iam:UpdateAssumeRolePolicy)
No principal has credential-modification actions on other principals (iam:CreateAccessKey, iam:UpdateLoginProfile, MFA-related actions on roles other than self)
When AWS announces a new compute service, the org's posture review treats it as a Pattern 3 candidate by default — the registry gets a new entry, the prover runs against existing principals, every match is a finding
CI runs a pattern-based prover (this article's template) against post-deploy observation snapshots, not just per-technique heuristic checks

The meta-lesson

Rhino did the foundational research that made every subsequent IAM privesc tool possible. The ecosystem rallied around the 21 methods. Z3 confirms all 21 of them. It just demonstrates that 21 is a lower bound, not the total. The total is "every method that fits these five shapes," and that number is whatever AWS launches next.

Defenders who think in lists will always be one quarter behind AWS's release calendar. Defenders who think in shapes are bounded by their model and the model is small enough to keep current. Five patterns. New service launches? Add it to the right registry. Re-run the prover. Done.

Rhino found 21. Z3 found 50. 24 of those still work after blocking everything Rhino listed. That's six sentences and the case for thinking structurally.

The example shipped is at stave/examples/iam-21-privesc-5-patterns/ — a CEL evaluation via pkg/stave.Apply (scoped to one of Stave's 44+ per-technique IAM escalation controls; fires on the vulnerable and partial-deny fixtures, silent on remediated) and a Z3 SAT prover that runs the five pattern queries across all three fixtures and prints the cross-pattern summary quoted in this article. The Z3 binary lives in a sibling Go module so its libz3 link stays out of Stave's main vendored tree. Stave detects this pattern and 31 other H1-grounded scenarios from local AWS configuration snapshots, with no cloud credentials.

AI Solved the Easy Part of Programming. The Hard Part is Now Harder to See.

Bala Paranj — Thu, 16 Jul 2026 11:45:34 +0000

In 1986, Fred Brooks published "No Silver Bullet — Essence and Accident in Software Engineering." His argument: software complexity has two kinds. Accidental complexity is the difficulty imposed by the tools — syntax rules, boilerplate code, build configurations, deployment scripts. Essential complexity is the difficulty imposed by the problem. The business rules, the data relationships, the edge cases where the real world doesn't fit neatly into a data model.

Brooks predicted that even if we eliminate ALL accidental complexity, software will still be hard. Essential complexity is irreducible. It comes from the domain, not the tools. No tool can eliminate it because it's not a property of the tool. It's a property of the problem.

AI is the most effective tool ever built for eliminating accidental complexity. It's proving Brooks right in the way he predicted.

What AI eliminates

AI generates boilerplate instantly. HTTP handlers, database CRUD operations, serialization code, test scaffolding, configuration files, Dockerfile templates, CI pipeline definitions. Every piece of accidental complexity that developers spend hours writing — AI produces in seconds.

A developer who spends 40% of their time on boilerplate and 60% on essential logic now spends 5% on boilerplate and has 95% of their time for essential logic. The time savings and productivity improvement is real. The 10x feeling is real for the accidental part.

AI eliminates accidental complexity. The problem happens after the accidental complexity is gone.

What remains after the accidental is removed

When the boilerplate is generated in seconds, the developer reaches the essential complexity faster. The business logic, domain rules, edge cases etc. The relationships between entities that don't map clearly to database tables. The race conditions that only appear under specific ordering. The authorization rules that depend on organizational hierarchy, delegation chains, and temporal state.

This is the complexity Brooks said was irreducible. AI can't eliminate it because it's not a syntax problem. It's a problem about the correspondence between the software and the real world. The AI can generate a function that handles a business rule. But it can't know whether the business rule is correct. Because correct is defined by the domain.

The old workflow had a natural pacing mechanism: the time spent writing boilerplate was time the developer unconsciously spent thinking about the domain. The struggle with syntax gave the developer time to notice: "wait, this business rule doesn't make sense." The boilerplate was annoying, but it was also friction that slowed the developer down enough to think.

AI removed the friction. The developer goes from prompt to working code in minutes. The essential complexity is still there. But the developer didn't notice it because they never slowed down enough to see it.

The illusion of completion

The most dangerous consequence: AI-generated code looks complete. The boilerplate and the error handling is there. The tests pass. The developer reviews it and thinks: "this is done."

But done means the accidental complexity is resolved. The essential complexity — the domain logic, the edge cases, the invariants was never examined. The code handles the common cases correctly (because the AI was trained on common patterns) and fails on the specific cases that make this particular domain unique (because those weren't in the training data).

The failure surfaces weeks or months later, when a customer hits an edge case the AI never considered. The developer debugging the issue finds code they didn't write, implementing a business rule they didn't design, handling a case they didn't think about. They can't fix it because they don't understand the essential complexity. The AI hid it under a layer of perfect-looking accidental resolution.

The reviewer's dilemma

This creates an asymmetry that compounds the problem. It is harder to review 100 lines of AI-generated code than to write 50 lines yourself.

When a developer writes code, they build a mental model of the domain as they go. They encounter edge cases during implementation, make decisions about how to handle them, and carry the reasoning forward. The code and the understanding are built together.

When a developer reviews AI-generated code, the code arrives pre-formed with no reasoning trail. The reviewer must reverse-engineer the domain model from the implementation — figure out which edge cases were handled, which were missed, and whether the ones that were handled were handled correctly. This is harder than building the model yourself, because the reviewer is reconstructing intent from an artifact that has no intent.

The result: teams that measure productivity by lines of code shipped or pull requests merged see massive throughput gains. Teams that measure productivity by defects avoided or domain coverage verified see the same essential-complexity bottleneck, now arriving faster and in larger batches.

Throughput is not progress.

Where the essential complexity lives

Brooks identified four properties that make essential complexity irreducible. AI interacts with each one differently.

Complexity. The sheer number of states the system can be in. A system with ten interacting components has more possible states than a developer can enumerate. Brooks called this the fundamental property — the one that makes all the others worse.

AI multiplies this property directly. A developer writing code by hand produces hundreds of lines per day. Each line is a decision point they consciously made. An AI-assisted developer ships thousands of lines per day. Each line is a decision point someone needs to understand but nobody consciously made. The number of states in the system grows with the code, but the developer's ability to map those states in their head remains constant. More code, same brain. The state space grows. The developer's model of it doesn't.

Conformity. The software must conform to the real world — to regulations, to existing systems, to human expectations. These conformity requirements can't be derived from code patterns. They must be understood from the domain.

Conformity is the hardest of Brooks's four properties, and the cloud makes it visible. The cloud is a conformity engine. You must conform to every service that imposes a model. AWS's IAM evaluation logic, GCP's networking model, Azure's RBAC hierarchy. AI is fluent in the syntax of these APIs but silent on the security implications of how they interact with your internal compliance rules.

AI that generates IAM policies based on training patterns doesn't understand IAM evaluation semantics. It pattern-matches on policies that appeared in its training data. The difference matters when two policies interact in ways the training data never combined, when a new action namespace silently changes the meaning of an existing wildcard, or when the specific compliance requirement falls between two regulatory interpretations.

Changeability. The requirements change. Because the real world changes. A pricing model that was correct last quarter is wrong this quarter. An IAM policy that was secure last week grants OAuth agent authorization this week because the provider added new actions to the namespace. The AI-generated code implements last quarter's model perfectly. The developer who needs to update it for this quarter's model doesn't understand the old model (they didn't write it) and can't safely modify it (they don't know which assumptions are embedded in the implementation).

Invisibility. Software is invisible. There's no spatial representation that shows the full structure. The developer must hold the structure in their head. As Peter Naur argued in "Programming as Theory Building," the developer must build a theory of the program: a mental model of how the solution maps to the problem, what each piece does and why, how the parts relate to each other and to the domain.

Naur's insight compounds Brooks's. If programming is building a theory in the developer's head, then AI-generated code is theory-less code. The artifact exists but the theory behind it was never constructed. No one holds the mental model of why the code is shaped the way it is, which domain constraints it satisfies, or which assumptions would break if the requirements changed. The code is doubly invisible: the developer doesn't have the theory, and the code's structure follows patterns from training data rather than patterns from the domain. The structure in the code doesn't match the structure of the problem. It matches the structure of similar code the AI has seen.

This makes AI-generated code uniquely fragile under change. Human-written code degrades when the developer who understood it leaves the team — the theory walks out the door. AI-generated code starts without a theory. There is no door for it to walk out of. The understanding was never there.

What this means for AI-assisted development

Brooks's insight predicts what DORA's 2025 data measured: organizations saw roughly a 10% improvement in actual code shipped to production. Higher throughput alongside higher instability — more code shipped, more rollbacks.

The accidental complexity was eliminated (massive time savings). The essential complexity remained (same number of domain bugs). The net improvement was modest. Because the time was always going to be spent in essential complexity, and AI doesn't reduce it.

The teams that will succeed are the ones that use the time saved on accidental complexity to invest more time in essential complexity — in domain modeling, property definition, invariant specification, understanding the business rules deeply enough to verify that the AI's output is correct.

The teams that will fail are the ones that use the time savings to generate more code, features, endpoints, services without investing in understanding the domain. They'll hit Brooks's wall later, with a larger codebase, less understanding, and more surface area for essential-complexity failures.

The specification as essential complexity management

Specifications — type signatures, API contracts, database schemas, property definitions are the tools for managing essential complexity. They make the domain rules explicit instead of leaving them implicit in the code.

A type signature that says func ProcessOrder(order Order) (Receipt, error) captures essential complexity: an order produces a receipt or an error. Not a maybe-receipt, a silent failure or a partial result. The type system enforces this at compile time regardless of whether the implementation was written by a human or an AI.

A property that says "no order can have a negative total after discounts are applied" captures a domain invariant. AI might generate discount logic that produces negative totals for specific discount combinations. The property catches it — mechanically, deterministically, on every change. Because the essential complexity was made explicit as a verifiable statement.

AI eliminates accidental complexity. Specifications manage essential complexity. Both are needed. AI without specifications produces code that looks correct but embeds unexamined domain assumptions. Specifications without AI require humans to write all the boilerplate. Together: the AI handles the syntax, the specifications handle the semantics.

Brooks said there's no silver bullet. He was right. AI eliminates the accidental complexity that was never the hard part. The hard part remains — irreducible, essential, and now hidden under a layer of perfectly formatted code that looks like it's finished.

In cloud security, the essential complexity is the domain invariant: "no privilege escalation path exists through any combination of roles, policies, and trust relationships." The accidental complexity is writing the checks — the YAML, the CEL predicates, the test fixtures. Stave handles the accidental complexity (3,000+ controls, mechanically evaluated). The essential complexity, knowing which properties to verify remains a human judgment that no AI can replace. Brooks, 40 years later is still right.

Auto-Fix is not the Problem. The Signal is the Problem.

Bala Paranj — Wed, 15 Jul 2026 11:59:33 +0000

Harry Wetherald (Maze) makes a case against auto-fixing vulnerabilities: most findings are false positives, every code or cloud change carries risk. So auto-fixing everything means introducing risk for zero security benefit. His proposed path: first improve understanding (AI investigates findings in context), then generate fix suggestions for humans, then eventually automate the fixes once they're reliable.

The problem identification is correct. The solution addresses the wrong layer.

What he gets right

The false positive rate is real. The answers range from 80% to 99%. Most findings don't matter when investigated in the context of the specific environment.
Auto-fixing false positives is worse than ignoring them. Every change to cloud or code carries risk such as downtime, new attack surface, regression. If 90% of findings are false positives, auto-fixing all of them means 90% of changes are risk for zero benefit.
AI-generated fix suggestions without context are not helpful. If the finding lacks organizational context, the fix suggestion will also lack the context. You get a wall of auto-fixes to review instead of a wall of findings to review. The shape of the problem changes; the size doesn't.

These observations are accurate.

Where the solution misses

The proposed progression: better AI investigation → fix suggestions → human review → eventually full automation.

This path accepts the scanner's output a finding, a signal as the starting point and tries to improve what happens after. It never questions the output itself.

The false positive problem is two problems

Wetherald's "80-99% false positives" combines two different things:

The first is false positives. The scanner is wrong about the configuration. The bucket does have versioning. The role doesn't have that permission. The scanner missed something. The fix for this is straightforward: fix the scanner. Better rules, parsing and coverage. This is an engineering problem with engineering solutions.

The second is true findings that aren't exploitable. The configuration is as the scanner says. The bucket really does lack versioning. But the context makes exploitation impossible: the bucket is empty, internal-only, behind a VPC endpoint policy, and protected by an SCP. The finding is true. The risk is zero.

These require different solutions. The first needs better detection. The second needs something else entirely.

Context is not an AI problem

The context that makes a true finding non-exploitable is the configuration graph. The relationships between resources, the organizational policies, the permission boundaries, the network topology. This context is static. It's deterministic. It's already in the cloud provider's API responses.

Wetherald proposes AI that "investigates findings in the context of your environment." But the context he describes is not ambiguous. The SCP either blocks the action or it doesn't. The VPC endpoint policy either restricts the path or it doesn't. The permission boundary either limits the role or it doesn't.

These are not judgment calls requiring AI investigation. They are predicates evaluable against a configuration snapshot. The answer is true or false. Same snapshot, predicate and answer, every time.

The reason scanners produce findings without this context is that scanners evaluate individual resources, not the relationships between them. A scanner checks: "does this bucket have versioning?" That's a property of one resource. Whether that bucket's lack of versioning matters depends on what writes to it, who can access it, what policies protect it, and whether it participates in an attack path. Those are relationships — edges in a graph, not properties of nodes.

The false positive is an architecture failure. The tool evaluates nodes. The risk lives in edges.

The output type is the root cause

A scanner produces a finding — "this bucket lacks versioning." That's a signal. It requires human interpretation before a machine can act on it. Is this finding real? Does it matter? Should we fix it? Those questions exist because the output is a signal.

Consider a different output: a verdict. "This configuration is NON_COMPLIANT with respect to control X, evaluated against predicate Y, from snapshot Z. Exit code 3." That's a decision. A pipeline reads the exit code. No interpretation, triage or "is this real?" question.

The false positive problem Wetherald describes is a property of signals, not a property of security evaluation. Signals require investigation because they're incomplete. They describe one resource without the graph. Verdicts don't require investigation because they evaluate the graph and produce a deterministic result.

Better AI investigating signals is still AI interpreting signals. The output is still probabilistic — "I'm 95% confident this matters." That's still a signal. A faster, smarter signal is still a signal.

The missing step: verification

Wetherald's path is: find → investigate → suggest fix → human executes fix. The fix happens. Then what?

How does the defender know the fix worked? How does the auto-fixer know the fix resolved the vulnerability rather than introducing a different one?

The fix-verification loop is absent from his progression. Without verification, "auto-fix" means "auto-change and hope." With verification:

Finding identified (the verdict says NON_COMPLIANT)
Fix applied (by human or automation)
Re-evaluate (capture new snapshot, re-run evaluation)
Verdict changes (NON_COMPLIANT → COMPLIANT, or it doesn't)
If it doesn't change: the fix was wrong. Try again.

Verification makes auto-fix safe. Not better AI investigation or fix suggestions. Verification. A deterministic verdict engine that can confirm the forbidden state is gone after the fix is applied.

Self-healing is the wrong endpoint

Wetherald's vision: AI agents find, investigate, and fix without human involvement. Self-healing environments.

This assumes the bottleneck is human labor. Too many findings for humans to review, so automate the humans away.

The actual bottleneck is signal quality. If 90% of findings are false positives, the problem isn't that humans are slow. It's that the tool is producing 9 units of noise for every 1 unit of signal. Faster humans (or AI replacing humans) processing the same noise doesn't solve the problem. It processes noise faster.

The alternative is changing the output so the question "is this real?" doesn't arise:

A deterministic verdict doesn't need investigation. The predicate either matches the configuration or it doesn't.
An exploitability classification (exploitable / one change away / reachable only) doesn't need triage. The graph evaluation already identified which findings have every precondition present for an attack.
A fix-verification loop doesn't need hope. The verdict engine confirms the forbidden state is gone.

None of these require AI. They require a different tool architecture: one that evaluates the configuration graph deterministically, produces verdicts instead of signals, and can verify that a fix resolved the finding.

The layer error

Wetherald is building a better consumer of signals. That's useful if your tools produce signals, you need something to interpret them, and AI interpretation is better than human interpretation at scale.

But building a better signal consumer doesn't fix the fact that the signal is the wrong output type. It's optimizing the layer above the problem rather than fixing the layer where the problem lives.

The layer where the problem lives is the tool's primary function. A scanner that evaluates nodes will always produce findings that lack graph context. No amount of post-hoc AI investigation changes what the scanner evaluated. The investigation is reconstructing the graph that the scanner should have evaluated in the first place — expensively, probabilistically, differently each time.

A tool that evaluates the graph from the start doesn't produce the false positive to investigate. The "80-99% false positive rate" is a consequence of evaluating nodes. Evaluate edges and the rate changes.

The right path

Wetherald's path: find (signal) → investigate (AI) → suggest fix → human review → auto-fix → self-healing.

The alternative: evaluate (verdict) → classify (exploitable / one-away / reachable) → fix → verify (re-evaluate, verdict changes) → automate verified fixes.

The difference:

No investigation step. The verdict is the investigation. The predicate evaluated the graph and produced a deterministic result.
No triage step. The exploitability classification is the triage. Exploitable findings first. One-away findings next (protect the precondition before it flips). Reachable findings last.
Verification built in. Every fix is re-evaluated. The verdict either changes (fix worked) or it doesn't (fix was wrong). No "auto-fix and hope."
Auto-fix becomes safe when the fix is verifiable. Not when AI gets better at investigation. The deterministic verdict tells the auto-fixer what needs to change, and the re-evaluation confirms the change worked.

This path doesn't require AI to improve. It requires the tool to produce a different output. The AI era's contribution to security isn't better investigation of signals. It's recognizing that investigation is a workaround for the wrong output type.

Stave produces verdicts, not signals. Deterministic evaluation of configuration snapshots against a control catalog. Exploitability classification on every finding. Fix verification via snapshot diff. The false positives Wetherald describes don't arise, because they evaluate the graph. stave apply --observations ./your-snapshot/

Four Eras of Cloud Security. Same Verb.

Bala Paranj — Tue, 14 Jul 2026 09:36:35 +0000

Scott Piper published a twenty-year retrospective on cloud security research in March 2026. It's the most useful structural history of the field I've seen — four eras, each with defining milestones, each with the tools and research that shaped cloud security. If you work in cloud security, read it first.

What follows is a question about what the history reveals when you examine one detail it doesn't discuss.

The four eras

Piper divides two decades into four eras:

2006–2016, Foundational. Cloud providers built the security primitives — IAM (2011), CloudTrail (2013), Organizations and SCPs (2016). Before these existed, there was no mechanism for least privilege, no audit trail, and no organizational boundary. Security research in this era was part-time work from people with broader careers.

2016–2021, CSPM. Cloud security became a full-time job. CIS Benchmarks standardized what to check. Open-source tools proliferated — Prowler, CloudMapper, Pacu, Cloud Custodian, ScoutSuite. Cloud security during this time largely meant deploying a CSPM.

2021–2025, CNAPP. Point solutions gave way to platforms. Vendors integrated CSPM with container scanning, vulnerability management, and workload protection into a single product category. Research teams at vendors began finding cross-tenant vulnerabilities in the cloud providers themselves.

2025–present, AI. AI accelerates both attack and defense. Exploits that required deep language expertise are generated in minutes. A CTF challenge was solved by an AI within minutes of release. The industry is speed-running the cloud eras.

This is a well-evidenced narrative. Every era is defined by a change in what tools could do and who was building them.

The verb that didn't change

Look at what each era's defining tools do. The direct action each tool performs on its direct object.

In the CSPM era, the defining tools match API responses against rule databases. Prowler, ScoutSuite and Cloud Custodian matches. The CIS Benchmark is a rule database. The verb is match, and the output is a finding.

In the CNAPP era, the defining tools aggregate findings from multiple scanners into a single platform. The CNAPP collects what the CSPM found, what the vulnerability scanner found, what the container scanner found, and deduplicates. The verb is aggregate, and the output is a consolidated finding.

In the AI era, the defining tools score events against learned baselines, or exploit vulnerabilities faster. The verb is score or exploit, and the output is a detection or a proof-of-concept.

Three eras of tooling. Three different verbs. But the output is always the same category: signals. A finding for human interpretation. An alert for human triage. A score for human thresholding. A detection for human review.

No era introduced a tool whose output is a decision. A deterministic, machine-verifiable, per-asset verdict that a pipeline can act on without human interpretation.

Why signals aren't decisions

A finding says: "this S3 bucket lacks versioning." A human must decide whether that matters, whether it's a true positive, and what to do about it.

A verdict says: "this configuration is COMPLIANT or NON_COMPLIANT with respect to this control, and here is the evaluation trace." A machine reads the exit code and gates the pipeline. No interpretation, triage or threshold is required from a human.

A CSPM finding can be accurate. The difference is in what happens next. A signal enters a queue. A verdict enters a pipeline. Signals require human interpretation as a necessary step between detection and action. Verdicts are the action.

Every era improved signals. CSPMs made signals more comprehensive. CNAPPs made signals more consolidated. AI makes signals faster. But improving a signal doesn't change it. A faster finding is still a finding. A better-correlated alert is still an alert. A higher-accuracy score is still a score.

Where the history is silent

Piper's history is comprehensive within its scope. The omission is structural. Deterministic verification tools don't appear in the history because they don't appear in the field. The category doesn't exist.

The cloud security industry has produced hundreds of tools across four eras. Billions of dollars of venture capital. Thousands of dedicated security engineers. Every tool produces signals.

The question is: is this because signals are sufficient, or because the architecture of every tool category makes decisions structurally impossible?

The architectural constraint

A CSPM queries the live cloud API, matches the response against a rule, and produces a finding. The finding is a signal because the CSPM evaluates one resource at a time. It can say "this bucket lacks versioning" but cannot say "this bucket is the destination for your CloudTrail trail, and this role has permission to delete it, and no SCP blocks cross-account writes, and therefore a complete attack path exists." The compound judgment requires evaluating relationships between resources, not properties of individual resources.

A CNAPP inherits this limitation. It aggregates findings from CSPMs and scanners, but aggregation doesn't create compound reasoning. It creates compound noise. More findings from more scanners, consolidated into one dashboard, still evaluated one resource at a time.

An AI-powered tool can reason across resources in principle, but introduces a different problem: non-determinism. The same input can produce different outputs across runs. For a security tool whose findings may be presented to auditors, regulators, or courts, the finding must be reproducible. A probabilistic verdict is a contradiction in terms.

The architectural gap is specific: no tool category evaluates the configuration graph (relationships between resources, not just properties of resources) deterministically (same input, same output, every time) from a coherent snapshot (point-in-time capture, not a stream of events).

The missing verb

The history has four eras of verbs: match, aggregate, score, exploit. The missing verb is evaluate — produce a deterministic categorical verdict per asset by applying an enumerated catalog of controls to a snapshot of cloud state.

The distinction is operational:

A tool that matches takes an API response and a rule, and outputs a finding. The finding enters a triage queue.

A tool that evaluates takes a configuration snapshot and a control catalog, and outputs a verdict. The verdict enters a pipeline as an exit code.

The input is different (snapshot vs API response). The substrate is different (enumerated catalog vs rule database). The output is different (verdict vs finding). The downstream consumer is different (pipeline vs human).

This is a different primary function that produces a different output type. In the same way that CNAPP was not a better CSPM but a platform that aggregated them, the missing tool is not a better version of any existing category. It occupies a structural gap in what the toolchain produces.

The compound gap

The sharpest illustration of the gap is compound risk. Consider a specific scenario documented by Unit 42 in June 2026: an attacker who can delete an S3 bucket that serves as a CloudTrail destination can recreate the bucket under their own account, silently rerouting security telemetry to attacker-controlled storage. No credentials stolen or network exploitation. Pure configuration-graph exploitation.

Detecting this from individual resource properties is impossible. The bucket's own properties are fine. The IAM role's own properties are fine. The CloudTrail trail's own properties are fine. The vulnerability exists only in the relationship between them. The fact that the role can delete the bucket, the trail depends on the bucket, and no organizational policy prevents cross-account writes.

A CSPM checking each resource individually sees three compliant resources. A CNAPP aggregating those three findings sees three compliant resources. Neither can see the graph. The compound chain — identity → permission → destination → trail → missing data perimeter is invisible to any tool that evaluates nodes without evaluating edges.

This is not a coverage gap (add more rules). It's an architectural gap. The tool's primary function cannot express the finding. The missing verb — evaluate a snapshot against a catalog — operates on the graph, not on individual resources. The compound chain is a natural output of graph evaluation, and a structural impossibility for resource-level matching.

What this means for the AI era

Piper frames the AI era around speed: faster exploits, faster patching, faster detection. AI solved a CTF in minutes. AI generated an exploit in 10 minutes. The implicit thesis: the bottleneck is speed, and AI removes it.

But if the output type hasn't changed, if AI produces faster signals rather than different outputs, then the bottleneck isn't speed. It's the gap between signals and decisions. A faster finding still enters a triage queue. A faster alert still requires human interpretation. A faster score still needs a threshold.

AI applied to the existing toolchain makes each verb faster. AI-powered matching. AI-powered aggregation. AI-powered scoring. The verbs stay the same. The output stays the same. The signals arrive faster, in greater volume, with higher confidence and still require a human to close the loop.

The alternative is not AI replacing the human. It's changing the output type so the loop doesn't need closing. A deterministic verdict, mechanically evaluated, same answer every time, with an auditable evaluation trace. That's a tool whose output a pipeline can act on directly, at any speed, because the human isn't in the loop between detection and action.

That's not the AI era's contribution. That's the contribution of a different verb.

The fifth column

Piper's history is a four-column table. Each column is an era. Each era has tools, verbs, and outputs. Adding a fifth column doesn't extend the table. It changes what the table measures:

	Foundational	CSPM	CNAPP	AI	?
Defining tools	IAM, CloudTrail, SCPs	Prowler, ScoutSuite, Cloud Custodian	Wiz, Orca, Prisma Cloud	AI-powered scanners, agentic exploits	—
Verb	(primitives, not tools)	Match	Aggregate	Score	Evaluate
Output	—	Signal	Signal	Signal	Verdict
Input	—	Live API	Live API + scanners	Events + logs	Snapshot
What advances	What's possible	What's checked	What's consolidated	How fast	What's decided

The first four columns improve along the same axis: better, more, faster signals. The fifth column is orthogonal. It doesn't make signals better. It produces a different output type.

The history doesn't mention this column because nobody has built it. The field went from matching to aggregating to scoring without ever stopping to ask: what if the output wasn't a signal?

When you remove humans in the loop where they are slow and the weakest link in the system, they move to a different loop where the expert judgement is encoded in controls that can be evaluated at machine speed without slowing down the Agentic era development loop. Slow and deliberate when human expertise is required. Fast and accurate when agents are required. Two different loops that serve diffferent purpose.

That question is answered by Stave. The verb is evaluate. The input is a configuration snapshot. The output is a deterministic per-asset verdict with an auditable evaluation trace. The compound chain in the Unit 42 example — identity with delete permission on a CloudTrail destination bucket, no SCP data perimeter — is a finding Stave produces and no signal-producing tool structurally can. Twenty years of cloud security built the signal infrastructure. The decision layer was never built. stave apply --observations ./your-snapshot/

Every Cloud Security Tool Works. None of Them Are Sufficient. Here's the Precise Diagnosis.

Bala Paranj — Mon, 13 Jul 2026 11:30:19 +0000

"Our CSPM secures the cloud." "Our SIEM detects threats." "Our CNAPP protects cloud-native applications." "Our AI flags risks."

Each claim is wrong by a specific, testable standard. There's a discipline from TRIZ that forces you to state what a tool DOES to its DIRECT OBJECT. No indirect effects. No consumer perceptions. No metaphors. Just: what does the tool DO to what it TOUCHES?

Applied to six cloud security tool categories, the discipline reveals a structural pattern: every tool produces signals. None produces decisions. Operators must currently infer decisions from accumulated signals. Breaches live in that inference gap.

The discipline: state the direct action

A function is the intended direct physical action of the tool on the object. State it as Tool—Verb—Object. Reject anything that describes an indirect effect, a consumer perception, or a non-physical concept.

The classic example: a ship propeller. "Propeller drives ship" is wrong. The propeller doesn't touch the ship. "Propeller pushes water" is correct. The propeller directly acts on the water; the water moves the ship. If you optimize for "driving the ship" you optimize the wrong thing. If you optimize for "pushing water" you optimize the right thing.

Cloud security descriptions fail the same test:

Common description (wrong)	Why wrong	Correct formulation
"CSPM secures the cloud posture"	"Secures" is consumer-perceived	CSPM rule engine matches API responses against rule database
"SIEM detects security incidents"	"Detects" + "incident" are indirect	SIEM correlation engine matches log events against detection signatures
"CNAPP protects cloud-native apps"	"Protects" is consumer-perceived	CNAPP aggregation engine collects and deduplicates findings from multiple scanners
"SOAR automates incident response"	"Automates" is meta	SOAR playbook executor runs scripted action sequences against detected events
"Checkov scans IaC for misconfigurations"	"Scans" + "misconfiguration" are technical-system labels	IaC scanner tests parsed HCL/JSON against rule patterns
"ML detects anomalies"	"Anomaly" is non-physical	ML model scores events against a learned baseline distribution

Six categories. Six correct formulations. Now the structural pattern becomes visible.

Six tools, six verbs

Tool	Primary function (correct)	Verb
CSPM	Rule engine matches API responses against rule database	Matches
CNAPP	Aggregation engine collects and deduplicates findings from scanners	Aggregates
SIEM	Correlation engine matches log events against signatures	Matches
SOAR	Playbook executor runs scripted sequences against detections	Runs
IaC scanner	Parser + matcher tests parsed IaC against rule patterns	Tests
ML detection	Trained model scores events against learned baseline	Scores

Six verbs: matches, aggregates, matches, runs, tests, scores.

Every verb is signal-producing. Each tool takes input and produces a signal — a finding, an alert, a score, a playbook result. None of the verbs is decision-producing. None produces a verdict that a machine can act on without human interpretation.

The cloud security industry has built six categories of signal-producing tools and zero categories of decision-producing tools.

The absent-function inventory

For each tool, what's missing — functions that should exist but don't:

CSPM: matches, but only after deployment

✓ Adequate:   Matches API responses against rules (within rule coverage)
✗ Absent:     Pre-deployment matching (CSPM runs AFTER state exists in the cloud)
✗ Absent:     Machine-verifiable deterministic verdict (findings require human interpretation)
✗ Absent:     Auditable rule basis (rule libraries are often vendor-proprietary)
✗ Harmful:    Alert fatigue (match volume exceeds operator capacity)
✗ Harmful:    Severity-score inflation (vendors exaggerate to justify visibility)

CNAPP: aggregates, but amplifies the noise

✓ Adequate:   Collects findings from multiple scanners
✗ Absent:     Pre-deployment prevention (inherits CSPM's post-hoc limitation)
✗ Absent:     Deterministic verdict (aggregated findings are MORE noisy, not less)
✗ Harmful:    Amplifies alert volume (aggregation increases noise without increasing signal)
✗ Harmful:    Hides scanner-specific quality (which scanner produced which finding?)

SIEM: matches events, but only after they happen

✓ Adequate:   Matches log events against signatures (within signature coverage)
✗ Absent:     Prevention (SIEM matches AFTER events occur; cannot prevent)
✗ Absent:     Machine-verifiable verdict per asset (produces alerts, not asset-level verdicts)
✗ Harmful:    Alert fatigue at production scale
✗ Harmful:    Cost growth (log-volume billing incentivizes reducing collection)

SOAR: runs playbooks, but amplifies upstream errors

✓ Adequate:   Runs playbooks (executor logic is mature)
✗ Absent:     Input quality control (cannot determine if the detection is correct)
✗ Absent:     Prevention (runs AFTER a detection; unsafe state already exists)
✗ Harmful:    Amplifies false positives (false detection → automated false response)
✗ Harmful:    False sense of automation (operators stop reviewing; detection failures become silent)

IaC scanner: tests source, but not deployed state

✓ Adequate:   Tests parsed IaC against rules (within rule coverage)
✗ Absent:     Post-deployment evaluation (tests source; not deployed state)
✗ Absent:     Identity and environment context (source file lacks the live account state
              it deploys into — a Terraform module is safe in isolation but unsafe when
              composed with an existing FullAccess role it attaches to)
✗ Absent:     Detection of console/CLI changes (manual changes bypass IaC entirely)
✗ Absent:     Drift detection (source vs live state divergence is invisible)
✗ Harmful:    False security from "clean IaC" (clean source ≠ clean deployed state)

ML detection: scores events, but can't explain or enumerate

✓ Adequate:   Scores events against baseline (inference is mechanical and fast)
✗ Absent:     Auditable verdict basis (high score doesn't explain WHY)
✗ Absent:     Determinism (same event scores differently after retraining)
✗ Absent:     Catalog-style enumeration (model cannot be inspected for "what counts as unsafe")
✗ Harmful:    Probabilistic verdicts for binary decisions (score thresholds are arbitrary)
✗ Harmful:    Model drift produces silent failure (stops catching things without anyone noticing)

The cross-tool pattern

Line up the absent functions across all six tools:

Absent function	CSPM	CNAPP	SIEM	SOAR	IaC	ML
Pre-deployment evaluation	✗	✗	✗	✗	✓*	✗
Deterministic verdict	✗	✗	✗	✗	✗	✗
Auditable rule basis	✗	✗	~	~	✓	✗
Catalog-style enumeration	~	~	✗	✗	✓	✗
Deployed-state evaluation	✓	✓	✓	~	✗	✓
Snapshot-based reasoning	✗	✗	✗	✗	✗	✗

*IaC scanners are pre-deployment but source-only. They test the declared intent, not the actual state that will exist after apply. A source-only check cannot account for identity and environment context. The Terraform plan says "attach this policy to this role." Whether the resultant state is safe depends on what permissions that role already has, what other policies are attached, and what SCPs constrain the account — none of which exist in the source file. The snapshot (planned state composed with live state) is the only valid object for a pre-deployment security decision.

The pattern: no tool provides all six functions. Most tools provide one or two. The combination of all six tools still leaves gaps. Because the gaps are structural, not coverage-based. Adding a seventh tool that also produces signals doesn't fill the decision gap.

The most striking column: deterministic verdict is absent in all six categories. No tool in the cloud security market produces a deterministic, machine-verifiable, per-asset verdict that a pipeline can act on without human interpretation.

The second most striking row: snapshot-based reasoning is absent in all six. Every tool operates on streams, events, findings, or source code — not on coherent point-in-time snapshots of actual cloud state.

The harmful-function pattern

Line up the harmful functions across all six tools:

Harmful function	Tools producing it
Alert/finding volume exceeds operator capacity	CSPM, CNAPP, SIEM
Severity-score inflation	CSPM, CNAPP, ML
Vendor lock-in by rule/signature format	CSPM, CNAPP, SIEM
Amplification of upstream errors	SOAR, CNAPP
False security from clean source	IaC scanners
Model drift / silent failure	ML
Cost growth with scale	SIEM

The harmful pattern compounds: CSPM produces inflated-severity findings → CNAPP aggregates them (amplifying volume) → SIEM correlates them (adding more alerts) → SOAR runs playbooks against the accumulated noise (amplifying false positives). Each tool's harmful function feeds the next tool's harmful function. The aggregate is worse than any individual tool's harmful contribution.

The complete picture: existing tools vs the missing function

Combine the absent-useful inventory and the harmful inventory into one view. The rows are the properties the problem requires. The columns are every tool category — including the missing one:

Required property	CSPM	CNAPP	SIEM	SOAR	IaC Scanner	ML Detection	Configuration Verifier
Pre-deployment evaluation	✗ Post-deploy	✗ Post-deploy	✗ Post-event	✗ Post-detection	⚠️ Source only	✗ Post-event	✅ Planned + deployed snapshots
Deterministic verdict	✗ Findings	✗ Aggregated findings	✗ Alerts	✗ Playbook results	✗ Findings	✗ Scores	✅ COMPLIANT / NON_COMPLIANT per asset
Auditable rule basis	✗ Proprietary	✗ Proprietary	⚠️ Signatures visible	⚠️ Playbooks visible	✅ Open rules	✗ Model opaque	✅ Open YAML catalog, every control readable
Catalog-style enumeration	⚠️ Rules exist but vendor-locked	⚠️ Inherited	✗ Signatures ≠ invariants	✗ No catalog	✅ Rules enumerate	✗ Model can't enumerate	✅ 2,949 controls, each traceable to documented failure
Deployed-state evaluation	✅ Queries live APIs	✅ Inherits CSPM	✅ Reads live events	⚠️ Acts on detections	✗ Source only	✅ Reads live events	✅ Snapshots of actual cloud state
Snapshot-based reasoning	✗ Stream/API polling	✗ Stream/API polling	✗ Event stream	✗ Event-driven	✗ Source files	✗ Event stream	✅ Coherent point-in-time snapshots
Compound cross-asset reasoning	✗ Per-resource rules	⚠️ Heuristic correlation	⚠️ Event correlation	✗ Per-playbook	✗ Per-resource rules	⚠️ Behavioral patterns	✅ Compound chains across configuration graph
Machine-consumable output	⚠️ API but interpretation needed	⚠️ API but noisy	⚠️ Alerts need triage	⚠️ Actions need approval	✅ Exit codes	✗ Scores need thresholding	✅ Exit code 0/3, pipeline acts directly

Primary function verb	Matches	Aggregates	Matches	Runs	Tests	Scores	Evaluates
Output type	Signal	Signal	Signal	Signal	Signal	Signal	Verdict

Read the table column by column. Every existing category has gaps (✗) in the rows that matter most. Read the last column: the Configuration Verifier fills every gap. Because the function analysis of existing tools diagnosed what was missing, and the missing function was designed to fill the diagnosis.

Read the bottom two rows. Six tools with six different verbs — matches, aggregates, matches, runs, tests, scores — all producing signals. One tool with a different verb — evaluates — producing verdicts. The verb difference is the category difference.

The most revealing row: deterministic verdict. Six ✗ marks. Every existing tool produces something that requires human interpretation before a machine can act on it. The Configuration Verifier produces a verdict a pipeline reads as an exit code. No interpretation, triage or threshold. Binary.

The diagnosis: signals vs decisions

The structural diagnosis from the function analysis:

What existing tools produce:    SIGNALS
    - findings for human interpretation
    - alerts for human triage
    - scores for human thresholding
    - playbook results for human verification

What the problem requires:      DECISIONS
    - deterministic per-asset verdicts
    - machine-readable pass/fail
    - auditable reasoning per verdict
    - pre-deployment evaluation against enumerated forbidden states

The gap between signals and decisions is the gap between the existing toolchain and the problem it claims to solve. No amount of better signals closes the gap. Because signals and decisions are different categories. A faster alert is still an alert. A smarter score is still a score. A better finding is still a finding. None is a verdict.

The gap requires a tool with a different primary function:

Existing PF verbs:    matches, aggregates, matches, runs, tests, scores
                      (all signal-producing)

Missing PF verb:      EVALUATES — produces a deterministic categorical verdict per asset
                      by applying an enumerated catalog of forbidden states
                      to a snapshot of actual cloud state
                      (decision-producing)

The missing verb is qualitatively different from the existing verbs. It's not a better version of matching or scoring. It's a different action that produces a different output (verdict vs signal) from a different input (snapshot vs stream) using a different substrate (enumerated catalog vs learned baseline).

The design brief

The function analysis produces the brief. The brief:

A tool whose primary function fills the absent-useful inventory must:

1. Take snapshots as its object — Coherent point-in-time captures of actual cloud state.

2. Apply a catalog of forbidden states — An enumerated, human-authored, auditable specification of what unsafe means.

3. Produce a verdict — A deterministic, categorical, per-asset decision: compliant or not.

4. Operate pre-deployment — Evaluate proposed state before it reaches the cloud.

5. Be deterministic — Same inputs, same verdict, every time.

6. Be auditable — the catalog is external, readable, and independently verifiable. An auditor can read the full enforcement scope without accessing vendor internals.

Six properties. Each is the complement of a specific absent-useful function in the cross-tool inventory. Together they describe a tool with the primary function: the verdict engine evaluates a snapshot of cloud state against a catalog of forbidden states, producing a deterministic per-asset verdict.

That primary function doesn't exist in any of the six existing categories. It's a different verb (evaluates), operating on a different object (snapshot), using a different substrate (catalog), producing a different output (verdict). The function analysis tells us the gap is structural. The tool that fills it is a new category, not a better version of an existing one.

The diagnostic for your own toolstack

Apply the same discipline to your tools:

1. For each tool, state the primary function in Tool—Verb—Object form. Reject non-physical vocabulary. What does the tool do to what it touches?

2. Classify each function. Adequate useful (works), inadequate useful (undershoots), absent useful (should exist, doesn't), harmful (works against you).

3. Look at the absent-useful column across all tools. The pattern of what's missing across the toolstack is the structural gap.

4. Look at the harmful column across all tools. Do the harmful functions compound? Does one tool's harmful output feed the next tool's harmful input?

5. State the missing primary function. What verb, operating on what object, producing what output, would fill the absent-useful inventory?

The answer to step 5 is the design brief for what your toolstack needs next. Not "a better version of tool X". A tool with a different primary function that fills the gaps the existing tools collectively leave.

Every cloud security tool works. Each tool's primary function is adequate within its coverage. The diagnosis is that they all produce signals where the problem demands decisions. The gap is structural. Filling it requires a different verb.

The missing primary function — the verdict engine evaluates a snapshot against a catalog of forbidden states, producing a deterministic per-asset verdict — is implemented in Stave, an open-source cloud configuration verifier. The verb the industry doesn't have: evaluates. The output the industry doesn't produce: verdicts. The input the industry doesn't use: snapshots. Try the missing function: bash examples/demo-ai-security/run.sh

DORA Metrics Measure Delivery Health. What Measures Security Posture Health?

Bala Paranj — Sun, 12 Jul 2026 12:17:36 +0000

Delivery teams have DORA. Four metrics — deployment frequency, lead time for changes, mean time to restore, change failure rate that predict whether a team is shipping well. Thoughtworks recently added a fifth: rework rate, measuring how much of the pipeline is consumed by fixing work previously considered complete.

These metrics changed how delivery organizations operate. Because they're leading indicators. They tell you the trajectory before the outcome arrives. A team with increasing lead times is heading for trouble. A team with rising rework rate is accumulating debt. You see it in the metrics before you see it in the incidents.

Security teams have no equivalent.

What security teams measure today

Finding counts. "We found 247 misconfigurations this quarter." More scanning produces more findings. A team that scans more frequently or adds a new tool sees the number go up which looks worse even if posture is improving. Finding counts measure scanning effort, not security health.

Compliance percentages. "We're 94% compliant with CIS Benchmarks." This measures the last audit, not the current trajectory. A team at 94% today might be at 87% next week if three Terraform changes introduced misconfigurations. The percentage is a snapshot, not a trend. It rewards breadth of coverage over depth. 94% across 200 checks sounds better than 100% across 50 checks, even if the 50 are the ones that matter.

Incident counts. "We had two security incidents this quarter." This is a trailing indicator. It measures failures that already happened. A team with zero incidents might have excellent posture or might have excellent luck. You can't tell. By the time the count goes up, the damage is done.

None of these answer the question delivery teams answer with DORA: are we getting better, and how fast?

The mapping

The five DORA metrics adapt directly to security posture. The definitions are concrete and measurable from evaluation data that already exists.

Evaluation frequency (maps to deployment frequency): how often do you verify your posture? A team that evaluates weekly has a 7-day window where misconfigurations can accumulate undetected. A team that evaluates on every commit has minutes. The metric isn't "how often do you scan". It's how often you produce a timestamped, deterministic evaluation result that you can compare to the previous one. Frequency determines your maximum detection latency.

Time to remediate (maps to lead time for changes): the duration from when a finding first appears to when it disappears from the evaluation results. A finding that appears on Monday and disappears by Wednesday has a 2-day remediation time. A finding that appears in January and is still present in June has a 5-month remediation time. The metric tells you how long risk persists after detection. Unlike finding counts, it penalizes findings that sit unresolved rather than rewarding findings that get detected.

Mean time to remediate (maps to MTTR): the average remediation time across all findings in a period. This is the posture equivalent of "how fast do we recover?" A team with a 3-day MTTR closes findings before they compound. A team with a 90-day MTTR is accumulating open findings faster than it resolves them. The trend matters more than the absolute number — MTTR going up means the team is falling behind.

New findings per evaluation (maps to change failure rate): of all the changes that happened between evaluations, how many introduced new security findings? If every Terraform apply produces three new findings, the development process is introducing risk at a steady rate. If the number is trending down, the team's practices — code review, pre-deployment checks, policy-as-code are catching problems earlier. This is the metric that connects delivery practices to security outcomes.

Recurrence rate (maps to rework rate): findings that were remediated and then reappeared. A public bucket gets fixed. A later change re-exposes it. The finding returns. That's security rework — unplanned effort to re-fix a problem previously considered complete.

This is the most important of the five. Thoughtworks added rework rate to DORA specifically because it catches the "considered complete but wasn't" pattern. In delivery, that's a feature that ships, breaks, and needs patching. In security, it's a finding that's remediated, silently reintroduced, and caught only when the next evaluation runs or worse, when an attacker finds it first.

A rising recurrence rate is the early warning signal that remediation isn't sticking. Either the fixes are superficial (the bucket was made private but the Terraform code still declares it public, so the next apply reverts it), or the process has no feedback loop (the developer who introduced the finding wasn't told about it, so they repeat the pattern).

Computing the metrics

The data is already in the evaluation results. Every finding has a first-seen timestamp and a last-seen timestamp. Every evaluation run has a capture time. Two consecutive evaluation results contain everything needed:

import "github.com/sufield/stave/pkg/stave"

previous := stave.LoadResult("findings-2026-05-28.json")
current := stave.LoadResult("findings-2026-06-01.json")

diff := stave.Diff(previous, current)

// The five metrics, computed from the diff:

// 1. Evaluation frequency
daysBetween := current.Run.CapturedAt.Sub(previous.Run.CapturedAt).Hours() / 24

// 2-3. Remediation time (per finding and mean)
for _, f := range diff.Removed {
    remediationTime := f.LastSeen.Sub(f.FirstUnsafe)
    // accumulate for mean
}

// 4. New findings per evaluation (change failure rate)
newFindings := len(diff.Added)
totalAssets := current.Summary.TotalAssets
changeFailureRate := float64(newFindings) / float64(totalAssets)

// 5. Recurrence rate
recurrenceRate := float64(len(diff.Recurred)) / float64(len(current.Findings))

No dashboard or CLI command. A Go program the customer writes, runs at whatever frequency they choose, and sends wherever they want — Slack, a retrospective, a CI gate, a spreadsheet, or nowhere at all.

The library exposes the data: LoadResult, Diff, typed structs with timestamps. The customer decides what the numbers mean. A recurrence rate of 5% might be acceptable for one team and alarming for another. That's their judgment, not the tool's.

Why a library, not a dashboard

Thoughtworks' guidance on DORA metrics: "We recommend using these metrics for team reflection and learning rather than just building complex dashboards. Simple mechanisms, such as check-ins during retrospectives, are often more effective than overly detailed tracking tools."

A CLI command with flags and formatters is a dashboard in disguise. It embeds opinions about what to show, how to format it, and what thresholds matter. The customer gets options but not ownership.

A library function that returns typed data has no opinion. The customer writes a 20-line program that computes the metric they care about, at the frequency they choose, with the thresholds they set. If they want a Slack message when recurrence rate exceeds 10%, they write that. If they want a weekly email with remediation time trends, they write that. If they want a CI gate that fails the build when new findings exceed zero, they write that.

The library provides facts. The customer provides interpretation. Same architecture as the evaluator itself and for the same reason: the tool that embeds its own interpretation is the tool the customer outgrows first.

What each metric tells a security team

Used together, the five metrics tell a story that no individual metric captures:

"We're scanning more but not remediating faster." Evaluation frequency is up, remediation time is flat or rising. The team is detecting more but closing at the same rate. The bottleneck is downstream of detection.

"Remediation isn't sticking." Recurrence rate is rising. Findings get fixed and come back. The fixes are superficial, or the development process reintroduces the same patterns.

"New risk is outpacing remediation." New findings per evaluation exceeds remediated findings per evaluation. The backlog is growing. The team is falling behind.

"We're getting faster at fixing but slower at preventing." MTTR is improving, but change failure rate is rising. The team is getting better at response and worse at prevention. The investment should shift from faster remediation to fewer introductions.

"Posture is improving." All five trending in the right direction — evaluating more frequently, remediating faster, introducing fewer new findings, and seeing fewer recurrences. This is the signal that practices are working, not just tools.

No single metric tells the story. A team with zero new findings but a 30% recurrence rate is on a treadmill. A team with high new findings but a 2-day MTTR and zero recurrences is catching and fixing fast. The five together are the leading indicator of posture health. The thing security teams have been missing.

The AI-assisted development connection

Thoughtworks flags rework rate specifically in the context of AI-assisted development: "If lead times don't decrease and deployment frequency doesn't increase, faster code generation doesn't translate into better outcomes. Conversely, degradation in stability metrics particularly rework rate provides an early warning sign of blind spots, technical debt and the risks of unchecked AI-assisted development."

The same logic applies to AI-generated infrastructure. A Terraform module generated by an AI agent ships faster. If it introduces a security finding that gets remediated and then reappears when the module is reused elsewhere, that's AI-assisted security rework. The rework rate catches it. The finding count doesn't. It just goes up by one each time.

Measuring generated infrastructure with posture DORA metrics answers the question that finding counts can't: is the AI making our cloud safer, or is it making our cloud bigger and equally vulnerable?

Action Items

1. Produce two evaluation results. Run your security evaluation tool today and again in a week. Save both outputs as JSON. You now have the raw data for all five metrics.

2. Diff them. Count: how many findings are new (added), how many were remediated (removed), how many came back (recurred), how many persisted (unchanged). Four numbers. That's the posture health summary for the week.

3. Compute one metric. Pick the one that matters most to your team right now. If you're drowning in findings, compute MTTR. It tells you whether you're closing faster or slower. If you recently adopted AI-assisted infrastructure, compute recurrence rate. It tells you whether fixes are sticking. If you're trying to justify investment in prevention, compute change failure rate. It tells you how many changes introduce new risk.

4. Bring it to a retrospective. One number. One trend. "Our remediation time went from 12 days to 8 days this month" is a more useful conversation starter than "we have 247 findings." The number becomes a tool for reflection, not a dashboard to stare at.

The delivery side of the industry figured this out years ago. The security side is still counting findings and calling it measurement. The same five metrics work. The data already exists. The question is whether you compute it.

The library API for loading results and computing diffs is in Stave at pkg/stave/. Tutorial 12 in the stave-guide covers building custom metrics programs.

Context Engineering Optimizes the Input. Nobody's Checking the Output.

Bala Paranj — Sat, 11 Jul 2026 11:15:41 +0000

The term context engineering has moved from optimization tactic to architectural discipline. Thoughtworks put it in their Technology Radar Vol. 34 (April 2026) under Adopt. The argument: as agents tackle complex tasks, dumping raw data into large context windows leads to degraded reasoning. The fix: treat the context window as a design surface, engineer what goes in.

The techniques work. Progressive context disclosure starts with a lightweight index and pulls in only what's needed. Context graphs model institutional reasoning as structured, queryable data. Dynamic retrieval selects tools and loads only necessary servers. Stateful compression summarizes intermediate outputs to manage working memory in long-running workflows.

All of these optimize the input. None of them verify the output.

The asymmetry

Every engineering discipline has two pipelines: one that prepares the input and one that verifies the output.

Manufacturing has incoming material inspection (input quality) AND outgoing quality control (output verification). A factory that perfects its raw material sourcing but has no inspection on the finished product still ships defective goods — just goods made from better materials.

Compilers have parsing and type-checking (input validation) AND the test suite and formal verification (output verification). A compiler that accepts well-formed source doesn't guarantee correct machine code. The type system catches structural errors. The test suite catches behavioral errors. Neither replaces the other.

AI-assisted development has context engineering (input optimization) AND... nothing. The output goes to code review, where a human reads it and forms an opinion. There is no mechanical verification of the output against declared properties. The entire industry investment is on the input side.

Input pipeline (where the industry invests):
  RAG → context graphs → progressive disclosure →
  dynamic retrieval → stateful compression →
  optimized context window → LLM

Output pipeline (where the industry doesn't):
  LLM → generated code → ???

The ??? is a human reviewer. Sometimes. If they have time. If they understand the code. If they know the properties that should hold. If they notice the subtle violation buried in plausible-looking output that was generated from excellent context.

Better input doesn't guarantee correct output

Four structural reasons.

The LLM can ignore the context. Liu et al. (2023) documented this as the "lost in the middle" problem: LLM performance degrades when relevant information sits in the middle of a long context window. Longpre et al. (2021) showed that when provided context conflicts with training data, the model frequently defaults to its training. Context engineering optimizes what's provided. It cannot guarantee what's used. The model makes a probabilistic selection from everything it sees — training weights and context combined. No amount of context engineering gives you a guarantee that a specific piece of context will influence a specific output.

Compression is lossy. Stateful compression — summarizing intermediate outputs to manage context length is performed by another LLM. A non-deterministic system deciding what information to discard from its own working memory. The compression may drop the detail that matters for the next step. You can't know the detail in advance, because you can't predict what the downstream task will need. The compression optimizes token count. It doesn't optimize for preserving the information the next step requires, because that requires knowing the future.

Context graphs model knowledge, not contracts. A context graph that encodes "policy X has exception Y with precedent Z" tells the LLM about the rules. It doesn't check whether the LLM followed them. The graph is a reference document — structured, queryable, better than unstructured prose. But a reference document is not a verification contract. The LLM can read the graph, generate output that contradicts it, and nobody knows unless a human catches it during review.

Progressive disclosure assumes the agent knows what it needs. The agent starts with a lightweight index and pulls in context it determines is relevant. That determination is made by the same non-deterministic system that will use the context to generate output. If the agent's judgment about what's relevant is wrong. If it doesn't pull in the security policy, the architectural constraint, the edge case documentation — the missing context produces a missing property in the output. The system optimized for signal-to-noise ratio. The optimization was performed by the system that has the noise problem.

Output verification pipeline

A verification pipeline checks the output against declared properties — mechanically, deterministically, on every change. The properties are contracts.

Context (what the input pipeline provides):
  "Here's the API documentation for the payment service."
  "Here's the coding convention for error handling."
  "Here's the schema for the database tables."

Properties (what the output pipeline checks):
  "No API endpoint returns PII without authentication."
  "No database query uses string concatenation for parameters."
  "No function calls an external service without a timeout."

The context tells the LLM what to build. The properties verify what was built. The context is probabilistic. The LLM may use it. The properties are deterministic. The output either satisfies them or it doesn't.

Three levels of output verification, from most accessible to most rigorous:

Property-based testing. One afternoon to adopt. Instead of testing five hand-written examples, test one property across 10,000 randomly generated inputs. "For all valid inputs, the output amount is never negative." QuickCheck, Hypothesis, and similar tools generate the inputs. The property is the contract. The test checks every generated case against it.

Contract testing. The API specification (OpenAPI, JSON Schema, protobuf) is the contract. Every generated endpoint is checked against the spec — fields present, types correct, status codes valid. The spec exists in most codebases already. The enforcement is a CI check that fails the build when the generated code deviates from the declared contract.

Formal verification. The property is proved for all possible inputs, not just tested against a sample. "No IAM principal in this account can reach administrator access through any permission path." The proof is mathematical. The verification is exhaustive. Tools like Z3, TLA+, and Alloy provide the solver. The property is the specification. The proof is the evidence.

Each level catches things the previous level misses. Property-based testing catches edge cases example-based tests miss. Contract testing catches interface violations property tests don't cover. Formal verification catches mathematical properties no test can exhaust. All three verify the output. None of them depend on context quality.

The industry's blind spot

Better context produces better output on average. The blind spot is treating it as sufficient as the architectural solution to AI reliability rather than one half of it.

The Thoughtworks framing: "Treating AI context as a static text box is a fast track to hallucinations." True. The complete statement: "Treating AI output as unverified is a fast track to confident-looking failures, regardless of how well you engineered the context."

Context engineering makes the failures rarer. Output verification catches the failures that remain. The failures that survive good context are subtle where the context was excellent, the output looks correct, the review approves, and the violation hides in a composition or an edge case or a negative property that wasn't in any context graph.

These are the failures that cause production incidents. Because the output was never checked against a contract.

The asymmetry is solvable

The input pipeline took years to mature, from raw prompting to RAG to context graphs to progressive disclosure. Each step was an engineering improvement that made the input better. The output pipeline is at step zero for most teams: no contracts, no mechanical verification, just human review.

The tools exist. Property-based testing is decades old. Contract testing is standard in API development. Formal verification is used at AWS, Microsoft, and Intel for critical systems. The gap is the recognition that input optimization and output verification are independent concerns, both necessary, neither sufficient.

A team that engineers its context pipeline and adds one property check per sprint — one mechanical verification of one declared property on every change is building both pipelines. The context pipeline reduces the frequency of failures. The verification pipeline catches the failures that survive. The combination provides reliability.

Context engineering is ingredient sourcing. Output verification is quality inspection. The restaurant that does both ships fewer defective plates than the one that perfects its sourcing and hopes the chef never makes a mistake.

References

Liu, N.F., et al. (2023). "Lost in the Middle: How Language Models Use Long Contexts." Transactions of the ACL.
Longpre, S., et al. (2021). "Entity-Based Knowledge Conflicts in Question Answering." EMNLP 2021.
Thoughtworks (2026). "Technology Radar, Vol. 34." Context Engineering under Adopt.

AWS Just Added OAuth to the MCP Server. It Silently Changed the Meaning of Your Existing IAM Policies.

Bala Paranj — Fri, 10 Jul 2026 11:30:11 +0000

On July 9, 2026, AWS introduced OAuth support for the AWS MCP Server. The announcement describes a new action namespace (signin:), new condition keys, and a new resource type. It shipped with a managed policy (AWSMCPSignInOAuthAccessPolicy) ready to attach.

The mechanism is straightforward: an agent authenticates to the AWS Sign-In token endpoint using SigV4 credentials. The same credentials an EC2 instance profile or Lambda execution role already has and receives a short-lived OAuth access token in exchange. The agent then uses that token to access the AWS MCP Server.

The OAuth flow is well-designed. But it changes the IAM surface in ways that deserve immediate attention, because every IAM policy with Action: "*" or Action: "signin:*" now implicitly grants OAuth agent authorization.

The New Actions

Two actions control the OAuth flow:

signin:AuthorizeOAuth2Access — interactive authorization code flow (browser-based)
signin:CreateOAuth2Token — token issuance (authorization code exchange, refresh, or client credentials)

Two more support token lifecycle management:

signin:IntrospectOAuth2Token — check token status
signin:RevokeOAuth2Token — revoke a token

All four target a new resource type:

arn:aws:signin:*:*:service-principal/aws-mcp.amazonaws.com

The Condition Keys That Matter

AWS introduced two OAuth-specific condition keys:

signin:OAuthRedirectUri — where the authorization response is delivered
signin:OAuthGrantType — which OAuth flows are permitted

The blog post includes a restrictive policy example:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": [
      "signin:AuthorizeOAuth2Access",
      "signin:CreateOAuth2Token"
    ],
    "Resource": "arn:aws:signin:*:*:service-principal/aws-mcp.amazonaws.com",
    "Condition": {
      "StringLike": {
        "signin:OAuthRedirectUri": "http://localhost:*"
      },
      "StringEquals": {
        "signin:OAuthGrantType": [
          "authorization_code",
          "refresh_token"
        ]
      }
    }
  }]
}

This is the policy AWS recommends for local development. Notice what it restricts: redirect URI to localhost, grant types to interactive flows only.

What the Managed Policy Does

AWSMCPSignInOAuthAccessPolicy grants the two primary actions. AWS provides it so administrators can attach it without writing custom policy. The question is whether the identities it gets attached to are the right ones.

An EC2 instance role with AdministratorAccess that also has this policy attached can now issue OAuth tokens that carry its full permission scope. The role was already powerful. The OAuth token makes that power delegatable to an agent process that may not have the same operational controls the role was designed for.

The Implicit Grant Problem

Here is the IAM surface change that most organizations will miss.

Before July 9, a policy with Action: "*" granted every AWS action across every service. After July 9, that same policy also grants signin:AuthorizeOAuth2Access and signin:CreateOAuth2Token. The policy did not change. The set of actions it covers changed.

The same applies to Action: "signin:*". Any policy that used the signin: namespace for console-related actions now also grants OAuth agent authorization.

These are not misconfigured policies. They are policies that were correct last week and have a different meaning this week. The identity that has them may not know it can now authorize agents to act on its behalf.

What Defenders Should Check

Which identities have OAuth capability? Audit for signin:AuthorizeOAuth2Access and signin:CreateOAuth2Token including grants via signin:* and *. The identities that have these permissions may not be the ones you want authorizing agents.

Are redirect URIs constrained? Without a signin:OAuthRedirectUri condition, OAuth tokens can be delivered to arbitrary endpoints. This is a token theft vector regardless of which grant type is used.

Is the resource scoped? A policy with Resource: "*" permits authorization against any service principal in the signin: namespace — current and future. Scoping to arn:aws:signin:*:*:service-principal/aws-mcp.amazonaws.com limits authorization to the MCP Server specifically.

Can anyone revoke tokens? If OAuth tokens are being issued but no identity in the account has signin:RevokeOAuth2Token, compromised tokens cannot be revoked. The same applies to signin:IntrospectOAuth2Token — without it, defenders cannot inspect token status during incident response.

What's the blast radius of delegation? The real risk is not which grant type is used. client_credentials is the intended grant for machine-to-machine flows — an EC2 instance converting SigV4 to OAuth is the documented use case. The risk is when the identity that can delegate has permissions far beyond what the agent task requires. An admin role issuing OAuth tokens creates a path where an agent operates with full admin scope.

The CloudTrail Angle

OAuth authorization events — AuthorizeOAuth2Access, CreateOAuth2Token, token revocations are recorded in CloudTrail. This is good for detection. It also means that if your CloudTrail destination bucket is compromised (the bucket hijacking vector), the OAuth audit trail goes with it.

Organizations that already monitor CloudTrail for IAM credential use should extend their detection to cover the signin: event source. Token issuance patterns — which identities are issuing tokens, how frequently, from which source IPs — are the new signal to baseline.

What This Changes About IAM Reviews

Every IAM review process that inventories dangerous actions needs to add the signin: namespace. The standard lists (iam:*, s3:*, kms:*, sts:AssumeRole) now have a peer: signin:CreateOAuth2Token is a credential-issuance action with the same architectural significance as sts:AssumeRole.

The difference is that sts:AssumeRole has been in every security team's mental model for years. signin:CreateOAuth2Token was introduced this week. The policies that grant it already exist — they just didn't grant it until now.

Automated Detection with Stave

Stave is an open-source CLI that evaluates cloud configuration snapshots against a catalog of security controls. It ships with eight controls purpose-built for the OAuth MCP Server surface, covering each of the checks described above:

Wildcard and implicit grants — Detects identities that acquired OAuth capability through Action: "*" or Action: "signin:*" rather than explicit grants. These are the policies that changed meaning on July 9.

Redirect URI constraints — Flags signin:AuthorizeOAuth2Access grants that lack a signin:OAuthRedirectUri condition, the token-delivery vector that matters regardless of grant type.

Resource scoping — Identifies OAuth grants with Resource: "*" instead of the specific MCP service principal ARN.

Delegation blast radius — Surfaces identities that combine signin:CreateOAuth2Token with admin or elevated permissions — the over-privileged delegation path.

Token lifecycle gaps — Flags accounts where tokens are being issued but no identity has signin:RevokeOAuth2Token or signin:IntrospectOAuth2Token, leaving defenders without revocation or inspection capability during incidents.

Managed policy tracking — Inventories AWSMCPSignInOAuthAccessPolicy attachments so teams can review which identities received the AWS-provided OAuth grant.

OAuth inventory — Surfaces every identity with any signin: OAuth action, giving defenders a complete picture of which identities participate in OAuth delegation.

Stave evaluates these controls against point-in-time snapshots of IAM configuration, so they can run in CI, on a schedule, or as a one-off audit. The controls are declarative YAML. Teams can inspect the detection logic, adjust thresholds, or extend the catalog for their own policies.

You Cannot Compose Safety From Individual Findings: The Formal Result the Cloud Security Industry Ignores

Bala Paranj — Thu, 09 Jul 2026 11:17:12 +0000

Every cloud security scanner on the market such as Prowler, ScoutSuite, Checkov, AWS Config Rules operates the same way. It evaluates individual resources against individual rules. S3 bucket: is public access blocked? IAM role: is the trust policy scoped? Security group: are ports restricted?

Each finding is correct. Each resource passes or fails its check. And the security team reports: 847 findings, 12 critical, here's your posture score.

The report is formally meaningless as a system-level safety claim.

This is a theoretical impossibility. You cannot derive system-level safety from component-level findings alone. The result has been proven, reproven, and formalized across four decades of computer science. The cloud security industry has collectively ignored it.

This article traces the formal lineage, shows why it matters practically, and describes what a verification framework that addresses system-level safety would require.

The result

The core claim: individual component safety proofs are insufficient for system-level safety. You need explicit specifications of the interfaces between components — what each component assumes about its environment, and what it guarantees to its environment. Without those interface specifications, individual safety proofs tell you nothing about the composed system.

This has been formally established across multiple independent lines of work.

The lineage

Owicki and Gries (1976) got there first. Working on proofs of parallel programs, they showed that a process with a valid individual correctness proof could behave incorrectly in composition because another process modified shared state that the first process's proof assumed was stable. They called this the non-interference problem.

The practical translation: you can prove that process A is correct in isolation, prove that process B is correct in isolation, compose them, and get an incorrect system. The individual proofs are valid. The composition is broken. The proofs failed because they didn't specify what each process assumed about its environment.

Jones (1983) provided the constructive solution: rely-guarantee reasoning. Each component specifies two things:

A rely condition: what the component assumes its environment will or will not do
A guarantee condition: what the component promises to its environment

Composition is valid if and only if each component's guarantee satisfies every other component's rely condition. If component A relies on shared state S being stable, and component B's guarantee includes modifying S, the composition is invalid — regardless of whether A and B individually pass their proofs.

Jones didn't just prove that the problem exists. He proved what the solution requires: explicit interface contracts between components. Without them, composition is unverified.

Lamport (1977) formally defined the two categories of correctness properties:

Safety properties: "nothing bad ever happens" — the system never enters a forbidden state
Liveness properties: "something good eventually happens" — every request eventually gets a response, every failure eventually triggers recovery

Lamport proved that safety properties are closed under intersection. If two components each satisfy a safety property, the composed system satisfies the intersection of those properties. But here's the critical subtlety: that intersection may be weaker than the global safety property you need. The system-level property may require coordination between components that neither component's individual proof addresses. The intersection is formally valid and semantically vacuous — it proves the system is "safe" only because the assumptions rule out all real executions.

Abadi and Lamport (1993), in "Composing Specifications," formalized this precisely. They showed that composing specifications requires a compatibility condition between the assumptions and guarantees of each component. Without proving compatibility, the composed specification may be satisfiable but only in degenerate cases that don't correspond to any real system behavior.

The degenerate case has a name: vacuous correctness. A composition is vacuously safe when the assumptions of the components are mutually exclusive — component A assumes the environment provides X, component B assumes the environment provides ¬X. Both components are individually correct. The composition is correct only because no real execution satisfies both assumptions simultaneously. In cloud terms: the system is secure because the network configuration prevents anyone from reaching the S3 bucket. But the bucket is also useless, because the network configuration prevents anyone from reaching it. The scanner reports zero findings. The system satisfies no business requirement. The safety proof is real and the safety is meaningless. Vacuous correctness is the result when you verify components without checking that their assumptions are mutually satisfiable.

The result, consolidated: you cannot verify system safety by verifying components individually. You must also verify that the interfaces between components — the rely-guarantee pairs, the assumption-guarantee contracts are mutually compatible. Any verification framework that evaluates only individual components is formally incapable of making system-level safety claims.

The cloud security application

Consider two AWS resources, each individually configured according to security best practices.

Resource A: an S3 bucket. Its bucket policy grants s3:GetObject to IAM role role-X. A scanner evaluates this bucket and passes it: access is restricted to a specific role. No public access or wildcard principals. Clear finding.

Resource B: an IAM role. role-X has a trust policy allowing sts:AssumeRole from any EC2 instance in the account via an instance profile. A scanner evaluates this role and passes it: the trust policy is scoped to a specific service principal (ec2.amazonaws.com) within the account. No cross-account trust or wildcard. Clear finding.

The composition: any EC2 instance in the account, including a compromised developer workstation, an unpatched test instance, or an attacker-controlled instance launched through a separate privilege escalation can assume role-X and read every object in the bucket.

The system is insecure. Both components passed their scans. The vulnerability exists only at the edge — the relationship between the bucket policy and the role's trust boundary.

In Jones's framework, the problem is precise: the S3 bucket's rely condition (implicit, unstated) is that role-X is assumable only by authorized, trusted principals performing authorized operations. The IAM role's guarantee (any EC2 instance can assume it) violates that rely condition. The violation is invisible to any tool that evaluates components individually because the rely condition is not a property of either component. It is a property of the interface between them.

Why node-level scanning is structurally insufficient

Better rules cannot fix this gap. It is a structural limitation of the verification framework itself.

A node-level scanner evaluates a function: f(resource) → {pass, fail}. The input is a single resource configuration. The output is a finding. The scanner's formal power is limited to properties expressible as predicates over individual resource states.

System-level safety properties are not predicates over individual resource states. They are predicates over paths through the resource graph — sequences of edges that, taken together, create an access path, a data flow, or a trust chain that violates a system invariant.

"No unauthorized principal can read objects in bucket B" is not a property of bucket B's configuration. It is a property of every path through the resource graph that terminates at bucket B. It depends on every IAM role that the bucket policy references, every trust policy on those roles, every policy attachment that grants permissions to principals that can assume those roles, and every resource policy on every intermediate resource in the chain.

A scanner that evaluates bucket B's configuration cannot verify this property. It doesn't have the information. The property is not in bucket B — it is in the graph.

This was proved by Owicki and Gries in 1976 for parallel processes, what Jones formalized in 1983 for compositional verification, and what Abadi and Lamport generalized in 1993 for specification composition. The cloud security industry is rediscovering — slowly, expensively, breach by breach — a result that has been formally established for half a century.

What Capital One proved empirically

The 2019 Capital One breach was not a code bug. It was a compound misconfiguration. Four individually-reasonable configurations composed into an exploitable access path:

A WAF with an SSRF vulnerability (a code bug, but the only code bug in the chain)
An EC2 instance with access to the AWS metadata service (standard default at the time)
An IAM role attached to that instance with overly broad S3 permissions (individually assessable, but assessed in isolation)
S3 buckets containing sensitive data accessible to that role (individually compliant — access was restricted to a specific role)

A node-level scanner evaluating each of these four resources individually would pass three of them. The WAF SSRF might be caught by a code-level scanner (Chris Betz's framework). But the compound misconfiguration — the path from SSRF through metadata service through IAM role to S3 bucket was invisible to any tool evaluating individual resources.

In Jones's terms: the S3 bucket relied on its granting role being assumable only through authorized code paths. The IAM role guaranteed it was assumable by any code running on the EC2 instance. The metadata service guaranteed it would dispense temporary credentials to any process on the instance. Each component's guarantee violated the next component's rely condition. No individual scan could detect this because the violation existed only at the interfaces.

Capital One's breach was an empirical proof of Owicki and Gries (1976). The industry treated it as an operational failure. It was a theoretical inevitability.

What compositional verification requires

If node-level scanning is structurally insufficient, what is sufficient?

Jones's rely-guarantee framework provides the answer, translated to cloud security:

Step 1: Define system-level safety properties. These are the invariants the system must satisfy — not properties of individual resources, but properties of the system as a whole. Examples: "No principal outside the production account can assume any role with access to customer data." "No network path exists from any public-facing resource to any database resource that does not traverse a WAF and an authentication layer." "No IAM policy grants s3:* to any role assumable by a service principal."

Step 2: Model the resource graph. Resources are nodes. Relationships (policy references, trust chains, network routes, data flows) are edges. The graph is the artifact under verification, not the individual resources.

Step 3: Encode rely-guarantee contracts for each resource. For each resource, specify: what does this resource assume about the resources it depends on (rely condition), and what does this resource promise to the resources that depend on it (guarantee condition)? For an S3 bucket that grants access to a role, the rely condition is that the role's trust policy restricts assumption to authorized principals. For the role, the guarantee condition is the actual set of principals that can assume it.

Step 4: Verify compositional compatibility. Check that every resource's guarantee satisfies every dependent resource's rely condition. Any violation is a compound misconfiguration — an edge-level vulnerability that no node-level scanner can detect.

Step 5: Verify system properties against the composed specification. Using the verified composition, check whether the system-level safety properties (Step 1) hold. This is where formal reasoning engines — SAT solvers, SMT solvers (Z3), Datalog engines (Soufflé) become necessary. The state space is too large for exhaustive enumeration but structured enough for formal analysis.

Formally, this step is a refinement check: verifying that the low-level configuration (the implementation — the resource graph as deployed) refines the high-level safety property (the specification — the system-level invariants defined in Step 1). Refinement means every behavior permitted by the implementation is also permitted by the specification. If the implementation allows an access path that the specification forbids, the refinement check fails, and the system is provably insecure with respect to that property. Formal verification goes beyond testing: a test checks whether a specific behavior occurs; a refinement check proves whether any forbidden behavior is possible.

This is compositional verification when applied to cloud security. It requires evaluating edges, not nodes. It requires a specification of what secure means at the system level, not just a catalog of individual resource checks. It requires reasoning engines capable of operating over the graph, not just over individual vertices.

What the industry builds instead

The industry builds node-level scanners and adds more rules.

Every time a new breach reveals a compound misconfiguration, the response is to add a rule to the scanner. After Capital One, tools added checks for IMDSv2, for overly broad IAM policies, for specific known-dangerous policy patterns. Each rule catches the specific compound path that the breach revealed. It does not catch the class of compound misconfiguration.

This is the equivalent of responding to Owicki and Gries by adding more test cases instead of adopting rely-guarantee reasoning. It patches the last breach. It does not verify the system against the next one.

The result is a growing catalog of rules. Prowler has over 300, AWS Config has hundreds more — each of which evaluates an individual resource against an individual predicate. The catalog grows. The false positive rate grows. The coverage of system-level safety properties does not grow, because the verification framework is structurally incapable of evaluating system-level properties.

The specification that is missing

The deepest problem is the absence of a specification.

Every cloud deployment is a system. Every system has safety properties it must satisfy. But almost no organization has written those properties down as formal, mechanically-checkable specifications of the system-level invariants that must hold.

Without a specification, verification is impossible. You can scan. You can discover. You can report findings. But you cannot verify, because verification requires a specification to verify against.

This is the missing specification. A formal description of what secure means for this specific system, expressed as properties over the resource graph, verifiable by formal reasoning engines.

The Owicki-Gries-Jones-Lamport lineage doesn't just prove that compositional verification is necessary. It proves what the inputs to compositional verification must be: component-level rely-guarantee contracts, a compatibility verification between those contracts, and system-level safety properties to verify against the composed specification.

The industry has none of these. It has node-level rules, individual findings, and posture scores. It has taken a formally insufficient verification framework and built an entire market around it.

Where this leads

The formal result is clear: system-level security requires system-level verification. System-level verification requires a specification of system-level properties, a model of the resource graph, and formal reasoning over the composition.

The theoretical foundations are 50 years old. The practical tools exist — SMT solvers, Datalog engines, model checkers. The artifact required for verification (the cloud configuration snapshot) is exportable from every major cloud provider as JSON.

What the industry lacks is the practice of defining system-level security properties and verifying them compositionally. Until that practice exists, every cloud security scanner on the market is formally incapable of answering the question it claims to answer: is this system secure?

Individual findings are not system security. The math proved this in 1976. It's time the industry caught up.

The Minimum Cut is the Remediation Plan

Bala Paranj — Wed, 08 Jul 2026 12:39:39 +0000

Two Questions, One Graph

Cloud security scanners answer: does a path exist from the internet to the crown jewels? Run a reachability algorithm — BFS, DFS, or a Datalog transitive closure over the graph of permissions, network routes, and trust relationships. If a path exists, emit a finding. Scanner does it. It's graph traversal, and it's solved.

The harder question that almost nobody answers is: what is the smallest set of changes that eliminates all paths?

Finding paths is cheap. There might be twelve paths from the internet to a production database, through different combinations of public subnets, overpermissioned IAM roles, VPC peering connections, and misconfigured security groups. A scanner that finds all twelve produces twelve findings. The engineer reads twelve findings and has to figure out, manually, which three changes would eliminate all twelve paths simultaneously. That reasoning — which edges to cut, where cutting one edge collapses multiple paths, where the graph has structural bottlenecks is not done by a scanner.

This is the minimum cut problem. Ford and Fulkerson solved it in 1956.

Max-Flow Min-Cut, Stripped Down

The Max-Flow Min-Cut Theorem says: in a flow network with a source and a sink, the maximum flow from source to sink equals the total capacity of the minimum cut. The smallest set of edges when removed disconnects the source from the sink.

In a network engineering context, source is the sender, sink is the receiver, edges are links with bandwidth capacity, and the theorem tells you the bottleneck throughput. In cloud security, the mapping is different and the theorem tells you something more useful.

Source. The untrusted origin. The public internet, a compromised employee workstation, a stolen credential used from an external account — wherever the attacker starts.

Sink. The crown jewels. An S3 bucket with PII, a production RDS database, a Secrets Manager secret, a KMS key that encrypts everything else. In a real environment there are hundreds or thousands of sensitive resources. The standard technique is the super-sink: a virtual node connected to every crown jewel by an edge with infinite capacity. Computing the min-cut from source to super-sink produces the minimum set of changes that disconnects the attacker from the entire high-value asset class simultaneously, not just from one resource. This is a single computation, not one per target. Cutting the path to the super-sink solves the remediation for every sensitive resource at once. The same technique works on the source side: a super-source connected to every entry point (public IPs, external trust relationships, compromised credential origins) computes the min-cut across all attack origins simultaneously.

Edges. The connections between resources that an attacker can traverse. A security group rule permitting ingress. An IAM policy granting s3:GetObject. A VPC peering connection with an overly broad route. A trust policy allowing role assumption from an external account. An event source mapping that triggers a Lambda function. Each edge is a step in the attack path.

The graph is directed. Edge direction matters. An IAM policy granting s3:GetObject is a unidirectional edge: it lets the principal read from the bucket but says nothing about traffic flowing the other way. A security group ingress rule on port 443 is also unidirectional: it permits inbound connections but the response traffic flows on the established connection, not through a separate edge. Max-Flow Min-Cut works correctly on directed graphs, but the implementation requires careful direction modeling because some AWS constructs look bidirectional. VPC peering is the primary example: a peering connection permits traffic in both directions, but the route table entries that use it are directional (subnet A routes to subnet B's CIDR through the peering). Model VPC peering as two directed edges (one per direction), each governed by its own route table entry and security group. The peering connection is the physical link; the routes are the directed edges in the graph. Getting the direction wrong inflates the path count. The algorithm finds paths that traffic can't traverse.

Capacity. In network engineering, capacity is measured in megabits per second. In cloud security, it's almost always binary. The permission exists or it doesn't. The security group rule allows the traffic or it doesn't. The trust policy permits assumption or it doesn't. A path is open (capacity 1) or blocked (capacity 0).

The theorem says: the maximum flow of attack paths from the internet to the database equals the total capacity of the minimum cut. Since capacity is binary, this simplifies to: the number of edge-disjoint attack paths equals the minimum number of edges you need to remove to disconnect the attacker from the data.

That minimum set of edges is the remediation plan.

Why Min-Cut Beats Path Enumeration

A scanner that finds twelve paths gives you twelve findings. You read them and notice that paths 1, 4, 7, and 11 all traverse the same overpermissioned IAM role. Paths 2, 5, and 9 all pass through the same security group rule. Paths 3, 6, 8, 10, and 12 all use the same VPC peering connection. Three changes — scope the role, tighten the security group, remove the peering route — eliminate all twelve paths. But the scanner didn't tell you that. It told you twelve things are broken. You did the graph theory in your head.

The min-cut algorithm does this automatically. It takes the full graph (every resource, every permission, every network route), sets the source (internet) and sink (database), and computes the minimum set of edges when removed disconnects them. The output isn't twelve paths. It's three edges. Those three edges are the remediation plan. The smallest set of changes that eliminates all reachability between the attacker and the target.

This is also why path enumeration scales badly. A complex environment might have hundreds of paths between the internet and a sensitive data store. Enumerating them all is computationally expensive and produces an overwhelming list. The min-cut is small — often three to five edges — regardless of how many paths exist. The remediation plan stays manageable even when the attack surface is enormous.

Menger's Theorem: The Binary Case

Since cloud permissions are almost always binary (the edge exists or it doesn't, with no partial capacity), the operative theorem is often Menger's rather than Ford-Fulkerson. Menger's Theorem states: the maximum number of edge-disjoint paths between two nodes equals the minimum number of edges whose removal disconnects them.

Edge-disjoint means the paths share no edges. If three edge-disjoint paths exist between the internet and the database, the minimum cut must include at least three edges — one from each disjoint path. Cutting two isn't enough because the third path survives intact.

This gives you a lower bound on the remediation effort without computing the full min-cut. Count the edge-disjoint paths and you know the minimum number of changes required. If there are five disjoint paths, five is the floor. No clever combination of three changes can eliminate all five.

For cloud security, Menger's Theorem answers: "how resilient is this attack surface?" A target reachable by one path is fragile — one fix eliminates access. A target reachable by five disjoint paths is robust (from the attacker's perspective). The defender needs at least five changes. The disjoint path count is a severity metric: higher disjoint count means more work to remediate and more ways for the attacker to maintain access.

Nodes Versus Edges

Ford-Fulkerson computes an edge cut: which connections should be severed? In cloud security, you're often looking for a vertex cut: which resource is the bottleneck?

If a single EC2 instance is the only path between the public subnet and the private data subnet, removing that instance disconnects the graph. That's a vertex cut. If a single IAM role is the only identity that bridges two otherwise-separated permission domains, revoking that role disconnects them. Also a vertex cut.

The two formulations are mathematically interchangeable. Any vertex cut problem converts to an edge cut problem by splitting each node into two nodes connected by a single edge. The capacity of that edge represents the cost of removing the resource. After splitting, Ford-Fulkerson computes the answer in edge-cut form, and you translate back to the vertex cut.

In practice, this is the difference between "which security group rule should I remove?" (edge cut) and "which IAM role is the choke point?" (vertex cut). Both are useful. The edge cut tells you the specific remediation. The vertex cut tells you the structural bottleneck. The resource whose compromise would open the most paths.

This is why per-resource scanners miss the compound risk. They evaluate each resource independently (each node in isolation) and can detect that a security group rule is too permissive or an IAM role has wildcard permissions. What they can't detect is that this specific security group rule is the choke point through which all twelve attack paths flow, or that this specific IAM role is the vertex whose removal disconnects the internet from the database. That's graph-level analysis, not node-level analysis. The scanner sees trees. The min-cut sees the forest.

The Weighted Version

Binary capacity (exists or doesn't exist) is the clear case. Real environments are messier. Not all edges are equally easy to exploit or equally easy to remediate.

Assigning weights to edges captures this. Two possible weight schemes, each answering a different question:

Exploitability weights (the attacker's perspective). An unpatched critical CVE on a public-facing instance has a high-capacity edge — easy for the attacker to traverse. An MFA-protected, IP-restricted, session-limited login has a low-capacity edge — hard to traverse. The max-flow computation finds the path of least resistance through the graph. The min-cut identifies where the defender can impose the most friction on the attacker with the fewest changes. This answers: "where do I get the most security gain per unit of effort?"

Remediation cost weights (the defender's perspective). Removing a VPC peering connection might break a production workflow — high remediation cost. Scoping an overpermissioned IAM role might be free — low cost. The min-cut with remediation cost weights finds the cheapest set of changes that disconnects the graph. This answers: "what is the least-disruptive remediation plan that eliminates all attack paths?"

The two weight schemes can be combined. The min-cut with combined weights finds the remediation plan that maximizes security improvement per unit of operational disruption — the economically optimal defense.

The Temporal Dimension

Here's where the min-cut connects to temporal consistency, the subject of the previous article in this series.

A min-cut computation takes a graph as input and produces a set of edges as output. The graph is constructed from a configuration snapshot — the IAM policies, security group rules, VPC routes, and trust relationships captured at a point in time. If the snapshot is temporally inconsistent — different parts captured at different times with configuration changes in between — the graph describes a system state that may never have existed.

Run Ford-Fulkerson on an inconsistent graph and the results are unreliable in three ways:

The ghost path. The scanner sees an IAM permission captured at 14:00 and a security group rule captured at 14:07. Ford-Fulkerson finds a path through both. The min-cut suggests removing the security group rule. But the IAM permission was revoked at 14:04. It doesn't exist at the same time as the security group rule. The path never existed. The remediation targets a choke point for an attack surface that was never real.

The phantom cut. The min-cut identifies three edges to remove. One of them is a VPC route that was already removed at 14:05, between when the source-side and sink-side resources were captured. The remediation plan includes a change that's already been made. The engineer "fixes" something that's already fixed and misses the remaining exposure.

The missing path. A new IAM permission was added at 14:03, after the identity group was captured but before the network group was captured. The identity snapshot doesn't include it. The graph is missing an edge. Ford-Fulkerson computes a min-cut that doesn't account for the new path. The remediation plan has a hole. It addresses the paths the scanner saw but misses the path that appeared mid-collection.

All three are consequences of computing over an inconsistent cut (in the Chandy-Lamport sense) of the distributed system. The configuration snapshot is a cut through the execution history of the cloud infrastructure. If the cut is inconsistent, the graph it produces is fictional, and the min-cut of a fictional graph is a fictional remediation plan.

The fix is the same consistency check described in the Lamport article: verify that no relevant change event falls between the capture times of the snapshot's service groups. If the cut is consistent, the graph is real, and the min-cut is the optimal remediation plan. If the cut is inconsistent, the min-cut computation should be qualified: "this remediation plan is based on a temporally inconsistent graph — re-collect for a consistent analysis."

What This Implies for Tooling

A scanner that enumerates paths is doing BFS/DFS. Useful, but produces a finding list that doesn't aggregate into a remediation plan.

A scanner that computes the min-cut is doing Ford-Fulkerson (or Edmonds-Karp, or push-relabel). It produces a remediation plan directly: the minimum set of changes that disconnects the attacker from the target. The plan is optimal by construction — no smaller set of changes achieves the same disconnection.

A scanner that computes the min-cut and verifies temporal consistency is doing Ford-Fulkerson over a provably consistent graph. The remediation plan is both optimal and trustworthy — it addresses attack paths that exist simultaneously, not paths assembled from observations at different moments.

The computational cost is modest. Ford-Fulkerson on a graph with V vertices and E edges runs in O(VE²) for Edmonds-Karp. A cloud environment with 10,000 resources and 50,000 edges (permissions, routes, trust relationships) is well within reach. The min-cut computation adds seconds, not hours, to the evaluation.

The implementation maps naturally to the fact-export architecture. The configuration snapshot produces a graph of resources (nodes) and permissions/routes/trust (edges). The graph is exported to an engine that computes reachability (Soufflé/Datalog for path enumeration) and min-cut (a flow algorithm). The engine returns both the paths (the findings) and the min-cut (the remediation plan). The temporal consistency layer verifies the graph before the computation begins and qualifies the results if verification fails.

One Question, Not Twelve

The scanner that finds twelve paths produces twelve findings. The engineer reads twelve findings, reasons about shared edges, identifies three choke points, and implements three changes. The reasoning the engineer did — manually, in their head, possibly incorrectly — is a min-cut computation.

The scanner that computes the min-cut produces three changes. The engineer reads three changes and implements them. The reasoning happened in the algorithm, where it's provably correct, instead of in the engineer's head, where it's approximated.

Twelve findings and the hope that the engineer will figure it out. Or three changes and the proof that they're sufficient. The math to get from the first to the second has been available since 1956. Ford and Fulkerson solved it. Menger characterized the lower bound. Chandy and Lamport told us how to verify the graph. The algorithms exist and they're efficient. The missing specification is applying them.

This is the second in a series on formal methods applied to cloud security. The first, "Your Scanner Doesn't Know What Time It Is," covers temporal consistency of configuration snapshots. The tools and examples reference Stave, an open-source cloud security tool (Apache 2.0, early-stage) that exports structured facts for external reasoning engines. The graph theory is general — it applies to any scanner that evaluates reachability across a configuration graph, regardless of which cloud provider or tool you use.

Your Scanner Doesn't Know What Time It Is

Bala Paranj — Tue, 07 Jul 2026 12:01:00 +0000

The Problem Nobody Talks About

Every cloud security scanner that evaluates configuration snapshots has the same hidden assumption: the snapshot is current and internally consistent. Neither is guaranteed. Nobody checks.

A scanner pulls the IAM role configuration, evaluates it, pulls the S3 bucket configuration, evaluates it, and then produces a compound finding: "this IAM role can access 47 S3 buckets." The finding treats the IAM snapshot and the S3 snapshot as though they describe the same moment in time. They don't. The IAM snapshot was captured at 14:00:03. The S3 snapshot was captured at 14:07:41. In that seven-minute window, someone might have changed the IAM policy, added a new bucket, or deleted the role entirely.

The compound finding is asserting something about a system state that may never have existed at any single point in time. It's splicing two observations from different moments into a single claim and presenting it as fact. The scanner has no mechanism to detect this, warn about it, or qualify the finding's confidence. The user reads "this role can reach 47 buckets" and trusts it, not knowing the role's policy was updated four minutes into the collection window and the actual number is 3.

This isn't a theoretical edge case. It's the normal operating condition. Cloud infrastructure changes continuously. Collection runs take minutes to hours. The larger the environment, the wider the window, and the more likely a change occurred inside it.

Leslie Lamport solved the formal version of this problem in 1978.

Lamport and Time

Lamport's "Time, Clocks, and the Ordering of Events in a Distributed System" is about a deceptively simple question: in a system where processes communicate by messages, what does it mean to say one event happened before another?

His answer: physical clock time doesn't determine ordering. Causality does. Event A happened before event B if A could have influenced B — either because A occurred earlier in the same process, or because A sent a message that B received. If neither condition holds, the events are concurrent: they happened independently, with no causal link.

This creates a partial ordering over events. Some events are definitively before or after each other. Others are not comparable. They happened in parallel, and no amount of clock synchronization changes that.

The insight that matters for configuration snapshots: two observations are consistent only if no event that would change the observed state occurred between them. This is a statement about causal ordering, not wall-clock proximity.

Two snapshots captured four hours apart with no intervening changes are perfectly consistent. Two snapshots captured thirty seconds apart with a policy change between them are not. The clock gap is irrelevant. The causal relationship between the observations and the change events determine consistency.

Three Variables in Every Snapshot

A configuration snapshot captures the state of a distributed system at a point in time. Three temporal variables govern whether findings derived from that snapshot are trustworthy:

Capture time. When the scanner observed the configuration. This is a physical clock reading from the machine running the collection. It tells you when the observation happened but not whether the observation is still current.

Evaluation time. When the scanner produces findings from the snapshot. This might be seconds after capture (continuous pipeline) or weeks later (M&A diligence on archived snapshots). The gap between capture time and evaluation time is the freshness of the snapshot — how likely it is that the configuration has changed since observation.

Change time. When the infrastructure was modified. This is the variable the scanner cannot directly observe. Snapshots capture state, not transitions. The scanner sees the configuration as it was at the moment of capture but not when it was last changed or what it was before. Change time lives in the audit log (CloudTrail, in AWS), not in the configuration snapshot.

Most scanners ignore all three. They treat the snapshot as though it's always current (no freshness check) and always consistent (no cross-observation validation). The result is findings that present stale or inconsistent data with the same confidence as findings on fresh, consistent data.

Freshness: The First-Order Problem

The simplest temporal problem is freshness: is the snapshot still current?

A finding of FAIL on a three-week-old snapshot might already be remediated. A finding of PASS might no longer hold. Both are presented to the user with the same confidence as findings on a five-minute-old snapshot. The scanner reports certainty it doesn't have.

The fix is a third verdict: UNCERTAIN. When the snapshot's age exceeds a configurable threshold, findings are marked UNCERTAIN — not PASS, not FAIL, but "we ran the check and can't vouch for the answer because the data is old."

This is straightforward engineering. Compare evaluation_time - capture_time against a threshold. If the age exceeds the threshold, qualify the finding. The evaluation still runs (the underlying PASS or FAIL is preserved as metadata), but the verdict carries an epistemic qualifier: this assertion was true at capture time, but the data may no longer reflect the current state.

Two implementation details matter:

First, the evaluation time must be overridable. A breach reconstruction team evaluating a January snapshot in June needs the findings to be PASS or FAIL relative to January, not UNCERTAIN relative to June. The --now flag sets the evaluation time to the moment of interest, making freshness relative to the scenario rather than the wall clock. Same snapshot, same controls, deterministic freshness verdicts — reproducibility preserved.

Second, different services change at different rates. IAM policies might be stable for weeks. Security groups might change daily. EC2 instances come and go hourly. A single global freshness threshold is a blunt instrument. Per-service-group thresholds align the uncertainty qualifier with the change velocity of the data.

Freshness is a probabilistic hedge. It says: "the data is old enough that changes are plausible, so we can't be confident." It doesn't say whether a change actually happened. For that, you need the causal ordering.

Consistency: The Lamport Problem

Freshness handles the gap between capture and evaluation. Consistency handles the gap within the capture itself.

A scanner that collects by service group — identity first, then storage, then compute — produces snapshots with different capture times. The identity group might be captured at T1 and the storage group at T2, minutes or hours later. A compound finding that joins facts from both groups — "this IAM role can access these S3 buckets" is valid only if the configuration was stable between T1 and T2.

This is Lamport's problem applied to snapshot collection. Each service group capture is an event. Each configuration change is an event. The compound finding is valid only when the captures are causally after all relevant changes.

Four scenarios:

Timeline 1: Consistent — no changes in window
  T1 (IAM captured)────────────T2 (S3 captured)
  No IAM or S3 changes between T1 and T2.
  → Compound finding is valid.

Timeline 2: Consistent — changes before window
  change──T1 (IAM captured)────T2 (S3 captured)
  The change happened before both captures.
  Both observations reflect post-change state.
  → Compound finding is valid.

Timeline 3: Inconsistent — change inside window
  T1 (IAM captured)──change──T2 (S3 captured)
  IAM was captured before the change.
  S3 was captured after the change.
  The compound finding joins pre-change IAM with post-change S3.
  → Compound finding may describe a state that never existed.

Timeline 4: Unknown — no change data
  T1 (IAM captured)────────T2 (S3 captured)
  No audit trail available. Cannot determine if
  changes occurred in the window.
  → Compound finding consistency is unknown.

Timeline 3 is the dangerous case. The finding is not wrong in the way a false positive is wrong. It's wrong in a deeper way: it's asserting something about a composite system state that was never actual. The IAM permissions it describes existed at T1. The S3 bucket inventory it describes existed at T2. But the combination — these permissions on these buckets may never have been simultaneously true.

Distributed systems theory has a formal term for what the scanner is attempting: a consistent cut. A cut is a set of events, one per process, that represents a snapshot of the system's state. A cut is consistent if it respects causality — no observation in the cut reflects a state that was changed by an event the cut doesn't include. Timelines 1 and 2 are consistent cuts: the captures include the effects of all prior changes. Timeline 3 is an inconsistent cut: the IAM capture reflects the pre-change state while the S3 capture reflects the post-change state, and the change event falls across the cut boundary. Chandy and Lamport's 1985 snapshot algorithm solves this for systems where the observer can inject markers into the communication channels. A configuration scanner can't inject markers into AWS. It can only observe API responses after the fact. But the audit trail serves the same role: it records the events between observations, which is enough to determine after collection whether the cut was consistent.

Timeline 4 is the common case. Without change data, the scanner can't distinguish Timelines 1, 2, and 3. The freshness threshold handles this probabilistically: if the window is small enough, a mid-window change is unlikely, so the finding is probably consistent. But probably is not provably.

The Lamport-Informed Consistency Check

If the snapshot includes change data — the audit trail showing when each resource was last modified — the scanner can determine consistency formally rather than probabilistically.

The algorithm:

For each compound finding, identify the service groups whose facts contribute to the finding (e.g., identity and storage for a role-to-bucket blast radius).
For each contributing service group, record its capture time (T1 for identity, T2 for storage).
For each resource referenced in the finding, look up the most recent change event from the audit trail. Critically, "relevant change events" include not only direct changes to the resources named in the finding but also changes to the relationships and policies that govern them. A change to a Service Control Policy affects every IAM role in every account under that OU. A change to an S3 bucket policy affects every principal that accesses that bucket. A change to a VPC endpoint policy affects every service call routed through that endpoint. The set of relevant changes is the transitive closure of everything that could alter the finding's conclusion — not just the resources at the endpoints of the graph, but the edges and governing policies between them.
For each pair of contributing groups, check whether any relevant change event (direct or relational) falls between their capture times.
If no relevant change falls in any window: the finding is causally consistent. The captures are after all relevant changes, regardless of wall-clock gap.
If a relevant change falls between T1 and T2: the finding is causally inconsistent. One group was captured before the change and one after. The finding may describe a state that never existed.
If change data is unavailable for a resource: the finding's consistency is unknown, and the freshness threshold applies as a probabilistic fallback.

This is the vector clock applied to snapshot consistency. Each service group's capture is a vector entry. Each configuration change is an event. The compound finding is consistent when the capture vector is causally after the change vector for every resource in the finding.

The formal property: a compound finding F over resources R1...Rn is consistent if and only if for every pair of resources Ri, Rj in F, there exists no change event E affecting Ri, Rj, or any governing policy G in the dependency chain between them (SCPs, RCPs, VPC endpoint policies, bucket policies), such that capture_time(group(Ri)) < time(E) < capture_time(group(Rj)).

This is decidable, computable from the audit trail, and produces a provable consistency verdict not a probabilistic one.

Why This Matters

Three concrete scenarios where the distinction between probabilistic freshness and formal consistency changes what you'd do:

M&A diligence. You're evaluating a target company's infrastructure from snapshots they provided. The snapshots were collected over a three-day window. Your compound findings (privilege escalation paths, data exfiltration chains) join data from snapshots captured days apart. Without the consistency check, you're presenting findings about a system state that may never have existed simultaneously. With it, you know which findings are causally consistent (trustworthy) and which are not (need re-collection or qualification in the report).

Breach reconstruction. The IR team evaluates pre-breach snapshots to determine what the attacker could have reached. The snapshots were collected over several hours before the breach was detected. A compound finding that joins IAM data captured three hours before the breach with S3 data captured one hour before the breach is valid only if no IAM changes occurred in that two-hour window. CloudTrail can answer this definitively. If the attacker modified IAM policies during the window, the compound finding is joining pre-compromise IAM with post-compromise S3 — a misleading reconstruction.

Continuous governance with stale pipelines. The collection pipeline for one service group fails silently. Identity data is fresh (captured tonight). Storage data is stale (last captured three days ago because the S3 collection job has been failing). The compound finding joins today's IAM permissions with three-day-old bucket inventory. Freshness catches this at the group level (the storage group is stale). Consistency catches the compound implication: any cross-group finding involving storage is uncertain, even if the identity findings alone are reliable.

Two Layers of Temporal Reasoning

The full temporal model has two layers, and they're complementary:

Layer 1: Freshness. Compares capture time against evaluation time. Answers: "is this data old enough that changes are plausible?" Produces UNCERTAIN when the age exceeds a threshold. This is the pragmatic layer. It requires no external data (just the timestamps on the snapshot) and catches the gross staleness case: the pipeline is broken, the snapshot is three weeks old, don't trust the findings.

Layer 2: Consistency (Lamport-informed). Compares capture times of different service groups against the change events from the audit trail. Answers: "did any relevant change occur between the captures that contribute to this compound finding?" Produces a provable consistency verdict. This is the formal layer. It requires the audit trail alongside the configuration snapshot.

Most scanners have neither. Having Layer 1 alone is a significant improvement. It catches stale data and qualifies findings accordingly. Having both layers gives you a temporal reasoning framework that can prove whether a compound finding describes a state that actually existed. That's the gap between a scanner that reports findings and a scanner that reasons about the epistemic status of its own assertions.

Lamport's contribution was a precise characterization of what it means to know the order of events when you can't share a clock. Configuration snapshots are observations made by a process (the scanner) about a distributed system (the cloud) where events (configuration changes) happen concurrently and independently. The formal framework exists. The question is whether the scanner uses it.

What This Implies

Layer 1 is buildable today with no external dependencies. Add captured_at metadata to snapshots, compare against evaluation time, emit UNCERTAIN when stale. Every scanner should do this. Almost none do.

Layer 2 is buildable when the audit trail is available. In AWS, CloudTrail records every configuration change with a timestamp. The data exists. Joining it with the snapshot's capture times is engineering, not research.

Together, the two layers give you a scanner that can answer three questions no other scanner answers:

Is this finding based on current data? (Freshness)
Is this compound finding based on consistent data? (Lamport consistency)
If neither, what specifically is uncertain and why? (Qualified verdicts with named uncertainty)

The formal-methods grounding makes the UNCERTAIN verdict meaningful. "The data is old" is an observation. "This compound finding joins pre-change IAM state with post-change S3 state, and therefore describes a system configuration that may never have existed" is a proof. The difference is whether the user knows why they shouldn't trust the finding, not just that they shouldn't.

The temporal reasoning framework described here is implemented in Stave, an open-source cloud security tool (Apache 2.0, early-stage). Layer 1 (freshness / UNCERTAIN verdict) is designed. Layer 2 (Lamport-informed consistency) is specified but not yet implemented — it requires audit trail integration alongside configuration snapshots. If you're building a scanner that evaluates configuration snapshots of distributed systems, Lamport's 1978 paper is the starting point. The problem it solves isn't academic. It's the temporal foundation that every compound finding depends on and almost no scanner verifies.