Bala Paranj

Posted on May 16

Proof, not prediction: where formal verification beats AI in cloud security

#aws #ai #security #cloud

An AI scanner says 'this configuration looks unsafe' with 87% confidence. A formal verifier says 'this configuration IS unsafe, here is the exact principal, action, and resource that proves it.' One is a prediction. The other is evidence. The difference matters for insurance, for audits, and for the 80% of cloud security questions that have exact answers.

A CISO walks into a renewal meeting with the cyber-insurance underwriter. The underwriter asks one question:

Can your most sensitive S3 bucket be accessed by an unauthorized principal?

There are two ways to answer. An AI-powered cloud security tool gives you a something like "this bucket appears to have appropriate controls based on similar configurations in our training data, confidence: 92%." A formal verifier gives you a yes/no with a witness: "no — UNSAT against the following 17 constraints over the bucket policy, identity federation, IAM, and account-level Public Access Block."

One is a prediction. The other is a proof. The underwriter knows which one survives a subrogation suit.

What is a proof

When a solver like Z3 says UNSAT, it has shown that no assignment of the free variables in the model satisfies the constraints. There is no principal, no action, no resource, no role assumption, no trust path, no policy condition — within the model — that violates the property. The answer is mathematically complete relative to the model.

When the same solver says SAT, it returns a witness: the concrete assignment that satisfies the constraints. The witness names the exact principal, action, and resource that constitute the violation. The output is constructive. You didn't just learn the property is false; you got the counterexample that proves it.

There is no confidence score. There is no probability. There is no temperature setting. There is no false-positive rate. There is no training data. There is no "this might be similar to a known bad pattern." There is a function from facts to verdicts, and the verdict is correct relative to the inputs it was given.

The qualification relative to the model is critical. If your model doesn't include SCP evaluation, the solver won't catch SCP issues. If your model doesn't include Cognito identity-pool trust, the solver won't catch identity-federation chains. But the model's limitations are enumerable. You can list them. Operators know exactly what the verifier covers and what it doesn't.

An AI tool's limitations are statistical and unknowable. You can't enumerate what the training data didn't cover. A configuration that doesn't match any training pattern gets missed silently. A configuration that superficially resembles a bad pattern gets flagged incorrectly. The miss is invisible until the breach.

Proof vs prediction

Dimension	Formal verification (Z3 / cvc5 / Yices)	AI-based cloud-security tools
Output shape	Proof or counterexample	Prediction with confidence score
Correctness guarantee	Mathematically complete relative to the model	Statistically approximate relative to the training data
False positives	Zero relative to the model	Non-zero, tunable via threshold
False negatives	Zero relative to the model	Non-zero, depends on training coverage
Explainability	Exact witness — "principal P, action A, resource R"	This configuration resembles known-bad patterns
Novel attack paths	Discovers paths nobody has seen before, if the model captures them	Recognises only patterns similar to training data
Reproducibility	Deterministic — same input always produces the same answer	Varies across model versions, temperatures, prompts

The deterministic property makes the verifier auditable. Run it today, get UNSAT. Run it tomorrow against the same fact base, get UNSAT. Run it in three years when the auditor asks for evidence, get UNSAT with the same fact base. The proof is the artifact. It does not decay.

An AI prediction at 92% confidence today is a 87% confidence prediction tomorrow because the model was updated, or a 79% confidence prediction next week because the prompt template changed. The prediction is not an artifact. It is an output of a service. The service can be re-priced, re-trained, or shut down. Your audit evidence cannot.

The cost structure that nobody mentions

The accuracy story is the visible one. The cost story is the one that bends adoption.

Z3 runs on a single CPU core. The binary is ~10 MB. Running it on Stave's full SMT-LIB export — 5,000 facts plus 90 closed-world axioms — completes in milliseconds:

$ time z3 facts.smt2
sat

real    0m0.131s
user    0m0.123s
sys     0m0.009s

130 milliseconds. No GPU. No API call. No usage tier.

An AI-based tool pays for every query. Inference cost, GPU hours, API fees, token pricing. The cost scales linearly with usage: more assets, more policies, more frequent runs, higher bill. Enterprise customers with thousands of resources across multiple AWS accounts pay proportionally.

The cost ratio shows up most starkly in continuous-assessment scenarios. A cyber-insurance underwriter might require posture verification on every configuration change. For a company making hundreds of changes daily across multiple accounts, the AI-based tool runs hundreds of times per day per account. The formal verifier runs the same number of times — at a marginal cost of zero. The CPU was already paid for.

	AI-based tools	Stave + Z3/cvc5
Per-query cost	API inference fee	Zero
Infrastructure	GPU clusters	Single CPU core
Scaling cost	Linear with usage	Flat
Offline capability	Requires API connectivity	Runs fully offline (`STAVE_NO_NETWORK=1`)
Predictability	Variable pricing, usage-based	Fixed — it's a binary

The offline guarantee compounds with the cost guarantee. Air-gapped environments — defence, classified workloads, certain financial verticals — cannot ship configuration data to a third-party AI API. They must analyse locally. A formal verifier was designed for this constraint. AI-powered SaaS tools were designed for the opposite constraint. The trade-off is structural, not a UX choice.

Why this matters for insurance and evidence

Cyber-insurance underwriting is the canary for this entire shift. Underwriters today read CSPM-tool screenshots and trust the vendor's narrative. The narrative is statistical: "our customers have 92% lower breach rates." The screenshot is an opinion: "this bucket appears compliant."

Underwriters are moving toward evidence. They want artifacts they can hand to legal, archive in their underwriting file, and produce in a subrogation suit if the breach happens anyway. A formal proof — (check-sat) returning unsat against a specific fact base on a specific date — is an artifact. It can be archived. It can be re-run. It can be independently verified by Z3, cvc5, or Yices to confirm the answer.

The proof and the fact base are the audit evidence. The fact base is obs.v0.1 JSON, schema-validated, with provenance for every property. The proof is whatever the solver returns. Together they reconstruct the verdict exactly. The auditor does not need to trust the security vendor's model — they can run the solver themselves against the committed fact base and verify the verdict independently.

A Z3 proof is evidence. An AI prediction is an opinion.

This is also Stave's strongest positioning. When an insurer asks "prove your S3 bucket can't be accessed by unauthenticated principals," the formal answer is the artifact the underwriter is starting to demand.

Where AI genuinely belongs

Roughly 80–90% of cloud-security questions have exact answers:

Is MFA enabled on the root account?
Does this bucket policy allow Principal: *?
Can an unauthenticated Cognito identity assume this role?
Is CloudTrail logging multi-region?
Are these three controls all violated on the same asset?

None of these requires a language model. None of them benefits from a confidence score. They are predicates over structured facts. The deterministic verifier answers them in milliseconds with mathematical completeness relative to the model. Paying AI pricing for "is MFA enabled?" is a category error.

The remaining 10–20% genuinely needs AI:

Natural-language policy interpretation: parsing a vendor's English SOC 2 attestation and asking "does this commit them to encrypting backups?" The text is ambiguous; the intent must be inferred.
Anomaly detection in behavioural data: "is this CloudTrail access pattern unusual for this principal?" The baseline is statistical, the answer is probabilistic.
Intent inference from incomplete documentation: "does the team that owns this bucket consider its contents public-by-design, or did the public ACL get added by mistake?" The signal is in commit messages, ticket history, and Slack threads; the answer is judgment.

AI tools are valuable on those. They are also expensive on those. The cost is justified because the alternative is human review at twice the price.

The foundation-layer architecture

The right shape is composition. Deterministic verification at the bottom; AI judgement at the top:

+-------------------------------------------------------+
|  AI tools (10-20% of work)                            |
|    Policy interpretation, anomaly detection,          |
|    intent inference from English / behavioural data   |
+-------------------------------------------------------+
|  Stave + Z3/cvc5 (80-90% of work)                     |
|    Compound chain detection, formal proofs,           |
|    deterministic verdicts on structured facts         |
+-------------------------------------------------------+
|  Snapshot facts (obs.v0.1)                            |
|    Schema-validated, provenance-tracked, offline      |
+-------------------------------------------------------+

The AI bill drops 80–90% because trivial deterministic checks no longer route through an inference endpoint. The remaining AI workload has a much better signal-to-noise ratio because it sees only the hard cases — the ones where human judgement is the alternative. The AI tool gets better, not worse, when the deterministic layer is in place underneath it.

This is also the partnership angle. An AI-powered cloud-security vendor benefits from Stave handling the deterministic layer: their costs drop because the volume drops, their accuracy improves because the inputs are cleaner, and their pricing model survives the next round of GPU price increases because they're no longer paying inference cost for questions like "is encryption enabled?"

Stave does not replace AI tools. It makes them economically viable by removing the work they shouldn't be doing.

A 130-millisecond demo

The repository ships a working pipeline. Clone, build, and run:

git clone https://github.com/sufield/stave
cd stave && make build

# Pick a real fixture — HackerOne #1021906, the Shopify "tag says internal,
# policy says everyone" case. The fixture is a reconstructed snapshot from
# the public disclosure.
FIXTURE=testdata/e2e/e2e-h1-shopify-1021906

# Export the snapshot as SMT-LIB v2 — facts only, no query.
./stave export-sir --format smt2 \
  --controls $FIXTURE/controls \
  --observations $FIXTURE/observations \
  --now 2026-01-15T00:00:00Z > /tmp/facts.smt2

# Append the satisfiability query.
echo "(check-sat)" >> /tmp/facts.smt2

# Two independent solvers.
z3 /tmp/facts.smt2
cvc5 --lang smt2 /tmp/facts.smt2

z3 returns sat on 302 lines of SMT-LIB in 132 milliseconds. cvc5 runs over the same input as an independent cross-check; on this particular fact base it returns unknown, which is itself an honest verdict — the second solver could not establish the answer with its default tactic. Two engines returning the same sat/unsat is the strongest possible cross-validation; one returning unknown is the signal to either widen the timeout, switch tactics, or refine the model. The point is that the protocol — facts plus query plus solver — is reproducible. The verdict is auditable regardless of which solver produced it.

No GPU, no API key, no per-query cost. The binary that produced the export does not call any cloud API. The solver runs locally. The audit evidence is the file at /tmp/facts.smt2 plus the verdict — both reproducible byte-for-byte at any point in the future against the same obs.v0.1 snapshot.

What this is not

It is not "AI is bad for security." AI is excellent for the 10–20% of problems that genuinely require judgement, anomaly detection, or natural-language interpretation. The framing is use the right tool for each class of problem.

It is not "formal verification solves everything." The verifier is correct relative to the model. If your model doesn't include the Cognito identity-pool trust pattern, the solver will not flag a Cognito identity-pool attack. But model coverage is enumerable; you can list what is and isn't covered. AI coverage is statistical; you cannot.

It is not "Z3 replaces your CSPM." CSPM and formal verification are complementary. CSPM tells you what's in the cloud right now. Formal verification tells you whether what's there satisfies your stated invariants. The CSPM dashboard is the inventory; the formal verifier is the auditor with the proof.

The structural shift

Cloud security has spent a decade paying language-model prices for boolean questions. The market matured around AI because no deterministic alternative existed that handled compound risk across services. That gap is closed. Stave's open-source policy library plus the SMT-LIB export plus a $0 solver replaces the deterministic-check layer at zero marginal cost.

The AI vendors who survive will be the ones who specialise in genuinely judgemental questions and welcome a free, deterministic foundation beneath them. The CSPM vendors who survive will be the ones who emit obs.v0.1-compatible inventories and let any open-source verifier consume them. The customers who survive are the ones who stop paying API fees for Principal: * checks.

When the insurance underwriter asks for proof, the customer hands them the SMT-LIB file and the solver verdict. The verifier ran on a laptop. The cost was zero. The evidence is independently reproducible.

That is what formal verification looks like in cloud security in 2026. The technique has been validated at scale by Microsoft Research (SecGuru, 2015) on Azure's datacentre network. Applying it to cloud service configurations is the only step that was missing. It is no longer missing.

Stave is an open-source intent verification engine for cloud infrastructure with 2,650+ controls, compound risk detection, and nine independent reasoning engines.

DEV Community