DEV Community

Cover image for The First Zero-Knowledge Proof of AI Safety Judgment
Alex Garden
Alex Garden

Posted on

The First Zero-Knowledge Proof of AI Safety Judgment

The Agent Alignment Protocol gives agents transparency — structured traces of what was considered, what was chosen, and why. The Agent Integrity Protocol gives agents integrity — continuous runtime verdicts on whether an agent's autonomous decisions align with its declared values and boundaries. Today we ship the third piece: proof.

Cryptographic evidence that every integrity verdict was honestly computed. Independently verifiable. No trust required.

This matters because integrity monitoring creates a new trust dependency. When we tell you an agent's behavior is consistent with its Alignment Card, you have to trust that we computed that verdict honestly — that we actually ran the analysis, that we didn't tamper with results, that the history hasn't been rewritten. The proof layer eliminates that trust dependency. Every verdict now ships with mathematical evidence you can verify yourself, in your own browser, without calling our API or trusting our infrastructure.

Four Layers of Cryptographic Evidence

The proof layer isn't a single mechanism. It's four independent cryptographic structures, each addressing a different trust question. They compose but don't depend on each other — you get value from any subset.

Ed25519 Signatures: Who Signed This Verdict?

Every integrity checkpoint is signed with an Ed25519 digital signature. The signed payload is a deterministic JSON string containing the checkpoint ID, agent ID, verdict, thinking block hash, input commitment, chain hash, and timestamp. Keys are sorted alphabetically before serialization, producing identical output regardless of property insertion order.

The signing key is a 32-byte Ed25519 secret held by the analysis service. The corresponding public key is published at a well-known endpoint (api.mnemom.ai/v1/keys). Anyone can fetch the public key and verify the signature using the @noble/ed25519 library — the same library we use internally. A valid signature proves the verdict was produced by a key holder and hasn't been modified since signing.

Hash Chains: Nothing Was Reordered

Each checkpoint includes a SHA-256 hash that chains it to the previous checkpoint in the same agent session. The chain preimage concatenates the previous chain hash (or the literal string genesis for the first checkpoint), the checkpoint ID, verdict, thinking block hash, input commitment, and timestamp — all joined by pipe delimiters.

This creates a tamper-evident sequence. Changing any field in any checkpoint breaks the chain at that point and every subsequent link. Inserting a checkpoint between two existing ones is impossible without recomputing every downstream hash, which would invalidate their signatures. Reordering checkpoints breaks the linkage. The chain gives you temporal integrity — proof that the sequence of verdicts is the sequence that actually occurred.

Merkle Trees: Nothing Was Deleted

Per-agent append-only binary Merkle trees accumulate every checkpoint into a single root hash. Each checkpoint becomes a leaf — its leaf hash is computed from the checkpoint ID, verdict, thinking block hash, chain hash, and timestamp. Internal nodes are SHA-256 hashes of their children's concatenated hex strings. When the number of nodes at any level is odd, the last node is duplicated before pairing.

The Merkle tree provides completeness proofs. Given a checkpoint and an O(log N) inclusion proof — a list of sibling hashes along the path from the leaf to the root — anyone can verify that the checkpoint exists in the tree by walking the path and checking that the computed root matches the published root. If a checkpoint were deleted from the log, the root would change and all existing inclusion proofs would fail. This is the same structure that Certificate Transparency logs use to ensure no TLS certificate is secretly issued.

ZK Verdict Proofs: The Rules Were Applied Honestly

This is the most novel layer. A RISC Zero STARK proves that the verdict derivation rules — the logic that maps analysis observations to a verdict of clear, review_needed, or boundary_violation — were applied correctly to the committed inputs. The proof is generated inside a zkVM guest program that takes the analysis JSON, input commitment hashes, and rule parameters as private inputs, executes the deterministic verdict derivation logic, and produces a receipt that anyone can verify.

Boundary violation verdicts are always proven. All other verdicts are proven via 10% stochastic sampling. The proving decision is made after analysis completes, so the analysis pipeline cannot know in advance which checkpoints will be audited — creating a probabilistic deterrent against selective manipulation.

The Novel Contribution

The AI verifiability space has focused overwhelmingly on proving model inference — demonstrating that a neural network with specific weights produced a specific output for a specific input. This is a hard problem. Modern large language models have hundreds of billions of parameters. Proving a single forward pass through a transformer is computationally prohibitive. Estimates for full LLM inference proofs range from $1,000 to $10,000+ per proof, with proving times measured in hours.

We prove something different: the auditor's judgment.

The integrity checkpoint pipeline has two stages. First, an analysis model reads the agent's reasoning trace and the Alignment Card and produces structured observations — what concerns it found, what categories they fall into, how severe they are. Second, a deterministic verdict derivation function maps those observations to a final verdict using explicit rules: thresholds, category weights, severity escalation logic.

The first stage is LLM inference. It's expensive to prove, and proving it wouldn't tell you much — a valid proof of inference just means the model produced that output, not that the output is correct. The second stage is a small, deterministic function. It takes structured JSON as input and produces a verdict as output. It runs in milliseconds. And proving it in a zkVM costs approximately $0.005 per proof.

This is 6+ orders of magnitude cheaper than proving the full inference. It occupies a new point in the design space: between doing nothing (the current industry default) and proving everything (economically infeasible). We prove the part that matters most for trust — that the rules mapping observations to verdicts were applied faithfully — at a cost that makes it practical to prove every high-stakes verdict and stochastically sample the rest.

See It

The showcase at mnemom.ai/showcase includes interactive visualizations of all four proof layers.

The certificate explorer displays a complete integrity certificate with collapsible sections for subject, claims, input commitments, and proofs. Each proof section shows the raw cryptographic data — signature bytes, chain hashes, Merkle siblings, STARK receipt metadata — alongside human-readable explanations of what each field means and what it proves.

The hash chain tamper simulation renders a chain of checkpoints as linked blocks. Click any block to simulate tampering — modifying its verdict field. The visualization shows the cascading failure in real time: the tampered block's chain hash no longer matches its recomputed value, and every subsequent block's prev_chain_hash linkage breaks. The red propagation makes the tamper-evidence property visceral rather than abstract.

The Merkle tree visualization displays the binary tree structure with highlighted inclusion proof paths. Select a leaf checkpoint and watch the verification path illuminate from leaf to root, showing each sibling hash being combined at every level. The root hash updates live as checkpoints are added, demonstrating the append-only accumulation property.

The 3D trust topology renders agents as nodes in a three-dimensional space, with edges representing coherence scores from AAP's Value Coherence Handshake and node colors indicating integrity status from AIP. The proof layer adds verification badges — visible indicators of which verdicts carry cryptographic evidence and which proof types are present.

Verify It Yourself

Every verification operation runs in your browser. No server calls. No trust required.

Ed25519 signature verification uses @noble/ed25519 to check the signature against the signed payload and the published public key. The library is a pure JavaScript implementation of the Ed25519 signature scheme — no WebCrypto API dependency, no platform-specific behavior, deterministic results across all environments.

Chain hash recomputation uses the Web Crypto API's SHA-256 implementation to recompute the chain hash from the checkpoint fields and compare it to the certificate's claimed chain hash. The preimage format is documented and stable: prevHash|checkpointId|verdict|thinkingBlockHash|inputCommitment|timestamp.

Merkle proof verification walks the inclusion proof from leaf to root, combining the running hash with each sibling at each level. If the sibling sits to the left, it's prepended; if to the right, it's appended. The final computed root must match the published root. This is a pure computation — no network access, no oracle, no trust assumption beyond SHA-256 collision resistance.

STARK verification uses a WebAssembly build of the RISC Zero verifier. The WASM module takes the proof receipt and the expected image ID (a commitment to the guest program's code) and returns a boolean. This is the same verification logic that runs on-chain in Ethereum STARK verification contracts, compiled to WASM for browser execution.

The Certificate Format

Integrity certificates follow a structured JSON format modeled on C2PA content credentials and W3C Verifiable Credentials. The design goals are machine readability, self-describing semantics, and independent verifiability.

A certificate contains five top-level sections:

  • @context — The schema identifier (https://mnemom.ai/aip/v1), following JSON-LD conventions for self-describing documents. -subject — What this certificate is about: the checkpoint ID, agent ID, session ID, and Alignment Card ID. -claims — The integrity verdict, concern categories, confidence score, reasoning summary, analysis model identifier, and analysis duration. -input_commitments — SHA-256 hashes of every analysis input: the thinking block, the Alignment Card, the declared values, the context window, and a combined commitment over all of them. These commitments bind the verdict to specific inputs without revealing those inputs. -proofs — The cryptographic evidence: Ed25519 signature (algorithm, key ID, signature value, signed payload), hash chain (current hash, previous hash, position), Merkle inclusion proof (leaf hash, leaf index, root, tree size, sibling path), and verdict derivation proof (method, image ID, receipt, journal, verification timestamp).

The certificate also includes a verification section with URLs for the public key endpoint, the certificate retrieval endpoint, and the server-side verification endpoint. These are convenience links — all verification can be performed client-side without them.

The full specification is published at docs.mnemom.ai.

What We Don't Claim

Honesty about limitations is itself a form of integrity.

The proof layer proves honest computation of the verdict derivation rules. It does not prove that the analysis model's observations are correct. The LLM that reads an agent's reasoning trace and identifies concerns can make mistakes — it can miss a subtle boundary violation, misclassify a concern's severity, or hallucinate a problem that doesn't exist. The cryptographic proof covers the deterministic verdict derivation logic, not the probabilistic language model that produces the inputs to that logic.

Think of it like an auditor's report. We can prove that the auditor followed their stated methodology. We cannot prove that the methodology itself catches every issue. The signature proves who signed the verdict. The chain proves the verdict sequence wasn't reordered. The Merkle tree proves no verdicts were deleted. The ZK proof proves the verdict rules were applied faithfully to whatever the analysis model found. None of these prove the analysis model found the right things.

This is an honest trade-off. Proving full LLM inference would close this gap but costs $1,000+ per proof and takes hours. We chose to prove the part that is economically practical and most relevant to the trust question operators actually ask: "Were the rules applied honestly to the evidence that was found?"

We also don't claim immutability of the Merkle root itself. The root is published by our infrastructure. A sufficiently compromised operator could publish a different root. The Merkle tree proves internal consistency — that a given set of checkpoints produces a given root — but the root's authenticity currently depends on the publication channel. Future work on root anchoring (e.g., periodic on-chain commitments) would close this gap.

Competitive Landscape

The verifiable AI space is young and fragmented. Different projects prove different things at different costs:

Approach Who What It Proves Approximate Cost
TEE attestation EQTY Lab Code ran in a trusted execution environment Hardware-dependent
Content provenance Macaw, C2PA Media was created/edited by specific software Per-asset
Model inference proofs Modulus Labs, Giza A specific neural network produced a specific output $1,000+/proof
ZK coprocessor Lagrange On-chain computation was performed correctly Per-query
Verdict derivation Mnemom Auditor judgment rules were applied faithfully ~$0.005/proof

These approaches are complementary, not competitive. TEE attestation proves where code ran. Content provenance proves what created a media artifact. Inference proofs prove which model produced an output. We prove how the auditor reached its verdict. A complete trust stack might use several of these together.

Our contribution is identifying that verdict derivation — the deterministic rules that map structured observations to integrity verdicts — is a distinct proof target that is orders of magnitude cheaper than inference proofs while addressing the specific trust question that matters for agent integrity monitoring.

Enterprise Value

The EU AI Act Article 50 transparency obligations take effect in August 2026. For high-risk AI systems, Article 50 requires audit trails, disclosure of AI-generated content, and mechanisms for human oversight. Cryptographic integrity certificates provide a compliance-ready audit trail that goes beyond what the regulation requires — not just logging that monitoring occurred, but proving it was performed honestly.

For enterprises evaluating integrity monitoring vendors, the proof layer eliminates vendor lock-in as a trust concern. You don't have to trust Mnemom's infrastructure to trust the verdicts. The certificates are self-contained. The verification logic is open source. The cryptographic primitives are standard (Ed25519, SHA-256, RISC Zero STARKs). Any third party can independently verify any certificate without calling our API.

The standard certificate format also enables multi-vendor trust architectures. If an organization uses multiple integrity monitoring providers — or builds their own alongside a commercial solution — certificates from different providers can be compared, aggregated, and cross-verified using the same tooling. The @context field and structured proof sections make certificates machine-readable across implementations.

Get Started

The proof layer ships today in the Smoltbot gateway and the AIP TypeScript SDK.

  • Documentation: docs.mnemom.ai — full specification for the certificate format, proof construction, and verification procedures.
  • API reference: GET /v1/checkpoints/:id/certificate, POST /v1/verify, GET /v1/agents/:id/merkle-root, GET /v1/checkpoints/:id/inclusion-proof — all public, no authentication required.
  • Live showcase: mnemom.ai/showcase — interactive certificate explorer, chain tamper simulation, Merkle tree visualization, and 3D trust topology.
  • Whitepaper: The full technical specification covering proof construction, security model, threat analysis, and formal verification properties is available at docs.mnemom.ai/whitepaper.
  • Source code: github.com/mnemom — Apache licensed. The signing, chain, Merkle, proving, and certificate modules are in the api/src/analyze/ directory. The zkVM guest program and WASM verifier are in zkvm/.

Every integrity verdict now comes with cryptographic evidence. Verify everything yourself. Trust nothing.

Mnemom builds alignment and integrity infrastructure for autonomous agents. AAP and AIP are open source and available on npm and PyPI.

GitHub: github.com/mnemom · Live demo: mnemom.ai/showcase

Top comments (2)

Collapse
 
vihanga_nimsara_2004 profile image
Vihanga Nimsara

UseFull

Collapse
 
alexgardenmnemom profile image
Alex Garden

Thanks @vihanga_nimsara_2004... Put it to use! I built this to support our own multi-agentic worker pods - kept running into the limitations of rules-based enforcement. Alignment is the only way!