Alex Garden

Posted on Feb 23

Building Cryptographic Trust Infrastructure for AI Agents

#ai #security #cryptography #infrastructure

Last month, MIT published their AI Agent Index — a comprehensive study of 30 major AI agents across 240 safety and transparency criteria. The results were stark: 133 fields had no public information. Twenty-five agents had no safety evaluation results. Only one had cryptographic signing.

As someone building AI agent infrastructure, this confirmed what we suspected: the gap isn't in building agents (the tooling is excellent), it's in verifying them.

The Monitoring vs Verification Problem

The market response has been monitoring solutions. Behavioral baselines, drift detection, comprehensive logging. These are necessary but insufficient.

Here's why: monitoring tells you what happened, but it doesn't tell you whether the monitor itself is honest. Consider this architecture:

Agent → Oversight System → Report ("clear" or "violation")

How do you know the oversight system applied its rules correctly? How do you know it didn't report "clear" when evidence showed a boundary violation? How do you know checkpoints weren't deleted after the fact?

You can't. Unless the oversight system can prove its own honesty mathematically.

That's verification: "we can prove we checked, and you can verify the proof yourself."

The Technical Challenge

Building this requires solving several problems:

Identity: What is the agent supposed to do?
Integrity: Is it doing what it's supposed to do?
Proof: Can we prove the integrity check was honest?
Reputation: How trustworthy is this agent over time?
Risk: Should we approve this specific action?
Containment: How do we enforce decisions?

We built Mnemom as a six-layer stack addressing each.

Layer 1: Machine-Readable Identity

Every agent gets an Alignment Card — a machine-readable behavioral contract:

{
  "permitted": ["read_tickets", "draft_responses", "escalate_to_human"],
  "forbidden": ["access_payment_data", "issue_refunds"],
  "escalation_triggers": ["billing_request_over_500"],
  "values": ["accuracy", "empathy", "privacy"]
}

This becomes the reference document for all verification. No card, no verification.

Layer 2: Real-Time Thinking Analysis

We intercept the agent's streaming response at the Cloudflare Workers level using tee(). The client receives the response at wire speed — zero latency added. The analysis runs in the background.

Crucially, we don't analyze outputs. We analyze thinking — the agent's reasoning trace between request and response. A separate model evaluates this against the Alignment Card:

{
  "verdict": "boundary_violation",
  "concerns": [
    {
      "category": "value_misalignment",
      "severity": "high",
      "description": "Agent considered payments API access, explicitly forbidden"
    }
  ],
  "confidence": 0.95
}

Three verdicts: clear, review_needed, or boundary_violation.

Layer 3: Four-Layer Cryptographic Attestation

This is where we diverge from monitoring solutions. Every integrity checkpoint passes through four independent cryptographic layers:

Ed25519 Signatures: Every checkpoint is cryptographically signed. Public keys are published. Anyone can verify, nobody can forge.

SHA-256 Hash Chains: Each checkpoint includes the hash of the previous checkpoint. Tamper with any checkpoint and the chain breaks.

Merkle Trees: An append-only binary tree accumulates all checkpoints. Proving a checkpoint exists requires O(log N) hashes — about 10 for 1,000 checkpoints.

Zero-Knowledge Proofs: A STARK proof re-derives the verdict from raw concerns using deterministic rules, ignoring the stated verdict entirely.

The key insight: we don't prove LLM inference was correct (computationally intractable). We prove the auditor's judgment was honestly derived from the LLM's output. This runs in ~10,000 RISC-V cycles with sub-second proof generation.

Layer 4: The Trust Score

Checkpoints accumulate into a composite reputation score — five weighted components mapping to a bond-rating scale (AAA to CCC):

Integrity ratio (40%)
Compliance with exponential decay (20%)
Drift stability (20%)
Trace completeness (10%)
Coherence compatibility (10%)

Every score is backed by the full attestation stack. Anyone can request Merkle inclusion proofs, verify signatures and hash chains, and for violations, verify STARK proofs.

Layer 5: Context-Aware Risk Assessment

A trust score tells you general trustworthiness. Risk assessment tells you whether to approve this specific action for this agent in this context.

The same agent gets different risk scores for different actions. Financial transactions weight compliance heavily. Task delegation weights coherence. Tool invocation weights integrity and drift.

For teams of agents, we model three types of systemic risk:

Aggregate quality (tail-risk weighting)
Coherence quality (value alignment variance)
Structural risk (contagion modeling)

All computations use fixed-point arithmetic for determinism — essential for zero-knowledge provability.

Layer 6: Real-Time Containment

Detection without enforcement is monitoring. We provide pause, kill, and resume controls with three enforcement modes:

monitor: Log only
nudge: Inject corrective guidance
enforce: Hard 403 block

Auto-containment triggers after N consecutive violations. When an agent is contained, structural risk analysis identifies exposed peers for prophylactic protection.

One Environment Variable

Despite the complexity above, using this is simple:

npm install -g smoltbot
smoltbot init

The CLI configures your agent to use the Mnemom gateway:

export OPENAI_BASE_URL=https://gateway.mnemom.ai/openai/v1

Your code doesn't change. Your users' experience doesn't change. But your agent now has:

Cryptographic integrity attestation
Public trust score and reputation page
Embeddable trust badges
Zero-knowledge proven risk assessments
Real-time containment controls

Why This Matters Now

Three converging factors:

The gap is documented: MIT's study, WEF governance frameworks, and EU AI Act Article 50 all identify the same missing piece — verifiable trust infrastructure.
The market fragmented: Solutions exist for pieces (malware scanning, prompt injection blocking, behavioral baselines, identity credentials, on-chain reputation), but no unified stack.
The proof is practical: ZK proofs of safety judgments aren't theoretical anymore. SP1 generates production-ready STARK proofs sub-second.

What's Live

Everything described is deployed:

Multi-provider gateway (Anthropic, OpenAI, Gemini)
Full attestation pipeline with cryptographic proofs
Trust scoring and public directory
Team risk assessment with contagion modeling
Real-time containment and graduated response
Enterprise features (RBAC, SSO, compliance exports)

Try the interactive showcase at mnemom.ai/showcase or point your agent at the gateway directly.

The credit check for AI agents is live.

Originally published on mnemom.ai

DEV Community