DEV Community: Alex Garden

Building a Live Adversarial Arena for AI Safety Testing

Alex Garden — Tue, 10 Mar 2026 20:28:45 +0000

Everyone talks about red teaming AI agents. Few do it continuously. None do it with cryptographic proof.

We built something different: a live adversarial arena where 15 agents attack our governance infrastructure 24/7. Every detection is provable. The current detection rate — 91.8% — updates in real-time.

The Static Testing Problem

The standard AI safety playbook looks familiar: hire a red team, run a battery of tests, produce a report, file it with compliance. Repeat annually, maybe quarterly if you're diligent.

This is how we tested bridges in the 1800s. Load it once, write down whether it held, move on.

The problem is obvious to anyone shipping AI agents in production: they don't sit still. They drift. They encounter novel inputs. They operate under adversarial conditions that no static benchmark can anticipate.

A model that passes every eval on Tuesday can hallucinate policy violations on Wednesday because:

The prompt distribution shifted
An attacker found a new angle
The reasoning patterns drifted over time
The context changed in ways the benchmark didn't anticipate

Static red team reports are stale on delivery. "We tested it once" isn't governance — it's compliance theater.

The Proving Ground Architecture

We built something different: a continuous adversarial stress test of our AIP (AI Integrity Protocol) governance pipeline.

Fifteen adversarial agents run around the clock, probing for weaknesses across six technique categories:

1. Reasoning Deception (87.7% detection rate)

Generating plausible but misleading reasoning chains that pass surface analysis. The attacker injects prompts that cause an authorized agent to take authorized actions for unauthorized purposes.

The action looks legitimate. The intent isn't. This is the hardest class to detect because nothing in the action itself is wrong — you have to understand the reasoning behind it.

2. Drift Evasion (93.4% detection rate)

Subtle reasoning drift across conversational turns. No single message violates policy. But over five, ten, fifteen turns, the agent's reasoning has drifted from its behavioral contract.

This is the boiling frog of agent safety — invisible to systems that only evaluate individual outputs.

3. Structural Evasion (81.6% detection rate)

Exploiting trace format, minimizing data, or splitting behavior across sessions to dodge analysis entirely. If the agent doesn't reason visibly, there's nothing to analyze.

This tests whether governance can detect the absence of evidence as evidence of absence.

4. Metadata Exploitation (71.9% detection rate)

Manipulating trace context, timestamps, or session data to mislead analysis. This is adversarial AI aimed at the auditor, not the agent.

If your governance relies on an LLM to evaluate another LLM's reasoning, the attacker's next move is to compromise the evaluator.

5. Value Manipulation (83.9% detection rate)

Exploiting the gap between declared values and actual behavior. The agent says what you want to hear while doing something else.

6. Policy Circumvention (100% detection rate)

Attempts to operate outside policy boundaries without triggering enforcement. Boundary probing, undeclared tools, escalation suppression.

Every one caught so far.

Cryptographic Attestation

Here's what makes this different from traditional red teaming: every detection is cryptographically attested.

Ed25519 signatures on detection events
SHA-256 hash chains linking evidence to conclusions
SP1 zero-knowledge proofs that detection was honestly derived

Not "we said we caught it." Proof that the catching was honest and the evidence wasn't tampered with.

The Mutation Mechanic

When the 24-hour rolling average detection rate crosses 90%, the red team agents gain the ability to mutate. They analyze which strategies were caught, how they were caught, and evolve new approaches with awareness of the defense.

This creates adversarial evolutionary pressure. The defense must keep improving or get outpaced.

The "Days Since Evasion" counter makes this arms race visible. When something gets through, the counter resets to zero — and the evasion technique is published as a detection recipe.

Most companies hide their breaches. We announce ours.

Live Numbers

Metric	Current Value
Total attempts	3,200+
Detection rate	91.8%
Average analysis latency	13.8s
Days since evasion	0
Mutations	8

That 91.8% is live. It changes every time an adversary runs. When we launched the arena, detection started at roughly 40%. The defense evolution chart shows the climb: 40% to 70% in twelve hours, stabilizing above 85% by day two, crossing 90% on day three.

We could have waited until detection hit 99%. We could have cherry-picked the attack classes. We could have published the number that looks best on a slide deck.

We published the live number instead.

Why This Matters

There are three main approaches to AI agent safety today:

Enforcement-first approaches make dangerous actions impossible by construction. Valuable, but enforcement without detection is blind to novel attacks.

Observability platforms log what happened. Good for forensics, but logs are mutable and require trusting the vendor.

Static benchmarks measure point-in-time performance. An agent that passes today can drift tomorrow.

The arena doesn't replace these approaches. It complements them with continuous, adversarial, cryptographically provable governance testing.

Implementation Details

For developers interested in the technical implementation:

// Simplified detection pipeline
pub struct DetectionPipeline {
    analyzers: Vec<Box<dyn IntegrityAnalyzer>>,
    attestation: AttestationService,
    threshold: f64,
}

impl DetectionPipeline {
    pub async fn analyze_trace(&self, trace: &AgentTrace) -> DetectionResult {
        let analysis = self.run_parallel_analysis(trace).await?;
        let confidence = analysis.aggregate_confidence();

        if confidence > self.threshold {
            let attestation = self.attestation.sign_detection(&analysis).await?;
            Ok(DetectionResult::Detected { analysis, attestation })
        } else {
            Ok(DetectionResult::Clean)
        }
    }
}

The key insight: continuous testing reveals gaps that static benchmarks miss. The adversarial pressure ensures those gaps get smaller over time.

What's Next

The arena is live now. Detection recipes publish automatically when evasions are caught and patched. We're exploring opening it to external adversaries — if you think you can beat our detection, we want to know.

Every mutation is a public research contribution to adversarial AI safety. The industry needs more of this kind of open research.

Originally published on mnemom.ai

Building Trust Infrastructure for the Agentic Economy: A Response to Stripe's Five Levels

Alex Garden — Wed, 04 Mar 2026 05:06:12 +0000

Stripe just published their annual letter describing five levels of agentic commerce — from simple form-filling agents to autonomous agents that anticipate your needs. But there's a critical piece missing from this vision: trust infrastructure.

As someone building trust systems for autonomous agents, I see a fundamental gap between Stripe's payment infrastructure (which answers "can this transaction be processed?") and what the agentic economy actually needs (which answers "should this agent be trusted to initiate it?").

The Trust Escalation Ladder

Stripe's five levels aren't just about increasing autonomy — they're about exponential increases in trust requirements:

Level 1: Agent fills out a web form. Trust requirement: essentially zero.
Level 2: Agent searches and reasons about purchases. You're trusting its judgment.
Level 3: Agent remembers your preferences and history. You're trusting it with persistent knowledge.
Level 4: Agent gets $400 and instructions to "get back-to-school shopping done." You're trusting financial delegation.
Level 5: Agent acts on its own anticipation. You're trusting it with your intent before you've articulated it.

Each level demands more trust. But what infrastructure exists to justify that trust?

Why KYC Doesn't Work for Agents

Know Your Customer (KYC) works because humans have persistent, verifiable identities. Agents don't. An agent is code running in a container — potentially ephemeral, potentially modified between sessions, potentially operating under instructions that conflict with what its owner intends.

What we need is KYA — Know Your Agent. Not who built the agent, but what has this agent actually done? How has it behaved over time? Can you prove it?

The Self-Certification Problem

Every agent safety system today operates on the same broken model: the vendor monitors the agent, generates logs, and reports on what happened. This requires trusting the vendor who generated the logs.

Consider financial auditing. No public company self-certifies its financial statements. An independent auditor verifies them. The integrity of the system depends on the independence of the verification.

Trust Score: Credit Ratings for Agents

We built Mnemom around a different principle: cryptographically verified behavioral reputation.

Trust Score is a 0-1000 rating computed over five weighted components:

Integrity ratio
Compliance history (with exponential decay)
Drift stability
Trace completeness
Coherence compatibility

But unlike credit scores, Trust Score has no central bureau. The underlying evidence is cryptographically signed and hash-chained. Anyone can verify it independently — no API call, no internet connection, no trust in the scoring vendor required.

The Technical Implementation

Here's the core insight that makes this practical: proving that a full LLM inference was correct is computationally intractable (billions of RISC-V cycles). But proving that an auditor honestly applied its rules to the LLM's output? That's roughly 10,000 RISC-V cycles.

We don't prove the model was correct. We prove the auditor's judgment was honest.

Every integrity checkpoint receives four-layer cryptographic attestation:

Ed25519 digital signatures: This gateway produced this checkpoint
SHA-256 hash chains: No checkpoint was deleted or reordered
Merkle tree accumulator: Any checkpoint is verifiable without downloading full history
SP1 STARK zero-knowledge proofs: The verdict was honestly derived from evidence

# Simple integration example
from mnemom import Gateway

# Zero code changes to existing agent
client = Gateway(
    model="claude-3-5-sonnet",
    policy="./governance.yaml"
).anthropic()

# Agent gets cryptographic identity automatically
response = client.messages.create(
    model="claude-3-5-sonnet",
    messages=[{"role": "user", "content": "Help me plan dinner"}]
)

# Behavioral monitoring happens transparently
# Trust Score updates in real-time

What Each Level Actually Needs

Level 1 — Identity: Every agent needs cryptographic identity before its first action. Our gateway assigns one with a single environment variable.

Level 2 — Behavioral monitoring: Agents reason and search. You need to know what they're thinking, not just what they output. Our integrity protocol intercepts internal reasoning before actions execute.

Level 3 — Persistent reputation: The agent remembers your preferences. You need to remember its behavior. Trust Score provides portable reputation that follows agents across sessions.

Level 4 — Governance: Full delegation requires full governance. Not just monitoring — policy enforcement, lifecycle management, trust recovery.

Level 5 — Cryptographic proof: Anticipatory action demands highest assurance. Trust Scores published to Base L2 via ERC-8004 smart contracts — immutable, publicly queryable, independently verifiable.

The SSL Moment

E-commerce didn't scale because browsers could display web pages. It scaled because browsers could display a padlock. That padlock didn't protect the transaction — it told users the connection was verified.

The agentic economy is pre-padlock. Agents can browse, reason, and buy — but there's no padlock. No verification that the agent connecting to an API is trustworthy.

Trust Score is the padlock for the agentic economy.

Getting Started

The infrastructure is open source (Apache 2.0) and shipping today:

npm install @mnemom/gateway  # Node.js
pip install mnemom           # Python

We have live agents with public Trust Score pages, cryptographic certificates, and on-chain reputation records. The future of commerce is agentic — but it needs to be trustworthy first.

The question isn't whether agents will handle most internet transactions. The question is whether we'll have the trust infrastructure to support them when they do.

Originally published on mnemom.ai

Building Proactive AI Agent Governance: Policy Engines in the Request Pipeline

Alex Garden — Mon, 02 Mar 2026 18:06:19 +0000

It’s becoming increasingly clear to me that the world needs a governance system for complex, highly autonomous AI systems such as self-driving vehicles. But looking at current governance systems, all of them do one thing in common: they react after something has already happened, and they record everything that has occurred in a log file, with the vague hope that perhaps someone will read the log file and perhaps identify a pattern. This post-reactive approach to what can be called a “regulatory bank” is akin to having a bank that records every transaction but doesn’t have any preventative controls in place to stop a fraud transaction from occurring in the first place, with the knowledge that you’ll only find out something has gone wrong after it has already happened.

I wanted to give you a preview of a new system we are building at Mnemom. We have been playing with the idea of shifting governance earlier in the request pipeline, before an agent would actually act.

The Monitoring vs Governance Problem

Most AI governance solutions are like security cameras. They show who did what, alert on unusual activities and sometimes give a few visuals to executives to help them decide on the next steps. And by then, it’s usually too late because the damage has already been done.

This creates several problems:

Reactive response: Damage happens before detection
Human bottleneck: Every decision requires human review
Configuration confusion: Tool additions look like policy violations
Trust erosion: Once a score drops, it stays dropped

What we really needed was a firewall or set of rules to prevent the evil from occurring.

The Semantic Gap Challenge

Core technical challenge: The semantic gap. Agent alignment cards declare capabilities in a human-friendly format:

{
"bounded_actions": ["inference", "read", "write", "web_fetch"]
}

But actual tools have specific, implementation-dependent names:

mcp__browser__navigate
execute_python_code
search_web
read_file

There has always been a disconnect between the semantic intent of an action (web_fetch) and the way that action is actually performed (mcp_browser_navigate). Looking at traces, an observer has to make a guess about what tool was used to perform the action described in the card and this can lead to all sorts of wrong conclusions.

Policy DSL: Bridging Intent and Implementation

Our solution is a policy Domain Specific Language that explicitly maps semantic capabilities to tool patterns:

capability_mappings:
web_browsing:
tools:
- "mcp__browser__*"
card_actions:
- "web_fetch"
- "web_search"

file_reading:
tools:
- "mcp__filesystem__read*"
- "mcp__filesystem__list*"
card_actions:
- "read_file"

forbidden:
- pattern: "mcp__filesystem__delete*"
reason: "File deletion not permitted"
severity: "critical"
- pattern: "mcp__shell__*"
reason: "Shell execution not permitted"
severity: "high"

defaults:
unmapped_tool_action: "warn"
grace_period_hours: 24

This is more than configuration, this is a governance contract. The alignment card describes our intent, the policy maps that intent to the platform, and the engine is the enforcement mechanism.

Three-Checkpoint Architecture

The same policy engine evaluates the same rules at three different points:

1. CI/CD (Prevention)

Check that the tool manifest for the agent is appropriate for the combination of the alignment card and policy that it is being sent to deploy:

npx @mnemom/smoltbot policy evaluate --card card.json --tools manifest.json

This check will fail the build if there is a policy mismatch. The idea is to catch misconfigured agents before they are deployed.

2. Gateway (Enforcement)

This extension makes a DevOps and Developers life easier. It extracts the name of the tool used in an API request by integrating itself in to the Kestrel request pipeline. It also shows a real time validation of that extracted name.

// Supports Anthropic, OpenAI, Gemini formats
const verdict = await policyEngine.evaluate({
toolNames: extractToolNames(request.body),
agentId: request.headers['x-agent-id'],
mode: 'enforce' // or 'warn', 'off'
});

if (verdict.action === 'reject') {
return Response.json(
{ error: verdict.reason },
{ status: 403, headers: { 'X-Policy-Verdict': 'rejected' }}
);
}

Policy evaluation occurs concurrently with quota allocation and does not introduce any latency in the hot path.

3. Observer (Verification)

After execution the traces are compared against the same policy rules. We call these three steps together the Auditing Process and we distinguish between:

Card gaps: Tools the agent uses that aren’t in its card (configuration error)
Behavior gaps: The agent actions which are specified in the control law, but which are forbidden (policy violation)

The distinction matters for what happens next.

Trust Recovery: Learning from Configuration Errors

Before this system, a common scenario would break trust scores permanently:

Developer adds new MCP tool to agent
Agent starts using the tool
Observer flags UNBOUNDED_ACTION violations
Mnemom Trust Rating drops
Developer fixes alignment card
Trust score stays low (damage was permanent)

They were valid violations, but we had a configuration gap that caused them to be flagged rather than a user error. Our new system addresses this:

// When violations are reclassified from behavior gaps to card gaps
POST /v1/agents/{agent_id}/reclassify

Reclassification triggers:

Trust score recalculation (violations excluded)
Downstream agent score updates (transitive trust recovery)
Proof reissuance (if affected violations were in proven sessions)
OTel event emission with before/after scores

Grace periods help with this. If the gateway encounters a tool that is not in the inventory, it will be given a 24 hour grace period (can be configured) before it will be reported as a violation.

Intelligence Layer: Predicting Multi-Agent Failures

Detection tells you what happened. Prediction tells you what will happen.

We built an intelligence layer on top of N-way coherence analysis that:

** Extracts fault lines** from agent value conflicts
** Forecasts risks** for specific team compositions
** Recommends policies** to mitigate predicted failures
** Applies transaction-scoped guardrails** without permanent changes

Example API flow:

// Find where agents conflict
const faultLines = await fetch('/v1/teams/fault-lines', {
method: 'POST',
body: JSON.stringify({ agents: ['agent-a', 'agent-b', 'agent-c'] })
});

// Predict failure modes for a task
const forecast = await fetch('/v1/teams/forecast', {
method: 'POST',
body: JSON.stringify({
agents: ['agent-a', 'agent-b'],
context: 'data_processing_pipeline'
})
});

// Get policy recommendation with cryptographic proof
const recommendation = await fetch('/v1/teams/recommend-policy', {
method: 'POST',
body: JSON.stringify({
team: 'data-team',
context: 'high_value_transaction'
})
});

The recommendation is attested with a STARK proof that spans the entire derivation chain of fault line extraction, risk forecast, policy generation and expected enforcement outcomes.

Implementation Notes

Performance Considerations

Policy evaluation is parallelized with quota resolution
Tool name extraction supports streaming request bodies
Grace period lookups use Redis with TTL expiry
Reclassification uses BFS traversal with 50-agent cap

Security Design

Policy DSL uses safe YAML subset (no code execution)
Tool patterns use glob matching with escape sequence validation
STARK proofs use SP1 guest programs with deterministic execution
On-chain anchoring uses ERC-8004 aligned contracts

Integration Points

// Express middleware
app.use('/v1/agents', policyEnforcement({ mode: 'enforce' }));

// Next.js API route
export default async function handler(req, res) {
const verdict = await evaluatePolicy(req);
if (verdict.action === 'reject') {
return res.status(403).json({ error: verdict.reason });
}
// Continue with agent request...
}

// Fastify plugin
await fastify.register(require('@mnemom/policy-plugin'), {
mode: 'warn',
gracePeriodHours: 48
});

What This Enables

Moving governance into the request pipeline fundamentally changes what's possible:

Proactive enforcement instead of reactive monitoring
Configuration error recovery instead of permanent trust damage
Predictive risk management for multi-agent teams
Cryptographically verified policy recommendations
Zero-latency policy evaluation in production

The engine that validated your application in your CI/CD pipeline, validated your production gateway and traced your observability traces has no drift, no gaps, no surprises – and now – it’s also the same engine you have in production.

Getting Started

Foundational model ethics is about the fact that your governance must not obstruct the autonomy your models were designed to not obstruct in the first place. If you’re planning on creating large teams of AI agents and they need some governance that doesn’t quite kill their autonomy in the process:

# Generate starter policy
smoltbot policy init

# Validate existing agents
smoltbot policy evaluate --card ./alignment-card.json --tools ./tool-manifest.json

# Test policy against live agents
curl -X POST https://api.mnemom.ai/v1/policies/evaluate \
-H "Content-Type: application/json" \
-d '{"agentId": "your-agent", "tools": ["tool-name"]}'

The full technical specification and integration guides are in our docs.

Originally published on mnemom.ai

Building Trust Systems for AI Agent Teams: Beyond Individual Credit Scores

Alex Garden — Wed, 25 Feb 2026 21:19:51 +0000

Last week, we shipped Trust Ratings for individual AI agents — essentially credit scores for autonomous systems. The response was immediate: "What about teams?"

This isn't just feature creep. Nobody deploys one agent in production. The interesting deployments are three, five, twelve agents coordinating on complex tasks. And here's the thing: the risk profile of a team is not the sum of its parts.

The Team Risk Problem

We already had team risk assessment at Mnemom — you could pass a list of agent IDs to our API and get a three-pillar analysis with Shapley attribution and circuit breakers. But every assessment started cold.

No persistent identity. No accumulated history. No way to answer "is this team getting better or worse?" If you ran the same five agents together every day for six months, the system treated each assessment as if those agents had never met.

Individual agents get persistent identity, trend lines, public reputation pages, and CI enforcement. Teams got none of it. Until today.

First-Class Team Identity

Teams are now entities with their own lifecycle:

POST /v1/teams
{
  "org_id": "org-abc123",
  "name": "Incident Response Alpha",
  "agent_ids": ["smolt-a4c12709", "smolt-b8f23e11", "smolt-c1d45a03"],
  "metadata": { "environment": "production", "domain": "infrastructure" }
}

Minimum two agents, maximum fifty. Agents can belong to multiple teams. When you add or remove an agent, the system records who made the change and triggers a score recomputation.

The Mathematics of Team Trust

Here's where it gets interesting. A team score isn't the average of individual scores. If it were, you wouldn't need one.

The thing that makes a team a team is coordination. Five AAA agents with terrible coherence should score worse than five A agents with excellent coordination.

Five Weighted Components

Team Coherence History (35%) — The dominant signal. How consistently well-aligned is this team over time? This measures the one thing that only exists at the team level.

Aggregate Member Quality (25%) — Tail-risk weighted aggregate of individual Trust Ratings. One weak member drags the team down more than one strong member lifts it up.

Operational Track Record (20%) — Historical hit rate across all team assessments. How often has this team been assessed as low-risk?

Structural Stability (10%) — Roster churn penalty. A team that swaps agents every week cannot build reliable track record.

Assessment Density (10%) — Actively monitored teams with 200 data points get more credit than ones assessed twice six months ago.

Same 0-1000 range as individual scores. Same AAA-through-CCC grades. Teams need 10 assessments before a score publishes.

Cryptographic Proof Chains

Every team assessment is cryptographically attested — Ed25519 signatures, hash chains, STARK zero-knowledge proofs. The team score computation itself runs in the zkVM.

This creates a proof chain: individual checkpoints → individual Trust Ratings → team assessments → Team Trust Rating. Each link is independently verifiable.

You don't have to trust that a team is rated A — you can verify every step yourself.

Public Infrastructure

Everything that exists for individual agents now exists for teams:

Reputation pages with score breakdowns and trend charts
Team directory — searchable catalog of public scores
Badges via SVG API — [ Team Trust | 812 ] in your README
GitHub Actions — CI gating on team scores

- uses: mnemom/reputation-check@v1
  with:
    team-id: team-7f2a9c01
    min-score: 700
    min-grade: A

Team Alignment Cards

Teams get their own behavioral contracts. You can auto-derive from member cards:

POST /v1/teams/{team_id}/card/derive

Values are unioned by frequency. Forbidden actions from any member apply to the team. Highest audit retention policy wins. Every change is versioned.

The team card is what coherence quality measures against — the declared behavioral contract for the group.

Integration Points

Containment: When a team member is paused via the containment engine, the team score reflects it immediately.

Predictive guardrails: Historical assessment data improves predictions. "This team historically struggles with speed-safety tradeoffs" beats cold-start analysis.

CI gating: Same GitHub Action that enforces individual scores now enforces team scores.

What This Enables

Individual Trust Ratings answered "can I trust this agent?" Team Trust Ratings answer:

"How has this team performed?" — persistent, trended, attested scores
"Is this team improving?" — weekly snapshots, not guesswork
"Which team should I deploy?" — side-by-side comparison, not gut feel

If individual ratings are FICO for agents, team scores are Moody's for agent portfolios. Same rigor, applied to the unit that actually matters.

Implementation Notes

Team reputation integrates with existing Mnemom infrastructure:

Uses the same cryptographic attestation as individual scores
Plugs into containment and guardrail systems
Supports the same CI/CD workflows
Maintains the same public directory structure

The scoring algorithm runs deterministically in the zkVM, ensuring reproducible results across different environments.

Team Trust Ratings ship today on Team and Enterprise plans. The infrastructure for multi-agent trust is here.

Originally published on mnemom.ai

Building Cryptographic Trust Infrastructure for AI Agents

Alex Garden — Mon, 23 Feb 2026 22:13:55 +0000

Last month, MIT published their AI Agent Index — a comprehensive study of 30 major AI agents across 240 safety and transparency criteria. The results were stark: 133 fields had no public information. Twenty-five agents had no safety evaluation results. Only one had cryptographic signing.

As someone building AI agent infrastructure, this confirmed what we suspected: the gap isn't in building agents (the tooling is excellent), it's in verifying them.

The Monitoring vs Verification Problem

The market response has been monitoring solutions. Behavioral baselines, drift detection, comprehensive logging. These are necessary but insufficient.

Here's why: monitoring tells you what happened, but it doesn't tell you whether the monitor itself is honest. Consider this architecture:

Agent → Oversight System → Report ("clear" or "violation")

How do you know the oversight system applied its rules correctly? How do you know it didn't report "clear" when evidence showed a boundary violation? How do you know checkpoints weren't deleted after the fact?

You can't. Unless the oversight system can prove its own honesty mathematically.

That's verification: "we can prove we checked, and you can verify the proof yourself."

The Technical Challenge

Building this requires solving several problems:

Identity: What is the agent supposed to do?
Integrity: Is it doing what it's supposed to do?
Proof: Can we prove the integrity check was honest?
Reputation: How trustworthy is this agent over time?
Risk: Should we approve this specific action?
Containment: How do we enforce decisions?

We built Mnemom as a six-layer stack addressing each.

Layer 1: Machine-Readable Identity

Every agent gets an Alignment Card — a machine-readable behavioral contract:

{
  "permitted": ["read_tickets", "draft_responses", "escalate_to_human"],
  "forbidden": ["access_payment_data", "issue_refunds"],
  "escalation_triggers": ["billing_request_over_500"],
  "values": ["accuracy", "empathy", "privacy"]
}

This becomes the reference document for all verification. No card, no verification.

Layer 2: Real-Time Thinking Analysis

We intercept the agent's streaming response at the Cloudflare Workers level using tee(). The client receives the response at wire speed — zero latency added. The analysis runs in the background.

Crucially, we don't analyze outputs. We analyze thinking — the agent's reasoning trace between request and response. A separate model evaluates this against the Alignment Card:

{
  "verdict": "boundary_violation",
  "concerns": [
    {
      "category": "value_misalignment",
      "severity": "high",
      "description": "Agent considered payments API access, explicitly forbidden"
    }
  ],
  "confidence": 0.95
}

Three verdicts: clear, review_needed, or boundary_violation.

Layer 3: Four-Layer Cryptographic Attestation

This is where we diverge from monitoring solutions. Every integrity checkpoint passes through four independent cryptographic layers:

Ed25519 Signatures: Every checkpoint is cryptographically signed. Public keys are published. Anyone can verify, nobody can forge.

SHA-256 Hash Chains: Each checkpoint includes the hash of the previous checkpoint. Tamper with any checkpoint and the chain breaks.

Merkle Trees: An append-only binary tree accumulates all checkpoints. Proving a checkpoint exists requires O(log N) hashes — about 10 for 1,000 checkpoints.

Zero-Knowledge Proofs: A STARK proof re-derives the verdict from raw concerns using deterministic rules, ignoring the stated verdict entirely.

The key insight: we don't prove LLM inference was correct (computationally intractable). We prove the auditor's judgment was honestly derived from the LLM's output. This runs in ~10,000 RISC-V cycles with sub-second proof generation.

Layer 4: The Trust Score

Checkpoints accumulate into a composite reputation score — five weighted components mapping to a bond-rating scale (AAA to CCC):

Integrity ratio (40%)
Compliance with exponential decay (20%)
Drift stability (20%)
Trace completeness (10%)
Coherence compatibility (10%)

Every score is backed by the full attestation stack. Anyone can request Merkle inclusion proofs, verify signatures and hash chains, and for violations, verify STARK proofs.

Layer 5: Context-Aware Risk Assessment

A trust score tells you general trustworthiness. Risk assessment tells you whether to approve this specific action for this agent in this context.

The same agent gets different risk scores for different actions. Financial transactions weight compliance heavily. Task delegation weights coherence. Tool invocation weights integrity and drift.

For teams of agents, we model three types of systemic risk:

Aggregate quality (tail-risk weighting)
Coherence quality (value alignment variance)
Structural risk (contagion modeling)

All computations use fixed-point arithmetic for determinism — essential for zero-knowledge provability.

Layer 6: Real-Time Containment

Detection without enforcement is monitoring. We provide pause, kill, and resume controls with three enforcement modes:

monitor: Log only
nudge: Inject corrective guidance
enforce: Hard 403 block

Auto-containment triggers after N consecutive violations. When an agent is contained, structural risk analysis identifies exposed peers for prophylactic protection.

One Environment Variable

Despite the complexity above, using this is simple:

npm install -g smoltbot
smoltbot init

The CLI configures your agent to use the Mnemom gateway:

export OPENAI_BASE_URL=https://gateway.mnemom.ai/openai/v1

Your code doesn't change. Your users' experience doesn't change. But your agent now has:

Cryptographic integrity attestation
Public trust score and reputation page
Embeddable trust badges
Zero-knowledge proven risk assessments
Real-time containment controls

Why This Matters Now

Three converging factors:

The gap is documented: MIT's study, WEF governance frameworks, and EU AI Act Article 50 all identify the same missing piece — verifiable trust infrastructure.
The market fragmented: Solutions exist for pieces (malware scanning, prompt injection blocking, behavioral baselines, identity credentials, on-chain reputation), but no unified stack.
The proof is practical: ZK proofs of safety judgments aren't theoretical anymore. SP1 generates production-ready STARK proofs sub-second.

What's Live

Everything described is deployed:

Multi-provider gateway (Anthropic, OpenAI, Gemini)
Full attestation pipeline with cryptographic proofs
Trust scoring and public directory
Team risk assessment with contagion modeling
Real-time containment and graduated response
Enterprise features (RBAC, SSO, compliance exports)

Try the interactive showcase at mnemom.ai/showcase or point your agent at the gateway directly.

The credit check for AI agents is live.

Originally published on mnemom.ai

The First Zero-Knowledge Proof of AI Safety Judgment

Alex Garden — Fri, 20 Feb 2026 04:21:47 +0000

The Agent Alignment Protocol gives agents transparency — structured traces of what was considered, what was chosen, and why. The Agent Integrity Protocol gives agents integrity — continuous runtime verdicts on whether an agent's autonomous decisions align with its declared values and boundaries. Today we ship the third piece: proof.

Cryptographic evidence that every integrity verdict was honestly computed. Independently verifiable. No trust required.

This matters because integrity monitoring creates a new trust dependency. When we tell you an agent's behavior is consistent with its Alignment Card, you have to trust that we computed that verdict honestly — that we actually ran the analysis, that we didn't tamper with results, that the history hasn't been rewritten. The proof layer eliminates that trust dependency. Every verdict now ships with mathematical evidence you can verify yourself, in your own browser, without calling our API or trusting our infrastructure.

Four Layers of Cryptographic Evidence

The proof layer isn't a single mechanism. It's four independent cryptographic structures, each addressing a different trust question. They compose but don't depend on each other — you get value from any subset.

Ed25519 Signatures: Who Signed This Verdict?

Every integrity checkpoint is signed with an Ed25519 digital signature. The signed payload is a deterministic JSON string containing the checkpoint ID, agent ID, verdict, thinking block hash, input commitment, chain hash, and timestamp. Keys are sorted alphabetically before serialization, producing identical output regardless of property insertion order.

The signing key is a 32-byte Ed25519 secret held by the analysis service. The corresponding public key is published at a well-known endpoint (api.mnemom.ai/v1/keys). Anyone can fetch the public key and verify the signature using the @noble/ed25519 library — the same library we use internally. A valid signature proves the verdict was produced by a key holder and hasn't been modified since signing.

Hash Chains: Nothing Was Reordered

Each checkpoint includes a SHA-256 hash that chains it to the previous checkpoint in the same agent session. The chain preimage concatenates the previous chain hash (or the literal string genesis for the first checkpoint), the checkpoint ID, verdict, thinking block hash, input commitment, and timestamp — all joined by pipe delimiters.

This creates a tamper-evident sequence. Changing any field in any checkpoint breaks the chain at that point and every subsequent link. Inserting a checkpoint between two existing ones is impossible without recomputing every downstream hash, which would invalidate their signatures. Reordering checkpoints breaks the linkage. The chain gives you temporal integrity — proof that the sequence of verdicts is the sequence that actually occurred.

Merkle Trees: Nothing Was Deleted

Per-agent append-only binary Merkle trees accumulate every checkpoint into a single root hash. Each checkpoint becomes a leaf — its leaf hash is computed from the checkpoint ID, verdict, thinking block hash, chain hash, and timestamp. Internal nodes are SHA-256 hashes of their children's concatenated hex strings. When the number of nodes at any level is odd, the last node is duplicated before pairing.

The Merkle tree provides completeness proofs. Given a checkpoint and an O(log N) inclusion proof — a list of sibling hashes along the path from the leaf to the root — anyone can verify that the checkpoint exists in the tree by walking the path and checking that the computed root matches the published root. If a checkpoint were deleted from the log, the root would change and all existing inclusion proofs would fail. This is the same structure that Certificate Transparency logs use to ensure no TLS certificate is secretly issued.

ZK Verdict Proofs: The Rules Were Applied Honestly

This is the most novel layer. A RISC Zero STARK proves that the verdict derivation rules — the logic that maps analysis observations to a verdict of clear, review_needed, or boundary_violation — were applied correctly to the committed inputs. The proof is generated inside a zkVM guest program that takes the analysis JSON, input commitment hashes, and rule parameters as private inputs, executes the deterministic verdict derivation logic, and produces a receipt that anyone can verify.

Boundary violation verdicts are always proven. All other verdicts are proven via 10% stochastic sampling. The proving decision is made after analysis completes, so the analysis pipeline cannot know in advance which checkpoints will be audited — creating a probabilistic deterrent against selective manipulation.

The Novel Contribution

The AI verifiability space has focused overwhelmingly on proving model inference — demonstrating that a neural network with specific weights produced a specific output for a specific input. This is a hard problem. Modern large language models have hundreds of billions of parameters. Proving a single forward pass through a transformer is computationally prohibitive. Estimates for full LLM inference proofs range from $1,000 to $10,000+ per proof, with proving times measured in hours.

We prove something different: the auditor's judgment.

The integrity checkpoint pipeline has two stages. First, an analysis model reads the agent's reasoning trace and the Alignment Card and produces structured observations — what concerns it found, what categories they fall into, how severe they are. Second, a deterministic verdict derivation function maps those observations to a final verdict using explicit rules: thresholds, category weights, severity escalation logic.

The first stage is LLM inference. It's expensive to prove, and proving it wouldn't tell you much — a valid proof of inference just means the model produced that output, not that the output is correct. The second stage is a small, deterministic function. It takes structured JSON as input and produces a verdict as output. It runs in milliseconds. And proving it in a zkVM costs approximately $0.005 per proof.

This is 6+ orders of magnitude cheaper than proving the full inference. It occupies a new point in the design space: between doing nothing (the current industry default) and proving everything (economically infeasible). We prove the part that matters most for trust — that the rules mapping observations to verdicts were applied faithfully — at a cost that makes it practical to prove every high-stakes verdict and stochastically sample the rest.

See It

The showcase at mnemom.ai/showcase includes interactive visualizations of all four proof layers.

The certificate explorer displays a complete integrity certificate with collapsible sections for subject, claims, input commitments, and proofs. Each proof section shows the raw cryptographic data — signature bytes, chain hashes, Merkle siblings, STARK receipt metadata — alongside human-readable explanations of what each field means and what it proves.

The hash chain tamper simulation renders a chain of checkpoints as linked blocks. Click any block to simulate tampering — modifying its verdict field. The visualization shows the cascading failure in real time: the tampered block's chain hash no longer matches its recomputed value, and every subsequent block's prev_chain_hash linkage breaks. The red propagation makes the tamper-evidence property visceral rather than abstract.

The Merkle tree visualization displays the binary tree structure with highlighted inclusion proof paths. Select a leaf checkpoint and watch the verification path illuminate from leaf to root, showing each sibling hash being combined at every level. The root hash updates live as checkpoints are added, demonstrating the append-only accumulation property.

The 3D trust topology renders agents as nodes in a three-dimensional space, with edges representing coherence scores from AAP's Value Coherence Handshake and node colors indicating integrity status from AIP. The proof layer adds verification badges — visible indicators of which verdicts carry cryptographic evidence and which proof types are present.

Verify It Yourself

Every verification operation runs in your browser. No server calls. No trust required.

Ed25519 signature verification uses @noble/ed25519 to check the signature against the signed payload and the published public key. The library is a pure JavaScript implementation of the Ed25519 signature scheme — no WebCrypto API dependency, no platform-specific behavior, deterministic results across all environments.

Chain hash recomputation uses the Web Crypto API's SHA-256 implementation to recompute the chain hash from the checkpoint fields and compare it to the certificate's claimed chain hash. The preimage format is documented and stable: prevHash|checkpointId|verdict|thinkingBlockHash|inputCommitment|timestamp.

Merkle proof verification walks the inclusion proof from leaf to root, combining the running hash with each sibling at each level. If the sibling sits to the left, it's prepended; if to the right, it's appended. The final computed root must match the published root. This is a pure computation — no network access, no oracle, no trust assumption beyond SHA-256 collision resistance.

STARK verification uses a WebAssembly build of the RISC Zero verifier. The WASM module takes the proof receipt and the expected image ID (a commitment to the guest program's code) and returns a boolean. This is the same verification logic that runs on-chain in Ethereum STARK verification contracts, compiled to WASM for browser execution.

The Certificate Format

Integrity certificates follow a structured JSON format modeled on C2PA content credentials and W3C Verifiable Credentials. The design goals are machine readability, self-describing semantics, and independent verifiability.

A certificate contains five top-level sections:

@context — The schema identifier (https://mnemom.ai/aip/v1), following JSON-LD conventions for self-describing documents. -subject — What this certificate is about: the checkpoint ID, agent ID, session ID, and Alignment Card ID. -claims — The integrity verdict, concern categories, confidence score, reasoning summary, analysis model identifier, and analysis duration. -input_commitments — SHA-256 hashes of every analysis input: the thinking block, the Alignment Card, the declared values, the context window, and a combined commitment over all of them. These commitments bind the verdict to specific inputs without revealing those inputs. -proofs — The cryptographic evidence: Ed25519 signature (algorithm, key ID, signature value, signed payload), hash chain (current hash, previous hash, position), Merkle inclusion proof (leaf hash, leaf index, root, tree size, sibling path), and verdict derivation proof (method, image ID, receipt, journal, verification timestamp).

The certificate also includes a verification section with URLs for the public key endpoint, the certificate retrieval endpoint, and the server-side verification endpoint. These are convenience links — all verification can be performed client-side without them.

The full specification is published at docs.mnemom.ai.

What We Don't Claim

Honesty about limitations is itself a form of integrity.

The proof layer proves honest computation of the verdict derivation rules. It does not prove that the analysis model's observations are correct. The LLM that reads an agent's reasoning trace and identifies concerns can make mistakes — it can miss a subtle boundary violation, misclassify a concern's severity, or hallucinate a problem that doesn't exist. The cryptographic proof covers the deterministic verdict derivation logic, not the probabilistic language model that produces the inputs to that logic.

Think of it like an auditor's report. We can prove that the auditor followed their stated methodology. We cannot prove that the methodology itself catches every issue. The signature proves who signed the verdict. The chain proves the verdict sequence wasn't reordered. The Merkle tree proves no verdicts were deleted. The ZK proof proves the verdict rules were applied faithfully to whatever the analysis model found. None of these prove the analysis model found the right things.

This is an honest trade-off. Proving full LLM inference would close this gap but costs $1,000+ per proof and takes hours. We chose to prove the part that is economically practical and most relevant to the trust question operators actually ask: "Were the rules applied honestly to the evidence that was found?"

We also don't claim immutability of the Merkle root itself. The root is published by our infrastructure. A sufficiently compromised operator could publish a different root. The Merkle tree proves internal consistency — that a given set of checkpoints produces a given root — but the root's authenticity currently depends on the publication channel. Future work on root anchoring (e.g., periodic on-chain commitments) would close this gap.

Competitive Landscape

The verifiable AI space is young and fragmented. Different projects prove different things at different costs:

Approach	Who	What It Proves	Approximate Cost
TEE attestation	EQTY Lab	Code ran in a trusted execution environment	Hardware-dependent
Content provenance	Macaw, C2PA	Media was created/edited by specific software	Per-asset
Model inference proofs	Modulus Labs, Giza	A specific neural network produced a specific output	$1,000+/proof
ZK coprocessor	Lagrange	On-chain computation was performed correctly	Per-query
Verdict derivation	Mnemom	Auditor judgment rules were applied faithfully	~$0.005/proof

These approaches are complementary, not competitive. TEE attestation proves where code ran. Content provenance proves what created a media artifact. Inference proofs prove which model produced an output. We prove how the auditor reached its verdict. A complete trust stack might use several of these together.

Our contribution is identifying that verdict derivation — the deterministic rules that map structured observations to integrity verdicts — is a distinct proof target that is orders of magnitude cheaper than inference proofs while addressing the specific trust question that matters for agent integrity monitoring.

Enterprise Value

The EU AI Act Article 50 transparency obligations take effect in August 2026. For high-risk AI systems, Article 50 requires audit trails, disclosure of AI-generated content, and mechanisms for human oversight. Cryptographic integrity certificates provide a compliance-ready audit trail that goes beyond what the regulation requires — not just logging that monitoring occurred, but proving it was performed honestly.

For enterprises evaluating integrity monitoring vendors, the proof layer eliminates vendor lock-in as a trust concern. You don't have to trust Mnemom's infrastructure to trust the verdicts. The certificates are self-contained. The verification logic is open source. The cryptographic primitives are standard (Ed25519, SHA-256, RISC Zero STARKs). Any third party can independently verify any certificate without calling our API.

The standard certificate format also enables multi-vendor trust architectures. If an organization uses multiple integrity monitoring providers — or builds their own alongside a commercial solution — certificates from different providers can be compared, aggregated, and cross-verified using the same tooling. The @context field and structured proof sections make certificates machine-readable across implementations.

Get Started

The proof layer ships today in the Smoltbot gateway and the AIP TypeScript SDK.

Documentation: docs.mnemom.ai — full specification for the certificate format, proof construction, and verification procedures.
API reference: GET /v1/checkpoints/:id/certificate, POST /v1/verify, GET /v1/agents/:id/merkle-root, GET /v1/checkpoints/:id/inclusion-proof — all public, no authentication required.
Live showcase: mnemom.ai/showcase — interactive certificate explorer, chain tamper simulation, Merkle tree visualization, and 3D trust topology.
Whitepaper: The full technical specification covering proof construction, security model, threat analysis, and formal verification properties is available at docs.mnemom.ai/whitepaper.
Source code: github.com/mnemom — Apache licensed. The signing, chain, Merkle, proving, and certificate modules are in the api/src/analyze/ directory. The zkVM guest program and WASM verifier are in zkvm/.

Every integrity verdict now comes with cryptographic evidence. Verify everything yourself. Trust nothing.

Mnemom builds alignment and integrity infrastructure for autonomous agents. AAP and AIP are open source and available on npm and PyPI.

GitHub: github.com/mnemom · Live demo: mnemom.ai/showcase