Anton Illarionov

Posted on Feb 23 • Originally published at api.odei.ai

7-Layer Constitutional AI Guardrails: Preventing Agent Mistakes

#ai #agents #architecture #security

7-Layer Constitutional Guardrails: Preventing AI Agent Mistakes Before They Happen

AI agents make mistakes. When they're operating autonomously — managing wallets, sending messages, executing contracts — mistakes are expensive.

The standard answer is "add a human in the loop." But that defeats the purpose of autonomous agents. The real answer is constitutional guardrails: a validation framework that runs before every consequential action.

Here's how we built it at ODEI, and how you can use it.

The Problem

Consider an autonomous agent managing USDC for a user. Without guardrails:

Agent calls transfer(500, wallet_address) — is the wallet trusted? Is the amount within limits? Was this already done?
Agent posts to Twitter — is this duplicate content? Does it violate policies?
Agent approves a transaction — was this authorized by the right person at the right time?

These questions can't be answered by the LLM alone. They require structured checks against known facts, historical state, and explicit rules.

The 7-Layer Framework

ODEI's constitutional guardrail system validates every action through 7 sequential checks:

Layer 1: Immutability Check

Can this entity be modified?

Some nodes in the world model are immutable after creation — founding documents, past transactions, signed commitments. Layer 1 prevents agents from accidentally rewriting history.

Layer 2: Temporal Context

Is this action still valid in time?

Decisions expire. Authorizations have windows. Layer 2 checks that the action is timely — not stale from a previous session, not premature.

Layer 3: Referential Integrity

Do all referenced entities exist?

The action references wallet 0x.... Does that wallet exist in the world model? Is it a known, trusted entity? Layer 3 catches hallucinated references.

Layer 4: Authority Validation

Does this agent have permission?

Not all agents can do all things. Layer 4 checks whether the requesting agent has the authority scope for this action, against the governance rules in the FOUNDATION layer.

Layer 5: Deduplication

Has this exact action already been taken?

Without deduplication, agents can send the same message twice, execute the same transaction twice, create the same entity twice. Layer 5 uses content hashing to detect duplicates.

Layer 6: Provenance Verification

Where did this instruction come from?

Is this action coming from a trusted source? Was it initiated by a verified principal or injected by an untrusted input? Layer 6 traces the instruction back to its origin.

Layer 7: Constitutional Alignment

Does this violate fundamental principles?

The highest-level check. The FOUNDATION layer of the world model contains constitutional principles — things the agent must never do. Layer 7 compares the action against these principles.

Using the Guardrail API

curl -X POST https://api.odei.ai/api/v2/guardrail/check \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "action": "transfer 500 USDC to 0x8185ecd4170bE82c3eDC3504b05B3a8C88AFd129",
    "context": {
      "requester": "trading_agent_v2",
      "reason": "performance fee payment"
    },
    "severity": "high"
  }'

Response:

{
  "verdict": "ESCALATE",
  "score": 45,
  "layers": [
    {"layer": "immutability", "result": "PASS"},
    {"layer": "temporal", "result": "PASS"},
    {"layer": "referential_integrity", "result": "PASS"},
    {"layer": "authority", "result": "PASS"},
    {"layer": "deduplication", "result": "PASS"},
    {"layer": "provenance", "result": "WARN", "note": "Wallet not in trusted list"},
    {"layer": "constitutional", "result": "WARN", "note": "Transfer exceeds daily limit"}
  ],
  "reasoning": "Transfer to unverified wallet exceeds daily limit. Escalate to human operator.",
  "timestamp": "2026-02-23T00:12:34Z"
}

Via MCP (Claude Desktop)

{
  "mcpServers": {
    "odei": {
      "command": "npx",
      "args": ["@odei/mcp-server"]
    }
  }
}

Then in Claude:

Check if I should approve: transfer 500 USDC to 0x...

Claude automatically calls odei_guardrail_check and returns the verdict with full reasoning.

Real Results

After running this in production since January 2026:

APPROVED (65%): Routine operations that pass all 7 layers
REJECTED (15%): Actions that clearly violate rules (duplicates, unauthorized)
ESCALATE (20%): Actions that need human review (unknown wallets, threshold violations)

The ESCALATE category is where most value is created: catching edge cases that would have been approved by a simple rule-based system but require human judgment.

Implementing Your Own

You don't need to use ODEI's service to implement this pattern. The architecture is:

Define your layers (we use 7, you might use 3 or 10)
For each layer, write a check function that returns PASS/WARN/FAIL with reasoning
Aggregate the results into a final verdict
Log everything — the audit trail is as important as the verdict

The hard part is building and maintaining the world model that the checks query against. That's why we built it as a service — maintaining 91 nodes and 91 relationship types is not trivial.

ODEI's guardrail API is available at api.odei.ai. Free tier available. Deployed as Virtuals ACP Agent #3082 for agent-to-agent calls.

Top comments (1)

Matthew Hou • Feb 23

Seven layers feels like a lot until you realize each one catches a different category of failure. The alternative is one big validation step that either passes or fails, with no granularity on what went wrong.

The "trusted wallet" pattern is interesting — maintaining an allowlist of known-safe actions and requiring escalation for anything outside it. I've used a similar approach for AI workflows: define what the agent is allowed to do, deny everything else by default, and log every denied action for review.

Simple principle, but hard to maintain as the agent's capabilities grow.