DEV Community

Cyber paisa
Cyber paisa

Posted on • Edited on

While Everyone Blindly Trusts Their AI Agents, I Mathematically Proved Mine Works — And Registered It on Blockchain

Let me tell you how this started.

I had 5 LLM providers running agents on free tiers. Groq, MiniMax, Cerebras, Zhipu, NVIDIA NIM. Every day I would kick off a crew of agents, go make coffee, come back, and read the output like it was gospel. "The market analysis shows strong bullish indicators for Q2." Great. Ship it.

Except one day I actually checked. The "bullish indicators" were completely fabricated. The agent had hallucinated three data sources that don't exist. And nobody caught it. Not CrewAI. Not me. Not the five other agents in the pipeline that supposedly "verified" the output.

That was the day I stopped trusting and started proving.

GitHub logo Cyberpaisa / deterministic-observability-framework

Deterministic architecture for multi-agent observability, verification and controlled degradation.

Deterministic Observability Framework for Multi-Agent LLM Systems

A research-grade deterministic orchestration and observability framework for multi-agent LLM systems operating under adversarial infrastructure constraints.

This repository formalizes reproducible experimentation, resilience metrics, controlled degradation modeling, governance invariance, and deterministic evaluation in heterogeneous provider environments.

Python 3.11+ | Apache-2.0 | 24,000+ LOC | 71 modules | 475 tests | Z3 formal verification | OAGS Level 3 | ERC-8004 attestation | Avalanche Mainnet | 7 on-chain attestations | MCP Server | REST API | PostgreSQL | Multi-Framework | pip install dof-sdk


Quick Start

pip install -e .
Enter fullscreen mode Exit fullscreen mode
from dof import GenericAdapter
result = GenericAdapter().wrap_output("your agent output")
# → {status: "pass", violations: [], score: 8.5}
Enter fullscreen mode Exit fullscreen mode

For Z3 formal proofs: python -m dof verify For full guide: docs/GETTING_STARTED.md


Abstract

Multi-agent LLM systems operating across heterogeneous providers exhibit infrastructure-induced instability that cannot be rigorously characterized using conventional orchestration tooling. Rate limits, cascading…

475 tests. 24,000 lines of code. 71 modules. 4 mathematical theorems. 7 transactions on Avalanche mainnet. Zero faith required.


The Dirty Secret of AI Agent Frameworks

Here is what nobody talks about at AI conferences. Every single multi-agent framework out there, CrewAI, AutoGen, LangGraph, all of them, they all end the same way:

crew = Crew(agents=[...], tasks=[...])
result = crew.kickoff()
print(result)  # fingers crossed
Enter fullscreen mode Exit fullscreen mode

That last line is basically a prayer. You print the result and hope the agent did not make stuff up, did not import something dangerous, did not quietly fail because Groq hit its rate limit at 2AM and the retry logic just recycled the same dead provider three times in a row.

There is no inspector. No quality gate. No math. Just vibes.

OpenAI themselves published a paper saying "deterministic behavioral guarantees are currently not possible for a model developer." And honestly, when I first read that, I thought they were right.

They are not. I have four theorems that say otherwise.


What I Actually Built

The Deterministic Observability Framework, or DOF because every serious project needs a three letter acronym. It does one thing: it sits between your agent and the real world and decides if the output is trustworthy before anyone sees it.

from dof import GenericAdapter

adapter = GenericAdapter()
result = adapter.wrap_output("your agent said something, let me check it")
# {status: "pass", violations: [], score: 8.5}
Enter fullscreen mode Exit fullscreen mode

That takes 30 milliseconds. Zero LLM tokens. Works with CrewAI, LangGraph, AutoGen, or literally anything that produces text. If your system can generate a string, DOF can judge it.

Think of it like a bouncer at a club. The agent produces something, walks up to the door, and DOF checks the ID, pats it down, runs a background check, and either lets it through or sends it home. Except this bouncer has a math degree and a blockchain wallet.


Seven Layers of Not Trusting Your Agent

I know seven layers sounds excessive. But hear me out. Each one catches something the others miss.

Layer 1: The Constitution

Every country has a constitution. Your agents should too. Mine is a YAML file with rules:

rules:
  hard:
    HARD-001: {pattern: "hallucination_detection", action: "block"}
    HARD-002: {pattern: "language_compliance", action: "block"}
  soft:
    SOFT-001: {pattern: "source_citation", action: "warn", weight: 0.25}
Enter fullscreen mode Exit fullscreen mode

Hard rules are bouncers. You hallucinate, you are out. Soft rules are disappointed parents. You forgot to cite your sources, you get a lower score but you can still come to dinner.

The enforcer uses deterministic string matching. Not an LLM. This matters because if your governance layer is also an LLM, congratulations, you are using a student to grade their own exam.

Layer 2: Code X-Ray

The AST Verifier reads any code your agent generates and checks for the things that keep security teams awake at night. eval() calls. os.system imports. API keys sitting in plain text like it is 2015.

Score goes from 0.0 (call the lawyers) to 1.0 (sleep peacefully).

Layer 3: The Supervisor

A separate agent scores the output on four dimensions: Quality, Actionability, Coherence, and Format. Score above 7? You pass. Between 5 and 7? Go back and try again, but this time with a different provider because maybe the first one was having a bad day. Below 5? We need to talk to a human.

Here is the clever part. When a retry happens, the crew_factory rebuilds the entire crew with fresh provider assignments:

def crew_factory():
    llm_researcher = get_llm_for_role("researcher")
    llm_verifier = get_llm_for_role("verifier")
    llm_strategist = get_llm_for_role("strategist")
    return Crew(
        agents=[researcher, verifier, strategist],
        tasks=[t1, t2, t3],
    )
Enter fullscreen mode Exit fullscreen mode

This breaks the correlation between "Groq is down" and "the retry also uses Groq." It sounds obvious but literally nobody does this.

Layer 4: The Math (This Is the Fun Part)

I have four theorems. Not empirical observations. Not "it usually works." Actual mathematical proofs verified by Z3, an SMT solver built by Microsoft Research that has been formally verifying software since 2007.

What We Proved In Math In English
GCR Invariant ∀ f ∈ [0,1]: GCR(f) = 1.0 Governance works no matter how many providers crash
Stability Formula ∀ f ∈ [0,1]: SS(f) = 1 − f³ We know exactly how stable the system is
It Only Gets Worse f₁ < f₂ ⟹ SS(f₁) > SS(f₂) More failures always means less stability. No surprises.
The Extremes SS(0) = 1.0 ∧ SS(1) = 0.0 Zero failures means perfect. Total failure means zero.

Z3 does not believe me. It tried to prove me wrong. It exhaustively searched every possible input looking for a single counterexample where governance could fail. It found nothing. UNSAT. Proof complete.

Total time to verify all four theorems: 10 milliseconds.

python -m dof verify
# ✓ GCR_INVARIANT   (2.25ms)
# ✓ SS_FORMULA      (0.30ms)
# ✓ SS_MONOTONICITY (1.25ms)
# ✓ SS_BOUNDARIES   (0.39ms)
# All verified: True
Enter fullscreen mode Exit fullscreen mode

Name another AI governance framework with formal mathematical proofs. I will wait.

Layer 5: Red Team vs Blue Team

Three agents argue about your output. The Red Team tries to destroy it, finding fabrications, safety issues, and policy violations. The Guardian defends it with actual evidence, API lookups, cross-references, verification results. And the Arbiter decides who wins using only verifiable facts. No opinions. No LLM judgment calls. Just logic.

It is like a courtroom where the judge only accepts physical evidence.

Layer 6: Memory That Forgets Responsibly

The GovernedMemoryStore remembers things your agents learned. But unlike your browser history, it has rules about what to keep and what to forget. Knowledge decays over time. Preferences fade. But decisions and errors are protected forever because those are the lessons that matter.

It is the difference between remembering what you had for lunch last Tuesday (irrelevant, let it decay) and remembering that the stove is hot (critical, keep forever).

Layer 7: The Blockchain Receipt

When everything passes, DOF signs a certificate with BLAKE3 and HMAC-SHA256, then publishes it to three places:

Where How Fast Can It Be Changed
Internal database 200ms Yes (for updates)
Enigma Scanner 900ms No (historical log)
Avalanche blockchain 2 seconds Absolutely not. Ever.

That last one is the point. Once it is on Avalanche, it is there forever. Anyone can verify it. No trust required.


The Math Behind the Stability

I derived this from first principles. A system with 2 retries and a provider failure rate of f only fails when all three attempts fail:

SS(f) = 1 − f³

If providers fail 10% of the time, your system survives 99.9% of the time. If they fail 50% of the time, you still survive 87.5%. The math is clean and Z3 confirmed it is not just clean but provably correct.

Five metrics, all bounded between 0 and 1, no ambiguity:

Metric What It Tells You
SS Did the system survive?
GCR Did governance hold? (Spoiler: always 1.0)
PFI How fragile are your providers?
RP How much retry pressure are you under?
SSR Is the supervisor being too strict?

Production Results (Not Benchmarks, Real Runs)

The Original 30 Runs

I ran 30 executions on real free tier providers. No sandboxes. No simulated failures. Just Groq being Groq at 2AM.

Metric Value
Stability 0.90 (27/30 survived)
Provider Fragility 0.61 (61% hit at least one failure)
Governance Compliance 1.0000 with zero variance

That last number is the one that matters. Despite 61% of executions hitting provider failures, governance held perfectly. Every single time. That is not luck. That is architecture.

The Full Audit (March 2026)

Two real agents from the Enigma Scanner verified with roles swapped (the arbitrage bot did contract auditing, the auditor did arbitrage scanning):

Agent Governance Code Safety Z3 Proofs On Chain
Apex Arbitrage #1687 PASS 1.0 4/4 Block 79674834
AvaBuilder #1686 PASS 1.0 4/4 Block 79674842

We also ran a four phase audit: 10 MCP tools tested, 8 A2A skills verified, cross-role execution where agents worked outside their comfort zone, and peer verification where each agent checked the other. Both passed everything.

Combined trust score: 0.85. Ranking: #1 and #2 out of 1,738 agents on the scanner. Not because we gamed the system. Because formal verification is a competitive advantage nobody else has.

External Agent Audit

We pointed DOF at 13 production agents across the ERC-8004 registry. Real HTTP connections, zero simulation:

Protocol Tested Active What We Found
A2A 4 0 All SnowRail agents offline. 404 across the board.
x402 2 2 QuickIntel alive, returning 402 with pricing. Working as designed.
OASF 2 2 Both agents serving full service manifests.
MCP 1 1 AMI Finance serving 114 lines of MCP config.

62% of agents we probed had at least one working protocol endpoint. The other 38% were ghosts. This is why you need governance before you write to an immutable ledger.


The Smart Contract

Deployed on Avalanche C-Chain. Not testnet. Mainnet.

function registerAttestation(
    bytes32 certificateHash,
    bytes32 agentId,
    bool compliant
) external
Enter fullscreen mode Exit fullscreen mode
Address 0x88f6043B...C052
Network Avalanche C-Chain
Attestations 7 confirmed
Cost per attestation About one cent
Verification isCompliant(hash) and anyone can call it

Batch operations available for when you need to register thousands of attestations without selling a kidney for gas fees.


Works With Everything

I am going to say something that might sound arrogant: if your system produces a string, DOF can govern it. I do not care what framework you use.

# CrewAI? Sure.
# LangGraph? Why not.
# AutoGen? Of course.
# Your custom Python script from 2023? Absolutely.

from dof import GenericAdapter
result = GenericAdapter().wrap_output("literally any text")
Enter fullscreen mode Exit fullscreen mode

It also runs as an MCP server for Claude Desktop and Cursor (10 tools), a REST API with 14 endpoints, and an A2A protocol server with 8 skills. Pick your interface.

{
  "mcpServers": {
    "dof-governance": {
      "command": "python",
      "args": ["mcp_server.py"]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Storage is dual-backend: JSONL by default for zero-config simplicity, PostgreSQL when you need production scale. Set an environment variable and it switches automatically.


The Trust Score Architecture

DOF plugs into the Enigma Scanner through a combined_trust_view that blends three independent scoring sources:

Who Scores Weight What They Measure
Centinela (infrastructure) 30% Is this thing alive and doing something?
DOF (formal governance) 50% Can we mathematically prove it works correctly?
Community (user ratings) 20% Do real people trust it?

Governance gets 50% of the weight. Not because I am biased (okay, maybe a little) but because it is the only dimension backed by formal mathematical proof. The others are empirical. DOF is proven.


How We Stack Up

I am not going to pretend DOF replaces LangChain or CrewAI. They do orchestration. DOF does governance. But here is what none of them have:

Capability DOF LangChain CrewAI Corvic Langfuse
Z3 formal proofs
Constitutional governance
On-chain attestation
Adversarial Red vs Blue
Code safety analysis
Memory governance
Framework agnostic Partial
External agent audit

That bottom row is the one I am most proud of. DOF does not just govern your agents. It can audit anyone else's too.


What I Got Wrong (The Honest Section)

Every project has blind spots. Here are mine:

The supervisor is also an LLM. Yes, the thing grading the exam is also a student. I mitigate this by using a different provider for the supervisor than for the agents, but the circularity exists and I am not going to pretend otherwise.

The stability formula assumes independence. SS(f) = 1 − f³ works when providers fail independently. If AWS goes down and takes Groq, Cerebras, and MiniMax with it, my independence assumption collapses. Real infrastructure has correlated failures.

Pattern matching is not semantic understanding. The Constitution Enforcer catches hallucinations through patterns, not meaning. A sufficiently creative hallucination that perfectly mimics real data would slip through. The adversarial protocol is the safety net here, but it is not bulletproof.

I am one person. This is 24,000 lines of code written by one developer in Medellín, Colombia. The codebase is comprehensive but it reflects one perspective. Contributions are welcome. Criticism even more so.


Try It (30 Seconds, No API Keys)

git clone https://github.com/Cyberpaisa/deterministic-observability-framework.git
cd deterministic-observability-framework
pip install -e .

# Check if governance works
python -c "from dof import GenericAdapter; print(GenericAdapter().wrap_output('Hello world'))"

# Run the math
python -m dof verify

# See everything
python -m dof health
Enter fullscreen mode Exit fullscreen mode

📖 Full getting started guide


What Comes Next

Merkle root batching so 10,000 attestations cost one cent. PyPI publication so you can just pip install dof-sdk. A "DOF VERIFIED" badge on the Enigma Scanner. And alignment with HyperClaw, Ben Goertzel's cognitive orchestration proposal, because if AGI is coming, it better come with governance.


If you are building AI agents that touch smart contracts, databases, financial data, or anything where a silent failure means writing garbage to a permanent record: stop hoping and start proving.

The repo is open. The math is verified. The contract is on mainnet. The proofs are on Snowtrace.

And if you still think your agent is fine without governance, ask yourself this: when was the last time you actually checked?

Top comments (0)