Let me tell you how this started.
I had 5 LLM providers running agents on free tiers. Groq, MiniMax, Cerebras, Zhipu, NVIDIA NIM. Every day I would kick off a crew of agents, go make coffee, come back, and read the output like it was gospel. "The market analysis shows strong bullish indicators for Q2." Great. Ship it.
Except one day I actually checked. The "bullish indicators" were completely fabricated. The agent had hallucinated three data sources that don't exist. And nobody caught it. Not CrewAI. Not me. Not the five other agents in the pipeline that supposedly "verified" the output.
That was the day I stopped trusting and started proving.
Cyberpaisa
/
deterministic-observability-framework
Deterministic architecture for multi-agent observability, verification and controlled degradation.
Deterministic Observability Framework for Multi-Agent LLM Systems
A research-grade deterministic orchestration and observability framework for multi-agent LLM systems operating under adversarial infrastructure constraints.
This repository formalizes reproducible experimentation, resilience metrics, controlled degradation modeling, governance invariance, and deterministic evaluation in heterogeneous provider environments.
Python 3.11+ | Apache-2.0 | 24,000+ LOC | 71 modules | 475 tests | Z3 formal verification | OAGS Level 3 | ERC-8004 attestation | Avalanche Mainnet | 7 on-chain attestations | MCP Server | REST API | PostgreSQL | Multi-Framework | pip install dof-sdk
Quick Start
pip install -e .
from dof import GenericAdapter
result = GenericAdapter().wrap_output("your agent output")
# → {status: "pass", violations: [], score: 8.5}
For Z3 formal proofs: python -m dof verify
For full guide: docs/GETTING_STARTED.md
Abstract
Multi-agent LLM systems operating across heterogeneous providers exhibit infrastructure-induced instability that cannot be rigorously characterized using conventional orchestration tooling. Rate limits, cascading…
475 tests. 24,000 lines of code. 71 modules. 4 mathematical theorems. 7 transactions on Avalanche mainnet. Zero faith required.
The Dirty Secret of AI Agent Frameworks
Here is what nobody talks about at AI conferences. Every single multi-agent framework out there, CrewAI, AutoGen, LangGraph, all of them, they all end the same way:
crew = Crew(agents=[...], tasks=[...])
result = crew.kickoff()
print(result) # fingers crossed
That last line is basically a prayer. You print the result and hope the agent did not make stuff up, did not import something dangerous, did not quietly fail because Groq hit its rate limit at 2AM and the retry logic just recycled the same dead provider three times in a row.
There is no inspector. No quality gate. No math. Just vibes.
OpenAI themselves published a paper saying "deterministic behavioral guarantees are currently not possible for a model developer." And honestly, when I first read that, I thought they were right.
They are not. I have four theorems that say otherwise.
What I Actually Built
The Deterministic Observability Framework, or DOF because every serious project needs a three letter acronym. It does one thing: it sits between your agent and the real world and decides if the output is trustworthy before anyone sees it.
from dof import GenericAdapter
adapter = GenericAdapter()
result = adapter.wrap_output("your agent said something, let me check it")
# {status: "pass", violations: [], score: 8.5}
That takes 30 milliseconds. Zero LLM tokens. Works with CrewAI, LangGraph, AutoGen, or literally anything that produces text. If your system can generate a string, DOF can judge it.
Think of it like a bouncer at a club. The agent produces something, walks up to the door, and DOF checks the ID, pats it down, runs a background check, and either lets it through or sends it home. Except this bouncer has a math degree and a blockchain wallet.
Seven Layers of Not Trusting Your Agent
I know seven layers sounds excessive. But hear me out. Each one catches something the others miss.
Layer 1: The Constitution
Every country has a constitution. Your agents should too. Mine is a YAML file with rules:
rules:
hard:
HARD-001: {pattern: "hallucination_detection", action: "block"}
HARD-002: {pattern: "language_compliance", action: "block"}
soft:
SOFT-001: {pattern: "source_citation", action: "warn", weight: 0.25}
Hard rules are bouncers. You hallucinate, you are out. Soft rules are disappointed parents. You forgot to cite your sources, you get a lower score but you can still come to dinner.
The enforcer uses deterministic string matching. Not an LLM. This matters because if your governance layer is also an LLM, congratulations, you are using a student to grade their own exam.
Layer 2: Code X-Ray
The AST Verifier reads any code your agent generates and checks for the things that keep security teams awake at night. eval() calls. os.system imports. API keys sitting in plain text like it is 2015.
Score goes from 0.0 (call the lawyers) to 1.0 (sleep peacefully).
Layer 3: The Supervisor
A separate agent scores the output on four dimensions: Quality, Actionability, Coherence, and Format. Score above 7? You pass. Between 5 and 7? Go back and try again, but this time with a different provider because maybe the first one was having a bad day. Below 5? We need to talk to a human.
Here is the clever part. When a retry happens, the crew_factory rebuilds the entire crew with fresh provider assignments:
def crew_factory():
llm_researcher = get_llm_for_role("researcher")
llm_verifier = get_llm_for_role("verifier")
llm_strategist = get_llm_for_role("strategist")
return Crew(
agents=[researcher, verifier, strategist],
tasks=[t1, t2, t3],
)
This breaks the correlation between "Groq is down" and "the retry also uses Groq." It sounds obvious but literally nobody does this.
Layer 4: The Math (This Is the Fun Part)
I have four theorems. Not empirical observations. Not "it usually works." Actual mathematical proofs verified by Z3, an SMT solver built by Microsoft Research that has been formally verifying software since 2007.
| What We Proved | In Math | In English |
|---|---|---|
| GCR Invariant | ∀ f ∈ [0,1]: GCR(f) = 1.0 | Governance works no matter how many providers crash |
| Stability Formula | ∀ f ∈ [0,1]: SS(f) = 1 − f³ | We know exactly how stable the system is |
| It Only Gets Worse | f₁ < f₂ ⟹ SS(f₁) > SS(f₂) | More failures always means less stability. No surprises. |
| The Extremes | SS(0) = 1.0 ∧ SS(1) = 0.0 | Zero failures means perfect. Total failure means zero. |
Z3 does not believe me. It tried to prove me wrong. It exhaustively searched every possible input looking for a single counterexample where governance could fail. It found nothing. UNSAT. Proof complete.
Total time to verify all four theorems: 10 milliseconds.
python -m dof verify
# ✓ GCR_INVARIANT (2.25ms)
# ✓ SS_FORMULA (0.30ms)
# ✓ SS_MONOTONICITY (1.25ms)
# ✓ SS_BOUNDARIES (0.39ms)
# All verified: True
Name another AI governance framework with formal mathematical proofs. I will wait.
Layer 5: Red Team vs Blue Team
Three agents argue about your output. The Red Team tries to destroy it, finding fabrications, safety issues, and policy violations. The Guardian defends it with actual evidence, API lookups, cross-references, verification results. And the Arbiter decides who wins using only verifiable facts. No opinions. No LLM judgment calls. Just logic.
It is like a courtroom where the judge only accepts physical evidence.
Layer 6: Memory That Forgets Responsibly
The GovernedMemoryStore remembers things your agents learned. But unlike your browser history, it has rules about what to keep and what to forget. Knowledge decays over time. Preferences fade. But decisions and errors are protected forever because those are the lessons that matter.
It is the difference between remembering what you had for lunch last Tuesday (irrelevant, let it decay) and remembering that the stove is hot (critical, keep forever).
Layer 7: The Blockchain Receipt
When everything passes, DOF signs a certificate with BLAKE3 and HMAC-SHA256, then publishes it to three places:
| Where | How Fast | Can It Be Changed |
|---|---|---|
| Internal database | 200ms | Yes (for updates) |
| Enigma Scanner | 900ms | No (historical log) |
| Avalanche blockchain | 2 seconds | Absolutely not. Ever. |
That last one is the point. Once it is on Avalanche, it is there forever. Anyone can verify it. No trust required.
The Math Behind the Stability
I derived this from first principles. A system with 2 retries and a provider failure rate of f only fails when all three attempts fail:
SS(f) = 1 − f³
If providers fail 10% of the time, your system survives 99.9% of the time. If they fail 50% of the time, you still survive 87.5%. The math is clean and Z3 confirmed it is not just clean but provably correct.
Five metrics, all bounded between 0 and 1, no ambiguity:
| Metric | What It Tells You |
|---|---|
| SS | Did the system survive? |
| GCR | Did governance hold? (Spoiler: always 1.0) |
| PFI | How fragile are your providers? |
| RP | How much retry pressure are you under? |
| SSR | Is the supervisor being too strict? |
Production Results (Not Benchmarks, Real Runs)
The Original 30 Runs
I ran 30 executions on real free tier providers. No sandboxes. No simulated failures. Just Groq being Groq at 2AM.
| Metric | Value |
|---|---|
| Stability | 0.90 (27/30 survived) |
| Provider Fragility | 0.61 (61% hit at least one failure) |
| Governance Compliance | 1.0000 with zero variance |
That last number is the one that matters. Despite 61% of executions hitting provider failures, governance held perfectly. Every single time. That is not luck. That is architecture.
The Full Audit (March 2026)
Two real agents from the Enigma Scanner verified with roles swapped (the arbitrage bot did contract auditing, the auditor did arbitrage scanning):
| Agent | Governance | Code Safety | Z3 Proofs | On Chain |
|---|---|---|---|---|
| Apex Arbitrage #1687 | PASS | 1.0 | 4/4 | Block 79674834 |
| AvaBuilder #1686 | PASS | 1.0 | 4/4 | Block 79674842 |
We also ran a four phase audit: 10 MCP tools tested, 8 A2A skills verified, cross-role execution where agents worked outside their comfort zone, and peer verification where each agent checked the other. Both passed everything.
Combined trust score: 0.85. Ranking: #1 and #2 out of 1,738 agents on the scanner. Not because we gamed the system. Because formal verification is a competitive advantage nobody else has.
External Agent Audit
We pointed DOF at 13 production agents across the ERC-8004 registry. Real HTTP connections, zero simulation:
| Protocol | Tested | Active | What We Found |
|---|---|---|---|
| A2A | 4 | 0 | All SnowRail agents offline. 404 across the board. |
| x402 | 2 | 2 | QuickIntel alive, returning 402 with pricing. Working as designed. |
| OASF | 2 | 2 | Both agents serving full service manifests. |
| MCP | 1 | 1 | AMI Finance serving 114 lines of MCP config. |
62% of agents we probed had at least one working protocol endpoint. The other 38% were ghosts. This is why you need governance before you write to an immutable ledger.
The Smart Contract
Deployed on Avalanche C-Chain. Not testnet. Mainnet.
function registerAttestation(
bytes32 certificateHash,
bytes32 agentId,
bool compliant
) external
| Address | 0x88f6043B...C052 |
| Network | Avalanche C-Chain |
| Attestations | 7 confirmed |
| Cost per attestation | About one cent |
| Verification |
isCompliant(hash) and anyone can call it |
Batch operations available for when you need to register thousands of attestations without selling a kidney for gas fees.
Works With Everything
I am going to say something that might sound arrogant: if your system produces a string, DOF can govern it. I do not care what framework you use.
# CrewAI? Sure.
# LangGraph? Why not.
# AutoGen? Of course.
# Your custom Python script from 2023? Absolutely.
from dof import GenericAdapter
result = GenericAdapter().wrap_output("literally any text")
It also runs as an MCP server for Claude Desktop and Cursor (10 tools), a REST API with 14 endpoints, and an A2A protocol server with 8 skills. Pick your interface.
{
"mcpServers": {
"dof-governance": {
"command": "python",
"args": ["mcp_server.py"]
}
}
}
Storage is dual-backend: JSONL by default for zero-config simplicity, PostgreSQL when you need production scale. Set an environment variable and it switches automatically.
The Trust Score Architecture
DOF plugs into the Enigma Scanner through a combined_trust_view that blends three independent scoring sources:
| Who Scores | Weight | What They Measure |
|---|---|---|
| Centinela (infrastructure) | 30% | Is this thing alive and doing something? |
| DOF (formal governance) | 50% | Can we mathematically prove it works correctly? |
| Community (user ratings) | 20% | Do real people trust it? |
Governance gets 50% of the weight. Not because I am biased (okay, maybe a little) but because it is the only dimension backed by formal mathematical proof. The others are empirical. DOF is proven.
How We Stack Up
I am not going to pretend DOF replaces LangChain or CrewAI. They do orchestration. DOF does governance. But here is what none of them have:
| Capability | DOF | LangChain | CrewAI | Corvic | Langfuse |
|---|---|---|---|---|---|
| Z3 formal proofs | ✅ | ❌ | ❌ | ❌ | ❌ |
| Constitutional governance | ✅ | ❌ | ❌ | ❌ | ❌ |
| On-chain attestation | ✅ | ❌ | ❌ | ❌ | ❌ |
| Adversarial Red vs Blue | ✅ | ❌ | ❌ | ❌ | ❌ |
| Code safety analysis | ✅ | ❌ | ❌ | ❌ | ❌ |
| Memory governance | ✅ | ❌ | ❌ | ❌ | ❌ |
| Framework agnostic | ✅ | ❌ | ❌ | ❌ | Partial |
| External agent audit | ✅ | ❌ | ❌ | ❌ | ❌ |
That bottom row is the one I am most proud of. DOF does not just govern your agents. It can audit anyone else's too.
What I Got Wrong (The Honest Section)
Every project has blind spots. Here are mine:
The supervisor is also an LLM. Yes, the thing grading the exam is also a student. I mitigate this by using a different provider for the supervisor than for the agents, but the circularity exists and I am not going to pretend otherwise.
The stability formula assumes independence. SS(f) = 1 − f³ works when providers fail independently. If AWS goes down and takes Groq, Cerebras, and MiniMax with it, my independence assumption collapses. Real infrastructure has correlated failures.
Pattern matching is not semantic understanding. The Constitution Enforcer catches hallucinations through patterns, not meaning. A sufficiently creative hallucination that perfectly mimics real data would slip through. The adversarial protocol is the safety net here, but it is not bulletproof.
I am one person. This is 24,000 lines of code written by one developer in Medellín, Colombia. The codebase is comprehensive but it reflects one perspective. Contributions are welcome. Criticism even more so.
Try It (30 Seconds, No API Keys)
git clone https://github.com/Cyberpaisa/deterministic-observability-framework.git
cd deterministic-observability-framework
pip install -e .
# Check if governance works
python -c "from dof import GenericAdapter; print(GenericAdapter().wrap_output('Hello world'))"
# Run the math
python -m dof verify
# See everything
python -m dof health
What Comes Next
Merkle root batching so 10,000 attestations cost one cent. PyPI publication so you can just pip install dof-sdk. A "DOF VERIFIED" badge on the Enigma Scanner. And alignment with HyperClaw, Ben Goertzel's cognitive orchestration proposal, because if AGI is coming, it better come with governance.
If you are building AI agents that touch smart contracts, databases, financial data, or anything where a silent failure means writing garbage to a permanent record: stop hoping and start proving.
The repo is open. The math is verified. The contract is on mainnet. The proofs are on Snowtrace.
And if you still think your agent is fine without governance, ask yourself this: when was the last time you actually checked?
Top comments (0)