DEV Community

Cover image for Your AI Agent Just Did Something. Can You Prove What It Was?
slucerodev
slucerodev

Posted on

Your AI Agent Just Did Something. Can You Prove What It Was?

You deployed an AI agent. It took an action. Something went wrong.

Now answer these questions:

What exactly did it do?
What input caused it?
Can you reproduce the exact sequence?
Can you prove to your legal team that it didn't do something else?
If the answer to any of those is "I'm not sure" — you have a governance problem.

The Actual Problem
Most AI agent frameworks are built around getting things done. That's fine. But they have no answer for proving what was done, why it was done, or reconstructing the exact execution trace after the fact.

Logs help. Traces help. But neither gives you deterministic replay — the ability to take a set of inputs and provably reconstruct the same execution, byte for byte, every time.

Without that, your audit trail is a story you're telling. With it, it's evidence.

What ExoArmur Does
ExoArmur is a governance layer that sits between your AI decision engine and your execution targets. It doesn't replace your agent framework. It wraps it.

Every action that passes through ExoArmur:

Gets evaluated by a policy decision point before it runs
Produces a cryptographically bound audit record tied to the original intent
Is deterministically replayable — same inputs always reconstruct the same trace
Can be vetoed or queued for human operator approval
The pipeline looks like this:

Decision Source → ActionIntent → PolicyDecisionPoint → SafetyGate → [Approval?] → Executor → ExecutionProofBundle
The key invariant: no action executes without passing through the governance boundary. Executors are untrusted plugins. The core is immutable. CI enforces determinism with a three-run stability gate on every push.

See It in 5 Minutes
bash
pip install exoarmur-core
Then run this:

python
from exoarmur import ReplayEngine
from exoarmur.replay.event_envelope import CanonicalEvent
import hashlib, json

payload = {"kind": "inline", "ref": {"event_id": "01ARZ3NDEKTSV4RRFFQ69G5FAV"}}
event = CanonicalEvent(
event_id="01ARZ3NDEKTSV4RRFFQ69G5FAV",
event_type="belief_creation_started",
actor="demo",
correlation_id="corr-1",
payload=payload,
payload_hash=hashlib.sha256(
json.dumps(payload, sort_keys=True, separators=(",", ":")).encode()
).hexdigest(),
)

engine = ReplayEngine(audit_store={"corr-1": [event]})
report = engine.replay_correlation("corr-1")

print("Replay result:", getattr(report.result, "value", report.result))
print("Failures:", report.failures or "none")
Output:

Replay result: success
Failures: none
That's a real deterministic replay over a real cryptographically-bound audit event. Not a mock. Not a demo stub. The same code runs in CI on every push.

The Governance Pipeline
Turn on the full V2 governance pipeline with three environment flags:

bash
EXOARMUR_FLAG_V2_FEDERATION_ENABLED=true \
EXOARMUR_FLAG_V2_CONTROL_PLANE_ENABLED=true \
EXOARMUR_FLAG_V2_OPERATOR_APPROVAL_REQUIRED=true \
python scripts/demo_v2_restrained_autonomy.py --operator-decision deny
Output:

DEMO_RESULT=DENIED
ACTION_EXECUTED=false
AUDIT_STREAM_ID=det-...
The action was vetoed. The denial is in the audit trail. Replay the stream:

bash python scripts/demo_v2_restrained_autonomy.py --replay
Same trace. Every time. Provably.

What It's Not
ExoArmur is not an LLM. Not an agent framework. Not a workflow engine. It doesn't care what's making decisions — it only cares that whatever executes passes through the governance boundary and leaves a verifiable trail.

Use it with LangChain. Use it with CrewAI. Use it with a custom decision layer. It wraps whatever you have.

Why Determinism Is the Core Bet
The three-run stability gate in CI isn't ceremony. It's the central guarantee: if your system can't reproduce the same trace from the same inputs, your audit trail isn't an audit trail. It's a log. Logs can be explained away. Deterministic replay cannot.

This matters right now because:

Enterprises deploying AI agents are facing internal compliance reviews
Regulated industries (finance, healthcare, legal) need more than "the model decided"
Incident response for AI systems has no tooling — ExoArmur is a start
Try It
bash
git clone https://github.com/slucerodev/ExoArmur-Core.git
cd ExoArmur-Core
pip install ".[dev]"
python -m pytest -q
1033 tests. Three deterministic runs. No external infrastructure required for the core suite.

Repo: github.com/slucerodev/ExoArmur-Core

If you're building agents in production and care about what they actually did — this is built for you.

Top comments (1)

Collapse
 
ali_muwwakkil_a776a21aa9c profile image
Ali Muwwakkil

One surprising insight: logging and traceability are often afterthoughts, but they should be foundational. In my experience with enterprise teams, the lack of a well-structured logging system is the main reason they struggle to understand AI agent actions. Implementing robust logging frameworks like ELK or Prometheus can provide the necessary transparency and accountability. This not only helps in debugging but also builds trust in AI systems, making them more reliable and actionable. - Ali Muwwakkil (ali-muwwakkil on LinkedIn)