Part 1: Can You Actually Prove an Agent Is Safe?
I’ve spent the last year building the Agent Governance Toolkit (AGT), an open-source runtime governance layer for AI agents.
- 13,000+ tests
- 5 SDK languages
- 19+ framework integrations
- Coverage for all 10 OWASP Agentic Top 10 risks
But the more I build, the more I realize: the hardest problems in agent governance aren’t the ones we’ve solved. They’re the ones nobody is even asking about yet.
Here are the questions I think the industry needs to wrestle with.
“You’re measuring the wrong thing.”
AGT enforces policy with a 0.00% violation rate in red-team testing. Sounds impressive, right?
But here’s the uncomfortable truth: we’re proving the policy layer works, not that the agent is safe.
When an agent says, “I want to do a file operation,” we check the policy and say yes or no. But what happens after we say yes? Did the agent actually do what it claimed?
Maybe it asked for a file read but triggered a secondary, unauthorized tool call. We have audit logs, but by the time you read them, the damage is done.
My Answer: The Continuous Loop
Governance can’t be a one-time gate check. It has to be a dynamic, multi-stage architecture:
- Pre-action: Deterministic policy enforcement (The “Fast Gate” — we’ve solved this).
- Post-action: Continuous validation of what the agent actually did vs. what it asked permission to do.
- Adaptive Trust: If observed behavior diverges from declared intent, the agent’s trust score decays in real time.
We already use trust scoring for agent-to-agent communication in Agent Mesh. That same mechanism must apply to standalone single agent runs. An agent that asks for permission to do A but ends up doing B should see its trust erode until the governance layer intervenes.
Can formal verification work for non-deterministic systems?
Short answer: Yes, but it has to be a hybrid.
You need deterministic checks (policy enforcement, capability constraints) combined with non-deterministic checks (behavioral anomaly detection, pattern analysis). Pre-execution and post-execution. Not one or the other.
The “Illusion Delta”:
In 2026, the industry is falling into a trap, agents perform safely in short-horizon tests but exhibit deviant emergent behavior in long-running production environments.
The missing piece? The Observability Layer. We need checks that observe how agents behave in real time, not just what they asked to do. This is the bridge between verifying policy and verifying behavior.
The Intent Problem
An agent reads your customer database (allowed). Then sends a Slack message (also allowed). But was it exfiltrating data?
You can’t govern intent by reading the agent’s “mind.” But you can govern it through Sequence Analysis : a sequence of individually allowed actions that looks like exfiltration is a signal.
The Thesis : The future of agent governance is not just pre-action policy gates. Its continuous behavioral observability combined with adaptive trust scoring.
What’s Next?
In Part 2, I’ll dive into “ Governing the Flock, Not Just the Bird ”, how to handle multi-agent swarms and delegation chains where accountability gets messy.
Originally published at https://www.linkedin.com.
Top comments (0)