A few weeks ago I was browsing GitHub looking for open source projects to contribute to. I stumbled on Microsoft's agent-governance-toolkit and decided to dig in. What I found surprised me — not because the code was impressive (it was), but because the problem it solved was one I hadn't thought seriously about before.
We talk a lot about building AI agents. We don't talk nearly enough about what happens when they go wrong.
The Rise of Autonomous AI Agents
In 2026, AI agents aren't a research curiosity anymore. They're running in production. Companies are deploying agents that browse the web, write and execute code, query databases, send emails, and call external APIs — all without a human in the loop for each step.
This is genuinely powerful. An agent that can autonomously research, analyze, and act can compress hours of work into minutes. But there's a problem that comes with that power that most teams are only starting to reckon with: an agent that can do anything, will eventually do the wrong thing.
Not out of malice. Out of ambiguity, edge cases, and the fundamental unpredictability of systems built on probabilistic models.
What Can Go Wrong — And Does
Let me give you concrete examples from patterns I saw while contributing to this toolkit.
Data exfiltration via MCP tool poisoning. MCP (Model Context Protocol) is a standard for giving agents access to tools. What most teams don't realize is that an attacker can embed hidden instructions directly in a tool's description. The agent reads the description, interprets it as instructions, and suddenly a "read file" tool is quietly sending your data to an external endpoint. The attack is invisible at the UI level. It lives in the metadata.
Runaway tool calls. An agent that's allowed to call tools indefinitely will sometimes loop. A bug in the task framing, an ambiguous goal, or an unexpected API response can send an agent into a cycle that burns through API credits, hits rate limits, or worse — makes irreversible changes at scale.
PII in memory writes. Agents with memory capabilities can inadvertently store sensitive user data — SSNs, emails, API keys embedded in text — in their long-term memory. Without validation at the write layer, that data persists and propagates.
Prompt injection across delegation chains. When agents spawn sub-agents (which is increasingly common in multi-agent architectures), each handoff is an attack surface. A malicious instruction injected early in a chain can propagate through multiple agents before anyone realizes something is wrong.
These aren't theoretical. Security researchers are actively demonstrating all of these attack patterns in the wild right now.
What Governance Actually Means
"Governance" is one of those words that sounds bureaucratic until you see what the absence of it costs.
In practical terms, governance for AI agents means three things:
1. Policy enforcement before execution. Every tool call, every action an agent wants to take, gets checked against a set of rules before it happens. Not after. Not logged for review later. Before. If the action matches a blocked pattern — a SQL injection attempt, a destructive shell command, a PII pattern — it's stopped. The agent gets a policy violation error. The action doesn't execute.
2. Audit logging you can actually use. Every action an agent takes, whether allowed or blocked, gets recorded with enough context to understand what happened. Not just "agent called tool X" but which agent, which context, what the input was, what the policy decision was, and why.
3. Budget and circuit breaker enforcement. Agents need hard limits — on the number of tool calls per session, on the delegation depth between agents, on execution time. When those limits are hit, the agent stops. Not degrades gracefully. Stops.
How agent-governance-toolkit Implements This
This is where it gets concrete. The toolkit wraps your existing agent frameworks — LangChain, CrewAI, AutoGen, PydanticAI, smolagents, Google ADK — with a governance layer that sits between your agent and the outside world.
You define a policy:
from agent_os.integrations.base import GovernancePolicy
policy = GovernancePolicy(
blocked_patterns=["DROP TABLE", "rm -rf", r"\b\d{3}-\d{2}-\d{4}\b"],
max_tool_calls=10,
require_human_approval=False,
)
Then you wrap your agent:
from agent_os.integrations import LangChainKernel
kernel = LangChainKernel(policy=policy)
governed_agent = kernel.wrap(my_langchain_chain)
Every call to governed_agent.invoke() now goes through a pre-execution check. If the input matches a blocked pattern, the call is rejected before reaching the LLM or any external service. If the agent has exceeded its tool call budget, the call is rejected. If the output contains something problematic, the post-execution check catches it.
The interception happens at multiple levels. Deep hooks intercept individual tool calls within the agent, memory writes are validated for PII before they're saved, and delegation chains between sub-agents are tracked and depth-limited.
For MCP specifically, the toolkit includes a security scanner that checks tool definitions for poisoning patterns — hidden instructions, exfiltration attempts, privilege escalation, role overrides — before those tools are registered with the agent.
Key Concepts Worth Understanding
The trust mesh. In a multi-agent system, trust isn't binary. Different agents should have different permission levels based on their role, their source, and the context they're operating in. The toolkit models this through trust cards and identity verification — each agent has a verifiable identity, and policies can be scoped to specific agents or roles.
Policy as code. Governance policies are defined as YAML files that can be version-controlled, reviewed, and deployed like any other configuration. They have a schema, they can be validated before deployment (agentos policy validate), and they can be diffed between versions. You don't want to be manually updating code every time a new blocked pattern needs to be added.
SLOs for agents. Just like you set service level objectives for APIs and infrastructure, you can set them for agents. Error rate thresholds, latency limits, availability targets. When an agent breaches its SLO, a circuit breaker trips and the agent is taken out of service. This prevents a degraded agent from silently producing bad outputs at scale.
Audit logging as a first-class concern. Every governance decision — allow or block — is recorded with full context. This isn't just for debugging. It's for compliance, incident response, and understanding actual agent behavior in production over time.
Why This Matters Now
The timing matters. We're at an inflection point.
Most teams deploying AI agents today are doing so without any governance layer. They're moving fast, the agents are working well enough in testing, and governance feels like a problem for later. But "later" in AI agent deployments often means "after the first serious incident."
The cost of retrofitting governance onto an existing agent system is much higher than building it in from the start. The audit trail doesn't exist yet, the policy boundaries haven't been defined, and the agents have already been granted permissions that are difficult to revoke without breaking workflows.
This isn't just a security problem either. It's a reliability problem and a trust problem. If your agents take actions that users or customers didn't expect and can't explain, trust in the system erodes quickly. Governance is what makes agent behavior predictable and explainable — not just safe.
Getting Started
The quickstart takes about 10 minutes:
pip install agent-governance-toolkit[full]
There are also three interactive Google Colab notebooks that let you explore the toolkit without setting anything up locally:
- Policy Enforcement 101 — define a policy, see violations blocked in real time
- MCP Security Proxy — scan tool definitions for poisoning patterns
- Multi-Agent Governance — SLOs, circuit breakers, chaos testing
Full source code and docs: github.com/microsoft/agent-governance-toolkit
A Note From a Contributor
I came to this project as someone who builds ML models and data pipelines — not a security engineer. But contributing to this codebase changed how I think about agent design. The attack surfaces are real, the failure modes are unintuitive, and the tooling to address them is still early.
If you're building agents in production, governance isn't optional. It's the difference between a system you can operate confidently and one you're constantly firefighting.
I'm Kanish Tyagi — MS Data Science student at UT Arlington and open source contributor. Find me on GitHub and LinkedIn.
Top comments (0)