Your agent will do something you didn't expect. Every team that has run agents in production for more than a few weeks knows this. The question isn't whether it happens — it's whether your system is designed for that moment.
Microsoft's Agent Governance Toolkit is a well-engineered answer to one version of that question: can we evaluate a declarative policy before a tool call fires? The answer is yes, and AGT does it well. Sub-millisecond evaluation. A solid test corpus. A real compliance story.
But "before a tool call fires" is not the whole question. It's the first clause of a much longer sentence — and everything after that clause is where production failures actually live.
This post is for teams making a governance platform decision today. We'll walk through what each approach covers, where the coverage ends, and why the teams building serious agent infrastructure at scale are landing on Waxell.
TL;DR
| AGT | Waxell | |
|---|---|---|
| Governance timing | Pre-execution only | Pre, mid, and post-execution |
| Agent scope | Framework-attached agents | External agents, framework agents, agentic runtime |
| Policy management | Developer-authored YAML, code deployment required | Dynamic engine — non-technical users, runtime injection |
| Data layer governance | Tool call level | Tool call + database + vector database (Signals / Domains) |
| Cost enforcement | None | BudgetLedger — tree-scoped, enforceable mid-run |
| Durable execution | Saga orchestrator (in-session only) | Suspend, resume, human gates across session boundaries |
| Policy per agent/fleet | Shared policy directory | Different policies per agent and fleet, dynamically |
| Policy categories | Open-ended rule authoring | 26 structured policy categories with scoping |
| Incident disposition | Allow / deny | Warn, block, or redact — scoped per category |
| Built on | Threat model and whitepaper | Millions of production agentic executions |
The Governance Gap AGT Doesn't Cover
AGT's architecture is explicit about its boundary: it governs agent actions, not LLM inputs or outputs, and it runs in-process. The policy evaluation happens before tool dispatch. If the tool is allowed, AGT steps aside.
That's a clear, honest design decision. But it means AGT's governance surface ends precisely where most production failures begin.
The six failure modes that appear repeatedly in production agent deployments are: runaway loops (the agent re-calls itself or a tool repeatedly), scope creep (the agent pursues a goal beyond the original instruction), data leakage (the agent surfaces data in its output that it shouldn't have retrieved), hallucination-in-action (the agent acts on a false premise mid-run), prompt injection (a retrieved document redirects agent behavior), and cascade failures (one agent's output becomes another agent's bad input across a spawn tree).
AGT can address some pre-conditions for some of these failures. A rule that blocks a recursive tool call can interrupt a loop — once. A capability check can prevent scope creep at a specific tool invocation. But a pre-execution policy can't stop a loop that's unfolding across turns. It can't gate an output before it reaches the next agent in a chain. It can't suspend a run when spend crosses a threshold mid-execution. It can't enforce a review step between what the agent decided and what the agent did.
These aren't edge cases. They're the failure modes that matter in production.
Three Planes of Governance
Every production agent deployment has three surfaces that need governance. Most governance tools cover one.
Plane 1: External and third-party agents. Agents running in external environments — developer tooling, CI pipelines, customer-facing sessions, third-party integrations — operate outside any framework instrumentation. They call your APIs, they read your data, they act on behalf of your users. But they're not running in a process you control, and they can't have framework adapters attached to them.
Plane 2: Framework-built agents. Agents built on LangChain, CrewAI, AutoGen, Semantic Kernel, and similar frameworks. This is where most governance tooling lives, because these frameworks provide attachment points for instrumentation and policy hooks.
Plane 3: The agentic runtime itself. The infrastructure layer that handles agent spawning, state persistence, suspension, resumption, and inter-agent communication. Governance at this layer means enforcing policies on the execution fabric, not just on individual tool calls.
AGT operates primarily on Plane 2. Its adapters attach to framework-built agents. Its in-process model has no surface for Plane 1 agents, and its saga orchestrator provides some runtime governance (compensating transactions for in-session failures) but no cross-session enforcement on Plane 3.
Waxell covers all three. The instrumentation layer auto-instruments 157 libraries across frameworks (Plane 2). External agents emit structured events attributed to the same governance surface via the Waxell installer (Plane 1). The Runtime SDK governs the execution fabric directly — spawn, suspend, resume, budget enforcement, human gates — without requiring any framework attachment (Plane 3).
The Execution Arc: Pre, Mid, and Post
The simplest way to describe the architectural difference is the execution arc.
AGT covers the pre-execution moment. A tool call is about to fire. The policy engine evaluates. Outcome: allow or deny. If allowed, AGT has done its job.
Waxell covers the full arc — and the response options are richer.
Where AGT's disposition is binary (allow or deny), Waxell's incident disposition model works like cloud infrastructure security: warn, block, or redact, scoped per policy category. A tool call that trips a budget threshold can be warned rather than blocked on the first occurrence, letting a human review before enforcement escalates. A response containing PII that shouldn't leave the tenant boundary can be redacted before it reaches the next agent in the chain, rather than halting the run entirely. The response is proportionate to the violation — which is how mature security systems work.
Pre-execution: Tool calls are checked against declared rules before dispatch. Fast enough to not block hot paths.
Mid-execution: This is the governance surface that doesn't exist in AGT. An agent is mid-run. It has made four tool calls. Its spawn tree has consumed $8 of the $10 budget threshold. The next tool call is permitted by policy, but by the time it completes, the budget will be exceeded. Waxell's BudgetLedger enforces at this boundary — the enforcement isn't "did this specific call violate a rule?" but "does the current execution state violate a constraint?"
Mid-execution also covers suspension and human gates. An agent drafts a document that will be sent to a customer. Before dispatch, a human review gate fires. The run suspends. The reviewer approves or rejects. The run resumes or terminates. None of this is expressible in a pre-execution policy framework.
Post-execution: Output gates, cost settlement, audit closure, RunEdge DAG completion. Waxell records the full causal graph after each run — what spawned what, which decisions led to which actions, what the cost was across the full tree. Post-execution governance means you can write policies that look at run history, not just the current call.
The Dynamic Policy Engine
AGT's policy model is declarative and static: YAML, OPA/Rego, or Cedar rules deployed in a policies/ directory. Changing a policy means editing a file, testing it, and deploying the new version. That's a developer task.
This is fine when policies are stable and your governance team is your development team. It becomes a bottleneck when policies need to change quickly — new regulation, new customer requirement, new threat pattern identified at 2am — and the people who understand the policy need don't have deployment access.
Waxell's policy engine is dynamic. Policies are injectable at runtime without redeployment. Different agents can run under different policy sets. Different fleets can have different enforcement profiles. A compliance officer can update a policy and push it through the platform UI without opening a terminal or filing a deployment ticket.
The policy surface is structured. Waxell ships 26 policy categories — covering data handling, cost, tool access, output content, identity, inter-agent communication, and more — each with its own scoping controls. Rather than writing rules from scratch against an open schema, teams configure governance against a taxonomy that was built from the actual categories of violations that surface in production. The 26 categories aren't arbitrary; they map to the failure modes and regulatory requirements that production teams have encountered repeatedly enough to warrant a first-class policy type.
The evaluation is fast — governance at the pre-execution boundary doesn't add perceptible latency to tool dispatch. But the organizational implication is the bigger difference: AGT makes governance an engineering concern. Waxell makes it an organizational concern.
When a compliance team needs to respond to a regulatory inquiry at 3pm on a Friday, they don't want to be blocked on a deployment pipeline. When a security team identifies a new class of tool call that should require elevated review, they want to push that requirement now, not at the next sprint boundary.
The dynamic policy engine isn't a feature. It's a governance velocity argument.
The Data Layer: Where Tool-Call Governance Ends
There's a category of agent behavior that no tool-call policy can govern: data retrieval.
An agent with permission to call a search tool, a retrieval function, or a vector database query can surface any data those systems return. The tool call is allowed. The policy was satisfied. The governance layer has no view into what the agent is about to see.
For most enterprise agent deployments, this is the actual risk surface. Not "will the agent call a restricted tool?" but "will the agent retrieve data it shouldn't have, surface it in an output, or pass it to the next agent in a spawn chain?"
Waxell's Signals and Domains schema extends governance to the data layer. You declare which agents can access which data sources, at what granularity, under what conditions. The policy enforcement happens at the retrieval boundary — before the data enters the agent's context — not at the tool call boundary where the retrieval was initiated.
This closes the gap that tool-call governance cannot close. An agent can be perfectly well-governed at the AGT level — every tool call checked against a rule, every capability verified — and still exfiltrate data through an unguarded retrieval path. The governed data access layer is the answer to that exposure.
Policy Management: Who Owns Governance at Your Organization?
This is the operational question that doesn't appear in most governance tool comparisons, but it determines whether your governance investment actually functions in production.
With AGT, the people who can change policies are the people who can edit YAML, run tests, and deploy code. That's a developer profile. For teams where security and compliance are embedded in engineering, this works. For teams where governance ownership sits with a compliance function, a legal team, or a dedicated security organization that doesn't have CI/CD access, it creates a structural dependency.
Every compliance requirement that needs to become an enforcement rule has to go through the engineering queue. Every policy update for a new customer requirement becomes a deployment ticket. The governance function is dependent on development capacity.
Waxell's dynamic policy engine breaks this dependency. Compliance teams author and manage policies directly. The platform provides the enforcement infrastructure; the teams that understand the regulatory context provide the rules. The separation is clean: platform engineering manages Waxell itself; compliance and security manage what runs on top of it.
For regulated industries — financial services, healthcare, legal, any team operating under data residency or audit requirements — this separation isn't a preference. It's a prerequisite.
Production Evidence vs. Whitepaper
AGT is a serious piece of engineering. The codebase is well-tested, the architecture is sound, and the threat model it was designed against is real. But it was designed against a threat model — a structured analysis of what agent governance should address, written before most of the teams now running production agents had encountered the failure modes they needed to govern.
Waxell's governance patterns — budget boundaries, tool-level policy, output gates, kill switch — were designed from incidents. The failure mode taxonomy (loop, scope creep, data leakage, hallucination-in-action, prompt injection, cascade) wasn't derived from a whitepaper. It was catalogued from actual production failures across millions of agentic executions.
This matters for a few reasons that aren't immediately obvious.
First, the edge cases. A threat model anticipates known attack vectors. Production evidence surfaces failure modes that weren't anticipated. The runtime governance patterns in Waxell reflect the shape of failures that teams encountered after they thought they had things under control.
Second, the performance profile. Fast policy evaluation in a benchmark is not the same as fast policy evaluation in a multi-agent spawn tree under real load. Waxell's evaluation performance is calibrated against actual production traffic patterns, not synthetic benchmarks.
Third, the coverage decisions. Every governance system makes tradeoffs about what to enforce and how. Waxell's tradeoffs were made in response to real operational pain. That doesn't make them universally correct — but it does mean they were tested against the actual problem before they shipped.
The Operational Stack
For teams running AGT today and evaluating whether to stay, add to, or replace it, here's the honest picture of what you're managing:
If you keep AGT only: You have pre-execution policy enforcement for framework-attached agents. You have an audit log of policy events and a flight recorder for post-mortem replay. You're building observability, cost tracking, durable execution, and external agent coverage yourself or assembling it from separate tools. You're also accepting that every policy change requires a developer and a deployment.
If you move to Waxell: You get the full execution arc across all three planes, the dynamic policy engine, the governed data access layer, BudgetLedger, durable execution, RunEdge causal DAG, and external agent observability under one governance surface. Policy management is decoupled from engineering deployment.
The migration path is straightforward. Waxell auto-instruments 157 libraries at process start — add waxell.init() before your agent initialization, and span-level tracing begins immediately for every LLM call and tool dispatch. Cost records, causal lineage, and budget enforcement layer on top without requiring instrumentation code.
Three Questions to Frame the Decision
If you're deciding now:
1. Who needs to change policies when something goes wrong? If the answer is "someone who doesn't have deployment access," you need a dynamic policy engine.
2. Where does your actual risk surface live? If it's in data retrieval as much as tool dispatch — and for most enterprise deployments, it is — you need data layer governance, not just tool-call governance.
3. What failure modes are you governing for? If you've been running agents in production and you've seen loops, scope creep, or cross-agent data contamination, you need mid-execution enforcement. A pre-execution policy can't stop a failure that's already unfolding.
AGT is a legitimate answer to a specific, well-scoped problem. For teams that need exactly that scope — framework-attached, developer-managed, pre-execution policy enforcement — it's a defensible choice.
For teams that need governance to match the full complexity of how agents fail in production, Waxell is built for that.
Getting Started
pip install waxell-observe[all] waxell-sdk
export WAXELL_API_KEY="wax_sk_..."
export WAXELL_TENANT="your-tenant-slug"
Add waxell.init() at process start. Spans appear immediately. For the full governance stack — dynamic policy engine, governed data access, BudgetLedger enforcement — see the platform overview or book a reference architecture review.
Waxell is the hosted platform for running, observing, and governing AI agents in production. Built on millions of agentic executions.
Frequently Asked Questions
Can I run AGT and Waxell together, or do I have to choose?
You can run both. AGT's in-process policy evaluation and Waxell's instrumentation layer operate independently — they don't need to know about each other to coexist. If you've already deployed AGT and want to add observability, cost tracking, and mid-execution governance on top, pip install waxell-observe[all] waxell-sdk and waxell.init() before your PolicyEvaluator() initialization is all it takes to get started. The integration guide covers the full stack, including how to wire AGT policy outcomes into Waxell spans and how to use the BudgetLedger as a data source for AGT custom checks.
What does "mid-execution enforcement" actually mean in practice?
Pre-execution governance checks whether a specific tool call is permitted before it fires. Mid-execution governance checks whether the current state of the run satisfies a constraint — regardless of whether any individual tool call violated a rule. The clearest example is cost: a run may be well within budget at every individual tool call, but the cumulative spend across a spawn tree can cross a threshold mid-run. Waxell's BudgetLedger enforces at that boundary, not at the per-call level. Similarly, human review gates are a mid-execution construct: the run reaches a decision point, suspends, waits for a reviewer, and resumes — something a pre-execution policy framework has no mechanism to express.
How does Waxell's dynamic policy engine work — can non-technical teams actually manage policies without deployments?
Yes. Policies in Waxell are managed through the platform UI and API, not through files in a code repository. A compliance officer can update a policy, change enforcement scope, or add a new rule and push it immediately — no deployment ticket, no engineering queue. The policy engine evaluates against Waxell's 26 structured policy categories, so teams are configuring governance against a taxonomy rather than authoring rules from scratch against an open schema. Platform engineering manages the Waxell infrastructure; compliance and security manage what runs on top of it. The separation is clean and doesn't require embedding governance ownership inside the engineering team.
What is the data layer governance Waxell provides, and why doesn't tool-call policy cover it?
Tool-call governance can block a retrieval function from being called. It can't control what data that function returns, or prevent that data from propagating through the agent's context and into downstream agents. Waxell's Signals and Domains schema extends policy enforcement to the retrieval boundary — before data enters the agent's context — not just at the call boundary where retrieval was initiated. For enterprise deployments where the real risk is an agent surfacing data it shouldn't have retrieved, or passing sensitive data to a spawned sub-agent, tool-call governance alone leaves the exposure open. Data layer governance closes it.
How quickly can we get started if we already have AGT deployed?
Basic observability starts in minutes: add waxell.init() before your existing PolicyEvaluator() initialization and spans begin appearing immediately for every LLM call and tool dispatch. No instrumentation code required — Waxell auto-instruments 157 libraries at process start. Cost records and causal lineage layer on top automatically. The BudgetLedger integration and dynamic policy engine require additional configuration; the platform overview covers the steps, or you can book a reference architecture review to walk through your specific setup.
Sources
- Microsoft Agent Governance Toolkit — GitHub — source for AGT architecture, scope, and documented non-goals
- Introducing the Agent Governance Toolkit — Microsoft Open Source Blog, April 2, 2026
- Agent Governance Toolkit: Architecture Deep Dive — Microsoft Tech Community
- OWASP Agentic Top 10
- SPIFFE — Secure Production Identity Framework For Everyone
- Waxell Platform Overview
- Waxell Governance Documentation
Top comments (0)