Logan for Waxell

Posted on Apr 24 • Originally published at waxell.ai

What the Microsoft Agent Governance Toolkit Leaves to You

#microsoft #agt #agents #ai

Microsoft's Agent Governance Toolkit is a serious piece of engineering. Sub-millisecond policy evaluation. OWASP Agentic Top 10 coverage. Post-quantum cryptography already shipped. A 9,500+ test corpus with continuous fuzzing. If you've chosen AGT or are evaluating it, you made a defensible decision.

But there's a question that usually surfaces about a week into any AGT deployment, and it's not about the policy engine: who changes a policy when something goes wrong, and how fast can they do it?

With AGT, changing a policy means editing a YAML file, running tests, and deploying. That's a developer task — which means your compliance team, your security team, your legal team, and your on-call engineer all have to route governance changes through the engineering queue. A new regulatory requirement, a new threat pattern, a customer escalation that needs an immediate enforcement change — all of it waits for a deployment.

This is a design decision, not an oversight. AGT is a library. It was built for teams where governance and engineering are the same function. For those teams, it's the right tool.

For teams where governance needs to move at the speed of incidents — not the speed of deployments — and where the people who understand the regulatory context aren't the same people who have CI/CD access, the gap starts before you even get to observability or cost tracking.

This post maps the full list of what AGT leaves open, starting with that gap. For each one, we walk through what a DIY build looks like, what open-source tooling covers it, and where a hosted platform fits.

AGT's Explicit Non-Goals

Before cataloguing the gaps, it's worth being clear that these are documented design decisions, not omissions. From the AGT README:

Not a prompt guardrail or content moderation tool
Governs agent actions, not LLM inputs or outputs; pre-execution only
Same-process trust boundary; container isolation recommended for higher-risk workloads
Workflow-level policies and intent declaration are on the roadmap but not yet available
No tenancy model documented for memory, signing keys, or data residency
No model cost table, no token aggregation, no per-user or per-tenant attribution
No governance at the data retrieval layer

These aren't weaknesses — they're scope decisions. A policy engine that tried to be everything would be nothing. What follows is the list of things that scope leaves on your plate.

TL;DR: Gaps and How to Fill Them

Gap	AGT ships	What you need	Platform answer
Policy management	Developer-authored YAML, deployment required	Non-technical authorship, runtime injection	Waxell dynamic policy engine — 26 categories, warn/block/redact
Observability	Audit log + flight recorder	Span-level tracing, causal graph	Waxell Observe — 157 libraries, RunEdge DAG
Cost tracking	Nothing	Per-call, per-tenant, BudgetLedger enforcement	Waxell SystemModelCost + BudgetLedger
Data layer governance	Nothing	Retrieval-boundary enforcement for DB/vector DB	Waxell Signals and Domains schema
Multi-tenancy	Nothing	Schema isolation, per-tenant signing keys	Waxell schema-per-tenant + AXID isolation
Durable execution	In-session saga only	Suspend-for-days, human gates, cross-session resume	Waxell Runtime
External agent coverage	Framework adapters only	Unified surface for external and third-party agents	Waxell installer
Causal lineage	Sequential audit log	Run-level causal graph across sessions	Waxell RunEdge DAG

Gap 1 — Policy Management: Governance Shouldn't Require a Deployment

What AGT ships: A declarative policy model — YAML, OPA/Rego, or Cedar rules in a policies/ directory. Version-controlled, testable, deployable. Well-suited for teams where developers own governance.

What you need beyond that: The ability to change a policy without a deployment ticket. The ability for a compliance officer, a security analyst, or an on-call engineer without CI/CD access to push an enforcement change when they need to. The ability to assign different policies to different agents and different fleets — so a high-sensitivity finance agent can run under stricter rules than a low-risk internal tool, and that assignment can change without touching the codebase.

The operational reality: agent incidents don't wait for the next sprint. When a new threat pattern is identified at 2am, or when a regulatory deadline brings an immediate compliance requirement, governance needs to move at incident speed.

If you build it yourself: Build a policy management UI on top of AGT's file-based model. Write a deployment pipeline that validates and ships policy changes without a full code review cycle. Scope policies per agent or fleet via configuration. Each of these is solvable individually; together they're a significant internal product build.

Open-source options: AGT's model is inherently developer-centric. There's no open-source layer that adds a non-technical policy authorship surface on top of it.

What a platform provides: Waxell's dynamic policy engine supports 26 structured policy categories — covering data handling, cost, tool access, output content, identity, inter-agent communication, and more — each with scoping controls. Policies are injectable at runtime without redeployment. Different agents and fleets run under different policy sets. The incident disposition model works like cloud infrastructure security: warn, block, or redact, scoped per category. A compliance officer can update a policy and push it live through the platform UI without opening a terminal.

Gap 2 — Observability: Audit Logs Are Not Traces

What AGT ships: An audit log of policy events — which rule fired, on which tool call, with what outcome. A flight recorder for post-mortem replay of a policy violation sequence.

What you need beyond that: Span-level distributed tracing. LLM call latency per turn. Token counts per model call. Tool call arguments and outputs. The full execution graph across spawned sub-agents. A queryable interface for debugging production incidents without trawling raw logs.

The difference matters in practice. A policy audit log tells you that Rule 14 blocked a write_file call at 14:23:07. It doesn't tell you what the agent had done for the 40 turns leading up to that call, which sub-agent spawned the offending run, what model was used at each step, or how many tokens the whole sequence consumed before halting.

Production agent failures rarely announce themselves through policy violations. Policy violations are rare by design — they're the catch, not the signal. The failures that actually hurt — cost overruns, reasoning regressions, emergent behavior that surprises you in a customer demo — don't trigger any rule. They only become visible in spans.

If you build it yourself: Instrument every LLM call with OpenTelemetry. Emit spans to your observability backend of choice. Build a frontend to query across runs. Estimate 3–6 engineer-weeks to a stable prototype; ongoing maintenance as your agent frameworks add new versions.

Open-source options: Langfuse (framework-agnostic, self-hostable, good default choice), Arize Phoenix (strong eval tooling), LangSmith (LangChain-coupled), Helicone (proxy-based, minimal instrumentation). All require some instrumentation; none provide a causal lineage graph across runs.

What a platform provides: pip install waxell-observe[all] auto-instruments 157 libraries at process start — LangChain, CrewAI, AutoGen, the Anthropic SDK, the OpenAI SDK, and 151 others. Spans appear in a trace explorer immediately. No instrumentation code required.

Gap 3 — Cost Tracking and Budget Enforcement

What AGT ships: Nothing. AGT has no model cost table, no token count aggregation, no billing-level attribution. This is documented, not a criticism.

What you need: Per-LLM-call cost records, keyed by model and token type. Aggregated by user, tenant, agent, and time window. If you run agents on behalf of customers — any SaaS product where agents do work for multiple tenants — cost attribution is the difference between knowing your margins and guessing until the invoice lands.

A concrete example: an agent-driven workflow runs across ten turns per request at moderate token counts per turn. At current model pricing, a single session is inexpensive. Multiply by tens of thousands of sessions per day across hundreds of tenants, and without cost tracking, you don't know which tenants are expensive, which workflows are runaway, or whether you're pricing correctly until the model provider bill arrives.

Beyond tracking, you need enforcement. Knowing that the spawn tree has consumed $8 of a $10 budget doesn't help if you can't act on that knowledge mid-run. The gap isn't just visibility — it's the ability to halt or warn when thresholds are crossed in real time.

If you build it yourself: Intercept every LLM call response, log the usage field, join to a pricing table you maintain, aggregate by session and tenant. Easy to prototype, operationally annoying to maintain as model pricing changes and new models are added. Real-time mid-run enforcement requires a live ledger, not just a reporting table.

Open-source options: Helicone and Langfuse both track costs. Helicone is proxy-based (easy to add, adds a network hop); Langfuse requires SDK calls per LLM call. Neither provides a BudgetLedger primitive — a real-time, tree-scoped cost ledger that agents can query mid-run and that policy rules can read to make cost-aware enforcement decisions.

What a platform provides: SystemModelCost records every LLM call with tokens and cost. ModelCostOverride maps custom model endpoints to pricing. Pass a session ID and you get per-user, per-tenant attribution automatically. The BudgetLedger tracks spend across the full spawn tree in real time — a parent agent and all its children share one ledger — and enforces mid-run when thresholds are crossed, not just at the next policy evaluation point.

Gap 4 — Database and Vector Database Governance

What AGT ships: Tool-call-level governance. Before a tool call fires, AGT evaluates whether it's allowed. If it is, the tool dispatches, and AGT's enforcement surface ends there.

The gap: AGT has no mechanism to enforce policy on what data an agent retrieves through that tool call. An agent with permission to call a search function, a retrieval endpoint, or a vector database query can surface any data those systems return. The tool call was allowed. The policy was satisfied. The governance layer never saw what the agent was about to read — or what it passed downstream.

For most enterprise agent deployments, the actual risk surface isn't "will the agent call a restricted tool?" It's "will the agent retrieve data it shouldn't have access to, surface it in an output, or pass it to the next agent in a spawn chain?" Cross-tenant data leakage in a multi-tenant deployment almost never happens through a blocked tool call. It happens through an unrestricted retrieval path.

If you build it yourself: Add authorization middleware to your retrieval layer. Implement agent-aware access control in your vector database. Build schema-level filtering that enforces which agents can see which data. This is solvable, but it puts data governance logic in your retrieval infrastructure rather than in the governance layer where it belongs.

Open-source options: No governance tool provides retrieval-boundary enforcement for arbitrary agent access patterns at time of writing.

What a platform provides: Waxell's Signals and Domains schema extends governance to the data layer. Teams declare which agents can access which data sources, at what granularity, under what conditions. Policy enforcement happens at the retrieval boundary — before the data enters the agent's context — not at the tool call boundary where the retrieval was initiated. An agent can be perfectly well-governed at the AGT tool-call level and still exfiltrate data through an unguarded retrieval path. The governed data access layer closes that gap.

Gap 5 — Multi-Tenancy Beyond Policy Units

What AGT ships: "Policies" as the organizational unit. No documented isolation model for tenant memory, tenant signing keys, or tenant data residency.

What you need if you're building SaaS: Customer A's agent must not read Customer B's episodic memory. Customer A's signed actions must not appear as Customer B's in an audit log. Customer A's data must stay in Customer A's schema. When a compliance auditor asks for evidence of tenant isolation, you need to produce it.

This isn't hypothetical. Any company running agents on behalf of multiple customers faces this question. "We use row-level security" is an answer, but it's an answer you have to build, test, and maintain.

If you build it yourself: Postgres row-level security or schema-per-tenant, Redis namespace isolation per tenant, per-tenant key derivation in your signing layer. Solvable, but it's infrastructure work that pulls engineers away from agent work.

Open-source options: No off-the-shelf multi-tenant agent isolation library exists at time of writing.

What a platform provides: Schema-per-tenant isolation in Postgres, Redis namespace isolation, per-tenant AXID signing keys. The isolation model is enforced at the infrastructure layer — agents inherit it automatically rather than relying on application-level guards.

Gap 6 — Durable Execution: Suspend, Resume, Wait

What AGT ships: A saga orchestrator for multi-step action rollback. If Step 3 of a 5-step workflow fails, the saga can unwind Steps 1 and 2. This is valuable — it's the right answer for compensating transactions. But the saga runs within a single execution session. There is no mechanism to suspend an agent mid-run and resume it hours or days later.

The use cases that need this: An agent sends an invoice, then needs to wait up to 7 days for payment confirmation before taking the next action. An agent drafts a sensitive document, routes it to a human for approval, and resumes only after the human approves. A nightly batch workflow that processes queued items, sleeps until the next morning, processes again. A customer onboarding flow that sends a welcome email, waits 48 hours, checks whether the user has completed setup, and branches accordingly.

None of these are addressable with a saga orchestrator. A saga handles rollback within a session. These use cases require checkpointed state that survives session boundaries — and potentially worker crashes.

If you build it yourself: A task queue with scheduled retry, Postgres checkpointing after each await step, a resume dispatcher that handles typed exceptions, idempotency handling for the "worker crashed mid-sleep" case. The infrastructure for durable execution without deterministic replay is non-trivial to get right.

Open-source options: Temporal (strong model, requires deterministic replay — significant adoption cost), Inngest (event-driven, not agent-native), LangGraph durable execution (tied to LangGraph), Cloudflare Workflows (infrastructure-coupled).

What a platform provides: Native durable execution — suspend for arbitrary durations, wait for human approval, resume after a signal or timer. The Envelope state machine checkpoints to Postgres after each await. Worker crash → automatic resume from the last checkpoint. No determinism requirement. You write normal Python; the framework handles the rest.

Gap 7 — External Agent Coverage

What AGT ships: Instrumentation adapters for LangChain, CrewAI, AutoGen, and Semantic Kernel. Agents running inside those frameworks are within AGT's governance surface. Agents running outside them are not.

What's outside the lens: External agents running in developer tooling, CI pipelines, third-party integrations, and customer-facing environments that don't run inside a supported framework. MCP servers running as independent processes. Any agent built before AGT adapters existed for its framework.

Why this matters in practice: Production agent fleets are rarely monolithic. The same logical agent runs in a developer's local environment, in CI, and in a production workflow tool. Without a unified governance surface across all three, you can't attribute cost across them, trace a decision from a local session to a production run, or apply consistent enforcement across the full surface.

If you build it yourself: Build event-emitting hooks for each external environment you need to cover. Write routing and normalization to pipe those events into the same observability backend as your framework agents. Repeat for each new external tool. This is custom engineering at each integration point.

Open-source options: None that provide a unified external agent governance surface across arbitrary external environments and MCP servers.

What a platform provides: The Waxell installer drops configuration that routes structured events from external agents into the same governance surface as your framework-built agents. External agents, framework agents, and the agentic runtime all appear under one observability plane with unified attribution.

Gap 8 — Causal Lineage

What AGT ships: An audit log. A sequential record of policy events. This is the right tool for answering "did this policy fire?" It is not the right tool for answering "what caused this agent to take this action?"

The incident investigation problem: An agent produces an incorrect output. The team wants to know: what spawned this agent? What data did it read in the run that preceded this one? What decision in a parent agent's run caused this child to be spawned with these parameters? What's the full causal chain from the user's original request to this output?

A sequential audit log can't answer those questions. It records events in order; it doesn't record causal relationships between runs. When Agent A spawns Agent B, which calls a tool that triggers Agent C across a different session boundary, the audit log has three separate event streams with no explicit link between them.

If you build it yourself: Propagate a parent run ID through every spawn call. Persist parent-child relationships in a separate table. Build a query layer over that table. Handle the edge cases: signal-triggered resumes, cross-session bridges, timer-fired continuations. A complete lineage model has more edge kinds than it first appears.

Open-source options: OpenTelemetry trace propagation covers span-level parent-child relationships but not run-level causal graphs. No open-source tool provides a complete causal lineage model for multi-agent systems at time of writing.

What a platform provides: The RunEdge DAG links every AgentExecutionRun to its causal predecessors via typed edge records: user_start, spawn, signal_fire, domain_callback, resume, timer_fire, retry, cross_session_bridge. The trace explorer renders the full causal graph as a browsable DAG. An incident that traces back through four spawn levels across three sessions is navigable in the UI in under a minute.

The Checklist

If you have AGT and are planning the rest of your stack:

Gap	DIY estimate	Open-source option	Hosted option
Policy management (non-technical, runtime)	4–8 weeks internal product	None	Waxell dynamic policy engine
Span-level tracing	3–6 weeks	Langfuse, Arize Phoenix	Waxell Observe
Cost attribution + mid-run enforcement	1–2 weeks + maintenance	Helicone, Langfuse (tracking only)	Waxell SystemModelCost + BudgetLedger
Database / vector DB governance	2–4 weeks per retrieval layer	None	Waxell Signals and Domains
Multi-tenant isolation	2–4 weeks infra	None	Waxell schema-per-tenant
Durable execution (suspend/resume)	4–8 weeks	Temporal, Inngest	Waxell Runtime
External agent coverage	1–2 weeks per tool	None	Waxell installer
Causal lineage	2–4 weeks + UI	None	Waxell RunEdge DAG

Most teams tackle these in roughly this order: policy management first if governance velocity is the immediate pressure; observability first if production visibility is the blocker; cost second once tracing is in. Tenancy, data layer governance, and lineage often come later — but they're worth planning for early, because retrofitting them into an existing fleet is significantly harder than building them in from the start.

FAQ

Does Waxell cover everything AGT covers?
Waxell's dynamic policy engine covers the pre-execution enforcement use case AGT is built for — and extends it: more structured policy categories, runtime injection without redeployment, non-technical policy management, and warn/block/redact disposition options beyond allow/deny. The one area where AGT has an advantage Waxell doesn't currently match is multi-language support: AGT ships working enforcement for TypeScript, .NET, Rust, and Go. Waxell is currently Python only.

Can I add Waxell Observe to an existing AGT deployment without changing my agent code?
Yes. pip install waxell-observe[all] and call waxell.init() at process start. The SDK auto-instruments your agent frameworks. No manual span instrumentation required for the frameworks in the supported library list.

What's the minimum I need from a platform if I already have AGT?
Depends on your most pressing gap. If governance velocity — getting policy changes live without a developer deployment — is the immediate need, start with the dynamic policy engine. If cost tracking is the first thing keeping you up at night, start with Waxell Observe. If you need agents that can pause for human approval, start with Waxell Runtime. You don't have to close all eight gaps at once.

Why can't I just use a standard observability platform — Datadog, Grafana, New Relic?
You can. Standard observability platforms handle infrastructure metrics and application traces well. They don't have concepts for LLM token cost, agent spawn trees, mid-run human approval gates, data retrieval governance, or causal lineage across agent sessions. You'd be building those abstractions on top of a generic platform — which is valid, but it's a significant engineering investment.

Is this list complete?
Probably not. Agent infrastructure is moving fast. The eight gaps above are the ones that consistently surface in production deployments today. Security-specific gaps (memory poisoning, prompt injection in tool responses, cross-tenant data leakage through the retrieval layer) overlap with the data governance gap above but deserve their own treatment as the threat landscape matures.

Waxell is the hosted platform for running, observing, and governing AI agents in production. See the platform overview or book a reference architecture review.

Sources

Microsoft Agent Governance Toolkit — GitHub — primary source for AGT scope, non-goals, and documented boundaries
Introducing the Agent Governance Toolkit — Microsoft Open Source Blog, April 2, 2026
Agent Governance Toolkit: Architecture Deep Dive — Microsoft Tech Community
OWASP Agentic Security Initiative Top 10
Langfuse — Open-source LLM observability
Temporal — Durable execution platform
Inngest — Event-driven durable functions
Waxell Platform Overview
Waxell Governance Documentation

Top comments (1)

PEACEBINFLOW • Apr 25

The distinction between an audit log that tells you a rule fired and a trace that shows you the forty turns leading up to it is the kind of operational reality that only becomes obvious after your first production incident at 2 AM. A policy violation is the last thing that happened, not the root cause. The root cause is usually something that happened fifteen turns earlier—a tool returned unexpected data, a sub-agent spawned with the wrong context, a model switched mid-session and started reasoning differently. None of that triggers a policy rule. All of it is visible in spans if you have them. Invisible if you don't.

What I find myself thinking about is the policy management gap specifically—the idea that governance changes need to move at incident speed, not deployment speed. The YAML-in-git model is clean, auditable, and completely wrong for the moment when a compliance officer learns about a new regulatory requirement at 10 PM and needs it enforced by 10:15. The engineering team is asleep. The CI/CD pipeline takes 20 minutes. The policy file needs to go through code review. Every step of that process is reasonable in isolation and catastrophic in combination. The platform model—runtime injection without redeployment—solves the speed problem but introduces a different challenge: if policies can change without going through git, how do you maintain an auditable record of who changed what and when? The answer is probably that the policy engine itself needs to be the system of record for policy changes, not git. But that's a cultural shift for teams that have spent years building compliance workflows around version-controlled config files.

The data retrieval governance gap is the one that seems most likely to become the next big incident category. Tool-call governance is well-understood: did the agent call a restricted tool? Blocked. But an agent with permission to query a vector database can retrieve anything that database contains, and no policy rule fires because the retrieval itself was authorized. The governance layer checked the wrong thing. It checked whether the tool was allowed, not whether the data was appropriate for this agent, this tenant, this context. Fixing that requires pushing governance into the retrieval layer itself, which is a significantly harder problem than tool-call allowlisting. Do you see the Signals and Domains approach—declaring which agents can access which data sources at what granularity—as something that teams are actually adopting, or is it still mostly aspirational while everyone focuses on the more visible gaps like cost tracking and observability?