DEV Community

Cover image for What the Microsoft Agent Governance Toolkit Leaves to You
Logan for Waxell

Posted on • Originally published at waxell.ai

What the Microsoft Agent Governance Toolkit Leaves to You

Microsoft's Agent Governance Toolkit is a serious piece of engineering. Sub-millisecond policy evaluation. OWASP Agentic Top 10 coverage. Post-quantum cryptography already shipped. A 9,500+ test corpus with continuous fuzzing. If you've chosen AGT or are evaluating it, you made a defensible decision.

But there's a question that usually surfaces about a week into any AGT deployment, and it's not about the policy engine: who changes a policy when something goes wrong, and how fast can they do it?

With AGT, changing a policy means editing a YAML file, running tests, and deploying. That's a developer task — which means your compliance team, your security team, your legal team, and your on-call engineer all have to route governance changes through the engineering queue. A new regulatory requirement, a new threat pattern, a customer escalation that needs an immediate enforcement change — all of it waits for a deployment.

This is a design decision, not an oversight. AGT is a library. It was built for teams where governance and engineering are the same function. For those teams, it's the right tool.

For teams where governance needs to move at the speed of incidents — not the speed of deployments — and where the people who understand the regulatory context aren't the same people who have CI/CD access, the gap starts before you even get to observability or cost tracking.

This post maps the full list of what AGT leaves open, starting with that gap. For each one, we walk through what a DIY build looks like, what open-source tooling covers it, and where a hosted platform fits.


AGT's Explicit Non-Goals

Before cataloguing the gaps, it's worth being clear that these are documented design decisions, not omissions. From the AGT README:

  • Not a prompt guardrail or content moderation tool
  • Governs agent actions, not LLM inputs or outputs; pre-execution only
  • Same-process trust boundary; container isolation recommended for higher-risk workloads
  • Workflow-level policies and intent declaration are on the roadmap but not yet available
  • No tenancy model documented for memory, signing keys, or data residency
  • No model cost table, no token aggregation, no per-user or per-tenant attribution
  • No governance at the data retrieval layer

These aren't weaknesses — they're scope decisions. A policy engine that tried to be everything would be nothing. What follows is the list of things that scope leaves on your plate.


TL;DR: Gaps and How to Fill Them

Gap AGT ships What you need Platform answer
Policy management Developer-authored YAML, deployment required Non-technical authorship, runtime injection Waxell dynamic policy engine — 26 categories, warn/block/redact
Observability Audit log + flight recorder Span-level tracing, causal graph Waxell Observe — 157 libraries, RunEdge DAG
Cost tracking Nothing Per-call, per-tenant, BudgetLedger enforcement Waxell SystemModelCost + BudgetLedger
Data layer governance Nothing Retrieval-boundary enforcement for DB/vector DB Waxell Signals and Domains schema
Multi-tenancy Nothing Schema isolation, per-tenant signing keys Waxell schema-per-tenant + AXID isolation
Durable execution In-session saga only Suspend-for-days, human gates, cross-session resume Waxell Runtime
External agent coverage Framework adapters only Unified surface for external and third-party agents Waxell installer
Causal lineage Sequential audit log Run-level causal graph across sessions Waxell RunEdge DAG

Gap 1 — Policy Management: Governance Shouldn't Require a Deployment

What AGT ships: A declarative policy model — YAML, OPA/Rego, or Cedar rules in a policies/ directory. Version-controlled, testable, deployable. Well-suited for teams where developers own governance.

What you need beyond that: The ability to change a policy without a deployment ticket. The ability for a compliance officer, a security analyst, or an on-call engineer without CI/CD access to push an enforcement change when they need to. The ability to assign different policies to different agents and different fleets — so a high-sensitivity finance agent can run under stricter rules than a low-risk internal tool, and that assignment can change without touching the codebase.

The operational reality: agent incidents don't wait for the next sprint. When a new threat pattern is identified at 2am, or when a regulatory deadline brings an immediate compliance requirement, governance needs to move at incident speed.

If you build it yourself: Build a policy management UI on top of AGT's file-based model. Write a deployment pipeline that validates and ships policy changes without a full code review cycle. Scope policies per agent or fleet via configuration. Each of these is solvable individually; together they're a significant internal product build.

Open-source options: AGT's model is inherently developer-centric. There's no open-source layer that adds a non-technical policy authorship surface on top of it.

What a platform provides: Waxell's dynamic policy engine supports 26 structured policy categories — covering data handling, cost, tool access, output content, identity, inter-agent communication, and more — each with scoping controls. Policies are injectable at runtime without redeployment. Different agents and fleets run under different policy sets. The incident disposition model works like cloud infrastructure security: warn, block, or redact, scoped per category. A compliance officer can update a policy and push it live through the platform UI without opening a terminal.


Gap 2 — Observability: Audit Logs Are Not Traces

What AGT ships: An audit log of policy events — which rule fired, on which tool call, with what outcome. A flight recorder for post-mortem replay of a policy violation sequence.

What you need beyond that: Span-level distributed tracing. LLM call latency per turn. Token counts per model call. Tool call arguments and outputs. The full execution graph across spawned sub-agents. A queryable interface for debugging production incidents without trawling raw logs.

The difference matters in practice. A policy audit log tells you that Rule 14 blocked a write_file call at 14:23:07. It doesn't tell you what the agent had done for the 40 turns leading up to that call, which sub-agent spawned the offending run, what model was used at each step, or how many tokens the whole sequence consumed before halting.

Production agent failures rarely announce themselves through policy violations. Policy violations are rare by design — they're the catch, not the signal. The failures that actually hurt — cost overruns, reasoning regressions, emergent behavior that surprises you in a customer demo — don't trigger any rule. They only become visible in spans.

If you build it yourself: Instrument every LLM call with OpenTelemetry. Emit spans to your observability backend of choice. Build a frontend to query across runs. Estimate 3–6 engineer-weeks to a stable prototype; ongoing maintenance as your agent frameworks add new versions.

Open-source options: Langfuse (framework-agnostic, self-hostable, good default choice), Arize Phoenix (strong eval tooling), LangSmith (LangChain-coupled), Helicone (proxy-based, minimal instrumentation). All require some instrumentation; none provide a causal lineage graph across runs.

What a platform provides: pip install waxell-observe[all] auto-instruments 157 libraries at process start — LangChain, CrewAI, AutoGen, the Anthropic SDK, the OpenAI SDK, and 151 others. Spans appear in a trace explorer immediately. No instrumentation code required.


Gap 3 — Cost Tracking and Budget Enforcement

What AGT ships: Nothing. AGT has no model cost table, no token count aggregation, no billing-level attribution. This is documented, not a criticism.

What you need: Per-LLM-call cost records, keyed by model and token type. Aggregated by user, tenant, agent, and time window. If you run agents on behalf of customers — any SaaS product where agents do work for multiple tenants — cost attribution is the difference between knowing your margins and guessing until the invoice lands.

A concrete example: an agent-driven workflow runs across ten turns per request at moderate token counts per turn. At current model pricing, a single session is inexpensive. Multiply by tens of thousands of sessions per day across hundreds of tenants, and without cost tracking, you don't know which tenants are expensive, which workflows are runaway, or whether you're pricing correctly until the model provider bill arrives.

Beyond tracking, you need enforcement. Knowing that the spawn tree has consumed $8 of a $10 budget doesn't help if you can't act on that knowledge mid-run. The gap isn't just visibility — it's the ability to halt or warn when thresholds are crossed in real time.

If you build it yourself: Intercept every LLM call response, log the usage field, join to a pricing table you maintain, aggregate by session and tenant. Easy to prototype, operationally annoying to maintain as model pricing changes and new models are added. Real-time mid-run enforcement requires a live ledger, not just a reporting table.

Open-source options: Helicone and Langfuse both track costs. Helicone is proxy-based (easy to add, adds a network hop); Langfuse requires SDK calls per LLM call. Neither provides a BudgetLedger primitive — a real-time, tree-scoped cost ledger that agents can query mid-run and that policy rules can read to make cost-aware enforcement decisions.

What a platform provides: SystemModelCost records every LLM call with tokens and cost. ModelCostOverride maps custom model endpoints to pricing. Pass a session ID and you get per-user, per-tenant attribution automatically. The BudgetLedger tracks spend across the full spawn tree in real time — a parent agent and all its children share one ledger — and enforces mid-run when thresholds are crossed, not just at the next policy evaluation point.


Gap 4 — Database and Vector Database Governance

What AGT ships: Tool-call-level governance. Before a tool call fires, AGT evaluates whether it's allowed. If it is, the tool dispatches, and AGT's enforcement surface ends there.

The gap: AGT has no mechanism to enforce policy on what data an agent retrieves through that tool call. An agent with permission to call a search function, a retrieval endpoint, or a vector database query can surface any data those systems return. The tool call was allowed. The policy was satisfied. The governance layer never saw what the agent was about to read — or what it passed downstream.

For most enterprise agent deployments, the actual risk surface isn't "will the agent call a restricted tool?" It's "will the agent retrieve data it shouldn't have access to, surface it in an output, or pass it to the next agent in a spawn chain?" Cross-tenant data leakage in a multi-tenant deployment almost never happens through a blocked tool call. It happens through an unrestricted retrieval path.

If you build it yourself: Add authorization middleware to your retrieval layer. Implement agent-aware access control in your vector database. Build schema-level filtering that enforces which agents can see which data. This is solvable, but it puts data governance logic in your retrieval infrastructure rather than in the governance layer where it belongs.

Open-source options: No governance tool provides retrieval-boundary enforcement for arbitrary agent access patterns at time of writing.

What a platform provides: Waxell's Signals and Domains schema extends governance to the data layer. Teams declare which agents can access which data sources, at what granularity, under what conditions. Policy enforcement happens at the retrieval boundary — before the data enters the agent's context — not at the tool call boundary where the retrieval was initiated. An agent can be perfectly well-governed at the AGT tool-call level and still exfiltrate data through an unguarded retrieval path. The governed data access layer closes that gap.


Gap 5 — Multi-Tenancy Beyond Policy Units

What AGT ships: "Policies" as the organizational unit. No documented isolation model for tenant memory, tenant signing keys, or tenant data residency.

What you need if you're building SaaS: Customer A's agent must not read Customer B's episodic memory. Customer A's signed actions must not appear as Customer B's in an audit log. Customer A's data must stay in Customer A's schema. When a compliance auditor asks for evidence of tenant isolation, you need to produce it.

This isn't hypothetical. Any company running agents on behalf of multiple customers faces this question. "We use row-level security" is an answer, but it's an answer you have to build, test, and maintain.

If you build it yourself: Postgres row-level security or schema-per-tenant, Redis namespace isolation per tenant, per-tenant key derivation in your signing layer. Solvable, but it's infrastructure work that pulls engineers away from agent work.

Open-source options: No off-the-shelf multi-tenant agent isolation library exists at time of writing.

What a platform provides: Schema-per-tenant isolation in Postgres, Redis namespace isolation, per-tenant AXID signing keys. The isolation model is enforced at the infrastructure layer — agents inherit it automatically rather than relying on application-level guards.


Gap 6 — Durable Execution: Suspend, Resume, Wait

What AGT ships: A saga orchestrator for multi-step action rollback. If Step 3 of a 5-step workflow fails, the saga can unwind Steps 1 and 2. This is valuable — it's the right answer for compensating transactions. But the saga runs within a single execution session. There is no mechanism to suspend an agent mid-run and resume it hours or days later.

The use cases that need this: An agent sends an invoice, then needs to wait up to 7 days for payment confirmation before taking the next action. An agent drafts a sensitive document, routes it to a human for approval, and resumes only after the human approves. A nightly batch workflow that processes queued items, sleeps until the next morning, processes again. A customer onboarding flow that sends a welcome email, waits 48 hours, checks whether the user has completed setup, and branches accordingly.

None of these are addressable with a saga orchestrator. A saga handles rollback within a session. These use cases require checkpointed state that survives session boundaries — and potentially worker crashes.

If you build it yourself: A task queue with scheduled retry, Postgres checkpointing after each await step, a resume dispatcher that handles typed exceptions, idempotency handling for the "worker crashed mid-sleep" case. The infrastructure for durable execution without deterministic replay is non-trivial to get right.

Open-source options: Temporal (strong model, requires deterministic replay — significant adoption cost), Inngest (event-driven, not agent-native), LangGraph durable execution (tied to LangGraph), Cloudflare Workflows (infrastructure-coupled).

What a platform provides: Native durable execution — suspend for arbitrary durations, wait for human approval, resume after a signal or timer. The Envelope state machine checkpoints to Postgres after each await. Worker crash → automatic resume from the last checkpoint. No determinism requirement. You write normal Python; the framework handles the rest.


Gap 7 — External Agent Coverage

What AGT ships: Instrumentation adapters for LangChain, CrewAI, AutoGen, and Semantic Kernel. Agents running inside those frameworks are within AGT's governance surface. Agents running outside them are not.

What's outside the lens: External agents running in developer tooling, CI pipelines, third-party integrations, and customer-facing environments that don't run inside a supported framework. MCP servers running as independent processes. Any agent built before AGT adapters existed for its framework.

Why this matters in practice: Production agent fleets are rarely monolithic. The same logical agent runs in a developer's local environment, in CI, and in a production workflow tool. Without a unified governance surface across all three, you can't attribute cost across them, trace a decision from a local session to a production run, or apply consistent enforcement across the full surface.

If you build it yourself: Build event-emitting hooks for each external environment you need to cover. Write routing and normalization to pipe those events into the same observability backend as your framework agents. Repeat for each new external tool. This is custom engineering at each integration point.

Open-source options: None that provide a unified external agent governance surface across arbitrary external environments and MCP servers.

What a platform provides: The Waxell installer drops configuration that routes structured events from external agents into the same governance surface as your framework-built agents. External agents, framework agents, and the agentic runtime all appear under one observability plane with unified attribution.


Gap 8 — Causal Lineage

What AGT ships: An audit log. A sequential record of policy events. This is the right tool for answering "did this policy fire?" It is not the right tool for answering "what caused this agent to take this action?"

The incident investigation problem: An agent produces an incorrect output. The team wants to know: what spawned this agent? What data did it read in the run that preceded this one? What decision in a parent agent's run caused this child to be spawned with these parameters? What's the full causal chain from the user's original request to this output?

A sequential audit log can't answer those questions. It records events in order; it doesn't record causal relationships between runs. When Agent A spawns Agent B, which calls a tool that triggers Agent C across a different session boundary, the audit log has three separate event streams with no explicit link between them.

If you build it yourself: Propagate a parent run ID through every spawn call. Persist parent-child relationships in a separate table. Build a query layer over that table. Handle the edge cases: signal-triggered resumes, cross-session bridges, timer-fired continuations. A complete lineage model has more edge kinds than it first appears.

Open-source options: OpenTelemetry trace propagation covers span-level parent-child relationships but not run-level causal graphs. No open-source tool provides a complete causal lineage model for multi-agent systems at time of writing.

What a platform provides: The RunEdge DAG links every AgentExecutionRun to its causal predecessors via typed edge records: user_start, spawn, signal_fire, domain_callback, resume, timer_fire, retry, cross_session_bridge. The trace explorer renders the full causal graph as a browsable DAG. An incident that traces back through four spawn levels across three sessions is navigable in the UI in under a minute.


The Checklist

If you have AGT and are planning the rest of your stack:

Gap DIY estimate Open-source option Hosted option
Policy management (non-technical, runtime) 4–8 weeks internal product None Waxell dynamic policy engine
Span-level tracing 3–6 weeks Langfuse, Arize Phoenix Waxell Observe
Cost attribution + mid-run enforcement 1–2 weeks + maintenance Helicone, Langfuse (tracking only) Waxell SystemModelCost + BudgetLedger
Database / vector DB governance 2–4 weeks per retrieval layer None Waxell Signals and Domains
Multi-tenant isolation 2–4 weeks infra None Waxell schema-per-tenant
Durable execution (suspend/resume) 4–8 weeks Temporal, Inngest Waxell Runtime
External agent coverage 1–2 weeks per tool None Waxell installer
Causal lineage 2–4 weeks + UI None Waxell RunEdge DAG

Most teams tackle these in roughly this order: policy management first if governance velocity is the immediate pressure; observability first if production visibility is the blocker; cost second once tracing is in. Tenancy, data layer governance, and lineage often come later — but they're worth planning for early, because retrofitting them into an existing fleet is significantly harder than building them in from the start.


FAQ

Does Waxell cover everything AGT covers?
Waxell's dynamic policy engine covers the pre-execution enforcement use case AGT is built for — and extends it: more structured policy categories, runtime injection without redeployment, non-technical policy management, and warn/block/redact disposition options beyond allow/deny. The one area where AGT has an advantage Waxell doesn't currently match is multi-language support: AGT ships working enforcement for TypeScript, .NET, Rust, and Go. Waxell is currently Python only.

Can I add Waxell Observe to an existing AGT deployment without changing my agent code?
Yes. pip install waxell-observe[all] and call waxell.init() at process start. The SDK auto-instruments your agent frameworks. No manual span instrumentation required for the frameworks in the supported library list.

What's the minimum I need from a platform if I already have AGT?
Depends on your most pressing gap. If governance velocity — getting policy changes live without a developer deployment — is the immediate need, start with the dynamic policy engine. If cost tracking is the first thing keeping you up at night, start with Waxell Observe. If you need agents that can pause for human approval, start with Waxell Runtime. You don't have to close all eight gaps at once.

Why can't I just use a standard observability platform — Datadog, Grafana, New Relic?
You can. Standard observability platforms handle infrastructure metrics and application traces well. They don't have concepts for LLM token cost, agent spawn trees, mid-run human approval gates, data retrieval governance, or causal lineage across agent sessions. You'd be building those abstractions on top of a generic platform — which is valid, but it's a significant engineering investment.

Is this list complete?
Probably not. Agent infrastructure is moving fast. The eight gaps above are the ones that consistently surface in production deployments today. Security-specific gaps (memory poisoning, prompt injection in tool responses, cross-tenant data leakage through the retrieval layer) overlap with the data governance gap above but deserve their own treatment as the threat landscape matures.


Waxell is the hosted platform for running, observing, and governing AI agents in production. See the platform overview or book a reference architecture review.


Sources

Top comments (0)