DEV Community

Cover image for Combining Microsoft AGT Policies with Waxell Observability: A Reference Architecture
Logan for Waxell

Posted on • Originally published at waxell.ai

Combining Microsoft AGT Policies with Waxell Observability: A Reference Architecture

This post is for teams that have made two decisions:

  1. Use Microsoft's Agent Governance Toolkit for policy enforcement.
  2. Need observability, cost tracking, and collaboration on top of that.

We'll show you how the two systems fit together in production: the architecture, the data flow, how identity layers coexist, the two explicit integration points that make them work as a stack, and how to divide operational ownership between teams. No competition. Both products are doing their jobs. This is about connecting them.


The Stack

Think of it as two horizontal layers over your agent process:

┌──────────────────────────────────────────────────────┐
│                   Agent Process                       │
│                                                       │
│  ┌──────────────────┐   ┌───────────────────────────┐│
│  │  AGT Agent OS    │   │  Waxell Observe SDK       ││
│  │  (policy eval)   │   │  (auto-instrumentation)   ││
│  └────────┬─────────┘   └──────────┬────────────────┘│
│           │                        │                  │
│  ┌────────┴────────────────────────┴────────────────┐ │
│  │              Waxell Runtime SDK                   │ │
│  │    (spawn / suspend / resume / ask_user)          │ │
│  └───────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────┘
          │                          │
          ▼                          ▼
   AGT AgentMesh                Waxell domain endpoints
   (SPIFFE workload              (out-of-process action
    identity, A2A/MCP/IATP)       enforcement, BudgetLedger)
Enter fullscreen mode Exit fullscreen mode

AGT Agent OS runs in-process. It evaluates YAML, OPA/Rego, or Cedar rules before every tool call. At 0.029ms for 100 rules, the evaluation is below the noise floor of any downstream latency. It doesn't know about Waxell. It just blocks or allows.

Waxell Observe SDK also runs in-process, alongside AGT. It intercepts the same tool calls via auto-instrumentation and emits spans to Waxell Observe — token counts, latency, tool arguments, cost, model. It doesn't know about AGT either. It just observes.

Waxell Runtime SDK sits underneath both and provides the durable execution infrastructure: agent spawning, suspension, resume, and human-in-the-loop gates.

The two products don't need to know about each other to coexist. The coordination happens at two explicit, optional integration points — the BudgetLedger cross-reference and the identity layer — which you can add progressively.


Initialization Order Matters

Before any code: initialize Waxell before loading AGT. Waxell's instrumentation layer needs to wrap the tool dispatch stack before AGT's hooks attach. If you reverse the order, AGT hooks fire before Waxell spans open and the span won't carry AGT policy attributes.

# agent_init.py — always Waxell first, then AGT

import waxell
from agent_os.policies import PolicyEvaluator

# 1. Initialize Waxell — wraps the instrumentation layer
waxell.init(
    api_key="wax_sk_...",
    tenant="acme-prod",
    observe=True,
    runtime=True,
)

# 2. Load AGT with your policy directory
engine = PolicyEvaluator()
engine.load_policies("policies/")
# Register custom checks per AGT docs before loading policies
# e.g. engine.register_check("waxell_budget_check", waxell_budget_check)
Enter fullscreen mode Exit fullscreen mode

Install both:

pip install waxell-observe[all] waxell-sdk agent-governance-toolkit[full]
Enter fullscreen mode Exit fullscreen mode

Both packages instrument via OpenTelemetry. They share the same tracer if configured to — no duplicate spans in your backend.


Data Flow: From Tool Call to Span

Here's what happens when an agent attempts a tool call under this stack:

Step 1 — AGT evaluates (in-process, ~0.029ms). engine.evaluate({"tool_name": "write_file"}) fires. It checks the loaded rule set. Rule 7 says write_file requires capability can_write_production. The agent has that capability. Outcome: allow.

Step 2 — Waxell Observe opens a span. The SDK records tool=write_file, agent_slug=eng-claude-code, run_id=run_8fA3k. The AGT policy outcome is attached as a span attribute at this point — waxell.policy.agt.allowed=True.

Step 3 — Tool dispatches. The actual file write happens.

Step 4 — Waxell Observe closes the span. Latency, output size, any error are recorded.

Step 5 — RunEdge created if the tool spawned a sub-agent. The causal link is recorded in the RunEdge DAG.

To wire the AGT outcome into the Waxell span, add a thin wrapper around your tool dispatch:

from opentelemetry import trace
from waxell.exceptions import PolicyViolationHalt

tracer = trace.get_tracer("waxell.agent")

def governed_tool_call(tool_name: str, args: dict) -> dict:
    with tracer.start_as_current_span(f"tool.{tool_name}") as span:

        # AGT evaluates in-process
        decision = engine.evaluate({"tool_name": tool_name})
        span.set_attribute("waxell.policy.agt.allowed", decision.allowed)

        if not decision.allowed:
            span.set_attribute("error", True)
            raise PolicyViolationHalt(
                f"AGT blocked tool '{tool_name}'"
            )

        # Waxell observes the real call
        result = dispatch_tool(tool_name, args)
        return result
Enter fullscreen mode Exit fullscreen mode

After this, every blocked tool call in the Waxell trace explorer shows waxell.policy.agt.allowed=False. You can filter by that attribute to see every AGT policy trigger across all runs in a time window — without touching the AGT audit log separately.


Integration Point 1: AGT Rules That Read Waxell BudgetLedger

The most powerful integration point is cost-aware policy enforcement. AGT can declare a rule that blocks a tool call when current spend exceeds a threshold. But AGT has no cost ledger — that data lives in Waxell.

The bridge is a custom check: AGT calls a Python function before evaluating the rule; that function reads the Waxell BudgetLedger. The YAML and Python patterns below illustrate the integration architecture — consult the AGT documentation for the exact custom check registration interface, as the specific field names may differ from this example.

# policies/cost_guard.yaml
- rule_id: cost_guard_synthesis
  description: "Block expensive synthesis if spawn tree spend > $10"
  condition:
    tool: synthesize_report
    custom_check: waxell_budget_check
  action: deny
  message: "Spawn tree budget exceeded  synthesize_report blocked"
Enter fullscreen mode Exit fullscreen mode
# custom_checks.py
import waxell

def waxell_budget_check(context: dict) -> bool:
    """Return True to ALLOW the tool call, False to DENY."""
    ledger = waxell.budget.get_tree_ledger(
        tree_id=context["axid_spawn_tree"],
        tenant=context["tenant_slug"],
    )
    return ledger.cost_usd < 10.00
Enter fullscreen mode Exit fullscreen mode

This pattern keeps ownership clean: the policy team writes YAML, the platform team manages the BudgetLedger state, and neither needs to touch the other's codebase. The ledger is the source of truth; the AGT rule is the declarative check over it.

One operational note: get_tree_ledger is a network call to Waxell's API. For high-frequency tool calls, cache the result at the span boundary — call once per run, not once per tool invocation — or implement a short TTL cache in the custom check function.


Integration Point 2: Identity — SPIFFE and AXID Side by Side

AGT ships AgentMesh for workload identity using SPIFFE/SVID — the standard for service-to-service mutual TLS. Waxell ships AXID, an Ed25519-signed JWT for per-run action provenance. These solve different problems and coexist without conflict.

AGT AgentMesh / SPIFFE Waxell AXID
Identifies The service or workload The specific agent run and action
Protocol mTLS, X.509 SVID JWT in X-Waxell-AXID header
Claims SPIFFE URI (workload identity) Tenant, agent slug, run ID, sub-user, spawn-chain parent
TTL Certificate lifetime (hours/days) 5 minutes per AXID
Question it answers "Is this process authorized to connect to this service?" "Which run, by which agent, on behalf of which user, in which spawn chain, took this action?"

In the combined stack: SPIFFE certificates secure the mTLS connection between the agent process and Waxell's domain endpoints. AXID JWTs ride in the X-Waxell-AXID header of the action request, carrying run-level claims. The server verifies both independently.

The practical implication: don't try to consolidate these into one identity layer. SPIFFE is a connection-level primitive; AXID is an action-level primitive. They're operating at different granularities and serve different audit needs.


Operational Ownership

One of the concrete benefits of running both products is clean team-boundary separation. The policy team doesn't need to understand Waxell's internals; the platform team doesn't need to review AGT rule logic. Here's how the split typically looks:

Layer Owned by Day-to-day tools
AGT policy rules Security / Compliance YAML files in policies/ repo, agt verify CLI, AGT audit log
Waxell Observe config Platform Engineering waxell.init() parameters, instrumentor config, ModelCostOverride pricing table
Waxell cost budgets Platform Engineering + Finance BudgetLedger limits, SystemModelCost pricing table, cost reports
Agent playbooks + capabilities Product / Agent owners ConnectAgentProfile in Connect UI or API
Incident response On-call (Platform) Waxell trace explorer + RunEdge DAG for root cause; AGT flight recorder for policy replay
Compliance reporting Compliance AGT agt verify attestation output, Waxell audit export

The compliance team exports attestations from AGT and audit records from Waxell. They don't need to know how either product works internally — just how to pull the artifacts they need for an auditor.


What Each Product Owns in an Incident

When something goes wrong with an agent in production, both tools are useful — but for different questions:

Use the Waxell trace explorer for: "What did the agent actually do?" Navigate the RunEdge DAG to find the originating request, trace every spawn and tool call, identify which turn introduced the problem, check token counts and cost across the run tree.

Use the AGT audit log and flight recorder for: "Did any policy fire?" Replay the policy evaluation sequence to confirm which rules were checked, what data they evaluated against, and whether any rule was violated or circumvented.

The combination gives you both behavioral visibility (Waxell) and policy compliance evidence (AGT) in the same incident. Neither is a substitute for the other.


Common Questions

Does the initialization order really matter?
Yes. Waxell's auto-instrumentation patches the tool dispatch layer at init() time. If AGT loads first, its hooks attach to the un-patched layer. Call waxell.init() before PolicyEvaluator() initialization and you won't hit this.

What if AGT blocks a call that Waxell would have allowed?
AGT blocks first — it's the earlier evaluation point. Waxell's Observe SDK still opens a span for the call, records the AGT deny outcome as a span attribute, and closes the span. You get full visibility into blocked calls in the trace explorer even though they never reached the tool. This is useful: you can see patterns in what's being blocked, not just that something was blocked.

What if I have my own policy layer on top of both?
Fine. The pattern holds. Add your policy evaluation to the governed_tool_call wrapper and record the outcome as an additional span attribute. Multiple policy layers coexist as long as each one records its outcome before dispatching.

Does the BudgetLedger check add latency to every tool call?
Only for tool calls covered by a waxell_budget_check rule. Cache the ledger read at the run or span boundary to bring the per-call overhead down to effectively zero. For high-frequency tool calling agents, read once at spawn and re-read only on budget change signals.

Do I need both Waxell Runtime and Waxell Observe, or can I use just Observe?
You can use just Observe. If you only need tracing and cost tracking on top of AGT, pip install waxell-observe[all] and waxell.init(observe=True, runtime=False) is a valid configuration. Add Runtime when you need durable execution (suspend, resume, ask_user).

What does this cost operationally?
AGT is open-source and free. Waxell is usage-based; see waxell.ai/pricing. For most teams, the observability and cost-tracking value covers the platform cost within the first month of visibility — cost surprises that surface in the first week of cost tracking typically exceed the annual platform cost.


Getting Started

If you have AGT already deployed and want to add Waxell:

# Add Waxell to your existing agent environment
pip install waxell-observe[all] waxell-sdk

# Set your API key
export WAXELL_API_KEY="wax_sk_..."
export WAXELL_TENANT="your-tenant-slug"
Enter fullscreen mode Exit fullscreen mode

Then add waxell.init() before your existing PolicyEvaluator() initialization. That's it for basic observability — spans start appearing immediately for every LLM call and tool dispatch your agent makes.

For the BudgetLedger integration, add the custom check function and the cost-guard policy file from the examples above. For the AXID + SPIFFE identity layer, see /docs/integrations/microsoft-agt for the full configuration.


Waxell is the hosted platform for running, observing, and collaborating with AI agents. See the platform overview or book a reference architecture review.


Sources

Top comments (1)

Collapse
 
peacebinflow profile image
PEACEBINFLOW

The SPIFFE vs. AXID distinction—connection-level identity versus action-level identity—is the kind of architectural clarity that usually only emerges after you've tried to use one for the other's job and watched it fail. I've seen teams attempt to shove run-level provenance into mTLS certificates and end up with certificate churn that takes down the service mesh. Different granularities, different protocols, different answers to different audit questions. Letting them coexist instead of forcing consolidation is the right call, and it's harder to sell to management than "one identity system to rule them all."

What I find myself thinking about is the initialization order requirement. Waxell before AGT, because the instrumentation wrapper needs to be in place before AGT's hooks attach. It's the kind of detail that's easy to document and easy to get wrong in production when someone refactors the bootstrap sequence six months later and doesn't remember why the order mattered. The failure mode isn't a crash—it's silent: AGT hooks fire before Waxell spans open, so blocked calls don't get recorded with the waxell.policy.agt.allowed=False attribute. Everything looks fine in both dashboards, but the trace explorer is missing data. Those are the worst bugs to debug because nothing is visibly broken.

The BudgetLedger cross-reference pattern—AGT YAML declaring a rule that calls a Python function that reads Waxell's ledger—is a clean separation of concerns in the abstract, but it introduces a network dependency into what was previously a sub-millisecond in-process evaluation. You address it with caching at the span boundary, which is the right mitigation. But it also means the cost guard rule can be up to one span's duration stale. For most use cases that's fine—budgets don't need microsecond precision. For a high-velocity agent burning through tool calls at 50/second, is there a risk of overshooting the budget in the gap between ledger reads, or is the BudgetLedger itself designed with some hysteresis to account for that?